This lesson will work with some of the data from my senior capstone project. We have records of government-provisioned disaster aid at the household level in the aid
object. In contrib
, we have (publicly available) data on campaign contributions to Gov. Chris Christie, separated by donation. Finally, we have data on the percentage of flooded area in each zip code stored in flood
.
library(dplyr)
aid <- readRDS("data/house_aid.rds")
contrib <- readRDS("data/indv_contrib.rds")
flood <- readRDS("data/zip_flood.rds")
For the most part, these data files are already cleaned. We’ll go through an example here to demonstrate some dplyr
functions. You can use filter()
to create a subset of your data based on one or more logical tests. Here, we’ll look for contributions before 2013, then search within those for donations coming from Flemington.
pre13 <- filter(contrib, date < as.Date("2013-01-01"))
pre13flem <- filter(pre13, city == "FLEMINGTON")
head(pre13flem)
## id name amount date city zip
## 1 5861012 HAMILTON, WALTER 3400 2009-02-27 FLEMINGTON 08822
## 2 4002170 KARROW, MARCIA A 2400 2009-09-21 FLEMINGTON 08822
## 3 7234182 CONNOR, SCOTT C 2000 2009-09-07 FLEMINGTON 08822
## 4 7234401 FACCHINA, ROBERT 1000 2009-09-22 FLEMINGTON 08822
## 5 4002170 KARROW, MARCIA A 1000 2009-07-14 FLEMINGTON 08822
## 6 7234545 MAZZAGETTI, DOMINICK 500 2009-10-05 FLEMINGTON 08822
We could also do this in one step:
pre13flem2 <- filter(contrib,
date < as.Date("2013-01-01"),
city == "FLEMINGTON")
pre13flem3 <- filter(contrib,
date < as.Date("2013-01-01") &
city == "FLEMINGTON")
identical(pre13flem, pre13flem2)
## [1] TRUE
identical(pre13flem, pre13flem3)
## [1] TRUE
In our contributions data set, cities are written in capital letters, but our aid data set has them in title case. Here, we’ll use mutate()
to clean and create columns, returning a new data frame. Like filter()
, mutate can also work on multiple columns at once. (Take a look at ?mutate_at
for documentation on how to apply one function quickly to multiple columns.)
aid2 <- mutate(aid,
city = toupper(city),
ihp_ref = as.numeric(ihp_ref),
ihp_elig = as.numeric(ihp_elig))
head(aid2)
## state_id city county zip ihp_ref ihp_elig ihp_amount
## 1 NJ ATLANTIC ABSECON 08201 2 1 0.0
## 2 NJ ATLANTIC ABSECON 08201 2 1 0.0
## 3 NJ ATLANTIC ABSECON 08201 2 2 12514.3
## 4 NJ ATLANTIC ABSECON 08201 2 1 0.0
## 5 NJ ATLANTIC ABSECON 08201 2 1 0.0
## 6 NJ ATLANTIC ABSECON 08201 1 1 0.0
Grouping in the tidyverse uses the “adverb” group_by()
. This function tells dplyr
that you would like to perform operations within groups, rather than across the entire data set at once.
To understand how grouping affects the filter()
function, we will look for the last contribution in each city. Consider what would be the result of running the following line:
filter(contrib, date == last(date))
Instead of finding each city’s last contribution, the line above subsets to all contributions made on the final date observed in the data set. To compare contribution dates within cities, we’ll need to use group_by()
.
contrib_grpd <- group_by(contrib, city)
last_contrib <- filter(contrib_grpd, date == last(date))
Some of the cities in last_contrib
still have multiple donations listed. These are cities in which multiple contributions were made on the last day. Just to demonstrate further, let’s see how a second dplyr
function works with group_by()
.
sample_n(last_contrib, 1)
## # A tibble: 534 x 6
## # Groups: city [534]
## id name amount date city zip
## <int> <fctr> <dbl> <date> <fctr> <chr>
## 1 15932902 SALEMI, GIANCARLO 3000 2013-06-27 ABERDEEN 07747
## 2 13578941 FIORE, KAY B 35 2013-07-31 ABSECON 08201
## 3 4025808 MALLETT JR, RUSSELL B 50 2009-09-17 ALLAMUCHY 07820
## 4 15044714 PASCH, DAVID 25 2013-10-16 ALLENDALE 07401
## 5 7233573 DENGROVE, ROBERT S 150 2009-09-01 ALLENHURST 07711
## 6 13579450 DEMAURO, NICHOLAS P 300 2013-08-09 ALLENTOWN 08501
## 7 4000078 WESHNAK, BARRY 300 2009-09-16 ALLENWOOD 08720
## 8 4435616 CONWAY III, GEORGE T 50 2013-03-28 ALPINE 07620
## 9 13578912 RITTIE, JASON 250 2013-07-30 ANDOVER 07821
## 10 4039714 MILITA, MARTIN 1000 2013-06-03 ANNANDALE 08801
## # ... with 524 more rows
Note that here, instead of returning data frame with one row, sample_n()
sampled within groups, returning a random observation from each city in last_contrib
. We did not need to call group_by()
again, because using filter()
on a grouped_df
(like contrib_grpd
) returns a data frame that is still grouped. (You can use ungroup()
to remove the grouping.)
Grouping is often most useful when summarizing or aggregating data. Our data sets are currently on three different levels: household (aid
), contribution (contrib
), and zip code (flood
). We will aggregate the first two data sets to the zip code level so that the three can be merged and analyzed together easily.
Here, we’ll also use the pipe operator, %>%
to make the code easier to read. To aggregate the contribution amounts, we can calculate the sums across each to zip code. The summarize()
function accepts a grouped_df()
and applies operations within groups.
zip_contrib <-
contrib %>%
group_by(zip) %>%
summarize(contrib_amount = sum(amount))
head(zip_contrib)
## # A tibble: 6 x 2
## zip contrib_amount
## <chr> <dbl>
## 1 07001 250
## 2 07002 4685
## 3 07003 22330
## 4 07004 46335
## 5 07005 33210
## 6 07006 157766
For the aid data, let’s total the disbursement amounts and count the number of individuals eligible for this government aid package.
zip_aid <-
aid2 %>%
group_by(zip) %>%
summarize(aid_amount = sum(ihp_amount),
ihp_elig = sum(ihp_elig))
head(zip_aid)
## # A tibble: 6 x 3
## zip aid_amount ihp_elig
## <chr> <dbl> <dbl>
## 1 07001 215991.66 361
## 2 07002 4699699.99 4657
## 3 07003 51655.87 732
## 4 07004 3367.49 58
## 5 07005 38856.34 143
## 6 07006 42445.03 280
Now we have three data sets with different information about New Jersey zip codes. We’ll save the two new data sets to use in another lesson.
saveRDS(zip_aid, "data/zip_aid.rds")
saveRDS(zip_contrib, "data/zip_contrib.rds")
(in progress)