Groups in the social sciences

While much research in the social sciences seeks to analyze individual behavior, it’s often necessary to make inferences based on observations of groups. For example, if it’s impossible to obtain voter turnout data specifying whether each citizen voted, researchers may study data on precincts instead. Working with different groups and levels of aggregation is common while analyzing data, and dplyr makes it easy to implement the basic tools you’ll need.

About the data

This lesson will work with some of the data from my senior capstone project. We have records of government-provisioned disaster aid at the household level in the aid object. In contrib, we have (publicly available) data on campaign contributions to Gov. Chris Christie, separated by donation. Finally, we have data on the percentage of flooded area in each zip code stored in flood.

library(dplyr)

aid <- readRDS("data/house_aid.rds")
contrib <- readRDS("data/indv_contrib.rds")
flood <- readRDS("data/zip_flood.rds")

A little cleaning

For the most part, these data files are already cleaned. We’ll go through an example here to demonstrate some dplyr functions. You can use filter() to create a subset of your data based on one or more logical tests. Here, we’ll look for contributions before 2013, then search within those for donations coming from Flemington.

pre13 <- filter(contrib, date < as.Date("2013-01-01"))
pre13flem <- filter(pre13, city == "FLEMINGTON")

head(pre13flem)
##        id                 name amount       date       city   zip
## 1 5861012     HAMILTON, WALTER   3400 2009-02-27 FLEMINGTON 08822
## 2 4002170     KARROW, MARCIA A   2400 2009-09-21 FLEMINGTON 08822
## 3 7234182      CONNOR, SCOTT C   2000 2009-09-07 FLEMINGTON 08822
## 4 7234401     FACCHINA, ROBERT   1000 2009-09-22 FLEMINGTON 08822
## 5 4002170     KARROW, MARCIA A   1000 2009-07-14 FLEMINGTON 08822
## 6 7234545 MAZZAGETTI, DOMINICK    500 2009-10-05 FLEMINGTON 08822

We could also do this in one step:

pre13flem2 <- filter(contrib, 
                     date < as.Date("2013-01-01"),
                     city == "FLEMINGTON")
pre13flem3 <- filter(contrib, 
                     date < as.Date("2013-01-01") &
                     city == "FLEMINGTON")
identical(pre13flem, pre13flem2)
## [1] TRUE
identical(pre13flem, pre13flem3)
## [1] TRUE

In our contributions data set, cities are written in capital letters, but our aid data set has them in title case. Here, we’ll use mutate() to clean and create columns, returning a new data frame. Like filter(), mutate can also work on multiple columns at once. (Take a look at ?mutate_at for documentation on how to apply one function quickly to multiple columns.)

aid2 <- mutate(aid,
               city = toupper(city),
               ihp_ref = as.numeric(ihp_ref),
               ihp_elig = as.numeric(ihp_elig))
head(aid2)
##   state_id     city  county   zip ihp_ref ihp_elig ihp_amount
## 1       NJ ATLANTIC ABSECON 08201       2        1        0.0
## 2       NJ ATLANTIC ABSECON 08201       2        1        0.0
## 3       NJ ATLANTIC ABSECON 08201       2        2    12514.3
## 4       NJ ATLANTIC ABSECON 08201       2        1        0.0
## 5       NJ ATLANTIC ABSECON 08201       2        1        0.0
## 6       NJ ATLANTIC ABSECON 08201       1        1        0.0

What changes when grouping

Grouping in the tidyverse uses the “adverb” group_by(). This function tells dplyr that you would like to perform operations within groups, rather than across the entire data set at once.

To understand how grouping affects the filter() function, we will look for the last contribution in each city. Consider what would be the result of running the following line:

filter(contrib, date == last(date))

Instead of finding each city’s last contribution, the line above subsets to all contributions made on the final date observed in the data set. To compare contribution dates within cities, we’ll need to use group_by().

contrib_grpd <- group_by(contrib, city)
last_contrib <- filter(contrib_grpd, date == last(date))

Some of the cities in last_contrib still have multiple donations listed. These are cities in which multiple contributions were made on the last day. Just to demonstrate further, let’s see how a second dplyr function works with group_by().

sample_n(last_contrib, 1)
## # A tibble: 534 x 6
## # Groups:   city [534]
##          id                  name amount       date       city   zip
##       <int>                <fctr>  <dbl>     <date>     <fctr> <chr>
##  1 15932902     SALEMI, GIANCARLO   3000 2013-06-27   ABERDEEN 07747
##  2 13578941          FIORE, KAY B     35 2013-07-31    ABSECON 08201
##  3  4025808 MALLETT JR, RUSSELL B     50 2009-09-17  ALLAMUCHY 07820
##  4 15044714          PASCH, DAVID     25 2013-10-16  ALLENDALE 07401
##  5  7233573    DENGROVE, ROBERT S    150 2009-09-01 ALLENHURST 07711
##  6 13579450   DEMAURO, NICHOLAS P    300 2013-08-09  ALLENTOWN 08501
##  7  4000078        WESHNAK, BARRY    300 2009-09-16  ALLENWOOD 08720
##  8  4435616  CONWAY III, GEORGE T     50 2013-03-28     ALPINE 07620
##  9 13578912         RITTIE, JASON    250 2013-07-30    ANDOVER 07821
## 10  4039714        MILITA, MARTIN   1000 2013-06-03  ANNANDALE 08801
## # ... with 524 more rows

Note that here, instead of returning data frame with one row, sample_n() sampled within groups, returning a random observation from each city in last_contrib. We did not need to call group_by() again, because using filter() on a grouped_df (like contrib_grpd) returns a data frame that is still grouped. (You can use ungroup() to remove the grouping.)

Aggregating data

Grouping is often most useful when summarizing or aggregating data. Our data sets are currently on three different levels: household (aid), contribution (contrib), and zip code (flood). We will aggregate the first two data sets to the zip code level so that the three can be merged and analyzed together easily.

Here, we’ll also use the pipe operator, %>% to make the code easier to read. To aggregate the contribution amounts, we can calculate the sums across each to zip code. The summarize() function accepts a grouped_df() and applies operations within groups.

zip_contrib <- 
  contrib %>%
  group_by(zip) %>%
  summarize(contrib_amount = sum(amount))
head(zip_contrib)
## # A tibble: 6 x 2
##     zip contrib_amount
##   <chr>          <dbl>
## 1 07001            250
## 2 07002           4685
## 3 07003          22330
## 4 07004          46335
## 5 07005          33210
## 6 07006         157766

For the aid data, let’s total the disbursement amounts and count the number of individuals eligible for this government aid package.

zip_aid <- 
  aid2 %>%
  group_by(zip) %>%
  summarize(aid_amount  = sum(ihp_amount),
            ihp_elig    = sum(ihp_elig))
head(zip_aid)
## # A tibble: 6 x 3
##     zip aid_amount ihp_elig
##   <chr>      <dbl>    <dbl>
## 1 07001  215991.66      361
## 2 07002 4699699.99     4657
## 3 07003   51655.87      732
## 4 07004    3367.49       58
## 5 07005   38856.34      143
## 6 07006   42445.03      280

Now we have three data sets with different information about New Jersey zip codes. We’ll save the two new data sets to use in another lesson.

saveRDS(zip_aid, "data/zip_aid.rds")
saveRDS(zip_contrib, "data/zip_contrib.rds")

Fitting models within groups

(in progress)