Functions covered:

  • st_read()
  • geom_sf()
  • dplyr function methods for to sf objects

Spatial data and packages in R

GIS data for social science research tend to be stored in one of two formats: vector or raster. Here, we’ll be working entirely with the former. Vector data store features such as points, lines, and polygons. These features may represent individual locations (points), roads and rivers (lines), or political units (polygons). Vector data are commonly stored in shapefiles (.shp), accompanied by several other files (e.g., .prj, .dbf) holding accessory data.

About sf

Standard packages for analyzing spatial data provide methods and object classes specialized for particular types of data. Of course, they also supply functions for reading, writing, generating, and manipulating spatial objects. This tutorial will cover the basic tools in sf, a package for representing simple features in R. Because sf is still under development, it does not support several crucial operations, though eventually sf will replace the currently standard suite of R spatial packages for vector data: sp, maptools, rgdal, rgeos.

Rather than cover the specifics of GIS or how to create simple features from scratch, this tutorial will give a superficial introduction to sf objects before demonstrating easy-to-implement manipulation and plotting methods. I learned of the sf package from an excellent, but more advanced, blog post. I recommend going there next, if you’re looking to learn more. The package authors include a series of very detailed vignettes.

library(dplyr)
library(ggplot2)
# If you can't find the function ggplot2::geom_sf(), 
# you may need to download the development version of ggplot2: devtools::install_github("tidyverse/ggplot2")
library(scales)

library(sf)

Its authors designed the sf package to fit neatly into the tidyverse. Like stringr, functions begin with a common string: st_, an abbreviation of spatial and temporal. While the package defines many auxiliary object classes, you will primarily be working with objects that are merely extensions of the basic data frame.

# Read in precinct-level 2016 presidential election results from Hawaii
pres2016 <- readRDS("data/HI_precinct_gen_pres2016.rds")

# Read in as a data frame
# st_read uses other files in the hi_precinct directory, 
# but you need to point it only to the .shp (shapefile)
hi <- st_read("data/hi_precinct/Precincts_2016_01_13b_PUBLIC.shp")
## Reading layer `Precincts_2016_01_13b_PUBLIC' from data source `/Users/Cole/Dropbox/tidy_intro/data/hi_precinct/Precincts_2016_01_13b_PUBLIC.shp' using driver `ESRI Shapefile'
## Simple feature collection with 264 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -1424396 ymin: 2089224 xmax: 945628.1 ymax: 3312458
## epsg (SRID):    26904
## proj4string:    +proj=utm +zone=4 +datum=NAD83 +units=m +no_defs
class(hi)
## [1] "sf"         "data.frame"

This structure makes indexing, merging, and aggregating spatial objects intuitive; for the most part, you can proceed as you would with any other data frame. What sets sf objects apart is their geometry column, which holds the GIS data associated with the corresponding row of the data frame. This column follows your sf object around as long as it retains its class.[^select_nogeom] To remove the geometry, use as.data.frame().

Working with sf objects

# Merge with election results data
hi$DP <- as.character(hi$DP)
hi_pres <- left_join(hi, pres2016, by = c(DP = "precinct"))

# Create new variables
hi_pres2 <- mutate(hi_pres,
                   district_code    = as.numeric(gsub("-[0-9]{2}$", "", DP)),
                   total_pres_votes = castle_votes_total + johnson_votes_total + stein_votes_total + 
                                        clinton_votes_total + trump_votes_total,
                   trump_vshare     = trump_votes_total / total_pres_votes,
                   clinton_vshare   = clinton_votes_total / total_pres_votes
                   )
## Warning in evalq(as.numeric(gsub("-[0-9]{2}$", "", DP)), <environment>):
## NAs introduced by coercion
# Grouped operation
hi_district <- 
  group_by(hi_pres2,
           district_code) %>%
  summarize(no_prec = n(),
            mean_trump = mean(trump_vshare),
            turnout = sum(no_ballots, na.rm = TRUE)/sum(no_reg_voters, na.rm = TRUE)
            )
summary(hi_district)
##  district_code     no_prec         mean_trump        turnout      
##  Min.   : 1.0   Min.   : 2.000   Min.   :0.2049   Min.   :0.4257  
##  1st Qu.:13.5   1st Qu.: 4.000   1st Qu.:0.2615   1st Qu.:0.5362  
##  Median :26.0   Median : 5.000   Median :0.3006   Median :0.5879  
##  Mean   :26.0   Mean   : 5.077   Mean   :0.3013   Mean   :0.5769  
##  3rd Qu.:38.5   3rd Qu.: 6.000   3rd Qu.:0.3284   3rd Qu.:0.6264  
##  Max.   :51.0   Max.   :11.000   Max.   :0.4132   Max.   :0.6869  
##  NA's   :1                       NA's   :10       NA's   :1       
##           geometry 
##  MULTIPOLYGON : 4  
##  POLYGON      :48  
##  epsg:26904   : 0  
##  +proj=utm ...: 0  
##                    
##                    
## 

Plotting sf objects

# Subset to Oahu island
oahu_pres <- filter(hi_pres2,
                    between(district_code, 17, 51))


ggplot(oahu_pres) + geom_sf(aes(fill = clinton_vshare)) + 
  scale_fill_gradient2(midpoint = 0.5, 
                       mid = "maroon4", high = "blue", low = "red", 
                       limits = c(0, 1)) + 
  theme_bw()

# Highlight where third-party votes decided the precinct winner
oahu_pres2 <- mutate(oahu_pres,
                     pwin3rd = abs(clinton_votes_total - trump_votes_total) <= 
                       stein_votes_total + johnson_votes_total + castle_votes_total
                     )
ggplot(oahu_pres2) + geom_sf(aes(fill = pwin3rd)) + theme_bw()

You can also use the base plotting system. It’s especially useful for quickly checking that your data look as you’d expect.

big <- filter(hi_pres2, between(district_code, 0, 7))
big_district <- filter(hi_district, between(district_code, 0, 7))

plot(big, max.plot = 1, col = "grey")

plot(big_district, max.plot = 1, col = "grey")