Vast improvements in processing power have opened the potential of spatial data analysis to the average social science researcher, laying the foundations for empirical work that has incorporated increasingly complex analyses of spatial relationships. Both the public and private sectors in many highly developed countries have advanced their spatial data capabilities in parallel with technological innovations, producing high-quality GIS products and services to meet the rising demand. In the political science literature, these data have yielded numerous high-powered, micro-level studies of behavior and political phenomena.
Many developing countries do not have such robust support for researchers hoping to conduct spatial analyses. Development institutions have produced high-resolution raster interpolations on a near-global scale for quantities like population density, literacy, and mortality; however, key political covariates exist only at much higher levels of aggregation. Government-issued shapefiles for administrative boundaries tend to be of much lower quality and are subject to change over time, with or without a formal delimitation procedures. Accurate geographic coordinates are simply unavailable for many types of point data.
The dearth of readily available, high-quality spatial data in the developing world, we believe, dramatically increases the relative demand for such a product. These data are not wholly inaccessible. With experience admittedly limited to Kenya, we would conjecture that spatial data in fact do exist for many developing countries. The problem is that these data are distributed sparsely across various people, departments, and organizations to the extent that significant time and personal effort are required to collect the data files even before any matching processes begin. As long as researchers cannot simply download these data in prepackaged form, though, they remain an untapped resource for those willing to put in the hours required to link the mass of geocoded points, available from an inconvenient number of sources, to the lists of schools, health clinics, markets, administrative offices, and polling centers that social scientists hope to study.
The remainder of this document discusses in detail our process of matching location names to geographic points. Example Data describes the sample of polling stations available to download and the various points and polygons required for the linkage process. Geocoding explains the techniques we’ve found helpful in our matching method. Imputing points discusses our process of stochastically imputing point coordinates for polling stations we could not accurately geocode.
We assume of the reader an intermediate-level understanding of the R
programming langauge and basic familiarity with the S4 spatial data classes defined in the sp
package (R Core Team 2016; Pebesma and Bivand 2005). Some processes in this document are run in parallel on six cores. All operations are performed on a server running R version 3.4.0 (64-bit) under Ubuntu 14. Reproducing the results on a machine with less than 32GB of RAM or using the 32-bit version of R may require adjusting the code.
We demonstrate our methods using samples of the polling stations file and a set of geocoded points. We also rely on polygon shapefiles detailing several levels of Kenyan administrative boundaries at different points in time. Finally, we use a 2010 population density raster for Kenya from WorldPop (Linard et al. 2012). All of these data are available to download from the GitHub repository.
# Load packages
library(sp)
library(maptools)
library(rgeos)
library(rgdal)
library(raster)
library(spatstat)
library(parallel)
library(matrixStats)
library(stringdist)
library(stringr)
library(purrr)
library(dplyr)
# Load user-defined functions
source("code/functions.R")
# Set number of cores for parallel processing
cores <- 6
We use two sets of point-specific data. First, the polling stations file, data/ps_samp.rds
, is a sample of polling stations scraped from voter rolls and electoral results. Barring linkage errors, we observe all polling stations active in the 1997, 2002, 2007, and 2013 elections in Kenya, as well as the 2005 and 2010 constitutional referenda. The data include not only the name of the active polling station in that year (e.g., “MURKISIAN PRIMARY SCHOOL”) but also important information on what administrative units govern the polling station. These regionally identifying variables allow us to index to a geographic area, or “block”, within which we roughly restrict our searches. (More on this below.)
Second, we have a mass of geocoded match points in Kenya obtained from various local, foreign, public, and private sources. A sample of these is stored in data/mp_samp.rds
. The accuracy of the coordinates and the consistency of formatting in the name strings vary by source. Each match point represents a potential location for a given polling station, but the match points file is, in reality, neither complete nor unambiguous. Therefore, while the procedure outlined here links these two files and yields a version of the polling stations file with geographic coordinates attached to each record, some of these coordinates cannot be matched to and must be imputed instead.
# Read points data
ps <- readRDS("data/ps_samp.rds")
mp <- readRDS("data/mp_samp.rds")
The polygon files included with this document represent several administrative levels. Redistricting and revisions in data quality have produced files that record changes in boundaries over time. In roughly decreasing order of scale, the units are as follows:
# Read polygons data
const_old <- readRDS("data/const_old.rds")
const13 <- readRDS("data/const13.rds")
caw <- readRDS("data/caw.rds")
caw_old <- readRDS("data/caw_old.rds")
sl <- readRDS("data/sl.rds")
Other regionally identifying information for polling stations, such as county, division, and province, are sometimes available, but they denote units that are too large to be helpful in narrowing down our searches.
Because the boundaries of these geographic units have changed over time, and because the Kenyan government has not always provided these data at a high quality, we cannot perfectly and consistently map polling stations to units like the ward, location, or sublocation. This difficulty is particularly pronounced along the borders of administrative units, where anecdotal evidence informs us that what the central Kenyan government may understand and treat as dividing lines are not as clear-cut to chiefs and other local administrators on the ground. Therefore, the borders delineated by the polygon shapefiles can be treated as only soft boundaries on point locations.
Understanding the quality of the polygon shapefiles informs us, if roughly, of how much error we should expect at the borders of administrative units. We’ve found two attributes of shapefiles to be most informative with only a quick study of the data: boundary detail and border alignment.
First, the level of detail provided at the boundaries of shapefile features indicates the amount of time, money, and effort put into creating the data. Building shapefiles that accurately represent the geography on the ground is labor intensive. Shortcuts, just like lower-resolution raster files, produce unrealistically rectangular boundaries where the land’s natural cuts and curves were too costly to map. These differences are most conspicuous along shorelines.
Below, we plot the overlay of three constituencies bordering Lake Victoria as represented in two different shapefiles.
Such exploratory work is most easily performed in an interactive GUI like QGIS or ArcGIS, but even this simple plot demonstrates the vast differences in detail offered by the two files: a clear coastline, rather than blockish figures, and distinct islands.
A second indicator of a shapefile’s quality can be seen at points where two features meet. From a distance, the two constituencies plotted below appear to be drawn somewhat roughly: suspiciously straight lines mark borders.
plot(const13[const13$constid %in% 239:240, ], col = "azure2")
Again, an interactive GUI might be helpful here to zoom in on the boundary between these to constituencies. However, we can make do by unioning the two features together.
plot(gUnaryUnion(const13[const13$constid %in% 239:240, ]))
Despite converting the object into a single feature, we still observe a partial line delimiting one constituency from the other. That line is the result of a gap between the two features, a flaw in the shapefile.
Whenever possible, we use UTM zone 37S as our coordinate reference system. Working with a projected CRS allows us to use functions like gBuffer()
seamlessly, defining buffer widths in easily understood units like meters. If you are working with a relatively small area (e.g., a single medium-sized country), we would recommend picking an appropriate projected CRS early in the data cleaning process.
We now outline our methods of finding the closest geographic match for polling stations in Kenya. It is worth noting now that our objective is neither to find the closest string match nor to find as many schools or health clinics as possible. Rather, we aim to minimize geographic error, even at the cost of finding what is certainly the wrong point.
It would be difficult to overstate the importance of preprocessing. Time spent developing the consistency of the data early in the matching procedure can lead to enormous gains in accuracy down the road. Perhaps more importantly, this phase develops a familiarity with the data’s idiosyncracies that becomes invaluable when tailoring matching strategies to a specific data set or goal. By its nature, however, preprocessing requires techniques that may vary widely depending on whatever quirks of the unmatched data require standardization. We therefore begin with data that has already gone through some cleaning. We describe only a few preprocessing steps that are helpful in the most general cases, omitting any discussion of the more idiosyncratic adjustments we made.
Polling stations were sometimes recorded in the pattern ALPHA PR. SCH
, but variations including ALPHA PRY SCH
, ALPHA PRIMARY SCHOOL
, and ALPHA PR.SCH
abound. This lack of standardization is common for all types of polling station locations. One approach would be to standardize the abbreviation of PRIMARY
and SCHOOL
by replacing all permutations with a single string. However, the fuzzy-matching algorithm we use not only inspects differences between strings but also normalizes that edit distance by the number of characters in each string. In the example below, we return two different string distance scores given inputs of essentially the same information.
stringdist("ALPHA PRIMARY SCHOOL", "BETA PRIMARY SCHOOL", method = "jw", p = 0.15)
## [1] 0.1817982
stringdist("ALPHA PR. SCH", "BETA PR. SCH", method = "jw", p = 0.15)
## [1] 0.2229345
For our purposes, it is more important to match polling stations to the correct geographic area, rather than matching primary schools strictly to primary schools.1 We can therefore remove the nonessential information and store it separately. We do this by cataloging the most common types of polling stations, then truncating the strings to separate these labels from the information that identifies the town, village, or area in which the polling station is located. The result is a truncated polling station name (e.g., ALPHA
) with classifying information (e.g., SCH
and PRI
) in separate columns. Conveniently, in Kenya, substrings in the ALPHA
position almost always refer to a village or other small locality. Thus, when calculating the similarity of two strings (e.g., ALPHA
and BETA
), we concentrate on regionally identifying information and concentrate on geographic accuracy.
Classifying data for each polling station-year are stored in election-specific type
and subtype
columns. These columns align temporally with the election-specific polling station names from which these data were extracted. For example, a polling station active in 2013 will have columns for its truncated name (ALPHA
), its type (SCH
), and its subtype (PRI
); separate columns will hold analogous information for any prior elections during which this polling station was active.
To improve matching further, we clean the strings of both our point (polling station) and polygon (administrative boundary) data. Most of the cleaning will likely be idiosyncratic to the locale of the data, like correcting for inconsistencies in transliteration, pronunciation, or abbreviation.
One technique we’ve found particularly helpful is to order the elements of each string. This corrects the particularly troublesome problem of some strings following the pattern SAMBURU EAST
, while the same constituency may be recorded in a second data set as EAST SAMBURU
. The function order_substr()
, defined in code/functions.R
, splits each string of a character vector at any commas, whitespace, forward slashes, and hyphens.2 The output is a vector with those substrings ordered alphabetically and collapsed with a single space. Alternative regular expressions for splitting can be included in the split
parameter; other strings may be used to separate the alphabetized substrings via the collapse
parameter.
tuna <- c("tuna,skipjack", "tuna , bluefin", "yellow-fin - tuna", "tuna, albacore")
order_substr(tuna)
## [1] "skipjack tuna" "bluefin tuna" "fin tuna yellow" "albacore tuna"
colors <- c("green/red", "yellow / blue", "orange purple")
order_substr(colors, collapse = "-", reverse = TRUE)
## [1] "red-green" "yellow-blue" "purple-orange"
We define a function for each of several matching steps. These steps differ primarily in terms of which lower-level polygons they use (e.g., sublocation, old ward, new ward). To cut down on processing time, we run these functions on multiple cores via parallel::mclapply()
.3
Each matching function follows the same basic framework:
Below, we describe the blocking and fuzzy matching steps in detail before walking through the code for a simplified matching function with a single polling station. We then use the example data sets to demonstrate the two functions we used first in our matching process, match_lsl()
and match_inf_loc13()
. Together, these functions geocode roughly 75 percent of Kenya’s 1997-2013 polling stations. We discuss two techniques particular to these functions; their full definitions can be found in code/functions.R
.
Our full data set of match points contains over 125,000 coordinates. Running matching algorithms for each of the nearly 26,000 polling stations against all match points would therefore be hugely expensive. To save time and RAM, we adapt a common record-linkage technique, blocking, to a spatial context: we split our matching data geographically and use constituency code identifiers as a blocking key.
We first buffer our constituency polygons by 500 meters to account for error in the spatial data. Next, we build a list in which each element is the subset of match points contained by each of the buffered constituencies. We use a convenience function defined in code/functions.R
, to save each object in a new directory: data/mp_constb/
. By default, the name of each file matches the name of the list element to which the file corresponds. We repeat this process to split the sublocation polygons into constituency blocks.
# Buffer old and new constituencies
constb <- gBuffer(const_old, byid = TRUE, width = 500)
l_constb <- split(constb, constb$id)
const13b <- gBuffer(const13, byid = TRUE, width = 500)
l_const13b <- split(const13b, const13$constid)
# Add sublocation and location information to match points
row.names(mp) <- 1:nrow(mp)
mp_sl <- over(mp, sl)
mp@data <- data.frame(mp@data, mp_sl)
# Split match points by buffered constituency
mp_constb <- mclapply(l_constb, function(x) mp[x, ],
mc.cores = cores)
save_list(mp_constb, "data/mp_constb")
mp_const13b <- mclapply(l_const13b, function(x) mp[x, ],
mc.cores = cores)
save_list(mp_const13b, "data/mp_const13b")
# Split sublocations by buffered constituency
sl_constb <- mclapply(l_constb, function(x) sl[x, ],
mc.cores = cores)
save_list(sl_constb, "data/sl_constb")
sl_const13b <- mclapply(l_const13b, function(x) sl[x, ],
mc.cores = cores)
save_list(sl_const13b, "data/sl_const13b")
Each match points and sublocations subset is named for the old (i.e., pre-2013) constituency code in which the spatial features are contained. We use these constituency codes as blocking keys by searching for matches only within subsets corresponding to a polling station’s constituency. For example, when searching for matches to a polling station assigned to constituency 067
in 2002, we load only the objects in data/mp_constb/067.rds
and data/sl_constb/067.rds
.
We use the stringdist
package to calculate Jaro-Winkler distances between the polling stations data and both the match points’ and polygon’s name strings (van der Loo 2014). The stringdist
package supports a number of metrics for measuring string similarity. We’ve found the Jaro-Winkler distance to work best for our purposes. Results of stringdist(a, b, method = "jw")
are scaled between 0 and 1, for which lower numbers reflect more similar strings.
Aside from supporting multiple string metrics, one advantage of stringdist()
over agrep()
is that it provides a sense of how the distance calculations change across comparisons. Inspecting the results by hand allows you to pick precise thresholds for the minimum acceptable string distance.
set.seed(575)
# Sample polling stations to work with and pull out
# the names of their 2002 sublocations
sl_samp <- sample(ps$sl02[ps$sl02 != ""], 20)
# Compute string distances between polling stations' sublocations
# and the names contained in the sublocations shapefile
sd_l <- lapply(sl_samp, stringdist,
b = sl$SUBLOCATIO, method = "jw")
# Find sublocation strings with the minimum string distance score
# (In this example, we'll just take the first match)
sl_match <- sapply(sd_l, function(x) sl[x == min(x), ]$SUBLOCATIO[1])
sl_dist <- sapply(sd_l, min)
# Bind the results in a data.frame
df <- data.frame(sl_samp, sl_match, sl_dist,
stringsAsFactors = FALSE)
df2 <- arrange(df, desc(sl_dist))
head(df2, 10)
## sl_samp sl_match sl_dist
## 1 GESURE LAGSURE 0.1507937
## 2 MULILII ULILINZI 0.1309524
## 3 NGOO NGOO 0.0000000
## 4 KORU KORU 0.0000000
## 5 NYATHUNA NYATHUNA 0.0000000
## 6 KIBUGAT KIBUGAT 0.0000000
## 7 KAKELO KAMROTH KAKELO KAMROTH 0.0000000
## 8 BANDE BANDE 0.0000000
## 9 NZAWA NZAWA 0.0000000
## 10 ALOMALA LOCHOR ALOMALA LOCHOR 0.0000000
In the example above, fourteen of the twenty polling stations were associated with 2002 sublocations that had an exact name match in the sublocations shapefile. The string distances listed in the sl_dist
column tell us a little about how stringdist()
scored the sublocation names that weren’t identical. For this particular sample, it looks like matches with a score less than 0.11 would be acceptable.4 We would encourage researchers to go through this exercise—with their full data set—each time they set a string distance threshold.
We’ll now go through the steps outlined above in order to match a single polling station. This workflow demonstrates the core stages of our geocodeing procedure: spatial blocking, fuzzy string matching, and match evaluation. We simplify the exercise by using polygon data only from 2002. The point name strings have already been cleaned and truncated.
# Pick a polling station
tps <- sample_n(ps, 1)
# Blocking ----
# Use old constituency code as a blocking key
tconst <- const_old[const_old$id == tps$constid10, ]
tconstb <- gBuffer(tconst, width = 500)
# Subset match points and sublocations to this buffered constituency
tmp <- mp[tconstb, ]
tsl <- sl[tconstb, ]
# Fuzzy string matching, polygons then points ----
strd_poly_A <- stringdist(tps$sl07, tsl$SUBLOCATIO, method = "jw")
strd_poly_B <- stringdist(tps$loc07, tsl$LOCATION, method = "jw")
strd_mp <- stringdist(tps$trunc13, tmp$trunc, method = "jw")
# Index to the closest polygon and point matches
poly_ind_A <- which(strd_poly_A == min(strd_poly_A) & strd_poly_A < 0.08)
poly_ind_B <- which(strd_poly_B == min(strd_poly_B) & strd_poly_B < 0.08)
mp_ind <- which(strd_mp == min(strd_mp))
# Quality measures ----
# Calculate geographic distance between point and polygon matches
gdist_A <- gDistance(tmp[mp_ind, ], tsl[poly_ind_A, ], byid = TRUE)
gdist_B <- gDistance(tmp[mp_ind, ], tsl[poly_ind_B, ], byid = TRUE)
# Record the minimum distance for each potential point match
min_gdist_A <- colMins(gdist_A)
min_gdist_B <- colMins(gdist_B)
type_match <- tps$type13 == tmp$type[mp_ind]
subtype_match <- tps$subtype13 == tmp$subtype[mp_ind]
# Build final output
tmatch <- tmp[mp_ind, ]
tmatch@data <-
data.frame(tps, type_match, subtype_match, min_gdist_A, min_gdist_B,
mp_name = tmatch$name, mp_code = tmatch$code, mp_source = tmatch$source,
slmatch = tsl$SL_ID[poly_ind_A])
Our result here, tmatch
, contains a single likely location for the polling station chosen. We’ve also included additional data on the quality of each match, as well as the ID number of the sublocation in which we matched the polling station.
The example above would constitute a single step in our matching algorithm. In our full record-linkage process, we iterate a similar procedure for all unmatched polling stations using progressively less precise blocking methods and appropriately adjusted string distance thresholds. The code/functions.R
script contains the full code for our two most precise matching functions: match_lsl()
and match_inf_loc13()
. Please refer to that script for more detailed comments on how these functions differ from the example above. Below, we demonstrate how to apply these two functions to the full sample data set. We run these fairly intensive processes in parallel.5 These two functions accept three arguments:
sp
object,Instead of returning a data frame, these functions output a list, though the data contained are essentially the same.
# Split polling stations into list
match_list <- split(ps, ps$uid)
# Find old urban constituency IDs
urb_county <- c("MOMBASA", "NAIROBI CITY", "NAKURU", "KISUMU")
urb_constid02 <- unique(filter(ps, county13 %in% urb_county)$constid02)
# Run most precise matching function on six cores
lsl_match <- mclapply(match_list,
match_lsl,
const = constb,
urb_const = urb_constid02,
polys = c("sl97", "loc97", "sl02", "loc02", "sl07", "loc07"),
mc.cores = cores)
# Find unmatched polling stations
found_lsl <- map(lsl_match, "match_mp")
no_match <- which(sapply(found_lsl, is.character))
In subsequent matching steps, lower levels of precision are typically due to using larger polygons to subset the match points or lower-quality, less-recent data. Polling stations may be missing the data that allow us to link them to relatively small polygons. In such cases, it may be possible to infer that polling station’s inclusion in a set of administrative boundaries, based on data from other polling stations. For example, polling stations newly created in 2013 are missing any information on the encompassing sublocation. However, we do observe the 2013 ward. Rather than begin by searching in an entire ward, we can instead infer sublocation data based on previously established polling stations in the same ward. We then look for matches within these smaller, high-quality polygons.
# Remove matched polling stations
ps2 <- filter(ps, uid %in% names(no_match))
# Build lists of potential locations and sublocations
ps3 <- group_by(ps2, wardid13) %>%
mutate(loc13 = na.omit(c(loc07, loc02, loc97)) %>% unique() %>% list(),
sl13 = na.omit(c(sl07, sl02, sl97)) %>% unique() %>% list()
) %>% ungroup()
# Find new urban constituency IDs
urb_constid13 <- unique(filter(ps, county13 %in% urb_county)$constid13)
match_list2 <- split(ps3, ps3$uid)
inf_loc13_match <- mclapply(match_list2,
match_inf_loc13,
const = const13b,
urb_const = urb_constid13,
polys = c("sl13", "loc13"),
mc.cores = cores)
For any polling stations remaining unmatched after all steps, we resort to imputing points within the smallest geographic area possible. The spatstat
package provides functions and classes to make this process simple, allowing us to use a population density raster as the probability density of the polling station locations. Below, we will continue with the previous example’s polling station, as if we had not found any point matches.
# Index to matched sublocation
tsl <- sl[sl$SL_ID == tmatch$slmatch, ]
# Read raster (file cropped from full Kenya population density raster)
pd <- raster("data/pd_sl5109.tif")
# Generate point
ppp <- rpoint(n = 1, f = as.im(pd))
# Convert to spatial point object
imp_point <- as(ppp, "SpatialPoints")
Social scientists have rapidly seized on the research potential of GIS data. However, the analyses that spatial data afford almost always require some amount of record linkage to join quantities of interest to the relevant spatial covariates. In locations with a still-nascent spatial data industry, researchers’ typical problems managing messy data are compounded with the difficulties of working with imprecise or unstandardized GIS files.
We hope that this guide can offer a useful resource for those looking to geocode points rigorously and systematically. The methods presented here are far from perfect: errors may still occur at any stage of our linkage process. However, we have found this procedure to be both sufficiently accurate and scalable to relatively large data sets.
Adrian Baddeley, Rolf Turner, Ege Rubak. 2015. Spatial Point Patterns: Methodology and Applications with R. London: Chapman and Hall/CRC Press. http://www.crcpress.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/9781482210200/.
Bivand, Roger, and Nicholas Lewin-Koh. 2017. Maptools: Tools for Reading and Handling Spatial Objects. https://CRAN.R-project.org/package=maptools.
Bivand, Roger, and Colin Rundel. 2017. Rgeos: Interface to Geometry Engine - Open Source (Geos). https://CRAN.R-project.org/package=rgeos.
Bivand, Roger, Tim Keitt, and Barry Rowlingson. 2017. Rgdal: Bindings for the Geospatial Data Abstraction Library. https://CRAN.R-project.org/package=rgdal.
Hijmans, Robert J. 2016. Raster: Geographic Data Analysis and Modeling. https://CRAN.R-project.org/package=raster.
Linard, Catherine, Marius Gilbert, Robert W. Snow, Abdisalan M. Noor, and Andrew J. Tatem. 2012. “Population distribution, settlement patterns and accessibility across Africa in 2010.” PLoS ONE 7 (2). doi:10.1371/journal.pone.0031743.
Pebesma, E.J., and R.S. Bivand. 2005. Classes and Methods for Spatial Data in R. R News 5 (2). https://cran.r-project.org/doc/Rnews/.
R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
van der Loo, M.P.J. 2014. “The Stringdist Package for Approximate String Matching.” The R Journal 6 (1): 111–22. https://CRAN.R-project.org/package=stringdist.
When geocoding ALPHA PRIMARY SCHOOL
, we would consider ALPHA SECONDARY SCHOOL
to be a much better match than BETA PRIMARY SCHOOL
. We are most intent on assigning the correct geographic location (ALPHA
), not the corrrect school type.↩
Cleaning with the order_substr()
function works poorly when substrings are inconsistently separated by the split pattern. For example, the vector of tuna species above hyphenates yellow-fin but leaves bluefin as a single word. Therefore, with the default split patterns, order_substr()
splits “yellow-fin” prior to alphabetizing the string. (In this particular case, a good fix may be to adjust the regex to split at hyphens only when they are surrounded by whitespace.)↩
If you are running this code on a machine with less than 32GB of RAM, we suggest decreasing the mc.cores
parameter for the matching processes.↩
Given that sublocation names are not always unique, we could be more confident in these matches if we had blocked on constituency.↩
If you are running this code on a machine with less than 32GB of RAM, we suggest decreasing the mc.cores
parameter for the matching processes.↩