library(getACS)
library(ggplot2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(gt)
options(tigris_use_cache = TRUE)
To illustrate how to interpolate ACS data to a new geometry using
{getACS}
, we can first download some basic population
data:
pop_sf <- get_acs_tables(
geography = "tract",
state = "Maryland",
county = "Baltimore city",
table = "B01001",
geometry = TRUE,
crs = 3857
)
#> Warning: • You have not set a Census API key. Users without a key are limited to 500
#> queries per day and may experience performance limitations.
#> ℹ For best results, get a Census API key at
#> http://api.census.gov/data/key_signup.html and then supply the key to the
#> `census_api_key()` function to use it throughout your tidycensus session.
#> This warning is displayed once per session.
#> | | | 0% | |= | 2% | |=== | 4% | |==== | 6% | |====== | 8% | |========== | 14% | |======================= | 32% | |=========================================== | 61% | |============================================= | 65% | |================================================== | 71% | |======================================================================| 100%
total_pop_sf <- filter_acs(pop_sf, vars = 1)
total_pop <- sf::st_drop_geometry(total_pop_sf)
We can also download Baltimore City neighborhood boundaries to provide an alternate geometry that we are interpolating the data to fit:
area <- sf::read_sf("https://services1.arcgis.com/UWYHeuuJISiGmgXx/ArcGIS/rest/services/Neighborhoods/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson")
area <- sf::st_transform(area, crs = sf::st_crs(pop_sf))
area <- select(area, NAME = Name)
Interpolating ACS data to a new geometry using {getACS}
is a multi-step process:
make_block_xwalk()
make_area_xwalk()
use_area_xwalk()
block_xwalk <- make_block_xwalk(
state = "Maryland",
county = "Baltimore city"
)
#> | | | 0% | |== | 3% | |==== | 6% | |====== | 8% | |======== | 11% | |========= | 13% | |=========== | 16% | |============= | 19% | |=============== | 21% | |================= | 24% | |================== | 26% | |==================== | 29% | |====================== | 31% | |======================== | 34% | |========================== | 37% | |=========================== | 39% | |============================= | 42% | |=============================== | 44% | |================================= | 47% | |=================================== | 50% | |==================================== | 52% | |====================================== | 55% | |======================================== | 57% | |========================================== | 60% | |============================================ | 62% | |============================================== | 65% | |=============================================== | 68% | |================================================= | 70% | |=================================================== | 73% | |===================================================== | 75% | |======================================================= | 78% | |======================================================== | 81% | |========================================================== | 83% | |============================================================ | 86% | |============================================================== | 88% | |================================================================ | 91% | |================================================================= | 93% | |=================================================================== | 96% | |===================================================================== | 99% | |======================================================================| 100%
#> | | | 0% | |==================== | 29% | |========================================= | 58% | |============================================================= | 87% | |======================================================================| 100%
area_xwalk <- make_area_xwalk(
area = area,
block_xwalk = block_xwalk,
coverage = FALSE,
weight_col = "POP20",
erase = TRUE
)
total_pop_xwalk <- use_area_xwalk(
total_pop,
area_xwalk,
weight_col = "perc_POP20"
)
total_pop_xwalk <- arrange(total_pop_xwalk, NAME)
Note that this method does not require the input data
(total_pop
in this example) to be a sf object. A data frame
is all you need. You can also avoid a separate call to
make_block_xwalk()
by supplying a state and county to
make_area_xwalk()
. One limitation is that
use_area_xwalk()
requires a long format data frame and does
not support interpolation for values returned by
tidycensus::get_acs()
when
output = "wide"
.
The tidycensus::interpolate_pw()
function provides very
similar functionality. While the block data needs to be prepared
separately, a single function can substitute for both
make_area_xwalk()
and use_area_xwalk()
.
blocks <- tigris::blocks("MD", "Baltimore city", year = 2020)
blocks <- sf::st_transform(blocks, crs = 3857)
total_pop_interpolated <- tidycensus::interpolate_pw(
from = total_pop_sf,
to = area,
to_id = "NAME",
weights = blocks,
weight_column = "POP20",
extensive = TRUE,
crs = 3857
)
total_pop_interpolated <- select(total_pop_interpolated, NAME, estimate)
total_pop_interpolated <- sf::st_drop_geometry(total_pop_interpolated)
This method works well but there are a few limitations:
from
, to
, and weights
parameters must all share the same coordinate reference system unless
crs
is explicitly provided.tigris::erase_water()
must be applied to input data in
advance. For make_area_xwalk()
, the erase parameter can
also be an sf object supporting the option to erase open space or other
undeveloped land as well as water areas.to
is
dropped and all numeric values are transformed (even if they aren’t an
estimate value).tidycensus::moe_sum()
.By default, make_area_xwalk()
joins the
block_xwalk
and area
based on which area has
the largest overlapping area with each U.S. Census block. This method is
slower than the default method for
tidycensus::interpolate_pw
: converting the Census block
geometries to a point on surface before joining the two. You can select
the latter approach by setting placement = "surface"
or
placement = "centroid"
for
make_area_xwalk()
.
This difference in spatial joins accounts for the small differences between the interpolated estimates using the two different functions. The following table shows the differences between the estimates returned by each method:
comparison_data <- left_join(
select(total_pop_xwalk, name = NAME, estimate),
select(total_pop_interpolated, name = NAME, estimate),
by = "name",
suffix = c("_xwalk", "_interpolated")
) |>
mutate(
estimate_interpolated = round(estimate_interpolated),
difference = estimate_xwalk - estimate_interpolated
)
comparison_summary <- bind_rows(
slice_max(comparison_data, difference, n = 5) |>
mutate(
category = "`interpolate_pw` estimate is lower"
),
slice_min(comparison_data, difference, n = 5) |>
mutate(
category = "`use_area_xwalk` estimate is lower"
)
)
comparison_summary |>
gt(groupname_col = "category") |>
tab_header(
"10 areas with the largest difference between interpolation methods"
)
10 areas with the largest difference between interpolation methods | |||
name | estimate_xwalk | estimate_interpolated | difference |
---|---|---|---|
`interpolate_pw` estimate is lower | |||
Locust Point Industrial Area | 1296 | 847 | 449 |
Belair-Edison | 15983 | 15857 | 126 |
Evergreen | 357 | 295 | 62 |
New Southwest/Mount Clare | 1993 | 1932 | 61 |
Dorchester | 2110 | 2057 | 53 |
`use_area_xwalk` estimate is lower | |||
Dickeyville | 161 | 837 | -676 |
Locust Point | 2143 | 2586 | -443 |
Four By Four | 866 | 997 | -131 |
Roland Park | 4971 | 5074 | -103 |
Carrollton Ridge | 2241 | 2302 | -61 |
As you can see, the differences in the estimates are mostly very small. Most of these areas are adjacent to large water or open space areas suggesting that any differences could be mitigated by erasing the non-populated geometry from both the blocks and area data before interpolating estimates.
In summary, I’d recommend the {getACS}
functions if you
want to retain a more accurate margin of error, you like a more verbose
function, or you don’t want to worry about matching coordinate reference
systems in advance.
If you prefer working with wide format data and don’t want to
transform the shape of your data back and forth with
tidyr::pivot_longer()
and
tidyr::pivot_wider()
, you may prefer sticking with
tidycensus::interpolate_pw()
instead.