censusapi
is a lightweight package that retrieves data
from the U.S. Census Bureau’s APIs. More than
1,000 Census API
endpoints are available, including the Decennial Census, American
Community Survey, Poverty Statistics, Population Estimates, and Census
microdata. This package is designed to let you get data from all of
those APIs using the same main functions and syntax for every
dataset.
This package returns the data as-is with the original variable names created by the Census Bureau and any quirks inherent in the data. Each dataset is a little different. Some are documented thoroughly, others have documentation that is sparse. Sometimes variable names change each year. This package can’t overcome those challenges, but tries to make it easier to get the data for use in your analysis. Make sure to thoroughly read the documentation for your dataset and see below for how to get help with Census data.
API key setup
To use the Census APIs, sign up for an API
key, which will be sent to your provided email address. You’ll need that
key to use this package. censusapi
will use it by default
without any extra work on your part.
To save your API key, within R, run:
# Add key to .Renviron
Sys.setenv(CENSUS_KEY=PASTEYOURKEYHERE)
# Reload .Renviron
readRenviron("~/.Renviron")
# Check to see that the expected key is output in your R console
Sys.getenv("CENSUS_KEY")
Once you’ve added your census key to your system environment, censusapi will use it by default without any extra work on your part.
In some instances you might not want to put your key in your
.Renviron - for example, if you’re on a shared school computer. You can
always choose to manually set key = "YOURKEY"
as an
argument in getCensus()
if you prefer.
Finding your API
To get started, load the censusapi
library.
To see a current table of every available endpoint,
uselistCensusApis()
. This data frame includes useful
information for making your API call, including the dataset’s name,
description and title, as well as a contact email for questions about
the underlying data.
apis <- listCensusApis()
colnames(apis)
#> [1] "title" "name" "vintage" "type" "temporal"
#> [6] "url" "modified" "description" "contact"
This returns useful information about each endpoint.
- title: Short written description of the dataset
- name: Programmatic name of the dataset, to be used with
censusapi
functions - vintage: Year of the survey, for use with microdata and aggregate datasets
- type: Dataset type, which is either Aggregate, Microdata, or Timeseries
- temporal: Time period of the dataset - only documented sometimes
- url: Base URL of the endpoint
- modified: Date last modified
- description: Long written description of the dataset
- contact: Email address for specific questions about the Census Bureau survey
Dataset types
There are three types of datasets included in the Census Bureau API
universe: aggregate, microdata, and timeseries. These type names were
defined by the Census Bureau and are included as a column in
listCensusApis()
.
table(apis$type)
#>
#> Aggregate Microdata Timeseries
#> 556 637 57
Most users will work with summary data, either aggregate or timeseries. Summary data contains pre-calculated numbers or percentages for a given statistic — like the number of children in a state or the median household income. The examples below and in the broader list of censusapi examples use summary data.
Aggregate datasets, like the American Community Survey or Decennial
Census, include data for only one time period (a vintage
),
usually one year. Datasets like the American Community Survey contain
thousands of these pre-computed variables.
Timeseries datasets, including the Small Area Income and Poverty Estimates, the Quarterly Workforce Estimates, and International Trade statistics, allow users to query data for more than one time period in a single API call.
Microdata contains the individual-level responses for a survey for
use in custom analysis. One row represents one person. Only advanced
analysts will want to use microdata. Learn more about what microdata is
and how to use it with censusapi
in Accessing
microdata.
Using getCensus
The main function in censusapi
is
getCensus()
, which makes an API call to a given endpoint
and returns a data frame with results. Each API has slightly different
parameters, but there are always a few required arguments:
-
name
: the programmatic name of the endpoint as defined by the Census, like “acs/acs5” or “timeseries/bds/firms” -
vintage
: the survey year, required for aggregate or microdata APIs -
vars
: a list of variables to retrieve -
region
: the geography level to retrieve, such as state or county, required for most endpoints
Some APIs have additional required or optional arguments, like
time
or monthly
for some timeseries datasets.
Check the specific documentation
for your API and explore its metadata with
listCensusMetadata()
to see what options are allowed.
Let’s walk through an example getting uninsured rates using the Small Area Health Insurance Estimates API, which provides detailed annual state-level and county-level estimates of health insurance rates for people below age 65.
Choosing variables
censusapi
includes a metadata function called
listCensusMetadata()
to get information about an API’s
variable and geography options. Let’s see what variables are available
in the SAHIE API:
sahie_vars <- listCensusMetadata(
name = "timeseries/healthins/sahie",
type = "variables")
# See the full list of variables
sahie_vars$name
#> [1] "for" "in" "time" "NIPR_LB90" "NIPR_PT"
#> [6] "AGECAT" "NIC_PT" "GEOID" "STATE" "RACE_DESC"
#> [11] "YEAR" "IPRCAT" "PCTIC_UB90" "NIPR_MOE" "PCTUI_LB90"
#> [16] "NIC_MOE" "US" "COUNTY" "NUI_UB90" "PCTUI_MOE"
#> [21] "NIC_UB90" "NUI_MOE" "SEXCAT" "PCTUI_PT" "PCTIC_LB90"
#> [26] "PCTUI_UB90" "NUI_PT" "STABREV" "AGE_DESC" "NAME"
#> [31] "NIC_LB90" "PCTIC_PT" "PCTIC_MOE" "IPR_DESC" "NIPR_UB90"
#> [36] "NUI_LB90" "GEOCAT" "SEX_DESC" "RACECAT"
# Full info on the first several variables
head(sahie_vars)
name | label | concept | predicateType | group | limit | predicateOnly | required |
---|---|---|---|---|---|---|---|
for | Census API FIPS ‘for’ clause | Census API Geography Specification | fips-for | N/A | 0 | TRUE | NA |
in | Census API FIPS ‘in’ clause | Census API Geography Specification | fips-in | N/A | 0 | TRUE | NA |
time | ISO-8601 Date/Time value | Census API Date/Time Specification | datetime | N/A | 0 | TRUE | true |
NIPR_LB90 | Number in Demographic Group for Selected Income Range, Lower Bound for 90% Confidence Interval | Uncertainty Measure | int | N/A | 0 | NA | NA |
NIPR_PT | Number in Demographic Group for Selected Income Range, Estimate | Estimate | int | N/A | 0 | NA | NA |
AGECAT | Age Category | Demographic ID | int | N/A | 6 | NA | default displayed |
Choosing regions
We can also use listCensusMetadata
to see which
geographic levels are available.
listCensusMetadata(
name = "timeseries/healthins/sahie",
type = "geography")
name | geoLevelId | limit | referenceDate | requires | wildcard | optionalWithWCFor |
---|---|---|---|---|---|---|
us | 010 | 1 | 2015-01-01 | NULL | NULL | NA |
county | 050 | 3142 | 2015-01-01 | state | state | state |
state | 040 | 52 | 2015-01-01 | NULL | NULL | NA |
This API has three geographic levels: us
,
county
, and state
. County data can be queried
for all counties nationally or within a specific state.
Making a censusapi call
First, using getCensus()
, let’s get the percent
(PCTUI_PT
) and number (NUI_PT
) of people
uninsured, using the wildcard star (*) to retrieve data for all
counties.
sahie_counties <- getCensus(
name = "timeseries/healthins/sahie",
vars = c("NAME", "PCTUI_PT", "NUI_PT"),
region = "county:*",
time = 2019)
head(sahie_counties)
time | state | county | NAME | PCTUI_PT | NUI_PT |
---|---|---|---|---|---|
2019 | 01 | 001 | Autauga County, AL | 9.4 | 4366 |
2019 | 01 | 003 | Baldwin County, AL | 10.9 | 19085 |
2019 | 01 | 005 | Barbour County, AL | 13.0 | 2194 |
2019 | 01 | 007 | Bibb County, AL | 11.0 | 1824 |
2019 | 01 | 009 | Blount County, AL | 14.3 | 6663 |
2019 | 01 | 011 | Bullock County, AL | 11.1 | 752 |
We can also get data on detailed income and demographic groups from
the SAHIE. We’ll use region
to specify county-level results
and regionin
to filter to Virginia, state code 51. We’ll
get uninsured rates by income group, IPRCAT
.
sahie_virginia <- getCensus(
name = "timeseries/healthins/sahie",
vars = c("NAME", "IPRCAT", "IPR_DESC", "PCTUI_PT"),
region = "county:*",
regionin = "state:51",
time = 2019)
head(sahie_virginia)
time | state | county | NAME | IPRCAT | IPR_DESC | PCTUI_PT |
---|---|---|---|---|---|---|
2019 | 51 | 001 | Accomack County, VA | 0 | All Incomes | 15.1 |
2019 | 51 | 001 | Accomack County, VA | 1 | <= 200% of Poverty | 19.6 |
2019 | 51 | 001 | Accomack County, VA | 2 | <= 250% of Poverty | 19.4 |
2019 | 51 | 001 | Accomack County, VA | 3 | <= 138% of Poverty | 19.7 |
2019 | 51 | 001 | Accomack County, VA | 4 | <= 400% of Poverty | 17.5 |
2019 | 51 | 001 | Accomack County, VA | 5 | 138% to 400% of Poverty | 16.3 |
Because the SAHIE API is a timeseries dataset, as indicated in its
name
,, we can get multiple years of data at once by
changing time = X
to time = "from X to Y"
.
Let’s get that data for DeKalb County, Georgia using county fips code
089 and state fips code 13. You can look up fips codes on the Census
Bureau website.
sahie_years <- getCensus(
name = "timeseries/healthins/sahie",
vars = c("NAME", "PCTUI_PT"),
region = "county:089",
regionin = "state:13",
time = "from 2006 to 2019")
sahie_years
time | state | county | NAME | PCTUI_PT |
---|---|---|---|---|
2006 | 13 | 089 | DeKalb County, GA | 19.0 |
2007 | 13 | 089 | DeKalb County, GA | 17.2 |
2008 | 13 | 089 | DeKalb County, GA | 22.5 |
2009 | 13 | 089 | DeKalb County, GA | 22.9 |
2010 | 13 | 089 | DeKalb County, GA | 25.8 |
2011 | 13 | 089 | DeKalb County, GA | 23.9 |
2012 | 13 | 089 | DeKalb County, GA | 21.7 |
2013 | 13 | 089 | DeKalb County, GA | 22.1 |
2014 | 13 | 089 | DeKalb County, GA | 19.4 |
2015 | 13 | 089 | DeKalb County, GA | 16.9 |
2016 | 13 | 089 | DeKalb County, GA | 15.3 |
2017 | 13 | 089 | DeKalb County, GA | 15.9 |
2018 | 13 | 089 | DeKalb County, GA | 17.1 |
2019 | 13 | 089 | DeKalb County, GA | 16.9 |
We can also filter the data by income group using the
IPRCAT
variable again. IPRCAT = 3
represents
<=138% of the federal poverty line. That is the threshold for Medicaid
eligibility in states that have expanded it under the Affordable
Care Act.
Getting this data for Los Angeles county (fips code 06037) we can see the dramatic decrease in the uninsured rate in this income group after California expanded Medicaid.
sahie_138 <- getCensus(
name = "timeseries/healthins/sahie",
vars = c("NAME", "PCTUI_PT", "NUI_PT"),
region = "county:037",
regionin = "state:06",
IPRCAT = 3,
time = "from 2010 to 2019")
sahie_138
time | state | county | NAME | PCTUI_PT | NUI_PT | IPRCAT |
---|---|---|---|---|---|---|
2010 | 06 | 037 | Los Angeles County, CA | 37.4 | 894385 | 3 |
2011 | 06 | 037 | Los Angeles County, CA | 35.1 | 867577 | 3 |
2012 | 06 | 037 | Los Angeles County, CA | 34.4 | 865516 | 3 |
2013 | 06 | 037 | Los Angeles County, CA | 33.0 | 818978 | 3 |
2014 | 06 | 037 | Los Angeles County, CA | 24.9 | 607542 | 3 |
2015 | 06 | 037 | Los Angeles County, CA | 17.8 | 402977 | 3 |
2016 | 06 | 037 | Los Angeles County, CA | 15.4 | 329251 | 3 |
2017 | 06 | 037 | Los Angeles County, CA | 14.3 | 281842 | 3 |
2018 | 06 | 037 | Los Angeles County, CA | 13.9 | 255520 | 3 |
2019 | 06 | 037 | Los Angeles County, CA | 15.1 | 254740 | 3 |
We can also get data for other useful demographics such as age group.
sahie_age <- getCensus(
name = "timeseries/healthins/sahie",
vars = c("NAME", "PCTUI_PT", "NUI_PT", "AGECAT", "AGE_DESC"),
region = "county:037",
regionin = "state:06",
time = 2019)
sahie_age
time | state | county | NAME | PCTUI_PT | NUI_PT | AGECAT | AGE_DESC |
---|---|---|---|---|---|---|---|
2019 | 06 | 037 | Los Angeles County, CA | 11.1 | 940376 | 0 | Under 65 years |
2019 | 06 | 037 | Los Angeles County, CA | 13.6 | 864634 | 1 | 18 to 64 years |
2019 | 06 | 037 | Los Angeles County, CA | 12.8 | 406708 | 2 | 40 to 64 years |
2019 | 06 | 037 | Los Angeles County, CA | 11.3 | 208558 | 3 | 50 to 64 years |
2019 | 06 | 037 | Los Angeles County, CA | 3.9 | 85306 | 4 | Under 19 years |
2019 | 06 | 037 | Los Angeles County, CA | 13.7 | 822705 | 5 | 21 to 64 years |
Annotations
Some Census datasets, including the American Community Survey, use annotated values. These values use numbers or symbols to indicate that the data is unavailable, has been top coded, has an insufficient sample size, or other noteworthy characteristics. Read more from the Census Bureau on ACS annotation meanings and ACS variable types.
The censusapi
package is intended to return the data
as-is so that you can receive those unaltered annotations. If you are
using data for a small geography like Census tract or block group make
sure to check for values like -666666666
or check the
annotation columns for non-empty values to exclude as needed.
As an example, we’ll get median income with associated annotations and margin of error for three census tracts in Washington, DC. The value for one tract is available, one is top coded, and one is unavailable. Notice that income is top coded at $250,000 — meaning any tract’s income that is above that threshold is listed as $250,001. You can see a value has a special meaning in the “EA” (estimate annotation) and “MA” (margin of error annotation) columns.
acs_income <- getCensus(
name = "acs/acs5",
vintage = 2020,
vars = c("B19013_001E", "B19013_001EA", "B19013_001M", "B19013_001MA"),
region = "tract:006804,007703,000903",
regionin = "county:001&state:11")
acs_income
state | county | tract | B19013_001E | B19013_001EA | B19013_001M | B19013_001MA |
---|---|---|---|---|---|---|
11 | 001 | 007703 | 46156 | NA | 24087 | NA |
11 | 001 | 000903 | 250001 | 250,000+ | -333333333 | *** |
11 | 001 | 006804 | -666666666 | - | -222222222 | ** |
Variable groups
For some surveys, particularly the American Community Survey and
Decennial Census, you can get many related variables at once using a
variable group
. These groups are defined by the Census
Bureau. In some other data tools, like data.census.gov, this concept
is referred to as a table
.
Some groups have several dozen variables, others just have a few. As
an example, we’ll get the estimate, margin of error and annotations for
median household income in the past 12 months for Census tracts in
Alaska using group B19013
.
First, see descriptions of the variables in group B19013:
group_B19013 <- listCensusMetadata(
name = "acs/acs5",
vintage = 2017,
type = "variables",
group = "B19013")
group_B19013
name | label | concept | predicateType | group | limit | predicateOnly |
---|---|---|---|---|---|---|
B19013_001MA | Annotation of Margin of Error!!Median household income in the past 12 months (in 2017 inflation-adjusted dollars) | MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2017 INFLATION-ADJUSTED DOLLARS) | string | B19013 | 0 | TRUE |
B19013_001EA | Annotation of Estimate!!Median household income in the past 12 months (in 2017 inflation-adjusted dollars) | MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2017 INFLATION-ADJUSTED DOLLARS) | string | B19013 | 0 | TRUE |
B19013_001E | Estimate!!Median household income in the past 12 months (in 2017 inflation-adjusted dollars) | MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2017 INFLATION-ADJUSTED DOLLARS) | int | B19013 | 0 | TRUE |
B19013_001M | Margin of Error!!Median household income in the past 12 months (in 2017 inflation-adjusted dollars) | MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2017 INFLATION-ADJUSTED DOLLARS) | int | B19013 | 0 | TRUE |
Now, retrieve the data using vars = "group(B19013)"
. You
could alternatively manually list each variable as
vars = c("NAME", "B19013_001E", "B19013_001EA", "B19013_001M", "B19013_001MA")
,
but using the groups is much easier.
acs_income_group <- getCensus(
name = "acs/acs5",
vintage = 2017,
vars = "group(B19013)",
region = "tract:*",
regionin = "state:02")
head(acs_income_group)
state | county | tract | B19013_001E | B19013_001EA | B19013_001M | B19013_001MA | GEO_ID | NAME |
---|---|---|---|---|---|---|---|---|
02 | 068 | 000100 | 83295 | NA | 6362 | NA | 1400000US02068000100 | Census Tract 1, Denali Borough, Alaska |
02 | 261 | 000200 | 95227 | NA | 22638 | NA | 1400000US02261000200 | Census Tract 2, Valdez-Cordova Census Area, Alaska |
02 | 261 | 000300 | 89000 | NA | 20435 | NA | 1400000US02261000300 | Census Tract 3, Valdez-Cordova Census Area, Alaska |
02 | 261 | 000100 | 49076 | NA | 7165 | NA | 1400000US02261000100 | Census Tract 1, Valdez-Cordova Census Area, Alaska |
02 | 122 | 000200 | 57694 | NA | 6526 | NA | 1400000US02122000200 | Census Tract 2, Kenai Peninsula Borough, Alaska |
02 | 122 | 000800 | 50904 | NA | 3723 | NA | 1400000US02122000800 | Census Tract 8, Kenai Peninsula Borough, Alaska |
Advanced geographies
Some geographies, particularly Census tracts and blocks, need to be
specified within larger geographies like states and counties. This
varies by API endpoint, so make sure to read the documentation for your
specific API and run
listCensusMetadata(type = "geographies")
to see the
available options.
Tract-level data from the 2010 Decennial Census can only be requested
from one state at a time. In this example, we use the built in
fips
list of state FIPS
codes to request tract-level data from each state and join into a
single data frame.
tracts <- NULL
for (f in fips) {
stateget <- paste("state:", f, sep="")
temp <- getCensus(
name = "dec/sf1",
vintage = 2010,
vars = "P001001",
region = "tract:*",
regionin = stateget)
tracts <- rbind(tracts, temp)
}
# How many tracts are present?
nrow(tracts)
#> [1] 73057
head(tracts)
state | county | tract | P001001 |
---|---|---|---|
01 | 001 | 020100 | 1912 |
01 | 001 | 020500 | 10766 |
01 | 001 | 020300 | 3373 |
01 | 001 | 020400 | 4386 |
01 | 001 | 020200 | 2170 |
01 | 001 | 020600 | 3668 |
The regionin
argument of getCensus()
can
also be used with a string of nested geographies, as shown below.
The 2010 Decennial Census summary file 1 requires you to specify a
state and county to retrieve block-level data. Use region
to request block level data, and regionin
to specify the
desired state and county.
data2010 <- getCensus(
name = "dec/sf1",
vintage = 2010,
vars = "P001001",
region = "block:*",
regionin = "state:36+county:027+tract:010000")
head(data2010)
state | county | tract | block | P001001 |
---|---|---|---|---|
36 | 027 | 010000 | 1000 | 31 |
36 | 027 | 010000 | 1011 | 17 |
36 | 027 | 010000 | 1028 | 41 |
36 | 027 | 010000 | 1001 | 0 |
36 | 027 | 010000 | 1031 | 0 |
36 | 027 | 010000 | 1002 | 4 |
For many more examples and advanced topics check out all of the articles.
Troubleshooting
The APIs contain more than 1,000 endpoints, each of which work a little differently. The Census Bureau also makes frequent changes to the APIs, which unfortunately are not usually announced in advance. If you’re getting an error message or unexpected results, here are some things to check.
Variables
Use listCensusMetadata(type = "variables")
on your
endpoint to see what variables are available. Occasionally the names
will change from year to year. This is very common with the ACS and
Decennial surveys as a well as the Population Estimates Program.
The Census APIs are case-sensitive, which means that if the variable name you want is uppercase you’ll need to write it uppercase in your request. Most of the APIs use uppercase, but some use lowercase and some even use sentence case variable names.
Geographies
Use listCensusMetadata(type = "geographies")
on your
dataset to check which geographies you can use. Each API has its own
list of valid geographies and they occasionally change as the Census
Bureau makes updates.
If you’re specifying a region by FIPS code, for example
state:01
, make sure to use the full code, padded with 0s if
necessary. Previously, specifying state:1
usually worked,
but the APIs now enforce using the full character FIPS codes. See the Census
Bureau FIPS reference for valid codes.
General
Read the online documentation for your dataset. Unfortunately, some information is not included in the developer metadata or documentation pages and is only available in PDFs. These PDFs are linked on the Census Bureau’s website. Please check for PDF documentation.
Unexpected errors
Occasionally you might get the general error message
"There was an error while running your query. We've logged the error and we'll correct it ASAP. Sorry for the inconvenience."
This comes from the Census Bureau and could be caused by any number of
problems, including server issues. Try rerunning your API call. If that
doesn’t work and you are requesting a large amount of data, try reducing
the amount that you’re requesting. If you’re still having trouble, see
below for ways to get help.
Other ways to get help
- If your
getCensus()
call results in an error, it will print the underlying API call in your R console. You can open this URL in your web browser to view it directly. You can always view the underlying call by usinggetCensus(show_call = TRUE)
. - Open a Github issue for bugs or issues caused by this R package.
- Join the public Census Bureau Slack channel and ask your question in the R or API rooms.
- Email the Census Bureau API team at census.data@census.gov for questions relating to the
underlying data and APIs. Make sure to include the underlying API call
if you’re having trouble with a specific API request, not the R code.
You can see this API call in the
censusapi
error message. You can also reach out to the contact listed in the dataset metadata found inlistCensusApis()
for questions about a specific survey.