Accessing microdata • censusapi

This package provides basic support for the Census’s new microdata APIs, using the same getCensus() functions used for summary data. Getting the data with getCensus() is easy. Using it responsibly takes some homework.

About microdata

Microdata contains individual-level responses: one row per person. It is a vital tool to perform custom analysis, but with great power comes great responsibility. Appropriately weighting the individual-level responses is required. You’ll often need to work with household relationships and will need to handle responses that aren’t in the universe of the question (for example, removing children in an analysis about college graduation rate.)

If you’re new to working with microdata you’ll need to do some reading before diving in. Here are some resources from the Census Bureau:

What is microdata and why should I use it? (video and transcript)
Census Microdata API User Guide (pdf)
Microdata API documentation

As for all other endpoints, censusapi retrieves the data so that you can perform your own analysis using your methodology of choice. If you’re looking for an interactive microdata analysis tool, try the data.census.gov microdata interactive tool or the IPUMS online data analysis tool.

Once you’ve learned how to use microdata and gained and understanding of weighting, getting the data using censusapi is simple.

Getting microdata with censusapi

As an example, we’ll get data from the 2020 Current Population Survey Voting Supplement. This survey asks people if they voted, how, and when, and includes useful demographic data.

See the available variables:

voting_vars <- listCensusMetadata(
    name = "cps/voting/nov",
    vintage = 2020,
    type = "variables")
head(voting_vars)

name	label	concept	predicateType	group	predicateOnly	suggested_weight	is_weight
for	Census API FIPS ‘for’ clause	Census API Geography Specification	fips-for	N/A	TRUE	NA	NA
in	Census API FIPS ‘in’ clause	Census API Geography Specification	fips-in	N/A	TRUE	NA	NA
ucgid	Uniform Census Geography Identifier clause	Census API Geography Specification	ucgid	N/A	TRUE	NA	NA
PEEDUCA	Demographics-highest level of school completed	NA	int	N/A	NA	PWSSWGT	NA
PUBUS1	Labor Force-unpaid work in family business/farm,y/n	NA	int	N/A	NA	PWCMPWGT	NA
PRCOW1	Indus.&Occ.-(main job)class of worker-recode	NA	int	N/A	NA	PWCMPWGT	NA

From the CPS Voting supplement, get data on method of voting in New York state using PES5 (Vote in person or by mail?) and PESEX (gender), along with the appropriate weighting variable, PWSSWGT. We’ll only get data for people with a response of 1 (yes) to PES1 (Did you vote?).

cps_voting <- getCensus(
    name = "cps/voting/nov",
    vintage = 2020,
    vars = c("PES5", "PESEX", "PWSSWGT"),
    region = "state:36",
    PES1 = 1)
head(cps_voting)

state	PES5	PESEX	PWSSWGT	PES1
36	1	1	4571.216	1
36	1	2	4806.369	1
36	1	2	3440.301	1
36	-3	1	5204.566	1
36	-3	2	4993.819	1
36	1	2	4602.958	1

Making a data dictionary

Most of microdata variables are encoded, which means that your data will have a lot of numbers instead of text labels.

A data dictionary, which includes the definitions and labels for every variable in the dataset, is helpful. This is possible with listCensusMetasdata(include_values = "TRUE) returns a data dictionary with one row for each variable-label pair. That means if there are 30 codes for a given variable, it will have 30 rows in the data dictionary. Variables that don’t have value labels in the metadata will have only one row.

voting_dict <- listCensusMetadata(
    name = "cps/voting/nov",
    vintage = 2020,
    type = "variables",
    include_values = TRUE)
head(voting_dict)

name	label	concept	predicateType	group	predicateOnly	suggested_weight	is_weight	values_code	values_label
for	Census API FIPS ‘for’ clause	Census API Geography Specification	fips-for	N/A	TRUE	NA	NA	NA	NA
in	Census API FIPS ‘in’ clause	Census API Geography Specification	fips-in	N/A	TRUE	NA	NA	NA	NA
ucgid	Uniform Census Geography Identifier clause	Census API Geography Specification	ucgid	N/A	TRUE	NA	NA	NA	NA
PEEDUCA	Demographics-highest level of school completed	NA	int	N/A	NA	PWSSWGT	NA	46	DOCTORATE DEGREE(EX:PhD,EdD)
PEEDUCA	Demographics-highest level of school completed	NA	int	N/A	NA	PWSSWGT	NA	33	5th Or 6th Grade
PEEDUCA	Demographics-highest level of school completed	NA	int	N/A	NA	PWSSWGT	NA	44	MASTER’S DEGREE(EX:MA,MS,MEng,MEd,MSW)

You can also look up the meaning of those codes for a single variable using the same function, listCensusMetadata(). Here are the values of PES5, the variable for “Vote in person or by mail?”

PES5_values <- listCensusMetadata(
    name = "cps/voting/nov",
    vintage = 2020,
    type = "values",
    variable = "PES5")
PES5_values

code	label
2	By Mail
-2	Don’t Know
1	In person
-1	Not in Universe
-9	No Response
-3	Refused

Other ways to access microdata

The Census Bureau microdata APIs are helpful for working with a limited just-released datasets. But they’re not your only option. Some other ways to get microdata are:

Retrieve standardized, cleaned microdata data from IPUMS and import with the impumsr package. IPUMS is widely used in research when the data needed is not brand new. I highly recommend that you check out IPUMS’ cleaned files microdata files as well as historic geographic data. These standardized files are generally released months to a year after the raw Census microdata that is available directly from the Census Bureau.
Download complete bulk files from the Census FTPs (file transfer protocols.) This is helpful if you need the a large number of variables. You might run in to size limitations getting many variables through the APIs.
Retrieve American Community Survey microdata via the Census APIs with tidycensus, which has helpful functions for working with those endpoints.