---
title: "IPUMS Data and R"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{IPUMS Data and R}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(ipumsr)
```

This article provides an overview of how to find, request, download, and
read IPUMS data into R. For a general introduction to IPUMS and ipumsr,
see the [ipumsr home page](https://tech.popdata.org/ipumsr/index.html).

## Obtaining IPUMS data

IPUMS data are free, but do require registration. New users can register
with a particular [IPUMS project](https://www.ipums.org/overview) by
clicking the **Register** link at the top right of the project website.

Users obtain IPUMS data by creating and submitting an *extract request*.
This specifies which data to include in the resulting *extract* (or
*data extract*). IPUMS servers process each submitted extract request,
and when complete, users can download the extract containing the
requested data.

Extracts typically contain both data and metadata files. Data files
typically come as fixed-width (.dat) files or comma-delimited (.csv)
files. Metadata files contain information about the data file and its
contents, including variable descriptions and parsing instructions for
fixed-width data files. IPUMS microdata projects provide metadata in DDI
(.xml) files. Aggregate data projects provide metadata in either .txt or
.csv formats.

Users can submit extract requests and download extracts via either the
**IPUMS website** or the **IPUMS API**. ipumsr provides a set of client tools
to interface with the API. Note that only [certain IPUMS projects](https://developer.ipums.org/docs/v2/apiprogram/apis/) are currently
supported by the IPUMS API.

### Obtaining data via an IPUMS project website

To create a new extract request via an IPUMS project website 
(e.g. [IPUMS CPS](https://cps.ipums.org/cps/)), navigate to the
extract interface for that project by clicking **Select Data** in the 
heading of the project website.

```{r, echo=FALSE, out.width = "70%", fig.align="center"}
#| fig.alt: >
#|   Screenshot of the Select Data link at the top of the IPUMS CPS homepage
knitr::include_graphics("cps_select_data.jpg")
```

The project's extract interface allows you to explore what's available, find 
documentation about data concepts and sources, and specify the data you'd like 
to download. The data selection parameters will differ across projects; see
each project's documentation for more details on the available options.

If you've never created an extract for the project you're interested in,
a good way to learn the basics is to watch a project-specific video on
creating extracts hosted on the [IPUMS Tutorials
page](https://www.ipums.org/support/tutorials).

#### Downloading from microdata projects

Once your extract is ready, click the green **Download** button to
download the data file. Then, right-click the **DDI** link in the
Codebook column, and select **Save Link As...** (see below).

```{r echo=FALSE}
#| fig.alt: >
#|   Screenshot of the My Data page on IPUMS USA website, with the "Download 
#|   .DAT" button and the "DDI" codebook link each surrounded by a red box for 
#|   emphasis. The right-click menu has been called up on the "DDI" codebook 
#|   link, and the menu option "Save Link As..." is also surrounded by a red box 
#|   for emphasis.
knitr::include_graphics("microdata_annotated_screenshot.png")
```

Note that some browsers may display different text, but there should be
an option to download the DDI file as .xml. (For instance, on Safari,
select **Download Linked File As...**.) For ipumsr to read the metadata,
you must **save the file in .xml format, *not* .html format**.

#### Downloading from aggregate data projects

Aggregate data projects include data and metadata together in a single
.zip archive. To download them, simply click on the green
**Tables** button (for tabular data) and/or **GIS Files** button (for
spatial boundary or location data) in the **Download Data** column.

### Obtaining data via the IPUMS API

Users can also create and submit extract requests within R by using
ipumsr functions that interface with the 
[IPUMS API](https://developer.ipums.org/). The IPUMS API currently supports
access to the extract system for [certain IPUMS collections](https://developer.ipums.org/docs/v2/apiprogram/apis/).

#### Extract support

ipumsr provides an interface to the IPUMS extract system via the IPUMS API 
for the following collections:

-   IPUMS USA
-   IPUMS CPS
-   IPUMS International
-   IPUMS Time Use (ATUS, AHTUS, MTUS)
-   IPUMS Health Surveys (NHIS, MEPS)
-   IPUMS NHGIS

#### Metadata support

For IPUMS NHGIS, ipumsr provides access to comprehensive metadata via the IPUMS
API. Users can query NHGIS metadata to explore available data when specifying 
NHGIS extract requests.

Increased access to metadata for microdata projects is in 
progress. Currently, the IPUMS API only provides a listing of available
samples for each microdata collection. At this time, creating extract requests 
for these projects requires using the corresponding project websites to find 
samples and variables of interest and obtain their API identifiers for use 
in R extract definitions.

#### Workflow

Once you have identified the data you would like to request, you can use ipumsr 
functions to define your extract request, submit it, wait for it to process, 
and download the associated data.

First, define the parameters of your extract. The available extract
definition options will differ by IPUMS data collection. See the
[microdata API request](ipums-api-micro.html) and [NHGIS API
request](ipums-api-nhgis.html) vignettes for more details on defining an
extract.

```{r}
# Define a microdata extract request, e.g. for IPUMS CPS
cps_extract_request <- define_extract_micro(
  collection = "cps",
  description = "2018-2019 CPS Data",
  samples = c("cps2018_05s", "cps2019_05s"),
  variables = c("SEX", "AGE", "YEAR")
)

# Define an NHGIS extract request
nhgis_extract_request <- define_extract_nhgis(
  description = "NHGIS Data via IPUMS API",
  datasets = ds_spec(
    "1990_STF1",
    data_tables = c("NP1", "NP2", "NP3"),
    geog_levels = "state"
  )
)
```

Next, submit your extract definition. After waiting for it to complete,
you can download the files directly to your local machine without leaving your 
R session:

```{r, eval = FALSE}
submitted_extract <- submit_extract(cps_extract_request)
downloadable_extract <- wait_for_extract(submitted_extract)
path_to_data_files <- download_extract(downloadable_extract)
```

You can also get the specifications of your previous extract requests,
even if they weren't made with the API:

```{r, eval=FALSE}
past_extracts <- get_extract_history("nhgis")
```

See the [introduction to the IPUMS API](ipums-api.html) for
more details about how to use ipumsr to interact with the IPUMS API.

## Reading IPUMS data

Once you have downloaded an extract, you can load the data into R with
the family of `read_*()` functions in ipumsr. These functions expand on
those provided in `{readr}` in two
ways:

-   ipumsr anticipates standard IPUMS file structures, limiting the need
    for users to manually extract and organize their downloaded files
    before reading.
-   ipumsr uses an extract's metadata files to automatically attach
    contextual information to the data. This allows users to easily
    identify variable names, variable descriptions, and labeled data
    values (from `{haven}`), which are
    common in IPUMS files.
    
File loading is covered in depth in the [reading IPUMS data](ipums-read.html)
vignette.

#### Microdata files

For microdata files, use the `read_ipums_micro_*()` family with the DDI (.xml)
metadata file for your extract:

```{r}
cps_file <- ipums_example("cps_00157.xml")
cps_data <- read_ipums_micro(cps_file)
head(cps_data)
```

#### NHGIS files

For NHGIS files, use `read_nhgis()`:

```{r}
nhgis_file <- ipums_example("nhgis0972_csv.zip")
nhgis_data <- read_nhgis(nhgis_file, verbose = FALSE)

head(nhgis_data)
```

#### Spatial boundary files

ipumsr also supports the reading of IPUMS shapefiles (spatial boundary
and location files) into the `sf` format provided by the
`{sf}` package:

```{r, eval = requireNamespace("sf")}
shp_file <- ipums_example("nhgis0972_shape_small.zip")
nhgis_shp <- read_ipums_sf(shp_file)

head(nhgis_shp)
```

#### Ancillary files

ipumsr is primarily designed to read data produced by the IPUMS extract
system. However, IPUMS does distribute other files, often available via
direct download. In many cases, these can be loaded with ipumsr.
Otherwise, these files can likely be handled by existing data reading
packages like `{readr}` (for delimited files) or `{haven}` (for Stata, SPSS,
or SAS files).

### Exploring file metadata

Load a file's metadata with `read_ipums_ddi()` (for microdata projects)
and `read_nhgis_codebook()` (for NHGIS). These provide file- and
variable-level metadata for a given data source, which can be used to
interpret the data contents.

```{r}
cps_meta <- read_ipums_ddi(cps_file)
nhgis_meta <- read_nhgis_codebook(nhgis_file)
```

Summarize the variable metadata for a dataset using `ipums_var_info()`:

```{r}
ipums_var_info(cps_meta)
```

You can also get contextual details for specific variables:

```{r}
ipums_var_desc(cps_data$INCTOT)
ipums_val_labels(cps_data$STATEFIP)
```

#### Labelled values

ipumsr also provides a family of `lbl_*()` functions to assist in
accessing and manipulating the value-level metadata included in IPUMS
data. This allows for value labels to be incorporated into the data
processing pipeline. For instance:

```{r}
# Remove labels for values that do not appear in the data
cps_data$STATEFIP <- lbl_clean(cps_data$STATEFIP)

ipums_val_labels(cps_data$STATEFIP)
```

```{r}
# Combine North and South Dakota into a single value/label pair
cps_data$STATEFIP <- lbl_relabel(
  cps_data$STATEFIP,
  lbl("38_46", "Dakotas") ~ grepl("Dakota", .lbl)
)

ipums_val_labels(cps_data$STATEFIP)
```

See the [value labels](value-labels.html) vignette for more details.