---
title: "Introduction to crosswalkr"
author: "Benjamin Skinner"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    css: vignette.css
vignette: >
  %\VignetteIndexEntry{Introduction to crosswalkr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{dplyr, haven, labelled}
---

Researchers often must compile master data sets
from a number of smaller data sets that are not consistent in terms of
variable names or value encodings. This can be especially true for
large administrative data sets that span multiple years and/or
departments. Other times, teams of researchers must work together to
maintain a master data set and it is important for replicability and
future collaboration that the team rely on consistent naming and
encoding conventions.

For example, let's say there are three flat files of student
information that need to be merged into a single large data set for
analysis.

### File 1
|sid|lname|state|t_score|
|:--|:--|:--|:--|
|1|Jackson|VA|74|
|2|Harrison|KY|86|
|3|Nixon|IL|78|

### File 2
|stu\_id|last\_name|st|test_score|
|:--|:--|:--|:--|
|4|Washington|35|92|
|5|Roosevelt|11|67|
|6|Taylor|47|68|

### File 3
|s\_id|name|sta|score|
|:--|:--|:--|:--|
|7|Tyler|North Dakota|91|
|8|Grant|South Dakota|82|
|9|Adams|Illinois|89|

It is clear that these files contain the same basic information, but
neither the names nor encodings for `state` | `st` | `sta` are consistent.

One solution is to just fix these one at a time before joining
them. For example:

```{r, message = FALSE}
library(crosswalkr)
library(dplyr)
library(labelled)
library(haven)
```
```{r, echo = FALSE}
file_1 <- data.frame(sid = c(1:3),
                     lname = c('Jackson','Harrison','Nixon'),
                     stat = c('VA','KY','IL'),
                     t_score = c(74,86,78),
                     stringsAsFactors = FALSE)

file_2 <- data.frame(stu_id = c(4:6),
                     last_name = c('Washington','Roosevelt','Taylor'),
                     st = c(35,11,47),
                     test_score = c(92,82,89),
                     stringsAsFactors = FALSE)

file_3 <- data.frame(s_id = c(7:9),
                     name = c('Tyler','Grant','Adams'),
                     sta = c('North Dakota','South Dakota','Illinois'),
                     score = c(91,82,89),
                     stringsAsFactors = FALSE)
```
```{r}
df1 <- file_1 %>%
    rename(id = sid,
           last_name = lname,
           stabbr = stat,
           score = t_score)

df2 <- file_2 %>%
    rename(id = stu_id,
           stabbr = st,
           score = test_score) %>%
    mutate(stabbr = as.character(stabbr))

df3 <- file_3 %>%
    rename(id = s_id,
           stabbr = sta,
           last_name = name)

df <- rbind(df1, df2, df3)
df

```
The problem, of course, is there is a lot of room for error since the
renaming process has to be repeated for each data frame. 

### Using a crosswalk file

Instead, it makes more sense to create a crosswalk data set that
aligns old (or raw) column names with new (or clean) column names and,
if desired, labels. The `crosswalk` to join these files could be:


|clean|label|file\_1\_raw|file\_2\_raw|file\_3\_raw|
|:--|:--|:--|:--|:--|
|id|Student ID|sid|stu\_id|s\_id|
|last\_name|Student last name|lname|last\_name|name|
|stabbr|State abbreviation|stat|st|sta|
|score|Test score|t\_score|test\_score|score|

```{r, echo = FALSE}
crosswalk <- data.frame(clean = c('id','last_name','stabbr','score'),
                        label = c('Student ID','Student last name',
                                  'State abbreviation','Test score'),
                        file_1_raw = c('sid','lname','stat','t_score'),
                        file_2_raw = c('stu_id','last_name','st','test_score'),
                        file_3_raw = c('s_id','name','sta','score'),
                        stringsAsFactors = FALSE)
```

The crosswalk file (`cw_file`) could be:  

1. Data frame object already in memory  
2. A string with path and name (*e.g.*, `'./path/to/crosswalk.csv'`) of a
   flat file of one of the following types:  
	   1. Comma separated (`*.csv`)  
	   2. Tab separated (`*.tsv`)  
	   3. Other delimited (`*.txt`) with `delimiter` option set to
   delimiter string (*e.g.*, `delimiter = '|'`)  
	   4. Excel (`*.xls` or `*.xlsx`) with `sheet` option set to sheet
   number or string name (defaulting to the first sheet)  
	   5. R data (`*.rdata`, `*.rda`, `*.rds`)  
	   6. Stata data (`*.dta`)  
	   
If given a string to the `cw_file` argument, `renamefrom()` and
`encodefrom()` determine the type of file by its ending.

## Renaming

To rename using the `renamefrom()` command:
```{r}
df1 <- renamefrom(file_1, cw_file = crosswalk, raw = file_1_raw, clean = clean, label = label)
df2 <- renamefrom(file_2, cw_file = crosswalk, raw = file_2_raw, clean = clean, label = label)
df3 <- renamefrom(file_3, cw_file = crosswalk, raw = file_3_raw, clean = clean, label = label)

df <- rbind(df1, df2, df3)
df
```

And check out the labels:
```{r}
var_label(df)
```

As new raw data files are added to the project, they could simply be
given a new column in the crosswalk file that mapped their raw column
names to the clean versions.

## Encoding

These same example files have inconsistent encodings for state: one
uses two-letter abbreviations, another the FIPS code, and another the
full name. Again, instead of fixing each one at a time, a separate crosswalk
for encoding these values could be used. The `crosswalkr` package
includes a state-level crosswalk, `stcrosswalk`:

```{r}

data(stcrosswalk)
stcrosswalk

```

The `encodefrom()` function works much like `renamefrom()`. The only
difference is that a vector of encoded values is returned that can be
added to an existing dataframe. 

`encodefrom()` returns either base R factors or labels depending on
whether the input data frame is a tibble.

#### factor

```{r}
df1$state <- encodefrom(file_1, var = stat, stcrosswalk, raw = stabbr, clean = stfips, label = stname)
df1
sapply(df1, class)

```
#### labelled vector
```{r}
file_1_ <- file_1 %>% tbl_df()
df1$state <- encodefrom(file_1_, var = stat, stcrosswalk, raw = stabbr,
                        clean = stfips, label = stname)
as_factor(df1)
zap_labels(df1)
```

## Combined example: `dplyr` chain

The `renamefrom()` and `encodefrom()` functions can be combined in a
`dplyr` chain.

```{r}
df <- rbind(file_1 %>%
            tbl_df() %>%
            renamefrom(., crosswalk, file_1_raw, clean, label) %>%
            mutate(stabbr = encodefrom(., stabbr, stcrosswalk, stabbr, stfips, stname)),

            ## append file 2
            file_2 %>%
            tbl_df() %>%
            renamefrom(., crosswalk, file_2_raw, clean, label) %>%
            mutate(stabbr = encodefrom(., stabbr, stcrosswalk, stfips, stfips, stname)),

            ## append file 3
            file_3 %>%
            tbl_df() %>%
            renamefrom(., crosswalk, file_3_raw, clean, label) %>%
            mutate(stabbr = encodefrom(., stabbr, stcrosswalk, stname, stfips, stname)))

df
as_factor(df)            
```