---
title: "DisImpact Tutorial"
author: "Vinh Nguyen"
date: "`r Sys.Date()`"
output:
  prettydoc::html_pretty:
    theme: architect
    highlight: github
    toc: true
vignette: >
  %\VignetteIndexEntry{DisImpact Tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

The `DisImpact` R package contains functions that help in determining disproportionate impact (DI) based on the following methodologies:

1. [percentage point gap](https://www.cccco.edu/-/media/CCCCO-Website/About-Us/Divisions/Digital-Innovation-and-Infrastructure/Research/Files/PercentagePointGapMethod2017.ashx) (PPG) method,
2. [proportionality index](https://www.cccco.edu/-/media/CCCCO-Website/Files/DII/guidelines-for-measuring-disproportionate-impact-in-equity-plans-tfa-ada.pdf) method (method #1 in reference), and
3. [80% index](https://www.cccco.edu/-/media/CCCCO-Website/Files/DII/guidelines-for-measuring-disproportionate-impact-in-equity-plans-tfa-ada.pdf) method (method #2 in reference).

# Install Package

```R
# From CRAN (Official)
install.packages('DisImpact')

# From github (Development)
devtools::install_github('vinhdizzo/DisImpact')
```

# Load Packages

```{r, message=FALSE, warning=FALSE}
library(DisImpact)
library(dplyr) # Ease in manipulations with data frames
```

# Load toy student equity data

To illustrate the functionality of the package, let's load a toy data set:
```{r}
# Load fake data set
data(student_equity)

# Print first few observations
head(student_equity)

# For description of data set
## ?student_equity
```

For a description of the `student_equity` data set, type `?student_equity` in the R console.

The toy data set can be summarized as follows:
```{r}
# Summarize toy data
dim(student_equity)
dSumm <- student_equity %>%
  group_by(Cohort, Ethnicity) %>%
  summarize(n=n(), Transfer_Rate=mean(Transfer))
dSumm ## This is a summarized version of the data set
```

# Percentage point gap (PPG) method

`di_ppg` is the main work function, and it can take on vectors or column names the tidy way:
```{r}
# Vector
di_ppg(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame
# Tidy and column reference
di_ppg(success=Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame
```

For a description of the `di_ppg` function, including both function arguments and returned results, type `?di_ppg` in the R console.

Sometimes, one might want to break out the DI calculation by cohort:
```{r}
# Cohort
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>%
  as.data.frame
```

`di_ppg` is also applicable to summarized data; just pass the counts to `success` and group size to `weight`.  For example, we use the summarized data set, `dSumm`, and sample size `n`, in the following:
```{r}
di_ppg(success=Transfer_Rate*n, group=Ethnicity, cohort=Cohort, weight=n, data=dSumm) %>%
  as.data.frame
```

By default, `di_ppg` uses the overall success rate as the reference rate for comparison (default: `reference='overall'`).  The `reference` argument also accepts `'hpg'` (highest performing group success rate as the reference rate), `'all but current'` (success rate of all groups combined excluding the comparison group), or a group value from `group`.

```{r}
# Reference: Highest performing group
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='hpg', data=student_equity) %>%
  as.data.frame

# Reference: All but current (PPG minus 1)
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='all but current', data=student_equity) %>%
  as.data.frame

# Reference: custom group
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='White', data=student_equity) %>%
  as.data.frame
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='Asian', data=student_equity) %>%
  as.data.frame
```

The user could also pass in custom reference points for comparison (e.g., a state-wide rate).  `di_ppg` accepts either a single reference point to be used or a vector of reference points, one for each cohort.  For the latter, the vector of reference points will be taken to correspond to the `cohort` variable, alphabetically ordered.
```{r}
# With custom reference (single)
di_ppg(success=Transfer, group=Ethnicity, reference=0.54, data=student_equity) %>%
  as.data.frame

# With custom reference (multiple)
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference=c(0.5, 0.55), data=student_equity) %>%
  as.data.frame
```

Disproportionate impact using the PPG relies on calculating the margine margin of error (MOE) pertaining around the success rate.  The MOE calculated in `di_ppg` has 2 underlying assumptions (defaults):

1. the minimum MOE returned is 0.03, and
2. using 0.50 as the proportion in the margin of error formula, $1.96 \times \sqrt{\hat{p} (1-\hat{p}) / n}$.

To override 1, the user could specify `min_moe` in `di_ppg`.  To override 2, the user could specify `use_prop_in_moe=TRUE` in `di_ppg`.
```{r}
# min_moe
di_ppg(success=Transfer, group=Ethnicity, data=student_equity, min_moe=0.02) %>%
  as.data.frame
# use_prop_in_moe
di_ppg(success=Transfer, group=Ethnicity, data=student_equity, min_moe=0.02, use_prop_in_moe=TRUE) %>%
  as.data.frame
```

In cases where the proportion is used in calculating MOE, an observed proportion of 0 or 1 would lead to a zero MOE.  To account for these scenarios, the user could leverage the `prop_sub_0` and `prop_sub_1` parameters in `di_ppg` and `ppg_moe` as substitutes.  These parameters default to `0.5`, which maximizes the MOE (making it more difficult to declare disproportionate impact).

```{r}
# Set Native American to have have zero transfers and see what the results
di_ppg(success=Transfer, group=Ethnicity, data=student_equity %>% mutate(Transfer=ifelse(Ethnicity=='Native American', 0, Transfer)), use_prop_in_moe=TRUE, prop_sub_0=0.1, prop_sub_1=0.9) %>%
  as.data.frame
```


# Proportionality index method

`di_prop_index` is the main work function for this method, and it can take on vectors or column names the tidy way:

```{r}
# Without cohort
## Vector
di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame
## Tidy and column reference
di_prop_index(success=Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame

# With cohort
## Vector
di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort) %>% as.data.frame
## Tidy and column reference
di_prop_index(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>%
  as.data.frame
```

For a description of the `di_prop_index` function, including both function arguments and returned results, type `?di_prop_index` in the R console.

Note that the referenced document describing this method does not recommend a threshold on the proportionality index for declaring disproportionate impact.  The `di_prop_index` function uses `di_prop_index_cutoff=0.8` as the default threshold, which the user could change.

```{r}
# Changing threshold for DI
di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, di_prop_index_cutoff=0.5) %>% as.data.frame
```

# 80% index method

`di_80_index` is the main work function for this method, and it can take on vectors or column names the tidy way:

```{r}
# Without cohort
## Vector
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame
## Tidy and column reference
di_80_index(success=Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame

# With cohort
## Vector
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort) %>% as.data.frame
## Tidy and column reference
di_80_index(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>%
  as.data.frame
```

For a description of the `di_80_index` function, including both function arguments and returned results, type `?di_80_index` in the R console.

By default, `di_80_index` uses the group with the highest success rate as reference in calculating the index.  One could specify the the comparison group using the `reference_group` argument (a value from `group`).

```{r}
# Changing reference group
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, reference_group='White') %>% as.data.frame
```

By default, `di_80_index` uses 80% (`di_80_index_cutoff=0.80`) as the default threshold for declaring disproportionate impact.  One could override this using another threshold via the `di_80_index_cutoff` argument.


```{r}
# Changing threshold for DI
di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, di_80_index_cutoff=0.50) %>% as.data.frame
```

# When dealing with a non-success variable like drop-out or probation

All methods and functions implemented in the `DisImpact` package treat outcomes as positive: 1 is desired over 0 (higher rate is better, lower rate indicates disparity).  The choice of the name `success` in the functions' arguments is intentional to remind the user of this.

Suppose we have a variable that indicates something negative (e.g., a flag for students on academic probation).  We could calculate DI on the converse of it by using the `!` (logical negation) operator:
```{r}
## di_ppg(success=!Probation, group=Ethnicity, data=student_equity) %>%
##   as.data.frame ## If there were a Probation variable
di_ppg(success=!Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame ## Illustrating the point with `!`
```

# Transformations on the fly

We can compute the success, group, and cohort variables on the fly:
```{r}
# Transform success
a <- sample(0:1, size=nrow(student_equity), replace=TRUE, prob=c(0.95, 0.05))
mean(a)
di_ppg(success=pmax(Transfer, a), group=Ethnicity, data=student_equity) %>%
  as.data.frame

# Collapse Black and Hispanic
di_ppg(success=Transfer, group=ifelse(Ethnicity %in% c('Black', 'Hispanic'), 'Black/Hispanic', Ethnicity), data=student_equity) %>% as.data.frame
```

# Calculate DI for many variables and groups

It is often the case that the user desires to calculate disproportionate impact across many outcome variables and many disaggregation/group variables.  The function `di_iterate` allows the user to specify a data set and the various variables to iterate across:
```{r}
# Multiple group variables
di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups='overall') %>% as.data.frame

# Multiple group variables and different reference groups

bind_rows(
  di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups='overall')
  , di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups=c('White', 'Male'), include_non_disagg_results=FALSE) # include_non_disagg_results = FALSE: Already have this scenario in Overall run
)
```

There is a separate [vignette](Scaling-DI-Calculations.html) that explains how one might leverage `di_iterate` for rapid dashboard development and deployment with disaggregation and disproportionate impact features.

# Appendix: R and R Package Versions

This vignette was generated using an R session with the following packages.  There may be some discrepancies when the reader replicates the code caused by version mismatch.

```{r}
sessionInfo()
```