---
title: "Introduction for AuxSurvey: a package to improving survey inference using administrative records"
output: html_vignette
vignette: >
  %\VignetteIndexEntry{introduction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

The **AuxSurvey** R package provides a set of statistical methods for improving survey inference by using **discretized auxiliary variables** from administrative records. The utility of such auxiliary data can often be diminished due to discretization for confidentiality reasons, but this package offers multiple estimators that handle such discretized auxiliary variables effectively.

This vignette demonstrates the key functionalities of the **AuxSurvey** package, including:

- Weighted or unweighted sample mean
- Weighted or unweighted raking
- Weighted or unweighted poststratification
- MRP (Bayesian Multilevel Regression with Poststratification)
- GAMP (Bayesian Generalized Additive Model of Response Propensity)
- Bayesian Linear Regression
- BART (Bayesian Additive Regression Trees)

These methods are designed for use with discretized auxiliary variables in survey data, and we will walk through data generation and estimation examples.


## Generate Simulated Data

The **AuxSurvey** package includes a function `simulate()` that generates the datasets used in the paper. These datasets include a population of 3000 samples and a sample of about 600 cases, with two outcomes (`Y1` as a continuous variable, `Y2` as a binary outcome).

The covariates in the dataset include:
- `Z1`, `Z2`, `Z3`: Binary covariates
- `X`: Continuous covariate
- Discretized versions of `X`: `auX_3`, `auX_5`, `auX_10`
- Propensity scores: `true_pi` (true), `logit_true_pi`, `estimated_pi` (estimated using BART)

```{r}
library(AuxSurvey)

# Generate data
data = simulate(N = 3000, discretize = c(3, 5, 10), setting = 1, seed = 123)
population = data$population  # Full population data (3000 cases)
samples = data$samples  # Sample data (600 cases)
ipw = 1 / samples$true_pi  # True inverse probability weighting
est_ipw = 1 / samples$estimated_pi  # Estimated inverse probability weighting
true_mean = mean(population$Y1)  # True value of the estimator
```

## Estimation Methods

After generating the data, we can use the `auxsurvey()` function to apply various estimators. The **`auxsurvey()`** function supports multiple estimation methods, including unweighted or weighted sample mean, raking, poststratification, MRP, GAMP, linear regression, and BART.

### Example 1: Sample Mean

To estimate the **sample mean** for `Y1`, we can use the `auxsurvey()` function. For the unweighted sample mean:

```{r}
# Unweighted sample mean
sample_mean = auxsurvey("~Y1", auxiliary = NULL, weights = NULL, samples = samples, population = NULL, method = "sample_mean", levels = 0.95)
```

For the inverse probability weighted (IPW) sample mean:

```{r}
# IPW sample mean
IPW_sample_mean = auxsurvey("~Y1", auxiliary = NULL, weights = ipw, samples = samples, population = population, method = "sample_mean", levels = 0.95)
```

### Example 2: Raking

The **raking** method adjusts the sample to match known marginal distributions of the auxiliary variables. You can apply raking for `auX_5`:

```{r}
# Unweighted Raking for auX_5 with interaction with Z1
rake_5_Z1 = auxsurvey("~Y1", auxiliary = "Z2 + Z3 + auX_5 * Z1", weights = NULL, samples = samples, population = population, method = "rake", levels = 0.95)
```

For IPW raking:

```{r}
# IPW Raking for auX_10
rake_10 = auxsurvey("~Y1", auxiliary = "Z1 + Z2 + Z3 + auX_10", weights = ipw, samples = samples, population = population, method = "rake", levels = c(0.95, 0.8))
```

### Example 3: MRP (Multilevel Regression with Poststratification)

The **MRP** method models the outcome using both fixed and random effects. Here is an example of running MRP with `auX_3` as a random effect:

```{r eval=FALSE}
# MRP with auX_3
MRP_1 = auxsurvey("Y1 ~ Z1 + Z2", auxiliary = "Z3 + auX_3", samples = samples, population = population, method = "MRP", levels = 0.95)
```

### Example 4: GAMP (Generalized Additive Model of Response Propensity)

The **GAMP** method can include smooth functions of the auxiliary variables. For example, here’s how to use smooth functions of `logit_estimated_pi` and `auX_10`:

```{r eval=FALSE}
# GAMP with smooth functions
GAMP_1 = auxsurvey("Y1 ~ 1 + Z1 + Z2 + Z3", auxiliary = "s(logit_estimated_pi) + s(auX_10)", samples = samples, population = population, method = "GAMP", levels = 0.95)
```

### Example 5: Bayesian Linear Regression

The **Bayesian Linear Regression** method treats categorical variables as dummy variables. Here’s an example for `Y1`:

```{r eval=FALSE}
# Linear regression with Bayesian estimation
LR_1 = auxsurvey("Y1 ~ 1 + Z1 + Z2 + Z3", auxiliary = "auX_3", samples = samples, population = population, method = "linear", levels = 0.95)
```

### Example 6: BART (Bayesian Additive Regression Trees)

Finally, the **BART** method can be applied to estimate the relationship between the outcome and the covariates. Here's an example for estimating `Y1` using BART:

```{r}
# BART for estimation
BART_1 = auxsurvey("Y1 ~ Z1 + Z2 + Z3 + auX_3 + logit_true_pi", auxiliary = NULL, samples = samples, population = population, method = "BART", levels = 0.95)
```

## Conclusion

The **AuxSurvey** package provides a powerful set of tools for survey analysis when working with discretized auxiliary data. By leveraging various Bayesian models and traditional survey methods like raking and poststratification, users can enhance their inference without violating confidentiality.

For further details on the package's functionality, please refer to the documentation, which provides more examples and explanation of the various estimators.