---
title: "Getting started with simputation"
author: "Mark van der Loo"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Getting started with simputation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo=FALSE}
library(simputation)
```

---

This package offers a number of commonly used single imputation methods, each
with a similar and hopefully simple interface. At the moment the following
imputation methodology is supported.

- Model based (optionally add [non-]parametric random residual)
    - linear regression 
    - robust linear regression
    - ridge/elasticnet/lasso regression
    - CART models (decision trees)
    - Random forest
- Multivariate imputation
    - Imputation based on the expectation-maximization algorithm
    - missForest (=iterative random forest imputation)
- Donor imputation (including various donor pool specifications)
    - k-nearest neigbour (based on [gower](https://cran.r-project.org/package=gower)'s distance)
    - sequential hotdeck (LOCF, NOCB)
    - random hotdeck
    - Predictive mean matching
- Other
    - (groupwise) median imputation (optional random residual)
    - Proxy imputation: copy another variable or use a simple transformation
      to compute imputed values.
    - Apply trained models for imputation purposes.


### Installation

The latest release of the package can be installed as follows.
```{r,eval=FALSE}
install.packages('simputation')
```

This package is a wrapper package. It stands on the shoulders of some great packages
that other authors have provided. Below is an overview of the packages that make
imputation with `simputation` possible.

```{r,echo=FALSE}
knitr::kable(
  data.frame(
      `function` = c("impute_rlm"    ,"impute_en"              , "impute_cart", "impute_rf", "impute_rhd","impute_shd","impute_knn","impute_mf","impute_em")
    , model = c("M-estimation", "ridge/elasticnet/lasso", "CART"       , "random forest","random hot deck","sequential hot deck","k nearest neighbours","missForest","mv-normal")
    , package = c("MASS"      ,"glmnet"                 , "rpart"      , "randomForest","VIM (optional)","VIM (optional)","VIM (optional)","missForest","norm")
    , R.recommended = c("yes","no","yes","no","no","no","no","no","no")
    ,stringsAsFactors=FALSE
  )
)
```


### General remarks

A call to an imputation function has the following structure.

```{r, eval=FALSE}
impute_<model>(data, formula, [model-specific options])
```
The output is similar to the ```data``` argument, except that empty values are
imputed (where possible) using the specified model.

The `formula` argument speciefies the variables to be imputed, the model
specification for `<model>` and possibly the grouping of the dataset.
The structure of a formula object is as follows:
```{r, eval=FALSE}
IMPUTED ~ MODEL_SPECIFICATION [ | GROUPING ]
```
where the part between `[]` is optional.

In the following, we assume that the reader already has some familiarity with
the use of formulas in R (e.g. when specifying linear models) and statistical
models commonly used in imputation.




### A first example
First create a copy of the iris dataset with some empty values in columns
1 (`Sepal.Length`), 2 (`Sepal.Width`) and 5 (`Species`).
```{r}
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA
head(dat,10)
```

To impute `Sepal.Length` using a linear model use the `impute_lm` function.
```{r}
da1 <- impute_lm(dat, Sepal.Length ~ Sepal.Width + Species)
head(da1,3)
```
Observe that the 3rd value is not imputed. This is because one of the predictor variables
is missing so the linear model does not produce an output. `simputation` does not report such cases but simply returns the partly imputed result. The remaining value can be imputed
using a new linear model or as shown below, using the group median.
```{r}
da2 <- impute_median(da1, Sepal.Length ~ Species)
head(da2,3)
```
Here, `Species` is used to group the data before computing the medians.

Finally, we impute the `Species` variable using a [decision tree](https://en.wikipedia.org/wiki/Decision_tree_learning) model. All variables except `Species` are used as predictor.

```{r}
da3 <- impute_cart(da2, Species ~ .)
head(da3,10)
```


### Chaining imputation methods

Using the `|>` operator (R 4.0.0 or later) allows for a very compact
specification of the above examples. 

```{r, eval=FALSE}
da4 <- dat |> 
  impute_lm(Sepal.Length ~ Sepal.Width + Species) |>
  impute_median(Sepal.Length ~ Species) |>
  impute_cart(Species ~ .)
```

### Similar model for multiple variables

The simputation package allows users to specify an imputation model for multiple
variables at once. For example, to impute both `Sepal.Length` and `Sepal.Width`
with a similar robust linear model, do the following.
```{r}
da5 <- impute_rlm(dat, Sepal.Length + Sepal.Width ~ Petal.Length + Species)
head(da5)
```

The function will model `Sepal.Length` and `Sepal.Width` against the predictor 
variables independently and impute them. The order of variables in the
specification is therefore not important for the result.

In general, the left-hand side of the model formula is analyzed by `simputation`,
combined appropriately with the right hand side and then passed through to the underlying modeling routine. Simputation also understands the `"."` syntax, which stands for "every
variable not otherwise present" and the "-" sign to remove variables from a formula. For example, the next expression imputes every variable except `Species` with the group
mean plus a normally distributed random residual.
```{r}
da6 <- impute_lm(dat, . - Species ~ 0 + Species, add_residual = "normal")
head(da6)
```
where `Species` on the right-hand-side defines the grouping variable.

### Grouping data for imputation

Use `|` in the `formula` argument to specify groups. 
```{r}
# New data set, leaving Species intact
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA

# split dat into groups according to 'Species', impute, combine and return.
da8 <- impute_lm(dat, Sepal.Length ~ Petal.Width | Species)
head(da8)
```

If one or more grouping variables are specified (multiple are specified by separating them with `+`), imputation takes place as follows.

1. Split the data into subsets according to the values of the grouping variables.
2. Estimate the model for each data subset and impute.
3. Combine the imputed subsets.

Simputation also integrates with the [dplyr](https://cran.r-project.org/package=dplyr) package and recognizes grouping specified with `group_by`.
```{r,eval=FALSE}
library(magrittr)
library(dplyr)

dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA

dat |> group_by(Species) |> 
  impute_lm(Sepal.Length ~ Petal.Width)
```


### Specify your own method with impute_proxy

The `impute_proxy` function is somewhat special since it allows you to define
an imputation method in the right-hand-side of the formula object. Below we
implement a `robust ratio imputation' (for what its worth) as example.
```{r}
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA

dat <- impute_proxy(dat, Sepal.Length ~ median(Sepal.Length,na.rm=TRUE)/median(Sepal.Width, na.rm=TRUE) * Sepal.Width | Species)
head(dat)
```

### Imputing a dataset with models trained on another dataset

This can be done with the `impute` function. To use it, train your
model in the way you are used to.
```{r}
m <- lm(Sepal.Length ~ Sepal.Width + Species, data=iris)
```
Next, use this model to impute a dataset.
```{r}
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA
head(dat)

dat <- impute(dat, Sepal.Length ~ m)
head(dat)
```
That's really all there is to it.

### Using VIM as (much) faster backend for hotdeck imputations

The [VIM](https://CRAN.R-project.org/package=VIM) package offers fast implementations for sequential
and random hotdeck procedures (based on the [data.table](https://CRAN.R-project.org/package=data.table) 
package). It also offers somewhat finer control over certain features such as donor selection. For
this reason, the sequential, random, and k-nearest neighbours hotdeck imputation procedures can
be told to use VIM as backend.

```{r,eval=FALSE}
dat <- data.frame(
  foo = c(1,2,NA,4)
  , bar = c(1,NA,8,NA)
)
# sequential hotdeck imputation, no sorting variables
impute_shd(dat, . ~ 1, pool="complete")
impute_shd(dat, . ~ 1, pool="univariate")
impute_shd(dat, .~1, backend="VIM")
```
Note that VIM uses last observation carried forward by default, and the specification of donor pool
is on a per-variable basis (this cannot be changed). See `?impute_shd` for the full specification.