---
title: "Getting Started with DoubleML"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with DoubleML}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(eval = TRUE)
```

The purpose of the following case-studies is to demonstrate the core functionalities of `DoubleML`.


## Installation 

The **DoubleML** package for R can be downloaded using (requires previous installation of the [`remotes` package](https://remotes.r-lib.org/index.html)).

```{r, eval = FALSE}
remotes::install_github("DoubleML/doubleml-for-r")
```

Load the package after completed installation. 

```{r, message=FALSE, warning=FALSE}
library(DoubleML)
```

The python package `DoubleML` is available via the github repository. For more information, please visit our user guide. 

## Data

For our case study we download the Bonus data set from the Pennsylvania Reemployment Bonus experiment and as a second example we simulate data from a partially linear regression model.

```{r}
library(DoubleML)

# Load bonus data
df_bonus = fetch_bonus(return_type = "data.table")
head(df_bonus)

# Simulate data
set.seed(3141)
n_obs = 500
n_vars = 100
theta = 3
X = matrix(rnorm(n_obs * n_vars), nrow = n_obs, ncol = n_vars)
d = X[, 1:3] %*% c(5, 5, 5) + rnorm(n_obs)
y = theta * d + X[, 1:3] %*% c(5, 5, 5) + rnorm(n_obs)
```


## The causal model

\begin{align*}
Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}(\zeta | D,X) = 0, \\
D = m_0(X) + V, & &\mathbb{E}(V | X) = 0,
\end{align*}
where $Y$ is the outcome variable and $D$ is the policy variable of interest.
The high-dimensional vector $X = (X_1, \ldots, X_p)$ consists of other confounding covariates,
and $\zeta$ and $V$ are stochastic errors.

## The data-backend `DoubleMLData`

`DoubleML` provides interfaces to objects of class [`data.table`](https://rdatatable.gitlab.io/data.table/) as well as R base classes `data.frame` and `matrix`. Details on the data-backend and the interfaces can be found in the user guide. The `DoubleMLData` class serves as data-backend and can be initialized from a dataframe by specifying the column `y_col="inuidur1"` serving as outcome variable $Y$, the column(s) `d_cols = "tg"` serving as treatment variable $D$ and the columns `x_cols=c("female", "black", "othrace", "dep1", "dep2", "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54", "durable", "lusd", "husd")` specifying the confounders. Alternatively a matrix interface can be used as shown below for the simulated data.


```{r}
# Specify the data and variables for the causal model
dml_data_bonus = DoubleMLData$new(df_bonus,
  y_col = "inuidur1",
  d_cols = "tg",
  x_cols = c("female", "black", "othrace", "dep1", "dep2",
    "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
    "durable", "lusd", "husd"))
print(dml_data_bonus)

# matrix interface to DoubleMLData
dml_data_sim = double_ml_data_from_matrix(X = X, y = y, d = d)
dml_data_sim
```


## Learners to estimate the nuisance models

To estimate our partially linear regression (PLR) model with the double machine learning algorithm, we first have to specify machine learners to estimate $m_0$ and $g_0$. For the bonus data we use a random forest regression model and for our simulated data from a sparse partially linear model we use a Lasso regression model. The implementation of `DoubleML` is based on the meta-packages [mlr3](https://mlr3.mlr-org.com/) for R. For details on the specification of learners and their hyperparameters we refer to the user guide Learners, hyperparameters and hyperparameter tuning.

```{r}
library(mlr3)
library(mlr3learners)
# surpress messages from mlr3 package during fitting
lgr::get_logger("mlr3")$set_threshold("warn")

learner = lrn("regr.ranger", num.trees = 500, max.depth = 5, min.node.size = 2)
ml_l_bonus = learner$clone()
ml_m_bonus = learner$clone()

learner = lrn("regr.glmnet", lambda = sqrt(log(n_vars) / (n_obs)))
ml_l_sim = learner$clone()
ml_m_sim = learner$clone()
```


## Cross-fitting, DML algorithms and Neyman-orthogonal score functions

When initializing the object for PLR models `DoubleMLPLR`, we can further set parameters specifying the resampling: 

* The number of folds used for cross-fitting `n_folds` (defaults to `n_folds = 5`) as well as 
* the number of repetitions when applying repeated cross-fitting `n_rep` (defaults to `n_rep = 1`). 

Additionally, one can choose between the algorithms `"dml1"` and `"dml2"` via `dml_procedure` (defaults to `"dml2"`). Depending on the causal model, one can further choose between different Neyman-orthogonal score / moment functions. For the PLR model the default score is `"partialling out"`, i.e., 
\begin{align}\begin{aligned}\psi(W; \theta, \eta) &:= [Y - \ell(X) - \theta (D - m(X))] [D - m(X)].\end{aligned}\end{align} 

Note that with this score, we do not estimate $g_0(X)$ directly, but the conditional expectation of $Y$ given $X$, $\ell_0(X) = E[Y|X]$. The user guide provides details about the Sample-splitting, cross-fitting and repeated cross-fitting, the Double machine learning algorithms and the Score functions


## Estimate double/debiased machine learning models

We now initialize `DoubleMLPLR` objects for our examples using default parameters. The models are estimated by calling the `fit()` method and we can for example inspect the estimated treatment effect using the `summary()` method. A more detailed result summary can be obtained via the `print()` method. Besides the `fit()` method `DoubleML` model classes also provide functionalities to perform statistical inference like `bootstrap()`, `confint()` and `p_adjust()`, for details see the user guide Variance estimation, confidence intervals and boostrap standard errors.

```{r}
set.seed(3141)
obj_dml_plr_bonus = DoubleMLPLR$new(dml_data_bonus, ml_l = ml_l_bonus, ml_m = ml_m_bonus)
obj_dml_plr_bonus$fit()
print(obj_dml_plr_bonus)

obj_dml_plr_sim = DoubleMLPLR$new(dml_data_sim, ml_l = ml_l_sim, ml_m = ml_m_sim)
obj_dml_plr_sim$fit()
print(obj_dml_plr_sim)
```