---
title: "Monte Carlo Simulations"
output: rmarkdown::html_vignette
author: "Jonas Kurle"
date: "11 June 2022"
vignette: >
  %\VignetteIndexEntry{Monte Carlo Simulations}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
<style>
body {
text-align: justify}
</style>

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```
```{r, include = FALSE, echo = FALSE}
Sys.setenv(LANGUAGE="en")
```

# Introduction
We provide several functions for Monte Carlo simulations to assess the 
performance of the outlier detection algorithm `outlier_detection()` and the 
various statistical tools such as `outliers_prop()`. The simulations can be
executed in parallel using various backends.

Monte Carlo simulations involve the following steps:

1. create or choose a *true* 2SLS model (including parameters)
2. specify the outlier detection algorithm to be analysed
3. *optionally* choose which simulation parameters to vary, such as the sample
size
4. choose whether to execute the simulations sequentially or in parallel and run the simulations

See the vignette *Introduction to the robust2sls Package* for more details on
the model setup (step 1) and the different algorithms (step 2) that are
implemented.

```{r, eval = FALSE}
utils::vignette("overview", package = "robust2sls")
```

# Step 1: True model
We conceptualise data as being generated by some *true* model, the so-called
data-generating process (DGP). Specifying a DGP ourselves in simulations, allows
us to check whether the theory works in practice. For example, we could use
Monte Carlo simulations to check whether the 2SLS estimator recovers the true
parameters; whether the proportion of detected outliers corresponds to the 
expected proportion; or whether a statistical test has expected size even in
finite samples.

First of all, we need to specify a valid 2SLS model and its parameters. The
function `generate_param()` can be used to generate random parameters of a 2SLS
model that fulfill the 2SLS conditions. For instance, the parameters are created
such that the structural error is uncorrelated to the instruments. Instead of 
random parameters, they can also - partly or fully - be specified by the user.

```{r}
library(robust2sls)
p <- generate_param(dx1 = 3, dx2 = 2, dz2 = 3, intercept = TRUE, seed = 10)
```

Here, we create parameters for a model with 3 exogenous and 2 endogenous 
regressors, and 3 outside instruments. The model includes an intercept, so one
of the exogenous instruments is simply a constant. The parameters are stored in
a list.

Structural equation: $y_{i} = \beta_{1} x_{1,i} + \beta_{2} x_{2,i} + \beta_{3} x_{3,i} + \beta_{4} x_{4,i} + \beta_{5} x_{5,i} + u_{i}$

First stage: $x_{i} = \Pi^{\prime} z_{i} + r_{i}$,

where the vector $x_{i}$ contains all the regressors and the vector of
instruments $z_{i}$ contains the 3 exogenous regressors and the two excluded
instruments. $\Pi$ is the matrix of first stage coefficients.

# Step 2: Outlier detection algorithm

The workhorse command for different types of trimmed 2SLS algorithms in the
*robust2sls* package is `outlier_detection()`. The main decisions are

* which initial estimator to use
* how the sample is trimmed, which is governed by
  * the reference distribution against which the errors are judged to be outliers or not
  * the cut-off value $c$ that determines the size of the standardised errors beyond which observations are labelled as outliers and subsequently removed
* how often the algorithm is iterated, which is represented by the parameter $m$.

To keep things simple and the run-time of the simulations low, we do not iterate
the algorithm in this example. We use the Robustified 2SLS algorithm, which uses
the full sample for the initial estimates. As is commonly done, we use the
normal distribution as the reference distribution. To target a 
*false outlier detection rate* of approximately 5%, we choose a cut-off value of
approximately 1.96, meaning that observations with an absolute standardised
residual larger than 1.96 are classified as outliers. This is set using the
`sign_level` argument of the function, which together with the reference
distribution, `ref_dist`, automatically determines the cut-off value.

The simulation function `mc_grid()` also takes these arguments and internally
uses them to call the `outlier_detection()` function repeatedly across
replications.

# Step 3: Parameter settings

Again to keep the run-time low, we only vary the sample size. We choose small
sample sizes of 50 and 100, respectively.

# Step 4: Backend for execution
The functions `mc()` and `mc_grid()` are designed to be used either sequentially
or in parallel. They are implemented using the 
[foreach](https://CRAN.R-project.org/package=foreach) package.
To ensure that the results are reproducible across different ways of executing 
the simulations (sequentially or parallel; within the latter as multisession, 
multicore, cluster etc.), the package 
[doRNG](https://CRAN.R-project.org/package=doRNG) is used to
execute the loops.

The Monte Carlo functions leave the registration of the foreach adaptor to the
end-user. For example, both the packages 
[doParallel](https://CRAN.R-project.org/package=doParallel) and
[doFuture](https://CRAN.R-project.org/package=doFuture) can be
used.

## Parallel loop

We first consider running the Monte Carlo simulation in parallel. We set the
number of cores and create the cluster. Note that CRAN only allows for at most
two cores, so the code limits the number of cores. For `registerDoParallel()`, 
we need to export the functions that are used within `mc_grid()` explicitly.
With `registerDoFuture()`, it should not be necessary to explicitly export
variables or packages because it identifies them automatically via static code
inspection.

```{r}
library(parallel)
ncores <- 2
cl <- makeCluster(ncores)
# export libraries to all workers in the cluster
invisible(clusterCall(cl = cl, function(x) .libPaths(x), .libPaths()))
```

First, we use the [doParallel](https://CRAN.R-project.org/package=doParallel)
package to run the simulations in parallel. 

```{r}
library(doParallel)
registerDoParallel(cl)
sim1 <- mc_grid(M = 100, n = c(100, 1000), seed = 42, parameters = p, 
               formula = p$setting$formula, ref_dist = "normal", 
               sign_level = 0.05, initial_est = "robustified", iterations = 0,
               shuffle = FALSE, shuffle_seed = 42, split = 0.5)
```

Next, we use the [doFuture](https://CRAN.R-project.org/package=doFuture) package
for the parallel loop. Both implementations yield the same result.

```{r}
library(doFuture)
registerDoFuture()
plan(cluster, workers = cl)
sim2 <- mc_grid(M = 100, n = c(100, 1000), seed = 42, parameters = p, 
               formula = p$setting$formula, ref_dist = "normal", 
               sign_level = 0.05, initial_est = "robustified", iterations = 0,
               shuffle = FALSE, shuffle_seed = 42, split = 0.5)
stopCluster(cl)

# check identical results
identical(sim1, sim2)
```

## Sequential loop

To run the loop sequentially, we can again use the [doFuture](https://CRAN.R-project.org/package=doFuture) package but this time setting a different plan. The [doRNG](https://CRAN.R-project.org/package=doRNG) ensures that the results are identical to those from the parallel loops.

```{r}
library(doFuture)
registerDoFuture()
plan(sequential)
sim3 <- mc_grid(M = 100, n = c(100, 1000), seed = 42, parameters = p, 
               formula = p$setting$formula, ref_dist = "normal", 
               sign_level = 0.05, initial_est = "robustified", iterations = 0,
               shuffle = FALSE, shuffle_seed = 42, split = 0.5)

# check identical results
identical(sim1, sim3)
```