---
title: "The simTool package: Facilitate simulations"
author: "Marsel Scheer"
date: "`r Sys.Date()`"
output:
  html_document:
    fig_caption: false
    number_sections: true 
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
vignette: >
  %\VignetteIndexEntry{The simTool package: Facilitate simulations}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<style>
p {
    max-width: 940px;
}
</style>

```{r, echo = FALSE, message = FALSE}
knitr::opts_chunk$set(comment = "")
options(width = 100, max.print = 100)
set.seed(123456)
library(simTool)
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
library(broom)
library(boot)
```

# A first example

1. Generate data 
2. Fit different models
3. Repeat step 1 and 2 three times
4. Summarize the results with respect to the different data generating functions and models by calculating mean and standard deviation for the corresponding model terms
```{r}
regData <- function(n, SD) {
  x <- seq(0, 1, length = n)
  y <- 10 + 2 * x + rnorm(n, sd = SD)
  tibble(x = x, y = y)
}

eval_tibbles(
  expand_tibble(fun = "regData", n = 5L, SD = 1:2),
  expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")),
  post_analyze = broom::tidy,
  summary_fun = list(mean = mean, sd = sd),
  group_for_summary = "term",
  replications = 3
)
```


# Introduction

The purpose of the *simTool* package is to disengage the research from any kind
of administrative source code which is usually an annoying necessity of a 
simulation study.

This vignette will give an introduction into the *simTool* package mainly by
examples of growing complexity. The workhorse is the function *eval_tibbles*. Every
parameter of this function will be discussed briefly and the functionality is
illustrated by at least one example.

# Workflow

The workflow is quite easy and natural. One defines two *data.frames* (or *tibbles*), 
the first one represents the functions that generate the data sets and the second one represents
the functions that analyze the data. These two *data.frames* are passed to
*eval_tibbles* which conducts the simulation. Afterwards, the results can nicely
be displayed as a *data.frame*.

# Defining the *data.frames* for data generation and analysis

There are 3 rules:

* the first column ( a *character* vector) defines the functions to be called
* the other columns are the parameters that are passed to function specified in the first column
* The entry *NA* will not be passed to the function specified in the first column.

The function *expand_tibble* is a convenient function for defining such *data.frames*.

We now define the data generation functions for our first simulation.
```{r}
print(dg <- dplyr::bind_rows(
  expand_tibble(fun = "rexp", n = c(10L, 20L), rate = 1:2),
  expand_tibble(fun = "rnorm", n = c(10L, 20L), mean = 1:2)
))
```


This *data.frame* represents 8 *R*-functions. For instance, the second
row represents a function that generates 20 exponential distributed random variables
with *rate* 1. Since *mean=NA* in the second row, this parameter is
not passed to *rexp*.

Similar, we define the *data.frame* for data analyzing functions.

```{r}
print(pg <- dplyr::bind_rows(
  expand_tibble(proc = "min"),
  expand_tibble(proc = "mean", trim = c(0.1, 0.2))
))
```

Hence, this *data.frame* represents 3 *R*-functions i.e. calculating the
minimum and the arithmetic mean with *trim=0.1* and *trim=0.2*.

# The workhorse *eval_tibbles*

The workhorse *eval_tibbles* has the following simplified pseudo code:


```{r, eval=FALSE}
1.  convert dg to R-functions  {g_1, ..., g_k} 
2.  convert pg to R-functions  {f_1, ..., f_L} 
3.  initialize result object 
4.  append dg and pg to the result object 
5.  t1 = current.time() 
6.  for g in  {g_1, ..., g_k} 
7.      for r in 1:replications (optionally in a parallel manner) 
8.          data = g() 
9.          for f in  {f_1, \ldots, f_L} 
10.             append f(data) to the result object (optionally apply a post-analyze-function)
11.         optionally append data to the result object 
12.      optionally summarize the result object over all  
         replications but separately for f_1, ..., f_L (and optional group variables)
13. t2 = current.time() 
14. Estimate the number of replications per hour from t1 and t2 
```

The object returned by *eval_tibbles* is a *list* of class *eval_tibbles*.

```{r}
dg <- expand_tibble(fun = "rnorm", n = 10, mean = 1:2)
pg <- expand_tibble(proc = "min")
eg <- eval_tibbles(data_grid = dg, proc_grid = pg, replications = 2)
eg
```

As you can see, the function always estimates the number of replications that can be
done in one hour. 

## Parameter *replications*

Of course, this parameter controls the number of replications conducted.
```{r}
eg <- eval_tibbles(data_grid = dg, proc_grid = pg, replications = 3)
eg
```

## Parameter *discard_generated_data*

*eval_tibbles* saves ALL generated data sets.
```{r}
eg <- eval_tibbles(data_grid = dg, proc_grid = pg, replications = 1)
eg$simulation
eg$generated_data
```
In general, it is sometimes very handy to
have the data sets in order to investigate unusual or unexpected results.
But saving the generated data sets can be very memory consuming.
Stop saving the generated data sets can be obtained by setting 
*discardGeneratedData = TRUE*. See command line 11 in the pseudo code.

## Parameter *summary_fun*

As stated in command line 12 we can summarize the result objects over 
all replications but separately for all data analyzing functions.
```{r}
dg <- expand_tibble(fun = "runif", n = c(10, 20, 30))
pg <- expand_tibble(proc = c("min", "max"))
eval_tibbles(
  data_grid = dg, proc_grid = pg, replications = 1000,
  summary_fun = list(mean = mean)
)
eval_tibbles(
  data_grid = dg, proc_grid = pg, replications = 1000,
  summary_fun = list(mean = mean, sd = sd)
)
```
Note, by specifying the parameter *summary_fun* the generated data sets
and all individual result objects are discarded. In this example we discard
$3 \times 1000$ data sets and $3 \times 1000 \times 2$ individual result
objects. 

## Parameter *post_analyze*

Sometimes the analyzing functions return quite complicated objects like in
the Section *A first example*.
```{r}
eval_tibbles(
  expand_tibble(fun = "regData", n = 5L, SD = 1:2),
  expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")),
  replications = 2
)
```
The parameter *post_analyze* (if specified) is applied directly after the
result was generated (see command line 10). Note, *purrr::compose* can be
very handy if your post-analyzing-function can be defined by a few single
functions:
```{r}
eval_tibbles(
  expand_tibble(fun = "regData", n = 5L, SD = 1:2),
  expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")),
  post_analyze = purrr::compose(function(mat) mat["(Intercept)", "Estimate"], coef, summary.lm),
  replications = 2
)
```


## Parameter *group_for_summary*

When the result object is a *data.frame* itself, 
for instance

```{r}
presever_rownames <- function(mat) {
  rn <- rownames(mat)
  ret <- tibble::as_tibble(mat)
  ret$term <- rn
  ret
}

eval_tibbles(
  expand_tibble(fun = "regData", n = 5L, SD = 1:2),
  expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")),
  post_analyze = purrr::compose(presever_rownames, coef, summary),
  replications = 3
)
```
In order to summarize the replications it is necessary to 
additional group the calculations with respect to 
another variable. This variable can be passed to
*group_for_summary*
```{r}
eval_tibbles(
  expand_tibble(fun = "regData", n = 5L, SD = 1:2),
  expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")),
  post_analyze = purrr::compose(presever_rownames, coef, summary),
  summary_fun = list(mean = mean, sd = sd),
  group_for_summary = "term",
  replications = 3
)
```


## Parameter *ncpus* and *cluster_seed*

By specifying *ncpus* larger than 1 a cluster objected is created
for the user and passed to the parameter *cluster* discussed in the
next section.
```{r}
eval_tibbles(
  data_grid = dg, proc_grid = pg, replications = 10,
  ncpus = 2, summary_fun = list(mean = mean)
)
```
As it is stated in command line 7, the replications are parallelized. In our case, this means
that roughly every CPU conducts 5 replications.

The parameter *cluster_seed* must be an integer vector of length 6 and
serves the same purpose as the function *set.seed*. By default,
*cluster_seed* equals *rep(12345, 6)*. Note, in order
to reproduce the simulation study it is also necessary that *ncpus*
does not change.

## Parameter *cluster*
The user can create a cluster on its own. This also enables the user
to distribute the replications over different computers in a network.
```{r}
library(parallel)
cl <- makeCluster(rep("localhost", 2), type = "PSOCK")
eval_tibbles(
  data_grid = dg, proc_grid = pg, replications = 10,
  cluster = cl, summary_fun = list(mean = mean)
)
stopCluster(cl)
```
As you can see our cluster consists of 3 workers. Hence, this reproduces the
results from the last code chunk above. Further note, if the user starts the
cluster, the user also has to stop the cluster. A cluster that is created
within *eval_tibbles* by specifying *ncpus* is also stop within
*eval_tibbles*.

## Parameter *cluster_libraries* and *cluster_global_objects*

A newly created cluster is ``empty''. Hence, if the simulation study requires
libraries or objects from the global environment, they must be transferred to 
the cluster.

Lets look at standard example from the *boot* package.
```{r}
library(boot)
ratio <- function(d, w) sum(d$x * w) / sum(d$u * w)
city.boot <- boot(city, ratio,
  R = 999, stype = "w",
  sim = "ordinary"
)
boot.ci(city.boot,
  conf = c(0.90, 0.95),
  type = c("norm", "basic", "perc", "bca")
)
```

The following data generating function is extremely boring because it always returns
the data set *city* from the library *boot*.
```{r}
returnCity <- function() {
  city
}
bootConfInt <- function(data) {
  city.boot <- boot(data, ratio,
    R = 999, stype = "w",
    sim = "ordinary"
  )
  boot.ci(city.boot,
    conf = c(0.90, 0.95),
    type = c("norm", "basic", "perc", "bca")
  )
}
```

The function *ratio* exists at the moment only in our global environment.
Further we had to load the *boot* package. Hence, we load the *boot*
package by setting *cluster_libraries = c("boot")* and transfer the function
*ratio* by setting *cluster_global_objects = c("ratio")*.

```{r}
dg <- expand_tibble(fun = "returnCity")
pg <- expand_tibble(proc = "bootConfInt")
eval_tibbles(dg, pg,
  replications = 10, ncpus = 2,
  cluster_libraries = c("boot"),
  cluster_global_objects = c("ratio")
)
```

Of course, it is possible to set *cluster_global_objects=ls()*, but then 
all objects from the global environment are transferred to all workers.

## Parameter *envir*

The function *eval_tibbles* generates in a
first step function calls from *data_grid*
and *proc_grid*. This is achieved by applying
the *R*-function *get*. By default, *envir=globalenv()*
and thus *get* searches the global environment of the
current R session. An example shows how to use
the parameter *envir*.
```{r}
# masking summary from the base package
summary <- function(x) tibble(sd = sd(x))
g <- function(x) tibble(q0.1 = quantile(x, 0.1))
someFunc <- function() {
  summary <- function(x) tibble(sd = sd(x), mean = mean(x))

  dg <- expand_tibble(fun = "runif", n = 100)
  pg <- expand_tibble(proc = c("summary", "g"))

  # the standard is to use the global
  # environment, hence summary defined outside
  # of someFunc() will be used
  print(eval_tibbles(dg, pg))
  cat("--------------------------------------------------\n")
  # will use the local defined summary, but g
  # from the global environment, because
  # g is not locally defined.
  print(eval_tibbles(dg, pg, envir = environment()))
}
someFunc()
```

## .truth-functionality

Sometimes it is handy to access the parameter constellation that was used
during the data generation in the (post) data analyzing phase. Of course,
one could write wrapper functions for every data generating function and
append the parameter constellation from the data generation as attributes
to the data set, but the purpose of this package is to reduce such
administrative source code. Hence if the
(post) data analyzing function has an argument *.truth*, then 
*eval_tibbles* will manage that hand-over. A brief example should
explain this. Suppose we want to estimate the bias of the empirical
quantile estimator if the data is normal distributed.

```{r}
dg <- expand_tibble(fun = c("rnorm"), mean = c(1,1000), sd = c(1,10), n = c(10L, 100L))
pg <- expand_tibble(proc = "quantile", probs = 0.975)
post_ana <- function(q_est, .truth){
  tibble::tibble(bias = q_est - stats::qnorm(0.975, mean = .truth$mean, sd = .truth$sd))
}
eval_tibbles(dg, pg, replications = 10^3, discard_generated_data = TRUE,
                   ncpus = 2,
                   post_analyze = post_ana,
                   summary_fun = list(mean = mean))
```

If we want to do the analysis for different distrubtions we could 
modify our post data analyzing function, but we can also simply 
add a *.truth*-column to the data generating grid. In this case,
the information from the *.truth*-column is directly passed to
the *.truth*-parameter:

```{r}
dg <- dplyr::bind_rows(
  expand_tibble(fun = c("rnorm"), mean = 0, n = c(10L, 100L), .truth = qnorm(0.975)),
  expand_tibble(fun = c("rexp"), rate = 1, n = c(10L, 100L), .truth = qexp(0.975, rate = 1)),
  expand_tibble(fun = c("runif"), max = 2, n = c(10L, 100L), .truth = qunif(0.975, max = 2))
)
pg <- expand_tibble(proc = "quantile", probs = 0.975)
post_ana <- function(q_est, .truth){
  ret <- q_est - .truth
  names(ret) <- "bias"
  ret
}
eval_tibbles(dg, pg, replications = 10^3, discard_generated_data = TRUE,
                   ncpus = 2,
                   post_analyze = post_ana,
                   summary_fun = list(mean = mean))
```

In the same fashion one could write a data analyzing function with a parameter *.truth*.
To go even a step further, we store the analytic quantile function in the
*.truth* column:

```{r}
dg <- dplyr::bind_rows(
  expand_tibble(fun = c("rnorm"), mean = 0, n = c(10L, 1000L), 
                .truth = list(function(prob) qnorm(prob, mean = 0))),
  expand_tibble(fun = c("rexp"), rate = 1, n = c(10L, 1000L),
                .truth = list(function(prob) qexp(prob, rate = 1))),
  expand_tibble(fun = c("runif"), max = 2, n = c(10L, 1000L),
                .truth = list(function(prob) qunif(prob, max = 2)))
)
bias_quantile <- function(x, prob, .truth) {
  est <- quantile(x, probs = prob)
  ret <- est - .truth[[1]](prob)
  names(ret) <- "bias"
  ret
}
pg <- expand_tibble(proc = "bias_quantile", prob = c(0.9, 0.975))
eval_tibbles(dg, pg, replications = 10^3, discard_generated_data = TRUE,
                   ncpus = 1,
                   summary_fun = list(mean = mean))
```

But one should keep in mind that if one calculates the quantile during the
(post) analyzing phase that this is happens on replication level. To be
more precise lets look at an excerpt of the pseudo code from the beginning 
of the vignette:

```{r, eval = FALSE}
6.  for g in  {g_1, ..., g_k} 
7.      for r in 1:replications (optionally in a parallel manner) 
8.          data = g() 
9.          for f in  {f_1, \ldots, f_L} 
10.             append f(data) to the result object (optionally apply a post-analyze-function)
```

No matter if one extend the data analyzing function *f_1, ... f_L* or
the post-analyze-function with an argument *.truth* the calculation
are made for every single replication during step 10. Hence, the operations
are not vectorized! 



# Some Examples

Note, the following code examples will use more computational 
resources. In order to prevent that these are checked/executed
on the CRAN check farm, they are only evaluated
if the environment variable *NOT_CRAN* is set to "true"

```{r}
EVAL <- FALSE
if (Sys.getenv("NOT_CRAN") == "true") {
  EVAL <- TRUE
}
```

## Sampling distribution of mean and median for normal and exponential distributed data

First we define how the data is generated, where the sample size should be 10 and 100:
```{r, eval = EVAL}
dg <- dplyr::bind_rows(
  expand_tibble(fun = c("rnorm"), mean = 1, n = c(10L, 100L)),
  expand_tibble(fun = c("rexp"), rate = 1, n = c(10L, 100L))
)
dg
```

Afterwards we define how we want to analyze the data:
```{r, eval = EVAL}
pg <- expand_tibble(proc = c("mean", "median"))
pg
```

Finally, we conduct the simulation and visualize the results
```{r, eval = EVAL}
et <- eval_tibbles(dg, pg, replications = 10^4, ncpus = 2)

et$simulation %>%
  ggplot(aes(x = results, color = interaction(fun, n), fill = interaction(fun, n))) +
  geom_density(alpha = 0.3) +
  facet_wrap(~ proc)
```

## Comparing bootstrap confidence intervals with classical studentized interval

We want to compare the confidence intervals that are
generated by *boot::boot.ci()* and *stats::t.test()*.
Unfortunately, *boot::boot.ci()* cannot be applied
directly to the generated data sets. Therefore, we
write a new function:
```{r, eval = EVAL}
bootstrap_ci <- function(x, conf.level, R = 999) {
  b <- boot::boot(x, function(d, i) {
    n <- length(i)
    c(
      mean = mean(d[i]),
      variance = (n - 1) * var(d[i]) / n^2
    )
  }, R = R)
  boot::boot.ci(b, conf = conf.level, type = "all")
}
```

Furthermore, *boot::boot.ci()* returns in general
more than one confidence interval
and the structures returned by *boot::boot.ci()* and *stats::t.test()*
are also very different. One solution could be
to write a function *t_test()* that calls *stats::t.test()*,
modifies the returned object and additionally
modify the function *bootstrap_ci* so that both
function return objects with a unified structure.
But instead of that we will implement a function that
is later on passed to the argument *post_analyze* of
*eval_tibbles*:
```{r, eval = EVAL}
post_analyze <- function(o, .truth) {
  if (class(o) == "htest") {
  #post-process the object returned by t.test
    ci <- o$conf.int
    return(tibble::tibble(
      type = "t.test",
      aspect = c("covered", "ci_length"),
      value = c(ci[1] <= .truth && .truth <= ci[2], ci[2] - ci[1])
    ))
  } else if (class(o) == "bootci") {
  #post-process the object returned by boot.ci
    method = c("normal", "basic", "student", "percent", "bca")
    ret = o[method]
    lower = unlist(purrr::map(ret, ~dplyr::nth(.x[1,], -2)))
    upper = unlist(purrr::map(ret, ~dplyr::nth(.x[1,], -1)))
    type = paste("boot", method, sep = "_")

    return(
      dplyr::bind_rows(
      tibble::tibble(
        type = type, 
        aspect = "covered", 
        value = as.integer(lower <= .truth & .truth <= upper)),
      tibble::tibble(
        type = type, 
        aspect = "ci_length", 
        value = upper - lower))
    )
  }
}
```
As you can see, the objects returned are tibbles with more than one row.
Summarizing the data over all replications will in general use the
variable *type* and *aspect* as grouping variables. This can be achieved
by using the parameter *group_for_summary* of *eval_tibbles*.

### Define and run the simulation

We want to generate normal-, uniform-, and exponential distributed data:
```{r, eval = EVAL}
dg <- dplyr::bind_rows(
  simTool::expand_tibble(fun = "rnorm", n = 10L, mean = 0, sd = sqrt(3), .truth = 0),
  simTool::expand_tibble(fun = "runif", n = 10L, max = 6, .truth = 3),
  simTool::expand_tibble(fun = "rexp", n = 10L, rate = 1 / sqrt(3), .truth = sqrt(3))
)
dg
```
and apply our functions that calculate the confidence intervals to it
using two different confidence levels.

```{r, eval = EVAL}
pg <- simTool::expand_tibble(
  proc = c("t.test","bootstrap_ci"),
  conf.level = c(0.9, 0.95)
)
pg
```
Note, that the structure of the objects returned are quite different which
addressed by our function *post_analyze*. The variables *type* and *aspect*
that are created by *post_analyze* are to distinguish the different
confidence intervals. Since these variables are part of the result objects,
*eval_tibbles* assumes that these variables are results.
In order to summarize the results (calculating the mean) over all replications
correctly we need to tell eval_tibbles that additional group variables
are *type* and *aspect*:
```{r, eval = EVAL}
et <- eval_tibbles(dg, pg,
  replications = 10^3, ncpus = 2,
  cluster_global_objects = "post_analyze",
  post_analyze = post_analyze,
  summary_fun = list(mean = mean),
  group_for_summary = c("aspect", "type")
)
et
```

Finally, we can visualize the summarized results:
```{r, eval = EVAL, fig.width=13}
et$simulation %>%
  ggplot(aes(x = fun, y = value, group = type, fill = type, label = round(value, 2))) +
  geom_col(position = "dodge") +
  geom_label(position = position_dodge(0.9), size = 3) +
  theme(legend.position = "bottom") + 
  facet_grid(aspect ~ conf.level, scales = "free")
```

### A different implementation

Here we briefly realize the simulation differently by leveraging
data analyzing functions with unified return-objects:


```{r, eval = EVAL}
t_test = function(x, conf.level){
  tt <- t.test(x, conf.level = conf.level)
  
  # unify return
  tibble::tibble(type = "t.test", lower = tt$conf.int[1], upper = tt$conf.int[2])
}

bootstrap_ci <- function(x, conf.level, R = 999) {
  b <- boot::boot(x, function(d, i) {
    n <- length(i)
    c(
      mean = mean(d[i]),
      variance = (n - 1) * var(d[i]) / n^2
    )
  }, R = R)
  ci <- boot::boot.ci(b, conf = conf.level, type = "all")
  method = c("normal", "basic", "student", "percent", "bca")
  ret = ci[method]
  lower = unlist(purrr::map(ret, ~dplyr::nth(.x[1,], -2)))
  upper = unlist(purrr::map(ret, ~dplyr::nth(.x[1,], -1)))
  type = paste("boot", method, sep = "_")
  
  # unify return
  tibble::tibble(type = type, lower = lower, upper = upper)
}

dg <- dplyr::bind_rows(
  simTool::expand_tibble(fun = "rnorm", n = 10L, mean = 0, sd = sqrt(3), .truth = 0),
  simTool::expand_tibble(fun = "runif", n = 10L, max = 6, .truth = 3),
  simTool::expand_tibble(fun = "rexp", n = 10L, rate = 1 / sqrt(3), .truth = sqrt(3))
)

pg <- simTool::expand_tibble(
  proc = c("t_test","bootstrap_ci"),
  conf.level = c(0.9, 0.95)
) %>% 
  mutate(R = ifelse(proc == "bootstrap_ci", 100, NA))

et <- eval_tibbles(dg, pg,
  replications = 10^2, ncpus = 2
)
et
```

```{r, eval = EVAL}
grps <- et$simulation %>% 
  select(-replications) %>% 
  select(fun:type) %>% 
  names

et$simulation %>% 
  mutate(covered = lower <= .truth & .truth <= upper,
         ci_length = upper - lower) %>% 
  group_by(.dots = grps) %>% 
  summarise(coverage = mean(covered),
            ci_length = mean(ci_length))
```