---
title: "Repeated Cross-fitting"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Repated Crossfitting}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6
)
```

Contents:

  * [Repeated Crossfitting](#repfit)
    + [Create an AIPW object](#constructor)
    + [Decorate with Repeated](#decorator)
  * [More Repeatations vs More K-split?](#whichbetter)
  
  

## <a id="repfit"></a>Repeated Cross-fitting
  
The purpose of repeated cross-fitting is to reduce the variability of estimate based on a specific split of data by summarizing estimates using different splits as suggested by Chernozhukov (2018). 

  
### <a id="constructor"></a>Create an AIPW object

```{r one_line}
library(AIPW)
library(SuperLearner)
library(ggplot2)
set.seed(123)
data("eager_sim_obs")
cov = c("eligibility","loss_num","age", "time_try_pregnant","BMI","meanAP")

AIPW_SL <- AIPW$new(Y= eager_sim_obs$sim_Y,
                    A= eager_sim_obs$sim_A,
                    W= subset(eager_sim_obs,select=cov), 
                    Q.SL.library = c("SL.glm"),
                    g.SL.library = c("SL.glm"),
                    k_split = 2,
                    verbose=TRUE)$
  fit()$
  summary()
```

### <a id="decorator"></a>Decorate with `Repeated` class

```{r refit}
# Create a new object from the previous AIPW_SL (Repeated class is an extension of the AIPW class)
repeated_aipw_sl <- Repeated$new(aipw_obj = AIPW_SL)
# Fit repetitively
repeated_aipw_sl$repfit(num_reps = 30, stratified = F)
# Summarise the median estimate, median SE, and the SE of median estimate adjusting for `num_reps` repetitions
repeated_aipw_sl$summary_median()
```

```{r check refit}
# Check the distributions of estiamtes from `num_reps` repetitions
s <- repeated_aipw_sl$repeated_estimates
ggplot2::ggplot(ggplot2::aes(x=Estimate),data = s) + ggplot2::geom_histogram(bins = 10) + ggplot2::facet_grid(~Estimand, scales = "free")
ggplot2::ggplot(ggplot2::aes(x=SE),data = s) + ggplot2::geom_histogram(bins = 10) + ggplot2::facet_grid(~Estimand, scales = "free")
```



### <a id="whichbetter"></a>More `num_reps` vs More `k-split`?

There are several considerations:

1. Computational resources
2. Sample size
3. Complexity of the SuperLearner algorithms



### <a id="ref"></a>References:
Chernozhukov V, Chetverikov V, Demirer M, et al (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal.