---
title: 'Example: abc'
author: "Thijs Janzen"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Example: abc}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r setup, include=FALSE}
required <- c("abcrf", "ape", "DDD")
if (!all(unlist(lapply(required,
                       function(pkg) requireNamespace(pkg, quietly = TRUE)))))
  knitr::opts_chunk$set(eval = FALSE)


knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.width = 6)
knitr::opts_chunk$set(fig.height = 6)
library(abcrf)
```

## ABC

In Approximate Bayesian Computation (ABC), the goal is to fit a model to data by
comparing summary statistics of data simulated under the model, with summary
statistics of the target data. By minimizing the difference in summary
statistics between simulated and target data, ABC then fits the model.
Here, we will make use of abcrf, a fast method to use ABC that makes use of
decision trees to build a random forest. These trees will be trained (abcrf in
a way is a kind of machine learning approach to ABC) on training data, and can
then be used to predict outcomes on test data. 

### Simple parameter example
In this example, we will see if abcrf can be trained to estimate the birth rate
on a Yule tree (e.g. without extinction). First, we generate training data:
```{r training_simple}
num_points <- 100
test_data <- matrix(nrow = num_points,
                    ncol = 1 + length(treestats::list_statistics()))
for (r in 1:num_points) {
  b <- runif(1)
  focal_tree <- ape::rphylo(n = 100, birth = b, death = 0)
  focal_stats <- treestats::calc_all_stats(focal_tree)
  test_data[r, ] <- c(b, focal_stats)
}
colnames(test_data) <- c("birth", names(focal_stats))
test_data <- as.data.frame(test_data)
```
Given the training data, we can create our random forest:
```{r abcrf_simple}
forest <- regAbcrf(birth ~ ., test_data, ntree = 100)
```
With our random forest set up, let's put it to the test on some other Yule tree:
```{r test_simple}
test_tree <- ape::rphylo(n = 100, birth = 0.5, death = 0)
test_stats <- treestats::calc_all_stats(test_tree)
test_stats <- as.data.frame(t(test_stats))

predict(forest, test_stats, test_data)
```
We see that even using only 100 decision trees, the parameter estimate is quite
close to 0.5. This can be visualized further:
```{r plot_simpl2}
densityPlot(forest, test_stats, test_data)
```

Interestingly, we can also assess which summary statistics were most
informative:
```{r plot_simple}
plot(forest)
```
This indicates that for estimating the parameters, the mean branch length and
phylogenetic diversity - both branch length dependent parameters, e.g. 
properties of the tree directly related to the speciation rate - are most
decisive in determining the birth rate. 

### Comparing models
ABCrf can also be used to do model selection. Here, we will extend the Yule data
we generated before with trees generated using the DDD model, and see if abcrf
can distinguish between these two models.
First, we prep our new training data frame and generate DDD trees:
```{r sim_ddd}
new_test_data <- cbind(test_data, 1)
for (r in 1:num_points) {
  b <- runif(1)
  focal_tree <- DDD::dd_sim(pars = c(b, 0, 130), age = 26.5)$tes
  # too small trees will yield NA values
  while (length(focal_tree$tip.label) < 4)
    focal_tree <- DDD::dd_sim(pars = c(b, 0, 130), age = 26.5)$tes
  focal_stats <- treestats::calc_all_stats(focal_tree)
  new_test_data <- rbind(new_test_data, c(b, focal_stats, 2))
}

for_abc <- new_test_data[, 2:ncol(new_test_data)]
colnames(for_abc) <- c(names(focal_stats), "model")
for_abc$model <- as.factor(for_abc$model)

forest <- abcrf::abcrf(model ~ ., data = for_abc, ntree = 100)
```
The random forest can now be plotted, and be used to assess target trees:
```{r plot_abcrf_model_select}
plot(forest, for_abc)
```
```{r predict_model_select}
to_test <- c()
for (i in 1:5) {
  focal_tree <- ape::rphylo(n = 100, birth = 0.5, death = 0)
  to_test <- rbind(to_test, treestats::calc_all_stats(focal_tree))
}

for (i in 1:5) {
  focal_tree <- DDD::dd_sim(pars = c(0.5, 0, 130), age = 26.5)$tes
  # too small trees will yield NA values
  while (length(focal_tree$tip.label) < 4)
    focal_tree <- DDD::dd_sim(pars = c(0.5, 0, 130), age = 26.5)$tes
  to_test <- rbind(to_test, treestats::calc_all_stats(focal_tree))
}
to_test <- as.data.frame(to_test)
predict(forest, to_test, for_abc)
```
Indeed, we see that the first five trees are convincingly assigned model 1 (BD),
and that the last five trees are convincingly assigned model 2 (DDD).