---
title: "CFTP Method"
output: 
  rmarkdown::html_vignette:
    toc: true
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{CFTP Method}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(MethEvolSIM)
```


# Introduction

Without the CFTP method, the initial methylation states are sampled independently 
for each CpG site based on the equilibrium frequencies of its genomic structure 
(island or non-island).
As a result, the initial distribution lacks neighbor
correlations of the correlated Single-Site Event (SSE) process.
These correlations develop over time during evolution along a branch.

To ensure the that root sequence includes neighbor correlations, we adapted the Coupling 
From The Past (CFTP) algorithm (see @ProppWilson96, @ProppWilson98 for details).
This method samples methylation states from the model's equilibrium 
accounting for the correlated SSE process.
Note that in the current implementation CFTP neglects the Island-Wide Event Process
and thus in particular the reduction of neighbor-specific correlations within CpG islands after
and island-wide event.

# Use examples

The CFTP method can be applied when simulating
initial methylation data with `simulate_initialData` or when simulating data evolution
along a tree with `simulate_evolData` with the argument `CFTP`.
The default value for this argument is FALSE but can be set to TRUE to apply the
CFTP algorithm initial data or at the root of the tree.

Example: Sample initial data with the sequence of methylation states according to 
the equilibrium.

```{r}
# Example with sampled initial methylation frequencies and default parameter values
custom_infoStr <- data.frame(n = c(100, 100, 100),
                             globalState = c("M", "U", "M"))
initialD <- simulate_initialData(infoStr = custom_infoStr, CFTP = TRUE)
```

Example: Simulate evolution of data along a tree calling the CFTP method to sample 
a sequence of methylation states at the root of the tree from the equilibrium.

```{r}
# Example with customized initial methylation frequencies, customized parameter values 
# and default dt (0.01)
tree <- "((a:1, b:1):2, c:2, (d:3.7, (e:4, f:1):3):5);"
custom_infoStr <- data.frame(n = c(10, 10, 10),
                             globalState = c("M", "U", "M"),
                             u_eqFreq = c(0.1, 0.8, 0.1),
                             p_eqFreq = c(0.1, 0.1, 0.1),
                             m_eqFreq = c(0.8, 0.1, 0.8))
custom_params <- get_parameterValues()
custom_params$mu <- 0.005

evolD <- simulate_evolData(infoStr = custom_infoStr, tree = tree, params = custom_params, CFTP = TRUE)
```

# Convergence and Convergence Approximations

Applying the CFTP method increases run time and memmory requirements. This increase
depends on the parameterization used for the given simulation.
For example, a combination of high $\iota$ values (leading to the SSE process being mostly independent)
and low $\alpha_{Ri}$ values (leading to a high variation in the 3 rates of 
independent SSE) leads to some sites having a rate of change through the SSE process
of nearly 0, which results in a long runtime of the exact CFTP algorithm.
To limit the run time and memmory consumption for such cases, the
maximum number of steps used is limited with the argument `CFTP_step_limit`.
If after the limiting number of steps the algorithm has not converged, an approximation
is applied and a message will show:

"Steps limit reached. Applying approximation for CFTP."

The number of steps used by the algorithm is increased by subsequent iterations, 
so that the number of steps at each iteration $n$ given by:

$$\text{steps}_n = 10000 \cdot 2^{n-1}$$

The default value for `CFTP_step_limit` is set to `327680000`, allowing the 
algorithm to try 16 full iterations before applying the approximation.

# Visual Examples

To help understanding how CFTP samples an initial sequence of methylation states from
equilibrium, we provide plots from our tests.
The test used 10 parameter combinations since the model equilibrium depends on them.
The summary statistic used is the mean correlation of a central segment of
methylation states per genomic structure type (see vignette "Summary Statistics with MethEvolSIM").

Without CFTP, the initial states of the sequence are sampled independently based 
on the equilibrium frequencies of each genomic structure. Over time, local 
correlations between methylation states develop during evolution.
With CFTP, the initial states of the sequence are sampled from the equilibrium 
accounting for the correlated SSE process. Therefore, the local correlations are
already present in the initial states of the sequence.

Our test consists in comparing the change in correlations observed in a sequence
initiated without CFTP along evolutionary time to the correlations observed in a
sequence initiated with CFTP.
We expect the mean correlation of the sequence without CFTP to change with 
evolutionary time until it stabilizes at the equilibrium for the corresponding
parameter combination.
To track this, we discretize evolution along a tree branch into small steps of length 0.01. 
We also apply CFTP to the initial data, sampling a sequence of methylation states 
from equilibrium and computing the corresponding summary statistic. 
The distribution of the summary statistics computed for the sequence obtained
with the CFTP method should match the summary statistics equilibrium reached 
without the CFTP method after evolutionary time.

In the following figures, the y-axis represents the mean correlation, and the x-axis
shows the number of discretized time steps.
Each figure corresponds to one of the 10 parameter combinations.
For each combination, 10 replicates were simulated, shown in
different colors.
Dots (with unique shapes for each replicate) indicate the mean
correlation at each time step for the sequences initiated without CFTP. 
For each sequence, the line of the same color represents the value after applying
CFTP with the same CpG-island-specific equilibrium probabilities of the
methylation states as in the initialization of the non-CFTP simulations.
The plots show that CFTP leads to a correlation sampled from a distribution
to which the simulations without CFTP converge.

![Sampled parameter combination 1](Figures/CFTP_testConvergence_paramsID_01.png)

![Sampled parameter combination 2](Figures/CFTP_testConvergence_paramsID_02.png)

![Sampled parameter combination 3](Figures/CFTP_testConvergence_paramsID_03.png)

![Sampled parameter combination 4](Figures/CFTP_testConvergence_paramsID_04.png)

![Sampled parameter combination 5](Figures/CFTP_testConvergence_paramsID_05.png)

![Sampled parameter combination 6](Figures/CFTP_testConvergence_paramsID_06.png)

![Sampled parameter combination 7](Figures/CFTP_testConvergence_paramsID_07.png)

![Sampled parameter combination 8](Figures/CFTP_testConvergence_paramsID_08.png)

![Sampled parameter combination 9](Figures/CFTP_testConvergence_paramsID_09.png)

![Sampled parameter combination 10](Figures/CFTP_testConvergence_paramsID_10.png)


## Reproducibility

The test used to generate the visual example is provided as a workflow in 
[MethEvolSIM GitHub Repository](https://github.com/statgenlmu/MethEvolSIM), under
`/extensive_tests/CFTP_convergence`.
All scripts are provided and can be run independently, or as in our example,
using [Snakemake](https://snakemake.readthedocs.io). 
Here, we describe the structure and content of the workflow. 


### Workflow Requirements

-   Dependencies are listed on the file `environment.yaml`. They can be installed
previous to running the workflow or, alternatively,
a conda environment can be created with:

```{bash, eval = FALSE}
conda env create -n cftp_testConvergence -f environment.yaml
```

Then, the Snakemake workflow can be run by:

```{bash, eval = FALSE}
conda activate cftp_testConvergence
snakemake --use-conda
```

If using nohup one can activate the environment as in the following command.

```{bash, eval = FALSE}
nohup conda run -n cftp_testConvergence snakemake &
```

-   Provided: `environment.yaml`, `Snakefile`, `config.yaml` and associated scripts.

### Steps in the Workflow

All the given parameters to the workflow steps are taken from the `config.yaml` file.

1.  Rule `get_spatial_str` run by script `get_spatial_str.R`. 
It sets the spatial distribution of the sequence to simulate.
The test simulates data on a simplified genomic region alternating island and 
non-island structures each a given number of times (`Str_n`) and 
each with a constant number of CpGs (`CpG_n`). 
The output is a file named as `methSite_genomicDist` under the directory `dir`.

2.  Rule `set_design` run by script `set_sim_design.R`. 
Takes as input the output of the previous rule. For a number of model parameter 
combinations (`n_sim`) and setting a seed (`seed`) for reproducibility, 
samples the parameter combinations from the prior distributions as defined in
`prior_distributions`.
The output is a file named "design.RData" under the  directory `dir` containing
both the spatial structure defined in the first rule 
and the sampled parameters. 
*Note that this script corrects the values of* $\iota$ *and* $\alpha_{R_i}$ 
*in the sampled parameters to be a minimum of 1e-2*, 
*as that is the smallest value used in the implementation due to numerical limitations*.

3.  Rule `run_sim` run by script `run_sim.R`. 
Takes as input the output of the previous rule. 
For each parameter combination it simulates a number of replicates (`replicate_n`) and
each replicate is simulated with a seed obtained as the index of the corresponding 
parameter combination * 1000 + the replicate number for reproducibility. For each 
replicate it first generates initial data without the CFTP method and 
simulates evolution along a branch of length `branch_length` a number of times from `start` to `end` and, then calls the CFTP method on the previously generated initial data. 
The output files are written under the directory `dir` and named:

  - `"CFTP_testConvergence_paramsID_{parameter_combination_number}_rep_{replicate_number}_{step_n}.RData"`

  - `"CFTP_testConvergence_paramsID_{parameter_combination_number}_rep_{replicate_number}_cftp.RData"`


4.  Rule `compute_meanCor` run by script `compute_meanCor.R`. 
Takes as input the output of the previous rule. 
For each branch step output and cftp output it computes the mean correlation of 
a segment of methylation states per genomic structure type (see
vignette "Summary Statistics with MethEvolSIM"). 
The output files are written under the directory `dir` and are named:

- `"summaryStats_CFTP_testConvergence_paramsID_{parameter_combination_number}_cftp.RData"`

- `"summaryStats_CFTP_testConvergence_paramsID_{parameter_combination_number}_rep_{replicate_number}.RData"`



5.  Rule `plot_results` run by script `plot.R`. 
Takes as input the output of the previous rule.  
The output files are written under a subdirectory `Figures` generated in the 
snakemake directory (or the directory from where the R script is run if not using Snakemake),
and are named "CFTP_testConvergence_paramsID_{parameter_combination_number}" 
both in `.pdf` and `.png` format.

# References