---
title: "Sampling from (CO)VLMC"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Sampling from (CO)VLMC}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 7,
  fig.align = "center",
  cache.path = ".sampling_cache/"
)
```

```{r setup, message=FALSE}
library(mixvlmc)
library(ggplot2)
library(data.table) ## for frollapply
```

Once a (CO)VLMC has been estimated from a sequence, it can be used to produce new sequences. Any statistical estimator defined on the original sequence can be applied to the simulated sequences leading a form of [semi-parametric bootstrap](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)). This provides
an interesting alternative to other bootstrapping solutions for correlated data,
such as the block bootstrap. This can also be use to consider what-if scenarios 
when different values of the covariates are used study their effect beyond analysis
coefficients in context specific models. 

## Theoretical results

BÃ¼hlmann and Wyner show in [their paper](https://dx.doi.org/10.1214/aos/1018031204)
that this approach is consistent for statistics that depend smoothly from arbitrary
means of fixed mappings of contexts to numerical values. In simple terms, one can
first choose a mapping of arbitrary tuples of values in the state space to numerical 
vectors and then build a estimator by averaging those vectors over the observed sequence.
The resulting vector mean can then be transformed smoothly to lead to an estimator.
The VLMC bootstrap is then consistent for this estimator under relatively mild
hypothesis on the underlying VLMC. 

A typical example of such estimator is the probability of observing a fixed 
pattern in the sequence. More generally, the class of estimators includes smooth
transformations of the empirical distribution of patterns up to a fixed length. 

In practice, it is recommended to sample a longer sequence than the one needed and
to keep only the last $n$ values, in order to minimise the influence of the starting
values. The first values are considered as a *burn in* or *warm up* phase. 

## Sampling from a VLMC

As explained in `vignette("likelihood")`, a VLMC does not specify distributions
for the initial values of time series as no context can be computed from them. 
In `simulate.vlmc()`, we use the notion of *extended context* described in the 
vignette to extend any VLMC into a full model, that specifies both the conditional
distributions given its contexts and also the distributions of the initial values.
   
### Introduction

Let us consider a simple example with an independent sample:
```{r}
dts <- sample(0:1, 500, replace = TRUE)
```
The optimal VLMC according to the BIC has no memory:
```{r}
best_vlmc <- tune_vlmc(dts)
draw(as_vlmc(best_vlmc))
```

Simulating a sequence using the model is done via `simulate.vlmc()` which overrides
the standard `stats::simulate()`. For instance:

```{r}
dts_sample <- simulate(as_vlmc(best_vlmc), 600)[-(1:100)]
```

Even if this is useless here because of the independence, we drop the first 100
samples. In general `simulate.vlmc()` is used as `stats::simulate()` and supports 
the standard parameters:

- `nsim` for the number of simulated values, here the length of the new time series
- `seed` to specify the random seed used during the simulation (the initial state of
   the random number generator is restored to its previous after simulating the values)
   
As for `stats::simulate()`, the state of the random number generator is never 
modified by a call to `simulate.vlmc()`. 

In addition, one can specify the initial values of the sequence via the `init` parameter, for
instance:
```{r}
dts_sample_2 <- simulate(as_vlmc(best_vlmc), 10, init = c(0L, 0L))
```
is guaranteed to start with $0, 0$:
```{r}
dts_sample_2
```
This provides an alternative to dropping the initial values of the simulated time
series by setting those values to the one observed in the data set. Notice that this
practice has no theoretical justification. 

To implement a *burn in* period, just use the `burnin` parameter. For instance:
```{r}
dts_sample_3 <- simulate(as_vlmc(best_vlmc), 10, burnin = 20)
```
will simulated 30 values and drop the first 20. 

### CAC time series
Let us consider the French CAC index provided in `EuStockMarkets`:
```{r}
CAC_raw <- as.data.frame(EuStockMarkets)$CAC
```
We turn it into a discrete time series with three values:

-  Stay if the value of the index on day t+1 is between 99.5% and 100.5% of the value
on day t
-  Up if the value increased by at least 0.5%
-  Down if the value decreased by at least 0.5%

```{r}
CAC_rel_evol <- diff(CAC_raw) / CAC_raw[-length(CAC_raw)]
CAC_dts <- factor(
  ifelse(CAC_rel_evol >= 0.005, "Up",
    ifelse(CAC_rel_evol <= -0.005, "Down", "Stay")
  ),
  levels = c("Down", "Stay", "Up")
)
```

Then we adjust a VLMC to the time series using the AIC criterion:
```{r cache=TRUE}
CAC_vlmc <- tune_vlmc(CAC_dts, criterion = "AIC")
CAC_model <- as_vlmc(CAC_vlmc)
```

We use here the AIC to favour predictive performances and as a consequence the
obtained model is quite complex with `r context_number(CAC_model)` contexts. 

The original discrete time series does not exhibit long subsequences of constant 
values as shown in the following graphical representation.
```{r fig.height=4}
CAC_rle <- rle(as.integer(CAC_dts))
CAC_rle_df <- data.frame(value = CAC_rle$values, length = CAC_rle$lengths)
ggplot(CAC_rle_df, aes(x = length)) +
  geom_bar() +
  labs(
    title = "Distribution of the lengths of constant subsequences",
    x = "Length",
    y = "Count"
  )
```

To study this aspect of the time series, a natural statistics is the probability 
of observing a constant subsequence of a length between 5 and 10 among all 
subsequences of length 10 that can be generated by the model. Notice that the 
patterns used in the statistics must be of fixed lengths to be covered by the 
consistency theorems mentioned above. We implement the statistics as follows:

```{r}
long_sequence <- function(dts) {
  dts_int <- as.integer(dts)
  ## RLE cannot be used directly as we need to account for overlapping
  ## patterns
  dts_freq <- frollapply(dts_int, 10, \(x) max(rle(x)$length) >= 5)
  mean(dts_freq, na.rm = TRUE)
}
```

We generate 100 bootstrap samples, using a *burn in* time proportional to the 
depth of the model, in order to allow for a proper mixing to take place. Notice 
that we could use also `burnin="auto"` to use the another heuristics which uses
the number of contexts to compute the length of the *burn in* period. We use
the `seed` parameter of `stats::simulate()` to ensure reproducibility. 

```{r cache=TRUE}
bootstrap_samples <- vector(100, mode = "list")
burn_in_time <- 50 * depth(CAC_model)
for (b in seq_along(bootstrap_samples)) {
  bootstrap_samples[[b]] <- simulate(CAC_model,
    length(CAC_dts),
    burin = burn_in_time,
    seed = b
  )
}
```

Then we compute the statistics on the bootstrap samples. 
```{r cache=TRUE}
bootstrap_ls <- sapply(bootstrap_samples, long_sequence)
```

The bootstrap distribution of this statistics is illustrated on the following
figure in which the red vertical line represents the value of the statistics on
the original sequence. 
```{r fig.height=4}
ggplot(mapping = aes(x = bootstrap_ls)) +
  geom_density() +
  geom_rug() +
  labs(
    title = "Bootstrap distribution of the probability of long constant subsequences",
    x = "Probability"
  ) +
  geom_vline(xintercept = long_sequence(CAC_dts), color = "red")
```

## Sampling from a COVLMC

The case of COVLMC is more complex mainly because of the dependence towards an 
external time series that is not modelled. In addition, no theoretical result
justify a form of COVLMC bootstrap at the time of writing this document (2023).

We can nevertheless use sampling both to get an idea of the potential variability
of the results from a qualitative point of view, and also to study some what-if 
scenarios based on a modification of the covariates. 

### CAC example continued
Let us study the CAC discrete time series using the three other indexes from 
`EuStockMarkets` as covariates. As previously, we select a model with the AIC
criterion:
```{r cache=TRUE}
CAC_covariates <- as.data.frame(EuStockMarkets)[c("DAX", "SMI", "FTSE")][-1, ]
CAC_covlmc <- tune_covlmc(CAC_dts, CAC_covariates, criterion = "AIC")
CAC_comodel <- as_covlmc(CAC_covlmc)
```
The obtained model is relatively complex as shown below:
```{r}
draw(CAC_comodel, model = "full", with_state = TRUE)
```

The variability of the discrete time series for fixed values of the covariates
can be investigated using simulated sequences. The only difference between
`simulate.vlmc()` and `simulate.covlmc()` is the need for the covariates in the 
latter. A *burn in*  time is also problematic here as we do not have additional 
throw away values for the covariates. A possibility used here consists in starting
the discrete time series as the observed one started, using as many values as the
size of the largest context. Notice that in the current implementation, 
simulating from a COVLMC is relatively slow compared to simulation from a VLMC. 
Here we perform a single simulation to illustrate the feature.
```{r cache=TRUE}
co_sample <- simulate(CAC_comodel,
  length(CAC_dts),
  seed = 0,
  init = CAC_dts[1:depth(CAC_comodel)],
  covariate = CAC_covariates
)
```

### What-if scenarios
In some situations, one can modify the covariates to test the consequences of the 
external influence on the discrete time series under study. Let us consider for
instance a week of electrical consumption:

```{r fig.height=5}
pc_week <- powerconsumption[powerconsumption$week == 12, ]
elec <- pc_week$active_power
ggplot(pc_week, aes(x = date_time, y = active_power)) +
  geom_line() +
  xlab("Date") +
  ylab("Activer power (kW)")
```
We build a discrete version by considering two states: the background base 
active power below 0.4 kW and the active use of electric appliance above this limit.

```{r}
elec_dts <- cut(elec, breaks = c(0, 1.2, 8), labels = c("background", "active"))
elec_cov <- data.frame(day = (pc_week$hour >= 7 & pc_week$hour <= 17))
```

Then we adjust a COVLMC with a binary covariate day/night, using the AIC. 

```{r}
elec_covlmc_tune <- tune_covlmc(elec_dts, elec_cov, criterion = "AIC")
best_elec_covlmc <- as_covlmc(elec_covlmc_tune)
draw(best_elec_covlmc, model = "full", time_sep = " | ", p_value = FALSE)
```

Then we can simulate longer or shorter days by manipulating the covariates. For
instance a "day only" full week could be obtained as follows
```{r}
day_only <- simulate(best_elec_covlmc,
  length(elec_dts),
  seed = 0,
  covariate = data.frame(day = rep(TRUE, length(elec_dts)))
)
```
while a "night only" full week is given by
```{r}
night_only <- simulate(best_elec_covlmc,
  length(elec_dts),
  seed = 0,
  covariate = data.frame(day = rep(FALSE, length(elec_dts)))
)
```

A typical use of this approach consists in simulating first a collection of 
sequences with the original covariates to get a sense of the variability of the
statistics of interest. In a second phase, a comparable collection of sequences
is generated using the manipulated covariates.