---
title: "Adjusting the noise distribution in the Barker proposal"
output: rmarkdown::html_vignette
bibliography: references.bib
link-citations: true
vignette: >
  %\VignetteIndexEntry{Adjusting the noise distribution in the Barker proposal}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The Barker proposal implementation in the `rmcmc` package in `barker_proposal()` by default uses a standard normal distribution for the auxiliary noise variables generated in the proposal, as suggested by @livingstone2022barker.
The analysis in @vogrinc2023optimal however implies that alternative choices of noise distribution can give improved performance in some situations.
This vignette demonstrates how to adjust the Barker proposal noise distribution.

```{r setup}
library(rmcmc)
```

## Example target distribution

As a simple example of a target distribution, we consider a $D$-dimensional distribution with product form (independent dimensions) and 'hyperbolic' marginals 

$$\pi(x) \propto \exp\left(-\sum_{d=1}^D(\delta^2 + x_d^2)^{\frac{1}{2}}\right).$$
The gradient of the corresponding log density function can be derived as
$$\nabla(\log \pi)(x)_d = -x_d / (\delta^2 + x_d^2)^{\frac{1}{2}}.$$
This can be implemented in R as

```{r}
delta_sq <- 0.1
target_distribution <- list(
  log_density = function(x) -sum(sqrt(delta_sq + x^2)),
  gradient_log_density = function(x) -x / sqrt(delta_sq + x^2)
)
```

Here we use $\delta^2 = `r delta_sq`$, corresponding to Example 4 in @vogrinc2023optimal.

## Creating Barker proposal with a custom noise distribution

The `barker_proposal()` function accepts an optional argument `sample_auxiliary`,
which can be used to specify the distribution the auxiliary noise variables are drawn from.
The argument should be passed a function accepting a single integer argument,
corresponding to the target distribution dimension,
and returning a vector of auxiliary noise variable samples,
one per target distribution dimension.
By default this function is set to `stats::rnorm()`, 
thus using independent standard normal variates for the auxiliary noise variables.

Below we create a function `sample_bimodal` which instead samples from a bimodal normal mixture,
with two equally-weighted normal components with means $\pm (1 - \sigma^2)$ and common variance $\sigma^2 = 0.01$.

```{r}
sample_bimodal <- function(dimension) {
  sigma <- 0.1
  return(
    sample(c(-1, 1), dimension, TRUE) * sqrt(1 - sigma^2) +
      stats::rnorm(dimension) * sigma
  )
}
```

This choice of bimodal distribution for the auxiliary noise variables is motivated by the analysis in @vogrinc2023optimal,
which shows that for target distributions of product form and using a Barker proposal, 
the asymptotic _expect squared jump distance_ (ESJD) a measure of chain sampling performance,
is maximised when the auxiliary noise distribution is chosen to be the Rademacher distribution
(the distribution with half of its mass in atoms at -1 and +1).
The Rademacher distribution itself does not give a practical sampling algorithm,
as the resulting Markov kernel will not be irreducible.
As a more practical alternative @vogrinc2023optimal therefore suggests using the above normal mixture bimodal distribution,
which can be considered a relaxation of the Rademacher distribution.

We now create two instances of the the Barker proposal,
the first using the default of a standard normal distribution for the auxiliary noise variables,
and the second using the bimodal distribution.

```{r}
normal_noise_proposal <- barker_proposal()
bimodal_noise_proposal <- barker_proposal(sample_auxiliary = sample_bimodal)
```

A convenience function `bimodal_barker_proposal()` is also provided in the package to simplify creating a proposal with bimodal noise distribution as above, with optional `sigma` argument specifying the standard deviation of the normal components. That is the second line above can equivalently be written `bimodal_noise_proposal <- bimodal_barker_proposal()` without the need to define the `sample_bimodal` function.

## Sampling chains with proposals

Now that we have the target distribution and two proposals specified,
we can sample chains to compare their performance.

```{r}
set.seed(7861932L)
dimension <- 100
initial_state <- rnorm(dimension)
adapters <- list(scale_adapter())
n_warm_up_iteration <- 10000
n_main_iteration <- 10000
```

Above we specify some common settings for the `sample_chain()` calls,
namely the dimension of the target distribution,
the initial chain state (shared for both chains),
the adapters to use during the warm-up iterations (here we adapt only the proposal scale to achieve a target acceptance rate of 0.57),
and the number of warm-up and main chain iterations.

We can now sample a chain using the bimodal noise proposal variant

```{r}
bimodal_noise_results <- sample_chain(
  target_distribution = target_distribution,
  proposal = bimodal_noise_proposal,
  initial_state = initial_state,
  n_warm_up_iteration = n_warm_up_iteration,
  n_main_iteration = n_main_iteration,
  adapters = adapters
)
```

and similarly using the default normal noise proposal

```{r}
normal_noise_results <- sample_chain(
  target_distribution = target_distribution,
  proposal = normal_noise_proposal,
  initial_state = initial_state,
  n_warm_up_iteration = n_warm_up_iteration,
  n_main_iteration = n_main_iteration,
  adapters = adapters
)
```

## Comparing performance of proposals

For a realisation of a Markov chain $X_{1:N}$,
the ESJD metric used to define the notion of sampling efficiency optimized over in @vogrinc2023optimal,
can be estimated as

$$
  \widehat{\textsf{ESJD}}(X_{1:N}) = \frac{1}{N - 1} \sum_{n=1}^{N-1} \Vert X_{n+1} - X_n \Vert^2
$$
with corresponding R implementation as below

```{r}
expected_square_jumping_distance <- function(traces) {
  n_iteration <- nrow(traces)
  mean(rowSums(traces[2:n_iteration, ] - traces[1:n_iteration - 1, ])^2)
}
```

Applying this function to the chain traces generated using the Barker proposal with bimodal noise,


```{r}
cat(
  sprintf(
    "Expected square jumping distance using normal noise is %.2f",
    expected_square_jumping_distance(bimodal_noise_results$traces[, 1:dimension])
  )
)
```

and Barker proposal with normal noise,

```{r}
cat(
  sprintf(
    "Expected square jumping distance using bimodal noise is %.2f",
    expected_square_jumping_distance(normal_noise_results$traces[, 1:dimension])
  )
)
```

in both cases selecting columns `1:dimension` to exclude the traced log density values,
we find that the proposal with bimodal noise gives roughly a factor two improvement in ESJD compared to using normal noise,
correlating with the results in @vogrinc2023optimal.


```{r}
library(posterior)
```

Using the `posterior` R package, we can all compare how the chains perform in terms of the estimated _effective sample size_ (ESS) of the estimate of the mean of the target log density.

First considering the chain using the Barker proposal with bimodal noise


```{r}
cat(
  sprintf(
    "Estimated ESS of mean(target_log_density) using bimodal noise is %.0f",
    ess_mean(
      extract_variable(
        as_draws_matrix(bimodal_noise_results$traces), "target_log_density"
      )
    )
  )
)
```

and similarly for the Barker proposal with normal noise

```{r}
cat(
  sprintf(
    "Estimated ESS of mean(target_log_density) using normal noise is %.0f",
    ess_mean(
      extract_variable(
        as_draws_matrix(normal_noise_results$traces), "target_log_density"
      )
    )
  )
)
```

we again find that the bimodal noise variant appears to show significantly improved sampling efficiency.

## References