---
title: "Introduction to XDNUTS package"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to XDNUTS package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: REFERENCES.bib  
header-includes:
  - \usepackage{amsmath}
  - \usepackage{mathtools}
  - \usepackage{amsfonts}
  - \usepackage{amssymb}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r, echo = FALSE, eval = TRUE}
set.seed(123)
```


```{r setup, eval = TRUE}
library(XDNUTS)
```

## Example with both continuous and discontinuous components

Consider the following probabilistic model
\begin{align*}
X & \sim Negbin(r,p) \notag \\
p & \sim Beta(a_0,b_0) \notag \\
n & \sim Unif(\{1,\dotsc,X\}) \notag 
\end{align*}
$X$ represents the number of trials necessary to obtain $r$ successes in a series of independent and identically distributed events with probability $p$.
We can generate an arbitrary value for $X$ and set the hyperparameters $a_0$ and $b_0$ as desired:
```{r, eval = TRUE}
#observed data
X <- 50

#hyperparameteers
a0 <- 10
b0 <- 10

#list of arguments
arglist <- list(data = X, hyp =  c(a0,b0))
```

Since $p$ is a continuous parameter and $r$ is positive discrete, the classic Hamiltonian Monte Carlo algorithms don't work, due to lack of differentiability. To address this, we can use the method described by @nishimura2020discontinuous . First, we need to specify a continuous version of $r$, which we will denote as $\tilde{r}$. This continuous variable $\tilde{r}$ is defined on the open interval $(1,X+1)$. Next, consider the trivial sequence $\{a_n\}_1^{X+1}: a_n = n$.  To transfer the mass probabilities of $\mathbb{P}(r) = \frac{1}{X} \,,r = 1,\dotsc,X$  to the continuous interval spanning from $a_{r}$ to $a_{r+1}$ we divide by the length of this interval. Using this approach and the indicator function, the probability density function of $\tilde{r}$ can be given by:
\[
\pi_{\hat{R}}(\hat{r}) = \displaystyle\sum_{r = 1}^X \frac{\pi_R(r)}{a_{r+1} - a_r} \mathcal{I}(a_r < \tilde{r} \leq a_{r+1})
\]
Since the support of this parameter is bounded, it is preferable to apply a transformation that maps $\tilde{r}$ to $\mathbb{R}$. Any bijective function can be used for this purpose; for example, the inverse logistic function serves this need:
\[
\hat{r} = \log\left(\frac{\tilde{r} - 1}{ (X+1) - \tilde{r}}\right)
\]
The new discontinuous probability density can be derived from the cumulative distribution function. Interestingly, this approach yields the same result as applying the Jacobian determinant of the inverse transformation.
\[
\pi_{\hat{R}}(\hat{r}) = \displaystyle\sum_{r = 1}^X \frac{\pi_R(r)}{a_{r+1} - a_r} \mathcal{I}(a_r < \tilde{r} \leq a_{r+1}) \cdot ( (X+1) - 1) \cdot\left(\frac{1}{1+e^{-\hat{r}}}\right)\cdot\left(\frac{e^{-\hat{r}}}{1+e^{-\hat{r}}}\right)
\]
where $\tilde{r} = 1 + \frac{(X+1) - 1}{1 + e^{-\hat{r}}}$ and since $\pi_R(r) = \frac{1}{X}$ the final result gives:
\[
\pi_{\hat{R}}(\hat{r}) = \displaystyle\sum_{r = 1}^X \frac{\mathcal{I}(a_r < \tilde{r} \leq a_{r+1})}{a_{r+1} - a_r}  \cdot\left(\frac{1}{1+e^{-\hat{r}}}\right)\cdot\left(\frac{e^{-\hat{r}}}{1+e^{-\hat{r}}}\right)
\]

For the same reason, we could apply a similar transformation to map the support of $p$ from $(0,1)$ to $\mathbb{R}$, Define:
\[
\omega = \log\left(\frac{p}{1-p}\right)
\]
which yields the following prior distribution for $\omega$:
\[
\pi(\omega) = \left(\frac{1}{1+e^{-\omega}}\right)^{a_0}\cdot\left(\frac{e^{-\omega}}{1+e^{-\omega}}\right)^{b_0}
\]
The resulting posterior distribution in this parameterization is given by:
\[
\pi(\omega,\hat{r}|X) \propto \binom{X-1}{r-1} \cdot \left(\frac{1}{1+e^{-\omega}}\right)^{r + a_0}\cdot\left(\frac{e^{-\omega}}{1+e^{-\omega}}\right)^{X - r +  b_0 } \cdot\left(\frac{1}{1+e^{-\hat{r}}}\right)\cdot\left(\frac{e^{-\hat{r}}}{1+e^{-\hat{r}}}\right)
\]
Ultimately, it can be easily shown that the negative logarithm of the posterior distribution and its derivative with respect to $\omega$ are as follows:
\begin{align*}
-\log\pi(\omega,\hat{r}|X)  \propto -\displaystyle\sum_{i = X - r + 1}^{X - 1} \log(i) &+ \displaystyle\sum_{i = 1}^{r-1}\log(i) + (X + a_0+b_0)\log\left(1+e^{-\omega}\right) \notag \\
&+ (X - r + b_0)\omega + \hat{r} + 2\log\left(1+e^{-\hat{r}}\right) \notag 
\end{align*}
\[
\frac{\partial-\log\pi(\omega,\hat{r}|X)}{\partial\omega}  =  (X - r + b_0) - (X+a_0+b_0)\frac{e^{-\omega}}{1+e^{-\omega}} 
\]
We can define an R function to evaluate these quantities. To be compatible with the package, the first argument should be the parameter values, the second should be the list of arguments necessary to evaluate the posterior, and the third should be a logical value: if `TRUE`, the function must return the negative logarithm of the posterior distribution; otherwise, it should return the gradient with respect to the continuous components, which, in this case, refers to the first element of the parameter vector.
```{r, eval = TRUE}
#function for the negative log posterior and its gradient
#with respect to the continuous components
nlp <- function(par,args,eval_nlp = TRUE){
  
  if(eval_nlp){
    #only the negative log posterior
    
    #overflow
    if(any(par > 30)) return(Inf) 
    
    #conversion of the r value
    r <- floor(1 + args$data*plogis(par[2]))
    
    #output
    out <- sum(log(seq_len(r-1))) + 
      (args$data + args$hyp[1] + args$hyp[2])*log(1+exp(-par[1])) + 
      par[1]*(args$data - r + args$hyp[2]) + par[2] + 2*log(1+exp(-par[2]))
    if(r > 2) out <- out - sum(log(seq(args$data - r + 1,args$data - 1)))
    
    return(out)
    
  }else{
    #only the gradient
    
    #overflow
    if(any(par > 30)) return(Inf) 
    
    #conversion of the r value
    r <- floor(1 + args$data*plogis(par[2]))
    
    #output
    return( (args$data - r + args$hyp[2]) - 
              (args$data + args$hyp[1] + args$hyp[2])*(1-plogis(par[1])) )
  }
  
}
```

To run the algorithm, use the `xdnuts` function. For instance, if you want to run 4 chains, you need to provide a list of 4 initial vectors as the first argument.
```{r, warning=FALSE, message=FALSE, eval = TRUE}
#MCMC
set.seed(1)
chains <- xdnuts(theta0 = lapply(1:4,function(x) c(omega = rnorm(1),r_hat = rnorm(1))),
                 nlp = nlp,
                 args = arglist,
                 k = 1,
                 thin = 1,
                 parallel = FALSE,
                 method = "NUTS",
                 hide = TRUE)
```

After sampling, you can examine the results using the `plot` function:
```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(chains)
```

By default, it displays the trace plots for each marginal chain, which is useful for checking if the chains are mixing well. To view the marginal and bivariate densities, specify `type = 2`.

```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(chains, type = 2)
```

Examining the marginal densities can help diagnose poor convergence of the chains, though this does not seem to be an issue here.

\
Another useful plot can be generated with the `type = 3` argument. This option overlays the energy level Markov chains. While it is less sensitive to convergence issues, it provides a summary of the global Markov process in a single chain, making it especially valuable for large models.
```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(chains, type = 3)
```

\
With `type = 4`, a stick plot of the trajectory lengths is displayed, with data from different chains overlaid. This plot serves as a good proxy for the computational cost of the procedure, as each step requires evaluating both the negative logarithmic posterior and its gradient multiple times.
```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(chains, type = 4)
```

\
With `type = 5`, the histogram of the marginal energy levels (in gray) is overlaid with the histogram of the corresponding proposal values (in red).
```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(chains, type = 5)
```

As with the classic Metropolis scheme, when the proposal distribution matches the target distribution well, there is effective exploration of the parameter space where the probability mass is highest. A poor match would indicate a suboptimal choice of the transition kernel, which in this case is determined solely by the choice of the momenta distribution. This would be reflected in a more leptokurtic red histogram compared to the gray histogram. For more details on this topic, see @betancourt2016diagnosing.

\
With `type = 6`, the function plots the autocorrelation function for each chain and parameter. This is an effective way to diagnose slow exploration of the posterior distribution.
```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(chains, type = 6)
```

\
With `type = 7`, the function plots the empirical Metropolis acceptance rate and the refraction rate of the discontinuous component for each chain. This plot helps diagnose suboptimal adaptation of the step-size.
```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(chains, type = 7)
```

\
Last but not least, we can summarize the chain’s output using the summary function.
```{r, warning=FALSE, message=FALSE, out.width='60%',out.height='60%', eval = TRUE}
summary(chains)
```

Quantities of interest for each marginal distribution are returned, with the last two columns referring to the Potential Scale Reduction Factor statistic from @gelman1992inference. Values greater than 1.1 indicate non-convergence of the chains. Below the table, the multivariate version of this test is printed to the console, along with the Bayesian Fraction of Missing Information from @betancourt2016diagnosing. If the latter is less than 0.2, it suggests slow exploration of the energy distribution.

## Example with only discontinuous components

It is possible to reuse the same models adopted previously while treating the first parameters as inducing a discontinuity as well. In this case, you should set `k=2` in the `xdnuts` function. For teaching purposes, this time the algorithm with the exhaustion termination criterion from @betancourt2016identifying will be used, specified by setting `method = "XHMC"`. In this scenario, you must specify a threshold value for this method. A reasonable but sub-optimal value for this threshold seems to be `tau = 1`.

```{r, eval = TRUE}
#MCMC
set.seed(1)
chains <- xdnuts(theta0 = lapply(1:4,function(x) rnorm(2)),
                 nlp = nlp,
                 args = arglist,
                 k = 2,
                 thin = 1,
                 parallel = FALSE,
                 method = "XHMC",
                 hide = TRUE,
                 tau = 1)
```

This time, we transform the chains back to their original parameterization using the function `xdtransform`

```{r, eval = TRUE}
#define the function to be applied to each sample
f <- function(x,XX) {
  c(
    plogis(x[1]), #inverse logistic for the probability
    floor(1 + XX*plogis(x[2])) #number of trials
  )
}
original_chains <- xdtransform(X = chains, which = NULL,
                               FUN = f,XX = arglist$data,
                               new.names = c("p","r"))
```


To avoid repetitions, not all plots are displayed.
```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(original_chains, type = 2)
```

The imperfect overlap of the marginal densities suggests poorer chain convergence, primarily due to increased autocorrelation caused by removing the pure Hamiltonian algorithm for the first parameters. Increasing the sample size to `N = 5e3` or using a greater thinning interval, such as `thin = 5`, can help address this issue.

```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(original_chains, type = 4)
```

This behavior is typical for this type of termination criterion, especially when applied to a discontinuous Hamiltonian system like this one. For some trajectories, the algorithm may never reach a termination, which significantly increases computational cost and can lead to the sampler proposing extreme values that might cause numerical issues. To avoid such problems, you can limit the depth of the binary tree trajectory by setting `control = set_parameters(max_treedepth = 5)`.

```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(original_chains, type = 6)
```

As suggested by the first graph, the autocorrelations have increased, causing a slower convergence of the estimates.

```{r, out.width='60%', out.height='60%', eval = TRUE}
summary(original_chains)
```

It is noteworthy that the Effective Sample Size has decreased, as discussed by @geyer2011introduction. Additionally, the function could alerts us to the presence of premature termination of trajectories, as the second graph may suggests.

## Example with only continuous components

For this example, we can use a Gaussian hierarchical model applied to the blood viscosity dataset, which is available in the package:
```{r, eval = TRUE}
data(viscosity)
viscosity

#create the list function
arglist <- list(data =  as.matrix(viscosity[,-1]),
                hyp = c(0, #mean a priori for mu
                        1000, #variance a priori for mu
                        0.5,1,0.5,1 #inverse gamma iperparameters
                        )
                )
```

This dataset contains 7 blood viscosity measurements for each of 6 different subjects. The model considered is as follows:

\begin{align*}
y_{ij} &\sim N(a_j,\sigma^2) \quad, i = 1,\dotsc,I\,,j = 1,\dotsc, J \notag \\
a_j &\sim N(\mu,\sigma_a^2) \notag \\
\mu &\sim N(m_0,s_0^2) \notag \\
\sigma^2 &\sim InvGamma(a_0,b_0) \notag \\
\sigma_a^2 &\sim InvGamma(a_1,b_1) \notag
\end{align*}

with $I = 7$, $J = 6$ and $n = I\times J$.

The variance parameters are transformed using a logarithmic scale to facilitate the algorithm's exploration of the parameter space: $\omega = \log\sigma^2$ and $\omega_a = \log\sigma^2_a$. In this case, only the final results are presented, as the intermediate calculations are not necessary.

\begin{align*}
-\log\pi(\theta|y) \propto &\quad\omega(n + a_0) + 0.5e^{-\omega}\left[2b_0 + \sum_{i,j} (y_{i,j} - a_j)^2 \right] \notag \\
&+ \omega_a(J + a_1) + 0.5e^{-\omega_a}\left[2b_1 + \sum_{j=1}^J(a_j - \mu)^2\right] + \frac{\mu^2 - 2\mu m_0}{2s_0^2} \notag
\end{align*}

\begin{align*}
\frac{\partial-\log\pi(\theta|y)}{\partial \mu} &= \frac{\mu - m_0}{s_0^2} - e^{-\omega_a}\displaystyle\sum_{j=1}^J (a_j - \mu)^2 \notag \\
\frac{\partial-\log\pi(\theta|y)}{\partial \omega} &= (n+a_0) - 0.5e^{-\omega}\left[2b_0 + \sum_{i,j} (y_{i,j} - a_j)^2\right] \notag \\
\frac{\partial-\log\pi(\theta|y)}{\partial \omega_a} &= (J+a_1) - 0.5e^{-\omega_a}\left[ 2b_1 + \displaystyle\sum_{j=1}^J (a_j - \mu)^2 \right]
\end{align*}

These computations are summarized in the following R function:
```{r, eval = TRUE}
nlp <- function(par,args,eval_nlp = TRUE){
    
    if(eval_nlp){
      #only the negative log posterior

      #overflow
      if(any(abs(par[2:3]) > 30)) return(Inf)
      
      return(par[2] * ( prod(dim(args$data)) + args$hyp[3] ) + 
               (sum( (args$data - par[-(1:3)])^2 ) + 
                  2*args$hyp[4])*exp(-par[2])/2 +
               par[3] * (length(par[-(1:3)]) + 
                           args$hyp[5]) +  
               (sum( (par[-(1:3)] - par[1])^2 ) + 
                  2+args$hyp[6])*exp(-par[3])/2 +
               (par[1] - args$hyp[1])^2/2/args$hyp[2])
      
    }else{
      #only the gradient
      
      #overflow
      if(any(abs(par[2:3]) > 30)) return(rep(Inf,9))
      
      c(
        -sum( par[-(1:3)] - par[1] )*exp(-par[3]) + (
          par[1] - args$hyp[1])/args$hyp[2], #mu derivative
        
        prod(dim(args$data)) + args$hyp[3] - 
          (sum( (args$data - par[-(1:3)])^2 ) +
             2*args$hyp[4])*exp(-par[2])/2, #omega derivative
        
        length(par[-(1:3)]) + args$hyp[5] - 
          (sum( (par[-(1:3)] - par[1])^2 ) + 
             2+args$hyp[6])*exp(-par[3])/2, #omega_a derivative
        
        -apply(args$data - par[-(1:3)],1,sum)*exp(-par[2]) + 
          (par[-(1:3)] - par[1])*exp(-par[3]) #random effects gradient
      )
      
    }
  
}
```

The derivation of the gradient is recommended for computational efficiency; however, it is always possible to calculate it numerically using methods like the `numDeriv::grad` function.

Since this is the only algorithm not yet explored, we will use the standard Hamiltonian Monte Carlo with a fixed trajectory length `L`.  In practice, the trajectory length `L` is sampled uniformly from the interval [`max(1,L - L_jitter)`,`L + L_jitter`]. By default, `L_jitter` is set to 1, while `L`  must be specified by the user. A sub-optimal but reasonable value for `L` is `20`.

```{r, eval = TRUE}
#MCMC
set.seed(1)
chains <- xdnuts(theta0 = lapply(1:4,function(x) {
                      out <- rnorm(3 + 6)
                      names(out) <- c("mu","log_sigma2","log_sigma2_a",
                                      paste0("mu",1:6))
                      out}),
                 nlp = nlp,
                 args = arglist,
                 k = 0, #no discontinuous components
                 thin = 1,
                 parallel = FALSE,
                 method = "HMC",
                 hide = TRUE,
                 L = 20,
                 control = set_parameters(L_jitter = 5))
```

We can transform the variance components back from their logarithmic parameterization using the `xdtransform` function. This time, only these components should be selected.
```{r, eval = TRUE}
original_chains <- xdtransform(X = chains,which = 2:3,
                               FUN = exp,new.names = c("sigma2","sigma2_a"))
```

Since the parameter dimension is relatively high, we can examine the chain of energy level sets for a mixing diagnostic.
```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(original_chains, type = 3)
```

Although in a Bayesian context the only difference concerns the informativeness of the priors, we can plot the density plots of the \textit{fixed} and \textit{random} parameters separately.

```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(original_chains, type = 2, which = 1:3) #fixed
```

```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(original_chains, type = 2, which = 4:9) #random
```

Next, it is useful to check the autocorrelation plots for each chain:
```{r, out.width='60%', out.height='60%', eval = TRUE}
plot(original_chains, type = 6)
```

As expected from theory @gelfand1995efficient, the centered parameterization of the model makes exploring the random effects variance components challenging, which is reflected in the strong autocorrelation of their chains.
\
Finally, the summary function:
```{r, out.width='60%', out.height='60%', eval = TRUE}
summary(original_chains)
```

By using the `xdextract` function, you can rearrange the chains into a more convenient format, such as an array or a matrix. This is useful for computing posterior quantities, including model predictions.

```{r, eval = TRUE}
#extract samples into matrix
theta <- xdextract(original_chains,collapse = TRUE)

#compute prediction
y_hat <- sapply(1:6, function(i){
  rnorm(NROW(theta),theta[,3+i],sqrt(theta[,2]))
})

#plot prediction
op <- par(no.readonly = TRUE)
par(mar=c(5.1, 4.1, 2.1, 4.1), xpd=TRUE)
plot(NULL, xlim = c(1,6), ylim = c(15,85), xlab = "Subject",
     ylab =  "Viscosity", bty = "L")
for(i in 1:6){
  #data
  points(rep(i,7),viscosity[i,-1], pch = 16)
  
  #data 95% credible intervals for the prediction
  lines(rep(i,2),quantile(y_hat[,i],c(0.025,0.975)), col = 3, lwd = 3)
  
  #random effects 95% credible intervals
  lines(rep(i,2),quantile(theta[,3+i],c(0.025,0.975)), col = 4, lwd = 4)
}
legend("topright",inset=c(-0.2,-0.2), lty = 1, lwd = 2, col = c(3,4),
       title = "95% Credible Intervals",
       legend = c("prediction","random effects"),
       bty = "n", bg = "transparent", cex = 0.8)
par(op)
```

The investigation of the first subject, which appears to be an outlier, is beyond the scope of this analysis. However, the hierarchical model's property of borrowing strength across subjects seems to be effective.

## Bibliography