Introduction to XDNUTS package

library(XDNUTS)

Example with both continuous and discontinuous components

Consider the following probabilistic model \[\begin{align*} X & \sim Negbin(r,p) \notag \\ p & \sim Beta(a_0,b_0) \notag \\ n & \sim Unif(\{1,\dotsc,X\}) \notag \end{align*}\] \(X\) represents the number of trials necessary to obtain \(r\) successes in a series of independent and identically distributed events with probability \(p\). We can generate an arbitrary value for \(X\) and set the hyperparameters \(a_0\) and \(b_0\) as desired:

#observed data
X <- 50

#hyperparameteers
a0 <- 10
b0 <- 10

#list of arguments
arglist <- list(data = X, hyp =  c(a0,b0))

Since \(p\) is a continuous parameter and \(r\) is positive discrete, the classic Hamiltonian Monte Carlo algorithms don’t work, due to lack of differentiability. To address this, we can use the method described by Nishimura, Dunson, and Lu (2020) . First, we need to specify a continuous version of \(r\), which we will denote as \(\tilde{r}\). This continuous variable \(\tilde{r}\) is defined on the open interval \((1,X+1)\). Next, consider the trivial sequence \(\{a_n\}_1^{X+1}: a_n = n\). To transfer the mass probabilities of \(\mathbb{P}(r) = \frac{1}{X} \,,r = 1,\dotsc,X\) to the continuous interval spanning from \(a_{r}\) to \(a_{r+1}\) we divide by the length of this interval. Using this approach and the indicator function, the probability density function of \(\tilde{r}\) can be given by: \[ \pi_{\hat{R}}(\hat{r}) = \displaystyle\sum_{r = 1}^X \frac{\pi_R(r)}{a_{r+1} - a_r} \mathcal{I}(a_r < \tilde{r} \leq a_{r+1}) \] Since the support of this parameter is bounded, it is preferable to apply a transformation that maps \(\tilde{r}\) to \(\mathbb{R}\). Any bijective function can be used for this purpose; for example, the inverse logistic function serves this need: \[ \hat{r} = \log\left(\frac{\tilde{r} - 1}{ (X+1) - \tilde{r}}\right) \] The new discontinuous probability density can be derived from the cumulative distribution function. Interestingly, this approach yields the same result as applying the Jacobian determinant of the inverse transformation. \[ \pi_{\hat{R}}(\hat{r}) = \displaystyle\sum_{r = 1}^X \frac{\pi_R(r)}{a_{r+1} - a_r} \mathcal{I}(a_r < \tilde{r} \leq a_{r+1}) \cdot ( (X+1) - 1) \cdot\left(\frac{1}{1+e^{-\hat{r}}}\right)\cdot\left(\frac{e^{-\hat{r}}}{1+e^{-\hat{r}}}\right) \] where \(\tilde{r} = 1 + \frac{(X+1) - 1}{1 + e^{-\hat{r}}}\) and since \(\pi_R(r) = \frac{1}{X}\) the final result gives: \[ \pi_{\hat{R}}(\hat{r}) = \displaystyle\sum_{r = 1}^X \frac{\mathcal{I}(a_r < \tilde{r} \leq a_{r+1})}{a_{r+1} - a_r} \cdot\left(\frac{1}{1+e^{-\hat{r}}}\right)\cdot\left(\frac{e^{-\hat{r}}}{1+e^{-\hat{r}}}\right) \]

For the same reason, we could apply a similar transformation to map the support of \(p\) from \((0,1)\) to \(\mathbb{R}\), Define: \[ \omega = \log\left(\frac{p}{1-p}\right) \] which yields the following prior distribution for \(\omega\): \[ \pi(\omega) = \left(\frac{1}{1+e^{-\omega}}\right)^{a_0}\cdot\left(\frac{e^{-\omega}}{1+e^{-\omega}}\right)^{b_0} \] The resulting posterior distribution in this parameterization is given by: \[ \pi(\omega,\hat{r}|X) \propto \binom{X-1}{r-1} \cdot \left(\frac{1}{1+e^{-\omega}}\right)^{r + a_0}\cdot\left(\frac{e^{-\omega}}{1+e^{-\omega}}\right)^{X - r + b_0 } \cdot\left(\frac{1}{1+e^{-\hat{r}}}\right)\cdot\left(\frac{e^{-\hat{r}}}{1+e^{-\hat{r}}}\right) \] Ultimately, it can be easily shown that the negative logarithm of the posterior distribution and its derivative with respect to \(\omega\) are as follows: \[\begin{align*} -\log\pi(\omega,\hat{r}|X) \propto -\displaystyle\sum_{i = X - r + 1}^{X - 1} \log(i) &+ \displaystyle\sum_{i = 1}^{r-1}\log(i) + (X + a_0+b_0)\log\left(1+e^{-\omega}\right) \notag \\ &+ (X - r + b_0)\omega + \hat{r} + 2\log\left(1+e^{-\hat{r}}\right) \notag \end{align*}\] \[ \frac{\partial-\log\pi(\omega,\hat{r}|X)}{\partial\omega} = (X - r + b_0) - (X+a_0+b_0)\frac{e^{-\omega}}{1+e^{-\omega}} \] We can define an R function to evaluate these quantities. To be compatible with the package, the first argument should be the parameter values, the second should be the list of arguments necessary to evaluate the posterior, and the third should be a logical value: if TRUE, the function must return the negative logarithm of the posterior distribution; otherwise, it should return the gradient with respect to the continuous components, which, in this case, refers to the first element of the parameter vector.

#function for the negative log posterior and its gradient
#with respect to the continuous components
nlp <- function(par,args,eval_nlp = TRUE){
  
  if(eval_nlp){
    #only the negative log posterior
    
    #overflow
    if(any(par > 30)) return(Inf) 
    
    #conversion of the r value
    r <- floor(1 + args$data*plogis(par[2]))
    
    #output
    out <- sum(log(seq_len(r-1))) + 
      (args$data + args$hyp[1] + args$hyp[2])*log(1+exp(-par[1])) + 
      par[1]*(args$data - r + args$hyp[2]) + par[2] + 2*log(1+exp(-par[2]))
    if(r > 2) out <- out - sum(log(seq(args$data - r + 1,args$data - 1)))
    
    return(out)
    
  }else{
    #only the gradient
    
    #overflow
    if(any(par > 30)) return(Inf) 
    
    #conversion of the r value
    r <- floor(1 + args$data*plogis(par[2]))
    
    #output
    return( (args$data - r + args$hyp[2]) - 
              (args$data + args$hyp[1] + args$hyp[2])*(1-plogis(par[1])) )
  }
  
}

To run the algorithm, use the xdnuts function. For instance, if you want to run 4 chains, you need to provide a list of 4 initial vectors as the first argument.

#MCMC
set.seed(1)
chains <- xdnuts(theta0 = lapply(1:4,function(x) c(omega = rnorm(1),r_hat = rnorm(1))),
                 nlp = nlp,
                 args = arglist,
                 k = 1,
                 thin = 1,
                 parallel = FALSE,
                 method = "NUTS",
                 hide = TRUE)

After sampling, you can examine the results using the plot function:

plot(chains)

By default, it displays the trace plots for each marginal chain, which is useful for checking if the chains are mixing well. To view the marginal and bivariate densities, specify type = 2.

plot(chains, type = 2)

Examining the marginal densities can help diagnose poor convergence of the chains, though this does not seem to be an issue here.


Another useful plot can be generated with the type = 3 argument. This option overlays the energy level Markov chains. While it is less sensitive to convergence issues, it provides a summary of the global Markov process in a single chain, making it especially valuable for large models.

plot(chains, type = 3)


With type = 4, a stick plot of the trajectory lengths is displayed, with data from different chains overlaid. This plot serves as a good proxy for the computational cost of the procedure, as each step requires evaluating both the negative logarithmic posterior and its gradient multiple times.

plot(chains, type = 4)


With type = 5, the histogram of the marginal energy levels (in gray) is overlaid with the histogram of the corresponding proposal values (in red).

plot(chains, type = 5)

As with the classic Metropolis scheme, when the proposal distribution matches the target distribution well, there is effective exploration of the parameter space where the probability mass is highest. A poor match would indicate a suboptimal choice of the transition kernel, which in this case is determined solely by the choice of the momenta distribution. This would be reflected in a more leptokurtic red histogram compared to the gray histogram. For more details on this topic, see Betancourt (2016a).


With type = 6, the function plots the autocorrelation function for each chain and parameter. This is an effective way to diagnose slow exploration of the posterior distribution.

plot(chains, type = 6)


With type = 7, the function plots the empirical Metropolis acceptance rate and the refraction rate of the discontinuous component for each chain. This plot helps diagnose suboptimal adaptation of the step-size.

plot(chains, type = 7)


Last but not least, we can summarize the chain’s output using the summary function.

summary(chains)
#>          mean      sd       5%      25%     50%     75%     95%      ESS
#> omega 0.09927 0.45341 -0.64615 -0.20095 0.09966 0.40515 0.83205 2812.841
#> r_hat 0.09384 0.55231 -0.79911 -0.26481 0.09345 0.45826 0.98466 2600.518
#>         R_hat R_hat_upper_CI
#> omega 1.00257        1.00808
#> r_hat 1.00245        1.00712
#> 
#> Multivariate Gelman Test:  1.00212
#> Estimated Bayesian Fraction of Missing Information:  0.79836

Quantities of interest for each marginal distribution are returned, with the last two columns referring to the Potential Scale Reduction Factor statistic from Gelman and Rubin (1992). Values greater than 1.1 indicate non-convergence of the chains. Below the table, the multivariate version of this test is printed to the console, along with the Bayesian Fraction of Missing Information from Betancourt (2016a). If the latter is less than 0.2, it suggests slow exploration of the energy distribution.

Example with only discontinuous components

It is possible to reuse the same models adopted previously while treating the first parameters as inducing a discontinuity as well. In this case, you should set k=2 in the xdnuts function. For teaching purposes, this time the algorithm with the exhaustion termination criterion from Betancourt (2016b) will be used, specified by setting method = "XHMC". In this scenario, you must specify a threshold value for this method. A reasonable but sub-optimal value for this threshold seems to be tau = 1.

#MCMC
set.seed(1)
chains <- xdnuts(theta0 = lapply(1:4,function(x) rnorm(2)),
                 nlp = nlp,
                 args = arglist,
                 k = 2,
                 thin = 1,
                 parallel = FALSE,
                 method = "XHMC",
                 hide = TRUE,
                 tau = 1)

This time, we transform the chains back to their original parameterization using the function xdtransform

#define the function to be applied to each sample
f <- function(x,XX) {
  c(
    plogis(x[1]), #inverse logistic for the probability
    floor(1 + XX*plogis(x[2])) #number of trials
  )
}
original_chains <- xdtransform(X = chains, which = NULL,
                               FUN = f,XX = arglist$data,
                               new.names = c("p","r"))

To avoid repetitions, not all plots are displayed.

plot(original_chains, type = 2)

The imperfect overlap of the marginal densities suggests poorer chain convergence, primarily due to increased autocorrelation caused by removing the pure Hamiltonian algorithm for the first parameters. Increasing the sample size to N = 5e3 or using a greater thinning interval, such as thin = 5, can help address this issue.

plot(original_chains, type = 4)

This behavior is typical for this type of termination criterion, especially when applied to a discontinuous Hamiltonian system like this one. For some trajectories, the algorithm may never reach a termination, which significantly increases computational cost and can lead to the sampler proposing extreme values that might cause numerical issues. To avoid such problems, you can limit the depth of the binary tree trajectory by setting control = set_parameters(max_treedepth = 5).

plot(original_chains, type = 6)

As suggested by the first graph, the autocorrelations have increased, causing a slower convergence of the estimates.

summary(original_chains)
#>       mean      sd       5%      25%      50%      75%     95%      ESS   R_hat
#> p  0.56329 0.12356  0.34886  0.47866  0.56852  0.65195  0.7586 1256.005 1.00353
#> r 27.94750 7.35655 16.00000 23.00000 28.00000 33.00000 39.0000 1454.517 1.00320
#>   R_hat_upper_CI
#> p        1.01146
#> r        1.01097
#> 
#> Multivariate Gelman Test:  1.00372
#> Estimated Bayesian Fraction of Missing Information:  1.25657

It is noteworthy that the Effective Sample Size has decreased, as discussed by Geyer (2011). Additionally, the function could alerts us to the presence of premature termination of trajectories, as the second graph may suggests.

Example with only continuous components

For this example, we can use a Gaussian hierarchical model applied to the blood viscosity dataset, which is available in the package:

data(viscosity)
viscosity
#>   id Time.1 Time.2 Time.3 Time.4 Time.5 Time.6 Time.7
#> 1  1     68     42     69     64     39     66     29
#> 2  2     49     52     41     56     40     43     20
#> 3  3     41     40     26     33     42     27     35
#> 4  4     33     27     48     54     42     56     19
#> 5  5     40     45     50     41     37     34     42
#> 6  6     30     42     35     44     49     25     45

#create the list function
arglist <- list(data =  as.matrix(viscosity[,-1]),
                hyp = c(0, #mean a priori for mu
                        1000, #variance a priori for mu
                        0.5,1,0.5,1 #inverse gamma iperparameters
                        )
                )

This dataset contains 7 blood viscosity measurements for each of 6 different subjects. The model considered is as follows:

\[\begin{align*} y_{ij} &\sim N(a_j,\sigma^2) \quad, i = 1,\dotsc,I\,,j = 1,\dotsc, J \notag \\ a_j &\sim N(\mu,\sigma_a^2) \notag \\ \mu &\sim N(m_0,s_0^2) \notag \\ \sigma^2 &\sim InvGamma(a_0,b_0) \notag \\ \sigma_a^2 &\sim InvGamma(a_1,b_1) \notag \end{align*}\]

with \(I = 7\), \(J = 6\) and \(n = I\times J\).

The variance parameters are transformed using a logarithmic scale to facilitate the algorithm’s exploration of the parameter space: \(\omega = \log\sigma^2\) and \(\omega_a = \log\sigma^2_a\). In this case, only the final results are presented, as the intermediate calculations are not necessary.

\[\begin{align*} -\log\pi(\theta|y) \propto &\quad\omega(n + a_0) + 0.5e^{-\omega}\left[2b_0 + \sum_{i,j} (y_{i,j} - a_j)^2 \right] \notag \\ &+ \omega_a(J + a_1) + 0.5e^{-\omega_a}\left[2b_1 + \sum_{j=1}^J(a_j - \mu)^2\right] + \frac{\mu^2 - 2\mu m_0}{2s_0^2} \notag \end{align*}\]

\[\begin{align*} \frac{\partial-\log\pi(\theta|y)}{\partial \mu} &= \frac{\mu - m_0}{s_0^2} - e^{-\omega_a}\displaystyle\sum_{j=1}^J (a_j - \mu)^2 \notag \\ \frac{\partial-\log\pi(\theta|y)}{\partial \omega} &= (n+a_0) - 0.5e^{-\omega}\left[2b_0 + \sum_{i,j} (y_{i,j} - a_j)^2\right] \notag \\ \frac{\partial-\log\pi(\theta|y)}{\partial \omega_a} &= (J+a_1) - 0.5e^{-\omega_a}\left[ 2b_1 + \displaystyle\sum_{j=1}^J (a_j - \mu)^2 \right] \end{align*}\]

These computations are summarized in the following R function:

nlp <- function(par,args,eval_nlp = TRUE){
    
    if(eval_nlp){
      #only the negative log posterior

      #overflow
      if(any(abs(par[2:3]) > 30)) return(Inf)
      
      return(par[2] * ( prod(dim(args$data)) + args$hyp[3] ) + 
               (sum( (args$data - par[-(1:3)])^2 ) + 
                  2*args$hyp[4])*exp(-par[2])/2 +
               par[3] * (length(par[-(1:3)]) + 
                           args$hyp[5]) +  
               (sum( (par[-(1:3)] - par[1])^2 ) + 
                  2+args$hyp[6])*exp(-par[3])/2 +
               (par[1] - args$hyp[1])^2/2/args$hyp[2])
      
    }else{
      #only the gradient
      
      #overflow
      if(any(abs(par[2:3]) > 30)) return(rep(Inf,9))
      
      c(
        -sum( par[-(1:3)] - par[1] )*exp(-par[3]) + (
          par[1] - args$hyp[1])/args$hyp[2], #mu derivative
        
        prod(dim(args$data)) + args$hyp[3] - 
          (sum( (args$data - par[-(1:3)])^2 ) +
             2*args$hyp[4])*exp(-par[2])/2, #omega derivative
        
        length(par[-(1:3)]) + args$hyp[5] - 
          (sum( (par[-(1:3)] - par[1])^2 ) + 
             2+args$hyp[6])*exp(-par[3])/2, #omega_a derivative
        
        -apply(args$data - par[-(1:3)],1,sum)*exp(-par[2]) + 
          (par[-(1:3)] - par[1])*exp(-par[3]) #random effects gradient
      )
      
    }
  
}

The derivation of the gradient is recommended for computational efficiency; however, it is always possible to calculate it numerically using methods like the numDeriv::grad function.

Since this is the only algorithm not yet explored, we will use the standard Hamiltonian Monte Carlo with a fixed trajectory length L. In practice, the trajectory length L is sampled uniformly from the interval [max(1,L - L_jitter),L + L_jitter]. By default, L_jitter is set to 1, while L must be specified by the user. A sub-optimal but reasonable value for L is 20.

#MCMC
set.seed(1)
chains <- xdnuts(theta0 = lapply(1:4,function(x) {
                      out <- rnorm(3 + 6)
                      names(out) <- c("mu","log_sigma2","log_sigma2_a",
                                      paste0("mu",1:6))
                      out}),
                 nlp = nlp,
                 args = arglist,
                 k = 0, #no discontinuous components
                 thin = 1,
                 parallel = FALSE,
                 method = "HMC",
                 hide = TRUE,
                 L = 20,
                 control = set_parameters(L_jitter = 5))

We can transform the variance components back from their logarithmic parameterization using the xdtransform function. This time, only these components should be selected.

original_chains <- xdtransform(X = chains,which = 2:3,
                               FUN = exp,new.names = c("sigma2","sigma2_a"))

Since the parameter dimension is relatively high, we can examine the chain of energy level sets for a mixing diagnostic.

plot(original_chains, type = 3)

Although in a Bayesian context the only difference concerns the informativeness of the priors, we can plot the density plots of the and parameters separately.

plot(original_chains, type = 2, which = 1:3) #fixed

plot(original_chains, type = 2, which = 4:9) #random

Next, it is useful to check the autocorrelation plots for each chain:

plot(original_chains, type = 6)

As expected from theory Gelfand, Sahu, and Carlin (1995), the centered parameterization of the model makes exploring the random effects variance components challenging, which is reflected in the strong autocorrelation of their chains.
Finally, the summary function:

summary(original_chains)
#>              mean       sd       5%      25%      50%      75%      95%
#> mu       41.82778  1.33343 39.61992 40.94063 41.81027 42.70707 44.06480
#> sigma2   71.10762 12.36267 53.97771 62.69590 69.50171 77.68585 93.14380
#> sigma2_a  0.77694  0.69979  0.22426  0.36124  0.53629  0.93777  2.15117
#> mu1      42.68049  1.60977 40.24659 41.54516 42.63272 43.67899 45.45473
#> mu2      41.92201  1.47945 39.47879 40.94731 41.92428 42.89715 44.38296
#> mu3      41.34194  1.52379 38.87303 40.33515 41.33641 42.37289 43.78504
#> mu4      41.62059  1.48213 39.14401 40.68174 41.61924 42.61911 44.02651
#> mu5      41.77944  1.49325 39.42050 40.76230 41.74429 42.77138 44.29616
#> mu6      41.54661  1.49635 39.08980 40.53589 41.54691 42.54118 44.05494
#>                ESS   R_hat R_hat_upper_CI
#> mu       1072.9918 1.02082        1.04927
#> sigma2    412.5059 1.02488        1.06126
#> sigma2_a  216.0952 1.23888        1.70282
#> mu1       439.9831 1.03900        1.10723
#> mu2       940.9661 1.01616        1.03492
#> mu3       708.7164 1.03525        1.07290
#> mu4       841.0502 1.01340        1.03287
#> mu5       993.2040 1.02011        1.04982
#> mu6       855.8656 1.01754        1.04256
#> 
#> Multivariate Gelman Test:  1.12237
#> Estimated Bayesian Fraction of Missing Information:  1.5032

By using the xdextract function, you can rearrange the chains into a more convenient format, such as an array or a matrix. This is useful for computing posterior quantities, including model predictions.

#extract samples into matrix
theta <- xdextract(original_chains,collapse = TRUE)

#compute prediction
y_hat <- sapply(1:6, function(i){
  rnorm(NROW(theta),theta[,3+i],sqrt(theta[,2]))
})

#plot prediction
op <- par(no.readonly = TRUE)
par(mar=c(5.1, 4.1, 2.1, 4.1), xpd=TRUE)
plot(NULL, xlim = c(1,6), ylim = c(15,85), xlab = "Subject",
     ylab =  "Viscosity", bty = "L")
for(i in 1:6){
  #data
  points(rep(i,7),viscosity[i,-1], pch = 16)
  
  #data 95% credible intervals for the prediction
  lines(rep(i,2),quantile(y_hat[,i],c(0.025,0.975)), col = 3, lwd = 3)
  
  #random effects 95% credible intervals
  lines(rep(i,2),quantile(theta[,3+i],c(0.025,0.975)), col = 4, lwd = 4)
}
legend("topright",inset=c(-0.2,-0.2), lty = 1, lwd = 2, col = c(3,4),
       title = "95% Credible Intervals",
       legend = c("prediction","random effects"),
       bty = "n", bg = "transparent", cex = 0.8)

par(op)

The investigation of the first subject, which appears to be an outlier, is beyond the scope of this analysis. However, the hierarchical model’s property of borrowing strength across subjects seems to be effective.

Bibliography

Betancourt, Michael. 2016a. “Diagnosing Suboptimal Cotangent Disintegrations in Hamiltonian Monte Carlo.” arXiv Preprint arXiv:1604.00695.
———. 2016b. “Identifying the Optimal Integration Time in Hamiltonian Monte Carlo.” arXiv Preprint arXiv:1601.00225.
Gelfand, Alan E, Sujit K Sahu, and Bradley P Carlin. 1995. “Efficient Parametrisations for Normal Linear Mixed Models.” Biometrika 82 (3): 479–88.
Gelman, Andrew, and Donald B Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7 (4): 457–72.
Geyer, Charles J. 2011. “Introduction to Markov Chain Monte Carlo.” Handbook of Markov Chain Monte Carlo 20116022 (45): 22.
Nishimura, Akihiko, David B Dunson, and Jianfeng Lu. 2020. “Discontinuous Hamiltonian Monte Carlo for Discrete Parameters and Discontinuous Likelihoods.” Biometrika 107 (2): 365–80.