Mhorseshoe

In this vignette, the functions exact_horseshoe, approx_horseshoe are explained in the following sections.

  1. About the horseshoe estimator in this package
  2. modified approximate algorithm

1. About the horseshoe estimator in this package

The horseshoe prior is a continuous reduced prior distribution frequently used in high-dimensional Bayesian linear models, and can be effectively applied to sparse data. This is a method that theoretically guarantees excellent shrinkage properties. Mhorseshoe provides an estimator applying the horseshoe prior, The likelihood function of the Gaussian linear model to be considered in this package is as follows:

\[L(y\ |\ x, \beta, \sigma^2) = (\frac{1}{\sqrt{2\pi}\sigma})^{-N/2}exp \{ -\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta)\},\\ X \in \mathbb{R}^{N \times p},\ y \in \mathbb{R}^{N},\ \beta \in \mathbb{R}^{p}\]

And the form of the hierarchical model applying the horseshoe prior to \(\beta\) is expressed as follows:

\[\beta_{j}\ |\ \sigma^{2}, \tau^{2}, \lambda_{j}^{2} \overset{i.i.d}{\sim} N\left(0, \sigma^{2}\tau^{2}\lambda_{j}^{2} \right), \quad \lambda_{j}\overset{i.i.d}{\sim}C^{+}(0,1),\quad j=1,2,...,p, \\ \tau \overset{i.i.d}{\sim}C^{+}(0,1),\quad p(\sigma^2)\propto gamma\left( w/2, w/2\right).\]

Where \(C^{+}(0,1)\) is a half-Cauchy distribution, \(\lambda_{1}, ..., \lambda_{p}\) are local shrinkage parameters, and \(\tau\) is global shrinkage parameter. The exact_horseshoe in this package is a horseshoe estimator for the prior defined above, as described in exact MCMC algorithm of Johndrow et al. (2020). The blocked Metropolis-within-Gibb that samples \(\lambda,\ \left(\tau, \sigma, \beta \right)\) is adopted. Since the algorithm considers the case of \(p >> N\), the computational cost can be lowered by using fast sampling(Bhattacharya et al.,2016) in the step of sampling \(\beta\).

For simulation, the data size was set to N = 300, p = 500, and the design matrix \(X=\{x_{j} \}_{j=1}^{p}\), true value of beta were set as follows.

\[x_{j} \sim N_{N}\left(0, I_{N} \right), \\ y_{i} \sim N(x_{i}\beta, 4), \\ \beta_{j} = \begin{cases} 1, & \mbox{if }\ \mbox{j<51,} \\ 0, & \mbox{otherwise.}\end{cases}\]

As a caution, the data is for testing the application of the algorithm and is not a simulation that strictly considers sparsity condition.

# making simulation data.
set.seed(123)
N <- 300
p <- 500
p_star <- 50
true_beta <- c(rep(1, p_star), rep(0, p-p_star))

# design matrix X.
X <- matrix(1, nrow = N, ncol = p)
for (i in 1:p) {
  X[, i] <- stats::rnorm(N, mean = 0, sd = 1)
}
  
# response variable y.
y <- vector(mode = "numeric", length = N)
e <- rnorm(N, mean = 0, sd = 2)
for (i in 1:p_star) {
  y <- y + true_beta[i] * X[, i]
}
y <- y + e

For the corresponding simulation data, the results were compared with the horseshoe function of the horseshoe package. We fit the data to the horseshoe and exact_horseshoe functions and plot the results for the first 100 indices, including the first 50 indices that are non-zero coefficients.

# horseshoe in horseshoe package.
horseshoe_result <- horseshoe::horseshoe(y, X, method.tau = "halfCauchy",
                                         method.sigma = "Jeffreys",
                                         burn = 0, nmc = 500)

# exact_horseshoe in Mhorseshoe package.
exact_horseshoe_result <- exact_horseshoe(y, X, burn = 0, iter = 500)

df <- data.frame(index = 1:100,
                 horseshoe_BetaHat = horseshoe_result$BetaHat[1:100],
                 horseshoe_LeftCI = horseshoe_result$LeftCI[1:100],
                 horseshoe_RightCI = horseshoe_result$RightCI[1:100],
                 exhorseshoe_BetaHat = exact_horseshoe_result$BetaHat[1:100],
                 exhorseshoe_LeftCI = exact_horseshoe_result$LeftCI[1:100],
                 exhorseshoe_RightCI = exact_horseshoe_result$RightCI[1:100],
                 true_beta = true_beta[1:100])

# Estimation results of the horseshoe in horseshoe package.
ggplot(data = df, aes(x = index, y = true_beta)) + 
  geom_point(size = 2) + 
  geom_point(aes(x = index, y = horseshoe_BetaHat), size = 2, col = "red") +
  geom_errorbar(aes(ymin = horseshoe_LeftCI,
                    ymax = horseshoe_RightCI), width = .1, col = "red") +
  labs(title = "95% Credible intervals of the horseshoe in horseshoe package", 
       y = "beta")


# Estimation results of the exact_horseshoe in Mhorseshoe package.
ggplot(data = df, aes(x = index, y = true_beta)) + 
  geom_point(size = 2) + 
  geom_point(aes(x = index, y = exhorseshoe_BetaHat), 
             size = 2, col = "red") +
  geom_errorbar(aes(ymin = exhorseshoe_LeftCI, 
                    ymax = exhorseshoe_RightCI), width = .1, col = "red") +
  labs(title = "95% Credible intervals of the exact_horseshoe in Mhorseshoe 
       package", y = "beta")

The results of both functions are the same. Both functions estimate that 95% of the credible intervals for all non-zero indices except 4, 36, 41, and 46 include 1.

2. Modified approximate algorithm

Johndrow et al. (2020) proposed a scalable approximate MCMC algorithm that can reduce computational costs by introducing a thresholding method while applying the horseshoe prior. The set of columns that satisfies the condition (\(\tau^{2}\lambda_{j}^{2} > \delta\)) is defined as the active set, and let’s define \(S\) as the set of indices of columns that satisfy the condition.

\[S = \{j\ |\ \tau^{2}\lambda_{j}^{2} > \delta,\ j=1,2,...,p. \}.\]

If \(\tau^{2}\lambda_{j}^{2}\) is very small, the posterior of \(\beta\) will have a mean and variance close to 0. Let the number of elements of \(S\) be \(s_{\delta}\) and the diagonal matrix for the local shrinkage parameters be \(\Lambda = diag(\lambda_1^2,...,\lambda_p^2)\). The approximate algorithm uses a reduced version of the matrix as \(\Lambda_{S} \in R^{s_{\delta} \times s_{\delta}}\), which reduces the computational cost, is applied to the main sampling part. This changed algorithm is implemented in the approx_horseshoe function. Johndrow et al. (2020) uses a fixed \(\delta\) in the approximate algorithm, On the other hand, our package applies a new method that sets \(\delta\) with the adaptive probability.

The adaptive algorithm developed in this package estimates a new threshold \(\delta^{New}\) with updated shrinkage parameters by MCMC sampling, Adopt \(\delta^{New}\) through adaptive probability. the process of updating a new threshold is added by applying the properties of the shrinkage weight \(k_{j},\ j=1,2,...,p\) proposed by Piironen and Vehtari (2017). In the prior of \(\beta_{j} \sim N(0, \sigma^{2}\tau^{2}\lambda_{j}^{2})\), the variable \(m_{eff}\) is defined as follows.

\[k_{j} = 1/\left(1+n\tau^{2}s_{j}^{2}\lambda_{j}^{2} \right), \quad m_{eff} = \sum_{j=1}^{p}{\left(1-k_{j} \right)}.\]

Where \(ns_{j}^2,\ j=1,2,...,p\) are the diagonal components of \(X^{T}X\). For the zero components of \(\beta\), \(k_{j}\) is derived close to 1, and nonzero’s \(k_{j}\) is derived close to 0, so the variable \(m_{eff}\) is called the effective number of nonzero coefficients. In this algorithm, \(\delta\) is updated to set \(s_{\delta} = ceiling(m_{eff})\). Adaptive probability is defined to satisfy Theorem 5(diminishing adaptation condition) of Roberts and Rosenthal (2007). at \(T\)th iteration,

\[p(T) = exp[p_{0} + p_{1}T],\quad p_{1} < 0, \quad u \sim U(0, 1), \\ if\ u < p(T),\ accept\ \delta^{New},\ s_{\delta^{New}} = ceiling(m_{eff}).\]

The default is \(p_{0} = 0\), \(p_{1} = -4.6 \times 10^{-4}\), and under this condition, \(p(10000) < 0.01\) is satisfied. The approximate algorithm that newly added this procedure is called the modified approximate algorithm, and simulation is performed.

# approximate algorithm with fixed default threshold.
approx_horseshoe_result <- approx_horseshoe(y, X, burn = 0, iter = 500, 
                                            auto.threshold = FALSE)
#> You chose FALSE for the auto.threshold argument. and since threshold = 0, set it to the default value of 0.002.

# modified approximate algorithm.
mapprox_horseshoe_result <- approx_horseshoe(y, X, burn = 0, iter = 500,
                                             auto.threshold = TRUE)

df2 <- data.frame(index = 1:100,
                  approx_BetaHat = approx_horseshoe_result$BetaHat[1:100],
                  approx_LeftCI = approx_horseshoe_result$LeftCI[1:100],
                  approx_RightCI = approx_horseshoe_result$RightCI[1:100],
                  mapprox_BetaHat = mapprox_horseshoe_result$BetaHat[1:100],
                  mapprox_LeftCI = mapprox_horseshoe_result$LeftCI[1:100],
                  mapprox_RightCI = mapprox_horseshoe_result$RightCI[1:100],
                  true_beta = true_beta[1:100])

# Estimation results of the approximate algorithm.
ggplot(data = df2, aes(x = index, y = true_beta)) + 
  geom_point(size = 2) + 
  geom_point(aes(x = index, y = approx_BetaHat), size = 2, col = "red") +
  geom_errorbar(aes(ymin = approx_LeftCI, 
                    ymax = approx_RightCI), width = .1, col = "red") +
  labs(title = "95% Credible intervals of the approx_horseshoe", y = "beta")


# Estimation results of the modified approximate algorithm.
ggplot(data = df2, aes(x = index, y = true_beta)) + 
  geom_point(size = 2) + 
  geom_point(aes(x = index, y = mapprox_BetaHat), 
             size = 2, col = "red") +
  geom_errorbar(aes(ymin = mapprox_LeftCI, 
                    ymax = mapprox_RightCI), 
                width = .1, col = "red") +
  labs(title = "95% Credible intervals of the modified_approx_horseshoe", 
       y = "beta")

In the case of exact_horseshoe, since the \(\Lambda\) is used as is, always think of \(length(S) = 500\). On the other hand, since the method of setting the threshold is different in approximate algorithm and modified approximate algorithm, a difference occurs in \(length(S) = s_{\delta}\).

exact_activeset <- rep(p, 500)
approx_activeset <- apply(approx_horseshoe_result$ActiveSet, MARGIN = 1, sum)
mapprox_activeset <- apply(mapprox_horseshoe_result$ActiveSet, MARGIN = 1, sum)

# active set plot
ggplot(data = data.frame(X = 1:500,
                         exact_activeset = exact_activeset,
                         approx_activeset = approx_activeset,
                         mapprox_activeset = mapprox_activeset)) +
  geom_line(mapping = aes(x = X, y = exact_activeset,
                          color = "exact")) +
  geom_line(mapping = aes(x = X, y = approx_activeset,
                          color = "approx"),
            alpha = 0.5) +
  geom_line(mapping = aes(x = X, y = mapprox_activeset,
                          color = "modified_approx"),
            alpha = 0.5) +
  scale_color_manual(name = "algorithm",
                     values = c("black", "red", "blue"), 
                     breaks = c("exact", "approx", "modified_approx"), 
                     labels = c("exact", "approx", "modified_approx"))

References

Bhattacharya, A., Chakraborty, A., & Mallick, B. K. (2016). Fast sampling with Gaussian scale mixture priors in high-dimensional regression. Biometrika, asw042.

Johndrow, J., Orenstein, P., & Bhattacharya, A. (2020). Scalable Approximate MCMC Algorithms for the Horseshoe Prior. In Journal of Machine Learning Research (Vol. 21).

Piironen, J., & Vehtari, A. (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11, 5018-5051.

Roberts G, Rosenthal J. Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. J Appl Prob. 2007;44:458–475.