---
title: "Beta Conjugate Prior"
author: "Thomas I. J. Bartos, Michael A. Spence, Michael J. Thomson"
date: "2024-06-5"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Beta Conjugate Prior}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
references: 
- id: ices_ns
  title: "Report of the Working Group on the Assessment of Demersal Stocks in the North Sea and Skagerrak"
  author: 
    - family: ICES
  issued:
    year: 2020
  type: report
  container-title: "ICES Scientific Reports. 2:61. 1353, ICES, Copenhagen"
  doi: "10.17895/ices.pub.6092"
- id: lkj
  title: "Generating random correlation matrices based on vines and extended onion method"
  author:
  - family: Lewandowski
    given: Daniel
  - family: Kuratowicka
    given: Dorota
  - family: Joe
    given: Harry
  container-title: Journal of Multivariate Analysis
  volume: 100
  page: 1989--2001
  type: article-journal 
  issued:
    month: October
    year: 2009
- id: conj_prior_wiki
  url: https://en.wikipedia.org/wiki/Conjugate_prior
---

```{r setup_lies,warning=FALSE}
library(EcoEnsemble)
set.seed(1234)
```

## Introduction

In this document we demonstrate the use of the beta conjugate prior distribution for the hyperparameters of the hierarchical prior for the correlation matrix entries governing the dependencies of the individual short-term discrepancy dynamics between variables of interest. We demonstrate the usage based on the built-in example of four fish species in the North Sea.

## Short-term individual discrepancy

Suppose we have $M$ simulators with up to $N$ variables of interest predicted by each simulator. The short-term individual discrepancy for the $m$th model follows an auto-regressive process,
\begin{equation*}
\mathbf{z}_{m}^{(t)}\sim{}N(R_{m}\mathbf{z}_{m}^{(t-1)},\Lambda_{m})
\end{equation*}
with $R_{m}=\text{diag}(r_{m,1},\ldots,r_{m,4})$
\begin{equation*}
\Lambda_{m} = \Pi_{m}  P_{m} \Pi_{m},
\end{equation*}
where $P_{m}$ is a correlation matrix and $\Pi_{m}=\text{diag}(\pi_{m,1},\ldots,\pi_{m,4})$.

The distribution is
\begin{equation}
    f(P_m | \alpha, \beta) \propto \begin{cases}
    \prod_{i = 1}^{N-1}\prod_{j = i+ 1}^{N} \frac{1}{\mathrm{B}(\alpha_{\rho,i,j}, \beta_{\rho,i, j})} s(P_{m,i,j})^{\alpha_{\rho,i, j} - 1}\left(1-s(P_{m,i, j})\right)^{\beta_{\rho,i, j} - 1}&\text{if }P_{m}\text{ is positive definite}\\
    0 &\text{otherwise}.
    \end{cases}
\end{equation}
with 
\begin{equation}
    s(\rho)=\frac{\rho+1}{2},
\end{equation}
and where we need to also specify hyperpriors on the hyperparameters $\alpha_{\rho, i, j}$ and $\beta_{\rho, i, j}$. Conditional on $P_m$ being positive definite, the marginal distribution of each off-diagonal entry of $P_m$ is a scaled beta distribution. Hence one option is to place a conjugate prior on the parameters of the derived un-scaled beta distribution, which means placing a conjugate prior on the joint distribution of $(\alpha_{\rho, i, j}, \beta_{\rho, i,j})$. It is known that up to proportionality, this prior is ([https://en.wikipedia.org/wiki/Conjugate_prior][@conj_prior_wiki]),
\begin{equation}
g(\alpha_{\rho, i, j}, \beta_{\rho, i,j}\,|\,p, q, k) \propto \frac{\Gamma(\alpha_{\rho, i, j} + \beta_{\rho, i, j})^{k}p^{\alpha_{\rho, i, j}}q^{\beta_{\rho, i, j}}}{\Gamma(\alpha_{\rho, i, j})^{k}\Gamma(\beta_{\rho, i, j})^{k}}\,,   
\end{equation}
for given positive real parameters $p$, $q$ and $k$. We refer to the distribution defined by the above density as the *Beta conjugate prior*. The interpretation of the prior in the case that $k$ is a positive integer and $0 < p\,,\,q < 1$ is that $\alpha_{\rho, i, j}$ and $\beta_{\rho, i, j}$ have been estimated from $k$ observations with product $p$ and product of complements $q$. The equivalent of an improper prior occurs when $k \approx 0$ and $p = q = 1$.

## Exploring the prior

### New parameterisation

We introduce a new parameterisation $(r, s, k)$ of the Beta conjugate prior, which restricts the original parameters $(p, q, k)$ to a subspace of reasonable choices available to the user. We keep $k > 0$ the same and assume we can express the original parameters as $p = z^{k}w^k$ and $q = (1-z)^{k}w^k$ where we require $0 \leq z, w \leq 1$. Then we set $r$ to be the logit transform of $z$, $r = \log\big(\frac{z}{1-z}\big) = \frac{1}{k}\log\big(\frac{p}{q}\big)$. The second parameter is a scale parameter $s$ controlling the order of magnitude of $p$ and $q$, which we define as the logit transform of $w$, $s = \log\big(\frac{w}{1-w}\big)$.
In the case that $k$ is an integer we can interpret the values $z$, $w$ and $k$ as follows: we form our prior using $k$ observations $x_{1}, \ldots, x_{k}$ of a Beta distribution where $w$ is the sum of the geometric means of the observations $x_i$ and observed complements $1-x_{i}$:
$$w = \bigg(\prod_{i = 1}^{k}x_{i}\bigg)^{\frac{1}{k}} + \bigg(\prod_{i = 1}^{k}(1-x_{i})\bigg)^{\frac{1}{k}}\,.$$
We interpret $z$ as the ratio of the geometric mean of the observations $x_i$ to $w$:
$$z = \frac{\bigg(\prod_{i = 1}^{k}x_{i}\bigg)^{\frac{1}{k}}}{\bigg(\prod_{i = 1}^{k}x_{i}\bigg)^{\frac{1}{k}} + \bigg(\prod_{i = 1}^{k}(1-x_{i})\bigg)^{\frac{1}{k}}}\,.$$
Note however, that $k$ need not be integer-valued in general. In the plots below, we show the marginal distribution of $\rho_{m,1,2}$ with $m = 5$ in a number of cases, with the green curve being the resulting marginal of the LKJ(1) distribution, which gives a uniform distribution over all correlation matrices (@lkj). Sampling from the prior with $k = 0.05$ we have
```{r eval = FALSE, message=FALSE, warning=FALSE, results='hide', silent=TRUE}
cor_pri_st <- rstan::stan_model(model_code = " functions{
  real priors_cor_hierarchical_beta(matrix ind_st_cor, int N, matrix M){
    real log_prior = 0;
    for (i in 1:(N-1)){
      for (j in (i+1):N){
        log_prior += beta_lpdf(0.5*(ind_st_cor[i, j] + 1)| M[i, j], M[j, i] );
      }
    }
    return log_prior;
  }

  real beta_conj_prior(real alpha, real betas, real r, real s, real k){
    real rl = 1/(1 + exp(-r));
    real sl = 1/(1 + exp(-s));
    real p = (sl * rl)^k;
    real q = (sl * (1 - rl))^k;
    real ret = alpha * log(p) + betas * log(q) - k * lbeta(alpha,betas);
    return ret;
  }

}

data {
  vector[3] cor_p;
}

parameters {
  matrix <lower=0>[5,5] beta_pars;
  corr_matrix[5] rho[4];
}

model {
  for (i in 1:4){
    for (j in (i+1):5){
      target += beta_conj_prior(beta_pars[i,j], beta_pars[j,i], cor_p[1], cor_p[2], cor_p[3]);
    }
  }

  for (i in 1:4){
    target += priors_cor_hierarchical_beta(rho[i],5,beta_pars);
    diagonal(beta_pars) ~ std_normal();
  }

}

generated quantities {

  matrix [5,5] rhovars;
  for (i in 1:4){
    for (j in (i+1):5){
      rhovars[i,j] = 4 * (beta_pars[i,j] * beta_pars[j,i])/(square(beta_pars[i,j] + beta_pars[j,i]) * (beta_pars[i,j] + beta_pars[j,i] + 1));
      rhovars[j,i] = (2 * beta_pars[i,j]/(beta_pars[i,j] + beta_pars[j,i])) - 1;
    }
  }
  
  for (i in 1:5){
    rhovars[i,i] = 4;
  }
  
}
")
library(ggplot2)
rhoplots <- list()
kvals <- c(0.05, 5)
parvals <- do.call(expand.grid, c(rep(list(c(0.25, 3)), 2), list(kvals)))
#Sampling and gathering outputs for plotting
for (i in 1:nrow(parvals)){
  fit_cor <- rstan::sampling(cor_pri_st, data = list(cor_p=as.numeric(parvals[i,])), iter = 2000, chains=4)
  ex.fit <- rstan::extract(fit_cor)
  rho_density <- density(ex.fit$rho[,1,1,2], from = -1, to = 1)
  rhoplot_data <- data.frame(rho_density$x, rho_density$y)
  names(rhoplot_data) <- c("rho", "Density")
  rhoplot_data <- cbind(rhoplot_data, rep(parvals[i,1], nrow(rhoplot_data)), rep(parvals[i,2], nrow(rhoplot_data)), rep(parvals[i,3], nrow(rhoplot_data)))
  names(rhoplot_data)[3:5] <- c("r", "s", "k")
  rhoplots[[i]] <- rhoplot_data
}
#Construct plots
rhoplot_data <- do.call(rbind, rhoplots)
rhoplot_data <- cbind(rhoplot_data, rep("Beta conjugate prior", nrow(rhoplot_data)))
names(rhoplot_data)[6] <- "Prior"
unif_range <- seq(-1, 1, length.out = nrow(rhoplots[[1]]))
unif_data <- data.frame(cbind(rep(unif_range, nrow(parvals)), rep(dbeta((unif_range+1)/2,5/2,5/2)/2, nrow(parvals)), rhoplot_data[,3:5], rep("Uniform prior", nrow(rhoplot_data))))
names(unif_data) <- names(rhoplot_data)
rhoplot_data <- rbind(rhoplot_data, unif_data)
rhoplot1 <- ggplot(rhoplot_data[which(rhoplot_data$k == kvals[1]),]) + geom_area(aes(x = rho, y = Density, color = Prior, fill = Prior), alpha = 0.3, position = "identity") + scale_x_continuous(bquote(rho), sec.axis = sec_axis(~., name = "r", breaks = NULL, labels = NULL)) + scale_y_continuous(sec.axis = sec_axis(~., name = "s", breaks = NULL, labels = NULL)) + theme(aspect.ratio = 1) + facet_grid(rows = vars(r), cols = vars(s)) + scale_color_manual(values = c("blue", "green")) + scale_fill_manual(values = c("blue", "green"))
rhoplot2 <- ggplot(rhoplot_data[which(rhoplot_data$k == kvals[2]),], aes(x = rho, y = Density)) + geom_area(aes(color = Prior, fill = Prior), alpha = 0.3, position = "identity") + scale_x_continuous(bquote(rho), sec.axis = sec_axis(~., name = "r", breaks = NULL, labels = NULL)) + scale_y_continuous(sec.axis = sec_axis(~., name = "s", breaks = NULL, labels = NULL)) + theme(aspect.ratio = 1) + facet_grid(rows = vars(r), cols = vars(s)) + scale_color_manual(values = c("blue", "green")) + scale_fill_manual(values = c("blue", "green"))
```

```{r, echo = FALSE, out.width="700px", out.height="700px"}
knitr::include_graphics("data/rhoplot1.png")
```

```{r, eval = FALSE, fig.dim = c(7,6), echo = FALSE}
rhoplot1
```

and for $k = 5$ we have

```{r, echo = FALSE, out.width="700px", out.height="700px"}
knitr::include_graphics("data/rhoplot2.png")
```

```{r, eval = FALSE, fig.dim = c(7,6), echo = FALSE}
rhoplot2
```
Note that in the case $k = 0.05$, we see that choosing $r$ large we get a reduced variance compared to the uniform prior equivalent. This is more pronounced in the case $k = 0.5$. In general, as $k$ gets smaller, this can be interpreted as having less prior information and so the variance of $\rho$ is increased. Note that increasing $s$ has also yielded a more skewed prior as larger $s$ puts greater density on beta distributions with large $\alpha$ and $\beta$ (whether symmetric or not). To understand the effect of increasing $k$, we plot the sampled variances and expectations of $\rho$ for six different values of $k$ but with other parameters fixed, $r = -0.3$, $s = 3$.

```{r, message=FALSE, warning=FALSE,eval=FALSE, results='hide', silent=TRUE}
kvals <- c(0.1, 0.2, 0.4, 0.8, 1.6, 3.2)
#parvals <- cbind(rep(0.3, 6), rep(-1, 6), kvals)
parvals <- cbind(rep(-0.3, 6), rep(3, 6), kvals)
rhovarplots <- list()
rhomeanplots <- list()
for (i in 1:6){
  fit_cor <- rstan::sampling(cor_pri_st, data = list(cor_p=parvals[i,]), iter = 2000, chains=4)
  ex.fit <- rstan::extract(fit_cor)
  rhovar_density <- density(ex.fit$rhovars[,1,2])
  rhomean_density <- density(ex.fit$rhovars[,2,1])
  rhovar_data <- data.frame(cbind(rhovar_density$x, rhovar_density$y, rep(kvals[i], length(rhovar_density$x))))
  rhomean_data <- data.frame(cbind(rhomean_density$x, rhomean_density$y, rep(kvals[i], length(rhomean_density$x))))
  names(rhovar_data) <- c("Variance", "Density", "k")
  names(rhomean_data) <- c("Expectation", "Density", "k")
  rhovarplots[[i]] <- rhovar_data
  rhomeanplots[[i]] <- rhomean_data
}
rhovarplot_data <- do.call(rbind, rhovarplots)
rhomeanplot_data <- do.call(rbind, rhomeanplots)
rhovarplot <- ggplot(rhovarplot_data) + geom_area(aes(x = Variance, y = Density), color = "blue", fill = "blue", alpha = 0.3, position = "identity") + geom_vline(xintercept = 0.2, color = "green", linetype = "dashed", linewidth = 0.8) + facet_wrap(vars(k), nrow = 2, ncol = 3) + scale_x_continuous(sec.axis = sec_axis(~., name = "k", breaks = NULL, labels = NULL))
rhomeanplot <- ggplot(rhomeanplot_data) + geom_area(aes(x = Expectation, y = Density), color = "blue", fill = "blue", alpha = 0.3, position = "identity") + geom_vline(xintercept = 0, color = "green", linetype = "dashed", linewidth = 0.8) + facet_wrap(vars(k), nrow = 2, ncol = 3) + scale_x_continuous(sec.axis = sec_axis(~., name = "k", breaks = NULL, labels = NULL))
```

```{r, echo = FALSE, out.width="600px", out.height = "500px"}
knitr::include_graphics("data/rhovarplot.png")
```

```{r, eval = FALSE, echo = FALSE}
rhovarplot
```
Note that in this case increasing $k$ leads to increased variance of $\rho$. The dashed green line at 0.2 indicates the variance of the marginal distribution of LKJ(1) correlation parameters, the uniform equivalent.

```{r, echo = FALSE, out.width = "600px", out.height = "500px"}
knitr::include_graphics("data/rhomeanplot.png")
```

```{r, eval=FALSE, fig.dim = c(8,5), echo = FALSE}
rhomeanplot
```

Note that as $k$ increases the variance of the expectation of $\rho$ decreases. 

## Prior predictive distribution
As an example of using the Beta conjugate prior for the short-term individual discrepancy correlations, we sample from the prior predictive distribution of the ensemble model in the case of four species with $r = 0.25, s = 3$ and $k = 4$ as our hyperparameters. We first draw from the the marginal prior distribution of the first off-diagonal correlation matrix entry $\rho_{m,1,2}$. We plot these draws below: 
```{r, eval = FALSE, message = FALSE, warning = FALSE, silent = TRUE, results = 'hide'}
fit_cor <- rstan::sampling(cor_pri_st, data = list(cor_p=c(0.25, 3, 4), iter = 2000, chains=4))
ex.fit <- rstan::extract(fit_cor)
rhoplot_data  <- data.frame(ex.fit$rho[,1,1,2])
names(rhoplot_data) <- "rho"
rhoplot <- ggplot(rhoplot_data) + geom_histogram(aes(x = rho), color = "blue", fill = "blue", alpha = 0.3, binwidth = 0.05) + scale_x_continuous(bquote(rho))
```

```{r, echo = FALSE, out.width = "400px", out.height = "400px"}
knitr::include_graphics("data/rhoplot.png")
```

```{r, eval = FALSE, echo = FALSE, warning = FALSE, fig.dim = c(4,3)}
rhoplot
```

The prior predictive distribution is then
```{r,eval=FALSE,message=FALSE, warning = FALSE, results = 'hide',silent=TRUE}
priors <- EnsemblePrior(4,
                        ind_st_params =IndSTPrior("hierarchical_beta_conjugate",list(-3,1,8,4),
                                                  list(0.25,3,4),AR_params=c(2,2)),
                        ind_lt_params = IndLTPrior("lkj",list(1,1),1),
                        sha_st_params = ShaSTPrior("lkj",list(1,10),1,AR_params=c(2,2)),
                        sha_lt_params = 5
)
prior_density <- prior_ensemble_model(priors, M = 4)
samples <- sample_prior(observations = list(SSB_obs, Sigma_obs),
             simulators = list(list(SSB_ewe, Sigma_ewe,"EwE"),
                   list(SSB_fs ,  Sigma_fs,"FishSums"),
                   list(SSB_lm ,  Sigma_lm,"LeMans"),
                   list(SSB_miz, Sigma_miz,"mizer")),
             priors=priors,
             sam_priors = prior_density)
```

```{r, eval = FALSE, fig.dim = c(7, 6)}
plot(samples)
```

```{r, echo = FALSE, out.height="500px", out.width="800px"}
knitr::include_graphics("data/p_priors_beta_conj.png")
```

# References