---
title: "Fine-mapping example"
author: "Gao Wang"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Fine-mapping example}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE,comment = "#",fig.width = 5,
                      fig.height = 3,fig.align = "center",
                      fig.cap = "&nbsp;",dpi = 120)
```

This vignette demonstrates `susieR` in the context of genetic
fine-mapping.  We use simulated data of expression level of a gene
($y$) in $N \approx 600$ individuals.  We want to identify with the
genotype matrix $X_{N\times P}$ ($P=1001$) the genetic variables that
causes changes in expression level.

The simulated data set is simulated to have exactly 3 non-zero
effects.

```{r}
library(susieR)
set.seed(1)
```

## The data-set

```{r}
data(N3finemapping)
attach(N3finemapping)
```

The loaded dataset contains regression data $X$ and $y$, along with some other
relevant properties in the context of genetic studies. It also
contains the "true" regression coefficent the data is simulated from.

Notice that we've simulated 2 sets of $Y$ as 2 simulation
replicates. Here we'll focus on the first data-set.

```{r}
dim(Y)
```

Here are the 3 "true" signals in the first data-set:

```{r}
b <- true_coef[,1]
plot(b, pch=16, ylab='effect size')
```

```{r}
which(b != 0)
```

So the underlying causal variables are 403, 653 and 773.

## Simple regression summary statistics

`univariate_regression` function can be used to compute 
summary statistics by fitting univariate simple regression variable by variable.
The results are $\hat{\beta}$ and $SE(\hat{\beta})$ from which z-scores can be
derived. Again we focus only on results from the first data-set:

```{r}
sumstats <- univariate_regression(X, Y[,1])
z_scores <- sumstats$betahat / sumstats$sebetahat
susie_plot(z_scores, y = "z", b=b)
```

## Fine-mapping with `susieR`

For starters, we assume there are at most 10 causal variables, i.e.,
set `L = 10`, although SuSiE is robust to the choice of `L`.

The `susieR` function call is:

```{r}
fitted <- susie(X, Y[,1],
                L = 10,
		verbose = TRUE)
```

### Credible sets

By default, we output 95% credible set:

```{r}
print(fitted$sets)
```

The 3 causal signals have been captured by the 3 CS reported here. The
3rd CS contains many variables, including the true causal variable
`403`. The minimum absolute correlation is 0.86.

If we use the default 90% coverage for credible sets, we still
capture the 3 signals, but "purity" of the 3rd CS is now 0.91 and size
of the CS is also a bit smaller.

```{r}
sets <- susie_get_cs(fitted,
                     X = X,
	  	     coverage = 0.9,
                     min_abs_corr = 0.1)
```

```{r}
print(sets)
```

### Posterior inclusion probabilities

Previously we've determined that summing over 3 single effect
regression models is approperate for our application. Here we
summarize the variable selection results by posterior inclusion
probability (PIP):

```{r}
susie_plot(fitted, y="PIP", b=b)
```

The true causal variables are colored red. The 95% CS identified are
circled in different colors. Of interest is the cluster around
position 400. The true signal is 403 but apparently it does not have
the highest PIP. To compare ranking of PIP and original z-score in
that CS:

```{r}
i  <- fitted$sets$cs[[3]]
z3 <- cbind(i,z_scores[i],fitted$pip[i])
colnames(z3) <- c('position', 'z-score', 'PIP')
z3[order(z3[,2], decreasing = TRUE),]
```

### Choice of priors

Notice that by default SuSiE estimates prior effect size from data. For
fine-mapping applications, however, we sometimes have knowledge of SuSiE prior effect
size since it is parameterized as percentage of variance explained (PVE) by a non-zero effect, 
which, in the context of fine-mapping, is related to per-SNP heritability. It is possible
to use `scaled_prior_variance` to specify this PVE and explicitly set `estimate_prior_variance=FALSE`
to fix the prior effect to given value.

In this data-set, SuSiE is robust to choice of priors. Here we
set PVE to 0.2,
and compare with previous results:

```{r}
fitted = susie(X, Y[,1],
               L = 10,
               estimate_residual_variance = TRUE, 
               estimate_prior_variance = FALSE, 
               scaled_prior_variance = 0.2)
susie_plot(fitted, y='PIP', b=b)
```

which largely remains unchanged. 

### A note on covariate adjustment

To include covariate `Z` in SuSiE, one approach is to regress it out from both `y`
and `X`, and then run SuSiE on the residuals. The code below illustrates the procedure:

```{r, eval=FALSE}
remove.covariate.effects <- function (X, Z, y) {
  # include the intercept term
  if (any(Z[,1]!=1)) Z = cbind(1, Z)
  A   <- forceSymmetric(crossprod(Z))
  SZy <- as.vector(solve(A,c(y %*% Z)))
  SZX <- as.matrix(solve(A,t(Z) %*% X))
  y <- y - c(Z %*% SZy)
  X <- X - Z %*% SZX
  return(list(X = X,y = y,SZy = SZy,SZX = SZX))
}

out = remove.covariate.effects(X, Z, Y[,1])
fitted_adjusted = susie(out$X, out$y, L = 10)
```

Note that the covariates `Z` should have a column of ones as the first column. If not, the above function `remove.covariate.effects`
will add such a column to `Z` before regressing it out. 
Data will be centered as a result. Also the scale of data is changed after regressing out `Z`. 
This introduces some subtleties in terms of interpreting the results.
For this reason, we provide covariate adjustment procedure as a tip in the documentation
and not part of `susieR::susie()` function. Cautions should be taken when applying this procedure and interpreting the result from it.

## Session information

Here are some details about the computing environment, including the
versions of R, and the R packages, used to generate these results.

```{r}
sessionInfo()
```

[N3finemapping]: https://github.com/stephenslab/susieR/blob/master/inst/datafiles/N3finemapping.rds