---
title: "Diagnostic for fine-mapping with summary statistics"
author: "Yuxin Zou"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Diagnostic for fine-mapping with summary statistics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE,comment = "#",fig.width = 5,
                      fig.height = 3,fig.align = "center",
                      fig.cap = " ",dpi = 120)
```

This vignette demonstrates the use of the diagnostic plot for
assessing consistency of the summary statistics and the reference LD
matrix.

The `susie_rss` assumes the LD matrix accurately estimate the
correlations among SNPs from the original GWAS genotype
data. Typically, the LD matrix comes from some public database of
genotypes in a suitable reference population. The inaccurate LD
information leads to unreliable fine-mapping result.

The diagnostic for consistency between summary statistics and
refenrence LD matrix is based on the RSS model under the null with
regularized LD matrix.
$$
\hat{z} | R, \lambda \sim N(0, (1-\lambda)R + \lambda I), 0<\lambda<1
$$
The parameter $\lambda$ is estimated by maximum likelihood. A larger
$\lambda$ means a greater inconsistency between summary statistics and
the LD matrix. The expected z score is computed for each SNP,
$E(\hat{z}_j | \hat{z}_{-j})$, and plotted against the observed z
scores.

```{r}
library(susieR)
library(curl)
```

## LD information from the original genotype data

We demonstrate the diagnostic plot in a simple case, the LD matrix is
estimated from the original genotype data. In this case, we expect the
diagnostic plot to confirm that the LD matrix is consistent with the
z scores.

We use the same simulated data as in
[fine mapping vignette](finemapping.html).

```{r}
data("N3finemapping")
n = nrow(N3finemapping$X)
b = N3finemapping$true_coef[,1]
sumstats <- univariate_regression(N3finemapping$X, N3finemapping$Y[,1])
z_scores <- sumstats$betahat / sumstats$sebetahat
Rin = cor(N3finemapping$X)
attr(Rin, "eigen") = eigen(Rin, symmetric = TRUE)
susie_plot(z_scores, y = "z", b=b)
```

The estimated $\lambda$ is
```{r}
lambda = estimate_s_rss(z_scores, Rin, n=n)
lambda
```

The plot for the observed z scores vs the expected z scores is 
```{r}
condz_in = kriging_rss(z_scores, Rin, n=n)
condz_in$plot
```

Summary of SuSiE Credible Sets:
```{r}
fit <- susie_rss(z_scores, Rin, n=n, estimate_residual_variance = TRUE)
susie_plot(fit,y = "PIP", b=b)
```

## LD information from the reference panel

We use another simulated data where the LD matrix is estimated from a
reference panel. In this example data set, there is one association
signal in the simulated data (red point), and there is one SNP with
mismatched reference and alternative allele between summary statistics
and the reference panel (yellow point).

**Note:** In some versions of [PLINK][plink], these mismatches can
occur when [PLINK automatically flips the alleles to make the minor
allele be the effect
allele](https://github.com/stephenslab/susieR/issues/148), leading to
different allele encodings in the z scores and LD matrix. Adding the
flag `--keep-allele-order` will disable this behaviour in PLINK.

```{r}
data_file <- tempfile(fileext = ".RData")
data_url <- paste0("https://raw.githubusercontent.com/stephenslab/susieR/",
                   "master/inst/datafiles/SummaryConsistency1k.RData")
curl_download(data_url,data_file)
load(data_file)
zflip = SummaryConsistency$z
ld = SummaryConsistency$ldref
n=10000
b = numeric(length(zflip))
b[SummaryConsistency$signal_id] = zflip[SummaryConsistency$signal_id]
plot(zflip, pch = 16, col = "#767676", main = "Marginal Associations", 
     xlab="SNP", ylab = "z scores")
points(SummaryConsistency$signal_id, zflip[SummaryConsistency$signal_id], col=2, pch=16)
points(SummaryConsistency$flip_id, zflip[SummaryConsistency$flip_id], col=7, pch=16)
```
Using the data with misaligned allele, SuSiE-RSS identifies a true positive CS containing the true effect SNP; and a false positive CS that incorrectly contains the mismatched SNP.
```{r}
fit = susie_rss(zflip, ld, n=n)
susie_plot(fit, y='PIP', b=b)
points(SummaryConsistency$flip_id, fit$pip[SummaryConsistency$flip_id], col=7, pch=16)
```

The estimated $\lambda$ is
```{r}
lambda = estimate_s_rss(zflip, ld, n=n)
lambda
```

In the diagnostic plot, the mismatched SNP shows the largest difference between observed and expected z-scores, and therefore appears furthest away from the diagonal.

```{r}
condz = kriging_rss(zflip, ld, n=n)
condz$plot
```

After fixing the allele encoding, SuSiE-RSS identifies a single true positive CS containing the true-effect SNP, and the formerly mismatched SNP is (correctly) not included in a CS.

```{r}
z = zflip
z[SummaryConsistency$flip_id] = -z[SummaryConsistency$flip_id]
fit = susie_rss(z, ld, n=n)
susie_plot(fit, y='PIP', b=b)
```

## Session information

Here are some details about the computing environment, including the
versions of R, and the R packages, used to generate these results.

```{r session-info}
sessionInfo()
```

[plink]: https://www.cog-genomics.org/plink