---
title: "Using SNPknock in R"
author: "Matteo Sesia (msesia@stanford.edu)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: ../inst/REFERENCES.bib
vignette: >
  %\VignetteIndexEntry{Using SNPknock in R}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---

This vignette illustrates the usage of the `SNPknock` package for creating knockoffs of discrete Markov chains and hidden Markov models [@sesia2019]. For simplicity, we will use synthetic data.

The `SNPknock` package also provides a simple interface to the genotype imputation software `fastPhase`, which can be used to fit hidden Markov models for genotype data. Since `fastPhase` is not available as an R package, this particular functionality of `SNPknock` cannot be demonstrated here. A tutorial showing how to use a combination of `SNPknock` and `fastPhase` to create knockoffs of genotype data can be found [here](genotypes.html).


## Knockoffs of discrete Markov chains

First, we verify that the `SNPknock` can be loaded.
```{r}
library(SNPknock)
```

We define a Markov chain model with 50 variables, each taking one of 5 possible values.
We specify a uniform marginal distribution for the first variable in the chain and create 49
transition matrices with randomly sampled entries.
```{r}
p=50; # Number of variables in the model
K=5;  # Number of possible states for each variable
# Marginal distribution for the first variable
pInit = rep(1/K,K)
# Create p-1 transition matrices
Q = array(stats::runif((p-1)*K*K),c(p-1,K,K))
for(j in 1:(p-1)) {
  Q[j,,] = Q[j,,] + diag(rep(1,K))
  Q[j,,] = Q[j,,] / rowSums(Q[j,,])
}
```
We can sample 100 independent observations of this Markov chain using the `SNPknock` package.
```{r}
set.seed(1234)
X = sampleDMC(pInit, Q, n=100)
print(X[1:5,1:10])
```
Above, each row of `X` contains an independent realization of the Markov chain.

A knockoff copy of `X` can be sampled as follows.
```{r}
Xk = knockoffDMC(X, pInit, Q)
print(Xk[1:5,1:10])
```

## Knockoffs of hidden Markov models

We define a hidden Markov chain model with 50 variables, each taking one of 3 possible values,
and 50 latent variables distributed as a Markov chain, each taking one of 5 values.
We specify a uniform marginal distribution for the first variable in the chain and create 49
transition matrices with randomly sampled entries. We also define an emission distributions
over the 3 possible emission states, for each of the 50 latent variables and the 5 latent states.
```{r}
p=200; # Number of variables in the model
K=5;  # Number of possible states for each variable
M=3;  # Number of possible emission states for each variable
# Marginal distribution for the first variable
pInit = rep(1/K,K)
# Create p-1 transition matrices
Q = array(stats::runif((p-1)*K*K),c(p-1,K,K))
rho = stats::runif(p-1, min = 1, max = 50)
for(j in 1:(p-1)) {
  Q[j,,] = Q[j,,] + rho[j] * diag(rep(1,K))
  Q[j,,] = Q[j,,] / rowSums(Q[j,,])
}
# Create p emission matrices
pEmit = array(stats::runif(p*M*K),c(p,M,K))
for(j in 1:p) { pEmit[j,,] = sweep(pEmit[j,,],2,colSums(pEmit[j,,]),`/`) }
```
We can sample 100 independent observations of this hidden Markov model using the `SNPknock` package.
```{r}
set.seed(1234)
X = sampleHMM(pInit, Q, pEmit, n=100)
print(X[1:5,1:10])
```
Above, each row of `X` contains an independent realization of the hidden Markov model.

A knockoff copy of `X` can be sampled as follows.
```{r}
Xk = knockoffHMM(X, pInit, Q, pEmit)
print(Xk[1:5,1:10])
```

## See also

If you want to see how to use `SNPknock` to create knockoffs of genotype data, see the [genotypes vignette](genotypes.html).

If you want to learn about `SNPknock` for large genome-wide association studies [@sesia2019multi], see <https://msesia.github.io/knockoffzoom/>