---
title: "Genotype simulation supported by *PhenotypeSimulator*"
author: "Hannah Meyer"
date: "`r Sys.Date()`"
output: pdf_document
bibliography: genotype-simulation.bib
csl: plos-genetics.csl
vignette: >
  %\VignetteIndexEntry{Supported Genotype Simulation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r, echo = FALSE, message=FALSE}
library(knitr)
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

There are a number of different strategies to generate genotype data for genetic
association studies.

1. **Simple bi-allelic SNPs without structure**:
    In the most simple case and assuming bi-allelic SNPs, each SNP is simulated
    from a binomial distribution with two trials and probability equal to the
    given allele frequencies. This simple approach does not simulate any
    dependency between the genotypes as is observed with LD structure in the
    genome. Simple bi-allelic SNPs can be simulated with
    PhenotypeSimulator::simulateGenotypes or via PLINK @Chang2015. For large
    datasets (~ 1 million SNPs), simulation via PLINK is recommended. 

1. **Coalescent approaches**:
    Coalescent methods simulate genealogical events backward in time. These
    events typically include the coalescence of two sequences into a single
    ancestral lineage, recombination within a sequence or migration between
    populations. A coalescent-based approach to simulate whole genome data is
    implemented in GENOME @Liang2007 and its output format is supported as an
    input format for *PhenotypeSimulator*. 
    
1. **Forward-time approaches**: 
    Forward-time simulation methods evolve a population forward in time, subject
    to arbitrary genetic and demographic factors @Peng2010. Several algorithms
    are available, many of which allow customisation by building the simulation
    scheme in R or python, hence the output format of the genotypes can be
    specified by the user. *PhenotypeSimulator* offers reading genotypes from
    delimited-files and as such, any genotypes generated by these programmes can
    be read (e.g. MetaSim @Strand2002 or simuPop @Peng2005).

1. **Resampling approaches**: 
    Resampling-based approaches combine existing genotype data into the
    genotypes of the simulated samples, thereby retaining allele frequency and
    LD patterns @Wright2007. A standard resampling-based approach that uses
    common genotype formats (common for the 
    [oxford genetics format](http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html))
    is Hapgen2 and an example usage is given below. 

Examples for genotype simulation via PLINK, GENOME and Hapgen2 are given below.
The corresponding data can be found in the extdata folder. 

## Simple bi-allelic SNPs without structure via PLINK
Download [PLINK](https://www.cog-genomics.org/plink/1.9/) and create a folder
for the PLINK output files.  The example below uses PLINK to simulate 1000 SNPs,
with allele frequencies between 0 and 1 for 100 controls and 0 cases.
PhenotypeSimulator::readStandardGenotypes reads the resulting .bim, .fam, .bed
files. 

```{bash, eval=FALSE}
# Serves as output directory for simulation and parameter file
mkdir -p ~/PhenotypeSimulator/inst/extdata/genotypes/plink
cd ~/PhenotypeSimulator/inst/extdata/genotypes/plink

# Write paramter file to simulate 1000 SNPs, with allele frequencies between 0 
# and 1 for 100 controls and 0 cases 
echo -e "1000\tSNP\t0.00\t1.00\t1.00\t1.00" > plink_sim_par.txt

plink --simulate plink_sim_par.txt \
      --simulate-ncases 0 \
      --simulate-ncontrols 100 \
      --out genotypes_plink
```



## Coalescent simulation via Genome
Download [GENOME](http://csg.sph.umich.edu/liang/genome/download.html) and
create a folder for the GENOME output files. The example below uses GENOME to
simulate genetic data for a population comprised of three sub-populations with
30, 30 and 40 samples. The simulated POPULATION PROFILE, SNP positions and the
genotypes are all saved in genotypes_genome.txt.
PhenotypeSimulator::readStandardGenotypes reads genotype information by parsing
this output file and extracting the samples information following the line
'Samples:'

```{bash, eval=FALSE} 
mkdir -p ~/PhenotypeSimulator/inst/extdata/genotypes/genome
cd ~/PhenotypeSimulator/inst/extdata/genotypes/genome

#  subpopulation with 30, 30 and 40 individuals each
genome -pop 3 30 30 40 > genotypes_genome.txt
```

## Resampling-based simulation via Hapgen2
Download [hapgen2](http://mathgen.stats.ox.ac.uk/genetics_software/hapgen/hapgen2.html),
[hapgen example files](http://mathgen.stats.ox.ac.uk/genetics_software/hapgen/download/example/hapgen2.example.tgz)
and the 1000Genomes data from the
[impute2 webpage](https://mathgen.stats.ox.ac.uk/impute/impute_v1.html#Using_IMPUTE_with_the_HapMap_Data)
as starting point for the resampling-based genotype simulation with hapgen2. 
For formating of the 1000Genomes data, have a look at this [vignette](https://hannahvmeyer.github.io/PhenotypeSimulator/articles/Simulation-and-LinearModel.html#genotype-simulation-via-hapgen2).
The example files contain example haplotypes (ex.haps), legend files  (ex.leg) map
(ex.map) and TagSNP files (ex.tags). The following examples simulates genotype
data for 100 controls and 0 cases with 1000 SNPs from chromosome 1.
PhenotypeSimulator::readStandardGenotypes reads the resulting .gen and .samples
files. 

```{bash, eval=FALSE}
# contains ex.leg file and serves as output directory
mkdir -p ~/PhenotypeSimulator/inst/extdata/genotypes/hapgen
cd ~/PhenotypeSimulator/inst/extdata/genotypes/hapgen

# contains 1000Genomes haplotype data used for sampling
hapdir=/path/to/CEU.0908.impute.files

hapgen2 -m $hapdir/genetic_map_chr1_combined_b37.txt \
        -l ex.leg \
        -h $hapdir/chr1.ceu_subset.hap \
        -o genotypes_hapgen \
        -n 100 0 \
        -dl 45162 0 0 0 \
        -no_haps_output
```

## References