---
title: "Empirical data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Empirical data}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE}
library(GenomeAdmixR)
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.width = 6)
print_mat <- function(mat) {
  n <- nrow(mat)
  c("\\begin{bmatrix}",
    paste0(sapply(seq_len(n - 1),
                  function(i) paste0(mat[i, ], collapse = " & ")),
           " \\\\"),
    paste0(mat[n, ], collapse = " & "),
    "\\end{bmatrix}")
}
```

## Empirical data

By default, the package uses local ancestry data, where each locus can be 
uniquely traced back to one of the founding populations of the admixed focal 
population. 
If, however, you are more interested in admixing sequence data, the GenomeAdmixR 
package provides this optionality as well, where you can import your own 
sequences and admix those in a similar fashion.

```{r creating fake data 2}
  fake_input_data <-
    create_artificial_genomeadmixr_data(number_of_individuals = 10,
                                        marker_locations = 1:10)

  fake_input_data$markers
  fake_input_data$genomes

```
Here, we create fake input data in absence of an example file. The data consists
of two components: a genome matrix and a vector with marker locations. The 
genome matrix has two rows per individual (one for each chromosome) and one 
column per molecular marker. The number of columns of the genome matrix should
match the length of the marker vector. Instead of using a/c/t/g, we use 
numerical notation, e.g. 1/2/3/4 (with missing data = 0).
Instead, you can also load sequence data from a VCF file (function 
\code{vcfR_to_genomeadmixr_data}) or from PLINK style data (function 
\code{ped_map_table_to_genomeadmixr_data}).
The loaded data can now be used as input for a simulation using the 
sequence_module:

```{r simulate admix data}
simulated_pop <- simulate_admixture(
                    module = sequence_module(molecular_data = fake_input_data),
                    pop_size = 100,
                    total_runtime = 100)
```
If you want to study admixture, you might have VCF data with individuals from 
multiple populations. GenomeAdmixR provides functionality to use this data as
seeding data as well, although we do ask you to prepare two distinct VCF files,
where each VCF contains the individuals from each population.
Because \code{simulate_admixture} only accepts a single input matrix, we 
can artifically create an initial hybrid input swarm by loading data from 
multiple VCF files, and then randomly sampling individuals from these datasets
in order to form the starting dataset. We can do so using the function
combine_input_data:
```{r creating fake data}
  chosen_markers <- 1:100
   fake_input_data1 <-
    create_artificial_genomeadmixr_data(number_of_individuals = 100,
                                        marker_locations = chosen_markers,
                                        used_nucleotides = 1:2)
  fake_input_data2 <-
    create_artificial_genomeadmixr_data(number_of_individuals = 100,
                                        marker_locations = chosen_markers,
                                        used_nucleotides = 3:4)
  
  combined_data <- combine_input_data(input_data_list = list(fake_input_data1,
                                                             fake_input_data2),
                                      frequencies = c(0.5, 0.5),
                                      pop_size = 1000)
```
We can use this input data as a starting point for a simulation again:
```{r simulate}
simulated_pop <- simulate_admixture(
                      module = sequence_module(molecular_data = combined_data,
                                               markers = 1:100,
                                               morgan = 1),
                      pop_size = 1000,
                      total_runtime = 100)
```
We have simulated here a population of 1000 individuals (initially randomly 
drawn from the sequences we created using combine_input_data), for 100 
generations. 
The result is identical in structure to results from 'simulate_admixture' and we
can use the same visualisation and analytical devices:

```{r visualise}
plot_over_time(simulated_pop$frequencies, focal_location = 50)
plot_joyplot_frequencies(simulated_pop$frequencies, time_points = c(0, 50, 100))
```

Furthermore, we can also perform selection on a certain allele, where the 
location under selection (first entry in the selection matrix) has to exist in
the original dataset.

```{r simulate selection}
selection_matrix <- matrix(NA, nrow = 1, ncol = 5)
selection_matrix[1, ] <- c(50, 1, 1.1, 1.2, 1)

# we expect allele 'a' to do well at 0.5 Morgan:
selected_pop <- simulate_admixture(
                      module = sequence_module(molecular_data = fake_input_data1,
                                               morgan = 1,
                                               markers = chosen_markers),
                      pop_size = 1000,
                      total_runtime = 100,
                      select_matrix = selection_matrix)
plot_over_time(selected_pop$frequencies, focal_location = 50)
```

# Migration
Sequence data can also be used to seed a simulation where two populations 
exchange migrants, by specifying parameters for the migration argument, using the migration_settings function:
```{r migration}

migr_pop <-
  simulate_admixture(
        module = sequence_module(molecular_data = list(fake_input_data1,
                                                       fake_input_data2)),
        migration = migration_settings(migration_rate = 0.01,
                                       population_size = c(100, 100)),
        total_runtime = 100)
```

# Using simulation data as input
Lastly, not only "true" sequencing data can be used as input, but also previous
output from GenomeAdmixR simulations. In order to so, output from previous 
simulations can be converted to genomeadmixr_data, either using pre-existing
molecular markers as used in the simulation, or by superimposing markers on 
new locations.

```{r from_simulation}
simulated_pop <- simulate_admixture(
                      module = ancestry_module(), 
                      pop_size = 100,
                      total_runtime = 100)
prepared_pop  <-
  simulation_data_to_genomeadmixr_data(simulation_data = simulated_pop,
                                       markers = seq(0, 1, length.out = 100))

simulated_pop2 <- simulate_admixture(
                        module = sequence_module(molecular_data = prepared_pop),
                        pop_size = 100,
                        total_runtime = 100)
```

# Mutation rate
When using sequencing data, the simulations can also be extended using mutation.
One can do so by specifying the mutation rate (in probability per bp per 
generation), and by specifying the substitution matrix, where the substitution 
matrix is a 4x4 matrix, that specifies the probability of the different 
substitutions and transversions. For instance, for the JC69 model, 
the substitution matrix is given by:
```{r, results = 'asis'}
writeLines(print_mat(matrix(1 / 4, nrow = 4, ncol = 4)))
```
Please note that this is not necessarily the rate matrix! In the substitution 
matrix we use, all rows MUST sum to 1 (as these represent the probabilities 
of mutating into another base). 
For instance, for the K80 model with parameter kappa, we would get the following
substitution matrix, with kappa = 0.2:
```{r, results = 'asis'}
kappa <- 0.2
diag_entry <- 1 - (1 / 4 + 1 / 4 + 0.2 / 4)

writeLines(print_mat(matrix(c(diag_entry, kappa, 1 / 4, 1 / 4,
                              kappa, diag_entry, 1 / 4, 1 / 4,
                              1 / 4, 1 / 4, diag_entry, kappa,
                              1 / 4, 1 / 4, kappa, diag_entry),
                              nrow = 4, ncol = 4)))
```
We understand that it might be tricky to make these calculations a priori, 
therefore we automatically translate the substitution matrix to the probability 
matrix required in case the matrix is a rate matrix. Thus, summarising, the
substitution matrix provides the conditional probability of mutating to another
base, given that mutation occurs (as governed by the overall mutation rate).
Using artificial sequencing data, we can for instance show how new bases arise 
in the population due to mutation:
```{r example mutation}
  fake_input_data3 <-
  create_artificial_genomeadmixr_data(number_of_individuals = 100,
                                      marker_locations = 1:1000,
                                      used_nucleotides = 1)
  
  kappa <- 0.5
  rate_matrix <- matrix(c(0, kappa, 1, 1,
                          kappa, 0, 1, 1,
                          1, 1, 0, kappa,
                          1, 1, kappa, 0), nrow = 4, ncol = 4)
  
  mutation_rate <- 1e-5
  
  simulated_pop <- simulate_admixture(
                      module = sequence_module(molecular_data = fake_input_data3,
                                               morgan = 1,
                                               markers = chosen_markers,
                                               mutation_rate = mutation_rate,
                                               substitution_matrix = rate_matrix),
                      pop_size = 1000,
                      total_runtime = 100)
  
  freqs <- calculate_allele_frequencies(simulated_pop)
  
  require(magrittr, quietly = TRUE)
  freqs %>%
    dplyr::group_by(ancestor) %>%
    dplyr::summarise(mean(frequency))
```
The final results show that indeed mutations have arisen in the population: 
the initial fake data only contained 'a' for all sequences, whereas at the end
of the simulation, we see that the other bases also exist in non-zero 
frequencies.