---
title: "Similarity Regression - Introduction"
author: "Daniel Greene"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Similarity Regression - Introduction}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---
```{r echo=FALSE}
knitr::opts_chunk$set(fig.width=7, fig.height=7)
```

`SimReg` has a function `sim_reg` for performing *Bayesian Similarity Regression* [1]. It returns an estimated probability of an association between ontological term sets and a binary variable, and conditional on the association: a *characteristic ontological profile*, such that ontological similarity to the profile increases the probability of the binary variable taking the value `TRUE`. The procedure has been used in the context of linking an ontologically encoded phenotype (as HPO terms) to a binary genotype (indicating the presence or absence of a rare variant within given genes) [1], so in this guide, we'll adopt the same theme.  

The function accepts arguments including `logical` response variable `y`, ontologically encoded predictor variable `x`, and additional arguments for tuning the compromise between execution speed and precision for the procedure. 

It returns an object of class `sim_reg_output`, which contains the pertinent information about the results of the inference. Of particular interest is the estimated probability of association, i.e. the probability that model selection indicator `gamma` = 1. The function `prob_association` can be used on the output to obtain such and estimate. Also, the posterior distribution of the characteristic ontological profile `phi` may be of interest, for which the function `get_term_marginals` can be used.

To set up an environment where we can run a simple example, we need an `ontology_index` object. The `ontologyIndex` package contains such an object - the Human Phenotype Ontology, `hpo` - so we load the package and the data, and proceed to create an HPO profile `template` and an enclosing set of terms, `terms`, from which we'll generate random term sets upon which to apply the function. In our setting, we'll interpret this HPO profile `template` as the typical phenotype of a hypothetical disease. We set `template` to the set `HP:0005537, HP:0000729` and `HP:0001873`, corresponding to phenotype abnormalities 'Decreased mean platelet volume', 'Autistic behavior' and 'Thrombocytopenia' respectively.

```{r}
library(ontologyIndex)
library(SimReg)
data(hpo)

# only use terms which are descended from HP:0000001
pa_descs <- intersection_with_descendants(hpo, "HP:0000001", hpo$id)
hpo <- lapply(hpo, "[", pa_descs)
class(hpo) <- "ontology_index"

set.seed(1)

template <- c("HP:0005537", "HP:0000729", "HP:0001873")
terms <- get_ancestors(hpo, c(template, sample(hpo$id, size=50)))
```

First, we'll do an example where there is no association between `x` and `y`, and then one where there is an association. 

In the example with no association, we'll fix `y`, with 10 `TRUE`s and generate the `x` randomly, with each set of ontological terms determined by sampling 5 random terms from `terms`.
```{r}
y <- c(rep(TRUE, 10), rep(FALSE, 90))
x <- replicate(simplify=FALSE, n=100, expr=minimal_set(hpo, sample(terms, size=5)))
```

Thus, our input data looks like:
```{r}
y
head(x)
```

Now we can call the `sim_reg` function to estimate the probability of an association (note: by default, the probability of an association has a prior of 0.05 and this can be set by passing a `gamma_prior_prob` argument), and print the mean posterior value of `gamma`, corresponding to our estimated probability of association.

```{r}
no_assoc <- sim_reg(ontology=hpo, x=x, y=y)
prob_association(no_assoc)
```

We note that there is a low probability of association. Now, we sample `x` conditional on `y`, so that if `y[i] == TRUE`, then `x` has 2 out of the 3 terms in `template` added to its profile.
```{r}
x_assoc <- lapply(y, function(y_i) minimal_set(hpo, c(
	sample(terms, size=5), if (y_i) sample(template, size=2))))
```

If we look again at the first few values in `x` for which `y[i] == TRUE`, we notice that they contain terms from the template.
```{r}
head(x_assoc)
```

Now we run the procedure again with the new `x` and `y` and print the mean posterior value of `gamma`.
```{r}
assoc <- sim_reg(ontology=hpo, x=x_assoc, y=y)
prob_association(assoc)
```

We note that we infer a higher probability of association. We can also visualise the estimated characteristic ontological profile, using the function `plot_term_marginals`, and note that the inferred characteristic phenotype corresponds well to `template`.

```{r}
plot_term_marginals(hpo, get_term_marginals(assoc), max_terms=8, fontsize=30)
```

## References

1.  D. Greene, NIHR BioResource, S. Richardson, E. Turro, Phenotype similarity regression for identifying the genetic determinants of rare diseases, The American Journal of Human Genetics 98, 1-10, March 3, 2016.