---
title: "Generalized Robinson-Foulds distances"
author: "[Martin R. Smith](https://smithlabdurham.github.io/)"
output: rmarkdown::html_vignette
bibliography: ../inst/REFERENCES.bib
csl: ../inst/apa-old-doi-prefix.csl
vignette: >
  %\VignetteIndexEntry{Generalized Robinson-Foulds distances}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r init, message=FALSE, warning=FALSE, echo = FALSE}
library('ape')
library('TreeDist')
tree1 <- read.tree(text='((A, B), ((C, (D, E)), (F, (G, (H, I)))));')
tree2 <- read.tree(text='((A, B), ((C, D, (E, I)), (F, (G, H))));')
AtoJ <-  read.tree(text='(((((A, B), C), D), E), (F, (G, (H, (I, J)))));')
swapAJ <-  read.tree(text='(((((J, B), C), D), E), (F, (G, (H, (I, A)))));')
```

This document outlines the similarity measures employed by the 
generalized Robinson–Foulds distances implemented in this package.

Generalized RF distances are introduced [elsewhere](Robinson-Foulds.html); 
before you read further, you may also wish to revisit how to
[use the 'TreeDist' package](Using-TreeDist.html), and relevant 
[principles of information theory](information.html).

## Shared phylogenetic information

Under the shared phylogenetic information tree distance measure [@SmithDist], 
pairs of splits are assigned a similarity score that corresponds to the amount
of phylogenetic information [_sensu_ @Steel2006] that they share in common
(see [separate vignette](information.html)), a concept introduced (though not 
developed) by Nelson [-@Nelson1979].

```{r, fig.width=6, fig.align='center'}
VisualizeMatching(SharedPhylogeneticInfo, tree1, tree2, 
                  Plot = TreeDistPlot, matchZeros = FALSE)
SharedPhylogeneticInfo(tree1, tree2)
```

This distance is measured in bits; on this measure, the total information 
content of a tree is given by 
```{r, total-phylo-info}
SplitwiseInfo(tree1)
```

### Conflicting splits can nevertheless be instructive

Shared phylogenetic information assigns zero similarity to incompatible splits,
i.e. those that cannot both occur on a single tree.
This leads to problematic behaviour in certain cases: for example,
swapping the position of two distant leaves (as with leaves 'A' and 'J' below)
can disproportionately reduce similarity -- in this example, to zero.

```{r, fig.width=6, fig.align='center'}
VisualizeMatching(SharedPhylogeneticInfo, AtoJ, swapAJ,
                  Plot = TreeDistPlot, matchZeros = FALSE, prune = c(5, 18))
```

## Mutual clustering information

Scoring each pair of splits according to their mutual clustering information
[@SmithDist]  (see [separate vignette](information.html)) results in a
information-based tree distance metric that recognizes similarity in tree
structure even when every possible pairing of splits conflicts:

```{r, fig.width=6, fig.align='center'}
VisualizeMatching(MutualClusteringInfo, AtoJ, swapAJ,
                  Plot = TreeDistPlot, matchZeros = FALSE, prune = c(5, 18))
MutualClusteringInfo(AtoJ, swapAJ)
```

Because no pair of non-trivial splits has zero mutual clustering information,
even a dissimilar pairing (such as `HI|ABCDEFG` &Rightarrow; `EI|ABCDFGH` below) 
is (slightly) preferable to leaving a split unpaired.

```{r, fig.width=6, fig.align='center'}
VisualizeMatching(MutualClusteringInfo, tree1, tree2, 
                  Plot = TreeDistPlot, matchZeros = FALSE)
```

The total mutual clustering information in a single tree is given by
```{r total-mci}
ClusteringEntropy(tree1)
```

## Nye _et al._ tree similarity metric

The Nye _et al_. [-@Nye2006] tree similarity metric scores pairs by
considering the elements held in common between subsets of each split.

Consider a pair of splits `ABCDEF|GHIJ` and `ABCDEIJ|FGH`.
These can be aligned thus:

```
ABCDEF  | GHIJ
ABCDE IJ|FGH
```

The first pair of subsets, `ABCDEF` and `ABCDEIJ`, have five elements in common 
(`ABCDE`), and together encompass eight elements (`ABCDEFIJ`).  Their 
_subset score_ is thus $\frac{5}{8}$.

The second pair of subsets, `GHIJ` and `FGH`, have two elements (`GH`) in common,
of the five total (`FGHIJ`), and hence receive a subset score of $\frac{2}{5}$.

This split alignment then receives an _alignment score_ corresponding to the 
lower of the two subset scores, $\frac{2}{5}$.

We must now consider the other alignment of this pair of splits,


```
ABCDEF  |     GHIJ
     FGH|ABCDE  IJ
```

This yields subset scores of $\frac{1}{8}$ and $\frac{2}{9}$, and thus has an
alignment score of $\frac{1}{8}$.
This alignment gives a lower score than the other, so is disregarded.
The pair of splits is allocated a similarity score corresponding to the 
better alignment: $\frac{2}{5}$.

As such, splits that match exactly will receive a similarity score of 1,
in a manner analogous to the Robinson–Foulds distance.  (This is despite the
fact that some splits are more likely to match than others.)
It is not possible for a pair of splits to receive a zero similarity 
score.

```{r, fig.width=6, fig.align='center'}
VisualizeMatching(NyeSimilarity, tree1, tree2, 
                  Plot = TreeDistPlot, matchZeros = FALSE)
NyeSimilarity(tree1, tree2, normalize = FALSE)
```

### Jaccard–Robinson–Foulds metric

B&ouml;cker _et al_. [-@Bocker2013] employ the same split similarity calculation
as Nye _et al._ (above), which they suggest ought to be raised to an
arbitrary exponent in order to down-weight the contribution of paired splits 
that are not identical.
In order for the metric to converge to the Robinson–Foulds metric as the
exponent grows towards infinity, the resulting score is then doubled.

```{r, fig.width=6, fig.align='center'}
JaccardRobinsonFoulds(tree1, tree2, k = 1)
VisualizeMatching(JaccardRobinsonFoulds, tree1, tree2,
                  Plot = TreeDistPlot, matchZeros = FALSE)

JRF2 <- function(...) JaccardRobinsonFoulds(k = 2, ...)
JRF2(tree1, tree2)
VisualizeMatching(JRF2, tree1, tree2,
                  Plot = TreeDistPlot, matchZeros = FALSE)
```

The figure below shows how the JRF distance between the two trees plotted above
varies with the value of the exponent _k_, relative to the Nye _et al._ and 
Robinson–Foulds distances between the trees:

```{r, fig.width=6, fig.height = 4,fig.align = 'center', echo = FALSE}
oldPar <- par(mar = c(4, 4, 0, 0))
x <- seq(1, 20, by = 0.1)
cbPalette8 <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442",
                "#0072B2", "#D55E00", "#CC79A7") # dput(Ternary::cbPalette8)
plot(0, xlim = range(x), ylim = c(3.5, 9.5),
     xlab = expression(italic('k')),
     ylab = 'Jaccard\u2013Robinson\u2013Foulds distance',
     pch = 3, type = 'n')

abline(h = RobinsonFoulds(tree1, tree2),
       lty = 'dashed', col = cbPalette8[6])
abline(h = NyeSimilarity(tree1, tree2, similarity = FALSE),
       lty = 'dotted', col = cbPalette8[8])

lines(x = x, y = vapply(x, JaccardRobinsonFoulds, tree1 = tree1, tree2 = tree2,
                        allowConflict = TRUE, 0), col = cbPalette8[5])
lines(x = x, y = vapply(x, JaccardRobinsonFoulds, tree1 = tree1, tree2 = tree2,
                        0), col = cbPalette8[4])

legend('right', c('Robinson\u2013Foulds', 'JRF, no conflict', 'JRF, conflict ok', 'Nye et al.'),
       lty = c('dashed', 'solid', 'solid', 'dotted'), 
       col = cbPalette8[c(6, 4, 5, 8)], bty = 'n')
par(oldPar)
```

The theoretical and practical performance of the JRF metric, and its speed
of calculation, are best at lower values of _k_ [@SmithDist], raising the
question of whether an exponent is useful.

B&ouml;cker _et al_. [-@Bocker2013] suggest that 'reasonable' matchings 
exhibit a property they term _arboreality_.
Their definition of an arboreal matching supposes that trees are rooted, in 
which case each split corresponds to a clade.
In an arboreal matching on a rooted tree, no pairing of splits conflicts with
any other pairing of splits.
Consider the case where splits 'A' and 'B' in tree 1 are paired 
respectively with splits 'C' and 'D' in tree 2.  If 'A' is paired with 'C' 
and 'B' with 'D', then to avoid conflict:

- if A is nested within B, then C must be nested within D

- if B is nested within A, then D must be nested within C

- if A and B do not overlap, then C and D must not overlap.

Equivalent statements for unrooted trees are a little harder to express.

Unfortunately, constructing arboreal matchings is NP-complete, making an optimal
arboreal matching slow to find.
A faster alternative is to prohibit pairings of contradictory splits, though 
distances generated under such an approach are theoretically less coherent and 
practically no more effective than those calculated when contradictory splits
may be paired, so the only advantage of this approach is a slight increase in 
calculation speed [@SmithDist].

## Matching Split Distance

Bogdanowicz & Giaro [-@Bogdanowicz2012] propose an alternative distance, which
they term the Matching Split Distance.  (This approach was independently 
proposed by Lin _et al._ [-@Lin2012].)

```{r, fig.width=6, fig.align='center'}
MatchingSplitDistance(tree1, tree2)
VisualizeMatching(MatchingSplitDistance, tree1, tree2,
                  Plot = TreeDistPlot, matchZeros = FALSE)
```

Note that the visualization shows the difference, rather than the similarity,
between splits.

Similar to the Nye _et al_. similarity metric, this method compares the subsets
implied by a pair of splits.
Here, the relevant quantity is the number of elements that must be moved from 
one subset to another in order to make the two splits identical.
With the pair of splits

```
ABCDEF  | GHIJ
ABCDE IJ|FGH
```

three leaves ('F', 'I' and 'J') must be moved before the splits are identical;
as such, the pair of splits are assigned a difference score of three.  
Formally, where $S_i$ splits $n$ leaves into bipartitions $A_i$ and $B_i$, the
difference score is calculated by

$n - m$

where $m$ counts the number of leaves that already match, and is defined as

$m = \max\{|A_1 \cap A_2| + |B_1 \cap B_2|, |A_1 \cap B_2| + |B_1 \cap A_2|\}$

```{r, fig.width=6, fig.align='center'}
MatchingSplitDistance(read.tree(text='((a, b, c, d, e, f), (g, h, i, j));'),
                      read.tree(text='((a, b, c, d, e, i, j), (g, h, f));'))
```

This distance is difficult to normalize, as it is not easy to calculate its
maximum value.

### Information theoretic alternative

In the matching split distance, $m$ represents a simple count of the number of
shared taxa.  An alternative is to measure the phylogenetic information content
of the largest split consistent with $S_1$ and $S_2$:

$m = \max\{h(A_1 \cap A_2 | B_1 \cap B_2), h(A_1 \cap B_2 | B_1 \cap A_2)\}$

The most information-rich split consistent with

```
ABCDEF  | GHIJ
ABCDE IJ|FGH  
```

is `ABCDE | GH`, which contains 
```{r}
TreeTools::SplitInformation(5, 2)
```
bits of phylogenetic information. This value can be used as a similarity score
for this pairing of splits.

```{r, fig.width=6, fig.align='center'}
MatchingSplitInfoDistance(tree1, tree2)
VisualizeMatching(MatchingSplitInfoDistance, tree1, tree2,
                  Plot = TreeDistPlot, matchZeros = FALSE)
```


## References