---
title: "Getting started: Simple tree searches"
author: "Martin R. Smith"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: ../inst/REFERENCES.bib
csl: ../inst/apa-old-doi-prefix.csl
vignette: >
  %\VignetteIndexEntry{Getting started: Simple tree searches}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE}
knitr::opts_chunk$set(fig.width = 7.2, fig.asp = 0.7) 
```

"TreeSearch" is an R package that allows, among other things, 
parsimony search on morphological datasets that contain inapplicable data,
using the algorithm proposed by Brazeau, Guillerme and Smith [-@Brazeau2019]
and implemented in the 'MorphyLib' C library [@Brazeau2017]
([details](https://rawgit.com/TGuillerme/Inapp/c4d502fa4ae3702eb207b2da3a692e469ba3b6ab/inst/gitbook/_book/index.html)).

## Getting started

[A companion vignette](getting-started.html) gives details on installing the
package and getting up and running.

Launch an interactive 'app' in your browser by typing `TreeSearch::EasyTrees()`
at the R / RStudio command line.

This will allow you to load data from a file, modify search settings, and
explore the distribution of most parsimonious trees in tree space.

![Starting parsimony search](gui_start.png){ width=100% }

View a consensus tree and explore the position of rogue taxa [@SmithCons]:

![Visualizing position of rogue taxon on search result consensus tree](gui_rogue.png){ width=100% }

Explore the distribution of trees (whether found by search or loaded from file)
in tree space [@SmithSpace], and evaluate search progress [@Whidden2015]:

![Evaluating search progress using tree space](gui_space.png){ width=100% }

Map characters on a chosen tree, using character and taxon notes imported from
a Nexus file, if present.  (This is designed to be interoperable with
[MorphoBank](https://morphobank.org) matrices.)

![Mapping character reconstructions](gui_char.png){ width=100% }

Trees can be saved as images, or in Nexus/Newick for further analysis.


## Command line tree search

You can also run tree searches using the R command line.
Once installed, load the "TreeSearch" package into R using
```{r Load-library}
library("TreeSearch")
```

You can 
[load your own dataset](https://ms609.github.io/TreeTools/articles/load-data.html),
but for now, we'll use the Vinther _et al._ [-@Vinther2008] dataset that comes 
bundled with "TreeSearch".

This dataset is small enough that it runs reasonably quickly, but its 
phylogenetic signal is obscure enough that it can require Ratchet searches to
escape from local optima.

```{r Load-data}
vinther <- TreeSearch::inapplicable.phyData[["Vinther2008"]]
```

```{R RNG-version, warn = FALSE}
# Set a random seed so that random functions in this document are reproducible
RNGversion("3.5.0")
set.seed(0)
```

We can conduct a basic parsimony search with:
```{r first-pass, message = FALSE}
bestTrees <- MaximizeParsimony(vinther)
```

It can be instructive to inspect the progress of tree search.

```{r inspect-progress}
firstHit <- attr(bestTrees, "firstHit")
firstHit
```

Here, we can see that many of the earliest ratchet iterations were finding
optimal trees that had not previously been visited.
Later iterations found progressively fewer new trees, suggesting that the
search is likely to have been effective.

Advanced users might wish to visualize the progress of tree search by mapping
tree space:

```{r map-search, fig.asp = 1}
distances <- TreeDist::ClusteringInfoDistance(bestTrees)
searchStages <- length(firstHit)
map <- cmdscale(distances, k = 3)
cols <- hcl.colors(searchStages, alpha = 0.8)
par(mar = rep(0, 4))
TreeDist::Plot3(map,
                col = cols[rep(seq_along(firstHit), firstHit)],
                pch = 16, cex = 2,
                axes = FALSE, xlab = "", ylab = "", asp = 1)
TreeTools::MSTEdges(distances, plot = TRUE, map[, 1], map[, 2],
                    col = "#00000030", lty = 2)
legend("topright", names(firstHit), col = cols, pch = 16, bty = "n")
```

A quick glance suggests that early ratchet iterations captured a large part of
the diversity of optimal trees, and that iterations aren't getting stuck in 
local optima -- though conscientious users will 
[ensure that the mapping of tree space is meaningful and adequate](
https://ms609.github.io/TreeDist/articles/treespace.html)
to detect structure before making any firm conclusions [@SmithSpace].

To be thorough, we might consider continuing the search for a little longer,
fine-tuning the search parameters:

```{r second-pass, message = FALSE}
bestTrees <- MaximizeParsimony(vinther, tree = bestTrees,
                               ratchIter = 6L,
                               tbrIter = 4L, 
                               finalIter = 3L,
                               maxHits = 80L)
```

As it happens, the best tree for this dataset has a score of 79 under
equal weights parsimony.

We can plot the best tree(s) that we've found, and check its parsimony score
(length):

```{r plot-tree}
par(mar = rep(0.25, 4), cex = 0.75) # make plot easier to read
plot(ape::consensus(bestTrees))
TreeLength(bestTrees[[1]], vinther)
```

### Evaluating clade support

We might be interested in labelling clades with their frequency among the 
sampled most-parsimonious trees:

```{r plot-label-nodes}
par(mar = rep(0.25, 4), cex = 0.75) # make plot easier to read
majCons <- ape::consensus(bestTrees, p = 0.5)
splitFreqs <- TreeTools::SplitFrequency(majCons, bestTrees) / length(bestTrees)
plot(majCons)
TreeTools::LabelSplits(majCons, round(splitFreqs * 100), unit = "%",
                       col = TreeTools::SupportColor(splitFreqs),
                       frame = "none", pos = 3L)
```

This approach is a sensible way to analyse clade credibility when the frequency
of a split corresponds to its probability, as is the case in a Bayesian
posterior sample of trees.
This is not the case, however, in sets of most parsimonious trees.

Imagine if our set of most parsimonious trees was expanded to include a single
additional tree in which _Halkieria_ was sister to the brachiopod
_Terebratulina_ (close to the outgroup).
We would then be in a situation in which _Halkieria_ may be a brachiopod –
in which case a single interpretation of molluscan relationships is most
parsimonious – or may be a mollusc, in which case its mosaic of characters
can be reconciled with molluscs in a number of equally-parsimonious ways.
Neither interpretation should be considered more or less plausible, even if we
observe more unique most parsimonious trees in which _Halkieria_ is a mollusc
simply because of the greater resultant uncertainty in the placement of taxa
such as _Odontogriphus_ and _Wiwaxia_.

A more instructive measure of clade support can be generated using
Jackknife resampling.
The `Resample()` [manual page](
https://ms609.github.io/TreeSearch/reference/MaximizeParsimony.html)
has suggestions for appropriate numbers of replicates and search intensity, 
and instructions for calculating bootstrap support;
the code here gives a quick-to-run jackknife framework that can be adapted to
the requirements of a particular dataset.

```{r Jackknife-annotations}
nReplicates <- 10
jackTrees <- lapply(logical(nReplicates), function (x)
  Resample(vinther, bestTrees, ratchIter = 0, tbrIter = 1, maxHits = 4,
           verbosity = 0)
)

strict <- ape::consensus(bestTrees, p = 1)

par(mar = rep(0, 4), cex = 0.8)
# Take the strict consensus of all trees for each replicate
JackLabels(strict, lapply(jackTrees, ape::consensus)) -> XX
```

Jackknife and bootstrap support values give an indication of the volume of data
that supports each node, but don't necessarily indicate whether the data
are unanimous on the existence of a clade: a high bootstrap support value could
indicate a large number of characters supporting a clade, and an only slightly
smaller number of characters contradicting it.

"TreeSearch" implements a number of concordance measures that aim to indicate
whether a dataset is unanimous or divided in support of a grouping, 
independently of the method of tree reconstruction.

```{r concordance}
concordance <- QuartetConcordance(strict, vinther)

# Alternative measures:
# concordance <- ClusteringConcordance(strict, vinther)
# concordance <- PhylogeneticConcordance(strict, vinther)

par(mar = rep(0, 4), cex = 0.8)
plot(strict)
TreeTools::LabelSplits(strict, signif(concordance, 3),
                       col = TreeTools::SupportColor(concordance / max(concordance)),
                       frame = "none", pos = 3L)
```

### Exploring taxon stability

Measure of clade support can only captures one aspect of the uncertainty in a
set of trees.  Often one "rogue" taxon with an uncertain position can mask
agreement in the relationships of other taxa [@SmithCons].
The potential impact of rogue taxa can be explored by colouring individual tips
according to their stability in the tree set:

```{r stability}
par(mar = rep(0, 4), cex = 0.8)

plot(strict, tip.color = Rogue::ColByStability(bestTrees))
```

Would removing an unstable taxon reveal hidden support for relationships at
the base of Mollusca?  We can test to see whether the removal of a taxon from
a summary tree is justified using:

```{r find-rogues}
Rogue::QuickRogue(bestTrees, p = 1)
```

In this case, dropping _Wiwaxia_ would improve the resolution of the
strict consensus by enough to justify the loss of the information
regarding its own position (a net gain of 14.3 bits).
The most informative single summary tree is thus provided by:

```{r cons-without-halk}
par(mar = rep(0, 4), cex = 0.8)
noWiwaxia <- lapply(bestTrees, TreeTools::DropTip, "Wiwaxia")
plot(ape::consensus(noWiwaxia), tip.color = Rogue::ColByStability(noWiwaxia))
```

This reveals that all trees agree that _Halkieria_ and _Orthrozanclus_ are
closer to aculiferan molluscs than _Odontogriphus_ and _Neopilina_, a fact
masked in the strict consensus due to the uncertain position of _Wiwaxia_.

We can see where _Wiwaxia_ would plot on this tree using:

```{r restore-wiwaxia}
par(mar = rep(0, 4), cex = 0.8)
xx <- TreeTools::RoguePlot(bestTrees, "Wiwaxia", p = 1)
```

Brighter greens indicate that more trees contained _Wiwaxia_ in this position.

More details on rogue taxon identification are available in the package
["Rogue"](https://ms609.github.io/Rogue/).

## Implied weighting

Equal weights produces trees that are less accurate and less precise than
implied weights [@Smith2019]; equally weighted analysis should never be
conducted without also considering the results of implied weights 
[@Goloboff1993;@Goloboff1997],
ideally under a range of concavity constants [cf. @Smith2014].

Implied weights can be activated by simply specifying a value of the concavity 
constant, _k_:

```{r iw-search, message = FALSE}
iwTrees <- MaximizeParsimony(vinther, concavity = 10)
par(mar = rep(0.25, 4), cex = 0.75) # make plot easier to read
plot(ape::consensus(iwTrees))
```

Note that we recommend a default value of 10, somewhat higher than the default 
of 3 in TNT; this low default gives poorer results in many settings
[@Goloboff2018;@Smith2019].
Better still is to use multiple values and compare the results,
perhaps in 
[Tree space](https://ms609.github.io/TreeDist/articles/treespace.html).
Even better (?) is to use [profile parsimony](https://ms609.github.io/TreeSearch/articles/profile.html).

## Constraining a search

It is sometimes legitimate to insist that trees must contain a certain clade.
Doing so reduces the number of tree rearrangements that are considered, and 
can this speed up tree search.

"TreeSearch" supports soft constraints, which can be specified using a 
separate Nexus file, or by creating a `phyDat` object in R.
Constraints are effectively phylogenetic characters; only trees on which each
such character fits perfectly will be considered.
The position of taxa not listed in a constraint will not be constrained.

`MaximizeParsimony()` will attempt to find a starting tree that satisfies the
constraints, but if it cannot, it may be necessary to specify one manually
-- perhaps after checking that no constraints are contradictory.

Here's a simple example on six taxa that enforces the bipartition ab | cdef:

```{r simple-constraints, fig.width = 4, fig.align = "center", message = FALSE}
library("TreeTools", quietly = TRUE)
constraint <- MatrixToPhyDat(c(a = 1, b = 1, c = 0, d = 0, e = 0, f = 0))
characters <- MatrixToPhyDat(matrix(
  c(0, 1, 1, 1, 0, 0,
    1, 1, 1, 0, 0, 0), ncol = 2,
  dimnames = list(letters[1:6], NULL)))
plot(MaximizeParsimony(characters, constraint = constraint,
                       verbosity = -1)[[1]])
```

Here's a more complex example that imposes the splits `ab | cef` and `abcd | ef`,
whilst allowing `g` to plot anywhere on the tree:

```{r complex-constraints, fig.width = 4, fig.align = "center", message = FALSE}
characters <- MatrixToPhyDat(matrix(
  c(0, 0, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 0, 0, 0), ncol = 2,
  dimnames = list(letters[1:7], NULL)))
constraint <- MatrixToPhyDat(matrix(
  c(0, 0, 1, "?", 1, 1,
    1, 1, 1,   1, 0, 0), ncol = 2,
  dimnames = list(letters[1:6], NULL)))
plot(MaximizeParsimony(characters, constraint = constraint,
                       verbosity = -1)[[1]])
```

Constraints can also be loaded from a Nexus file with
`constraint <- TreeTools::ReadAsPhyDat("constraint_file.nex")`.

## Where next?

- [Documentation home](https://ms609.github.io/TreeSearch/)

- [Guide to installation](getting-started.html)

- Search for trees using
  - [standard parsimony](tree-search.html) (corrected for inapplicable data)
  - [profile parsimony](profile.html)
  - [custom optimality criteria](custom.Rmd)
  
- Explore the distribution of optimal trees in [mappings](tree-space.html)

## References