---
title: 'Example: data'
author: "Thijs Janzen"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{Example: data}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
required <- c("ape", "pheatmap")
if (!all(unlist(lapply(required,
                       function(pkg) requireNamespace(pkg, quietly = TRUE)))))
  knitr::opts_chunk$set(eval = FALSE)

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.width = 6)
knitr::opts_chunk$set(fig.height = 6)


```

## Data

The treestats package can rapidly calculate summary statistics on
phylogenetic trees, and in this vignette, we demonstrate this on
empirical trees. We will make use of family-level pruned trees stemming
from the clootl supertree of birds. These were created for the original
publication accompanying the treestats paper.

```{r read_trees}
focal_trees <- ape::read.tree(file = "https://raw.githubusercontent.com/thijsjanzen/treestats-scripts/main/datasets/phylogenies/fracced/birds.trees")  # nolint
```

We can now calculate all summary statistics for all trees:

```{r calc_sum_stats}
all_stats <- c()
for (i in seq_along(focal_trees)) {
  focal_stats <- treestats::calc_all_stats(focal_trees[[i]])
  all_stats <- rbind(all_stats, focal_stats)
}
all_stats <- as.data.frame(all_stats)
```

We can now, for instance, plot the distribution of family sizes in
birds:

```{r plot_size}
hist(all_stats$number_of_lineages)
```

Furthermore, we can make a heatmap of all correlations:

```{r corr}
cor.dist <- cor(all_stats)
diag(cor.dist) <- NA
heatmap(cor.dist)
```

This will generate a distorted image: correlations are not corrected for
tree size. We can study this a bit more in detail:

```{r plot with tree size}
opar <- par()
par(mfrow = c(3, 3))
for (stat in c("area_per_pair", "colless", "eigen_centrality",
               "four_prong", "max_betweenness", "max_width",
               "mntd", "sackin", "wiener")) {
  if (stat != "number_of_lineages") {
    x <- all_stats[, colnames(all_stats) == "number_of_lineages"]
    y <- all_stats[, colnames(all_stats) == stat]
    plot(y ~ x, xlab = "Tree size", ylab = stat, pch = 16)
  }
}
par(opar)
```

To correct for this, we will have to go over the entire correlation
matrix.

```{r correct_corr}
tree_size <- all_stats[, colnames(all_stats) == "number_of_lineages"]

for (i in seq_len(nrow(cor.dist))) {
  for (j in seq_len(ncol(cor.dist))) {
    stat1 <- rownames(cor.dist)[i]
    stat2 <- colnames(cor.dist)[j]
    x <- all_stats[, colnames(all_stats) == stat1]
    y <- all_stats[, colnames(all_stats) == stat2]

    a1 <- lm(x ~ tree_size)
    a2 <- lm(y ~ tree_size)
    new_cor <- cor(a1$residuals, a2$residuals)
    cor.dist[i, j] <- new_cor
  }
}
diag(cor.dist) <- NA
heatmap(cor.dist)
```

A nicer way to visualize this is given by the package ppheatmap:

```{r nicer plot, out.width="100%" }
if (requireNamespace("pheatmap")) pheatmap::pheatmap(cor.dist)
```