--- title: "Introduction to BioTIMEr" author: "Inês S. Martins" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to BioTIMEr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The `BioTIMEr` package provides tools designed to interact with the BioTIME database, which contains assemblage time series from multiple habitats and for multiple taxa (Dornelas et al 2018, 2024). The functions provided include the BioTIME recommended methods for preparing time series data (gridding and rarefaction), calculating a selection of standard biodiversity metrics (e.g. species richness, numerical abundance and exponential Shannon index) per year, alongside examples on how to calculate and plot linear trends of biodiversity change over time. Note that each study in BioTIME is identified by a unique Study_ID and that all the functions in BioTIMEr assume that the variables are named as in the BioTIME data. #### **Load the required libraries** ```{r setup, include=FALSE} library(knitr) opts_chunk$set(echo = TRUE) library(vegan) library(dplyr) library(tidyr) library(ggplot2) library(dggridR) set.seed(19) ``` ```{r bt_package_loading, warning=FALSE, message=FALSE, eval=TRUE} # Install and load the latest version of BioTIMEr library(BioTIMEr) ``` When loading the BioTIMEr package, a subset of BioTIME data is loaded together with its metadata. This subset includes 12 studies from marine, terrestrial and freshwater time series across different taxonomic groups. The whole BioTIME database is accessible [here](https://biotime.st-andrews.ac.uk/) ```{r data_description, warning=FALSE, message=FALSE, eval=TRUE, include=TRUE} # you can run the following commands to retrieve more information about the subsets. ?BTsubset_meta ?BTsubset_data ``` ```{r package info, warning=FALSE, message=FALSE, eval=TRUE, include=TRUE} # you can also see a full list of BioTIMEr functions and help pages by: ??BioTIMEr ``` ### How to use BioTIMEr? 1. Grid BioTIME data into a discrete global grid based on the samples’ location (latitude/longitude) to standardize the spatial extent between different studies using: - `gridding()` 2. Apply sample-based rarefaction to standardize the number of samples per year within each cell-level time series using: - `resampling()` 3. Calculate a set of standard alpha and/or beta diversity metrics using: - `getAlphaMetrics()` - `getBetaMetrics()` 4. Fit linear regression models to step 3 outputs as a function of time with: - `getLinearRegressions ()` 5. Visualise the results. Below is a detailed run through these different steps, breaking down some of the tasks and the reasoning behind our suggestions for best practice when analysing BioTIME data. We use the ‘pipe’ operator `(%>%)` to make our code more efficient and streamlined, and we use ggplot2 library for most visualisations. This and other libraries required for BioTIMEr to run will be, if needed, installed at the same time BioTIMEr is installed. ---------------------------------------------------------------------- ### Why bother with separating observations using a global grid? Typically, a BioTIME study contains distinct samples collected with a consistent methodology for a given period of time. However, the spatial extent and sample distribution among studies can vary substantially depending on the study and the taxon being sampled (e.g. forest quadrants, bird transects, fish tows, etc.; which can be implemented within smaller or larger areas). Because the spatial extent varies among studies, an advisable step is to grid those studies that have large extents and multiple sampling locations into smaller equal-sized subsets. The `gridding()` function is designed to perform such task using a global grid of hexagonal cells derived from dggridR (see `vignette(‘dggridR’)`). By implementing this gridding, the large scales studies are split into multiple cell-level assemblages, while the small scale studies are assigned to a single cell. Let's apply `gridding()` to a subset of BioTIME data and see what happens: ```{r gridding_ex, cache=TRUE, echo=TRUE, message=FALSE, tidy=FALSE, include=TRUE} grid_samples <- gridding(meta = BTsubset_meta, btf = BTsubset_data, res = 12, resByData = FALSE) # Get a look at the output # grid_samples %>% head() %>% kable() ``` You can see that each sample is assigned a different combination of study ID and grid cell (based on its latitude and longitude) resulting in a unique identifier that determines our assemblage time series (`assemblageID`). This `assemblageID` can now be used as the study unit for analysis of biodiversity change based on similar spatial extent between them. Let’s take a look and see how many studies/cells/assemblages time series are there in this subset: ```{r gridding_ex_plot, cache=TRUE, echo=TRUE, message=FALSE, fig.width=7, fig.height=3,tidy=FALSE, include=TRUE} check_res_1 <- grid_samples %>% group_by(STUDY_ID, StudyMethod) %>% summarise(n_cell = n_distinct(cell), n_aID = n_distinct(assemblageID), res = "res12") check_res_1 %>% head(10) %>% kable() # How many samples were there in each study? ggplot(data = check_res_1) + geom_bar(mapping = aes(x = as.character(STUDY_ID), y = n_aID, fill = res), stat = "identity") + scale_fill_discrete(type = c("#155f49")) + xlab("StudyID") + ylab("Number of assemblages in a study") + theme_bw() + theme(legend.position = "none") ``` You can see that several studies are not partitioned, and that's because all their observations were already contained within a single cell. #### **What happens if I change `res`?** By default, the function separates studies into hexagonal cells of ~96 km2 (`res=12`), as this resolution was found to be the most appropriate when working on the whole BioTIME database. Yet sometimes, particularly when working with a subset of BioTIME, you may want to explore alternative grid resolutions. The function has two options: you can either directly define a new resolution (see `vignette(‘dggridR’)`) or allow the function to find the best resolution based on the average study extent of your dataset. Let’s run over these alternative options with a few examples: ```{r gridding_ex2, cache=TRUE, echo=TRUE, message=FALSE, tidy=FALSE, include=TRUE,results='hold'} # define an alternative resolution of 14 (~10.7 km2 cells) grid_samples_14 <- gridding(meta = BTsubset_meta, btf = BTsubset_data, res = 14, resByData = FALSE) # allow the spatial extent of the data to define the resolution grid_samples_auto <- gridding(meta = BTsubset_meta, btf = BTsubset_data, res = 12, resByData = TRUE) # this option also returns a message with the automatically picked resolution: ``` So what changed with different `res` values? ```{r gridding_ex3, cache=TRUE, echo=TRUE, message=FALSE, fig.width=7, fig.height=3,tidy=FALSE, include=TRUE} check_res_2 <- grid_samples_14 %>% group_by(StudyMethod, STUDY_ID) %>% summarise(n_cell = n_distinct(cell), n_aID = n_distinct(assemblageID), res = "res14") check_res_3 <- grid_samples_auto %>% group_by(StudyMethod, STUDY_ID) %>% summarise(n_cell = n_distinct(cell), n_aID = n_distinct(assemblageID), res = "res15") checks <- rbind(check_res_1, check_res_2, check_res_3) ggplot(data = checks) + geom_bar(mapping = aes(x = as.character(STUDY_ID), y = n_aID, fill = res), stat = "identity", position = "dodge") + scale_fill_discrete(type = c("#155f49","#66c1d1","#d9d956")) + xlab("StudyID") + ylab("Number of assemblages in a study") + theme_bw() ``` The number of cells in each study varies depending on the resolution chosen. You can also map what happens when gridding the data: ```{r gridding_map, cache=TRUE, echo=TRUE, message=FALSE, fig.width=7, fig.height=3,tidy=FALSE, include=TRUE,results='hold', figures='hold'} ## The following example is built on the demonstrations in ## https://cran.r-project.org/web/packages/dggridR/vignettes/dggridR.html. # First we build the ~96 km2 global grid dgg_12 <- dggridR::dgconstruct(res = 12) # To simplify, we only map the grid cell boundaries for cells which # have observations. # NOTE: if you are working with the full BioTIME database, this step may take some time. grid_12 <- dggridR::dgcellstogrid(dgg_12, grid_samples$cell) # Now let's follow the same steps and build a ~10.7 km2 global grid: dgg_14 <- dggridR::dgconstruct(res = 14) grid_14 <- dggridR::dgcellstogrid(dgg_14, grid_samples_14$cell) # And we get some polygons for each country of the world, to create a background: countries <- ggplot2::map_data("world") # Now you could map the whole world, but let's just zoom in the UK and have a look at # STUDY 466 (Marine Fish): map_uk_locations <- ggplot() + geom_polygon(data = countries, aes(x = long, y = lat, group = group), fill = NA, color = "grey") + geom_sf(data = grid_12, aes(), color = alpha("blue", 0.4)) + coord_sf(xlim = c(-20, 10), ylim = c(50, 60)) + geom_rect(aes(xmin = -11, xmax = -0.7, ymin = 57.2, ymax = 59), colour = "red", fill = NA) + labs(x = NULL, y = NULL) + theme_bw() + theme(text = element_text(size = 8)) zoom_in_map <- ggplot() + geom_polygon(data = countries, aes(x = long, y = lat, group = group), fill = NA, color = "grey") + geom_sf(data = grid_12, aes(), color = alpha("blue", 0.4)) + geom_sf(data = grid_14, aes(), color = alpha("red", 0.4)) + coord_sf(xlim = c(-11, -0.7), ylim = c(57.2, 59)) + theme_bw() grid::grid.newpage() main <- grid::viewport(width = 1, height = 1, x = 0.5, y = 0.5) # the main map inset <- grid::viewport(width = 0.4, height = 0.4, x = 0.82, y = 0.45) # the inset in bottom left # The resulting distribution of different size cells appears as follows: print(zoom_in_map, vp = main) print(map_uk_locations, vp = inset) ``` We can see that the higher the resolution set the higher the number of cells into which samples get assigned. There is thus the risk of the resulting assemblages time series becoming too small to have any ecological meaning. A good rule of thumb is to never set `res` higher than the resolution you get when `resByData = TRUE`. ### Why rarefaction is a must! Now we will deal with another common issue when working with time series data: the fact that the number of samples may vary over time. When evaluating changes through time it's important to be able to quantify change independently of the number of samples, i.e. we want to be able to assess whether e.g. alpha diversity has truly increased or decreased over time and that such signal is not a sampling artifact due to uneven sampling effort – i.e. resulting from having more or fewer samples in different years or periods of the time series being compared. Probably the most common and widely used approach to prevent temporal variation in sampling effort from affecting diversity estimate is to apply sample-based rarefaction. Here (add link) is a great introduction to it. BioTIMEr provides the `resampling()` function that implements sample-based rarefaction, already adapted to BioTIME data and designed to run smoothly over the previously gridded dataset as done above. The function identifies the minimum number of samples across all years in a time series and then uses this minimum to randomly resample each year down to that number. Let's continue with our BioTIME example subset dataset: ```{r resampling_ex, cache=TRUE, echo=TRUE, message=FALSE, tidy=FALSE, include=TRUE} # First, if you are not sure you need this step, # you can always check how many samples there are in every year of the different time series: check_samples <- grid_samples %>% group_by(STUDY_ID, assemblageID, YEAR) %>% summarise(n_samples = n_distinct(SAMPLE_DESC), n_species = n_distinct(Species)) check_samples %>% head(10) %>% kable() ``` Looking at the `n_samples` column we can already see for the few assemblages in this subset that there are substantial differences in sampling effort (i.e. number of samples) through time. Thus, we should standardize for sampling effort: ```{r resampling_ex1, cache=TRUE, echo=TRUE, message=FALSE, tidy=FALSE, include=TRUE} # Let's apply resampling() then, using the data frame of the gridded data: grid_rare <- resampling(x = grid_samples, measure = "ABUNDANCE", resamps = 1, conservative = FALSE) ``` ```{r tests1, cache=TRUE, echo=TRUE, message=FALSE, tidy=TRUE, include=FALSE} #let's apply it then: #g rid_samples_temp <- subset(grid_samples, !grid_samples$BIOMASS==0) #to be deleted after Faye reviews the data and makes all 0=NAs # grid_rare <- resampling( x= grid_samples_test, measure ="BIOMASS", resamps = 1, conservative = FALSE) ``` The function returns a single long-form data frame containing the total currency of interest (i.e. summed across all samples) for each species in each year within each assemblage time series. You can specify what the currency field is: Abundance, Biomass or both (BioTIME studies can include records for either or both). Note that resampling() also tests and removes any `NAs` within the selected currency field(s) before implementing the sample-based rarefaction. Thus, if you choose to use both currency fields, only observations (when `conservative = FALSE`) or full sampling events (when `conservative = TRUE`) in which BOTH Abundance and Biomass were recorded are retained and a warning is issued as follows: ```{r resampling_ex12, cache=TRUE, echo=TRUE, message=FALSE, tidy=TRUE, include=TRUE} # Keep only observations with both abundance and biomass grid_rare_ab <- resampling(x = grid_samples, measure = c("ABUNDANCE", "BIOMASS"), resamps = 1, conservative = FALSE) # Keep only sampling events where all observations within had both abundance and biomass to start with grid_rare_abT <- resampling(x = grid_samples, measure = c("ABUNDANCE", "BIOMASS"), resamps = 1, conservative = TRUE) ``` #### **How many resampling iterations should we implement?** Remember that sample-based rarefaction involves randomly selecting a specified number of samples (i.e. sampling events) for each year, this number being equal to the number of samples in the year with the lowest sampling effort in each time series. If the sampling effort through time was equal or relatively stable, then you probably do not need to concern yourself any further. However, for many time series in BioTIME (or indeed other biodiversity data) sampling effort can vary substantially, and thus the sample-based rarefaction process can lead to different species diversity and composition simply by randomly selecting different samples. To help exemplify this point across, let’s pick one assemblage time series from the subset and plot the number of samples recorded in each year: ```{r resampling_ex3, cache=TRUE, echo=TRUE, message=FALSE, fig.width=7, fig.height=3,tidy=FALSE, include=TRUE,results='hold', figures='hold'} # What is the number of samples in the year with the lowest sampling effort? ggplot(data = check_samples[check_samples$assemblageID == "18_335699",], aes(x = YEAR, y = n_samples)) + geom_col(aes(x = YEAR, y = n_samples), fill = "red", alpha = 0.5) + geom_segment(aes(x = 1926, y = min(n_samples) + 3, xend = 1927, yend = min(n_samples)), arrow = arrow(length = unit(0.2, "cm"))) + xlab("Year") + ylab("Number of samples") + theme_bw() # In this case,the year 1927 had the lowest sampling effort, with 3 samples (arrow). ``` In this particular case, what `resampling()`will do is to randomly select and retain only 3 samples from each year of this time series. As you can see in the plot, several years have 20 samples or more; so if you run the function a second time it’s likely that the sample-based rarefaction process will select a different combination of 3 samples (with the exception of year 1927), which may or may not differ in their species diversity and composition. One option to deal with this issue is to repeat this random sampling process multiple times, effectively creating multiple alternative datasets for the same time series (and which can be used in subsequent sensitivity analysis or to estimate the variability in diversity metrics that originates from the random sample selection process): ```{r resampling_ex4, cache=TRUE, echo=TRUE, message=FALSE, fig.width=7, fig.height=3,tidy=FALSE, include=TRUE,results='hold', figures='hold'} # Let's implement the sample-based rarefaction by resampling the dataset 10 times. grid_rare_n10 <- resampling(x = grid_samples, measure = "ABUNDANCE", resamps = 10, conservative = FALSE) # Note that you may want to resample many more times (e.g. at least 30-100+ times, but up to 199 # if e.g. working with the whole BioTIME data), depending on how many iterations you want a # subsequent bootstrap analysis to have. # This may also take some computation time, so if you are working with a big subset of BioTIME # is advisable to break it down in smaller subsets. # Each resampling iteration will be identified as resamp = 1:n, in this case 1:10. # Now we can check if there are differences across the first 3 of these iterations: check_resamps <- grid_rare_n10[grid_rare_n10$resamp < 4,] %>% group_by(STUDY_ID, assemblageID, resamp) %>% summarise(n_obs = n(), n_species = n_distinct(Species), n_year = n_distinct(YEAR)) check_resamps %>% head(10) %>% kable() ``` Since sampling events within a study are retrieved with a consistent methodology and observational effort, we don’t expect to see big differences between picking some sampling events over others, however it’s still an advisable check to do. The example above also highlights the fact that during the process of standardising the number of samples per year within each time series, a lot of data will be discarded. Additional steps, such as removing low sampling years, before applying the `resampling()` function may minimize this effect. **Your data is now ready for analysis!** You can stop here and go do your own thing (the world is your oyster!)...or you can follow the next steps and learn how to use the BioTIMEr package to estimate a few standard diversity metrics, as well as some hints on how to quantify and visualize temporal biodiversity change for a given BioTIME data subset. Please note that as BioTIMEr is designed to interact with BioTIME data, the functions are built using BioTIME variable names. However the functions and workflow we describe can be easily applied to any time series data as long as the variable names are modified accordingly before. ### Calculate diversity estimates Now the following functions are pretty straightforward. `getAlphaMetrics()` and `getBetaMetrics()` estimate a set of common alpha and beta diversity metrics for each year within each time series in your BioTIME data. Let’s apply them to our now processed BioTIME example dataset: ```{r metrics, cache=TRUE, echo=TRUE, message=FALSE, fig.width=7, fig.height=3,tidy=FALSE, include=TRUE, figures='hold'} # Get alpha metrics estimates: alpha_metrics <- getAlphaMetrics(x = grid_rare, measure = "ABUNDANCE") # see also help("getAlphaMetrics") for more details on the metrics # Have a quick look at the output alpha_metrics %>% head(6) %>% kable() # Get beta metrics estimates: beta_metrics <- getBetaMetrics(x = grid_rare, measure = "ABUNDANCE") #see also help("getBetaMetrics") for more details on the metrics # Have a quick look at the output beta_metrics %>% head(6) %>% kable() # NOTE the functions used the rarefied data with only one resampling iteration ``` ```{r tests2, cache=TRUE, echo=TRUE, message=FALSE, fig.width=7, fig.height=3,tidy=FALSE, include=FALSE, figures='hold'} # # Get alpha metrics estimates: # alpha_metrics <- getAlphaMetrics(x = grid_rare_ab, measure = "BIOMASS") # #see also help("getAlphaMetrics") for more details on the metrics # # #Have a quick look at the output # alpha_metrics %>% head(6) %>% kable() # # # Get beta metrics estimates: # beta_metrics <- getBetaMetrics(x = grid_rare_ab, measure = "BIOMASS") # #see also help("getBetaMetrics") for more details on the metrics # # #Have a quick look at the output # beta_metrics %>% head(6) %>% kable() ``` There are many metrics for measuring biodiversity in community ecology. Here, we focused on the more commonly used metrics when measuring temporal changes in biodiversity. getAlphaMetrics() estimates nine alpha diversity metrics: Species richness (`S`), numerical abundance (`N`), maximum numerical abundance (`maxN`), Shannon index, Simpson index, inverse Simpson, McNaughton dominance, probability of intraspecific encounter (`PIE`) and exponential Shannon. getBetaMetrics() estimates three beta diversity metrics: Jaccard dissimilarity, Bray-Curtis dissimilarity and Morisita-Horn dissimilarity, assuming the first year in the time series as the baseline. Note that `N` and `maxN` metrics return numerical abundance when `measure=ABUNDANCE`, and biomass values when `measure=BIOMASS`. ### Investigating biodiversity trends Finally, when looking to measure biodiversity change using temporal data, it is common to estimate temporal trends in alpha diversity and beta diversity by fitting a simple linear regression to the diversity estimates over the duration of the time series. The `getLinearRegressions()` function implements linear models as a function of year and is designed to run smoothly over an output of either the `getAlphaMetrics()` or `getBetaMetrics()` functions. Specifically, it provides the results of simple linear regressions (slope, p-value, significance, intercept) using lm(), and fit individually to the yearly diversity estimates of each metric within each assemblage time series in your dataset. ```{r trends, cache=TRUE, echo=TRUE, message=FALSE, fig.width=7, fig.height=3,tidy=FALSE, include=TRUE,results='hold', figures='hold'} # Let's apply it then: alpha_slopes <- getLinearRegressions(x = alpha_metrics, divType = "alpha", pThreshold = 0.05) #for alpha metrics # Have a quick look at the output alpha_slopes %>% head(6) %>% kable() beta_slopes <- getLinearRegressions(x = beta_metrics, divType = "beta", pThreshold = 0.05) #for beta metrics # Have a quick look at the output # beta_slopes %>% head(6) %>% kable() ``` You can change the default P-value threshold for statistical significance (i.e. the measure of the strength of evidence against `H0`). For easy sorting, plotting and interpretation, a column called `significance` is also provided, where 1 indicates the threshold is met, i.e. the estimated slope is significant at the specified p-value level. Also note that some biodiversity analysis may benefit from additional transformations and scaling of data before modelling (check a nice tutorial on the subject [here](https://ourcodingclub.github.io/tutorials/data-scaling/)). ### Summarize and visualize biodiversity trends Now that we have quantified change through time, we are ready to make some plots! The next section provides only a few examples of queries and data visualization, but it’s not a comprehensive exploration of the results, which will also vary depending on your own research question. Right, let's start will some overall summaries and trends: ```{r trends2, cache=TRUE, echo=TRUE, message=FALSE, fig.width=10, fig.height=7,tidy=FALSE, include=TRUE} # First, how many assemblages in our dataset show a moderate evidence (P < 0.05) of change in alpha diversity? check_alpha_trend <- alpha_slopes %>% group_by(metric) %>% filter(significance == 1) %>% # or use filter(pvalue<0.05) summarise(n_sig = n_distinct(assemblageID), mean(slope)) check_alpha_trend %>% kable() # We can see that only a few (<40) of the assemblage time series actually show a significant # trend of change over time, independently of the metric used. This indicates that in most # time series in the studies we analysed alpha diversity is not really changing through time. ``` Let’s do some summary plots showing the variation of alpha diversity change: ```{r trends3, cache=TRUE, echo=TRUE, message=FALSE, fig.width=7.2, fig.height=5,tidy=FALSE, include=TRUE,results='hold'} # Get a slope per assemblageID and metric alpha_slopes_simp <- alpha_slopes %>% group_by(assemblageID, metric, pvalue) %>% filter(significance == 1) %>% #select only the assemblages with significant trends summarise(slope = unique(slope)) # Calculate the mean slope and CIs stats_alpha <- alpha_slopes_simp %>% group_by(metric) %>% summarise(mean_slope = mean(slope), #mean ci_slope = qt(0.95, df = length(slope) - 1) * (sd(slope, na.rm = TRUE) / sqrt(length(slope)))) #margin of error # Let's put it all together ggplot(data = alpha_slopes_simp) + geom_histogram(aes(x = slope, fill = metric), bins = 25) + #geom_density(alpha=0.5)+ #in case you what a density plot instead geom_vline(aes(xintercept = 0), linetype = 3, colour = "grey",linewidth = 0.5) + #add slope=0 line geom_vline(data = stats_alpha, aes(xintercept = mean_slope), linetype = 1, linewidth = 0.5,colour = "black") + #mean geom_vline(data = stats_alpha, aes(xintercept = mean_slope - ci_slope), linetype = 2, linewidth = 0.5, colour = "black") + #lower confidence interval geom_vline(data = stats_alpha, aes(xintercept = mean_slope + ci_slope), linetype = 2, linewidth = 0.5, colour = "black") + #upper confidence interval facet_wrap(~metric, scales = "free") + scale_fill_biotime() + #using the customize BioTIME colour scale. See help("scale_color_biotime") for more options. ggtitle("Alpha diversity change") + theme_bw() + theme(legend.position = "none",plot.title = element_text(size = 11, hjust = 0.5)) ``` Now the same for beta diversity change: ```{r trends4, cache=TRUE, echo=FALSE, message=FALSE, fig.width=7.2, fig.height=1.9,tidy=FALSE, include=TRUE,results='hold'} # Get a slope per assemblageID and metric beta_slopes_simp <- beta_slopes %>% group_by(assemblageID, metric, pvalue) %>% filter(significance == 1) %>% #select only the assemblages with significant trends summarise(slope = unique(slope)) # Calculate the mean slope and CIs stats_beta <- beta_slopes_simp %>% group_by(metric) %>% summarise(mean_slope = mean(slope), #mean ci_slope = qt(0.95, df = length(slope) - 1) * (sd(slope, na.rm = TRUE) / sqrt(length(slope)))) #margin of error #Let's put it all together ggplot(data = beta_slopes_simp) + geom_histogram(aes(x = slope, fill = metric), bins = 25) + #geom_density(alpha=0.5)+ #in case you what a density plot instead geom_vline(aes(xintercept = 0), linetype = 3, colour = "grey", linewidth = 0.5) + #add slope=0 line geom_vline(data = stats_beta, aes(xintercept = mean_slope), linetype = 1, linewidth = 0.5, colour = "black") + #mean geom_vline(data = stats_beta, aes(xintercept = mean_slope - ci_slope), linetype = 2, linewidth = 0.5, colour = "black") + #lower confidence interval geom_vline(data = stats_beta, aes(xintercept = mean_slope + ci_slope), linetype = 2, linewidth = 0.5, colour = "black") + #upper confidence interval facet_wrap(~metric, scales = "free") + scale_fill_biotime() + #using the customize BioTIME colour scale. See help("scale_color_biotime") for more options. ggtitle("Beta diversity change") + theme_bw() + theme(legend.position = "none",plot.title = element_text(size = 11, hjust = 0.5)) ``` ```{r trends5, cache=TRUE, echo=TRUE, message=FALSE, fig.width=7.2, fig.height=1.9,tidy=FALSE, include=TRUE,results='hold'} # Hint: If you wish to plot all metrics together, simply merge you alpha_slopes and # beta_slopes dataframes beforehand. ``` That's great! While overall trends are important, you may also want to evaluate the patterns for a specific taxa or climate, and look at both sets of trends (i.e. significant and non-significant) together. For example, let's say you want to have a look at changes in Shannon diversity for the two more represented taxon in our example BioTIME dataset. ```{r trends6, cache=TRUE, echo=TRUE, message=FALSE, fig.width=7.2, fig.height=2.3,tidy=FALSE, include=TRUE,results='hold'} # First, we need to check the meta file and retrieve the information for the studies of interest #head(BTsubset_meta) meta <- select(BTsubset_meta, STUDY_ID, TAXA, REALM, CLIMATE) # Get a slope per assemblageID and metric alpha_slopes_simp <- alpha_slopes %>% group_by(assemblageID, metric,'pvalue', significance) %>% summarise(slope = unique(slope)) # Get back the Study ID by separating our assemblageID column into multiple other columns. alpha_slopes_simp <- as.data.frame(alpha_slopes_simp %>% separate(., assemblageID, into = c("STUDY_ID", "cell"), sep = "_", remove = FALSE)) # Merge it all alpha_slopes_meta <- merge(alpha_slopes_simp, meta, by = "STUDY_ID") # Select relevant data for plotting alpha_slopes_set1 <- subset(alpha_slopes_meta, alpha_slopes_meta$metric == "Shannon" & (alpha_slopes_meta$TAXA == "Fish" | alpha_slopes_meta$TAXA == "Birds")) # Let's put it all together ggplot(data = alpha_slopes_set1, aes(x = slope, fill = TAXA)) + geom_histogram(fill = "grey", bins = 25) + #all assemblages geom_histogram(data = alpha_slopes_set1[alpha_slopes_set1$significance == 1, ], bins = 25) + #only significant change assemblages geom_vline(aes(xintercept = 0), linetype = 3, colour = "black", linewidth = 0.5) + #add slope=0 line facet_wrap(~TAXA, scales = "free") + scale_fill_biotime() + ggtitle("Shannon-Weiner Species Diversity Index") + theme_bw() + theme(plot.title = element_text(size = 11, hjust = 0.5)) ``` ```{r trends7, cache=TRUE, echo=TRUE, message=FALSE, fig.width=7.2, fig.height=1.9, include=TRUE,results='hold'} # Now we see the full distribution of slopes of change for the Shannon index for the birds # and fish time series in our subset: all slopes are shown in grey, while in color we show # the subset of assemblages for which evidence (P < 0.05) of change was detected. ``` We might also want to explore the different trends for a given time series, allowing us to visualise the data and the trend line: ```{r trends8, cache=TRUE, echo=TRUE, message=FALSE, tidy=FALSE, include=TRUE,fig.width=7.2, fig.height=5, eval=TRUE,results='hold', figures='hold'} # Let's go back to the data frame with the yearly values and select our relevant data for plotting alpha_metrics_set <- subset(alpha_metrics, assemblageID == "18_335699") # Transform data alpha_metrics_set_long <- pivot_longer(alpha_metrics_set, cols = c("S", "N", "maxN", "Shannon", "expShannon", "Simpson", "invSimpson", "PIE", "DomMc"), names_to = "metric", names_transform = as.factor) # Get assemblage ID slope of change alpha_slopes_set2 <- subset(alpha_slopes, assemblageID == "18_335699") # Get assemblage ID name_assemblage <- unique(alpha_slopes_set2$assemblageID) # Merge info alpha_metr_slop<- left_join(alpha_slopes_set2, alpha_metrics_set_long, join_by(assemblageID, metric)) # Let's put it all together ggplot(data = alpha_metr_slop, aes(x = YEAR, y = value)) + geom_point(colour = "#155f49", size = 1) + #plot the yearly estimates stat_smooth(method = "lm", se = FALSE, linetype = 2, colour = "grey") + #draw all regression line stat_smooth(data = subset(alpha_metr_slop, alpha_metr_slop$significance == 1), aes(x = YEAR, y = value), method = "lm", se = FALSE, linetype = 2, colour = "#155f49") + #color only trends that are significant facet_wrap(~metric, scales = "free") + scale_fill_biotime() + ggtitle(paste("Assemblage", name_assemblage)) + ylab("Diversity") + theme_bw() + theme(plot.title = element_text(size = 11, hjust = 0.5)) ``` ```{r trends9, cache=TRUE, echo=TRUE, message=FALSE, fig.width=10, fig.height=7,tidy=FALSE, include=FALSE,results='hold', figures='hold'} # Hint: If you want to draw the p-value on the plot, you can try using the #ggpmisc::stat_fit_glance which takes anything passed through lm() in R and #allows it to processed and printed using ggplot2. ``` We can see that results vary depending on the dataset and the metric of change used. It is important to remember that biodiversity metrics are just a tool used to measure changes, and each have diverse abilities to detect changes in different biodiversity dimensions and under different scenarios e.g.They thus provide different pieces of a complex puzzle and may strongly affect our interpretation of biodiversity change. ### Credits and Disclaimer This R package was developed by the BioTIME team. The functions and documentation were developed predominantly by Alban Sagouis, Faye Moyes and Inês S. Martins. For optimal use, the package should operate in the manner described here. Feedback is appreciated, particularly if you come across any errors or would like to suggest changes. Updates will be made sporadically. ### How to cite BioTIMEr ```{r citation, cache=TRUE, echo=FALSE, message=FALSE, fig.width=10, fig.height=7,tidy=FALSE, include=FALSE,results='hold', figures='hold'} citation("BioTIMEr") ``` Sagouis A, Moyes F, Martins IS, Blowes SA, Brambilla V, Chow CFY, Fontrodona-Eslava A, Antão L, Chase JM, Dornelas M, Magurran AE (2024). BioTIMEr: tools to use and explore the BioTIME database. https://github.com/bioTIMEHub/BioTIMEr