--- title: "APCalign" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{APCalign} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- When working with biodiversity data, it is important to verify taxonomic names with an authoritative list and correct any out-of-date names. The `APCalign` package simplifies this process by: - Accessing up-to-date taxonomic information from the [Australian Plant Census](https://biodiversity.org.au/nsl/services/search/taxonomy) and the [Australia Plant Name Index](https://biodiversity.org.au/nsl/services/search/names). - Aligning authoritative names to your taxonomic names using our [fuzzy matching algorithm](https://traitecoevo.github.io/APCalign/articles/updating-taxon-names.html) - Updating your taxonomic names in a transparent, reproducible manner ## Installation ```r install.packages("remotes") remotes::install_github("traitecoevo/APCalign") library(APCalign) ``` To demonstrate how to use `APCalign`, we will use an example dataset `gbif_lite` which is documented in `?gbif_lite` ```r dim(gbif_lite) #> [1] 129 7 gbif_lite |> print(n = 6) #> # A tibble: 129 × 7 #> species infraspecificepithet taxonrank decimalLongitude decimalLatitude scientificname #> #> 1 Tetratheca… SPECIES 145. -37.4 Tetratheca ci… #> 2 Peganum ha… SPECIES 139. -33.3 Peganum harma… #> 3 Calotis mu… SPECIES 115. -24.3 Calotis multi… #> 4 Leptosperm… SPECIES 151. -34.0 Leptospermum … #> 5 Lepidosper… SPECIES 142. -37.3 Lepidosperma … #> 6 Enneapogon… SPECIES 129. -17.8 Enneapogon po… #> # ℹ 123 more rows #> # ℹ 1 more variable: verbatimscientificname ``` ## Retrieve taxonomic resources The first step is to retrieve the entire APC and APNI name databases and store them locally as taxonomic resources. We achieve this using `load_taxonomic_resources()`. The resources are compressed as parquet files to speed download and local loading. There are two versions of the databases that you can retrieve with the `stable_or_current_data` argument. Calling: - `stable` will retrieve the most recent, archived version of the databases from our [GitHub releases](https://github.com/traitecoevo/APCalign/releases). This is set as the default option. - `current` will retrieve the up-to-date databases directly from the APC and APNI website. Note that the databases are reasonably large so the initial retrieval of the core data will take a few minutes. Once the taxonomic resources have been stored locally, subsequent retrievals will take less time. Retrieving `current` resources will always take longer since it is accessing the latest information from the website in an uncompressed format. ```r # Benchmarking the retrieval of `stable` or `current` resources stable_start_time <- Sys.time() stable_resources <- load_taxonomic_resources(stable_or_current_data = "stable") #> Loading resources......done stable_end_time <- Sys.time() current_start_time <- Sys.time() current_resources <- load_taxonomic_resources(stable_or_current_data = "current") #> Loading resources......done current_end_time <- Sys.time() # Compare times stable_end_time - stable_start_time #> Time difference of 16.48976 secs ``` For a more reproducible workflow, we recommend specifying the exact `stable` version you want to use. ```r resources <- load_taxonomic_resources(stable_or_current_data = "stable", version = "0.0.2.9000") #> Loading resources......done ``` ## Align and update plant taxon names Now we can query our taxonomic names against the taxonomic resources we just retrieved using `create_taxonomic_update_lookup()`. This all-in-one function will: - Align your taxonomic names to APC and APNI using our [matching algorithms](https://traitecoevo.github.io/APCalign/articles/updating-taxon-names.html) - Update names to an APC-accepted species or infraspecific name whenever possible. - Return a suggested name for all names, defaulting to an `accepted_name` when available, and otherwise providing an APNI name or a name where only a genus-level alignment is possible. If you would like to learn more about each of these step, take a look at the section [Closer look at name alignment and updating with 'APCalign'](#closer-look) ```r library(dplyr) updated_gbif_names <- gbif_lite |> pull(species) |> create_taxonomic_update_lookup(resources = resources) #> Checking alignments of 121 taxa #> -> 0 names already matched; 0 names checked but without a match; 121 taxa yet to be checked updated_gbif_names |> print(n = 6) #> # A tibble: 129 × 12 #> original_name aligned_name accepted_name suggested_name genus taxon_rank taxonomic_dataset #> #> 1 Tetratheca c… Tetratheca … Tetratheca c… Tetratheca ci… Tetr… species APC #> 2 Peganum harm… Peganum har… Peganum harm… Peganum harma… Pega… species APC #> 3 Calotis mult… Calotis mul… Calotis mult… Calotis multi… Calo… species APC #> 4 Leptospermum… Leptospermu… Leptospermum… Leptospermum … Lept… species APC #> 5 Lepidosperma… Lepidosperm… Lepidosperma… Lepidosperma … Lepi… species APC #> 6 Enneapogon p… Enneapogon … Enneapogon p… Enneapogon po… Enne… species APC #> # ℹ 123 more rows #> # ℹ 5 more variables: taxonomic_status , scientific_name_authorship , #> # aligned_reason , update_reason , number_of_collapsed_taxa ``` The `original_name` is the taxon name used in your original data. The `aligned_name` is the taxon name we used to link with the APC to identify any synonyms. The `accepted_name` is the currently, accepted taxon name used by the Australian Plant Census. The `suggested_name` is the best possible name option for the `original_name`. ## Plant established status across states/territories 'APCalign' can also provide the state/territory distribution for established status (native/introduced) from the APC. We can access the established status data by state/territory using `create_species_state_origin_matrix()` ```r # Retrieve status data by state/territory status_matrix <- create_species_state_origin_matrix(resources = resources) ``` Here is a breakdown of all possible values for `origin` ```r library(purrr) library(janitor) # Obtain unique values status_matrix |> select(-species) |> flatten_chr() |> tabyl() #> flatten_chr(select(status_matrix, -species)) n percent #> doubtfully naturalised 1120 2.371003e-03 #> formerly naturalised 277 5.863998e-04 #> native 40336 8.538997e-02 #> native and doubtfully naturalised 9 1.905270e-05 #> native and naturalised 136 2.879075e-04 #> native and uncertain origin 2 4.233933e-06 #> naturalised 8765 1.855521e-02 #> not present 421606 8.925258e-01 #> presumed extinct 101 2.138136e-04 #> uncertain origin 22 4.657327e-05 ``` You can also obtain the breakdown of species by established status for a particular state/territory using `state_diversity_counts()` ```r state_diversity_counts("NSW", resources = resources) #> # A tibble: 7 × 3 #> origin state num_species #> #> 1 doubtfully naturalised NSW 93 #> 2 formerly naturalised NSW 8 #> 3 native NSW 5958 #> 4 native and doubtfully naturalised NSW 2 #> 5 native and naturalised NSW 34 #> 6 naturalised NSW 1580 #> 7 presumed extinct NSW 8 ``` Using the established status data and state/territory information, we can check if a plant taxa is a native using `native_anywhere_in_australia()` ```r library(dplyr) updated_gbif_names |> sample_n(1) |> # Choosing a random species pull(suggested_name) |> # Extracting this APC accepted name native_anywhere_in_australia(resources = resources) #> # A tibble: 1 × 2 #> species native_anywhere_in_aus #> #> 1 Solanum prinophyllum considered native to Australia by APC ``` ## Closer look at name standardisation with 'APCalign' {#closer-look} `create_taxonomic_update_lookup` is a simple, wrapper, function for novice users that want to quickly check and standardise taxon names. For more experienced users, you can take a look at the sub functions `match_taxa()`, `align_taxa()` and `update_taxonomy()` to see how taxon names are processed, aligned and updated. ![](../man/figures/standardise_taxonomy_workflow.png) ### Aligning names to APC and APNI The function `align_taxa` will: 1. Clean up your taxonomic names - The functions `standardise_names`, `strip_names` and `strip_names_extra` standardise infraspecific taxon designations and clean up punctuation and whitespaces 2. Find best alignment with APC or APNI to your taxonomic name using our the function [match_taxa](https://traitecoevo.github.io/APCalign/articles/updating-taxon-names.html) - A taxonomic name flows through a progression of [50 match algorithms](https://traitecoevo.github.io/APCalign/articles/updating-taxon-names.html) until it is able to be aligned to a name on either the APC or APNI list. - These include [exact and fuzzy matches](#fuzzy-match). Fuzzy matches are designed to capture small spelling mistakes and syntax errors in phrase names. - These include matches to the entire name string and matches on just select words in the sequence. - The sequence of matches has been carefully curated to align names with the fewest mistakes. 3. Determine the `taxon_rank` to which the name can be resolved, based on its syntax. - For names that can only be resolved to genus, reformats the name to offer a standardised `genus sp.` name, with additional information/notes provided as part of the original name in square brackets, as in `Acacia sp. [skinny leaves]` or `Acacia sp. [Broken Hill]` 4. Determine the `taxonomic_reference` (APC or APNI) of each name-alignment. **Note** that `align_taxa` **does not** seek to update outdated taxonomy. That process occurs during [update_taxonomy](#update) process. `align_taxa` instead aligns each name input to the closest match amongst names documented by the APC and APNI. ```r library(dplyr) aligned_gbif_taxa <- gbif_lite |> pull(species) |> align_taxa(resources = resources) #> Checking alignments of 121 taxa #> -> 0 names already matched; 0 names checked but without a match; 121 taxa yet to be checked aligned_gbif_taxa |> print(n = 6) #> # A tibble: 129 × 7 #> original_name cleaned_name aligned_name taxonomic_dataset taxon_rank aligned_reason #> #> 1 Tetratheca ciliata Tetratheca … Tetratheca … APC species Exact match o… #> 2 Peganum harmala Peganum har… Peganum har… APC species Exact match o… #> 3 Calotis multicaulis Calotis mul… Calotis mul… APC species Exact match o… #> 4 Leptospermum triner… Leptospermu… Leptospermu… APC species Exact match o… #> 5 Lepidosperma latera… Lepidosperm… Lepidosperm… APC species Exact match o… #> 6 Enneapogon polyphyl… Enneapogon … Enneapogon … APC species Exact match o… #> # ℹ 123 more rows #> # ℹ 1 more variable: alignment_code ``` For every `aligned_name`, `align_taxa()` will provide a `aligned_reason` which you can review as a table of counts: ```r library(janitor) aligned_gbif_taxa |> pull(aligned_reason) |> tabyl() |> tibble() #> # A tibble: 6 × 4 #> `pull(aligned_gbif_taxa, aligned_reason)` n percent valid_percent #> #> 1 Exact match of taxon name to an APC-accepted canonical name o… 118 0.915 0.929 #> 2 Exact match of taxon name to an APC-known canonical name once… 6 0.0465 0.0472 #> 3 Exact match of taxon name to an APNI-listed canonical name on… 1 0.00775 0.00787 #> 4 Exact match of the first two words of the taxon name to an AP… 1 0.00775 0.00787 #> 5 Exact match of the first word of the taxon name to an APC-acc… 1 0.00775 0.00787 #> 6 2 0.0155 NA ``` #### Configuring matching precision and aligned output {#fuzzy-match} There are arguments in `align_taxa` that allows you to select which of the 50 matching algorithms are activated/deactivated and the degree of fuzziness of the fuzzy matching function - `fuzzy_matches` turns fuzzy matching on / off (it defaults to `TRUE`). - `fuzzy_abs_dist` and `fuzzy_rel_dist` control the degree of fuzzy matching (they default to `fuzzy_abs_dist = 3` & `fuzzy_rel_dist = 0.2`). - `imprecise_fuzzy_matches` turns imprecise fuzzy matching on / off (it defaults to `FALSE`; for true it is set to `fuzzy_abs_dist = 5` & `fuzzy_rel_dist = 0.25`). - `APNI_matches` turns matches to the APNI list on/off (it defaults to `TRUE`). - `identifier` allows you to specify a text string that is added to genus-level matches, indicating the site, study, etc e.g. `Acacia sp. [Blue Mountains]` ### Updating to APC-accepted names {#update} `update_taxonomy()` uses the information generated by `align_taxa()` to, whenever possible, update names to APC-accepted names. ```r updated_gbif_taxa <- aligned_gbif_taxa |> update_taxonomy(resources = resources) updated_gbif_taxa |> print(n = 6) #> # A tibble: 129 × 21 #> original_name aligned_name accepted_name suggested_name genus family taxon_rank #> #> 1 Tetratheca ciliata Tetratheca c… Tetratheca c… Tetratheca ci… Tetr… Elaeo… species #> 2 Peganum harmala Peganum harm… Peganum harm… Peganum harma… Pega… Nitra… species #> 3 Calotis multicaulis Calotis mult… Calotis mult… Calotis multi… Calo… Aster… species #> 4 Leptospermum trinervium Leptospermum… Leptospermum… Leptospermum … Lept… Myrta… species #> 5 Lepidosperma laterale Lepidosperma… Lepidosperma… Lepidosperma … Lepi… Cyper… species #> 6 Enneapogon polyphyllus Enneapogon p… Enneapogon p… Enneapogon po… Enne… Poace… species #> # ℹ 123 more rows #> # ℹ 14 more variables: taxonomic_dataset , taxonomic_status , #> # taxonomic_status_aligned , aligned_reason , update_reason , #> # subclass , taxon_distribution , scientific_name_authorship , #> # taxon_ID , taxon_ID_genus , scientific_name_ID , canonical_name , #> # row_number , number_of_collapsed_taxa ``` #### Taxonomic resources used for updating names - The APC includes all previously recorded taxonomic names for a current taxon concept, designating the currently-accepted name as `taxonomic_status: accepted`, while previously used or inappropriately used names for the taxon concept have alternative taxonomic statuses documented (e.g. taxonomic synonym, orthographic variant, misapplied). - The APC includes a column `acceptedNameUsageID` that links a taxon name with an alternative taxonomic status to the current taxon name, allowing outdated/inappropriately used names to be synced to their current name. *Note*: Names listed on the APNI but absent from the APC are those that are designated as `taxonomic_dataset: APNI` by `APCalign`. These are names that are currently `unknown` by the APC. Over time, this list shrinks, as taxonomists link ever more occasionally used name variants to an APC-accepted taxon. However, for now, *names listed only on the APNI cannot be updated* #### Name updates at different taxonomic levels - `update_taxonomy()` divides names into lists based on the `taxon_rank` and `taxonomic_dataset` assigned by `align_taxa`, as each list requires different updating algorithms. - Only taxonomic names that are designated as `taxon_rank = species/infraspecific` and `taxonomic_dataset = APC` can be updated to an APC-accepted name. - For all other taxa, it may be possible to align the genus-name to an APC-accepted genus. - For all taxa, a `suggested_name` is provided, selecting the `accepted_name` when available, and otherwise the `aligned_name`, but with, if possible, an updated, APC-accepted genus name. #### Taxonomic splits - Taxonomic splits refers to instances where a single taxon concept is subsequently split into multiple taxon concepts. For such taxa, when the `aligned_name` is the "old" taxon concept name, it is impossible to know which of the currently accepted taxon concepts the name represents. - The function `update_taxonomy` includes an argument `taxonomic_splits`, offering three alternative outputs for taxon concepts that have been split. 1. `most_likely_species` is the default value, and returns the `accepted_name` of the original taxon_concept; alternative names are documented in square brackets as part of the suggested name (`Acacia aneura [alternative possible names: Acacia minyura (pro parte misapplied) | Acacia paraneura (pro parte misapplied) | Acacia quadrimarginea (misapplied)`). 2. `return_all` returns all currently accepted names that were split from the original taxon_concept; this leads to an increase in the number of rows in the output table. (Acacia aneura, Acacia minyura and Acacia paraneura are each output as a separate row, each with a unique taxon_ID) 3. `collapse_to_higher_taxon` declares that for split names, there is no way to be certain about which accepted name is appropriate and therefore that the best possible match is at the genus level; no `accepted_name` is returned, the `taxon_rank` is demoted to `genus` and the suggested name documents the possible species-level names in square brackets (`Acacia sp. [collapsed names: Acacia aneura (accepted) | Acacia minyura (pro parte misapplied) | Acacia paraneura (pro parte misapplied)]`) ```r library(dplyr) aligned_gbif_taxa |> update_taxonomy(taxonomic_splits = "most_likely_species", resources = resources) |> filter(original_name == "Acacia aneura") # Subsetting Acacia aneura as an example #> # A tibble: 1 × 21 #> original_name aligned_name accepted_name suggested_name genus family taxon_rank #> #> 1 Acacia aneura Acacia aneura Acacia aneura Acacia aneura [alternat… Acac… Fabac… species #> # ℹ 14 more variables: taxonomic_dataset , taxonomic_status , #> # taxonomic_status_aligned , aligned_reason , update_reason , #> # subclass , taxon_distribution , scientific_name_authorship , #> # taxon_ID , taxon_ID_genus , scientific_name_ID , canonical_name , #> # row_number , number_of_collapsed_taxa ``` ```r aligned_gbif_taxa |> update_taxonomy(taxonomic_splits = "return_all", resources = resources) |> filter(original_name == "Acacia aneura") # Subsetting Acacia aneura as an example #> # A tibble: 3 × 21 #> original_name aligned_name accepted_name suggested_name genus family taxon_rank #> #> 1 Acacia aneura Acacia aneura Acacia aneura Acacia aneura Acacia Fabaceae species #> 2 Acacia aneura Acacia aneura Acacia minyura Acacia minyura Acacia Fabaceae species #> 3 Acacia aneura Acacia aneura Acacia paraneura Acacia paraneura Acacia Fabaceae species #> # ℹ 14 more variables: taxonomic_dataset , taxonomic_status , #> # taxonomic_status_aligned , aligned_reason , update_reason , #> # subclass , taxon_distribution , scientific_name_authorship , #> # taxon_ID , taxon_ID_genus , scientific_name_ID , canonical_name , #> # row_number , number_of_collapsed_taxa ``` ```r aligned_gbif_taxa |> update_taxonomy(taxonomic_splits = "collapse_to_higher_taxon", resources = resources) |> filter(original_name == "Acacia aneura") # Subsetting Acacia aneura as an example #> # A tibble: 1 × 21 #> original_name aligned_name accepted_name suggested_name genus family taxon_rank #> #> 1 Acacia aneura Acacia aneura Acacia sp. Acacia sp. [collapsed n… Acac… Fabac… species #> # ℹ 14 more variables: taxonomic_dataset , taxonomic_status , #> # taxonomic_status_aligned , aligned_reason , update_reason , #> # subclass , taxon_distribution , scientific_name_authorship , #> # taxon_ID , taxon_ID_genus , scientific_name_ID , canonical_name , #> # row_number , number_of_collapsed_taxa ```