Quantifying the geological completeness of paleontological sampling in North America

Authors: Palaeoverse Development Team

Last updated: 2024-07-29

Introduction

rmacrostrat is an R package which allows users to easily retrieve geological data from the Macrostrat database and facilitates analyses of these data within the R environment. This vignette (or tutorial, if you prefer) is provided to guide you through the installation process and some of the functionality available within rmacrostrat. Specifically, we will focus on reproducing a classical figure (i.e., Fig. 14) from Peters & Heim (2010), which visualizes the changing number–and proportion–of lithostratigraphic units from different paleoenvironments throughout the Phanerozoic. Let’s get started!

Installation

The rmacrostrat package can be installed via CRAN, or its dedicated GitHub repository if the development version is preferred. To install via CRAN, simply use:

install.packages("rmacrostrat")

To install the development version, first install the devtools package, and then use install_github to install rmacrostrat directly from GitHub.

install.packages("devtools")
devtools::install_github("palaeoverse/rmacrostrat")

You can now load rmacrostrat using the standard library function:

library(rmacrostrat)

Before we get into the good stuff, the development team has a small request. If you use rmacrostrat in your research, please cite the associated publication. This will help us to continue our work in supporting you to do yours. You can access the appropriate citation via:

citation("rmacrostrat")
## To cite rmacrostrat in publications, use the following citation:
## 
##   Jones, L.A., Dean, C.D., Gearty, W., and Allen, B.J. 2024. rmacrostrat: An R
##   package for accessing and retrieving data from the Macrostrat geological
##   database. EarthArXiv, 1–29, doi: 10.31223/X5XX37
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {rmacrostrat: An R Package for fetching geologic data from the Macrostrat database.},
##     author = {Lewis A. Jones and Christopher D. Dean and William Gearty and Bethany J. Allen},
##     year = {2024},
##     journal = {EarthArXiv},
##     pages = {1–29},
##     doi = {10.31223/X5XX37},
##   }

Context

Quantifying the spatiotemporal distribution of lithostratigraphic units is fundamental to understanding how environments, and the biodiversity within them, have evolved throughout Earth’s history. Previous work by Peters & Heim (2010) focused on “the geological completeness of paleontological sampling in North America”, with the aim of recognizing and overcoming geologically-controlled sampling biases. One component of this article involved quantifying how the number and proportion of lithostratigraphic units preserving different paleoenvironment types (e.g., marine, marginal, terrestrial) varies through time. In this vignette, we will revisit this question and examine how the number and proportion of different lithostratigraphic units–grouped by paleoenvironment type–changes over the Phanerozoic. To do so, we will make use of the rmacrostrat package to fetch data from the Macrostrat database.

Retrieving data

To quantify how the number and proportion of different lithostratigraphic units change through time, we will focus on the following paleoenvironmental groupings: marine, marginal, and terrestrial. The first step in establishing these groupings is to see what different environments are actually available in Macrostrat:

Hint: Remember that all definitions of data stored in Macrostrat are available via the def_ suite of functions.

# What environments are available?
# Hint: Defining no arguments returns all environments
environments <- def_environments()
# See the first few rows
head(environments)
##   environ_id                                 name      type  class   color t_units
## 1          1                            peritidal carbonate marine #B8B8E6      94
## 2          2                     shallow subtidal carbonate marine #3399FF     815
## 3          3                open shallow subtidal carbonate marine #3399FF     421
## 4          4 lagoonal/restricted shallow subtidal carbonate marine #B8B8E6     216
## 5          5                           sand shoal carbonate marine #5CADFF      25
## 6          6                                 reef carbonate marine #33CCFF     156

From this call, we can see a whole suite of different inferred paleoenvironments are available, which are conveniently grouped into a type (e.g., carbonate, fluvial, lacustrine) and class (i.e., marine and non-marine). Class seems helpful for discerning between marine and non-marine environments, but we also want a marginal category, so we need to re-classify some of these environment definitions.

We can define which paleoenvironments are marginal using the additional information available in the name column of environments. For this definition, let’s consider that any environment that is classified as a delta, beach, barrier island, estuary, lagoon, or tidal flat is a marginal environment. From our available environments (n = 87), we can identify 22 as marginal environments. This classification is perhaps a little subjective, but for this vignette it provides a fair representation of marginal environments:

# Define an object classifying marginal paleoenvironments
marginal_envs <- c("shallow subtidal", "open shallow subtidal", 
              "lagoonal/restricted shallow subtidal", "paralic indet.", 
              "lagoonal", "coastal indet.", "foreshore", "shoreface",
              "transition zone/lower shoreface", "barrier bar",
              "deltaic indet.", "delta plain", "interdistributary bay",
              "delta front", "prodelta", "marginal marine",
              "deltaic indet.", "delta plain", "lacustrine deltaic indet.",
              "lacustrine delta plain", "lacustrine prodelta", "playa")

Now that we’ve defined our marginal paleoenvironments, we can re-classify the class column in environments so that we have three categories. We can also replace the label “non-marine” with “terrestrial” for clarity:

# Update class to include marginal
environments$class[which(environments$name %in% marginal_envs)] <- "marginal"
# Change the name of non-marine class to terrestrial
environments$class[which(environments$class == "non-marine")] <- "terrestrial"

For simplicity, we will create three separate data.frames, one for each paleoenvironment.

# Subset dataframe
marine <- subset(environments, class == "marine")
marginal <- subset(environments, class == "marginal")
terrestrial <- subset(environments, class == "terrestrial")

Great! So we now have our groupings, but if we want to investigate how lithostratigraphic units change through time, we need a little more information. Specifically, we need to know the inferred paleoenvironment and age of each lithostratigraphic unit. We can retrieve such information about lithostratigraphic units using the get_units function. Conveniently, we can request such data for each paleoenvironmental class using the environ_id column.

Hint: Remember that to retrieve data stored in Macrostrat (e.g., Macrostrat columns, sections, and units), the get_ suite of functions must be used.

# Retrieve marine units
marine_units <- get_units(environ_id = marine$environ_id)
# Retrieve marginal units
marginal_units <- get_units(environ_id = marginal$environ_id)
# Retrieve terrestrial units
terrestrial_units <- get_units(environ_id = terrestrial$environ_id)
# Let's also add the environment class in these dataframes for consistency
marine_units$environ_class <- "Marine"
marginal_units$environ_class <- "Marginal"
terrestrial_units$environ_class <- "Terrestrial"

Nice! That was pretty straightforward. However, some unit_id values in the marine dataframe are also present in the marginal and terrestrial dataframes, and vice versa:

# How many marine unit IDs are also in terrestrial?
length(which(marine_units$unit_id %in% terrestrial_units$unit_id))
## [1] 499

On further investigation, we can see that some lithostratigraphic units are associated with multiple different paleoenvironments (various combinations of marine, marginal, and terrestrial). Interesting, but perhaps to be expected given transgression and regression cycles! We should treat these differently to our pre-existing groupings for clarity. Let’s make a new paleoenvironment class called “mixed”, and update the environment class accordingly.

# Which unit IDs are shared between several paleoenvironments?
shared_id <- c(
  # Marine
  marine_units$unit_id[which(marine_units$unit_id %in% terrestrial_units$unit_id)],
  marine_units$unit_id[which(marine_units$unit_id %in% marginal_units$unit_id)],
  # Marginal
  marginal_units$unit_id[which(marginal_units$unit_id %in% marine_units$unit_id)],
  marginal_units$unit_id[which(marginal_units$unit_id %in% terrestrial_units$unit_id)],
  # Terrestrial
  terrestrial_units$unit_id[which(terrestrial_units$unit_id %in% marine_units$unit_id)],
  terrestrial_units$unit_id[which(terrestrial_units$unit_id %in% marginal_units$unit_id)]
)
# Get unique IDs
shared_id <- unique(shared_id)
# Combine dataframes
litho_units <- rbind.data.frame(marine_units, marginal_units, terrestrial_units)
# Recode environment class
litho_units$environ_class[which(litho_units$unit_id %in% shared_id)] <- "Mixed"
# Retain only unique units
litho_units <- unique(litho_units)

Great, we can now see that we don’t have any duplicate units in our dataset anymore:

# Check if any units appear more than once
any(table(litho_units$unit_id) > 1)
## [1] FALSE

OK - we now have four lithostratigraphic groupings for our units, but Figure 14 from Peters & Heim (2010) includes a fifth unit group, “unknown”. For the sake of completeness, let’s get all lithostratigraphic units not already covered by our other four groupings (marine, marginal, mixed, and terrestrial). As we cannot search Macrostrat by “unknown”, we will first pull all Phanerozoic units, and then filter out units already present in our litho_units dataset.

# Get all Phanerozoic units
all_units <- get_units(interval_name = "Phanerozoic")
# Remove units not already present in litho_units (unknown palaeoenvironments)
unknown_units <- subset(all_units, !unit_id %in% litho_units$unit_id)
# Check if any environment information is associated with units - none, great!
head(unknown_units$environ)
## [[1]]
## data frame with 0 columns and 0 rows
## 
## [[2]]
## data frame with 0 columns and 0 rows
## 
## [[3]]
## data frame with 0 columns and 0 rows
## 
## [[4]]
## data frame with 0 columns and 0 rows
## 
## [[5]]
## data frame with 0 columns and 0 rows
## 
## [[6]]
## data frame with 0 columns and 0 rows
# Add environment class
unknown_units$environ_class <- "Unknown"
# Combine dataframes
litho_units <- rbind.data.frame(litho_units, unknown_units)

Great! But we’ve just remembered that Macrostrat doesn’t only cover North America, and we’ve pulled all units available worldwide. To make a closer reproduction of Figure 14 from Peters & Heim (2010), we ought to just focus on North America. Conveniently, Macrostrat is split into different projects, which tend to cover different regions:

# Check available projects
def_projects()[, 1:2]
##    project_id                   project
## 1           1             North America
## 2           3                      eODP
## 3           4                  Deep Sea
## 4           5               New Zealand
## 5           6                 Australia
## 6           7                 Caribbean
## 7           8             South America
## 8           9                    Africa
## 9          10  North American Ediacaran
## 10         11 North American Cretaceous
## 11         12                 Indonesia
## 12         13          Northern Eurasia

From this call, we can see that North America has a project_id of 1. We can use this information to further filter litho_units:

# Filter to N. America
litho_units <- subset(litho_units, project_id == 1)

Now we’ve got our data and done a bit of wrangling, let’s get on with summarizing and visualizing it.

Summarizing data

We are interested in how the number and proportion of lithostratigraphic units representing different paleoenvironment classes changes through time. Conveniently, lithostratigraphic units in Macrostrat have age information associated with them, specifically the minimum (t_age) and maximum (b_age) age in millions of years before present. Using this data, with some handy support functions from the palaeoverse R package, we can count the number of lithographic units within each paleoenvironmental group, and within each time bin. We will use international stratigraphic stage bins as our time bins for this:

# Load palaeoverse R package
library(palaeoverse)
# Generate stage-level time bins
bins <- time_bins(scale = "international ages")
# Rename age columns in litho_units to be consistent with bins
colnames(litho_units)[which(colnames(litho_units) == "b_age")] <- "max_ma"
colnames(litho_units)[which(colnames(litho_units) == "t_age")] <- "min_ma"
# As we are only interested in the Phanerozoic, let's remove any
# units that have a minimum age of more than 538.8 Ma 
litho_units <- subset(litho_units, min_ma <= 538.8)
# In case the maximum age of units exceeds 538.8, but the minimum 
# age is less than 538.8, let's set max_ma to 538.8 to still count it 
# if it appears at any point in the Fortunian
litho_units$max_ma[which(litho_units$max_ma > 538.8)] <- 538.8
# Bin data using the bin_time function with the "all"
# method. This will place units into all bins the unit
# overlaps with, generating additional rows for each time 
# bin covered (expect an increase in the number of rows)
litho_units <- bin_time(occdf = litho_units,
                        bins = bins,
                        min_ma = "min_ma", 
                        max_ma = "max_ma",
                        method = "all")
# Calculate the number of environment classes per time bin
# Remember, each row is a unique instance of a unit within a bin
counts <- group_apply(occdf = litho_units, 
                      group = c("environ_class", "bin_assignment"),
                      fun = nrow)
# Rename columns to ease reading and merging
colnames(counts) <- c("count", "environment", "bin")
# Merge datasets to combine time bin information with counts
counts <- merge(x = bins, y = counts, by = "bin")

Visualizing data

Now we have our summary of the lithostratigraphic units, we can visualize this data. First, we will plot the number of units within each paleoenvironment group through time. To support this data visualization, we will make use of the ggplot2 (plotting) and deeptime (adding a geological timescale axis) R packages.

# Load data visualization packages
library(ggplot2)
library(deeptime)

# Let's first set our factor levels to replicate the original figure
counts$environment <- factor(counts$environment, levels = c("Marine", "Marginal", 
                                                            "Mixed", "Terrestrial", 
                                                            "Unknown"))

# Generate a plot of raw counts of lithostratigraphic units through time
# grouped by paleoenvironment
ggplot(counts, aes(fill = environment, y = count, x = mid_ma)) + 
  # Stacked bar chart with bar width specified by interval duration
  geom_bar(position = "stack", stat = "identity", width = counts$duration_myr,
           color = "black", linewidth = 0.1) +
  # Label y-axis
  scale_y_continuous("Number of lithostratigraphic units") +
  # Label x-axis and reverse direction
  scale_x_reverse("Time (Ma)") +
  # Data plotting colors
  scale_fill_viridis_d() +
  # Theming 
  theme_bw() +
  theme(legend.title = element_blank(),
        legend.position = "top") +
  # Add geological timescale
  coord_geo()
plot of chunk visualize_counts

plot of chunk visualize_counts

We can also plot the proportion of units within each paleoenvironment group through time.

# Generate a plot of lithostratigraphic units through time grouped by
# paleoenvironment as proportions
ggplot(counts, aes(fill = environment, y = count, x = mid_ma)) + 
  # Proportional bar chart with bar width specified by interval duration
  geom_bar(position = "fill", stat = "identity", width = counts$duration_myr,
           color = "black", linewidth = 0.1) +
  # Label y-axis
  scale_y_continuous("Proportion of lithostratigraphic units") +
  # Label x-axis and reverse direction
  scale_x_reverse("Time (Ma)") +
  # Data plotting colors
  scale_fill_viridis_d() +
  # Theming
  theme_bw() +
  theme(legend.title = element_blank(),
        legend.position = "top") +
  # Add geological timescale
  coord_geo()
plot of chunk visualize_proportions

plot of chunk visualize_proportions

And that’s it! Not too bad, right? There are some clear differences between our generated plots and those from the original Peters & Heim (2010) paper (Fig. 14), which is to be expected given that substantially more data is now available in Macrostrat, and there may be differences in what we defined as “marginal” environments. Despite these differences, some broad-scale patterns are largely consistent, such as the higher proportion of terrestrial lithostratigraphic units in the Mesozoic and Cenozoic, compared to the Paleozoic.

Hopefully this vignette has shown you some potential uses for rmacrostrat functions and helped provide a workflow for your own analyses. If you have any questions about the package or its functionality, please feel free to join our Palaeoverse Google group and leave a comment; we’ll aim to answer it as soon as possible!

If you’re interested in learning more about rmacrostrat, don’t forget to check out our other vignettes! You can see which ones are available by calling vignette(package = "rmacrostrat").

References

Gearty, W. 2024. deeptime: Plotting Tools for Anyone Working in Deep Time. R package version 1.1.1, https://CRAN.R-project.org/package=deeptime.

Jones, L.A., Gearty, W., Allen, B.J., Eichenseer, K., Dean, C.D., Galván S., Kouvari, M., Godoy, P.L., Nicholl, C.S.C., Dillon, E.M., Flannery-Sutherland, J.T., Chiarenza, A.A. 2022. palaeoverse: A community-driven R package to support palaeobiological analysis. Methods in Ecology and Evolution, 14(9), 2205–2215. doi: 10.1111/2041-210X.14099.

Peters, S.E. and Heim, N.A. 2010. The Geological Completeness of Paleontological Sampling in North America. Paleobiology 36(10), pp61–79.

Wickham, H. 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.