---
title: "tidysdm overview"
output: rmarkdown::html_vignette
#output: rmarkdown::pdf_document
vignette: >
  %\VignetteIndexEntry{tidysdm overview}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
# xgboost uses data.table
data.table::setDTthreads(2)
RhpcBLASctl::blas_set_num_threads(2)
RhpcBLASctl::omp_set_num_threads(2)

# set the two variables below to FALSE for a normal vignette using the data
# included in the package
download_data <- FALSE
create_sample_data <- FALSE
if (create_sample_data) {
  download_data <- TRUE # we need to download the data to create the sample data
}
```

# SDMs with `tidymodels`

Species Distribution Modelling relies on several algorithms, many of
which have a number of hyperparameters that require turning. The
`tidymodels` universe includes a number of packages specifically design
to fit, tune and validate models. The advantage of `tidymodels` is that
the models syntax and the results returned to the users are
standardised, thus providing a coherent interface to modelling. Given
the variety of models required for SDM, `tidymodels` is an ideal
framework. `tidysdm` provides a number of wrappers and specialised
functions to facilitate the fitting of SDM with `tidymodels`.

This article provides an overview of the how `tidysdm` facilitates
fitting SDMs. Further articles, detailing how to use the package for
palaeodata, fitting more complex models and how to troubleshoot models
can be found on the [`tidisdm`
website](https://evolecolgroup.github.io/tidysdm/). As `tidysdm` relies
on `tidymodels`, users are advised to familiarise themselves with the
introductory tutorials on the [`tidymodels`
website](https://www.tidymodels.org/start/).

When we load `tidysdm`, it automatically loads `tidymodels` and all
associated packages necessary to fit models:

```{r libraries}
library(tidysdm)
```

### Accessing the data for this vignette: how to use `rgbif`

We start by reading in a set of presences for a species of lizard that
inhabits the Iberian peninsula, *Lacerta schreiberi*. This data is taken
from GBIF Occurrence Download (6 July 2023)
<https://doi.org/10.15468/dl.srq3b3>. The dataset is already included in
the `tidysdm` package:

```{r load_presences}
data(lacerta)
head(lacerta)
```

Alternatively, we can easily access and manipulate this dataset using
`rbgif`. Note that the data from GBIF often requires some level of cleaning. Here
we will use a simple cleaning function from the `CoordinateCleaner`; in general,
we recommend to inspect the data that are flagged as problematic, rather than
just accepting them as we do here:

```{r download_presences, eval = download_data}
# download presences
library(rgbif)
occ_download_get(key = "0068808-230530130749713", path = tempdir())
# read file
library(readr)
distrib <- read_delim(file.path(tempdir(), "0068808-230530130749713.zip"))
# keep the necessary columns and rename them
lacerta <- distrib %>%
  select(gbifID, decimalLatitude, decimalLongitude) %>%
  rename(ID = gbifID, latitude = decimalLatitude, longitude = decimalLongitude)
# clean up the data
library(CoordinateCleaner)
lacerta <- clean_coordinates(
  x = lacerta,
  lon = "longitude",
  lat = "latitude",
  species = "ID",
  value = "clean"
)
```
```{r echo=FALSE, results='hide', eval=create_sample_data}
usethis::use_data(lacerta, overwrite = TRUE)
```

# Preparing your data

First, let us visualise our presences by plotting on a map. `tidysdm`
works with `sf` objects to represent locations, so we will cast our
coordinates into an `sf` object, and set its projections to standard
'lonlat' using the proj4 string "+proj=longlat".

```{r cast_to_sf}
library(sf)
lacerta <- st_as_sf(lacerta, coords = c("longitude", "latitude"))
st_crs(lacerta) <- "+proj=longlat"
```

It is usually advisable to plot the locations directly on the raster
that will be used to extract climatic variables, to see how the
locations fall within the discrete space of the raster. For this
vignette, we will use WorldClim as our source of climatic information.
We will access the WorldClim data via the library `pastclim`; even
though this library, as the name suggests, is mostly designed to handle
palaeoclimatic reconstructions, it also provides convenient functions to
access present day reconstructions and future projections. `pastclim`
has a handy function to get the land mask for the available datasets,
which we can use as background for our locations. We will cut the raster
to the Iberian peninsula, where our lizard lives. 

For this example:

```{r land_mask, eval= download_data}
library(pastclim)
download_dataset(dataset = "WorldClim_2.1_10m")
land_mask <-
  get_land_mask(time_ce = 1985, dataset = "WorldClim_2.1_10m")

# Iberia peninsula extension
iberia_poly <-
  terra::vect(
    "POLYGON((-9.8 43.3,-7.8 44.1,-2.0 43.7,3.6 42.5,3.8 41.5,1.3 40.8,0.3 39.5,
     0.9 38.6,-0.4 37.5,-1.6 36.7,-2.3 36.3,-4.1 36.4,-4.5 36.4,-5.0 36.1,
    -5.6 36.0,-6.3 36.0,-7.1 36.9,-9.5 36.6,-9.4 38.0,-10.6 38.9,-9.5 40.8,
    -9.8 43.3))"
  )

crs(iberia_poly) <- "+proj=longlat"
# crop the extent
land_mask <- crop(land_mask, iberia_poly)
# and mask to the polygon
land_mask <- mask(land_mask, iberia_poly)
```

```{r land_mask_save, echo=FALSE, results = 'hide', eval=create_sample_data}
terra::saveRDS(land_mask, "../inst/extdata/lacerta_land_mask.rds")
```


```{r land_mask_load, echo=FALSE, eval=!download_data}
library(pastclim)
set_data_path(on_CRAN = TRUE)
# Iberia peninsula extension
iberia_poly <-
  terra::vect(
    "POLYGON((-9.8 43.3,-7.8 44.1,-2.0 43.7,3.6 42.5,3.8 41.5,1.3 40.8,0.3 39.5,
     0.9 38.6,-0.4 37.5,-1.6 36.7,-2.3 36.3,-4.1 36.4,-4.5 36.4,-5.0 36.1,
    -5.6 36.0,-6.3 36.0,-7.1 36.9,-9.5 36.6,-9.4 38.0,-10.6 38.9,-9.5 40.8,
    -9.8 43.3))"
  )
crs(iberia_poly) <- "+proj=longlat"
land_mask <- terra::readRDS(system.file("extdata/lacerta_land_mask.rds",
  package = "tidysdm"
))
```

For plotting, we will take advantage of `tidyterra`, which makes
handling of `terra` rasters with `ggplot` a breeze.

```{r, fig.width=6, fig.height=4}
library(tidyterra)
library(ggplot2)
ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_1985)) +
  geom_sf(data = lacerta) +
  guides(fill = "none")
```

# Map projection

Before we start thinning the data we need to make sure that all our data (points and rasters) have the same geographic coordinate reference system (CRS) by projecting them. In some of the pipeline steps (e.g. thinning data, measuring areas) using an equal area projection may make a significant difference, especially for large-scale projects.

You can use the website `projectionwizard.org` (https://link.springer.com/chapter/10.1007/978-3-319-51835-0_9) 
to find an appropriate equal area projection for any region. 

To define our projection within the code, we will use a proj4 string, which provides information on the type of projection, its parameters and the units of distance in which the new coordinates will be expressed (if you are using `projectionwizard.org` it will provide yo  with the string as well). 

In this case, we will use a Albers Equal Area Conic projection centred on the Iberian peninsula, with km as units. The proj4 string is:

```{r}
iberia_proj4 <-
  "+proj=aea +lon_0=-4.0 +lat_1=36.8 +lat_2=42.6 +lat_0=39.7 +datum=WGS84 +units=m +no_defs"
```

For rasters (maps), we use the `terra` function `project` to change the CRS. We pass the raster object and the proj4 string as arguments:

```{r}
land_mask <- terra::project(land_mask, y = iberia_proj4)
```

Now we need to project the data points to the same CRS as the raster. We will do so using the appropriate `sf` function:

```{r}
lacerta <- st_transform(lacerta, iberia_proj4)
```

Plotting the data, we will see that the shape of the land mask has slightly changed following the new projection.

```{r project_iberia, fig.width=6, fig.height=4}
ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_1985)) +
  geom_sf(data = lacerta) +
  guides(fill = "none")
```


# Thinning step

Now, we thin the observations to have one per cell in the raster (given
our project, each cell is approximately the same size):

```{r thin_by_cell}
set.seed(1234567)
lacerta <- thin_by_cell(lacerta, raster = land_mask)
nrow(lacerta)
```

```{r}
pres_data <- terra::extract(land_mask, lacerta)
summary(pres_data)
```


```{r plot_thin_by_cell, fig.width=6, fig.height=4}
ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_1985)) +
  geom_sf(data = lacerta) +
  guides(fill = "none")
```

Now, we thin further to remove points that are closer than 20km. As the units
of our projection are m (the default for most projections), we use a
a convenient conversion function, `km2m()`, to
avoid having to write lots of zeroes:

```{r thin_by_dist}
set.seed(1234567)
lacerta_thin <- thin_by_dist(lacerta, dist_min = km2m(20))
nrow(lacerta_thin)
```

Let's see what we have left of our points:

```{r plot_thin_by_dist, fig.width=6, fig.height=4}
ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_1985)) +
  geom_sf(data = lacerta_thin) +
  guides(fill = "none")
```

We now need to select points that represent the potential available area
for the species. There are two approaches, we can either sample the
background with `sample_background()`, or we can generate
pseudo-absences with `sample_pseudoabs()`. In this example, we will
sample the background; more specifically, we will attempt to account for
potential sampling biases by using a target group approach, where
presences from other species within the same taxonomic group are used to
condition the sampling of the background, providing information on
differential sampling of different areas within the region of interest.

We will start by downloading records from 8 genera of *Lacertidae*,
covering the same geographic region of the Iberian peninsula from GBIF
<https://doi.org/10.15468/dl.53js5z>:

```{r download_background, eval=FALSE}
library(rgbif)
occ_download_get(key = "0121761-240321170329656", path = tempdir())
library(readr)
backg_distrib <- readr::read_delim(file.path(
  tempdir(),
  "0121761-240321170329656.zip"
))
# keep the necessary columns
lacertidae_background <- backg_distrib %>%
  select(gbifID, decimalLatitude, decimalLongitude) %>%
  rename(ID = gbifID, latitude = decimalLatitude, longitude = decimalLongitude)
```

In this case as well, we need to use the appropriate projection (the same defined before) for the background. If the projections do not correspond the analyses will stop giving an error message. 

```{r projection_background, eval=FALSE}
# convert to an sf object
lacertidae_background <- st_as_sf(lacertidae_background,
  coords = c("longitude", "latitude")
)

st_crs(lacertidae_background) <- "+proj=longlat"
lacertidae_background <- st_transform(lacertidae_background, crs = iberia_proj4)
```
```{r echo=FALSE, results='hide', eval=create_sample_data}
usethis::use_data(lacertidae_background, overwrite = TRUE)
```

```{r echo=FALSE}
data("lacertidae_background")
lacertidae_background <- st_as_sf(lacertidae_background,
  coords = c("longitude", "latitude")
)
st_crs(lacertidae_background) <- "+proj=longlat"
lacertidae_background <- st_transform(lacertidae_background, crs = iberia_proj4)
```

We need to convert these observations into a raster whose values are the
number of records (which will be later used to determine how likely each
cell is to be used as a background point). We will also mask the resulting background raster to match the land mask of interest.

```{r background_to_raster, fig.width=6, fig.height=4}
lacertidae_background_raster <- rasterize(lacertidae_background,
  land_mask,
  fun = "count"
)
lacertidae_background_raster <- mask(
  lacertidae_background_raster,
  land_mask
)
ggplot() +
  geom_spatraster(data = lacertidae_background_raster, aes(fill = count)) +
  scale_fill_viridis_b(na.value = "transparent")
guides(fill = "none")
```

We can see that the sampling is far from random, with certain locations
having very large number of records. We can now sample the background,
using the 'bias' method to represent this heterogeneity in sampling
effort:

```{r sample_background,}
set.seed(1234567)
lacerta_thin <- sample_background(
  data = lacerta_thin, raster = lacertidae_background_raster,
  n = 3 * nrow(lacerta_thin),
  method = "bias",
  class_label = "background",
  return_pres = TRUE
)
```

Let's see our presences and background:

```{r plot_sample_pseudoabs, fig.width=6, fig.height=4}
ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_1985)) +
  geom_sf(data = lacerta_thin, aes(col = class)) +
  guides(fill = "none")
```

We can use `pastclim` to download the WorldClim dataset (we'll use the 10 arc-minute
resolution) and extract the bioclimatic
variables that are available (but you do not have to use `pastclim`, you could 
use any raster dataset you have access to, loading it directly with `terra`).

```{r load_climate, eval=download_data}
download_dataset("WorldClim_2.1_10m")
climate_vars <- get_vars_for_dataset("WorldClim_2.1_10m")
climate_present <- pastclim::region_slice(
  time_ce = 1985,
  bio_variables = climate_vars,
  data = "WorldClim_2.1_10m",
  crop = iberia_poly
)
```

```{r echo=FALSE, results='hide', eval=create_sample_data}
terra::saveRDS(
  climate_present,
  "../inst/extdata/lacerta_climate_present_10m.rds"
)
```


```{r echo=FALSE, results='hide', eval=!download_data}
climate_present <- terra::readRDS(
  system.file("extdata/lacerta_climate_present_10m.rds",
    package = "tidysdm"
  )
)
climate_vars <- names(climate_present)
```

Note that the dataset covers the
period 1970-2000, so `pastclim` dates it as 1985 (the midpoint). We have also cropped
it directly to the Iberian peninsula.

Note that, in this vignette, we focus on continuous variables; most machine learning algorithms
do not natively cope with multi-level factors, but it is possible to use `recipes::step_dummy()`
to generate dummy variables from factors. A worked example can be found in the
[article on additional features of tidymodels with tidysdm](https://evolecolgroup.github.io/tidysdm/articles/a2_tidymodels_additions.html).

And now we project the climate variables in the same way as we did for all previous
spatial data:
```{r}
climate_present <- terra::project(climate_present, y = iberia_proj4)
```

Next, we extract climate for all presences and background points:

```{r}
lacerta_thin <- lacerta_thin %>%
  bind_cols(terra::extract(climate_present, lacerta_thin, ID = FALSE))
```


```{r echo=FALSE, results='hide', eval=create_sample_data}
terra::saveRDS(lacerta_thin, "../inst/extdata/lacerta_thin_all_vars.rds")
```

Before going forward with the analysis, we should make  sure that there
are no missing values in the climate that we extracted:

```{r}
summary(lacerta_thin)
```

We can see that there are no missing values in any of the
extracted climate variables. If that was not the case, we would have
to go back to the climate raster and homogenise the NAs
across layers (i.e. variables). This can be achieved either by 
setting the same cells to NA in all layers (including the land mask
that we used to thin the data), or by interpolating the layers with less
information to fill the gaps (e.g. cloud cover in some remote sensed data).
interpolate the missing

Based on this paper (<https://doi.org/10.1007/s10531-010-9865-2>), we
are interested in these variables: "bio06", "bio05", "bio13", "bio14", "bio15".
We can visualise the differences between presences and the background using violin plots:

```{r fig.height=11, fig.width=7}
lacerta_thin %>% plot_pres_vs_bg(class)
```
We can see that all the variables of interest do seem to have a different distribution
between presences and the background. We can formally quantify the mismatch between the two 
by computing the overlap:

```{r}
lacerta_thin %>% dist_pres_vs_bg(class)
```

Again, we can see that the variables of interest seem good candidates with a clear
signal. Let us then focus on those variables:

```{r climate_variables}
suggested_vars <- c("bio06", "bio05", "bio13", "bio14", "bio15")
```

Environmental variables are often highly correlated, and collinearity is
an issue for several types of models. We can inspect the correlation
among variables with:

```{r, fig.width=7, fig.height=8}
pairs(climate_present[[suggested_vars]])
```

We can see that some variables have rather high correlation (e.g. bio05
vs bio14). We can subset to variables below a certain threshold
correlation (e.g. 0.7) with:

```{r choose_var_cor_keep}
climate_present <- climate_present[[suggested_vars]]

vars_uncor <- filter_collinear(climate_present,
  cutoff = 0.7,
  method = "cor_caret"
)
vars_uncor
```

So, removing bio14 leaves us with a set of uncorrelated variables. Note that
`filter_collinear` has other methods based on variable inflation that would also
be worth exploring. For this example, we will remove bio14 and work with the remaining
variables.

```{r}
lacerta_thin <- lacerta_thin %>% select(all_of(c(vars_uncor, "class")))
climate_present <- climate_present[[vars_uncor]]
names(climate_present) # variables retained in the end
```

# Fit the model by cross-validation

Next, we need to set up a `recipe` to define how to handle our dataset.
We don't want to do anything to our data in terms of transformations, so
we just need to define the formula (*class* is the `outcome`, all other
variables are `predictors`; note that, for `sf` objects, `geometry` is
automatically replaced by `X` and `Y` columns which are assigned a role
of `coords`, and thus not used as predictors):

```{r recipe}
lacerta_rec <- recipe(lacerta_thin, formula = class ~ .)
lacerta_rec
```

Note that `step_` functions from `recipes` that filter data rows (e.g `step_naomit()`)
can be problematic (see the relevant section from the man page for [`step_naomit()`](https://recipes.tidymodels.org/reference/step_naomit.html#row-filtering)). We would recommend that you filter your data before setting
up the recipe, rather than attempting to use steps in the recipe.

In classification models for `tidymodels`, the assumption is that the
level of interest for the response (in our case, presences) is the
reference level. We can confirm that we have the data correctly
formatted with:

```{r}
lacerta_thin %>% check_sdm_presence(class)
```

We now build a `workflow_set` of different models, defining which
hyperparameters we want to tune. We will use *glm*, *random forest*,
*boosted_trees* and *maxent* as our models (for more details on how to
use `workflow_set`s, see [this
tutorial](https://workflowsets.tidymodels.org/articles/tuning-and-comparing-models.html)).
The latter three models have tunable hyperparameters. For the most
commonly used models, `tidysdm` automatically chooses the most important
parameters, but it is possible to fully customise model specifications
(e.g. see the help for `sdm_spec_rf`). Note that, if you used GAMs with `sdm_spec_gam()`,
it is necessary to update the model with `gam_formula()` due to the non-standard
formula notation of GAMs (see the help
of `sdm_spec_gam()` for an example of how to do this).

```{r workflow_set}
lacerta_models <-
  # create the workflow_set
  workflow_set(
    preproc = list(default = lacerta_rec),
    models = list(
      # the standard glm specs
      glm = sdm_spec_glm(),
      # rf specs with tuning
      rf = sdm_spec_rf(),
      # boosted tree model (gbm) specs with tuning
      gbm = sdm_spec_boost_tree(),
      # maxent specs with tuning
      maxent = sdm_spec_maxent()
    ),
    # make all combinations of preproc and models,
    cross = TRUE
  ) %>%
  # tweak controls to store information needed later to create the ensemble
  option_add(control = control_ensemble_grid())
```

We now want to set up a spatial block cross-validation scheme to tune
and assess our models. We will split the data by creating 3 folds. We
use the `spatial_block_cv` function from the package `spatialsample`.
`spatialsample` offers a number of sampling approaches for spatial data;
it is also possible to convert objects created with `blockCV` (which
offers further features for spatial sampling, such as stratified
sampling) into an `rsample` object suitable to `tisysdm` with the
function `blockcv2rsample`.

```{r training_cv, fig.width=6, fig.height=4}
library(tidysdm)
set.seed(105)
lacerta_cv <- spatial_block_cv(lacerta_thin, v = 5)
autoplot(lacerta_cv)
```

We can check that the splits are reasonably balanced with:
```{r check_balance}
check_splits_balance(lacerta_cv, class)
```

We can now use the block CV folds to tune and assess the models (to keep
computations fast, we will only explore 3 combination of hyperparameters
per model; this is far too little in real life!):

```{r tune_grid}
set.seed(1234567)
lacerta_models <-
  lacerta_models %>%
  workflow_map("tune_grid",
    resamples = lacerta_cv, grid = 3,
    metrics = sdm_metric_set(), verbose = TRUE
  )
```

Note that `workflow_set` correctly detects that we have no tuning
parameters for *glm*. We can have a look at the performance of our
models with:

```{r autoplot_models, fig.width=7, fig.height=4}
autoplot(lacerta_models)
```

Now let's create an ensemble, selecting the best set of parameters for
each model (this is really only relevant for the ML algorithms, as there
were not hype-parameters to tune for the *glm*). We will use
the Boyce continuous index as our metric to choose the best random
forest and boosted tree. When adding members to an ensemble, they are
automatically fitted to the full training dataset, and so ready to make
predictions.

```{r}
lacerta_ensemble <- simple_ensemble() %>%
  add_member(lacerta_models, metric = "boyce_cont")
lacerta_ensemble
```

```{r echo=FALSE, results='hide', eval=create_sample_data}
usethis::use_data(lacerta_ensemble, overwrite = TRUE)
```

And visualise it

```{r autoplot_ens, fig.width=7, fig.height=4}
autoplot(lacerta_ensemble)
```

A tabular form of the model metrics can be obtained with:

```{r}
lacerta_ensemble %>% collect_metrics()
```

# Projecting to the present

We can now make predictions with this ensemble (using the default option
of taking the mean of the predictions from each model).

```{r plot_present, fig.width=6, fig.height=4}
prediction_present <- predict_raster(lacerta_ensemble, climate_present)
ggplot() +
  geom_spatraster(data = prediction_present, aes(fill = mean)) +
  scale_fill_terrain_c() +
  # plot presences used in the model
  geom_sf(data = lacerta_thin %>% filter(class == "presence"))
```

We can subset the ensemble to only use the best models, based on the
Boyce continuous index, by setting a minimum threshold of 0.5 for that
metric (this is somewhat low, for a real analysis we would recommend a higher value
of 0.7 or higher). We will also take the median of the available model predictions
(instead of the mean, which is the default). The plot does not change
much (the models are quite consistent).

```{r plot_present_best, fig.width=6, fig.height=4}
prediction_present_boyce <- predict_raster(lacerta_ensemble, climate_present,
  metric_thresh = c("boyce_cont", 0.5),
  fun = "median"
)
ggplot() +
  geom_spatraster(data = prediction_present_boyce, aes(fill = median)) +
  scale_fill_terrain_c() +
  geom_sf(data = lacerta_thin %>% filter(class == "presence"))
```

Sometimes, it is desirable to have binary predictions (presence vs
absence), rather than the probability of occurrence. To do so, we first
need to calibrate the threshold used to convert probabilities into
classes (in this case, we optimise the TSS):


```{r}
lacerta_ensemble <- calib_class_thresh(lacerta_ensemble,
  class_thresh = "tss_max",
  metric_thresh = c("boyce_cont", 0.5)
)
```

And now we can predict for the whole continent:

```{r, fig.width=6, fig.height=4}
prediction_present_binary <- predict_raster(lacerta_ensemble,
  climate_present,
  type = "class",
  class_thresh = c("tss_max"),
  metric_thresh = c("boyce_cont", 0.5)
)
ggplot() +
  geom_spatraster(data = prediction_present_binary, aes(fill = binary_mean)) +
  geom_sf(data = lacerta_thin %>% filter(class == "presence")) +
  scale_fill_discrete(na.value = "transparent")
```

# Projecting to the future

WorldClim has a wide selection of projections for the future based on
different models and Shared Socio-economic Pathways (SSP). Type
`help("WorldClim_2.1")` for a full list. We will use predictions based
on "HadGEM3-GC31-LL" model for SSP 245 (intermediate green house gas
emissions) at the same resolution as the present day data (10
arc-minutes). We first download the data:

```{r eval=download_data}
download_dataset("WorldClim_2.1_HadGEM3-GC31-LL_ssp245_10m")
```

Let's see what times are available:

```{r eval=FALSE}
get_time_ce_steps("WorldClim_2.1_HadGEM3-GC31-LL_ssp245_10m")
```

```{r echo=FALSE}
c(2030, 2050, 2070, 2090)
```

We will predict for 2090, the further prediction in the future that is
available.

Let's now check the available variables:

```{r eval=FALSE}
get_vars_for_dataset("WorldClim_2.1_HadGEM3-GC31-LL_ssp245_10m")
```

```{r echo=FALSE}
climate_vars[-length(climate_vars)]
```

Note that future predictions do not include *altitude* (as that does not
change with time), so if we needed it, we would have to copy it over
from the present. However, it is not in our set of uncorrelated
variables that we used earlier, so we don't need to worry about it.

```{r eval=download_data}
climate_future <- pastclim::region_slice(
  time_ce = 2090,
  bio_variables = vars_uncor,
  data = "WorldClim_2.1_HadGEM3-GC31-LL_ssp245_10m",
  crop = iberia_poly
)
```

```{r echo=FALSE, results='hide', eval=create_sample_data}
terra::saveRDS(
  climate_future,
  "../inst/extdata/lacerta_climate_future_10m.rds"
)
```

```{r echo=FALSE, results='asis', eval=!download_data}
climate_future <- terra::readRDS(
  system.file("extdata/lacerta_climate_future_10m.rds",
    package = "tidysdm"
  )
)
```

Project the climatic raster with the same projection that we have been using for the
analysis:
```{r}
climate_future <- terra::project(climate_future, y = iberia_proj4)
```

And predict using the ensemble:

```{r plot_future, fig.width=6, fig.height=4}
prediction_future <- predict_raster(lacerta_ensemble, climate_future)

ggplot() +
  geom_spatraster(data = prediction_future, aes(fill = mean)) +
  scale_fill_terrain_c()
```


# Dealing with extrapolation

The total area of projection of the model may include environmental conditions 
which lie outside the range of conditions covered by the calibration dataset. 
This phenomenon can lead to misinterpretation of the SDM outcomes due to 
spatial extrapolation.

`tidysdm` offers a couple of approaches to deal with this problem. The simplest one is that we can clamp the environmental variables to stay within the limits observed
in the calibration set:

```{r, fig.width=6, fig.height=4}
climate_future_clamped <- clamp_predictors(climate_future,
  training = lacerta_thin,
  .col = class
)
prediction_future_clamped <- predict_raster(lacerta_ensemble,
  raster = climate_future_clamped
)

ggplot() +
  geom_spatraster(data = prediction_future_clamped, aes(fill = mean)) +
  scale_fill_terrain_c()
```

The predictions seem to have changed very little.

An alternative is to allow values to exceed the ranges of the calibration set,
but compute the Multivariate environmental similarity surfaces (MESS) (Elith et al. 2010) to highlight areas where extrapolation occurs and thus visualise the prediction's uncertainty.

We estimate the MESS for the same future time slice used above:

```{r, fig.width=6, fig.height=4}
lacerta_mess_future <- extrapol_mess(
  x = climate_future,
  training = lacerta_thin,
  .col = "class"
)

ggplot() +
  geom_spatraster(data = lacerta_mess_future) +
  scale_fill_viridis_b(na.value = "transparent")
```

Extrapolation occurs in areas where MESS values are negative, with the magnitude
of the negative values indicating how extreme is in the interpolation. From
this plot, we can see that the area of extrapolation is where the model already
predicted a suitability of zero. This explains why clamping did little to our
predictions.

We can now overlay MESS values with current prediction to visualize areas characterized by spatial extrapolation. 

```{r, fig.width=6, fig.height=4}
# subset mess
lacerta_mess_future_subset <- lacerta_mess_future
lacerta_mess_future_subset[lacerta_mess_future_subset >= 0] <- NA
lacerta_mess_future_subset[lacerta_mess_future_subset < 0] <- 1

# convert into polygon
lacerta_mess_future_subset <- as.polygons(lacerta_mess_future_subset)

library(ggpattern)

# plot as a mask
ggplot() +
  geom_spatraster(data = prediction_future) +
  scale_fill_terrain_c() +
  geom_sf_pattern(
    data = lacerta_mess_future_subset,
    pattern = "stripe",
    fill = "transparent",
    pattern_fill = "black",
    pattern_density = 0.02,
    pattern_spacing = 0.05,
    pattern_angle = 45,
    alpha = 0.1,
    linewidth = 0.5
  )
```

Note that clamping and MESS are not only useful when making predictions into the future, but also into the past and present (in the latter case, it allows us to make
sure that the background/pseudoabsences do cover the full range of predictor
variables over the area of interest).

The `tidymodels` universe also includes functions to estimate the area of 
applicability in the package `waywiser`, which can be used with `tidysdm`.

# Visualising the contribution of individual variables

It is sometimes of interest to understand the relative contribution of
individual variables to the prediction. This is a complex task,
especially if there are interactions among variables. For simpler linear
models, it is possible to obtain marginal response curves (which show
the effect of a variable whilst keeping all other variables to their
mean) using `step_profile()` from the `recipes` package. We use
`step_profile()` to define a new recipe which we can then bake to
generate the appropriate dataset to make the marginal prediction. We can
then plot the predictions against the values of the variable of
interest. For example, to investigate the contribution of `bio05`, we
would:

```{r, fig.width=6, fig.height=4}
bio05_prof <- lacerta_rec %>%
  step_profile(-bio05, profile = vars(bio05)) %>%
  prep(training = lacerta_thin)

bio05_data <- bake(bio05_prof, new_data = NULL)

bio05_data <- bio05_data %>%
  mutate(
    pred = predict(lacerta_ensemble, bio05_data)$mean
  )

ggplot(bio05_data, aes(x = bio05, y = pred)) +
  geom_point(alpha = .5, cex = 1)
```

It is also possible to use
[DALEX](https://modeloriented.github.io/DALEX/),to explore `tidysdm`
models; see more details in the [tidymodels
additions](https://evolecolgroup.github.io/tidysdm/dev/articles/a2_tidymodels_additions.html)
article.

# Repeated ensembles

The steps of thinning and sampling pseudo-absences can have a bit impact
on the performance of SDMs. As these steps are stochastic, it is good
practice to explore their effect by repeating them, and then creating
ensembles of models over these repeats. In `tidysdm`, it is possible to
create `repeat_ensembles`. We start by creating a list of
`simple_ensembles`, by looping through the SDM pipeline. We will just
use two fast models to speed up the process,
and use pseudo-absences instead of background.

```{r}
# empty object to store the simple ensembles that we will create
ensemble_list <- list()
set.seed(1234) # make sure you set the seed OUTSIDE the loop
for (i_repeat in 1:3) {
  # thin the data
  lacerta_thin_rep <- thin_by_cell(lacerta, raster = climate_present)
  lacerta_thin_rep <- thin_by_dist(lacerta_thin_rep, dist_min = 20000)
  # sample pseudo-absences
  lacerta_thin_rep <- sample_pseudoabs(lacerta_thin_rep,
    n = 3 * nrow(lacerta_thin_rep),
    raster = climate_present,
    method = c("dist_min", 50000)
  )
  # get climate
  lacerta_thin_rep <- lacerta_thin_rep %>%
    bind_cols(terra::extract(climate_present, lacerta_thin_rep, ID = FALSE))
  # create folds
  lacerta_thin_rep_cv <- spatial_block_cv(lacerta_thin_rep, v = 5)
  # create a recipe
  lacerta_thin_rep_rec <- recipe(lacerta_thin_rep, formula = class ~ .)
  # create a workflow_set
  lacerta_thin_rep_models <-
    # create the workflow_set
    workflow_set(
      preproc = list(default = lacerta_thin_rep_rec),
      models = list(
        # the standard glm specs
        glm = sdm_spec_glm(),
        # maxent specs with tuning
        maxent = sdm_spec_maxent()
      ),
      # make all combinations of preproc and models,
      cross = TRUE
    ) %>%
    # tweak controls to store information needed later to create the ensemble
    option_add(control = control_ensemble_grid())

  # train the model
  lacerta_thin_rep_models <-
    lacerta_thin_rep_models %>%
    workflow_map("tune_grid",
      resamples = lacerta_thin_rep_cv, grid = 3,
      metrics = sdm_metric_set(), verbose = TRUE
    )
  # make an simple ensemble and add it to the list
  ensemble_list[[i_repeat]] <- simple_ensemble() %>%
    add_member(lacerta_thin_rep_models, metric = "boyce_cont")
}
```

Now we can create a `repeat_ensemble` from the list:

```{r}
lacerta_rep_ens <- repeat_ensemble() %>% add_repeat(ensemble_list)
lacerta_rep_ens
```

```{r echo=FALSE, results='hide', eval=create_sample_data}
usethis::use_data(lacerta_rep_ens, overwrite = TRUE)
```


We can summarise the goodness of fit of models for each repeat with
`collect_metrics()`, but there is no `autoplot()` function for
`repeated_ensemble` objects.

We can then predict in the usual way. We will take the mean and median
of all models, without filtering by performance, and plot the results:

```{r, fig.width=6, fig.height=4}
lacerta_rep_ens <- predict_raster(lacerta_rep_ens, climate_present,
  fun = c("mean", "median")
)
ggplot() +
  geom_spatraster(data = lacerta_rep_ens, aes(fill = median)) +
  scale_fill_terrain_c()
```

Note that the predictions are quite similar to the ones we obtained
before, but the predicted suitable range is somewhat larger, probably because we
included models that are not very good (as we did not filter by performance) in the ensemble.