---
title: "Running spatial machine learning models"
output:
  rmarkdown::html_vignette: default
vignette: >
  %\VignetteIndexEntry{Running spatial machine learning models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The default MBG setup models a uniform linear relationship between each
covariate and the outcome in modeling space. This assumption may not
always hold: there may be nonlinear relationships, spatially-varying
relationships, and covariate interactions that affect the outcome. If we
want to capture the nuances of these predictor relationships without
over-fitting our model, we can turn to machine learning (ML) algorithms
that are intended to fit over a high-dimensional feature space.

# Setup

In this tutorial, we will run multiple machine learning models using
point-georeferenced outcome data and raster covariate data, then apply
the ML model predictions to generate raster predictions across the
prediction space. To begin, load the `mbg` package and helper packages
for data manipulation, then load the example data. Generate a raster of
the full prediction space using the `build_id_raster()` function.

``` r
## Load packages
library(data.table)
library(sf)
library(terra)
library(mbg)

## Load data
# Outcome: child stunting
outcomes <- data.table::fread(
  system.file('extdata/child_stunting.csv', package = 'mbg')
)
# Spatial covariates, including an intercept
covariates <- list(
  access = terra::rast(system.file('extdata/access.tif', package = 'mbg')),
  evi = terra::rast(system.file('extdata/evi.tif', package = 'mbg')),
  temperature = terra::rast(system.file('extdata/temperature.tif', package = 'mbg'))
)
covariates$intercept <- covariates[[1]] * 0 + 1
# Administrative boundaries
# Departments (first-level administrative divisions)
departments <- sf::st_read(
  system.file('extdata/Benin_departments.gpkg', package = 'mbg'),
  quiet = TRUE
)

# Create ID raster
id_raster <- mbg::build_id_raster(
  polygons = departments,
  template_raster = covariates[[1]]
)
```

# Running ML models with `mbg` and `caret`

The `mbg` package wraps the [`caret`](https://topepo.github.io/caret/)
package to run predictive models over space. The caret (Classification
And REgression Training) package contains 137 different model
specifications for regression: these include linear regression and
generalized linear models, generalized additive models, penalized
regression (LASSO, ridge, elastic net), support vector machines,
regression trees, gradient boosting, neural nets, and more.

You can search the full list of available regression models in the
`caret`
[documentation](https://topepo.github.io/caret/available-models.html).
Note that some models have tuning parameters that can be set beforehand.
We will create a named list of `submodel_settings`, where each name
corresponds to the name of the model we will be using, and each value is
an optional named list containing any tuning parameters we want to pass
to that model.

Each ML model will be tuned using cross-validation. We can control model
tuning by editing `cross_validation_settings`, a named list of arguments
that will be passed to the `caret::trainControl` function.

These models are fit and predictions generated across the study area
using the `run_regression_submodels()` function. This function returns a
named list with two items:

  - `"models"`: The fitted caret models, which can be used to closely
    inspect the model results and generate new predictions
  - `"predictions"`: Rasters containing the predictions of the outcome
    generated from each model.

In the code block below, we run three regression models, using child
stunting in Benin as the outcome and three covariate predictors: elastic
net (a penalized regression method), gradient boosted machines, and
bagged regression trees. We then plot the results across Benin:

``` r
## Run ML models using input covariates
cross_validation_settings <- list(method = 'repeatedcv', number = 5, repeats = 5)
submodel_settings <- list(
  enet = NULL,
  gbm = list(verbose = FALSE),
  treebag = NULL
)
submodels <- mbg::run_regression_submodels(
  input_data = outcomes,
  id_raster = id_raster,
  covariates = covariates,
  cv_settings = cross_validation_settings,
  model_settings = submodel_settings,
  prediction_range = c(0, 1)
)
#> Fitting 3 regression models
#>   Candidate model:  enet
#> Loading required package: ggplot2
#> Loading required package: lattice
#>   Candidate model:  enet: 0.932 sec elapsed
#>   Candidate model:  gbm
#>   Candidate model:  gbm: 1.825 sec elapsed
#>   Candidate model:  treebag
#>   Candidate model:  treebag: 2.508 sec elapsed
#> Fitting 3 regression models: 5.377 sec elapsed
```

``` r
plot(
  terra::rast(submodels$predictions),
  main = paste('Submodel prediction:', names(submodels$predictions)),
  range = c(.15, .45),
  fill_range = TRUE
)
```

![](spatial-ml-models_files/figure-gfm/run-ml-models-1.png)<!-- -->

Many machine learning models are robust to handling a large feature
space, meaning that we can add more candidate predictors to the model
without over-fitting. The `run_regression_submodels()` includes the
option to add administrative boundaries to the feature space as one-hot
encoded variables. In the code block below, we use this option to add
department (first-level administrative division) identifiers to the
predictors:

``` r
submodels_with_department_effects <- mbg::run_regression_submodels(
  input_data = outcomes,
  id_raster = id_raster,
  covariates = covariates,
  cv_settings = cross_validation_settings,
  model_settings = submodel_settings,
  use_admin_bounds = TRUE,
  admin_bounds = departments,
  admin_bounds_id = 'department_code',
  prediction_range = c(0, 1)
)
#> Fitting 3 regression models
#>   Candidate model:  enet
#>   Candidate model:  enet: 1.055 sec elapsed
#>   Candidate model:  gbm
#>   Candidate model:  gbm: 2.548 sec elapsed
#>   Candidate model:  treebag
#>   Candidate model:  treebag: 3.82 sec elapsed
#> Fitting 3 regression models: 7.529 sec elapsed
```

``` r
plot(
  terra::rast(submodels_with_department_effects$predictions),
  main = paste('Submodel prediction:', names(submodels_with_department_effects$predictions)),
  range = c(.15, .45),
  fill_range = TRUE
)
```

![](spatial-ml-models_files/figure-gfm/run-with-department-effects-1.png)<!-- -->

# Stacking ML models

The `run_regression_submodels()` function allows users to estimate the
outcome based on many different types of regression models which may not
make similar predictions across the study area. How can we combine
estimates from these models while also incorporating some of the
geostatistical principles used in the baseline `mbg` model? One option
is to use [stacked
generalization](https://doi.org/10.1007/s10654-018-0390-z), which
combines model predictions to create a “super learner” with better
performance than any individual model. [Previous
studies](https://doi.org/10.1098/rsif.2017.0520) indicate that this
“super learner” ensemble can replace the covariate term in a standard
geostatistical model to improve prediction accuracy.

To run a geostatistical model with stacked generalization, create a
`MbgModelRunner` object where `use_stacking = TRUE`. The arguments
`submodel_settings`, `stacking_cv_settings`, and
`stacking_prediction_range` should also be set when running stacked
generalization. In the example below, we also include administrative
effects as one-hot encoded variables in the component submodels.

After enabling stacking, run the `MbgModelRunner$run_mbg_pipeline()`
method to complete the full model workflow. This will now include the
following steps:

  - Run the component regression submodels
  - For each submodel, create estimates of the outcome at each pixel
    across the study area
  - Run the geostatistical model, replacing the standard covariate
    effect with the “super learner” ensemble based on the regression
    submodels
  - Generate final predictions from the geostatistical model

<!-- end list -->

``` r
# Run stacked generalization MBG model
model_runner <- MbgModelRunner$new(
  input_data = outcomes,
  id_raster = id_raster,
  covariate_rasters = covariates,
  use_stacking = TRUE,
  stacking_cv_settings = cross_validation_settings,
  stacking_model_settings = submodel_settings,
  stacking_prediction_range = c(0, 1),
  stacking_use_admin_bounds = TRUE,
  admin_bounds = departments,
  admin_bounds_id = 'department_code'
)
model_runner$run_mbg_pipeline()
#> Fitting 3 regression models
#>   Candidate model:  enet
#>   Candidate model:  enet: 1.219 sec elapsed
#>   Candidate model:  gbm
#>   Candidate model:  gbm: 2.403 sec elapsed
#>   Candidate model:  treebag
#>   Candidate model:  treebag: 3.434 sec elapsed
#> Fitting 3 regression models: 7.2 sec elapsed
#> MBG model fitting
#> MBG model fitting: 12.556 sec elapsed
#> Generating model predictions
#>   Parameter posterior samples
#>   Parameter posterior samples: 5.23 sec elapsed
#>   Cell draws
#>   Cell draws: 2.128 sec elapsed
#>   Summarize draws
#>   Summarize draws: 1.447 sec elapsed
#> Generating model predictions: 8.807 sec elapsed
```

When `use_stacking = TRUE`, the submodel predictions are saved in the
`MbgModelRunner$model_covariates` attribute. These can be pulled and
plotted:

``` r
# Get predictions from each component submodel
mbg_component_models <- model_runner$model_covariates

plot(
  terra::rast(model_runner$model_covariates),
  range = c(.15, .45),
  fill_range = TRUE
)
```

![](spatial-ml-models_files/figure-gfm/plot-submodels-1.png)<!-- -->

We can also plot mean estimates from the geostatistical model. Note how
the geostatistical model blends estimates from the component submodels,
while also including other effects including the latent spatial term and
the nugget:

``` r
# Get predictions from the overall model
grid_cell_predictions <- model_runner$grid_cell_predictions
# Plot mean estimates
plot(
  grid_cell_predictions$cell_pred_mean,
  main = 'MBG mean estimates'
)
lines(departments)
```

![](spatial-ml-models_files/figure-gfm/plot-model-results-1.png)<!-- -->

As as with the default model, we can plot uncertainty by pixel:

``` r
ui_raster <- grid_cell_predictions$cell_pred_upper - grid_cell_predictions$cell_pred_lower
plot(
  ui_raster,
  col = sf::sf.colors(n = 100),
  main = 'MBG estimates: 95% uncertainty interval width'
)
lines(departments)
```

![](spatial-ml-models_files/figure-gfm/plot-model-uncertainty-1.png)<!-- -->