---
title: "Quickstart: applying PCP to a simulated environmental mixture"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Quickstart: applying PCP to a simulated environmental mixture}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

In this vignette, we will see how to leverage the functions in `pcpr` to:

1. Simulate and explore a simple environmental mixture dataset;
2. Determine the best PCP algorithm to employ with the simulated data;
3. Tune PCP's parameters using a cross-validated grid search; and
4. Evaluate PCP's performance against the simulated ground truth mixture model.

We start by loading and attaching the `pcpr` package to our current R session.

```{r setup}
library(pcpr)
```

Let's also write a quick helper function that will enable us to visualize the matrices we will be working with throughout this vignette as heatmaps:

```{r helper}
library(magrittr) # for the pipe %>%
library(ggplot2) # for plotting

plot_matrix <- function(D, ..., lp = "none", title = NULL) {
  D <- t(D)
  if (is.null(colnames(D))) colnames(D) <- paste0("C", 1:ncol(D))
  data.frame(D) %>%
    dplyr::mutate(x = paste0("R", 1:nrow(.))) %>%
    tidyr::pivot_longer(tidyselect::all_of(colnames(D)), names_to = "y", values_to = "value") %>%
    dplyr::mutate(x = factor(x, levels = unique(x)), y = factor(y, levels = unique(y))) %>%
    ggplot(aes(x = x, y = y, fill = value)) +
    geom_raster() +
    scale_y_discrete(limits = rev) +
    coord_equal() +
    scale_fill_viridis_c(na.value = "white", ...) +
    theme_minimal() +
    theme(
      axis.text.x = element_blank(),
      axis.ticks.x = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      legend.position = lp,
      plot.margin = margin(0, 0, 0, 0),
      aspect.ratio = 1
    ) +
    ggtitle(title)
}
```

## Simulating data

The `sim_data()` function lets users generate simple mixtures models for
quick experimentation. Let's use its default parameters to simulate a noisy environmental mixture $D = L_0 + S_0 + Z_0$
comprised of $n = 100$ observations of $p = 10$ chemicals, with three underlying chemical
exposure patterns (or a rank $r = 3$), extreme outlying exposure events along
the diagonal of the matrix, and dense Gaussian noise corrupting all the
measurements in the matrix:

```{r sim data}
data <- sim_data()
D <- data$D
L_0 <- data$L
S_0 <- data$S
Z_0 <- data$Z
```

Let's get a sense of what these matrices look like. Remember in real world analyses,
we only observe $D$. The matrices $L_0$, $S_0$, and $Z_0$ are the ground truth, coming
from some unobserved data generating mechanism:

```{r mat viz}
plot_matrix(D)
plot_matrix(L_0)
plot_matrix(S_0)
plot_matrix(Z_0)
```

The `matrix_rank()` function estimates the rank of a given matrix by counting the number of
nonzero singular values governing that matrix (with the help of a `thresh` parameter determining
what is "practically zero"):

```{r matrix rank}
matrix_rank(L_0)
matrix_rank(D)
```

Here we can see `L_0` has `r matrix_rank(L_0)` underlying patterns. This is obscured in the observed, full-rank
mixture matrix `D`, since the `S_0` and `Z_0` components are corrupting the underlying structure
provided by `L_0`.

Let's simulate the chemicals (columns) of our data being subject to some limit of detection (LOD) with the `sim_lod()`
function. We will simulate the bottom 10th percentile of each column as being below LOD, and examine the LOD for
each column:

```{r lod}
lod_info <- sim_lod(D, q = 0.1)
D_lod <- lod_info$D_tilde
lod <- lod_info$lod
lod
```

Next, because our mixtures data is often incomplete in practice, let's further simulate a random 5% of
the values as missing `NA` with the `sim_na()` function. Once our missing values are in place, we can
finish simulating the LOD in the mixture by imputing values simulated < LOD to be the $LOD / \sqrt{2}$,
a common imputation scheme for measurements < LOD:

```{r corrupt mat randomly}
corrupted_data <- sim_na(D_lod, perc = 0.05)
D_tilde <- corrupted_data$D_tilde
lod_root2 <- matrix(
  lod / sqrt(2),
  nrow = nrow(D_tilde),
  ncol = ncol(D_tilde), byrow = TRUE
)
lod_idxs <- which(lod_info$tilde_mask == 1)
D_tilde[lod_idxs] <- lod_root2[lod_idxs]
plot_matrix(D_tilde)
```

The `D_tilde` matrix represents our observed, messy mixtures model, suffering from
incomplete `NA` observations and a chemical-specific LOD.

## Model selection

There are two PCP algorithms shipped with `pcpr`: the convex `root_pcp()`
[[Zhang et al. (2021)](https://proceedings.neurips.cc/paper/2021/hash/f65854da4622c1f1ad4ffeb361d7703c-Abstract.html)]
and non-convex `rrmc()` [[Cherapanamjeri et al. (2017)](https://proceedings.mlr.press/v70/cherapanamjeri17a.html)].
To figure out which model would be best for our data, let's inspect the singular
values of our observed mixture using the `sing()` method (`sing()` cannot accept
missing values, so we will use `impute_matrix()` to impute `NA` values in `D_tilde`
with their respective column means):

```{r sing}
D_imputed <- impute_matrix(D_tilde, apply(D_tilde, 2, mean, na.rm = TRUE))
singular_values <- sing(D_imputed)
plot(singular_values, type = "b")
```

`rrmc()` is best suited for data characterized by slowly decaying singular values,
indicative of complex underlying patterns and a relatively large degree of noise.
Most EH data can be described this way. `root_pcp()` is best for data characterized
by rapidly decaying singular values, indicative of very well-defined latent patterns.

The singular values plotted above decay quickly from the first to the second, but very gradually
from the second onward. For this simple simulated dataset, both PCP models are perfectly suitable.
We will use `rrmc()`, as this is the model environmental health researchers will
likely employ most frequently. The `vignette("pcp-applied")` contains an exemplary
mixtures matrix singular value plot with slowly decaying singular values.

## Grid search for parameter tuning

To estimate the low-rank and sparse matrices, `rrmc()` needs to be given a maximum rank
`r` and regularization parameter `eta`. To determine the optimal values of `r` and `eta`,
we will conduct a brief grid search using the `grid_search_cv()` function. In our grid
search below, we will examine all models from ranks 1 through 5 and all values of `eta`
near the default, calculated with `get_pcp_defaults()`.

```{r gs, eval = FALSE, echo = TRUE}
eta_0 <- get_pcp_defaults(D_tilde)$eta
etas <- data.frame("eta" = sort(c(0.1 * eta_0, eta_0 * seq(1, 10, 2))))
# to get progress bar, could wrap this
# in a call to progressr::with_progress({ gs <- grid_search_cv(...) })
gs <- grid_search_cv(
  D_tilde,
  pcp_fn = rrmc,
  grid = etas, r = 5, LOD = lod,
  parallel_strategy = "multisession",
  num_workers = 16,
  verbose = FALSE
)
```

The `rrmc()` approach to PCP uses an iterative rank-based procedure to recover `L` and
`S`, meaning it first constructs a rank `1` model and iteratively builds up to the
specified rank `r` solution. As such, for the above grid search, we
passed `etas` as the `grid` argument to search and sent $r = 5$ as a constant
parameter common to all models in the search. Since `length(etas) = 6` and $r = 5$, we
searched through 30 different PCP models. The `num_runs` argument determines how many (random)
tests should be performed for each unique model setting. By default, `num_runs = 100`,
so our grid search tuned `r` and `eta` by measuring the performance of 3000 different PCP models.
We passed the simulated `lod` vector as another constant to the grid search,
equipping each `rrmc()` run with the same LOD information.

```{r real gs, eval = TRUE, echo = FALSE}
gs <- readRDS("rds_files/quickstart-gs.rds")
```

```{r gs results}
r_star <- gs$summary_stats$r[1]
eta_star <- round(gs$summary_stats$eta[1], 3)
gs$summary_stats
```

Inspecting the `summary_stats` table from the output grid search provides the mean-aggregated
statistics for each of the 30 distinct parameter settings we tested.
The grid search correctly identified the rank `r r_star` solution as the best
(lowest relative error `rel_err` rate). The corresponding `eta` = `r eta_star`. The top three parameter
settings also seem to have reasonable `S_sparsity` levels as well (all are above `0.95`). The next three
parameter settings seem to under-regularize the sparse `S` matrix by quite a bit, as 80% of entries are non-zero.
We will take the top parameters identified by the grid search in this instance. Had the very top parameters
yielded a sparsity of e.g. `0.7`, we likely then would have preferred the second set of parameters with sparisities in the
`0.9`s. This decision would have been grounded in prior assumptions about the amount of outliers to expect in the mixtuere.
For more on the interpreation of grid search results, consult the documentation for the `grid_search_cv()` function.

## Running PCP

Now we can run our PCP model:

```{r rrmc}
pcp_model <- rrmc(D_tilde, r = r_star, eta = eta_star, LOD = lod)
```

We can inspect the evolution of the objective function over the course of PCP's optimization:

```{r obj}
plot(pcp_model$objective, type = "l")
```

And the output `L` matrix:

```{r output L}
plot_matrix(pcp_model$L)
matrix_rank(pcp_model$L)
```

Let's briefly inspect PCP's estimate of the sparse matrix `S`, and fix any values that
are "practically" zero using the `hard_threshold()` function. The histogram below shows
a majority of the entries in `S` are between -0.4 and 0.4, so we will call those values
"practically" zero, and the rest true outlying exposure events. We can then calculate
the sparsity of `S` with `sparsity()`:

```{r sparse}
hist(pcp_model$S)
pcp_model$S <- hard_threshold(pcp_model$S, thresh = 0.4)
plot_matrix(pcp_model$S)
sparsity(pcp_model$S)
```

## Benchmarking with PCA

Before evaluating our PCP model, let's see how well a more traditional method such as
Principal Component Analysis (PCA) can recover `L_0`, to provide a benchmark for
comparison.

The `proj_rank_r()` function (project matrix to rank `r`) approximates an input matrix
as low-rank using a rank-`r` truncated SVD, the same way PCA approximates a low-rank
matrix. Normally, a researcher would need to determine `r` subjectively. We will give
PCA an advantage by sharing PCP's discovery from the above grid search that the solution
should be of rank `r r_star`:

```{r pca}
L_pca <- proj_rank_r(D_imputed, r = r_star)
```

## Evaluating PCP against the ground truth

Finally, let's see how we did in recovering `L_0` and `S_0`. We will examine the
relative error between our model's estimates and the simulated ground truth matrices.
We use the Frobenius norm to calculate the relative errors between the matrices:

```{r performance metrics}
data.frame(
  "Obs_rel_err" = norm(L_0 - D_imputed, "F") / norm(L_0, "F"),
  "PCA_L_rel_err" = norm(L_0 - L_pca, "F") / norm(L_0, "F"),
  "PCP_L_rel_err" = norm(L_0 - pcp_model$L, "F") / norm(L_0, "F"),
  "PCP_S_rel_err" = norm(S_0 - pcp_model$S, "F") / norm(S_0, "F"),
  "PCP_L_rank" = matrix_rank(pcp_model$L),
  "PCP_S_sparsity" = sparsity(pcp_model$S)
)
```

PCP outperformed PCA by quite a bit! PCP's relative recovery error on the `L_0` matrix
stood at only `r round(norm(L_0 - pcp_model$L, "F") / norm(L_0, "F") * 100, 2)`%,
compared to an observed relative error of
`r round(norm(L_0 - D_imputed, "F") / norm(L_0, "F") * 100, 2)`% and PCA's relative
error of `r round(norm(L_0 - L_pca, "F") / norm(L_0, "F") * 100, 2)`%.
PCP's sparse matrix estimate was only off from the ground truth `S_0` by
`r round(norm(S_0 - pcp_model$S, "F") / norm(S_0, "F") * 100, 2)`%.

## After PCP

We can now pair our estimated `L` matrix with any matrix factorization method of our
choice (e.g. PCA, factor analysis, or non-negative matrix factorization) to extract
the latent chemical exposure patterns (an example of what this looks like is
in `vignette("pcp-applied")`, where non-negative matrix factorization is used to extract
patterns from PCP's `L` matrix). These patterns, along with the isolated outlying
exposure events in `S`, can then be analyzed with any outcomes of interest in
downstream epidemiological analyses.