---
title: "Multivariate missingness and monotonicity"
author: "Janick Weberpals"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Multivariate missingness and monotonicity}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<style>
p.comment {
background-color: #DBDBDB;
padding: 10px;
border: 1px solid black;
margin-left: 25px;
border-radius: 5px;
font-style: italic;
}
</style>

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dpi = 150,
  fig.width = 6,
  fig.height = 4.5
  )
```

```{r setup}
library(smdi)
library(gt)
suppressPackageStartupMessages(library(dplyr))
```

# Multivariate missing data in `smdi`

In this article, we want to briefly highlight two aspect regarding **multivariate missingness**: 

1. How does `smdi` handle multivariate missingness?

2. What is the link between missing data patterns and missing data mechanisms and how does this affect the behavior and performance of the `smdi` functionality?

## Established taxonomies

In general, there are two classic established missing data taxonomies:

* Mechanisms: Missing completely at random (MCAR), at random (MAR) and not at random (MNAR)

* Patterns: Monotone versus Non-monotone missingness

# How does `smdi` handle multivariate missingness?

In all `smdi` functions, except `smdi_little()`, binary missing indicator variables are created for each partially observed variable (either specified by the analyst using the `covar` parameter or automatically identified via `smdi_check_covar()` if `covar` = NULL) and the columns with the actual variable values are dropped. For the variable importance visualization in `smdi_rf()`, these variables are indicated with a *"_NA"* suffix. Missing values are accordingly indicated with a *1* and complete observations with a *0*. This functionality is controlled via the `smdi_na_indicator()` utility function. 

```{r, fig.cap="Illustrating missing indicator variable generation within `smdi` functions"}
smdi_data %>% 
  smdi_na_indicator(
    drop_NA_col = FALSE # usually TRUE, but for demonstration purposes set to FALSE
    ) %>% 
  select(
    ecog_cat, ecog_cat_NA, 
    egfr_cat, egfr_cat_NA, 
    pdl1_num, pdl1_num_NA
    ) %>% 
  head() %>% 
  gt()
```

Now, let's assume we have three partially observed covariates *X*, *Y* and *Z*, which we would like to include in our missingness diagnostics. All `smdi_diagnose()` functions, except `smdi_little()`, create *X_NA*, *Y_NA* and *Z_NA* and *X*, *Y* and *Z* are discarded from the dataset. The functions will then iterate the diagnostics through all *X_NA*, *Y_NA* and *Z_NA* one-by-one. That is, if, for example, *X_NA* is assessed, *Y_NA* and *Z_NA* serve as predictor variables along with all other covariates in the dataset. If *Y_NA* is assessed, *X_NA* and *Z_NA* are included as predictor variables, and so forth.

<div class="alert alert-info">
  <strong>Important!</strong> It is important to notice that this strategy is the default to deal with multivariate missingness in the `smdi` package, however, another possible approach could be to *not* consider the other partially observed variables in the first place (e.g. by dropping them before applying any `smdi` function) and stacking the diagnostics focusing on one partially observed variable at a time. Such a strategy would be advisable in scenarios of monotone missing data patterns (see next section).
</div>

# `smdi` in case of monotone missing data patterns

While in the `smdi` package we mainly focus on missing data mechanisms, missing data patterns always need to be considered, too. Please refer to the [routine structural missing data diagnostics article](https://janickweberpals.gitlab-pages.partners.org/smdi/articles/b_routine_diagnostics.html#descriptives), where we highlight the importance of describing missingness proportions and patterns before running any of the `smdi` diagnostics.

As mentioned in the section before, in case of monotone missing data patterns, the `smdi` functionality may be misleading.

<div class="alert alert-info">
  <strong>Monotonicity</strong> A missing data pattern is said to be monotone if the variables Yj can be ordered such that if Yj is missing then all variables Yk with k > j are also missing (taken from Stef van Buuren [^1]).
</div>

[^1]: For more information on missing data patterns see <https://stefvanbuuren.name/fimd/missing-data-pattern.html>

A good example for monotone missing data could be clinical blood laboratory tests ("labs") which are often tested together in a lab panel. If one lab is missing, typically the other labs of this panel are also missing.

```{r}
# we simulatea monotone missingness pattern
# following an MCAR mechanism

set.seed(42)

data_monotone <- smdi_data_complete %>% 
  mutate(
    lab1 = rnorm(nrow(smdi_data_complete), mean = 5, sd = 0.5),
    lab2 = rnorm(nrow(smdi_data_complete), mean = 10, sd = 2.25)
    )

data_monotone[3:503, "lab1"] <- NA
data_monotone[1:500, "lab2"] <- NA
```

```{r}
smdi::gg_miss_upset(data = data_monotone)
```

```{r}
smdi::md.pattern(data_monotone[, c("lab1", "lab2")], plot = FALSE)
```

In extreme cases of perfect linearity, this can lead to multiple warnings and errors such as `system is exactly singular` or `-InfWarning: Variable has only NA's in at least one stratum`.

In cases in which monotonicity is still clearly present but not as extreme (like in the example above), `smdi` will prompt a message to the analyst to raise awareness of this issue as the `smdi` output can be **highly misleading** in such instances.

```{r}
diagnostics_jointly <- smdi_diagnose(
  data = data_monotone,
  covar = NULL, # NULL includes all covariates with at least one NA
  model = "cox",
  form_lhs = "Surv(eventtime, status)"
  )
```

```{r, fig.cap="Diagnostics of lab 1 if analyzed separately."}
diagnostics_jointly %>% 
  smdi_style_gt()
```

In such cases, it may be advisable to *not* consider including *lab2* in the missingness diagnostics of *lab1* and vice versa and stack the diagnostics focusing on one partially observed variable at a time.

## Lab 1 analyzed without Lab 2

```{r, fig.cap="Diagnostics of lab 1 if analyzed separately."}
# lab 1
lab1_diagnostics <- smdi_diagnose(
  data = data_monotone %>% select(-lab2),
  model = "cox",
  form_lhs = "Surv(eventtime, status)"
  )

lab1_diagnostics %>% 
  smdi_style_gt()
```

## Lab 2 analyzed without Lab 1

```{r, fig.cap="Diagnostics of lab 2 if analyzed separately."}
# lab 2
lab2_diagnostics <- smdi_diagnose(
  data = data_monotone %>% select(-lab1),
  model = "cox",
  form_lhs = "Surv(eventtime, status)"
  )

lab2_diagnostics %>% 
  smdi_style_gt()
```

## Presented in one table using `smdi_style_gt()`

We can also combine the output of individually stacked `smdi_diagnose` tables and enhance it with a global Little's test that takes into account the multivariate missingness of the entire dataset.

```{r}
# computing a gloabl p-value for Little's test including both lab1 and lab2
little_global <- smdi_little(data = data_monotone)

# combining two individual lab smdi tables and global Little's test
smdi_style_gt(
  smdi_object = rbind(lab1_diagnostics$smdi_tbl, lab2_diagnostics$smdi_tbl), 
  include_little = little_global
  )
```

Since the missingness follows an MCAR mechanism, `smdi_diagnose()` now shows the expected missingness diagnostics patterns one would expect from an MCAR mechanism.