---
title: "NMI Scores"
output:
    rmarkdown::html_vignette:
        toc: true
description: >
  Calculate how important various features were to the final SNF cluster solution.
vignette: >
  %\VignetteIndexEntry{NMI Scores}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<style>
div.aside { background-color:#fff2e6; }
</style>

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

```{r echo = FALSE}
options(crayon.enabled = FALSE, cli.num_colors = 0)
```

Download a copy of the vignette to follow along here: [nmi_scores.Rmd](https://raw.githubusercontent.com/BRANCHlab/metasnf/main/vignettes/nmi_scores.Rmd)

NMI scores were used in the original `SNFtool` package as a unitless way to compare the relative importance of different features in a final cluster solution.
The premise of this approach is that if a feature was very important, clustering off of that feature alone should result in a solution that is very similar to the one that was generated by clustering off of all the features together.

In the original `SNFtool` implementation of calculating NMI scores, the cluster solution based on the individual feature being assessed was restricted to necessarily being generated using squared Euclidean distance, a K hyperparameter value of 20, an alpha hyperparameter value of 0.5, and spectral clustering with the number of clusters based on the best eigen-gap value of possible solutions spanning from 2 to 5 clusters.

In contrast, the `metasnf` implementation leverages all the architectural details and hyperparameters supplied in the original SNF config and `batch_snf()` call to make the solo-feature to all-feature solutions as comparable as possible.

The chunk below outlines how the primary NMI calculating function, `calc_nmis()`, can be used.

```{r}
library(metasnf)

dl <- data_list(
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    list(anxiety, "anxiety", "behaviour", "ordinal"),
    list(depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)

set.seed(42)
sc <- snf_config(
    dl = dl,
    n_solutions = 20,
    min_k = 20,
    max_k = 50
)

# Generation of 20 cluster solutions
sol_df <- batch_snf(dl, sc)

# Let's just calculate NMIs of the anxiety and depression data types for the
# first 5 cluster solutions to save time:
feature_nmis <- calc_nmis(dl[4:5], sol_df[1:5, ])

print(feature_nmis)
```

One important thing to note is that if the cluster space you initially set up when calling `batch_snf` relied on custom distance metrics, clustering algorithms, or the `automatic_standard_normalize` parameter, you should use those same values when calling `calc_nmis()` as well.

Another important note is that by default, `calc_nmis` will ignore the `inc_*` columns of the settings data frame, i.e., no data types are dropped during solo feature cluster solution calculations.
This can lead to a bit of an odd interpretation if you view NMI as a direct reflection of contribution to the final SNF output.
It is possible for a feature that was not a part of a particular cluster solution to still produce its own cluster solution that has a very high NMI score to the prior one.
If you wish to suppress the calculation of NMIs for features that were not actually included in a particular SNF run due to having a 0 value in the inclusion column, you can set the `ignore_inclusions` parameter to `FALSE`.

Finally, if you'd like the NMI information to be presented in a transposed format, you can do that too by setting `transpose` to `FALSE`.