---
title: "HDStIM"
author: "Rohit Farmer"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{HDStIM}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The primary function `HDStIM()` in the `HDStIM` package follows a heuristic approach to group cells into responding and non-responding. For a combination of cell population and stimulation type (e.g., CD127+ T-helper cells and interferon-alpha), `HDStIM()` starts by performing K-means clustering on the combined set of cells from stimulated and unstimulated samples. K-means clustering is performed on combined expression data of all the state (signaling/intracellular) markers. Upon clustering using a contingency table as shown below, a Fisher's exact test determines the effect size and the statistical significance of partitioning. Cells from the combinations that pass the Fisher's exact test (p-value < 0.05) are considered responding. An optional UMAP can also be calculated to visually verify the cell partitioning in responding and non-responding groups by using auxiliary plotting scripts provided in the package.

In addition to an auxiliary script to plot UMAPs, the package also comes with two other plotting scripts for K-means clustering and Fisher’s exact test and state marker density before and after mapping.

**An example of the contingency table used for Fisher's exact test.**
```{r con_table}
matrix(c(60, 40, 20, 80),nrow = 2, ncol = 2, 
       dimnames = list(c("Cluster1", "Cluster2"), c("Stim", "Unstim")))
```

# To Run The Main HDStIM Function

As stated above, `HDStIM()` is the primary function of the `HDStIM` package. We will use the example data set  `chi11` (from mass cytometry) included in the package.

*Note:`chi11` is a minimal dataset included for unit testing only. Therefore, it does not represent a typical mass/flow cytometry assay.*

```{r example}
library(HDStIM)

mapped_data <-  HDStIM(chi11$expr_data, chi11$state_markers,
                       chi11$cluster_col, chi11$stim_label,
                       chi11$unstim_label, seed_val = 123, 
                       umap = TRUE, umap_cells = 500, 
                       verbose = FALSE)

class(mapped_data)

attributes(mapped_data)
```

## Output

`HDStIM()` returns a list with the mapped expression data, data to plot stacked bar plots to visualize the K-means and Fisher's exact test results, and data to plot the optional UMAPs. The list also includes tables containing statistical information from K-means and Fisher's exact test and other information passed as the function attributes.

```{r head_expr}
head(mapped_data$response_mapping_main)
```

```{r head_stacked}
head(mapped_data$stacked_bar_plot_data)
```

```{r head_umap}
head(mapped_data$umap_plot_data)
```

```{r head_all_F}
head(mapped_data$all_fisher_p_val)
```

```{r head_all_K}
head(mapped_data$all_k_means_data)
```

# To Plot Diagnostic Figures

## Plots Explaining K-means Clustering And Fisher's Exact Test

Using the `stacked_bar_plot_data`, `plot_K_Fisher()` generates bar plots showing the percentage of cells from the stimulated and unstimulated samples clustered in the two K-means clusters a given cell population and stimulation type. 

`plot_K_Fisher()` returns a list of ggplot objects. If the path is specified, it can also render and save the plots in PNG format.

```{r k_plots, fig.width = 4, fig.align = "center", dpi = 150}
k_plots <- plot_K_Fisher(mapped_data, path = NULL, verbose = FALSE)
k_plots[[1]]
```

## UMAP Plots To Visually Inspect Responding and Non-Responding Cell Mapping

*Note: You can only generate these plots if you have asked UMAPs to be calculated in 
the `HDStIM()` function.*

UMAP plots can be helpful for visually inspecting how well `HDStIM()` has mapped responding vs. non-responding cells for a cell population and stimulation type. `plot_umap()` also returns a list of ggplot objects and if the path is specified, it will render and save the plots in PNG format. 

```{r u_plots, fig.width = 4, fig.align = "center", dpi = 150}
u_plots <- plot_umap(mapped_data, path = NULL, verbose = FALSE)
u_plots[[1]]
```

## Distribution Plots for Individual State Marker before And After Mapping

For each state/signaling markers distribution plots shows the kernel density estimation of the pre `HDStIM()` data from both stimulated and unstimulated samples along with the density from cells from stimulated samples mapped as responding. `plot_exprs()` also returns a list of ggplot objects and if the path is specified, it will render and save the plots in PNG format. 

```{r e_plots, fig.width = 7, fig.height = 5, fig.align = "center"}
e_plots <- plot_exprs(mapped_data, path = NULL,verbose = FALSE)
library(ggplot2)
e_plots[[1]] +
    theme(text = element_text(size = 11))
```

# To Rank State/Signaling Markers According To Their Contribution To The Response

`marker_ranking_boruta()` function runs Boruta on the stimulation - cell population combinations that passed the Fisher's exact test to rank the markers according to their contribution to the response. The function returns a list with a tibble containing attribute statistics calculated by Boruta and ggplot objects. If the path is not NULL, plots are also rendered and saved in the specified folder in PNG format. 

```{r m_ranks, fig.width = 7, fig.height = 6, fig.align = "center"}
m_ranks <- marker_ranking_boruta(mapped_data, path = NULL, n_cells = NULL,
                                 max_runs = 100, seed_val = 123,
                                 verbose = FALSE)

head(m_ranks$attribute_stats)

m_ranks$plots[[1]]
```