---
title: "RcppPlanc, Fast NMF and iNMF Implementation with C++"
author: Yichen Wang
date: 10-02-2023
output:
  html_document:
    toc: true
    toc_float:
      toc_collapsed: true
    toc_depth: 3
    theme: cerulean
    highlight: tango
    mathjax: "https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"
vignette: >
  %\VignetteIndexEntry{RcppPlanc, Fast NMF and iNMF Implementation with C++}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

RcppPlanc is an R package initiated for wrapping the C++ library PLANC ([planc GitHub](https://github.com/ramkikannan/planc), [S. Eswar, 2021](https://doi.org/10.1145/3432185)), where fast algorithms for non-negative matrix factorization (NMF) are implemented. Based on that, we also mimic the optimization and implemented integrative NMF (iNMF) ([liger GitHub](https://github.com/welch-lab/liger), [J. D. Welch, 2019](https://doi.org/10.1016/j.cell.2019.05.006)).

```{r setup}
library(RcppPlanc)
library(Matrix)
```

## Running Non-negative Matrix Factorization (NMF)

NMF problem is stated with the objective below and it aims at factorizing a given non-negative $X$ matrix into two low-rank non-negative matrices $W$ and $H$ so that $WH \approx X$. Denote the dimensionality of $X$ to be $m$ features by $n$ sample points, then $W$ will have the size of $m \times k$ and $H$ is of $k \times n$. $k$ is the inner dimension of the factorization and should be less than $m$ and $n$. 

\begin{align*}
\arg\min_{W\ge0,H\ge0}||X-WH||_F^2
\end{align*}

The implementation currently supports the factorization of both dense and sparse (`dgCMatrix`) matrix. The following code chunk factorizes a randomly initialized non-negative dense matrix. 

```{r nmfDense}
mat <- matrix(runif(50*100), nrow = 50, ncol = 100)
res <- nmf(mat, k = 10, nCores = 2)
```

The output result `res` is a list object with entries `res$W` and `res$H` for the output matrices as described above. Note that here matrix `res$H` is transposed (i.e. $n \times k$) for the sake of efficient implementation. `res$objErr` is the objective error computed for the result.

There are other variant algorithms for NMF that can be specified with argument `algo=` and also symmetric NMF approach implemented in `symNMF()`. Please see `?nmf` or `?symNMF` for more detailed documentation of the functions.

## Running Integrative Non-negative Matrix Factorization (iNMF)

iNMF jointly factorizes multiple datasets that share the same set of features. It is originally designed for integrating single-cell biological data, such as transcriptomics and other modalities. As its objective function stated as below, it factorizes given datasets $E_i$ ($m \times n_i$) into $W$ ($m \times k$), $V_i$ ($m \times k$) and $H_i$ ($k \times n_i$), where $W$ is shared across all datasets and $V$ and $H$ are dataset specific. $\lambda$ is regularization parameter, $m$ is the number of shared features (e.g. gene expression), $n_i$ is the number of sample points (e.g. cells) of each dataset and $d$ is the number of datasets, It is expected that $(W+V_i)H_i \approx E_i$. Please see `?inmf` for more detailed explanation.

\begin{align*}
\arg\min_{H\ge0,W\ge0,V\ge0}\sum_{i}^{d}||E_i-(W+V_i)Hi||^2_F+\lambda\sum_{i}^{d}||V_iH_i||_F^2
\end{align*}

We prepared some example datasets that are down-sampled from public available study ([Hyun Min Kang, 2018](https://doi.org/10.1038/nbt.4042)). They can be loaded with the following code chunk. These datasets are presented as sparse matrices. 

```{r inmfLoadData}
data("ctrl.sparse")
data("stim.sparse")
```

To run iNMF with the datasets in hand, simply create a list of them and call the function `inmf()`.

```{r inmf, message=FALSE, results='hide'}
res <- inmf(list(ctrl.sparse, stim.sparse), k = 20, lambda = 5, nCores = 2)
```

The returned result is also a list object, which contains the following entries:

- `res$H`, a list of dataset specific $H$ matrices, each of size $n_i \times k$. Similarly, they are also transposed.
- `res$V`, a list of dataset specific $V$ matrices, each of size $m \times k$.
- `res$W`, the $W$ matrix, $m \times k$.
- `res$objErr`, the objective error value calculated for the result.

We also implemented the other variant of iNMF in `onlineINMF()` ([C. Gao, 2021](https://doi.org/10.1038/s41587-021-00867-x)) and `uinmf()` ([A.R. Kriebel, 2022](https://doi.org/10.1038/s41467-022-28431-4)). Please see their help pages for the detail. 

## Using Data Stored in HDF5 Files for iNMF

[HDF5](https://www.hdfgroup.org/solutions/hdf5/) files have been widely adpoted for storing large-scale datasets on hard drive. Relevant examples can be found such as the H5AD files used by Python [AnnData](https://github.com/scverse/anndata) package that store annotated data, and the output file of [10X CellRanger Count](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count) which stores pre-processed single-cell expression matrix and metadata. 

We support using either dense matrices stored in 2-D datasets or CSC (compressed sparse column) sparse matrices stored with sets of three 1-D array datasets. Currently, we don't provide interfaces for interactively exploring the data in an HDF5, but this can be easily achieved with other packages, such as [hdf5r](https://CRAN.R-project.org/package=hdf5r). To use datasets stored in HDF5 files, we provide argument list constructors `H5Mat()` and `H5SpMat()`, respectively for dense and sparse matrix, to organize the necessary information for retrieving data.

We prepared example HDF5 files with identical information as the package data `ctrl.sparse` and `stim.sparse`, in both dense and sparse forms. The argument lists for them can be constructed as shown below:

```{r h5mat}
ctrl.denseH5FilePath <- system.file("extdata/ctrl_dense.h5", package = "RcppPlanc")
ctrl.h5dense <- H5Mat(filename = ctrl.denseH5FilePath, dataPath = "data")
ctrl.h5dense

stim.denseH5FilePath <- system.file("extdata/stim_dense.h5", package = "RcppPlanc")
stim.h5dense <- H5Mat(filename = stim.denseH5FilePath, dataPath = "data")
stim.h5dense

ctrl.sparseH5FilePath <- system.file("extdata/ctrl_sparse.h5", package = "RcppPlanc")
ctrl.h5sparse <- H5SpMat(filename = ctrl.sparseH5FilePath, 
                         valuePath = "scaleDataSparse/data", rowindPath = "scaleDataSparse/indices", colptrPath = "scaleDataSparse/indptr",
                         nrow = nrow(ctrl.sparse), ncol = ncol(ctrl.sparse))
ctrl.h5sparse

stim.sparseH5FilePath <- system.file("extdata/stim_sparse.h5", package = "RcppPlanc")
stim.h5sparse <- H5SpMat(filename = stim.sparseH5FilePath,
                         valuePath = "scaleDataSparse/data", rowindPath = "scaleDataSparse/indices", colptrPath = "scaleDataSparse/indptr",
                         nrow = nrow(stim.sparse), ncol = ncol(stim.sparse))
stim.h5sparse
```

With the argument list constructed, we can simply run any iNMF algorithm with having them in a list, which is the same as what we did with in-memory matrix objects. 

>Note that it is recommended to apply `onlineINMF()` with HDF5 files due to the nature of the HALS algorithm it adopts. `inmf()` and `uinmf()` use ANLS updating approach which requires subsetting on rows of the data and could be very slow. Although this is still supported. Optimization might be updated in future versions.

```{r onlineinmf, message=FALSE, results='hide'}
res <- onlineINMF(list(ctrl.h5sparse, stim.h5sparse), k = 20, lambda = 5, minibatchSize = 50, nCores = 2)
```

Here, the result is also a similar list object but with two additional entries `res$A` and `res$B`. These are dataset specific matrices storing the algorithm update information but not the primary result of iNMF. These are needed for performing other scenarios of online iNMF. Please see `?onlineINMF` for more explanation. 

Note that the argument `minibatchSize` is set to `50` only for demonstrating this minimal example of this vignette. This is a parameter specifying the chunk size of each updates in the iterative algorithm, which obviously should be a smaller number than the number of sample points in provided datasets. Users testing with real life large-scale data should ignore it or determine it basing on the specification (e.g. RAM) of the machines. 

\br
\br