---
title: Analysis of single-cell RNA-seq data using fastglmpca
author: Eric Weine and Peter Carbonetto
date: "`r Sys.Date()`"
output:
  html_document:
    toc: no
    highlight: textmate
    theme: readable
vignette: >
  %\VignetteIndexEntry{Analysis of single-cell RNA-seq data using fastglmpca}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The aim of this vignette is to introduce the basic steps involved in
fitting GLM-PCA model to single-cell RNA-seq data using
**fastglmpca**. (See [Townes *et al* 2019][townes-2019] or
[Collins *et al* 2001][collins-2001] for a detailed description of the
GLM-PCA model.)

```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold",
                      fig.align = "center",dpi = 120)
```

To begin, load the packages that are needed.

```{r load-pkgs, message=FALSE, warning=FALSE}
library(Matrix)
library(fastglmpca)
library(ggplot2)
library(cowplot)
```

Set the seed so that the results can be reproduced.

```{r set-seed}
set.seed(1)
```

The example data set
--------------------

We will illustrate fastglmpca using a single-cell RNA-seq data set
from [Zheng *et al* (2017)][zheng-2017]. These data are reference
transcriptome profiles from 10 bead-enriched subpopulations of
peripheral blood mononuclear cells (PBMCs). The original data set is
much larger; for this introduction, we have taken a subset of
roughly 3,700 cells.

The data we will analyze are unique molecular identifier (UMI)
counts. These data are stored as an $n \times m$ sparse matrix, where
$n$ is the number of genes and $m$ is the number of cells:

```{r load-data}
data(pbmc_facs)
dim(pbmc_facs$counts)
```

The UMI counts are "sparse"---that is, most of the counts are
zero. Indeed, over 95% of the UMI counts are zero:

```{r nonzero-rate}
mean(pbmc_facs$counts > 0)
```

For the purposes of this vignette only, we randomly subset the data
further to reduce the running time:

```{r subset-genes-1}
counts <- pbmc_facs$counts
n      <- nrow(counts)
rows   <- sample(n,3000)
counts <- counts[rows,]
```

Now we have a 3,000 x 3,774 counts matrix:

```{r subset-genes-2}
dim(counts)
```

Initializating and fitting the GLM-PCA model
--------------------------------------------

Since no preprocessing of UMI counts is needed (e.g., a
log-transformation), the first step is to initialize the model fit
using `init_glmpca_pois()`. This function has many input arguments and
options, but here we will keep all settings at the defaults, and we
set K, the rank of the matrix factorization, to 2:

```{r init-glmpca}
fit0 <- init_glmpca_pois(counts,K = 2)
```

By default, `init_glmpca_pois()` adds gene- (or row-) specific
intercept, and a fixed cell- (or column-) specific size-factor. This
is intended to mimic the defaults in [glmpca][glmpca-pkg].
`init_glmpca_pois()` has many other options which we do not demonstrate
here.

Once we have initialized the model, we are ready to run the
optimization algorithm to fit the model (i.e., estimate the model
parameters). This is accomplished by a call to `fit_glmpca_pois()`:

```{r fit-glmpca-1, results="hide", eval=FALSE}
fit <- fit_glmpca_pois(counts,fit0 = fit0)
```

If you prefer not to wait for the model optimization (it may take
several minutes to run), you are welcome to load the previously fitted
model (which is the output from the `fit_glmpca_pois` call above):

```{r fit-glmpca-2}
fit <- pbmc_facs$fit
```

The return value of `fit_glmpca_pois()` resembles the output of
`svd()` and similar functions, with a few other outputs giving
additional information about the model:

```{r fit-glmpca-3}
names(fit)
```

In particular, the outputs that are capital letters the low-rank
reconstruction of the counts matrix:

```{r fitted-counts}
fitted_counts <- with(fit,
  exp(tcrossprod(U %*% diag(d),V) +
      tcrossprod(X,B) +
      tcrossprod(W,Z)))
```

Let's compare (a random subset of) the reconstructed ("fitted") counts
versus the observed counts:

```{r fitted-vs-observed, fig.height=3, fig.width=3}
i <- sample(prod(dim(counts)),2e4)
pdat <- data.frame(obs    = as.matrix(counts)[i],
                   fitted = fitted_counts[i])
ggplot(pdat,aes(x = obs,y = fitted)) +
  geom_point() +
  geom_abline(intercept = 0,slope = 1,color = "magenta",linetype = "dotted") +
  labs(x = "observed count",y = "fitted count") +
  theme_cowplot(font_size = 12)
```

The U and V outputs in particular are interesting because they give
low-dimensional (in this case, 2-d) embeddings of the genes and cells,
respectively. Let's compare this 2-d embedding of the cells the
provided cell-type labels:

```{r plot-v, fig.height=3, fig.width=4.5}
celltype_colors <- c("forestgreen","dodgerblue","darkmagenta",
                     "gray","hotpink","red")
celltype <- as.character(pbmc_facs$samples$celltype)
celltype[celltype == "CD4+/CD25 T Reg" |
         celltype == "CD4+ T Helper2" |
         celltype == "CD8+/CD45RA+ Naive Cytotoxic" |
		 celltype == "CD4+/CD45RA+/CD25- Naive T" |
		 celltype == "CD4+/CD45RO+ Memory"] <- "T cell"
celltype <- factor(celltype)
pdat <- data.frame(celltype = celltype,
                   pc1 = fit$V[,1],
				   pc2 = fit$V[,2])
ggplot(pdat,aes(x = pc1,y = pc2,color = celltype)) +
  geom_point() +
  scale_color_manual(values = celltype_colors) +
  theme_cowplot(font_size = 10)
```

The 2-d embedding separates well the CD34+ and CD14+ cells from the
others, and somewhat distinguishes the other cell types (B cells, T
cells, NK cells).

Session info
------------

This is the version of R and the packages that were used to generate
these results.

```{r session-info}
sessionInfo()
```

[zheng-2017]: https://doi.org/10.1038/ncomms14049
[townes-2019]: https://doi.org/10.1186/s13059-019-1861-6
[collins-2001]: https://proceedings.neurips.cc/paper_files/paper/2001/file/f410588e48dc83f2822a880a68f78923-Paper.pdf
[glmpca-pkg]: https://cran.r-project.org/package=glmpca