---
title: "Large-Scale Eigenvalue Decomposition and SVD with RSpectra"
author: "Yixuan Qiu"
date: "`r Sys.Date()`"
output:
  prettydoc::html_pretty:
    theme: cayman
    highlight: github
vignette: >
  %\VignetteIndexEntry{Large-Scale Eigenvalue Decomposition and SVD with RSpectra}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Introduction

Eigenvalue decomposition is a commonly used technique in
numerous statistical problems. For example, principal component analysis (PCA)
basically conducts eigenvalue decomposition on the sample covariance of a data
matrix: the eigenvalues are the component variances, and eigenvectors are the
variable loadings.

In R, the standard way to compute eigenvalues is the `eigen()` function.
However, when the matrix becomes large, `eigen()` can be very time-consuming:
the complexity to calculate all eigenvalues of a $n \times n$ matrix is
$O(n^3)$.

While in real applications, we usually only need to compute a few
eigenvalues or eigenvectors, for example to visualize high dimensional
data using PCA, we may only use the first two or three components to draw
a scatterplot. Unfortunately in `eigen()`, there is no option to limit the
number of eigenvalues to be computed. This means that we always need to do the
full eigen decomposition, which can cause a huge waste in computation.

The same thing happens in Singular Value Decomposition (SVD). It is often the
case that only a Partial SVD or Truncated SVD is needed, and moreover the
matrix is usually stored in sparse format. Base R does not provide functions
suitable for these special needs.

And this is why the **RSpectra** package was developed. **RSpectra** provides
an R interface to the
[Spectra](https://spectralib.org/) library, which is used to solve
large scale eigenvalue problems. The core part of **Spectra** and **RSpectra**
was written in efficient C++ code, and they can handle large scale matrices
in either dense or sparse formats.



## Eigenvalue Problem

### Basic Usage

The **RSpectra** package provides functions `eigs()` and `eigs_sym()` to
calculate eigenvalues of general and symmetric matrices respectively.
If the matrix is known to be symmetric,
`eigs_sym()` is preferred since it guarantees that the eigenvalues are real.

To obtain eigenvalues of a square matrix `A`, simply call the `eigs()` or
`eigs_sym()` function, tell it how many eigenvalues are requested (argument `k`),
and which ones are going to be selected (argument `which`).
By default, `which = "LM"` means to pick the eigenvalues
with the largest magnitude (modulus for complex numbers and absolute value
for real numbers). 

Below we first generate some random matrices and display some of the eigenvalues
and eigenvectors:

```{r}
set.seed(123)
n = 100  # matrix size
k = 5    # number of eigenvalues to calculate
# Some random data
M = matrix(rnorm(n^2), n)
# Make it symmetric
A = crossprod(M)
# Show its largest 5 eigenvalues and associated eigenvectors
head(eigen(A)$values, 5)
head(eigen(A)$vectors[, 1:5])
```

Now we use **RSpectra** to directly obtain the largest 5 eigenvalues:

```{r}
library(RSpectra)
res = eigs_sym(A, k, which = "LM")  # "LM" is the default
res$values
head(res$vectors)
```

If only eigenvalues are requested, the `retvec` option can be used:

```{r}
eigs_sym(A, k, opts = list(retvec = FALSE))
```

### Matrix in Sparse Format

For really large data, the matrix is usually stored in sparse format.
**RSpectra** supports the `dgCMatrix` and `dgRMatrix` matrix types defined
in the **Matrix** package.

```{r}
library(Matrix)
Msp = as(M, "dgCMatrix")
Asp = as(A, "dgRMatrix")

eigs(Msp, k, which = "LR", opts = list(retvec = FALSE))$values  # largest real part
eigs_sym(Asp, k, opts = list(retvec = FALSE))$values
```

An even more general way to specify the matrix `A` is to define a function that
calculates `A %*% x` for any vector `x`. This representation is called the
**Function Interface**, which can also be regarded as a sparse format, since
users do not need to store the matrix `A`, but only the operator `x -> Ax`.
For example, to represent a diagonal matrix $diag(1, 2, \ldots, 10)$ and
calculate its eigenvalues, we can define the function `f` and call the
`eigs_sym()` function:

```{r}
# Implicitly define the matrix by a function that calculates A %*% x
# Below represents a diagonal matrix whose elements are stored in the `args` parameter
f = function(x, args)
{
    # diag(args) %*% x == x * args
    return(x * args)
}
eigs_sym(f, k = 3, n = 10, args = 1:10)
```

`n` gives the dimension of the matrix, and `args` contains user data that will
be passed to `f`.

### Smallest Eigenvalues

Sometimes you may need to calculate the smallest (in magnitude) `k` eigenvalues
of a matrix. A direct way to do so is to use `eigs(..., which = "SM")`.
However, the algorithm of **RSpectra** is good at finding large
eigenvalues rather than small ones, so `which = "SM"` tends to require much
more iterations or even may not converge.

The recommended way to retrieve the smallest eigenvalues is to utilize
the spectral transformation: If $A$ has eigenvalues $\lambda_i$,
then $(A-\sigma I)^{-1}$ has eigenvalues $1/(\lambda_i-\sigma)$. Therefore,
we can set $\sigma = 0$ and calculate the largest eigenvalues of $A^{-1}$,
whose reciprocals are exactly the smallest eigenvalues of $A$.

`eigs()` implements the spectral transformation via the parameter `sigma`, so
the following code returns the smallest eigenvalues of `A`. Note that
`eigs()` always returns eigenvalues in the original scale (i.e., $\lambda_i$
instead of $1/(\lambda_i-\sigma)$).

```{r}
eigs_sym(A, k, which = "LM", sigma = 0)$values  # recommended way
eigs_sym(A, k, which = "SM")$values             # not recommended
```

More generally, the option `which = "LM", sigma = s` selects eigenvalues
that are closest to `s`.



## Truncated SVD

Truncated SVD (or Partial SVD) is frequently used in text mining and image
compression, which computes the leading singular values and singular vectors
of a rectangular matrix.

**RSpectra** has the `svds()` function to compute Truncated SVD:

```{r}
set.seed(123)
m = 100
n = 20
k = 5
A = matrix(rnorm(m * n), m)

str(svds(A, k, nu = k, nv = k))
```

Similar to `eigs()`, `svds()` supports sparse matrices:

```{r}
Asp = as(A, "dgCMatrix")
svds(Asp, k, nu = 0, nv = 0)
```

`svds()` also has the function interface: users provide functions `A`
and `Atrans` that calcualtes $Ax$ and $A'x$ respectively, and `svds()` will
compute the Truncated SVD.

```{r}
Af      = function(x, args)  as.numeric(args %*% x)
Atransf = function(x, args)  as.numeric(crossprod(args, x))
str(svds(Af, k, Atrans = Atransf, dim = c(m, n), args = Asp))
```


## Performance

**RSpectra** is written in efficient C++ code.
[This page](https://spectralib.org/performance.html) gives some
benchmarking results of `svds()` with other similar functions available in R
and extension packages.