---
title: "High Dimensional Data: Must Read!!"
bibliography: ../inst/REFERENCES.bib
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{High Dimensional Data: Must Read!!}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


# Introduction
**GGMncv** is set up for low-dimensional settings, that is, when 
the number of observations ($n$) is much greater than the number 
of nodes ($p$). This is perhaps not typical in the Gaussian graphical
modeling literature, and is a direct result of my (the author of **GGMncv**)
field  encountering low-dimensional data most often 
[see for example @williams2019nonregularized; @williams2020back]. As a result, 
**the defaults are honed in for low-dimensional data**!

Of course, **GGMncv** can readily be used for high-dimensional data. In what 
follows, I highlight several issues that may arise, in addition to solutions
to overcome those issues.

# Failure Altogether

By default, **GGMncv** uses the sample based (inverse) covariance matrix for the 
initial values, which is needed for employing nonconvex regularization.
When $p > n$, **GGMncv** will produce an error because the sample based (inverse) covariance matrix cannot be inverted in this situation. 

For example, notice the error when running the following code:

```{r, error=TRUE}
library(GGMncv)

# p > n
set.seed(2)
main <- gen_net(p = 50, 
                edge_prob = 0.05)

set.seed(2)
y <- MASS::mvrnorm(n = 49, 
                   mu = rep(0, 50), 
                   Sigma = main$cors)

fit <- ggmncv(cor(y), n = nrow(y))
```

## Solution
The solution is to provide an function for the initial matrix. To this end,
**GGMncv** includes the function `lediot_wolf` which is a shrinkage 
estimator [@ledoit2004well]. It is important to note that any 
function can be used, so long as it return the inverse **correlation** 
matrix.

```{r, message=FALSE, warning=FALSE}
fit <- ggmncv(cor(y), n = nrow(y), 
              penalty = "atan",
              progress = FALSE,
              initial = ledoit_wolf, Y = y)
```

Notice the `Y = y`, which is used internally to pass additional arguments 
via `...` to the function provided in `initial`.

The conditional dependence structure can then be plotted with

```{r, message=FALSE, warning=FALSE}
plot(get_graph(fit), 
     node_size = 5)
```

Here is an example of providing a function.

```{r}
initial_ggmncv <- function(y, ...){
  Rinv <- corpcor::invcor.shrink(y, verbose = FALSE)
  return(Rinv)
}

fit2 <- ggmncv(cor(y), n = nrow(y), 
               penalty = "atan",
               progress = FALSE,
               initial = initial_ggmncv, 
               y = y)

plot(get_graph(fit2), 
     node_size = 5)
```


Perhaps it is of interest to compare performance, given 
that different initial values were used.

```{r}
# ledoit and wolf
score_binary(estimate = fit$adj, 
             true = main$adj, 
             model_name = "lw")

# Shaffer and strimmer
score_binary(estimate = fit2$adj, 
             true = main$adj,  
             model_name = "ss")

```

# Works but Bad Performance

Perhaps a trickier situation is when the covariance matrix
can be inverted, but it is still ill-conditioned. This can occur
when $p$ approaches but does not exceed $n$. Here performance
can be very bad.

```{r, error=TRUE}
# p -> n
main <- gen_net(p = 50, 
                edge_prob = 0.05)

y <- MASS::mvrnorm(n = 60, 
                   mu = rep(0, 50), 
                   Sigma = main$cors)

fit <- ggmncv(cor(y), n = nrow(y), 
              penalty = "atan",
              progress = FALSE)

score_binary(estimate = fit$adj, 
             true = main$adj)
```

This is extremely problematic because there was no error, and the performance
was terrible (note: 1 - specificity = the false positive rate).

## Solution

One solution is again to provide a function to `initial`.

```{r}
fit <- ggmncv(cor(y), n = nrow(y), 
              progress = FALSE,
              penalty = "atan",
              initial = ledoit_wolf, Y = y)

score_binary(estimate = fit$adj, 
             true = main$adj)
```

An additional solution is to use $L_1$-regularization, i.e.,

```{r}
fit_l1 <- ggmncv(cor(y), n = nrow(y), 
              progress = FALSE, 
              penalty = "lasso")

score_binary(estimate = fit_l1$adj, 
             true = main$adj)
```


A quick comparison of KL-divergence

```{r}
# atan
kl_mvn(true = solve(main$cors), estimate = fit$Theta)

# lasso
kl_mvn(true = solve(main$cors), estimate = fit_l1$Theta)
```


# References