---
title: "Using outbreak data to estimate the relative case fatality ratio"
author: "Nicholas G Reich"
date: "`r format(Sys.time(), '%d %B %Y')`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Example for estimating the case fatality ratio}
  %\VignetteEngine{knitr::knitr}
  \usepackage[utf8]{inputenc}
---


## Introduction to the data

The goal of this vignette is to provide a tutorial for implementing the
case fatality ratio estimation methods available in the coarseDataTools R package.  To illustrate the methods, we will estimate a
relative case fatality ratio from a simulated outbreak dataset.


## Loading the data and the code
We begin by loading the package and a set of simulated data.

```{r label=sourcing}
library(coarseDataTools)
data(simulated.outbreak.deaths)
```

The data that we have loaded is a simulated dataset with five
columns.  Let's take a peek at the dataset.
```{r label=datapeek}
simulated.outbreak.deaths[15:20, ]
```
The rows index observations at a particular time and group.  The columns are defined as follows:

  -  time = the specified unit of time, $t$, for this observation
  -  grp = the covariate group, $j$, for this observation
  -  R = the number of recovered cases observed
  -  D = the number of dead cases observed
  -  N = the total number of cases observed, or D + R

We must perform a little bit of pre-processing on the data so that it
is ready for the analysis.  Specifically, we need to define the $T$
time periods that we want to use for analysis.  To avoid (for the
moment)  some convergence issues associated with
small sample sizes, we will only include times
where both groups have at least 10 total cases observed.

```{r label=preprocessing}
## set minimum number of observed cases for inclusion
min.cases <- 10

## observed cases
N.1 <- simulated.outbreak.deaths[1:60, "N"]
N.2 <- simulated.outbreak.deaths[61:120, "N"]

## subset to run analyis on times with greater than min.cases
first.t <- min(which(N.1 > min.cases & N.2 > min.cases))
last.t <- max(which(N.1 > min.cases & N.2 > min.cases))
idx.for.Estep <- first.t:last.t

## find and label the subset of times to be used for estimation routine
new.times <- seq_along(idx.for.Estep)
simulated.outbreak.deaths <- cbind(simulated.outbreak.deaths, new.times = NA)
simulated.outbreak.deaths[c(idx.for.Estep, idx.for.Estep + 60), "new.times"] <- rep(new.times, 2)
```

If we look at the new data matrix, we can see the new variable
that we've made and that will be used in the analysis:
```{r label=datapeek2}
simulated.outbreak.deaths[15:20, ]
```

## Running an analysis

We wish to fit a model to the simulated dataset that adjusts for changing reporting rates over time and
for a lag between disease onset and death.  We assume that the
reporting rates for dead and recovered cases vary by time but not by
covariate $j$.  We also assume that the case fatality ratio does not
vary by time but is different for each of the $j$ covariate groups.  The model formula that
adjusts for reporting rates but not the lag is
\begin{eqnarray}
\log E (D_{tj}) &=& \log N_{tj} + \beta_0 + \alpha_t\cdot X_t + \gamma_j\cdot Y_j
\end{eqnarray}
where $X_t$ and $Y_j$ are indicator variables  for $t=2,\dots,T$ and
$j=2, \dots,J$.  Then, $\gamma_j$ is the log relative case
fatality ratio.  The EM algorithm described in the main paper imputes
the values of $D_{tj}$ for the E-step and uses the
above equation as the M-step.

Before running an analysis, we need to fix a few parameters.
Specifically, we need to fix the assumed $\eta$, or the
vector of probabilities that define the survival distribution.  We
will assume that it is (0, .3, .4, .3).  For example, given that a
case will end up dying, there is a 40\% chance that the death will
occur on the third day after disease onset.  Also, we need to set
starting values for the $\alpha$ parameters in the model.  We will
assume that $\alpha_t=0$ for all $t$.

```{r label=setValues}
assumed.nu <- c(0, 0.3, 0.4, 0.3)
alpha.start <- rep(0, 22)
```

That is all the set up required.  We can now call the central
function, $\tt{EMforCFR()}$, which generates three estimates of the
CFR: the na\"ive, the reporting-rate-adjusted and the lag-adjusted estimators.
```{r label=runAnalysis, cache=TRUE, warning=FALSE}
cfr.ests <- EMforCFR(
  assumed.nu = assumed.nu,
  alpha.start.values = alpha.start, full.data = simulated.outbreak.deaths, verb = FALSE,
  SEM.var = TRUE, max.iter = 100, tol = 1e-5
)
```

[Note that running cfr.ests will generate warnings (many of them,
likely) but not to worry because this is due to the fact that the
likelihood uses non-integer outcomes in calculating the Poisson
density function.  This is an expected behavior and is not an
indication of a problem with the routine.]

The function $\tt{EMforCFR()}$ returns a list with the following components:

  -  naive.rel.cfr = the na\"ive estimator for the relative CFR
  -  glm.rel.cfr = the reporting-rate-adjusted estimator for the relative CFR
  -  EM.rel.cfr = the lag-adjusted estimator for the relative CFR
  -  EM.rel.cfr.var = the variance for the log-scale lag-adjusted
    estimator taken from the final M-step
  -  EM.rel.cfr.var.SEM = the Supplemented EM algorithm variance
    for the log-scale lag-adjusted estimator
  -  EM.rel.cfr.chain = a vector of the EM algorithm iterates of the
    lag-adjusted relative CFR estimates
  -  ests = the coefficient estimates for the model
  -  ests.chain.EM = a matrix with all of the coefficient
    estimates, at each EM iteration
  -  DM = the DM matrix from the SEM algorithm
  -  DMiter = a vector showing how many iterations it took for the
    variance component to converge in the SEM algorithm


We are particularly interested in the following components:
```{r label=estimationResults}
cfr.ests$naive.rel.cfr
cfr.ests$glm.rel.cfr
cfr.ests$EM.rel.cfr
cfr.ests$EM.rel.cfr.var.SEM
```