---
title: "psc-vignette"
author: "Richard Jackson"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{psc-vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height= 5
)
```

# Introduction

The psc.R package implements the methods for applying Personalised Synthetic 
Controls, which allows for patients receiving some experimental treatment to be 
compared against a model which predicts their reponse to some control.  This is 
a form of causal inference which differes from other approaches in that

\item Data are only required on a single treatment - all counterfactual evidence 
is supplied by a parametric model
\item Causal inference, in theory at least, is estimated at a patient level - as 
opposed to estimating average effects over a population

The causal estimand obtained is the Average Treatment Effect of the Treated (ATT)
which differs from the Average Treatment Effect (ATE) obtained in other settings 
and addresses the question of whether treatments are effective in the population 
of patients who are treated.  This estimand then targets efficacy over effectivness.



In its basic form, this method creates a likelihood to compare a cohort of data 
to a parametric model.  See (X) for disucssion on it's use as a causal inference 
tool.  To use this package, two basic peices of information are required, a 
dataset and a model against which they can be compared.

In this vignette, we will detail how the psc.r package is constructed and give 
some examples for it's application in practice.


# Methodology


The `pscfit` function compares a dataset ('DC') against a parametric model. This 
is done by selecting a likelihood which is identified by the type of CFM that is 
supplied. At present, two types of model are supported, a flexible parmaeteric 
survival model of type 'flexsurvreg' and a geleneralised linear model of type 
'glm'.

Where the CFM is of type 'flexsurvreg' the likeihood supplied is of the form:


$$L(D∣\Lambda,\Gamma_i)=\prod_{i=1}^{n} f(t_i∣\Lambda,\Gamma_i)^{c_i} 
S(t_i∣\Gamma,\Lambda_i)^{(1−c_i)}$$

Where $\Gamma$ defines the cumulative baseline hazard function, $\Lambda$ is the 
linear predictor and $t$ and $c$ are the event time and indicator variables.

Where the CFM is of the type 'glm' the likelihood supplied is of the form:

$$L(x∣\Gamma_i) = \prod_{i=1}^{n} b (x∣ \Gamma_i )\exp\{\Gamma_i t(x)−
c(\Gamma_i)\}$$


Where $b(.)$, $t(.)$ and $c(.)$ represent the functions of the exponential 
family. In both cases, $\Gamma$ is defiend as:

$$ \Gamma_i = \gamma x_i+\beta $$

Where $\gamma$ are the model coefficients supplied by the CFM and $\beta$ is the 
parameter set to measure the difference between the CFM and the DC.

Estimation is performed using a Bayesian MCMC procedure. Prior distributions for 
$\Gamma$ (& $\Lambda$) are derived directly from the model coefficients (mean 
and variance covariance matrix) or the CFM. A bespoke MCMC routine is performed 
to estimate $\beta$. Please see '?mcmc' for more detials.

For the standard example where the DC contains information from only a single 
treatment, trt need not be specified. Where comparisons between the CFM and 
multiple treatments are require, a covariate of treamtne allocations must be 
specified sperately (using the 'trt' option).


# Package Structure

The main function for using applying Personal Synthetic Controls is the pscfit() 
function which has two inputs, a Counter-Factual Model (CFM) and a data cohort 
(DC).  Further arguments include

* nsim which sets the number of MCMC iterations (defaults to 5000)
* 'id' if the user wishes to restrict estimation to a sub-set (or individual) 
within the DC
* 'trt' to be used as an initial identifier if mulitple treatment comparisons 
are to be made (please see the Mulitple Treatment Comparison below)


## psc object

The output of the "pscfit()" function is an object of class 'psc'.  This class 
contains the following attributes

* A definition of the calss of the model supplied
* A 'cleaned' dataset including extracted components of the CFM and the cleaned DC included in the procedure
* An object defingin the class of model (and therefore the procedure applied - see above)
* A matrix containing the draws of the posterior distributions

## Postestimation functions

basic post estimation functions have been developed to work with the psc object, 
namely "print()", "coef()", "summary()" and "plot()".  For the first three of 
these these provided basic summaries of the efficacy parameter obtained from the 
posterior distribution.


# Motivating Example


The psc.r package includes as example, a dataset which is derived from patients 
with advanced Hepatocellular Carcinoma (aHCC) who have all received some 
experimental treatment.  The dataset is simply named 'data' and is loaded into 
the enviroment using the "data()" function

```{r}
#install.packages("psc")
library(psc)
```

Included is a list of prognostic covariates:

* vi - Vascular Invasion
* age60 - Age - centered at 60
* ecog - ECOG performance Status
* logafp - AFP on the natural log scale
* alb - Albumin
* logcreat - Creatinine on the log scale
* logast - AST on the natuarl log scale
* allmets -Presence of Metastesis
* aet - Ateiology; HBV, HCV or Other

Also included are the following structures

* time - survival time
* cen - censoring indictor
* os - time to be used as a continuous outcome
* event - binary event for use as a binary outcome
* count - count data to be used for a count outcome

Lastly the dataset also inlclude a 'trt' variable to be used in the estimation 
of multiple treatment comparisons.

We give esamples of how the 'pscfit()' function can be used to comapre data 
against models with survival outcomes (with a 'flexsurvreg' model) along with 
binary, continuous and count outcomes (with a 'glm' model).


## Survival Example

For an example with a survival outcome a model must be supplied which is 
contructed ont he basis of flexible parametric splines.  This is contructed 
using the "flexsurvreg" function within the "flexsurv" package.  An example is 
included within the 'psc.r' package names 'surv.mod' and is loaded using the 
'data()" function:

```{r}
surv.mod <- psc::surv.mod
surv.mod
```

In this example you can see that this is a model constructed with 3 internal 
knots and hence 5 parameters to describe the baseline cumulative hazard 
function.  There are also prognostic covariates which match with the prognostic 
covaraites in the data cohort.

To begin it is worth looking at the performance of the model and looking at how 
the survival of patietns in the data cohort compare

```{r}
library(survival)
sfit <- survfit(Surv(data$time,data$cen)~1)
plot(sfit)
```

Comparing the dataset to the model is then performed using

```{r,include=F}
surv.psc <- pscfit(surv.mod,data)
```

and we can view the attributes of the psc object that is created

```{r}
attributes(surv.psc)
```

For example to view the matrix contianing the draws of the posterior 
distribution we use

```{r}
surv.post <- surv.psc$posterior
head(surv.post)
```

Inspection will show that there is a column for each parameter in the original model as well as 'beta' and 'DIC' vcolumns which give teh posterior estiamtes for $\beta$ and the Deviance Informaiton Criterion respectively.

We can inspect the poterior distribution using the autocorrelation function, trace and stardard summary statistics:

#### Autocorrelation
```{r}
acf(surv.post$beta)
```

#### Trace
```{r}
plot(surv.post$beta,typ="s")
```

#### Summary

```{r}
summary(surv.post$beta)
```

Standard 'summary()' function wil summarise the model fit

```{r}
summary(surv.psc)
```

To visualise the original model and the fit of the data, the plot function has 
been developed

```{r}
plot(surv.psc)
```


## GLM

The "pscfit()" object uses class of the model supplied to derive the likelihoiod 
and estimation procedure that isde required.  In this example, the "enrichwith" 
package is utilised to extract from the model the parameters of the exponential 
family.  Important from the attributes of the GLM are the "family" statements 
which dictates the form of the likelihood.  For each of the binary, continuous 
and Count data outcomes then, the syntax and the DC remains the same and it is 
the form of the CFM that dictates the analysis


### Binary

#### Step 1:  Load Model
```{r}
bin.mod <- psc::bin.mod
bin.mod
```

#### Step 2: Fit psc object
```{r,include=F}
psc.bin <- pscfit(bin.mod,data)
psc.bin
```

#### Step 3: Review summary statistics
```{r}
summary(psc.bin)
```

#### Step 4: Plot Output
```{r}
plot(psc.bin)
```

### Count


#### Step 1:  Load Model
```{r}
count.mod <- psc::count.mod
count.mod
#data("count.mod")
```

#### Step 2: Fit psc object
```{r,include=F}
psc.count <- pscfit(count.mod,data)
psc.count
```

#### Step 3: Review summary statistics
```{r}
summary(psc.count)
```

#### Step 4: Plot Output
```{r}
plot(psc.count)
```

### Continuous

#### Step 1:  Load Model
```{r}
cont.mod <- psc::cont.mod
cont.mod
```

#### Step 2: Fit psc object

```{r,include=F}
psc.con <- pscfit(cont.mod,data)
psc.con
```

#### Step 3: Review summary statistics
```{r}
summary(psc.con)
```

#### Step 4: Plot Output
```{r}
plot(psc.con)
```


# Sub group Effects

An attractive feature of Personalised Synthetic Controls is its use in fitting 
sub-group effects.  Whereas other casual inference tools typically make 
ssumptions about population levels of balance and then further assume that this 
balance holds at sub-group levels, Personalised Synthetic Controls differ in 
that they estimate treatment effects at a patient level and then average across 
populations.  To estimate sub-group effects then we need only to restrict 
estimation over some sub-group of the population.  This can be achived by directly 
slecting the subgroup you wish to evaluate

Using an example where we wnat to see if the treatment effect is consistent by 
patients with ECOG=0 and ECOG =1

## Sub group effects by restricting the population


### PSC fit for pateitns with ECOG=0
```{r,include=T,echo=T}
id1 <- which(data$ecog==0)
sub1 <- pscfit(surv.mod,data,id=id1)
```


### PSC fit for pateitns with ECOG=1
```{r,include=T,echo=T}
id2 <- which(data$ecog==1)
sub2 <- pscfit(surv.mod,data,id=id2)
```


We can then easily compare the model coefficients

```{r}
summary(sub1)
cat("\n")
summary(sub2)
```


And look at the plots for each outcome

#### PSC plot for ECOG=0
```{r}
plot(sub1)
```

#### PSC plot for ECOG=1
```{r}
plot(sub2)
```