--- title: "Umpire 2.0: Clinically Realistic Simulations" author: "Kevin R. Coombes and Caitlin E. Coombes" data: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Umpire 2.0} %\VignetteKeywords{Umpire, simulations, mixed type data, clinical data} %\VignetteDepends{Umpire} %\VignettePackage{Umpire} %\VignetteEngine{knitr::rmarkdown} --- # Introduction Version 2.0 of the Ultimate Microarray Prediction, Inference, and Reality Engine (Umpire) extends the functions of the Umpire 1.0 R package to allow researchers to simulate realistic, mixed-type, clinical data. Statisticians, computer scientists, and clinical informaticians who develop and improve methods to analyze clinical data from a variety of contexts (including clinical trials, population cohorts, and electronic medical record sources) recognize that it is difficult to evaluate methods on real data where "ground truth" is unknown. Frequently, they turn to simulations where the can control the underlying structure, which can result in simulations which are too simplistic to reflect complex clinical data realities. Clinical measurements on patients may be treated as independent, in spite of the elaborate correlation structures that arise in networks, pathways, organ systems, and syndromes in real biology. Further, the researcher finds limited tools at her disposal to facilitate simulation of binary, categorical, or mixed data at this representative level of biological complexity. In this vignette, we describe a workflow with the <tt>Umpire</tt> package to simulate biologically realistic, mixed-type clinical data. As usual, we start by loading the package: ```{r} library(Umpire) ``` # Simulating Mixed-Type Clinical Data Since we are going to run simulations, for reproducibility purposes, we should set the seed of the random number generator. ```{r seed} set.seed(84503) ``` ## Model Subtypes and Survival The simulation workflow begins by simulating complex, correlated, continuous data with known "ground truth" by instantiating a <tt>ClinicalEngine</tt>. We simulate 20 features and 4 clusters of unequal size. The ClinicalEngine generates subtypes (clusters) with known "ground truth" through an implementation of the Umpire 1.0 <tt>CancerModel</tt> and <tt>CancerEngine</tt>. ```{r} ce <- ClinicalEngine(20, 4, isWeighted = TRUE) summary(ce) ``` Note that the prevalences are not equal; when you use <tt>isweighted = TRUE</tt>, they are chosen from a Dirichlet distribution. Note also that the <tt>summary</tt> function describes the object as a <tt>CancerEngine</tt>, since the same underlying structure is used to implement a <tt>ClinicalEngine</tt>. Now we confirm that the model expects to produce the 20 features that we requested. It will do so using 10 "components", where each component consists of a pair of correlated features. ```{r nrow} nrow(ce) nComponents(ce) ``` ## Simulate Raw Data The <tt>ClinicalEngine</tt> is used to simulate the raw, base dataset. ```{r} dset <- rand(ce, 300) ``` Data are simulated as a list with two objects: simulated <tt>data</tt> and associated <tt>clinical</tt> information, including "ground truth" subtype membership and survival data (outcome, length of followup, and occurrence of event of interest within the followup period). ```{r} class(dset) names(dset) summary(dset$clinical) ``` The raw <tt>data</tt> are simulated as a matrix of continuous values. ```{r} class(dset$data) dim(dset$data) ``` ## Apply Clinically Realistic Noise The user may add further additive noise to the raw data. The <tt>ClinicalNoiseModel</tt> simulates additive noise for each feature _f_ and patient _i_ as a normal distribution $E_{fi} \sim N(0, \tau)$ , where the standard deviation $\tau$ varies with a hyperparameter along the gamma distribution $\tau \sim Gamma(shape, scale)$. Thus, the ClinicalNoiseModel generates many features with low noise (such as a tightly calibrated laboratory test) and some features with high noise (such as a blood pressure measured by hand and manually entered into the medical record.) The user may apply default parameters or individual parameters. Next, the <tt>ClinicalNoiseModel</tt> is applied to <tt>blur</tt> the previously simulated data. The default model below generates a low overall level of additive noise. ```{r} cnm <- ClinicalNoiseModel(nrow(ce@localenv$eng), shape = 1.02, scale = 0.05) summary(cnm) noisy <- blur(cnm, dset$data) ``` ## Simulate Mixed-Type Data <tt>Umpire 2.0</tt> allows the simulation of binary, nominal, and ordinal data from raw, continuous data in variable, user-defined mixtures. The user defines prevalences, summing to 1, of binary, continuous, and categorical data in the desired final mixture. For categorical features, the user may tune the percent of categorical data desired to be nominal and the range of the number of categories to be simulated. The data simulated above by the <tt>ClinicalEngine</tt> and <tt>ClinicalNoiseModel</tt> takes rows (not columns) as features, as an omics convention. Thus, by default, when generating data, rows are treated as features and columns as patients. The <tt>makeDataTypes</tt> method transposes its results to a data frame where the columns are features and the rows are patients. This transposition both fits better with the conventions used for clinical data, but also supports the ability to store different kinds of (mixed-type) data in different columns. ```{r} dt <- makeDataTypes(dset$data, pCont = 1/3, pBin = 1/3, pCat = 1/3, pNominal = 0.5, range = 3:9, inputRowsAreFeatures = TRUE) names(dt) ``` The <tt>makeDataTypes</tt> function generates a list containing two objects: a data.frame of mixed-type data... ```{r} class(dt$binned) dim(dt$binned) summary(dt$binned) ``` The <tt>cutpoints</tt> contain a record, for each feature, of data type, break points, and labels. Here are two examples of the kind of information stored for a cutpoint. ```{r} dt$cutpoints[[1]] dt$cutpoints[[5]] ``` And here is an overview of the number of features of each type. ```{r} cp <- dt$cutpoints type <- sapply(cp, function(X) { X$Type }) table(type) ``` The <tt>cupoitns</tt> should be saved for downstream use in the <tt>MixedTypeEngine</tt>. ## The MixedTypeEngine The many parameters defining a simulated data mixture can be stored as a single <tt>MixedTypeEngine</tt> for downstream use to easily generate future datasets with the same simulation parameters. The <tt>MixedTypeEngine</tt> stores the following components for re-implementation: 1. The <tt>ClinicalEngine</tt>, including parameters for generating the subtype pattern and survival model. 2. The <tt>ClinicalNoiseModel</tt>. 3. The <tt>cutpoints</tt> generated by <tt>makeDataTypes</tt>. ```{r} mte <- MixedTypeEngine(ce, noise = cnm, cutpoints = dt$cutpoints) summary(mte) ``` With <tt>rand</tt>, the user can easily generate new data sets with the same simulation parameters. ```{r} dset2 <- rand(mte, 20) class(dset2) summary(dset2$data) summary(dset2$clinical) ``` By using the <tt>keepal</tt> argument othe function, you can keep the intermediate datasets produced by the <tt>rand</tt> method. ```{r} dset3 <- rand(mte, 25, keepall = TRUE) class(dset3) names(dset3) ``` The <tt>raw</tt> and <tt>noisy</tt> elements have the rows as (future clinical) features and the columns as patients/samples. ```{r raw} dim(dset3$raw) summary(t(dset3$raw)) dim(t(dset3$noisy)) summary(dset3$noisy) ``` Noisy data arises by adding simulated noise to the raw data. ```{r, fig.cap="Raw and noisy data."} plot(dset3$raw[5,], dset3$noisy[5,], xlab = "Raw", ylab = "Noisy", pch=16) ``` The <tt>binned</tt> element has columns as features and rows as samples. Binned data arises by applying cut points to noisy data. ```{r fig.cap = "Noisy and binned data."} dim(dset3$binned) summary(dset3$binned) plot(dset3$binned[,5], dset3$noisy[5,], xlab = "Binned", ylab = "Noisy") ```