--- title: "Population Time Plots" author: "Sahir R. Bhatnagar" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_height: 8 fig_width: 11 keep_md: yes toc: yes toc_depth: 4 vignette: > %\VignetteIndexEntry{Population Time Plots} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} bibliography: reference.bib editor_options: chunk_output_type: console --- # Overview Population time plots can be extremely informative graphical displays of survival data. They should be the first step in your exploratory data analyses. We facilitate this task in the `casebase` package by providing a `plot` method for objects returned by the `popTime` function. This is done in two steps: 1) Pass your dataset to the `casebase::popTime` function and specify the column names corresponding to * the time of event * the event status * exposure (optional) This will create an object of class `popTime` with an exposure attribute if you specify a value for the `exposure` argument. 2) Pass the object of class `popTime` to the `plot` function In this vignette we show how to create population-time plots for the two datasets that ship with the `casebase` package, as well as several well known survival datasets from the `survival` package. For a more detailed description of how to modify the aesthetics of a population time plot, please refer to the [Customizing Population Time Plots](http://sahirbhatnagar.com/casebase/articles/customizingpopTime.html) vignette. # Load Required Packages ```{r setup, echo=TRUE, message=FALSE, warning=FALSE} library(survival) library(casebase) library(ggplot2) library(data.table) ``` # European Randomized Study of Prostate Cancer Screening Data For our first example, we make use of the European Randomized Study of Prostate Cancer Screening (ERSPC) data which ships with the `casebase` package (see `help("ERSPC", package = "casebase")` for details about this data). ```{r} data("ERSPC") head(ERSPC) ``` We first pass this dataset to the `popTime` function. If you do not specify the `time` and `event` arguments, this function will try to guess them using regular expressions. See the **Details** section of `help("popTime", package = "casebase")` for how we try to guess these inputs. ```{r} pt_object <- casebase::popTime(ERSPC, event = "DeadOfPrCa") ``` We can see its contents and its class: ```{r} head(pt_object) class(pt_object) ``` The `casebase` package has a `plot` method for objects of class `popTime`: ```{r} # plot method for objects of class 'popTime' plot(pt_object) ``` One benefit from these plots is that it allows you to see the incidence density. This can be seen from the distribution of the red dots in the above plot. We can see that more events are observed later on in time. Therefore a constant hazard model would not be appropriate in this instance as it would overestimate the cumulative incidence earlier on in time, and underestimate it later on. The unique 'step shape' of the population time plot is due to the randomization of the Finnish cohorts which were carried out on January 1 of each of the 4 years 1996 to 1999. This, coupled with the uniform December 31 2006 censoring date, lead to large numbers of men with exactly 11, 10, 9 or 8 years of follow-up. It is important to note that the red points are random distributed across the gray area for each time of event. That is, imagine we draw a vertical line at a specific event time. We then plot the red point at a randomly sampled y-coordinate along this vertical line. This is done to avoid having all red points along the upper edge of the plot (because the subjects with the least amount of observation time are plotted at the top of the y-axis). By randomly distributing them, we can get a better sense of the incidence density. ## Exposure Stratified Population Time Plot Often the observations in a study are in specific groups such as treatment arms. It may be of interest to compare the population time plots between these two groups. Here we compare the control group and the screening group. We create exposure stratified plots by specifying the `exposure` argument in the `popTime` function: ```{r} # stratified population time plot pt_object_strat <- casebase::popTime(ERSPC, event = "DeadOfPrCa", exposure = "ScrArm") ``` We can see its contents, class and that it has an exposure attribute: ```{r} head(pt_object_strat) class(pt_object_strat) ``` We can also see that the `pt_object_strat` has an exposure attribute which contains the name of the exposure variable in the dataset: ```{r} attr(pt_object_strat, "exposure") ``` The plot method for objects of class `popTime` will use this exposure attribute to create exposure stratified population time plots: ```{r} plot(pt_object_strat) ``` We can also plot them side-by-side using the `facet.params` argument, which is a list of arguments that are passed to the [`facet_wrap()`](https://ggplot2.tidyverse.org/reference/facet_wrap.html) function in the `ggplot2` package: ```{r} plot(pt_object_strat, facet.params = list(ncol = 2)) ``` ## Plotting the base series To illustrate the casebase sampling methodology, we can also plot the base series using the `add.base.series` function: ```{r} plot(pt_object_strat, add.base.series = TRUE) ``` Note that the `theme.params` argument is a list of arguments passed to the [`ggplot2::theme()`](https://ggplot2.tidyverse.org/reference/theme.html) function. # Stem Cell Data Next we show population time plots when there is a competing event. The `bmtcrr` dataset contains information on patients who underwent haematopoietic stem cell transplantation for acute leukaemia. This data is included in the `casebase` package. See `help("bmtcrr", package = "casebase")` for more details. We will use this dataset to further illustrate the fundamental components of a population time plot. Note that for this dataset, the `popTime` fails to identify a `time` variable if you didn't specify one: ```{r, error=TRUE} # load data data(bmtcrr) str(bmtcrr) # table of event status by exposure table(bmtcrr$Status, bmtcrr$D) # error because it can't determine a time variable popTimeData <- popTime(data = bmtcrr) ``` In this case, you must be explicit about what the time variable is: ```{r} popTimeData <- popTime(data = bmtcrr, time = "ftime") class(popTimeData) ``` ## Plotting the follow-up times for each observation We first plot that area on the graph representing the observed follow-up time. Fundamentally, this area is constructed by plotting a line for each individual, where the length of each line represents their follow-up time in the cohort. The follow-up times are plotted from top (shortest follow-up time) to bottom (longest follow-up time). In practice, we instead plot a polygon using the [`ggplot2::geom_ribbon()`](https://ggplot2.tidyverse.org/reference/geom_ribbon.html) function. The following figure shows this area for the `bmtcrr` dataset. Note that we must specify `add.case.series = FALSE` because the default is to add the case series: ```{r} plot(popTimeData, add.case.series = FALSE) ``` Note that we can change the aesthetics of the area by using the `ribbon.params()` argument as follows. These arguments are passed to the [`ggplot2::geom_ribbon()`](https://ggplot2.tidyverse.org/reference/geom_ribbon.html) function: ```{r} plot(popTimeData, add.case.series = FALSE, ribbon.params = list(color = "red", fill = "blue", size = 2, alpha = 0.2)) ``` ## Plot the Case Series Next we add the case series. Note that because the `Status` column has a competing event (coded as 2), we must specify `comprisk = TRUE` (even if you don't want to plot the competing event): ```{r} plot(popTimeData, add.case.series = TRUE, comprisk = TRUE) ``` In the above plot we can clearly see many of the deaths occur at the beginning, so in this case, a constant hazard assumption isn't valid. This information is useful when deciding on the type of model to use. ## Plot the Base Series We can now add the base series with the `add.base.series` argument. Internally, the `plot` method calls the `casebase::sampleCaseBase` function to sample the base series from the total person moments. This requires us to specify the ratio of base series to case series in the `ratio` argument which we will leave at its default of 1. A legend is also added by default: ```{r} plot(popTimeData, add.case.series = TRUE, add.base.series = TRUE, comprisk = TRUE) ``` ## Plot the Competing event We specify the `add.competing.event = TRUE` in order to also plot the competing event. Note, that like the case series, the competing event is sampled randomly on the vertical axis in order to see the incidence density. ```{r} plot(popTimeData, add.case.series = TRUE, add.base.series = TRUE, add.competing.event = TRUE, comprisk = TRUE) ``` We can also only plot the case series and competing event (or any combination): ```{r} plot(popTimeData, add.case.series = TRUE, add.base.series = FALSE, add.competing.event = TRUE, comprisk = TRUE) ``` ## Stratified by Disease Next we stratify by disease; lymphoblastic or myeloblastic leukemia, abbreviated as ALL and AML, respectively. We must specify the `exposure` variable. Furthermore it is important to properly label the factor variable corresponding to the exposure variable; this will ensure proper labeling of the panels: ```{r} # create 'popTime' object popTimeData <- popTime(data = bmtcrr, time = "ftime", exposure = "D") attr(popTimeData, "exposure") plot(popTimeData, add.case.series = TRUE, add.base.series = TRUE, add.competing.event = TRUE, comprisk = TRUE) ``` ## Change color points and legend labels Here is some code to change color points and legend labels. For a more thorough description, please see the [Customizing Population Time Plots](http://sahirbhatnagar.com/casebase/articles/customizingpopTime.html) vignette. ```{r} plot(popTimeData, add.case.series = TRUE, add.base.series = TRUE, add.competing.event = TRUE, comprisk = TRUE, case.params = list(mapping = aes(x = time, y = yc, colour = "Relapse", fill = "Relapse")), base.params = list(mapping = aes(x = time, y = ycoord, colour = "Base series", fill = "Base series")), competing.params = list(mapping = aes(x = time, y = yc, colour = "Competing event", fill = "Competing event")), fill.params = list(name = "Legend Name", breaks = c("Relapse", "Base series", "Competing event"), values = c("Relapse" = "blue", "Competing event" = "red", "Base series" = "orange")) ) ``` # Veteran Data Below are the steps to create a population time plot for the Veterans' Administration Lung Cancer study (see `help("veteran", package = "survival")` for more details on this dataset). ```{r} # veteran data in library(survival) data("veteran") str(veteran) popTimeData <- casebase::popTime(data = veteran) class(popTimeData) plot(popTimeData) ``` We can see in this example that the dots are fairly evenly spread out. That is, we don't see any particular clusters of red dots indicating that perhaps a constant hazard assumption would be appropriate. ## Stratified by treatment population time plot In this example we compare the standard and test treatment groups. A reminder that this is done by simply specifying the `exposure` argument in the `casebase::popTime` function: ```{r} # Label the factor so that it appears in the plot veteran <- transform(veteran, trt = factor(trt, levels = 1:2, labels = c("standard", "test"))) # create 'popTime' object popTimeData <- popTime(data = veteran, exposure = "trt") # object of class 'popTime' class(popTimeData) # has name of exposure variable as an attribute attr(popTimeData, "exposure") ``` Again, we simply pass this object to the `plot` function to get an exposure stratified population time plot: ```{r} # plot method for objects of class 'popTime' plot(popTimeData) ``` # Stanford Heart Transplant Data Population time plots also allow you to explain patterns in the data. We use the Stanford Heart Transplant Data to demonstrate this. See `help("heart", package = "survival")` for details about this dataset. For this example, we must create a time variable, because we only have the start and stop times. This is a good example to show that population time plots are also valid for this type of data (i.e. subjects who have different entry times) because we are only plotting the time spent in the study on the x-axis. ```{r} # data from library(survival) data("heart") str(heart) # create time variable for time in study heart <- transform(heart, time = stop - start, transplant = factor(transplant, labels = c("no transplant", "transplant"))) # stratify by transplant indicator popTimeData <- popTime(data = heart, exposure = "transplant") plot(popTimeData) ``` In the plot above we see that those who didn't receive transplant died very early (many red dots at the start of the x-axis). Those who did receive the transplant have much better survival (as indicated by the grey area). Does this show clear evidence that getting a heart transplant increases survival? Not exactly. This is a classic case of confounding by indication. In this study, the doctors only gave a transplant to the healthier patients because they had a better chance of surviving surgery. # NCCTG Lung Cancer Data The following example is from survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. See `help("cancer", package = "survival")` for details about this data. ```{r} # data from library(survival) data("cancer") str(cancer) # since the event indicator 'status' is numeric, it must have # 0 for censored and 1 for event cancer <- transform(cancer, status = status - 1, sex = factor(sex, levels = 1:2, labels = c("Male", "Female"))) popTimeData <- popTime(data = cancer) plot(popTimeData) ``` ## Stratified by gender We can also switch back to the default `ggplot2` theme by specifying `casebase.theme = FALSE`: ```{r} popTimeData <- popTime(data = cancer, exposure = "sex") plot(popTimeData, casebase.theme = FALSE) ``` # Simulated Data Example Below is an example based on simulated data. ## Simulate the data ```{r} set.seed(1) nobs <- 500 # simulation parameters a1 <- 1.0 b1 <- 200 a2 <- 1.0 b2 <- 50 c1 <- 0.0 c2 <- 0.0 # end of study time eost <- 10 # e event type 0-censored, 1-event of interest, 2-competing event # t observed time/endpoint # z is a binary covariate DTsim <- data.table(ID = seq_len(nobs), z=rbinom(nobs, 1, 0.5)) setkey(DTsim, ID) DTsim[,`:=` (event_time = rweibull(nobs, a1, b1 * exp(z * c1)^(-1/a1)), competing_time = rweibull(nobs, a2, b2 * exp(z * c2)^(-1/a2)), end_of_study_time = eost)] DTsim[,`:=`(event = 1 * (event_time < competing_time) + 2 * (event_time >= competing_time), time = pmin(event_time, competing_time))] DTsim[time >= end_of_study_time, event := 0] DTsim[time >= end_of_study_time, time:=end_of_study_time] ``` ## Population Time Plot ```{r} # create 'popTime' object popTimeData <- popTime(data = DTsim, time = "time", event = "event") plot(popTimeData) ``` ## Stratified by Binary Covariate z ```{r} # stratified by binary covariate z popTimeData <- popTime(data = DTsim, time = "time", event = "event", exposure = "z") # we can line up the plots side-by-side instead of one on top of the other # we can also change the theme by adding plot(popTimeData, facet.params = list(ncol = 2)) + theme_linedraw() ``` # Session information ```{r echo=FALSE, eval=TRUE} print(sessionInfo(), locale = FALSE) ```