--- title: "Modgo simulation" output: html_vignette: self_contained: yes description: | Modgo Vignette demonstrates how to use the package to simulate a data set, and also to exploit the main capabilities of the package vignette: > %\VignetteIndexEntry{Modgo simulation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: bibliography.bib --- ```{r include = FALSE} knitr::opts_chunk$set(fig.width = 8, fig.height = 8, out.height = "80%", out.width = "80%", dpi = 300) counter <- 0 ``` ```{r setup, warning=FALSE, include = FALSE} library(survival) ``` # 1. Default *modgo* simulation For illustration, we selected the Cleveland Clinic Heart Disease Data set from the University of California in Irvine (UCI) machine learning data repository [@Dua2019]. Below, we are using eleven variables, five of which are continuous, four are dichotomous, and two categorical variables. ```{r data} library(modgo) data("Cleveland", package = "modgo") ``` ```{r basic_arguments} # Specifying dichotomous and ordinal categorical variables binary_variables <- c("Sex","HighFastBloodSugar","CAD","ExInducedAngina") categorical_variables <- c("Chestpaintype","RestingECG") nrep <- 500 plot_variables <- c("Age", "STDepression", binary_variables[c(1,3)], categorical_variables) ``` In this section, we run *modgo* with its default settings. For *modgo* to produce results that mimic the original data set efficiently, user needs to specify dichotomous and ordinal categorical variables. Variables will be considered as continuous, otherwise. All *modgo* runs in this and the following sections will produce 500 data sets with the specification nrep = 500; the default is 100. Figure 1 shows the correlation plots for the default *modgo* run, and Figure 2 displays the distribution plots for the original data set and one simulated data set. The default displayed simulated data set is the first one. Moreover, for all the plots a set of variables are used. ```{r default_test} test <- modgo(data = Cleveland, bin_variables = binary_variables, categ_variables = categorical_variables, nrep = nrep) ``` ```{r correlation_default,echo=FALSE, fig.cap=paste0("Figure ",counter,": Correlation plots for a default *modgo* run.")} corr_plots(test, variables = plot_variables) counter <- counter + 1 ``` ```{r distr_default,echo=FALSE, fig.cap=paste0("Figure ",counter,": Distribution plots for a default *modgo* run.")} distr_plots(test, variables = plot_variables) counter <- counter + 1 ``` # 2. Expansions ## 2.1 Selection by thresholds of variables *modgo* provides an option so that only subjects (instances) are simulated that fulfill a specific requirement. In the simplest case (Section 2.1), the user can specify an upper or a lower boundary, or an interval for a variable. The use may alternatively specify a combination of variables and thresholds. Three steps are required when subjects need to fulfill a specific selection criterion for a continuous variable. First, the name of the variable needs to be specified, for which the threshold needs to be set. Second, the left and right boundaries need to be specified. Third, a data frame with three columns is defined with Column 1: variable name of threshold variable, Column 2: left boundary, i.e., lower bound, Column 3: right boundary, i.e., upper bound. Finally, the data frame is imported using the *thresh_var* argument. In the example, all subjects have to be at least 66 years old. The selection variable therefore is *Age* with left threshold *65* and right threshold infinity *NA*. If the percentage of samples fulfilling the indicated threshold requirements are less than 10% of the simulated samples, *modgo* stops to avoid excessive computation time. However, users can force *thresh_force = TRUE* the requested simulation to be run. Figure 3 shows the correlation plot for this illustration. Substantial differences between the original and the simulated correlation plots can be observed for the RestingECG and several other variables. Figure 4 displays the corresponding distribution plot. The age distribution is shifted as expected. Furthermore, the distribution of subjects with coronary artery disease (CAD = 1) is higher in the simulated than the original data set. ```{r} Variables <- c("Age") thresh_left <- c(65) thresh_right <- c(NA) thresholds <- data.frame(Variables, thresh_left, thresh_right) print(as.matrix(thresholds)) test_thresh <- modgo(data = Cleveland, bin_variables = binary_variables, categ_variables = categorical_variables, thresh_var = thresholds, nrep = nrep, thresh_force = TRUE) ``` ```{r, echo=FALSE, fig.cap=paste0("Figure ",counter,": Correlation plot for Age > 65 threshold *modgo* run")} corr_plots(test_thresh, variables = plot_variables) counter <- counter + 1 ``` ```{r,echo=FALSE, fig.cap=paste0("Figure ",counter,": Distribution plot for Age > 65 threshold *modgo* run")} distr_plots(test_thresh, variables = plot_variables) counter <- counter + 1 ``` ## 2.2 Perturbation analysis - Unchanged variance For continuous variables, *modgo* provides the option to add a normally distributed noise with mean 0 and variance $\sigma_{p}^2$. With this perturbation, the variance of the perturbed variable is identical to the variance of the original variable. This option permits the generation of values from continuous variables, which were not observed in the original data set. To specify which variables are to be perturbed and to which degree, i.e., percentage, the user needs to provide *modgo* with a named vector of the percentages and with the corresponding variables names as the names of the vector. Similar to the previous examples, Figure 5 shows the correlation plots for the expansion to perturbations, and Figure 6 displays the distribution plots. Figure 6 shows that the distribution of both resting blood pressure and cholesterol change substantially due to the perturbation. ```{r} #Create named vector perturb_vector <- c(0.9,0.7) names(perturb_vector) <- c("RestingBP","Cholsterol") test_pertru <- modgo(data = Cleveland, bin_variables = binary_variables, categ_variables = categorical_variables, pertr_vec = perturb_vector, nrep = nrep) ``` ```{r, echo=FALSE, fig.cap=paste0("Figure ",counter,": Correlation plot for Pertrubation Expansion *modgo* run")} corr_plots(test_pertru, variables = c(plot_variables, names(perturb_vector))) counter <- counter + 1 ``` ```{r, echo=FALSE, fig.cap=paste0("Figure ",counter,": Distribution plot for Pertrubation Expansion *modgo* run")} distr_plots(test_pertru, variables = c(plot_variables, names(perturb_vector))) counter <- counter + 1 ``` # 3. Generalized lambda *modgo* simulation Another feature provided by *modgo* is the ability to simulate a data set using the Generalized Lambdas Distribution (GLD) method. This method allows users to generate values that are not present in the original data set, as the default method only uses values from the original data. The GLD method is based on the GLDEX package [@SteveSU]. More information on Generalized Lambdas Distributions can be found in Fitting Statistical Distributions [@Zaven_Karian]. ```{r GLD_run} test_GLD <- modgo(data = Cleveland, bin_variables = binary_variables, categ_variables = categorical_variables, generalized_mode = TRUE, nrep = nrep) ``` ```{r correlation_GLD, echo=FALSE, fig.cap=paste0("Figure ",counter,": Correlation plots for Generalized Lambda Distribtion *modgo* run")} corr_plots(test_GLD, variables = plot_variables) counter <- counter + 1 ``` ```{r distr_GLD, echo=FALSE, fig.cap=paste0("Figure ",counter,": Distribution plot for Generalized Lambda Distribtion *modgo* run")} distr_plots(test_GLD, variables = plot_variables) counter <- counter + 1 ``` GLDEX provides three basic models for calculating the four Lambdas for each distribution. These models are called rmfmkl (default model), rprs, and star [@SteveSU]. They can also be combined for a bi-modal estimation. We give you the option to specify your desired model(or a combination of models) for each variable in the dataset. In the next step, we will show you how to specify the desired GLD models. ```{r arguments_GLD_def_model} Variables <- c("Age","STDepression") Model <- c("rprs", "star-rmfmkl") model_matrix <- cbind(Variables, Model) ``` ```{r GLD_run_def_model} test_GLD_define_model <- modgo(data = Cleveland, bin_variables = binary_variables, categ_variables = categorical_variables, generalized_mode = TRUE, generalized_mode_model = model_matrix, nrep = nrep) ``` ```{r correlation_GLD_def_model, echo=FALSE, fig.cap=paste0("Figure ",counter,": Correlation plots for Generalized Lambda Distribtion *modgo* run with specified GLD models")} corr_plots(test_GLD_define_model, variables = plot_variables) counter <- counter + 1 ``` ```{r distr_GLD_def_model,echo=FALSE, fig.cap=paste0("Figure ",counter,": Distribution plots for Generalized Lambda Distribtion *modgo* run with specified GLD models")} distr_plots(test_GLD_define_model, variables = plot_variables) counter <- counter + 1 ``` By examining the distribution plots shown above, we can observe that the GLD method has the capability to generate values for certain variables, such as STDepression, that do not exist in the original data set and can also be extremely high. Additionally, there is an alternative option to compute Generalized Lambdas independently of the *modgo* function and subsequently utilize them as an input for a subsequent *modgo* run. ```{r GLD_run_def_model_set_lambdas} gener_lambdas_matrix <- generalizedMatrix(data = Cleveland, generalized_mode_model = model_matrix, bin_variables = binary_variables) test_GLD_define_model_set_lambdas <- modgo(data = Cleveland, bin_variables = binary_variables, categ_variables = categorical_variables, generalized_mode = TRUE, generalized_mode_lmbds = gener_lambdas_matrix, nrep = nrep) ``` Lastly, an intriguing feature provided by *modgo* in conjunction with the GLD method is the capability to simulate a data set without requiring the original data. To execute *modgo* without a data set, the user must provide the following: 1) Correlation matrix of the data set 2) Generalized matrix of the data set 3) Sample size of the simulated data set 4) Variable names Below, we present an example of this case. ```{r GLD_run_no_data_set} # Necessary arguments gener_lambdas_matrix <- generalizedMatrix(data = Cleveland, generalized_mode_model = model_matrix, bin_variables = binary_variables) sigma <- cor(Cleveland) variables_names <- colnames(sigma) sample_size <- 100 test_GLD_no_data_set <- modgo(data = NULL, variables = variables_names, bin_variables = binary_variables, categ_variables = categorical_variables, sigma = sigma, generalized_mode = TRUE, generalized_mode_lmbds = gener_lambdas_matrix, n_samples = sample_size, nrep = nrep) ``` ```{r correlation_GLD_run_no_data_set, echo=FALSE, fig.cap=paste0("Figure ",counter,": Correlation plots for Generalized Lambda Distribtion *modgo* run without providing a data set")} corr_plots(test_GLD_no_data_set, variables = plot_variables) counter <- counter + 1 ``` # 4. Survival example To demonstrate the simulation of survival variables, we chose the cancer data set from the *survival* package [@survival-package]. This data set contains 167 samples and 10 variables. In order to set up modgo_survival(), the user must specify a status variable and a time variable, in addition to the other arguments of *modgo*. ```{r survival_data_set} # cancer prepare data("cancer", package = "survival") cancer <- na.omit(cancer) cancer$sex <- cancer$sex - 1 cancer$status <- cancer$status - 1 time_var_cancer <- "time" status_var_cancer <- "status" bin_var_cancer <- c("status", "sex") cat_var_list_cancer <- c("ph.ecog") plot_variables_surv <- colnames(cancer)[1:6] ``` The modgo_survival function divides the data set into two separate data sets based on the status variable, and then it simulates each data set individually using the Generalized Lambdas Distribution method. The user can specify which GLDEX model should be used for each data set, with the default being "rprs". ```{r survival_data_set_run} # Survival run test_surv <- modgo_survival(data = cancer, surv_method = 1, bin_variables = bin_var_cancer, categ_variables = cat_var_list_cancer, event_variable = status_var_cancer, time_variable = time_var_cancer, generalized_mode_model_no_event = "rmfmkl", generalized_mode_model_event = "rprs") ``` ```{r survival_data_set_corr, echo=FALSE, fig.cap=paste0("Figure ",counter,": Correlation plots for modgo_survival run")} corr_plots(test_surv, variables = plot_variables_surv) counter <- counter + 1 ``` ```{r survival_data_set_distr, echo=FALSE, fig.cap=paste0("Figure ",counter,": Distribution plots for modgo_survival run")} distr_plots(test_surv, variables = plot_variables_surv) counter <- counter + 1 ``` In the following plot, the surv_fit() curves from the *survival* package for both original and simulated data are depicted. ```{r survival_data_set_coxplots, fig.cap=paste0("Figure ",counter,": Survival fit curves plot for modgo_survival run")} data_set_info <- c(rep("Original", dim(test_surv$original_data)[1]), rep("Simulated", dim(test_surv$simulated_data[[1]])[1])) combine_data_set <- rbind(test_surv$original_data, test_surv$simulated_data[[1]]) combine_data_set <- cbind(combine_data_set, data_set_info) fit <- survfit(Surv(time, status) ~ data_set_info, data=combine_data_set) plot(fit, fun = "F", col=1:2) legend(700, 1, c("Original data set", "Simulated data set"), lty=c(1,1), col=c(1,2), bty='n', lwd=2) ``` # References