--- title: "Adding Custom Feature Engineering Functions" author: "Jenna Reps, Egill Fridgeirsson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Adding Custom Feature Engineering Functions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE, warning = FALSE} library(PatientLevelPrediction) ``` # Introduction This vignette describes how you can add your own custom function for feature engineering in the Observational Health Data Sciences and Informatics (OHDSI) [`PatientLevelPrediction`](https://github.com/OHDSI/PatientLevelPrediction) package. This vignette assumes you have read and are comfortable with building single patient level prediction models as described in the `vignette('BuildingPredictiveModels')`. **We invite you to share your new feature engineering functions with the OHDSI community through our [GitHub repository](https://github.com/OHDSI/PatientLevelPrediction).** # Feature Engineering Function Code Structure To make a custom feature engineering function that can be used within PatientLevelPrediction you need to write two different functions. The 'create' function and the 'implement' function. The 'create' function, e.g., create\<FeatureEngineeringFunctionName\>, takes the parameters of the feature engineering 'implement' function as input, checks these are valid and outputs these as a list of class 'featureEngineeringSettings' with the 'fun' attribute specifying the 'implement' function to call. The 'implement' function, e.g., implement\<FeatureEngineeringFunctionName\>, must take as input: - `trainData` - a list containing: - `covariateData`: the `plpData$covariateData`restricted to the training patients - `labels`: a data frame that contain `rowId`(patient identifier) and `outcomeCount` (the class labels) - `folds`: a data.frame that contains `rowId` (patient identifier) and `index` (the cross validation fold) - `featureEngineeringSettings` - the output of your create\<FeatureEngineeringFunctionName\> The 'implement' function can then do any manipulation of the `trainData` (adding new features or removing features) but must output a `trainData` object containing the new `covariateData`, `labels` and `folds` for the training data patients. # Example Let's consider the situation where we wish to create an age spline feature. To make this custom feature engineering function we need to write the 'create' and 'implement' R functions. ## Create function Our age spline feature function will create a new feature using the `plpData$cohorts$ageYear` column. We will implement a restricted cubic spline that requires specifying the number of knots. Therefore, the inputs for this are: `knots` - an integer/double specifying the number of knots. ```{r, echo = TRUE, eval=FALSE} createAgeSpline <- function(knots = 5) { # create list of inputs to implement function featureEngineeringSettings <- list( knots = knots ) # specify the function that will implement the sampling attr(featureEngineeringSettings, "fun") <- "implementAgeSplines" # make sure the object returned is of class "sampleSettings" class(featureEngineeringSettings) <- "featureEngineeringSettings" return(featureEngineeringSettings) } ``` We now need to create the 'implement' function `implementAgeSplines()` ## Implement function All 'implement' functions must take as input the `trainData` and the `featureEngineeringSettings` (this is the output of the 'create' function). They must return a `trainData` object containing the new `covariateData`, `labels` and `folds`. In our example, the `createAgeSpline()` will return a list with 'knots'. The `featureEngineeringSettings` therefore contains this. ```{r tidy=FALSE,eval=FALSE} implementAgeSplines <- function(trainData, featureEngineeringSettings, model = NULL) { # if there is a model, it means this function is called through applyFeatureengineering, meaning it # should apply the model fitten on training data to the test data if (is.null(model)) { knots <- featureEngineeringSettings$knots ageData <- trainData$labels y <- ageData$outcomeCount X <- ageData$ageYear model <- mgcv::gam( y ~ s(X, bs = "cr", k = knots, m = 2) ) newData <- data.frame( rowId = ageData$rowId, covariateId = 2002, covariateValue = model$fitted.values ) } else { ageData <- trainData$labels X <- trainData$labels$ageYear y <- ageData$outcomeCount newData <- data.frame(y = y, X = X) yHat <- predict(model, newData) newData <- data.frame( rowId = trainData$labels$rowId, covariateId = 2002, covariateValue = yHat ) } # remove existing age if in covariates trainData$covariateData$covariates <- trainData$covariateData$covariates |> dplyr::filter(!.data$covariateId %in% c(1002)) # update covRef Andromeda::appendToTable( trainData$covariateData$covariateRef, data.frame( covariateId = 2002, covariateName = "Cubic restricted age splines", analysisId = 2, conceptId = 2002 ) ) # update covariates Andromeda::appendToTable(trainData$covariateData$covariates, newData) featureEngineering <- list( funct = "implementAgeSplines", settings = list( featureEngineeringSettings = featureEngineeringSettings, model = model ) ) attr(trainData$covariateData, "metaData")$featureEngineering <- listAppend( attr(trainData$covariateData, "metaData")$featureEngineering, featureEngineering ) return(trainData) } ``` # Acknowledgments Considerable work has been dedicated to provide the `PatientLevelPrediction` package. ```{r tidy=TRUE,eval=TRUE} citation("PatientLevelPrediction") ``` **Please reference this paper if you use the PLP Package in your work:** [Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. J Am Med Inform Assoc. 2018;25(8):969-975.](https://dx.doi.org/10.1093/jamia/ocy032) This work is supported in part through the National Science Foundation grant IIS 1251151.