---
title: "Adding Custom Feature Engineering Functions"
author: "Jenna Reps, Egill Fridgeirsson"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Adding Custom Feature Engineering Functions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE, message = FALSE, warning = FALSE}
library(PatientLevelPrediction)
```

# Introduction

This vignette describes how you can add your own custom function for feature engineering in the Observational Health Data Sciences and Informatics (OHDSI) [`PatientLevelPrediction`](https://github.com/OHDSI/PatientLevelPrediction) package. This vignette assumes you have read and are comfortable with building single patient level prediction models as described in the `vignette('BuildingPredictiveModels')`.

**We invite you to share your new feature engineering functions with the OHDSI community through our [GitHub repository](https://github.com/OHDSI/PatientLevelPrediction).**

# Feature Engineering Function Code Structure

To make a custom feature engineering function that can be used within PatientLevelPrediction you need to write two different functions. The 'create' function and the 'implement' function.

The 'create' function, e.g., create\<FeatureEngineeringFunctionName\>, takes the parameters of the feature engineering 'implement' function as input, checks these are valid and outputs these as a list of class 'featureEngineeringSettings' with the 'fun' attribute specifying the 'implement' function to call.

The 'implement' function, e.g., implement\<FeatureEngineeringFunctionName\>, must take as input:

-   `trainData` - a list containing:

    -   `covariateData`: the `plpData$covariateData`restricted to the training patients

    -   `labels`: a data frame that contain `rowId`(patient identifier) and `outcomeCount` (the class labels)

    -   `folds`: a data.frame that contains `rowId` (patient identifier) and `index` (the cross validation fold)

-   `featureEngineeringSettings` - the output of your create\<FeatureEngineeringFunctionName\>

The 'implement' function can then do any manipulation of the `trainData` (adding new features or removing features) but must output a `trainData` object containing the new `covariateData`, `labels` and `folds` for the training data patients.

# Example

Let's consider the situation where we wish to create an age spline feature. To make this custom feature engineering function we need to write the 'create' and 'implement' R functions.

## Create function

Our age spline feature function will create a new feature using the `plpData$cohorts$ageYear` column. We will implement a restricted cubic spline that requires specifying the number of knots. Therefore, the inputs for this are: `knots` - an integer/double specifying the number of knots.

```{r, echo = TRUE, eval=FALSE}
createAgeSpline <- function(knots = 5) {
  # create list of inputs to implement function
  featureEngineeringSettings <- list(
    knots = knots
  )

  # specify the function that will implement the sampling
  attr(featureEngineeringSettings, "fun") <- "implementAgeSplines"

  # make sure the object returned is of class "sampleSettings"
  class(featureEngineeringSettings) <- "featureEngineeringSettings"
  return(featureEngineeringSettings)
}
```

We now need to create the 'implement' function `implementAgeSplines()`

## Implement function

All 'implement' functions must take as input the `trainData` and the `featureEngineeringSettings` (this is the output of the 'create' function). They must return a `trainData` object containing the new `covariateData`, `labels` and `folds`.

In our example, the `createAgeSpline()` will return a list with 'knots'. The `featureEngineeringSettings` therefore contains this.

```{r tidy=FALSE,eval=FALSE}
implementAgeSplines <- function(trainData, featureEngineeringSettings, model = NULL) {
  # if there is a model, it means this function is called through applyFeatureengineering, meaning it   # should apply the model fitten on training data to the test data
  if (is.null(model)) {
    knots <- featureEngineeringSettings$knots
    ageData <- trainData$labels
    y <- ageData$outcomeCount
    X <- ageData$ageYear
    model <- mgcv::gam(
      y ~ s(X, bs = "cr", k = knots, m = 2)
    )
    newData <- data.frame(
      rowId = ageData$rowId,
      covariateId = 2002,
      covariateValue = model$fitted.values
    )
  } else {
    ageData <- trainData$labels
    X <- trainData$labels$ageYear
    y <- ageData$outcomeCount
    newData <- data.frame(y = y, X = X)
    yHat <- predict(model, newData)
    newData <- data.frame(
      rowId = trainData$labels$rowId,
      covariateId = 2002,
      covariateValue = yHat
    )
  }

  # remove existing age if in covariates
  trainData$covariateData$covariates <- trainData$covariateData$covariates |>
    dplyr::filter(!.data$covariateId %in% c(1002))

  # update covRef
  Andromeda::appendToTable(
    trainData$covariateData$covariateRef,
    data.frame(
      covariateId = 2002,
      covariateName = "Cubic restricted age splines",
      analysisId = 2,
      conceptId = 2002
    )
  )

  # update covariates
  Andromeda::appendToTable(trainData$covariateData$covariates, newData)

  featureEngineering <- list(
    funct = "implementAgeSplines",
    settings = list(
      featureEngineeringSettings = featureEngineeringSettings,
      model = model
    )
  )

  attr(trainData$covariateData, "metaData")$featureEngineering <- listAppend(
    attr(trainData$covariateData, "metaData")$featureEngineering,
    featureEngineering
  )

  return(trainData)
}
```

# Acknowledgments

Considerable work has been dedicated to provide the `PatientLevelPrediction` package.

```{r tidy=TRUE,eval=TRUE}
citation("PatientLevelPrediction")
```

**Please reference this paper if you use the PLP Package in your work:**

[Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. J Am Med Inform Assoc. 2018;25(8):969-975.](https://dx.doi.org/10.1093/jamia/ocy032)

This work is supported in part through the National Science Foundation grant IIS 1251151.