---
title: "Motivation"
author: "Dr. Simon Müller"
date: "`r Sys.Date()`"
output:
  html_vignette:
    css: kable.css
    number_sections: yes
    toc: yes
vignette: >
  %\VignetteIndexEntry{Motivation}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Objective {#objective}

### Define a Distance Measure for Numerical and Categorical Features {#define-a-distance-measure-for-numerical-and-categorical-features}

Let $x_i = \left(x_i^1,\cdots,x_i^p\right)^T$ and
$x_j = \left(x_j^1,\cdots,x_j^p\right)^T$ be the feature vectors of tow
distinct patients $i$ and $j$. A first rough idea may be to calculate
the $L_1-$norm of this two feature vectors: $$
\|x_i - x_j\|_{L_1}=\frac{1}{p}\sum\limits_{k=1}^p|x_i^k - x_j^k|
$$

However, this naive approach arises from some problems:

1.  This measure is well-defined for numerical features but not for
    categorical features, such as gender (male/female). We need a method
    to define the distance for categorical variables.

2.  $|x_i^k - x_j^k|$ is not scale-invariant, meaning that if one
    changes the unit of measurement (e.g., from meters to centimeters),
    the contribution of this feature would increase by a factor of 100.
    Features with large values will dominate the distance.

3.  All features are treated equally, which may not reflect their actual
    importance. For example, in the case of lung cancer, the patient's
    smoking status (no/yes) seems more relevant than the city they live
    in.

To address these issues, we propose a machine learning approach that can
handle both numerical and categorical features while accounting for
their varying importance:

1.  For categorical features, use a distance metric such as the Gower
    distance, which can handle mixed data types. This metric assigns a
    distance of 0 for matching categories and 1 for non-matching
    categories.
2.  Normalize numerical features to make them scale-invariant. This can
    be done using min-max scaling or standardization (subtracting the
    mean and dividing by the standard deviation).
3.  Assign weights to features based on their importance in predicting
    the outcome. This can be achieved using feature selection techniques
    or by incorporating feature importance scores from machine learning
    models such as decision trees or random forests.

By employing a machine learning approach to define the distance measure,
we can effectively address the challenges associated with numerical and
categorical features while accounting for their relative importance in
the context of the problem.

## Weighted Distance Measure {#weighted-distance-measure}

To address the challenges discussed earlier, we need to perform two
steps:

1.  Generalize the L1-distance to handle different variable types,
    including numerical and categorical features.

2.  Assign appropriate weights to each feature to:

    a\. eliminate the dependency on the scale of the variables,

    b.  account for varying importance across features, and
    c.  consider the impact size of each feature.

By incorporating these modifications, we arrive at the following
weighted distance measure:

$$
d(x_i, x_j) = \sum\limits_{k=1}^p |\alpha(x_i^k, x_j^k) d(x_i^k, x_j^k)|
$$

The weighted distance measure now includes weights
$\alpha(x_i^k, x_j^k)$, which are determined based on the training data.
In the next section, we will discuss how to obtain these weights and
effectively compute the weighted distance measure for mixed data types
and varying feature importance.

# Statistical Model {#statistical-model}

Let $\alpha(x_i^k, x_j^k)$ the weights and $d(x_i^k, x_j^k)$ the
distance for feature $k$ and observation $i$ and $j$. We will define
them as:

If feature $k$ is numerical, then

$$d(x_i^k, x_j^k) = x_i^k - x_j^k$$ and

$$\alpha(x_i^k, x_j^k) = \hat{\beta_k}$$

If feature $k$ is categorical, then

$$d(x_i^k, x_j^k) = 1 \text{ when } x_i^k = x_j^k \text{ else } 0$$ and
$$\alpha(x_i^k, x_j^k) = \hat{\beta_k}^i - \hat{\beta_k}^j,$$

where $\hat{\beta_k}^k$ are the coefficients of a regression model
(linear, logistic, or CPH model).By defining the distance and weight
measures in this way, we account for both numerical and categorical
features while incorporating the relative importance of each feature in
the model. The weights are determined based on the regression model's
coefficients, reflecting the impact of each feature on the prediction.
This approach provides a more meaningful and interpretable distance
measure for mixed data types and varying feature importance.

# Random Forests {#random-forests}

## Proximity Measure {#proximity-measure}

Two observations $i$ and $j$ are considered more similar when the
fraction of trees in which patient $i$ and $j$ share the same terminal
node is close to one (Breiman, 2002).

$$d(x_i, x_j)^2 = 1 - \frac{1}{M}\sum\limits_{t=1}^T 1_{[x_i \text{ and } x_j \text{ share the same terminal node in tree } t]},$$
where $M$ is the number of trees that contain both observations and $T$
is the total number of trees.A drawback of this measure is that the
decision is binary, meaning that potentially similar observations might
be counted as dissimilar. For example, suppose a final cut-off is
consistently made around age 58, and observation 1 has an age of 56
while observation 2 has an age of 60. In this case, the distance between
observation 1 and observation 2 would be the same as the distance
between observation 1 and an observation with an age of 80. This
limitation makes the proximity measure less sensitive to small
differences between observations, potentially affecting the overall
analysis of similarity.

## Depth Measure: A modified proximity measure {#depth-measure-a-modified-proximity-measure}

In contrast to the proximity measure, the depth measure takes into
account the number of edges between two observations instead of their
final nodes in each tree. This distance measure is averaged over all
trees and is defined as:

$$
d(x_i, x_j) = \frac{1}{M}\sum\limits_{t=1}^T g_{ij},
$$

where $M$ is the number of trees containing both observations, and
$g_{ij}$ is the number of edges between the end nodes of observation $i$
and $j$ in tree $t$. This measure considers the structure of the trees
and provides a more nuanced understanding of the similarity between
observations.

For more details and a thorough explanation of the depth measure, refer
to the publication by Englund and Verikas: "[A novel approach to
estimate proximity in a random forest: An exploratory
study.](https://www.sciencedirect.com/science/article/abs/pii/S095741741200810X)"
The depth measure addresses some of the limitations of the proximity
measure by considering the tree structure and the path traversed by the
observations, which results in a more accurate and informative distance
measure.