---
title: "Performance metrics"
author: "Alex Zwanenburg"
date: "2024-09-20"
output:
rmarkdown::html_vignette:
includes:
in_header: familiar_logo.html
after_body: license.html
toc: TRUE
rmarkdown::github_document:
html_preview: FALSE
includes:
in_header: familiar_logo.html
after_body: license.html
toc: TRUE
bibliography: "refs.bib"
vignette: >
%\VignetteIndexEntry{Performance metrics}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
When we train a model, we usually want to know how good the model is. Model
performance is assessed using different metrics that quantify how well a model
discriminates cases, stratifies groups or predicts values. `familiar` implements
metrics that are typically used to assess the performance of categorical,
regression and survival models.
# Categorical outcomes
Performance metrics for models with categorical outcomes, i.e. `binomial` and
`multinomial` are listed below.
| **method** | **tag** | **averaging** |
|:----------------------------------------|:-----------------------------------------------------|:-------------:|
| accuracy | `accuracy` | |
| area under the receiver-operating curve | `auc`, `auc_roc` | |
| balanced accuracy | `bac`, `balanced_accuracy` | |
| balanced error rate | `ber`, `balanced_error_rate` | |
| Brier score | `brier` | |
| Cohen's kappa | `kappa`, `cohen_kappa` | |
| f1 score | `f1_score` | × |
| false discovery rate | `fdr`, `false_discovery_rate` | × |
| informedness | `informedness` | × |
| markedness | `markedness` | × |
| Matthews' correlation coefficient | `mcc`, `matthews_correlation_coefficient` | |
| negative predictive value | `npv` | × |
| positive predictive value | `precision`, `ppv` | × |
| recall | `recall`, `sensitivity`, `tpr`, `true_positive_rate` | × |
| specificity | `specificity`, `tnr`, `true_negative_rate` | × |
| Youden's J | `youden_j`, `youden_index` | × |
Table: Overview of performance metrics for categorical outcomes. Some contingency
table-based metrics require averaging for `multinomial` outcomes as their
original definition only covers `binomial` problems with two classes.
## Area under the receiver-operating curve
The area under the receiver-operating curve is quite literally that. It is the
area under the curve created by plotting the true positive rate (sensitivity)
against the false positive rate (1-specificity). TPR and FPR are derived from a
contingency table, which is created by comparing predicted class probabilities
against a threshold. The receiver-operating curve is created by iterating over1
threshold values. The AUC of a model that predicts perfectly is $1.0$, while
$0.5$ indicates predictions that are no better than random.
The implementation in `familiar` does not use the ROC curve to compute the AUC.
Instead, an algebraic equation by @Hand2001-ij is used. For `multinomial`
outcomes the AUC is computed for each pairwise comparison of outcome classes,
and averaged [@Hand2001-ij].
## Brier score
The Brier score [@Brier1950-lo] is a measure of deviation of predicted
probabilities from the ideal situation where the probability for class *a* is
$1.0$ if the observed class is *a* and $0.0$ if it is not *a*. Hence, it can be
viewed as a measure of calibration as well. A value of $0.0$ is optimal.
The implementation in `familiar` iterates over all outcome classes in a
one-versus-all approach, as originally devised by @Brier1950-lo. For `binomial`
outcomes, the score is divided by 2, so that it falls in the $[0.0, 1.0]$ range.
## Contingency table-based metrics
A contingency table or confusion matrix displays the observed and predicted
classes. When dealing with two classes, e.g. *a* and *b*, one of the classes is
usually termed 'positive' and the other 'negative'. For example, let *b* be the
'positive' class. Then we can define the following four categories:
- True positive ($TP$): *b* is predicted and observed.
- True negative ($TN$): *a* is predicted and observed.
- False negative ($FN$): *b* was observed, but *a* was predicted.
- False positive ($FP$): *a* was observed, but *b* was predicted.
A contingency table contains the occurrence of each of the four cases. If a
model is good, most samples will be either true positive or true negative.
Models that are not as good may have larger numbers of false positives and/or
false negatives.
Metrics based on the contingency table use two or more of the four categories to
characterise model performance. The extension from two classes (`binomial`) to
more (`multinomial`) is often not trivial. For many metrics, `familiar` uses a
one-versus-all approach. Here, all classes are iteratively used as the
'positive' class, with the rest grouped as 'negative'. Three options can be
used to obtain performance values for `multinomial` problems, with an
implementation similar to that of `scikit.learn`:
- `micro`: The number of true positives, true negatives, false positives and
false negatives are computed for each class iteration, and then summed over
all classes. The score is calculated afterwards.
- `macro`: A score is computed for each class iteration, and then averaged.
- `weighted`: A score is computed for each class iteration, and the averaged
with a weight corresponding to the number of samples with the observed
'positive' class, i.e. the prevalence.
By default, `familiar` uses `macro`, but the averaging procedure may be selected
through appending `_micro`, `_macro` or `_weighted` to the name of the metric.
For example, `recall_micro` will compute the recall metric using `micro`
averaging.
Averaging only applies to `multinomial` outcomes. No averaging is performed for
`binomial` problems with two classes. In this case `familiar` will always
consider the second class level to correspond to the 'positive' class.
### Accuracy
Accuracy quantifies the number of correctly predicted classes:
$s_{acc}=(TP + TN) / (TP + TN + FP + FN)$. The extension to more than 2 classes
is trivial. No averaging is performed for the accuracy metric.
### Balanced accuracy
Accuracy is known to be sensitive to imbalances in the class distribution. A
balanced accuracy was therefore defined [@Brodersen2010-vb], which is the
averaged within-class true positive rate (also known as recall or sensitivity):
$s_{bac}=0.5 (TP / (TP + FN) + TN / (TN + FP))$.
The extension to more than 2 classes involves summation of in-class true
positive rate $TP / (TP + FN)$ for each positive class and subsequent division
by the number of classes. No averaging is performed for balanced accuracy.
### Balanced error rate
The balanced error rate is closely related to balanced accuracy, i.e. instead of
the in-class true positive rate, the in-class false negative rate is used:
$s_{ber}=0.5 (FN / (TP + FN) + FP / (TN + FP))$.
The extension to more than 2 classes involves summation of in-class false
negative rate $FN / (TP + FN)$ for each positive class and subsequent division
by the number of classes. No averaging is performed for balanced error rate.
### F1 score
The F1 score is the harmonic mean of precision and sensitivity:
$s_{f1} = 2 \; TP / (2 \; TP + FP + FN)$.
The metric is not invariant to class permutation. Averaging is therefore
performed for `multinomial` outcomes.
### False discovery rate
The false discovery rate quantifies the proportion of false positives among all
predicted positives, i.e. the Type I error: $s_{fdr} = FP / (TP + FP)$.
The metric is not invariant to class permutation. Averaging is therefore
performed for `multinomial` outcomes.
### Informedness
Informedness is a generalisation of Youden's J statistic [@Powers2011-jt].
Informedness can be extended to multiple classes, and no averaging is therefore
required.
For `binomial` problems, informedness and the Youden J statistic are the same.
### Cohen's kappa
Cohen's kappa coefficient is a measure of correspondence between the observed
and predicted classes [@Cohen1960-kc]. Cohen's kappa coefficient is invariant to
class permutations and no averaging is performed for Cohen's kappa.
### Markedness
Markedness is related to the precision or positive predictive value
[@Powers2011-jt].
### Matthews correlation efficient
Matthews' correlation coefficient measures the correlation between observed and
predicted classes [Matthews1975-kh].
An extension to multiple classes, i.e. multinomial outcomes, was devised by
@Gorodkin2004-tx.
### Negative predictive value
The negative predictive value (NPV) is the fraction of predicted negative
classes that were also observed to be negative: $s_{npv} = TN / (TN + FN)$.
The NPV is not invariant to class permutations. Averaging is performed for
`multinomial` outcomes.
### Positive predictive value
The positive predictive value (PPV) is the fraction of predicted positive
classes that were also observed to be positive: $s_{ppv} = TP / (TP + FP)$.
The PPV is also referred to as precision. The PPV is not invariant to class
permutations. Averaging is performed for `multinomial` outcomes.
`micro`-averaging effectively computes the accuracy.
### Recall
Recall, also known as sensitivity or true positive rate, is the fraction of
observed positive classes that were also predicted to be positive:
$s_{recall} = TP / (TP + FN)$.
Recall is not invariant to class permutations and averaging is performed for
`multinomial` outcomes. Both `micro` and `weighted` averaging effectively
compute the accuracy.
### Specificity
Specificity, also known as the true negative rate, is the fraction of observed
negative classes that were also predicted to be negative:
$s_{spec} = TN / (TN + FP)$.
Specificity is not invariant to class permutations and averaging is performed
for `multinomial` outcomes.
### Youden's J statistic
Youden's J statistic [@Youden1950-no] is the sum of recall and specificity minus
1: $s_{youden} = TP / (TP + FN) + TN / (TN + FP) - 1$.
Youden's J statistic is not invariant to class permutations and averaging is
performed for `multinomial` outcomes.
For `binomial` problems, informedness and the Youden J statistic are the same.
# Regression outcomes
Performance metrics for models with regression outcomes, i.e. `count` and
`continuous`, are listed below.
| **method** | **tag** |
|:---------------------------|:--------------------------------------|
| explained variance | `explained_variance` |
| mean absolute error | `mae`, `mean_absolute_error` |
| relative absolute error | `rae`, `relative_absolutive_error` |
| mean log absolute error | `mlae`, `mean_log_absolute_error` |
| mean squared error | `mse`, `mean_squared_error` |
| relative squared error | `rse`, `relative_squared_error` |
| mean squared log error | `msle`, `mean_squared_log_error` |
| median absolute error | `medae`, `median_absolute_error` |
| R2 score | `r2_score`, `r_squared` |
| root mean square error | `rmse`, `root_mean_square_error` |
| root relative squared error| `rrse`, `root_relative_squared_error` |
| root mean square log error | `rmsle`, `root_mean_square_log_error` |
Each of the above metrics can be made more robust against rare outliers by
appending `_winsor` or `_trim` as a suffix to the metric name. This respectively
performs winsorising (clipping) and trimming (truncating) based on the absolute
prediction error, prior to computing the metric. Winsorising clips the predicted
values for 5% of the instances with the most extreme absolute errors prior to
computing the performance metric, whereas trimming removes these instances. For
example, winsored and trimmed versions of the mean squared error metric are
specified as `mse_winsor` and `mse_trim` respectively.
Let $y$ be the set of observed values, and $\hat{y}$ the corresponding predicted
values. The error is then $\epsilon = y-\hat{y}$.
## Explained variance
The explained variance is defined as
$1 - \text{Var}\left(\epsilon\right) / \text{Var}\left(y\right)$. This metric is
not sensitive to differences in offset between observed and predicted values.
## Mean absolute error
The mean absolute error is defined as $1/N \sum_i^N \left|\epsilon_i\right|$,
with $N$ the number of samples.
## Relative absolute error
The relative absolute error is defined as $\sum_i^N \left|\epsilon_i\right|/ \sum_i^N \left|y_i - \bar{y}\right|$.
## Mean log absolute error
The mean log absolute error is defined as
$1/N \sum_i^N \log(\left|\epsilon_i\right| + 1)$.
## Mean squared error
The mean squared error is defined as $1/N \sum_i^N \left(\epsilon_i \right)^2$.
## Relative squared error
The relative squared error is defined as $\sum_i^N \left(\epsilon_i\right)^2/ \sum_i^N \left(y_i - \bar{y}\right)^2$.
## Mean squared log error
Mean squared log error is defined as
$1/N \sum_i^N \left(\log \left(y_i + 1\right) - \log\left(\hat{y}_i + 1\right)\right)^2$.
Note that this score only applies to observed and predicted values in the
positive domain. It is not defined for negative values.
## Median absolute error
The median absolute error is the median of absolute error
$\left|\epsilon\right|$.
## R2 score
The R2 score is defined as:
$$R^2 = 1 - \frac{\sum_i^N(\epsilon_i)^2}{\sum_i^N(y_i - \bar{y})^2}$$ Here
$\bar{y}$ denotes the mean value of $y$.
## Root mean square error
The root mean square error is defined as
$\sqrt{1/N \sum_i^N \left(\epsilon_i \right)^2}$.
## Root relative squared error
The root relative squared error is defined as $\sqrt{\sum_i^N \left(\epsilon_i\right)^2/ \sum_i^N \left(y_i - \bar{y}\right)^2}$.
## Root mean square log error
The root mean square log error is defined as
$\sqrt{1/N \sum_i^N \left(\log \left(y_i + 1\right) - \log\left(\hat{y}_i + 1\right)\right)^2}$.
Note that this score only applies to observed and predicted values in the
positive domain. It is not defined for negative values.
# Survival outcomes
Performance metrics for models with survival outcomes, i.e. `survival`, are
listed below.
+--------------------------------------------+---------------------------------+
| **method** | **tag** |
+:===========================================+:================================+
| concordance index | `concordance_index`, `c_index`, |
| | `concordance_index_harrell`, |
| | `c_index_harrell` |
+--------------------------------------------+---------------------------------+
## Concordance index
The concordance index assesses ordering between observed and predicted values.
Let $T$ be observed times, $c$ the censoring status ($0$: no observed event;
$1$: event observed) and $\hat{T}$ predicted times. Concordance between all
pairs of values is determined as follows [@Pencina2004-ii]:
- Concordant: a pair is concordant if $T_i < T_j$ and $\hat{T}_i < \hat{T}_j$
(provided $c_i=1$), or if $T_i > T_j$ and $\hat{T}_i > \hat{T}_j$ (provided
$c_j=1$).
- Discordant: a pair is discordant if $T_i < T_j$ and $\hat{T}_i > \hat{T}_j$
(provided $c_i=1$), or if $T_i > T_j$ and $\hat{T}_i < \hat{T}_j$ (provided
$c_j=1$).
- Tied: a pair is tied if $\hat{T}_i = \hat{T}_j$, provided that $c_i=c_j=1$.
- Not comparable: otherwise. This occurs, for example, if sample $i$ was
censored before an event was observed in sample $j$, or both samples were
censored.
The concordance index is then computed as:
$$ci = \frac{n_{concord} + 0.5 n_{tied}}{n_{concord} + n_{discord} + n_{tied}}$$
# References