---
title: "Frequentist uncertainty visualization"
author: "Matthew Kay"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    df_print: kable
vignette: >
  %\VignetteIndexEntry{Frequentist uncertainty visualization}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, child="children/chunk_options.txt"}
```

## Introduction
  
This vignette shows how to combine the `ggdist` geoms with output from the `broom` package to enable visualization of uncertainty from frequentist models. The general idea is to use `xdist` and `ydist` aesthetics supported by `ggdist` stats to visualize *confidence distributions* instead of visualizing posterior distributions as we might from a Bayesian model. For more information on the families of geoms and stats in ggdist, see `vignette("slabinterval")`, `vignette("dotsinterval")`, and `vignette("lineribbon")`.

*Confidence distributions* are a way of unifying the notion of sampling distributions, bootstrap distributions, and several other concepts in frequentist inference. They are a convenient tool for visualizing uncertainty in a way that generalizes across Bayesian and frequentist frameworks: where in a Bayesian framework we might visualize a *probability* distribution, in the frequentist framework we visualize a *confidence* distribution. This gives us a way to use the same geometries for uncertainty visualization in either framework.

For more on confidence distributions, see: Xie, Minâ€ge, and Kesar Singh. [Confidence distribution, the frequentist distribution estimator of a parameter: A review](https://statweb.rutgers.edu/mxie/RCPapers/insr.12000.pdf). _International Statistical Review_ 81.1 (2013): 3-39.

## Setup
  
The following libraries are required to run this vignette:
  
```{r setup, message = FALSE, warning = FALSE}
library(dplyr)
library(tidyr)
library(ggdist)
library(ggplot2)
library(broom)
library(distributional)

theme_set(theme_ggdist())
```

```{r hidden_options, include=FALSE}
.old_options = options(width = 100)
```


## Point estimates

We'll start with an ordinary least squares (OLS) linear regression analysis of this simple dataset:

```{r data_gen}
set.seed(5)
n = 10
n_condition = 5
ABC =
  tibble(
    condition = rep(c("A","B","C","D","E"), n),
    response = rnorm(n * 5, c(0,1,2,1,-1), 0.5)
  )
```

This is a typical tidy format data frame: one observation per row. Graphically:

```{r data_plot, fig.width = med_width, fig.height = med_height/1.5}
ABC %>%
  ggplot(aes(x = response, y = condition)) +
  geom_point(alpha = 0.5) +
  ylab("condition")
```

And a simple linear regression of the data is fit as follows:

```{r m_ABC}
m_ABC = lm(response ~ condition, data = ABC)
```

The default summary is not great from an uncertainty communication perspective:

```{r m_ABC_summary}
summary(m_ABC)
```

So let's try half-eye plots instead. The basic idea is that we need to get the three parameters for the sampling distribution of each parameter and then use `stat_halfeye()` to plot them. The confidence distribution for parameter $i$, $\tilde\beta_i$, from an `lm` model is a scaled-and-shifted t distribution:

$$
\tilde\beta_i \sim \textrm{student_t}\left(\nu, \hat\beta_i, \sigma_{\hat\beta_i}\right)
$$

With:

- $\nu$: *degrees of freedom*, equal to `df.residual(m_ABC)`
- $\hat\beta_i$: *location*, equal to the point estimate of the parameter (`estimate` column from `broom::tidy()`)
- $\sigma_{\hat\beta_i}$: *scale*, equal to the standard error of the parameter estimate (`std.error` column from `broom::tidy()`)

We can get the estimates and standard errors easily by using `broom::tidy()`:

```{r m_ABC_coefs}
tidy(m_ABC)
```

Finally, we can construct vectors of probability distributions using functions 
like `distributional::dist_student_t()` from 
the [distributional](https://pkg.mitchelloharawild.com/distributional/) package.
The `stat_slabinterval()` family of functions supports these objects.

Putting everything together, we have:

```{r halfeye, fig.width = tiny_width, fig.height = tiny_height}
m_ABC %>%
  tidy() %>%
  ggplot(aes(y = term)) +
  stat_halfeye(
    aes(xdist = dist_student_t(df = df.residual(m_ABC), mu = estimate, sigma = std.error))
  )
```

If we would rather see uncertainty in conditional means, we can instead use `tidyr::expand()` along with `broom::augment()` (similar to how we can use `tidyr::expand()` with `tidybayes::add_fitted_draws()` for Bayesian models). Here we want the confidence distribution for the mean in condition $c$, $\tilde\mu_c$:

$$
\tilde\mu_c \sim \textrm{student_t}\left(\nu, \hat\mu_c, \sigma_{\hat\mu_c} \right)
$$

With:

- $\nu$: *degrees of freedom*, equal to `df.residual(m_ABC)`
- $\hat\mu_c$: *location*, equal to the point estimate of the mean in condition $c$ (`.fitted` column from `broom::augment()`)
- $\sigma_{\hat\mu_c}$: *scale*, equal to the standard error of the mean in condition $c$ (`.se.fit` column from `broom::augment(..., se_fit = TRUE)`)

Putting everything together, we have:

```{r halfeye_with_data, fig.width = tiny_width, fig.height = tiny_height}
ABC %>%
  expand(condition) %>%
  augment(m_ABC, newdata = ., se_fit = TRUE) %>%
  ggplot(aes(y = condition)) +
  stat_halfeye(
    aes(xdist = dist_student_t(df = df.residual(m_ABC), mu = .fitted, sigma = .se.fit)),
    scale = .5
  ) +
  # we'll add the data back in too (scale = .5 above adjusts the halfeye height so
  # that the data fit in as well)
  geom_point(aes(x = response), data = ABC, pch = "|", size = 2, position = position_nudge(y = -.15))
```

Of course, this works with the entire `stat_slabinterval()` family. Here are gradient plots instead:

```{r gradientinterval, fig.width = tiny_width, fig.height = tiny_height}
ABC %>%
  expand(condition) %>%
  augment(m_ABC, newdata = ., se_fit = TRUE) %>%
  ggplot(aes(y = condition)) +
  stat_gradientinterval(
    aes(xdist = dist_student_t(df = df.residual(m_ABC), mu = .fitted, sigma = .se.fit)),
    scale = .5, fill_type = "gradient"
  )
```

**Note:** The example above uses the **experimental** `fill_type = "gradient"`
option. This can be omitted if your system does not support it; see further 
discussion in the section on gradient plots in `vignette("slabinterval")`.

Or complementary cumulative distribution function (CCDF) bar plots:

```{r ccdfinterval, fig.width = tiny_width, fig.height = tiny_height}
ABC %>%
  expand(condition) %>%
  augment(m_ABC, newdata = ., se_fit = TRUE) %>%
  ggplot(aes(y = condition)) +
  stat_ccdfinterval(
    aes(xdist = dist_student_t(df = df.residual(m_ABC), mu = .fitted, sigma = .se.fit))
  )
```

We can also create quantile dotplots by using the `dots` family of geoms. Quantile dotplots show quantiles from a distribution (in this case, the sampling distribution), employing a *frequency framing* approach to uncertainty communication that can be easier for people to interpret ([Kay et al. 2016](https://doi.org/10.1145/2858036.2858558), [Fernandes et al. 2018](https://doi.org/10.1145/3173574.3173718)):

```{r dotplot, fig.width = tiny_width, fig.height = tiny_height}
ABC %>%
  expand(condition) %>%
  augment(m_ABC, newdata = ., se_fit = TRUE) %>%
  ggplot(aes(y = condition)) +
  stat_dots(
    aes(xdist = dist_student_t(df = df.residual(m_ABC), mu = .fitted, sigma = .se.fit)),
    quantiles = 100
  )
```

See `vignette("slabinterval")` and `vignette("dotsinterval")` for more examples of uncertainty geoms and stats in the slabinterval family.

## For a fit line

The same principle of reconstructing the confidence distribution allows us to use `stat_lineribbon()` to construct uncertainty bands around regression fit lines. Here we'll reconstruct an example with the `mtcars` dataset from `vignette("tidy-brms", package = "tidybayes")`, but using `lm()` instead:

```{r m_mpg}
m_mpg = lm(mpg ~ hp * cyl, data = mtcars)
```

Again we'll use `tidyr::expand()` with `broom::tidy()`, but now we'll employ `stat_lineribbon()`:

```{r lineribbon, fig.width = tiny_width, fig.height = tiny_height}
mtcars %>%
  group_by(cyl) %>%
  expand(hp = seq(min(hp), max(hp), length.out = 101)) %>%
  augment(m_mpg, newdata = ., se_fit = TRUE) %>%
  ggplot(aes(x = hp, fill = ordered(cyl), color = ordered(cyl))) +
  stat_lineribbon(
    aes(ydist = dist_student_t(df = df.residual(m_mpg), mu = .fitted, sigma = .se.fit)),
    alpha = 1/4
  ) +
  geom_point(aes(y = mpg), data = mtcars) +
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Dark2") +
  labs(
    color = "cyl",
    fill = "cyl",
    y = "mpg"
  )
```

Another alternative to using `alpha` to create gradations of lineribbon colors in different
groups is to use the `fill_ramp` aesthetic provided by `ggdist` to "ramp" the fill
color of the ribbons from `"white"` to their full color (see `help("scale_fill_ramp")`).
Here we'll "whiten" the fill color of each band according to its `level` (the `level`
variable is computed by `stat_lineribbon()` and is an ordered factor version of
`.width`):

```{r lineribbon_lightened, fig.width = tiny_width, fig.height = tiny_height}
mtcars %>%
  group_by(cyl) %>%
  expand(hp = seq(min(hp), max(hp), length.out = 101)) %>%
  augment(m_mpg, newdata = ., se_fit = TRUE) %>%
  ggplot(aes(x = hp, color = ordered(cyl))) +
  stat_lineribbon(aes(
    ydist = dist_student_t(df = df.residual(m_mpg), mu = .fitted, sigma = .se.fit),
    fill = ordered(cyl),
    fill_ramp = after_stat(level)
  )) +
  geom_point(aes(y = mpg), data = mtcars) +
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Dark2") +
  labs(
    color = "cyl",
    fill = "cyl",
    y = "mpg"
  )
```

One way to avoid problems created by overlapping ribbons (besides using `alpha`, 
as above) is to use commutative blending modes available in R > 4.2. The 
[ggblend](https://mjskay.github.io/ggblend/) package enables support for such
blending modes, and its documentation includes examples of their use with 
`stat_lineribbon()`.

For more examples of using lineribbons, see `vignette("lineribbon")`.


```{r reset_options, include=FALSE}
options(.old_options)
```