---
title: "Heteroscedastic Censored and Truncated Regression with crch"
author:
  - name: Jakob W. Messner
    affiliation: Austro Control
    address: Austria
    orcid: 0000-0002-1027-3673
    email: Jakob.Messner@acds.at
    url: https://www.zeileis.org/
  - name: Georg J. Mayr
    affiliation: UniversitÃ¤t Innsbruck
    address: Austria
    orcid: 0000-0001-6661-9453
    email: Georg.Mayr@uibk.ac.at
  - name: Achim Zeileis
    affiliation: UniversitÃ¤t Innsbruck
    address: Austria
    orcid: 0000-0003-0918-3766
    email: Achim.Zeileis@R-project.org
    url: https://www.zeileis.org/
format: 
  html:
    html-math-method: mathjax
    toc: true
    number-sections: true
abstract: >
  This introduction to the R package **crch** is a (slightly)
  modified version of @messner2016, published in _The R Journal_.\  

  The **crch** package provides functions for maximum likelihood
  estimation of censored or truncated regression models with conditional
  heteroscedasticity along with suitable standard methods to summarize the
  fitted models and compute predictions, residuals, etc. The supported
  distributions include left- or right-censored or truncated Gaussian, logistic,
  or student-t distributions with potentially different sets of regressors for
  modeling the conditional location and scale. The models and their R
  implementation are introduced and illustrated by numerical weather prediction
  tasks using precipitation data for Innsbruck (Austria).
bibliography: crch.bib
vignette: >
  %\VignetteIndexEntry{Heteroscedastic Censored and Truncated Regression with crch}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{crch,glmx}
  %\VignetteKeywords{censored regression, truncated regression, tobit model, Cragg model, heteroskedasticity, R}
  %\VignettePackage{crch}
---

```{r}
#| label: preliminaries
#| include: false
knitr::opts_chunk$set(
  engine = "R",
  collapse = TRUE,
  comment = "##",
  message = FALSE,
  warning = FALSE,
  echo = TRUE
)
options(width = 70, prompt = "R> ", continue = "+  ", useFancyQuotes = FALSE, digits = 5)
library("crch")
```

## Introduction {#sec-intro}

Censored or truncated response variables occur in a variety of applications.
Censored data arise if exact values are only reported in a restricted range.
Data may fall outside this range but are reported at the range limits. In
contrast, if data outside this range are omitted completely we call it
truncated. 
E.g., consider wind measurements with an instrument that needs a certain 
minimum wind speed to start working. If wind speeds below this minimum are
recorded as $\le$_minimum_ the data is censored. If only wind speeds
exceeding this limit are reported and those below are omitted the data is 
truncated.
Even if the generating process is not as clear, censoring or truncation
can be useful to consider limited data such as precipitation observations.

The tobit [@tobin1958] and truncated regression [@cragg1971] models
are common linear regression models for censored and truncated conditionally
normally distributed responses respectively.  
Beside truncated data, truncated regression is also
used in two-part models [@cragg1971] for censored type data: A binary
(e.g., probit) regression model fits the exceedance probability of the lower
limit and a truncated regression model fits the value given the lower limit is
exceeded.

Usually linear models like the tobit or truncated regression models assume
homoscedasticity which means that the variance of an underlying normal
distribution does not depend on covariates.
However, sometimes this assumption does not hold and models that can consider
conditional heteroscedasticity should be used. Such models have been proposed,
e.g., for generalized linear models [@nelder1987, @smyth1989], generalized
additive models [@rigby1996, @rigby2005], or beta regression
[@cribarineto2010]. There also exist several R packages with functions
implementing the above models, e.g., **dglm** [@dunn2014], 
**glmx** [@zeileis2013], **gamlss** [@rigby2005], 
**betareg** [@zeileis2012] amongst others.

The **crch** package provides functions to fit censored and truncated 
regression models that
consider conditional heteroscedasticity. It has a convenient interface to
estimate these models with maximum likelihood and provides several methods for
analysis and prediction. In addition to the typical conditional Gaussian distribution
assumptions it also allows for logistic and student-t
distributions with heavier tails.

The outline of the paper is as follows. @sec-models describes the
censored and truncated regression models, and @sec-impl presents
their R implementation. @sec-examples illustrates the package
functions with numerical weather prediction data of precipitation in Innsbruck
(Austria) and finally @sec-summary summarizes the paper.


## Regression models {#sec-models}

For both, censored and truncated regression, a normalized latent response
$(y^*-\mu)/\sigma$ is assumed to follow a certain distribution $D$

$$
  \frac{y^*-\mu}{\sigma} \sim  D
$$

The location parameter $\mu$ and a link function of the scale parameter
$g(\sigma)$ are assumed to relate linearly to covariates $\mathbf{x} = (1, x_1,
x_2, \ldots)^\top$ and $\mathbf{z} = (1, z_1, z_2, \ldots)^\top$:

$$
\begin{eqnarray}
	\mu &= &\mathbf{x}^{\top}\beta \\
	g(\sigma) & = & \mathbf{z}^{\top}\gamma
\end{eqnarray}
$$ {#eq-musigma}

where $\beta=(\beta_0, \beta_1, \beta_2, \ldots)^\top$ and $\gamma=(\gamma_0,
\gamma_1, \gamma_2, \ldots)^\top$ are coefficient vectors. The link function
$g(\cdot):\mathbb{R}^+ \mapsto \mathbb{R}$ is a strictly increasing and twice
differentiable function; e.g., the logarithm (i.e., \mbox{$g(\sigma)=\log(\sigma)$}) is a well
suited function. Although they only map to $\mathbb{R}^+$, the identity
$g(\sigma) = \sigma$ or the quadratic function $g(\sigma)=\sigma^2$ can be
usefull as well. However, problems in the numerical optimization can occur.

Commonly $D$ is the standard normal distribution so that $y^*$ is assumed to be
normally distributed with mean $\mu$ and variance $\sigma^2$.  $D$ might also be
assumed to be a standard logistic or a student-t distribution if heavier tails
are required. The tail weight of the student-t distribution can be controlled by
the degrees of freedom $\nu$ which can either be set to a certain value or
estimated as an additional parameter. To assure positive values, $\log(\nu)$ is
modeled in the latter case.

$$
  \log(\nu) = \delta
$$ {#eq-dfest}


### Censored regression (tobit)

The exact values of censored responses are only known in an interval defined by
$\text{left}$ and $\text{right}$. Observation outside this interval are
mapped to the interval limits

$$
  y = \begin{cases} \text{left} & y^* \le \text{left} \\
    y^* & \text{left} < y^* < \text{right}\\
    \text{right} & y^* \ge \text{right} \end{cases}
$$

The coefficients $\beta$, $\gamma$, and $\delta$
(@eq-musigma and @eq-dfest)) can be estimated by maximizing the
sum over the data set of the log-likelihood function 
$\log(f_{\text{cens}}(y, \mu, \sigma))$, where

$$
  f_{\text{cens}}(y, \mu, \sigma) =  
    \begin{cases} F\left(\frac{\text{left} - \mu}{\sigma}\right)
      & y \le \text{left} \\
    f\left(\frac{y - \mu}{\sigma}\right)
      & \text{left} < y < \text{right} \\
    \left(1 - F\left(\frac{\text{right} - \mu}{\sigma}\right)\right)
      & y \ge \text{right} \end{cases}
$$

$F()$
and $f()$ are the cumulative distribution function and the probability density
function of $D$, respectively. If $D$ is the normal distribution this model is a
heteroscedastic variant of the tobit model [@tobin1958].

### Truncated regression

Truncated responses occur when latent responses below or above some thresholds
are omitted. 

$$
  y = y^*|\text{left} < y^* < \text{right}
$$

Then $y$ follows a truncated distribution with probability density function

$$
  f_{\text{tr}}(y, \mu, \sigma) = 
    \frac{f\left(\frac{y - \mu}{\sigma}\right)}
    {F\left(\frac{\text{right} - \mu}{\sigma}\right) - 
    F\left(\frac{\text{left} - \mu}{\sigma}\right)}
$$

In that case the coefficients $\beta$, $\gamma$, and $\delta$ can be estimated by maximizing the sum over the data set of the log-likelihood function 

$$
  \log(f_{\text{tr}}(y, \mu, \sigma))
$$


## R implementation {#sec-impl}

The models from the previous section can both be fitted with the `crch()` function 
provided by the **crch** package.
This function takes a formula and data, sets up the likelihood function, gradients and Hessian
matrix and uses `optim()` to maximize the likelihood. It returns an
\proglang{S}3 object for which various standard methods are available. We tried
to build an interface as similar to `glm()` as possible to facilitate the
usage.

```{r}
#| eval: false
crch(formula, data, subset, na.action, weights, offset, link.scale = "log", 
  dist = "gaussian", df = NULL, left = -Inf, right = Inf, truncated = FALSE,
  type = "ml", control = crch.control(...),
  model = TRUE, x = FALSE, y = FALSE, ...)
```

Here `formula`, `data`, `na.action`, `weights`, and
`offset` have their standard model frame meanings
[e.g., @chambers1992]. However, as provided in the **Formula**
package [@zeileis2010] `formula` can have two parts separated by `|` where the first
part defines the location model and the second part the scale model. E.g., with
`y ~ x1 + x2 | z1 + z2` the location model is specified by
`y ~ x1 + x2` and the scale model by `~ z1 + z2`.
Known offsets can be specified for the location model by `offset` or
for both, the location and scale model, inside `formula`, i.e., 
`y ~ x1 + x2 + offset(x3) | z1 + z2 + offset(z3)`.

The link function $g(\cdot)$ for the scale model can be specified by
`link.scale`. The default is `"log"`, also supported are
`"identity"` and `"quadratic"`. Furthermore, an arbitrary link
function can be specified by supplying an object of class `"link-glm"`
containing `linkfun`, `linkinv`, `mu.eta`, and `name`.
Furthermore it must contain the second derivative `dmu.deta` if analytical
Hessians are employed.

`dist` specifies the used distribution. Currently supported are
`"gaussian"` (the default), `"logistic"`, and `"student"`. If
`dist = "student"` the degrees of freedom can be set by the `df`
argument. If set to `NULL` (the default) the degrees of freedom are
estimated by maximum likelihood (@eq-dfest).

`left` and `right` define the lower and upper censoring or truncation
points respectively. The logical argument `truncated` defines whether a
censored or truncated model is estimated. Note that also a wrapper function
`trch()` exists that is equivalent to `crch()` but with default
`truncated = TRUE`.

With `type = "ml"` maximum likelihood estimation is carried out with the R function
`optim()` using control options specified in `crch.control()`. By default
the `"BFGS"` method is applied. If no starting values are supplied,
coefficients from `lm()` are used as starting values for the location part.
For the scale model the intercept is initialized with the link function of the
residual standard deviation from `lm()` and the remaining scale coefficients are
initialized with 0. If the degrees of freedom of a \mbox{student-t} distribution are
estimated they are initialized by 10. For the \mbox{student-t} distribution with
estimated degrees of freedom the covariance matrix estimate is derived from the
numerical Hessian returned by `optim()`. For fixed degrees of freedom and
Gaussian and logistic distributions the covariance matrix is derived
analytically. However, by setting `hessian = TRUE` the numerical Hessian
can be employed for those models as well.

As alternative estimation methods `type = "crps"` and/or
`control = crch.boost(...)` are available. With `type = "crps"` the objective
function is the minimum continuous ranked probability score (CRPS) as
a more robust alternative to maximum likelihood. With `control` options set
by `crch.boost()` boosting is used for estimation, i.e., where the iterations
in the fitting of the model are stopped early (offering various criteria for
selecting the number of iterations automatically). Moreover, a convenience
interface `crch.stabsel()` for stability selection based on boosted `crch()`
models is provided.

Finally `model`, `y`, and `x` specify whether the model frame,
response, or model matrix are returned.

The returned model fit of class `"crch"` (and additionally `"crch.boost"` if
the model is fitted by boosting) is a list similar to `"glm"`
objects. Some components like `coefficients` are lists with elements for
location, scale, and degrees of freedom. The package also provides a set of
extractor methods for `"crch"` objects that are listed in
@tbl-methods.

| Function              | Description                                                                   |
|:----------------------|:------------------------------------------------------------------------------|
| `print()`             | Print function call and estimated coefficients. |
| `summary()`           | Standard regression output (coefficient estimates, standard errors, partial Wald tests). Returns an object of class `"summary.crch"` containing summary statistics which has a `print()` method. |
| `coef()`              | Extract model coefficients where `model` specifies whether a single vector containing all coefficients (`"full"`) or the coefficients for the location (`"location"`), scale (`"scale"`) or degrees of freedom (`"df"`) are returned. |
| `vcov()`              | Variance-covariance matrix of the estimated coefficients. |
| `predict()`           | Predictions for new data where `"type"` controls whether location (`"response"`/`"location"`), scale (`"scale"`) or quantiles (`"quantile"`) are predicted. Quantile probabilities are specified by `at`. |
| `fitted()`            | Fitted values for observed data where `"type"` controls whether location (`"location"`) or scale (`"scale"`) values are returned. |
| `residuals()`         | Extract various types of residuals where `type` can be `"standardized"` (default), `"pearson"`, `"response"`, or `"quantile"`. |
| `terms()`             | Extract terms of model components. |
| `logLik()`            | Extract fitted log-likelihood. |

: Functions and methods for objects of class `"crch"`. {#tbl-methods}

Additional to the `crch()` function and corresponding methods the
**crch** package also provides probability density, cumulative distribution,
random number, and quantile functions for censored and truncated normal,
logistic, and student-t distributions. Furthermore it also provides a function
`hxlr()` (heteroscedastic extended logistic regression) to fit
heteroscedastic interval-censored regression models [@messner2013a].

Note that alternative to `crch()` heteroscedastic censored and truncated models
could also be fitted by the R package **gamlss** [@rigby2005] with the add-on packages
**gamlss.cens** and **gamlss.tr**.
However, for the special case of linear censored of truncated regression models with
Gaussian, logistic, or student-t distribution **crch** provides a
fast and convenient interface and various useful methods for analysis and prediction.
Also, **crch** provides the alternative fitting methods minimum CRPS estimation,
boosting, and stability selection.


## Example {#sec-examples}

This section shows a weather forecast example application of censored and truncated regression
models fitted with `crch()`. Weather forecasts are usually based on
numerical weather prediction (NWP) models that take the current state of the
atmosphere and compute future weather by numerically simulating the most
important atmospheric processes. However, because of uncertain initial
conditions and unknown or unresolved processes these numerical predictions are
always subject to errors.  To estimate these errors, many weather centers provide
so called ensemble forecasts: several NWP runs that use different initial
conditions and model formulations. Unfortunately these ensemble forecasts cannot
consider all error sources so that they are often still biased and uncalibrated.
Thus they are often calibrated and corrected for systematic errors by
statistical post-processing.

One popular post-processing method is heteroscedastic linear regression where
the ensemble mean is used as regressor for the location and the ensemble
standard deviation or variance is used as regressor for the scale
[e.g., @gneiting2005]. Because not all meteorological variables can be
assumed to be normally distributed this idea has also been extended to other
distributions including truncated regression for wind
[@thorarinsdottir2010] and censored regression for wind power
[@messner2014b] or precipitation [@messner2014].

The following example applies heteroscedastic censored regression with a
logistic distribution assumption to precipitation data in Innsbruck (Austria).
Furthermore, a two-part model tests whether the occurrence of precipitation and
the precipitation amount are driven by the same process.

First, the **crch** package is loaded together with an included precipitation
data set with forecasts and observations for Innsbruck (Austria)

```{r}
library("crch")
data("RainIbk", package = "crch")
```

The `data.frame` `RainIbk` contains observed 3 day-accumulated
precipitation amounts (`rain`) and the corresponding 11 member ensemble
forecasts of total accumulated precipitation amount between 5 and 8 days in
advance (`rainfc.1`, `rainfc.2`, $\ldots$ `rainfc.11`). The
rownames are the end date of the 3 days over which the precipitation amounts are
accumulated respectively; i.e., the respective forecasts are issued 8 days
before these dates.

In previous studies it has been shown that it is of advantage to model the
square root of precipitation rather than precipitation itself. Thus all
precipitation amounts are square rooted before ensemble mean and standard
deviation are derived. Furthermore, events with no variation in the ensemble
are omitted:

```{r}
RainIbk <- sqrt(RainIbk)
RainIbk$ensmean <- apply(RainIbk[,grep('^rainfc',names(RainIbk))], 1, mean)
RainIbk$enssd <- apply(RainIbk[,grep('^rainfc',names(RainIbk))], 1, sd)
RainIbk <- subset(RainIbk, enssd > 0)
```

A scatterplot of `rain` against `ensmean` 

```{r}
#| eval: false
plot(rain ~ ensmean, data = RainIbk, pch = 19, col = gray(0, alpha = 0.2))
abline(0, 1, col = 2)
```

indicates a linear relationship that differs from a 1-to-1 relationship
(@fig-scatter). Precipitation is clearly non-negative with many zero
observations. Thus censored regression or a two-part model are suitable
to estimate this relationship.

First we fit a logistic censored model for `rain` with `ensmean` as
regressor for the location and `log(enssd)` as regressor for the scale.

```{r}
CRCH <- crch(rain ~ ensmean | log(enssd), data = RainIbk, left = 0, 
  dist = "logistic")
summary(CRCH)
```

Both, `ensmean` and `log(enssd)` are highly significant according to the
Wald test performed by the `summary()` method. The location model is also
shown in @fig-scatter:

```{r}
#| eval: false
abline(coef(CRCH)[1:2], col = 4)
```

```{r}
#| fig-width: 9
#| fig-height: 5
#| out-width: 100%
#| echo: false
#| label: fig-scatter
#| fig-cap: "Square rooted precipitation amount against ensemble mean forecasts. A line with intercept 0 and slope 1 is shown in red and the censored regression fit in blue."
plot(rain~ensmean, data = RainIbk, pch = 19, col = gray(0, alpha = 0.2))
abline(0,1, col = 2, lwd = 2)
abline(coef(CRCH)[1:2], col = 4, lwd = 2)
```

If we compare this model to a constant scale model (tobit model with logistic
distribution)

```{r}
CR <- crch(rain ~ ensmean, data = RainIbk, left = 0, dist = "logistic")
cbind(AIC(CR, CRCH), BIC = BIC(CR, CRCH)[,2])
```

we see that the scale model clearly improves the fit regarding AIC and BIC.

A comparison of the logistic model with a Gaussian and a student-t model

```{r}
CRCHgau <- crch(rain ~ ensmean | log(enssd), data = RainIbk, left = 0, 
  dist = "gaussian")
CRCHstud <- crch(rain ~ ensmean | log(enssd), data = RainIbk, left = 0, 
  dist = "student")
AIC(CRCH, CRCHgau, CRCHstud)
```

confirms the logistic distribution assumption. Note, that with the estimated
degrees of freedom of `r round(exp(tail(coef(CRCHstud), 1)), 2)` the
student-t distribution resembles the (scaled) logistic distribution quite well
(see @fig-dist). 

```{r}
#| fig-width: 9
#| fig-height: 5
#| out-width: 100%
#| echo: false
#| label: fig-dist
#| fig-cap: "Probability density functions of a student-t distribution with 9.56 degrees of freedom, a logistic, and a normal distribution. The densities of the logistic and normal distribution are scaled to facilitate comparison."
a <- seq(-5, 5, 0.01)
df <- exp(tail(coef(CRCHstud), 1))
#par(mfrow = c(1,2))
plot(a, dt(a, df), type = "l", xlab = "", ylab = "density", main = "Probability density function", lwd = 2)
lines(a, dlogis(a*dt(0, df)/dlogis(0))*dt(0, df)/dlogis(0), col = 2, lwd = 2, lty = 2)
lines(a, dnorm(a*dt(0, df)/dnorm(0))*dt(0, df)/dnorm(0), col = 4, lwd = 1, lty = 1)

#plot(a, pt(a, df), type = "l", xlab = "", ylab = "density", main = "Cumulative distribution function", lwd = 2)
#lines(a, plogis(a*dt(0, df)/dlogis(0)), col = 2, lwd = 2, lty = 2)
legend("topright", lwd = c(2,2,1), lty = c(1,2,1), col = c(1,2,4), c("student-t", "scaled logistic", "scaled normal"), bty = "n")
```

In the censored model the occurrence of precipitation and precipitation amount
are assumed to be driven by the same process. To test this assumption
we compare the censored model with a two-part model consisting of a
heteroscedastic logit model and a truncated regression model with logistic
distribution assumption. For the heteroscedastic logit model we use
`hetglm()` from the **glmx** package and for the truncated model we
employ the `crch()` function with the argument `truncated = TRUE`.

```{r}
library("glmx")
BIN <- hetglm(I(rain > 0) ~ ensmean | log(enssd), data = RainIbk,
  family = binomial(link = "logit"))
TRCH <- crch(rain~ensmean | log(enssd), data = RainIbk, subset = rain > 0, 
  left = 0, dist = "logistic", truncated = TRUE)
```

In the heteroscedastic logit model, the intercept of the scale model is not
identified. Thus, the location coefficients of the censored and truncated
regression models have to be scaled to compare them with the logit model.

```{r}
cbind("CRCH" = c(coef(CRCH, "location")/exp(coef(CRCH, "scale"))[1], 
    coef(CRCH, "scale")[2]), 
  "BIN" = coef(BIN), 
  "TRCH" = c(coef(TRCH, "location")/exp(coef(TRCH, "scale"))[1], 
    coef(TRCH, "scale")[2]))
```

The different (scaled) coefficients indicate that different processes drive the
occurrence of precipitation and precipitation amount.  This is also confirmed by AIC and
BIC that are clearly better for the two-part model than for the censored model:

```{r}
loglik <- c("Censored" = logLik(CRCH), "Two-Part" = logLik(BIN) + logLik(TRCH))
df <- c(4, 7)
aic <- -2 * loglik + 2 * df
bic <- -2 * loglik + log(nrow(RainIbk)) * df
cbind(df, AIC = aic, BIC = bic)
```

Finally, we can use the fitted models to predict future precipitation. Therefore
assume that the current NWP forecast of square rooted precipitation has an
ensemble mean of 1.8 and an ensemble standard deviation of 0.9. A median
precipitation forecast of the censored model can then easily be computed with

```{r}
newdata <- data.frame(ensmean = 1.8, enssd = 0.9)
predict(CRCH, newdata, type = "quantile", at = 0.5)^2
```

Note, that the prediction has to be squared since all models fit the square root
of precipitation. In the two-part model the probability to stay below a
threshold $q$ is composed of 

$$
  P(y \le q) = 1-P(y > 0) + P(y>0) \cdot P(y \le q|y>0)
$$

Thus median precipitation equals the $(P(y > 0) - 0.5)/P(y>0)$-quantile of the truncated distribution.
```{r}
p <- predict(BIN, newdata)
predict(TRCH, newdata, type = "quantile", at = (p - 0.5)/p)^2
```

Probabilities to exceed, e.g., 5mm can be predicted with cumulative distribution
functions (e.g., `pclogis()`, `ptlogis()`) that are also provided in
the **crch** package. 

```{r}
mu <- predict(CRCH, newdata, type = "location")
sigma <- predict(CRCH, newdata, type = "scale")
pclogis(sqrt(5), mu, sigma, lower.tail = FALSE, left = 0)

mu <- predict(TRCH, newdata, type = "location")
sigma <- predict(TRCH, newdata, type = "scale")
p * ptlogis(sqrt(5), mu, sigma, lower.tail = FALSE, left = 0)
```

Note, that `pclogis()` could also be replaced by `plogis()` since they
are equivalent between $\text{left}$ and $\text{right}$.

Clearly, other types of model misspecification or model generalization
(depending on the point of view) for the classical tobit model are possible. In
addition to heteroscedasticity, the type of response distribution, and the
presence of hurdle effects as explored in the application here, further aspects
might have to be addressed by the model.  Especially in economics and the
social sciences sample selection effects might be present in the two-part model
which can be addressed (in the homoscedastic normal case) using the R packages
**sampleSelection** [@toomet2008] or **mhurdle**
[@croissant2013]. Furthermore, the scale link function or potential
nonlinearities in the regression functions could be assessed, e.g., using the
**gamlss** suite of packages [@stasinopoulos2007]. 


## Summary {#sec-summary}

Censored and truncated response models are common in econometrics and other
statistical applications. However, often the homoscedasticity assumption of
these models is not fulfilled. This paper presented the **crch** package 
that provides functions to fit censored or truncated regression models
with conditional heteroscedasticity.
It supports Gaussian, logistic or student-t distributed censored or truncated 
responses and provides various convenient methods for analysis and prediction.
To illustrate the package we showed that heteroscedastic censored and truncated models 
are well suited to improve precipitation forecasts.


## References {.unnumbered}