---
output:
  pdf_document:
    citation_package: natbib
    latex_engine: pdflatex
    toc: true
    toc_depth: 2
    includes:
        in_header: vignette_head.tex
    keep_tex: true
title: "Robust Standard Errors in Small Samples"
author: "Michal Kolesár"
date: "`r format(Sys.time(), '%B %d, %Y')`"
geometry: margin=1in
fontfamily: mathpazo
bibliography: library.bib
fontsize: 11pt
vignette: >
  %\VignetteIndexEntry{dfadjust}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include=FALSE, cache=FALSE}
library("knitr")
knitr::opts_knit$set(self.contained = FALSE)
knitr::opts_chunk$set(tidy = TRUE, collapse=TRUE, comment = "#>",
                      tidy.opts=list(blank=FALSE, width.cutoff=55))
oldoptions <- options(digits=3)
```

# Description[^1]

This package implements small-sample degrees of freedom adjustments to robust
and cluster-robust standard errors in linear regression, as discussed in
@ImKo16. The implementation can handle models with fixed effects, and cases
with a large number of observations or clusters

[^1]: We thank Bruce Hansen for comments and Ulrich Müller for suggesting to us a version of Lemma 2 below.

```{r setup}
library(dfadjust)
```

To give some examples, let us construct an artificial dataset with 11 clusters
```{r}
set.seed(7)
d1 <- data.frame(y=rnorm(1000), x1=c(rep(1, 3), rep(0, 997)),
                 x2=c(rep(1, 150), rep(0, 850)),
                 x3=rnorm(1000),
                 cl=as.factor(c(rep(1:10, each=50), rep(11, 500))))
```

Let us first run a regression of `y` on `x1`. This is a case in which, in spite
of moderate data size, the effective number of observations is small since there
are only three treated units:

```{r}
r1 <- lm(y~x1, data=d1)
## No clustering
dfadjustSE(r1)
```

We can see that the usual robust standard errors (`HC1 se`) are much smaller
than the effective standard errors (`Adj. se`), which are computed by taking the
`HC2` standard errors and applying a degrees of freedom adjustment.

Now consider a cluster-robust regression of `y` on `x2`. There are
only 3 treated clusters, so the effective number of observations is again small:

```{r}
r1 <- lm(y~x2, data=d1)
# Default Imbens-Kolesár method
dfadjustSE(r1, clustervar=d1$cl)
# Bell-McCaffrey method
dfadjustSE(r1, clustervar=d1$cl, IK=FALSE)
```

Now, let us run a regression of `y` on `x3`, with fixed effects. Since we're
only interested in `x3`, we specify that we only want inference on the second
element (the first one being the intercept):

```{r}
r1 <- lm(y~x3+cl, data=d1)
dfadjustSE(r1, clustervar=d1$cl, ell=2)
dfadjustSE(r1, clustervar=d1$cl, ell=2, IK=FALSE)
```

Finally, an example in which the clusters are large. We have 500,000 observations:
```{r}
d2 <- do.call("rbind", replicate(500, d1, simplify = FALSE))
d2$y <- rnorm(length(d2$y))
r2 <- lm(y~x2, data=d2)
summary(r2)
# Default Imbens-Kolesár method
dfadjustSE(r2, clustervar=d2$cl)
# Bell-McCaffrey method
dfadjustSE(r2, clustervar=d2$cl, IK=FALSE)
```

# Methods

This section describes the implementation of the @ImKo16 and @BeMc02 degrees of
freedom adjustments.

There are $S$ clusters, and we observe $n_{s}$ observations in cluster $s$, for
a total of $n=\sum_{s=1}^{S}n_{s}$ observations. We handle the case with
independent observations by letting each observation be in its own cluster, with
$S=n$. Consider the linear regression
of a scalar outcome $Y_{i}$ onto a $p$-vector of regressors $X_{i}$,
\begin{equation*}
  Y_{i}=X_{i}'\beta+u_{i},\qquad E[u_{i}\mid X_{i}]=0.
\end{equation*}
We're interested in inference on $\ell'\beta$ for some fixed vector
$\ell\in\mathbb{R}^{p}$. Let $X$, $u$, and $Y$ denote the design matrix, and
error and outcome vectors, respectively. For any $n\times k$ matrix $M$, let
${M}_{s}$ denote the $n_{s}\times k$ block corresponding to cluster $s$, so
that, for instance, $Y_{s}$ corresponds to the outcome vector in cluster $s$.
For a positive semi-definite matrix ${M}$, let ${M}^{1/2}$ be a matrix
satisfying ${{M}^{1/2}}'{M}^{1/2}={M}$, such as its symmetric square root or its
Cholesky decomposition.

Assume that
\begin{equation*}
  E[u_{s}u_{s}'\mid X]=\Omega_{s},\quad\text{and}\quad
  E[u_{s}u_{t}'\mid X]=0\quad\text{if $s\neq t$}.
\end{equation*}
Denote the conditional variance matrix of $u$ by $\Omega$, so that $\Omega_{s}$
is the block of $\Omega$ corresponding to cluster $s$. We estimate $\ell'\beta$
using OLS. In `R`, the OLS estimator is computed via a QR decomposition, $X=QR$,
where $Q' Q=I$ and $R$ is upper-triangular, so we can write the estimator as
\begin{equation*}
  \ell'\hat{\beta}=\ell'\left(\sum_{s}X_{s}'X_{s}\right)^{-1}\sum_{s}X_{s}Y_{s}
  =\tilde{\ell}'\sum_{s}Q_{s}'Y_{s},\qquad \tilde{\ell}={R^{-1}}'\ell.
\end{equation*}
It has variance
\begin{equation*}
  V:=  \var(\ell'\hat{\beta}\mid X)
  =\ell'\left(X' X\right)^{-1}
  \sum_{s}X_{s}'\Omega_{s}X_{s}\left(X' X\right)^{-1}\ell
  =\tilde{\ell}'\sum_{s}Q_{s}'\Omega_{s}Q_{s}
  \tilde{\ell}.
\end{equation*}

## Variance estimate

We estimate $V$ using a variance estimator that generalizes the HC2 variance
estimator to clustering. Relative to the LZ2 estimator described in @ImKo16, we use a
slight modification that allows for fixed effects:
\begin{equation*}
  \hat{V}=\ell'(X' X)^{-1}\sum_{s}^{}X'_{s}{A}_{s}\hat{u}_{s}\hat{u}_{s}'
  {A}_{s}'X_{s}(X' X)^{-1}\ell
  =\ell' R^{-1}\sum_{s}^{}Q'_{s}{A}_{s}\hat{u}_{s}\hat{u}_{s}'
  {A}_{s}'Q_{s}{R'}^{-1}\ell
  =\sum_{s=1}^{S}(\hat{u}_{s}'a_{s})^{2},
\end{equation*}
where
\begin{equation*}
  \hat{u}_{s}:=Y_{s}-X_{s}\hat{\beta}
  =u_{s}-Q_{s}Q' u,\qquad
   a_{s}={A}_{s}'Q_{s}\tilde{\ell},
\end{equation*}

and $A_s$ is a generalized inverse of the symmetric square root of $I-Q_s Q_s'$,
the block of the hat matrix corresponding to cluster $s$. In presence of
cluster-specific fixed effects, $I-Q_s Q_s'$ is not generally invertible, which
necessitates taking a generalized inverse. So long as the vector $\ell$ doesn't
load on these fixed effects, $\hat{V}$ will be unbiased under homoskedasticity,
as the next result, which slightly generalizes the Theorem 1 in @PuTi18, shows.

\begin{lemma}
  Suppose that $X=(W,L)$ is full rank, and suppose that the vector $\ell$ loads only on
  elements of $W$. Let $\ddot{W}$ denote the residual from projecting $W$ onto
  $L$, and suppose that for each cluster $s$, (i) $L_s'\ddot{W}_s=0$ and that
  (ii) $\sum_{k=1}^S I(k\neq s)\ddot{W}_k'\ddot{W}_k$ is full rank. Then $\hat{V}$ is
  unbiased under homoskedasticity.
\end{lemma}

The proof is given in the last section. By definition of projection, $L$ and
$\ddot{W}$ are orthogonal. Condition (i) of the lemma strengthens this
requirement to orthogonality within each cluster. It holds if $L$ corresponds to
a vector of cluster fixed effects, or more generally if $L$ contains
cluster-specific variables. Condition (ii) ensures that after partialling out
$L$, it is feasible to run leave-one-cluster-out regressions. Without
clustering, the condition is equivalent to the requirement that the partial
leverages associated with $\ddot{W}$ are smaller than one.[^2]

[^2]: To see this, let $H=\ddot{W}(\ddot{W}'\ddot{W})^{-1}\ddot{W}'$ denote the partial projection matrix. Since $H=H^{2}$, \begin{equation*}H_{ii}-H_{ii}^{2}=\sum_{j\neq i}H_{i j}H_{j i}=\ddot{W}_{i}'(\ddot{W}'\ddot{W})^{-1} [\sum_{j\neq i}\ddot{W}_{j}\ddot{W}_{j}'] (\ddot{W}'\ddot{W})^{-1}\ddot{W}_{i}.\end{equation*} so that $H_{ii}=1$ iff $\sum_{j\neq i}\ddot{W}_{j}\ddot{W}_{j}'$ is reduced rank.

If the observations are independent, the vector of leverages $(Q_{1}'Q_{1},
\dotsc, Q_{n}'Q_{n})$ can be computed directly using the `stats::hatvalues`
function. In this case, we use this function to compute
$A_{i}=1/\sqrt{1-Q_{i}'Q_{i}}$ directly, and we then compute
$a_{i}=A_{i}Q_{i}'\tilde{\ell}$ using vector operations. For the case with
clustering, computing an inverse of $I-Q_{s}Q_{s}'$ can be expensive or even
infeasible if the cluster size $n_s$ is large. We therefore use the following
result, which allows us to compute $a_{s}$ by computing a spectral decomposition
of a $p\times p$ matrix.

\begin{lemma}
  Let $Q_{s}'Q_{s}=\sum_{i=1}^{p}\lambda_{is}r_{is}r_{is}'$ be the spectral
   decomposition of $Q_{s}'Q_{s}$. Then $a_s=Q_{s}D_{s}
  \tilde{\ell}$, where $\qquad D_{s}=\sum_{i\colon \lambda_{i}\neq
    1}(1-\lambda_{i})^{-1/2}r_{is}r_{is}'$.
\end{lemma}

The lemma follows from the fact that $I-Q_{s}Q_{s}'$ has eigenvalues
$1-\lambda_{is}$ and eigenvectors $Q_{s}r_{is}$. More precisely, let
$Q_{s}=\sum_{i}\lambda_{is}^{1/2}u_{is}r_{is}'$ denote the singular value
decomposition of $Q_{s}$, so that
$I-Q_{s}Q_{s}'=\sum_{i}(1-\lambda_{is})u_{is}u_{is}'$, and we can take
$A_{s}=\sum_{i\colon \lambda_{is}\neq 1}(1-\lambda_{is})^{-1/2}u_{is}u_{is}'$.
Then,
\begin{equation*}
  A_{s}'Q_{s}=\sum_{i\colon \lambda_{i}\neq 1}
  (1-\lambda_{is})^{-1/2}\lambda_{is}^{1/2} u_{is} r_{is}'
  =Q_{s}\sum_{i\colon \lambda_{i}\neq 1}(1-\lambda_{is})^{-1/2}r_{is} r_{is}',
\end{equation*}
where the second equality uses $Q_{s}r_{is}=\lambda_{is}^{1/2}u_{is}$.


## Degrees of freedom correction

Let $G$ be an $n\times S$ matrix with columns $(I-QQ')_{s}'a_{s}$. Then the
@BeMc02 adjustment sets the degrees of freedom to
\begin{equation*}
  f_{\text{BM}}=\frac{\trace(G' G)^{2}}{\trace((G' G)^{2})}.
\end{equation*}
Since $(G' G)_{st}=a_{s}'(I-QQ')_{s}(I-QQ)_{t}'a_{t}=a_{s}(\1{s=t}-Q_{s}Q_{t}')a_{t}$,
the matrix $G' G$ can be efficiently computed as
\begin{equation*}
  G' G=\diag(a_{s}'a_{s})-BB'\qquad B_{s k}=a_{s}'Q_{s k}.
\end{equation*}
Note that $B$ is an $S\times p$ matrix, so that computing the degrees of freedom
adjustment only involves $p\times p$ matrices:
\begin{align*}
  f_{\text{BM}}=\frac{(\sum_{s}a_{s}'a_{s}-\sum_{s,k}B_{s k}^{2})^{2}}{
  \sum_{s}(a_{s}'a_{s})^{2}-2\sum_{s,k}(a_{s}'a_{s})B_{s k}^{2}+\sum_{s,t}(B_{s}'B_{t})^{2}
  }.
\end{align*}
If the observations are independent, we compute $B$ directly as `B <- a*Q`,
and since $a_{i}$ is a scalar, we have
\begin{equation*}
  f_{\text{BM}}=\frac{(\sum_{i}a_{i}^{2}-\sum_{s k}B_{s k}^{2})^{2}}{
    \sum_{i}a_{i}^{4}-2\sum_{i}a_{i}^{2}B_{i}'B_{i}+\sum_{i, j}(B_{i}'B_{j})^{2}}.
\end{equation*}

The @ImKo16 degrees of freedom adjustment instead sets
\begin{equation*}
  f_{IK}=\frac{\trace({G}'\hat{\Omega}
    G)^{2}}{\trace(({G}'\hat{\Omega}
    G)^{2})},
\end{equation*}

where $\hat{\Omega}$ is an estimate of the @moulton86 model of the
covariance matrix, under which
$\Omega_{s}=\sigma_{\epsilon}^{2}I_{n_{s}}+\rho\iota_{n_{s}}\iota_{n_{s}}'$.
Using simple algebra, one can show that in this case,
\begin{equation*}
  G'\Omega G=\sigma_{\epsilon}^{2}\diag(a_{s}'a_{s}) -\sigma_{\epsilon}^{2}BB'+\rho (D-BF')(D-BF')',
\end{equation*}
where
\begin{equation*}
 F_{s k}=\iota_{n_{s}}'Q_{s k},\qquad D=\diag(a_{s}'\iota_{n_{s}})
\end{equation*}
which can again be computed even if the clusters are large. The estimate
$\hat{\Omega}$ replaces $\sigma_{\epsilon}^{2}$ and $\rho$ with analog
estimates.

```{r cleanup, include=FALSE}
options(oldoptions)
```
# Proof of Lemma 1

The estimator of the block of $V$ associated with $W$ implied by $\hat{V}$ is
given by
\begin{equation*}
  (\ddot{W}'\ddot{W})^{-1}  \sum_{s}\ddot{W}_{s}'A_{s} (I-QQ')_{s} u u'(I-QQ')_{s}'A_{s}'\ddot{W}_{s}
  (\ddot{W}'\ddot{W})^{-1},
\end{equation*}
which is unbiased under homoskedasticity if for each $s$,
\begin{equation}\label{equation:1}
  \ddot{W}_{s}'A_{s}(I-Q_{s}Q_{s}')A_{s}'\ddot{W}_{s}
  = \ddot{W}_{s}'\ddot{W}_{s}.
\end{equation}
We will show that \eqref{equation:1} holds. To this end, we first claim that under
conditions (i) and (ii), $\ddot{W}_{s}$ is in the column space of
$I-Q_{s}Q_{s}'$ (a claim that's trivial if this matrix is full rank). Decompose
$I-QQ'=I-H_{\ddot{W}}-H_{L}$, where $H_{\ddot{W}}$ and $H_{L}$ are hat matrices
associated with $\ddot{W}$ and $L$. The block associated with cluster $s$ can
thus be written as
$I-Q_{s}Q_{s}'=I-L_{s}(L' L)^{-1}L_{s}'-\ddot{W}_{s}(\ddot{W}'\ddot{W})^{-1}\ddot{W}_{s}'$.
Let
$B_{s}=\ddot{W}_{s}(\ddot{W}'\ddot{W}-\ddot{W}_{s}'\ddot{W}_{s})^{-1}\ddot{W}'\ddot{W}$,
which is well-defined under condition (ii). Then, using condition (i), we get
\begin{equation*}
\begin{split}
  (I-Q_{s}Q_{s}')B_{s}
  &=(I-\ddot{W}_{s}(\ddot{W}'\ddot{W})^{-1}\ddot{W}_{s}')B_{s}\\
  &=\ddot{W}_{s}(I-(\ddot{W}'\ddot{W})^{-1}\ddot{W}_{s}'\ddot{W}_{s})
  (\ddot{W}'\ddot{W}-\ddot{W}_{s}'\ddot{W}_{s})^{-1}\ddot{W}'\ddot{W}=\ddot{W}_{s},
\end{split}
\end{equation*}
proving the claim. Letting $C$ denote the symmetric square root of
$I-Q_{s}Q_{s}'$, the left-hand side of \eqref{equation:1} can therefore be written as
\begin{equation*}
  \ddot{W}_{s}'A_{s}(I-Q_{s}Q_{s}')A_{s}'\ddot{W}_{s}
  =  B_{s}'CC A_{s} CC A_{s}' CC B_{s}=\ddot{W}_{s}'\ddot{W}_{s},
\end{equation*}
where the second equality follows by the definition of a generalized inverse.




# References