---
title: "Vignette: Robust Horvitz-Thompson Estimator"
author: "Beat Hulliger and Tobias Schoch"
output:
    html_document:
        css: "fluent.css"
        highlight: tango
vignette: >
  %\VignetteIndexEntry{Robust Horvitz-Thompson Estimator}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "",
    prompt = TRUE,
    dpi = 36,
    fig.align = "center"
)
```

```{css, echo = FALSE}
.my-sidebar-orange {
    padding-left: 1.5rem;
    padding-top: 0.5rem;
    padding-bottom: 0.25rem;
    margin-top: 1.25rem;
    margin-bottom: 1.25rem;
    border: 1px solid #eee;
    border-left-width: 0.75rem;
    border-radius: .25rem;
    border-left-color: #ce5b00;
}

.my-sidebar-blue {
    padding-left: 1.5rem;
    padding-top: 0.5rem;
    padding-bottom: 0.25rem;
    margin-top: 1.25rem;
    margin-bottom: 1.25rem;
    border: 1px solid #eee;
    border-left-width: 0.75rem;
    border-radius: .25rem;
    border-left-color: #1f618d;
}
```

## Outline

In this vignette, we discuss the robust Horvitz-Thompson (RHT) estimator of
[Hulliger](#biblio) (1995, 1999). The vignette is organized as follows.

 - 1 Workplace data
 - 2 Robust Horvitz-Thompson estimator
 - References

<div class="my-sidebar-blue">
<p style="color: #1f618d;">
**Good to know.**
</p>
<p>The RHT estimating method is available in two "flavors":

 - bare-bone method
 - survey method

Bare-bone methods are stripped-down versions of the survey methods in terms of
functionality and informativeness. These functions may serve users and package
developers as building blocks. In particular, bare-bone functions cannot
compute variances.

See Vignette "Basic Robust Estimators" to learn more about other robust
estimators (trimming, winsorization, etc.).
</p>
</div>

First, we load the namespace of the package `robsurvey` and attach it to the
search path.

```{r}
library("robsurvey", quietly = TRUE)
```

The argument `quietly = TRUE` suppresses the start-up message in the call of
`library("robsurvey")`.

## 1. Workplace Data

The `workplace` sample consists of payroll data on n = 142 workplaces or
business establishments (with paid employees) in the retail sector of a
Canadian province.

 - The data display similar characteristics to the original 1999 Canadian
   Workplace and Employee Survey (WES), however, the `workplace` data are not
   those collected by Statistics Canada but have been generated by
   [Fuller](#biblio) (2009, Example 3.1.1, Table 6.3).
 - The sampling design is stratified by industry, geographic region, and size
   (size is defined using estimated employment). A sample of workplaces was
   then drawn independently in each stratum using simple random sample without
   replacement (sample size is determined by Neyman allocation).

The original weights of WES were about 2200 for the stratum of small
workplaces, about 750 for medium-sized, and about 35 for large workspaces.
Several strata containing very large workplaces were sampled exhaustively; see
[Patak et al](#biblio). (1998).

```{r}
attach(workplace)
```

The variable of interest is `payroll`, and the goal is to estimate the
population payroll total in the retail sector (in Canadian dollars).

```{r}
head(workplace, 3)
```

### 1.1 Survey design object

In order to use the **survey methods** (not bare-bone methods), we must
**attach** the `survey` package ([Lumley](#biblio), 2010, 2021) to the search
path

```{r, eval = FALSE}
library("survey")
```

```{r, echo = FALSE}
suppressPackageStartupMessages(library(survey))
```

and specify a survey or sampling design object

```{r, eval = FALSE}
dn <- svydesign(ids = ~ID, strata = ~strat, fpc = ~fpc, weights = ~weight,
                data = workplace, calibrate.formula = ~-1 + strat)
```

```{r, echo = FALSE}
dn <- if (packageVersion("survey") >= "4.2") {
        svydesign(ids = ~ID, strata = ~strat, fpc = ~fpc, weights = ~weight,
                  data = workplace, calibrate.formula = ~-1 + strat)
    } else {
        svydesign(ids = ~ID, strata = ~strat, fpc = ~fpc, weights = ~weight,
                  data = workplace)
    }
```

**Note.** Since **version 4.2**, the **survey** package allows the definition
of pre-calibrated weights (see argument `calibrate.formula` of the function
`svydesign()`). This vignette uses this functionality (in some places). If you
have installed an earlier version of the `survey` package, this vignette will
automatically fall back to calling `svydesign()` without the calibration
specification. See vignette [Pre-calibrated
weights](https://CRAN.R-project.org/package=survey/vignettes/precalibrated.pdf)
of the `survey` package to learn more.

```{r, echo = FALSE, results = "asis"}
survey_version <- packageVersion("survey")
if (survey_version < "4.2") {
cat(paste0('<div class="my-sidebar-orange">\n
<p style="color: #ce5b00;">
**IMPORTANT: PRE-CALIBRATED WEIGHTS ARE NOT SUPPORTED**
</p>
This vignette has been built with version **', survey_version,
'** of the **survey** package. Therefore, `svydesign()` is called without
the `calibrate.formula` argument. As a consequence, some of the variance and
standard error estimates may differ from those with pre-calibrated weights,
i.e., the default specification.<p>
</p>
</div>'))
}
```

### 1.2 Exploring the data

To get a first impression of the distribution of `payroll`, we examine two
(design-weighted) boxplots of `payroll` (on raw and logarithmic scale). The
boxplots are obtained using function `survey::svyboxplot`.


```{r, echo = FALSE, fig.show = "hold", out.width = "50%", fig.asp = 0.5}
layout(matrix(1:2, ncol = 2))
par(mar = c(4, 1, 1, 1))
svyboxplot(payroll~ 1, dn, all.outliers = TRUE, xlab = "payroll",
           horizontal = TRUE)
svyboxplot(log(payroll) ~ 1, dn, all.outliers = TRUE, xlab = "log(payroll)",
           horizontal = TRUE)
```

From the boxplot with `payroll` on raw scale, we recognise that the sample
distribution of `payroll` is skewed to the right; the boxplot on logarithmic
scale demonstrates that log-transform is not sufficient to turn the skewed
distribution into a symmetric distribution. The outliers need not be errors.
Following [Chambers](#biblio) (1986), we distinguish representative outliers
from non-representative outliers ($\rightarrow$ see vignette "Basic Robust
Estimators" for an introduction to the notion of non-/ representative
outliers).

The outliers visible in the boxplot refer to a few large workplaces. Moreover,
we assume that these outliers represent other workplaces in the population that
are similar in value (i.e., representative outliers).

<div class="my-sidebar-blue">
<p style="color: #1f618d;">
**Detailed analysis.**
</p>

We plotted the sampling weights against the logarithm of `payroll`.

```{r, echo = FALSE, out.width = "50%"}
par(mar = c(5, 4, 1, 0))
plot(weights(dn), dn$variables$payroll, ylab = "payroll (log scale)",
     log = "y", panel.first = grid(col = "grey", lty = 2))
```

In the scatter plot, we can observe the following tendency:

 - Large observations have rather small weights (and vice versa).
 - Or, if we think of a linear regression fit, we would draw the line from the
   top left to the bottom right corner.

This tendency reflects the survey designers' attempt to prevent *influential
outliers* at the stage of designing the survey, that is, to prevent
observations that are far from the bulk of values *and* have a large sampling
weight. The survey designers achieved this by carefully sampling workplaces
with large (anticipated) payroll with a sample inclusion probability equal or
close to unity.

Although the designers did a great job, we are faced with the following
problem: Because outliers have a sampling weight that is already close to
unity, there is little we can do by reducing the sampling weight in order to
limit the impact of an influential value. Hence, weight reduction
($\rightarrow$ see vignette "Basic Robust Estimators") is not a productive
strategy. Fortunately, this situation is where the RHT shines.
</div>

## 2 Robust Horvitz-Thompson Estimator

<div class="my-sidebar-orange">
<p style="color: #ce5b00;">
**IMPORTANT**
</p>
<p>
The RHT estimator is the method of choice for pps designs (i.e., designs
without replacement where the sample inclusion probabilities are proportional
to some measure of size). For equal-probability designs, the *M*-estimator of
`type = "rhj"` (robust Hajek type estimator) tends to be superior; see vignette
<b><a style = "color:#ce5b00;">Basic Robust Estimators</a></b> to learn more.
</p>
</div>

### 2.1 Bare-bone methods

The following bare-bone estimating methods are available:

 - `weighted_mean_huber()`
 - `weighted_total_huber()`
 - `weighted_mean_tukey()`
 - `weighted_total_tukey()`

The functions with postfix `_tukey` are *M*-estimators with the Tukey biweight
$\psi$-function. The Huber RHT *M*-estimator of the payroll total can be
computed with

```{r}
weighted_total_huber(payroll, weight, k = 8, type = "rht")
```

Note that we must specify `type = "rht"` for the RHT [the case `type = "rhj"`
is discussed in the vignette "Basic Robust Estimators"]. Here, we have chosen
the robustness tuning constant $k = 8$.

<div class="my-sidebar-blue">
<p style="color: #1f618d;">
**Good to know.**
</p>
<p>
In general, the tuning constant `k` must be chosen larger than (loosely
speaking) "we are used to choose it". More precisely, in the context of an
*infinite population* with a standard *Gaussian* distribution, the constant $k
= 1.345$ ensures that the Huber $M$-estimator of location achieves 95%
efficiency compared with the arithmetic mean under the Gaussian model. The
efficiency considerations underlying the choice of $k = 1.345$ *do not* carry
over to distributions other than the Gaussian.

All bare-bone methods can be called with the argument `info = TRUE` (default:
`FALSE`). This instructs the functions to return a list with estimate-specific
information $\rightarrow$ see vignette "Basic Robust Estimators" to learn more.

The Huber *M*-estimator can be computed for an asymmetric Huber $\psi$-function
by calling the function with the additional argument `asym = TRUE`.

The *M*-estimators are computed by iterative methods. If the algorithm fails to
converge, the functions return `NA`. By default, the algorithm uses a maximum
of `maxit = 50` iterations and a numerical tolerance criterion of `tol = 1e-5`
as a stopping rule. Other values of `maxit` and `tol` can be specified in the
function call; see `svyreg_control()`.
</p>
</div>

### 2.2 Survey methods

The following survey method are available;

 - `svymean_huber()`
 - `svytotal_huber()`
 - `svymean_tukey()`
 - `svytotal_tukey()`

The survey method of the RHT (and its standard error) is

```{r}
m <- svytotal_huber(~payroll, dn, k = 8, type = "rht")
m
```

The `summary()` method summarizes the most important facts about the estimate.

```{r}
summary(m)
```

The estimated location, variance, and standard error can be extracted from
object `m` with the following commands.

```{r}
coef(m)
vcov(m)
SE(m)
```

For *M*-estimators, the estimated scale (weighted MAD) can be extracted with
the `scale()` function.

```{r}
scale(m)
```

Additional utility functions are:

 - `residuals()` to extract the residuals
 - `fitted()` to extract the fitted values under the model in use
 - `robweights()` to extract the robustness weights

In the following figure, the robustness weights of object `m` are plotted
against the residuals. The Huber RHT *M*-estimator downweights observations at
both ends of the residuals' distribution.

```{r, eval = FALSE}
plot(residuals(m), robweights(m))
```

```{r, echo = FALSE, out.width = "50%"}
par(mar = c(5, 4, 1, 0))
plot(residuals(m), robweights(m))
```

### 2.3 Adaptive estimation

An adaptive *M*-estimator of the total (or mean) is defined by letting the data
chose the tuning constant $k$. This approach is available for the RHT estimator
$\rightarrow$ see vignette "Basic Robust Estimators", Chap. 5.3 on
*M*-estimators.

---

## References {#biblio}

CHAMBERS, R. (1986). Outlier Robust Finite Population Estimation. *Journal of
the American Statistical Association* **81**, 1063–1069, [DOI:
10.1080/01621459.1986.10478374](https://doi.org/10.1080/01621459.1986.10478374).

FULLER, W. A. (2009). *Sampling Statistics*, Hoboken (NJ): John Wiley & Sons,
[DOI: 10.1002/9780470523551](https://doi.org/10.1002/9780470523551).

HULLIGER, B. (1995). Outlier Robust Horvitz–Thompson Estimators. *Survey
Methodology* **21**, 79–87.

HULLIGER, B. (1999). Simple and robust estimators for sampling, in:
*Proceedings of the Survey Research Methods Section, American Statistical
Association*, pp. 54–63.

HULLIGER, B. (2006). Horvitz–Thompson Estimators, Robustified. In:
*Encyclopedia of Statistical Sciences* ed. by Kotz, S. Volume 5, Hoboken (NJ):
John Wiley and Sons, 2nd edition, [DOI:
10.1002/0471667196.ess1066.pub2](https://doi.org/10.1002/0471667196.ess1066.pub2).

LUMLEY, T. (2010). *Complex Surveys: A Guide to Analysis Using R: A Guide to
Analysis Using R*, Hoboken (NJ): John Wiley & Sons.

LUMLEY, T. (2021). survey: analysis of complex survey samples. R package
version 4.0, URL https://CRAN.R-project.org/package=survey.

PATAK, Z., HIDIROGLOU, M. and LAVALLEE, P. (1998). The methodology of the
Workplace and Employee Survey. *Proceedings of the Survey Research Methods
Section, American Statistical Association*, 83–91.