---
title: "Deaths by Horse-kick and other Hazards in the Prussian Army"
author: "Antony Unwin and Bill Venables"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: gda.bib
vignette: >
  %\VignetteIndexEntry{Deaths by Horse-kick and other Hazards in the Prussian Army}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "",
  echo = FALSE, 
  fig.height = 6, 
  fig.width = 9, 
  out.width = "100%"
)
```

```{r setup, include=FALSE}
library(knitr)
library(Horsekicks)
library(dplyr)
library(tidyr)
library(ggplot2)
theme_set(theme_bw() + theme(plot.title = element_text(hjust = 0.5),
                             strip.background = element_rect(fill = "antiquewhite")))
```

## Introduction

In 1898 the Russian statistician Ladislaus von Bortkiewicz (1868-1931)
used a dataset on deaths in Prussian Army Corps to illustrate what he
called the _Law of Small Numbers_ (@vonBortk:1898).  At the time
Statistics could only be confidently applied to large datasets where,
for example, Gaussian approximations might apply.  Von Bortkiewicz
wanted to show that methods could be devised to deal with datasets
where such approximate methods clearly could _not_ be used as well.
He took annual data on deaths due to horse-kicks in 14 Army Corps over
20 years from 1875 to 1894.  They had been reported in the official
state publication _Preussische Statistik_.  At the start of the period
a unified Germany had just been created following their victory over
France in the Franco-Prussian war of 1870-71.  Germany was not
involved in a war again until 1914 when the First World War began,
although from the 1880s there were colonisation expeditions in Africa
and the western Pacific.

In more recent times the preference for most intermediate to advanced
textbooks has, for good reasons, shifted to larger and larger data
sets.  Nevertheless the fascination with historically interesting
datasets remains even if they are small.  One such dataset is the
horse-kick data and deserves a fresh look with modern tools,
particularly at the level of detail at which it was originally
reported and not as a gross summary of about 5 or 6 frequencies.

The original data set was of death by horse-kick only, and covered the
20 calendar years from 1875 to 1894.  This package offers an extension
of the data in two ways: Firstly, it extends the series to 1907,
making it for 33 years rather than 20, and, secondly, it provides data
in parallel on two further accidental causes of death, by falling from
a horse or by drowning.  The bigger story is set out in detail in the
article (@unwin:2025) which this package is intended to accompany.

## A glimpse at the data

The following table shows the first few lines of the data set, called
`hkdeaths`.

```{r}
kable(head(hkdeaths), align = "r")
```

Most of the column names are self-explanatory, some need further explanation:

* __regiments__ is the number of regiments in the __corps__ as given
  by von Bortkiewicz.  The __corps__ sizes are unequal and
  __regiments__ provides one relative measure of __corps__ size.
  
* __NCOs__ is the number of non-commissioned officer deaths by
  horse-kick up till 1899.  These deaths are included in the __kick__
  total.
  
* __vonB_kick__ is the original von Bortkiewicz data.  It spans only
  the first 20 years, and contains a minor transcription error when
  compared to the original source, _Preussiche Statistik_.
  
To get some idea of what is going on in the data, we can show the
change over time of the corps annual aggregate deaths for each of
the three causes.

```{r, echo = TRUE}
hazards <- hkdeaths |> 
  group_by(year) |> 
  summarise(horse_kicks = sum(kick),
            falls       = sum(fall),
            drownings   = sum(drown), .groups = "drop") 
hazards_long <- hazards |> 
  pivot_longer(cols = c(horse_kicks, falls, drownings), 
               names_to = "cause", values_to = "deaths")
ggplot(hazards_long) + aes(x = year, y = deaths, colour = cause) +
  geom_line(linewidth = 1.5) +
  scale_colour_viridis_d() + 
  labs(title = "Annual Deaths from three Causes", x = "Year", y = "Deaths")
```

There is a suprisingly consistent fall in the death rate by drowning
over the period, of about 2% per annum:

```{r, echo=TRUE}
drown_1 <- glm(drownings ~ I(year - 1891), data = hazards, family = poisson(link = log))
exp(coef(drown_1)) |> rbind() |> as.data.frame(check.names = FALSE) |> kable(digits = 2)
```

```{r}
b <- exp(coef(drown_1))
```

So the estimated mean deaths by drowning in the middle of the series,
1891, was `r round(b[1], 1)` and the estimated decline is
`r round((1 - b[2])*100, 1)`% per annum.

It is natural to ask if this decline in drowning death rate is also
consistent across corps.  This can be studied using a facet plot.

```{r, echo=TRUE}
ggplot(hkdeaths) + aes(x = year, y = drown) + 
  geom_line(colour = "#21908CFF", linewidth=1.0) + facet_wrap(~ corps)  +
  geom_smooth(method = "glm", formula = y ~ I(x-1891), method.args = list(family = poisson),
              linewidth = 1.0, colour = "#440154FF", se = FALSE) + 
  labs(x = "Year", y = "Deaths", title = "Annual Drowning Deaths per Corps")

```

```{r, eval=FALSE}
# ggplot(hkdeaths) + aes(x = year, y = drown) + 
#   geom_line(colour = "#DF536B", linewidth=1.0) + facet_wrap(~ corps)  +
#   geom_smooth(method = "glm", formula = y ~ I(x-1891), method.args = list(family = poisson),
#               linewidth = 2, colour = "#2297E6") + labs(y = "Deaths by drowning")
```

Perhaps corps XIV is the one exception, showing a slight
_increase_ in drowning deaths per annum over the period.  All other
corps show a decline to some degree.


## References