---
title: "Simulate Individual Data"
author: "Gabriele Pittarello"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Simulate Individual Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: '`r system.file("references.bib", package="ReSurv")`'
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ReSurv)
library(ggplot2)
```

# Introduction

In this vignette we show how to simulate the individual data we included in the simulation study of @hiabu23. The simulations are based on the `SynthETIC` package and they can be used to replicate our results.
In the manuscript, we named the $5$ scenarios Alpha, Beta, Gamma, Delta, Epsilon. The $5$ scenarios have the same data features described in the following table. Conversely, they have specific characteristics that we will describe in the coming sections.

| Covariates                                       | Description        |
|--------------------------------------------------|--------------------|
| `claim_number`                                   | Policy identifier. |
| `claim_type` $\in  \left\{0, 1 \right\}$         | Type of claim.     |
| `AP`                                             | Accident month.    |
| `RP`                                             | Reporting month.   |

For each scenario we will show if they satisfy the chain ladder assumptions (CL), the proportionality assumption in @cox72 (PROP) and if interactions are present (INT). Details on the simulation mechanism and the simulation parameters can be found in the manuscript.


# Scenario Alpha

This scenario is a mix of `claim_type 0` and  `claim_type 1` with same number of claims at each accident month (i.e. the claims volume).

```{r eval=FALSE, include=TRUE}
# Input data

input_data_0 <- data_generator(
  random_seed = 1964,
  scenario = "alpha",
  time_unit = 1 / 360,
  years = 4,
  period_exposure = 200
)

```


```{r eval=FALSE, include=TRUE}
input_data_0 %>%
  as.data.frame() %>%
  mutate(claim_type = as.factor(claim_type)) %>%
  ggplot(aes(x = RT - AT, color = claim_type)) +
  stat_ecdf(size = 1) +
  labs(title = "Empirical distribution of simulated notification delays", x =
         "Notification delay (in days)", y = "Cumulative Density") +
  xlim(0, 1500) +
  scale_color_manual(
    values = c("royalblue", "#a71429"),
    labels = c("Claim type 0", "Claim type 1")
  ) +
  scale_linetype_manual(values = c(1, 3),
                        labels = c("Claim type 0", "Claim type 1")) +
  guides(
    color = guide_legend(title = "Claim type", override.aes = list(
      color = c("royalblue", "#a71429"), size = 2
    )),
    linetype = guide_legend(
      title = "Claim type",
      override.aes = list(linetype = c(1, 3), size = 0.7)
    )
  ) +
  theme_bw()
```


# Scenario Beta

This scenario is similar to simulation `Alpha` but the volume of `claim_type 1` is decreasing in the most recent accident dates. When the longer tailed bodily injuries have a decreasing claim volume, aggregated chain ladder methods will overestimate reserves, see @ajne94. 

```{r include=TRUE, eval =FALSE}

input_data_1 <- data_generator(
  random_seed = 1964,
  scenario = 1,
  time_unit = 1 / 360,
  years = 4,
  period_exposure  = 200
)

```

```{r eval=FALSE, include=TRUE}
input_data_1 %>%
  as.data.frame() %>%
  mutate(claim_type = as.factor(claim_type)) %>%
  ggplot(aes(x = RT - AT, color = claim_type)) +
  stat_ecdf(size = 1) +
  labs(title = "Empirical distribution of simulated notification delays", x =
         "Notification delay (in days)", y = "Cumulative Density") +
  xlim(0, 1500) +
  scale_color_manual(
    values = c("royalblue", "#a71429"),
    labels = c("Claim type 0", "Claim type 1")
  ) +
  scale_linetype_manual(values = c(1, 3),
                        labels = c("Claim type 0", "Claim type 1")) +
  guides(
    color = guide_legend(title = "Claim type", override.aes = list(
      color = c("royalblue", "#a71429"), size = 2
    )),
    linetype = guide_legend(
      title = "Claim type",
      override.aes = list(linetype = c(1, 3), size = 0.7)
    )
  ) +
  theme_bw()
```

# Scenario Gamma

An interaction between `claim_type 1` and accident period affects the claims occurrence. One could imagine a scenario, where a change in consumer behavior or company policies resulted in different reporting patterns over time.  For the last simulated accident month, the two reporting delay distributions will be identical.

```{r}
# Input data

input_data_2 <- data_generator(
  random_seed = 1964,
  scenario = 2,
  time_unit = 1 / 360,
  years = 4,
  period_exposure = 200
)

```

```{r eval=FALSE, include=TRUE}
input_data_2 %>%
  as.data.frame() %>%
  mutate(claim_type = as.factor(claim_type)) %>%
  ggplot(aes(x = RT - AT, color = claim_type)) +
  stat_ecdf(size = 1) +
  labs(title = "Empirical distribution of simulated notification delays", x =
         "Notification delay (in days)", y = "Cumulative Density") +
  xlim(0, 1500) +
  scale_color_manual(
    values = c("royalblue", "#a71429"),
    labels = c("Claim type 0", "Claim type 1")
  ) +
  scale_linetype_manual(values = c(1, 3),
                        labels = c("Claim type 0", "Claim type 1")) +
  guides(
    color = guide_legend(title = "Claim type", override.aes = list(
      color = c("royalblue", "#a71429"), size = 2
    )),
    linetype = guide_legend(
      title = "Claim type",
      override.aes = list(linetype = c(1, 3), size = 0.7)
    )
  ) +
  theme_bw()
```


# Scenario Delta

A seasonality effect dependent on the accident months for `claim_type 0` and `claim_type 1` is present. This could occur in a real world setting with increased work load during winter for certain claim types, or a decreased workforce during the summer holidays. 

```{r}

input_data_3 <- data_generator(
  random_seed = 1964,
  scenario = 3,
  time_unit = 1 / 360,
  years = 4,
  period_exposure = 200
)

```

```{r eval=FALSE, include=TRUE}
input_data_3 %>%
  as.data.frame() %>%
  mutate(claim_type = as.factor(claim_type)) %>%
  ggplot(aes(x = RT - AT, color = claim_type)) +
  stat_ecdf(size = 1) +
  labs(title = "Empirical distribution of simulated notification delays", x =
         "Notification delay (in days)", y = "Cumulative Density") +
  xlim(0, 1500) +
  scale_color_manual(
    values = c("royalblue", "#a71429"),
    labels = c("Claim type 0", "Claim type 1")
  ) +
  scale_linetype_manual(values = c(1, 3),
                        labels = c("Claim type 0", "Claim type 1")) +
  guides(
    color = guide_legend(title = "Claim type", override.aes = list(
      color = c("royalblue", "#a71429"), size = 2
    )),
    linetype = guide_legend(
      title = "Claim type",
      override.aes = list(linetype = c(1, 3), size = 0.7)
    )
  ) +
  theme_bw()
```

# Scenario Epsilon

The data generating process violates the proportional likelihood in @cox72. We generate the data assuming that a) there is an effect of the covariates on the baseline and b) the proportionality assumption is not valid.

```{r}
# Input data

input_data_4 <- data_generator(
  random_seed = 1964,
  scenario = 4,
  time_unit = 1 / 360,
  years = 4,
  period_exposure = 200
)

```


```{r eval=FALSE, include=TRUE}
input_data_4 %>%
  as.data.frame() %>%
  mutate(claim_type = as.factor(claim_type)) %>%
  ggplot(aes(x = RT - AT, color = claim_type)) +
  stat_ecdf(size = 1) +
  labs(title = "Empirical distribution of simulated notification delays", x =
         "Notification delay (in days)", y = "Cumulative Density") +
  xlim(0, 1500) +
  scale_color_manual(
    values = c("royalblue", "#a71429"),
    labels = c("Claim type 0", "Claim type 1")
  ) +
  scale_linetype_manual(values = c(1, 3),
                        labels = c("Claim type 0", "Claim type 1")) +
  guides(
    color = guide_legend(title = "Claim type", override.aes = list(
      color = c("royalblue", "#a71429"), size = 2
    )),
    linetype = guide_legend(
      title = "Claim type",
      override.aes = list(linetype = c(1, 3), size = 0.7)
    )
  ) +
  theme_bw()

```

# Bibliography