---
title: "Semi-Parametric Distribution Demo"
author: "Alexios Galanos"
date: "`r Sys.Date()`"
output: 
    rmarkdown::html_vignette:
        css: custom.css
        code_folding: show
        highlight: kate
vignette: >
  %\VignetteIndexEntry{Semi-Parametric Distribution Demo}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This provides a quick demo to illustrate the semi-parametric distribution. This
distribution estimates the tails of the data using the Generalized Pareto distribution 
(`gpd`) whilst the interior is estimated using a kernel density. The tails and 
interior are then smoothly joined together to form the semi-parametric distribution.

For this demo, we'll generate a random sample from the Generalized Hyperbolic 
Skew Student distribution and then model the top and bottom 1% of the data 
using the Generalized Pareto distribution. Choosing the correct threshold when 
applying the Generalized Pareto distribution is an important step to ensure that 
the asymptotic distribution fits the data well and the parameter estimates are 
stable. There are a number of methods to test for this, and numerous packages in R which can help. The [mev](https://CRAN.R-project.org/package=mev) 
package, which is used for estimating the GPD (except in the case of the probability weighted moments
approach) provides threshold stability plots and other diagnostics for aid in
choosing the right threshold.

There are a total of 4 parameters, 2 for each tail, estimated independently. As such,
the standard errors are based on a block diagonal variance-covariance matrix.

```{r,warning=FALSE,message=FALSE}
library(tsdistributions)
library(data.table)
library(knitr)
# simulate data
set.seed(200)
n <- 6000
sim <- rghst(n, mu = 0, sigma = 1, skew = -0.6, shape = 8.01)
spec <- spd_modelspec(sim, lower = 0.01, upper = 0.99, kernel_type = 'normal')
mod <- estimate(spec, method = "pwm")
print(summary(mod))
```

We'll investigate the goodness of fit of the distribution using some visuals:

```{r,fig.height=6,fig.width=7}
lower_treshold <- mod$gpd$lower$threshold
upper_treshold <- mod$gpd$upper$threshold
oldpar <- par(mfrow = c(1,1))
par(mfrow = c(3,1), mar = c(3,3,3,3))
hist(sim, breaks = 300, probability = TRUE, xlab = "r", 
     col = "steelblue", border = "whitesmoke",main = "PDF")
box()
abline(v = lower_treshold, col = 'mediumpurple3', lty = 2)
abline(v = upper_treshold, col = 'mediumpurple3', lty = 2)
lines(sort(sim), dspd(sort(sim), mod), col = "darkorange",lwd = 2)

plot(sort(sim), (1:length(sim)/length(sim)), 
     ylim = c(0, 1), pch = 19, 
     cex = 0.5, ylab = "p", xlab = "q", main = "CDF")
grid()
lines(sort(sim), pspd(sort(sim), mod), col = "darkorange",lwd = 2)

abline(v = lower_treshold, col = 'mediumpurple3', lty = 2)
abline(v = upper_treshold, col = 'mediumpurple3', lty = 2)

plot(sort(sim), qspd(ppoints(length(sim)), mod), ylab = "Model Simulated", xlab = "Observed", main = "Q-Q")
abline(0, 1, col = "darkorange")
grid()
abline(v = lower_treshold, col = 'mediumpurple3', lty = 2)
abline(v = upper_treshold, col = 'mediumpurple3', lty = 2)
abline(h = lower_treshold, col = 'mediumpurple3', lty = 2)
abline(h = upper_treshold, col = 'mediumpurple3', lty = 2)
par(oldpar)
```

The vertical purple lines show the region modeled by the kernel density, representing
the thresholds beyond which the GPD is used.

To calculate expectations from the estimated object we can use quadrature integration,
as shown below, with calculations for the mean, variance, skewness, kurtosis, value
at risk (1%) and expected shortfall (1%). We choose reasonably large but finite values
for the numerical integration limits.

```{r}
f <- function(x) x * dspd(x, mod)
mu <- integrate(f, -50, 50)$value
f <- function(x) x^2 * dspd(x, mod)
variance <- integrate(f, -50, 50)$value
f <- function(x) (x - mu)^3 * dspd(x, mod)
skewness <- integrate(f, -50, 50)$value/(variance^(3/2))
f <- function(x) (x - mu)^4 * dspd(x, mod)
kurtosis <- integrate(f, -50, 50)$value/(variance^2)
value_at_risk <- qspd(0.01, mod)
f <- function(x) x * dspd(x, mod)
expected_shortfall <- integrate(f, -50, value_at_risk)$value/0.01

tab <- data.table(stat = c("mean","variance","skewness","kurtosis","VaR(1%)","ES(1%)"),
                  spd = c(mu, variance, skewness, kurtosis, value_at_risk, expected_shortfall),
                  observed = c(mean(sim), var(sim), mean((sim - mean(sim))^3)/(sd(sim)^3), 
                               mean((sim - mean(sim))^4)/(sd(sim)^4),quantile(sim, 0.01),
                               mean(sim[sim <= quantile(sim, 0.01)])))
kable(tab, digits = 2)
```

The results appear promising and close to the observed values. Note that even though
we simulated the data from a known distribution, in practice we usually don't 
know which distribution the data came from. Therefore, the semi-parametric 
distribution provides a good trade-off between fully parametric and non-parametric
approaches. The caveat is that we still need to be careful in choosing the thresholds.
There is no free lunch!