---
title: "Diagnostics"
author: "Joshua F. Wiley"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Diagnostics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

To start, load the package.

```{r setup}
library(JWileymisc)
```

# Univariate Diagnostics

The `testDistribution()` function can be used to evaluate whether a
variable comes from a specific, known parametric distribution.
By default, no outliers or "extreme values" are shown.
These can be used directly through the data output or plotted. The
plot includes both the empirical density, the density of the assumed
parametric distribution, a rug plot showing actual values, if not too
many data points, and a Quantile-Quantile plot (QQ Plot). The QQ Plot
is rotated to save space, by removing the diagonal and rotating to be
horizontal, which we call a QQ Deviates plot.

One other unique feature of the density plot is the x axis. 
It follows Tufte's principles about presenting data. 
To make the x axis more informative, it is a
"range frame", meaning it is only plotted over the range of the actual
data. Secondly, the breaks are a five number summary of the
data. These five numbers are:

* Minimum (0th percentile)
* Lower quartile (25th percentile)
* Median (50th percentile)
* Upper quartile (75th percentile)
* Maximum (100th percentile)

Most plots use breaks that are evenly spaced. These do not convey
information about the data. Using a five number summary the breaks
themselves are effectively like a boxplot with specific values
provided so you can immediately read what values the min and max are,
the median, etc.

```{r}

test <- testDistribution(mtcars$mpg, "normal")
head(test$Data)
table(test$Data$isEV)

plot(test)

``` 

It is possible to have extreme values automatically identified either
based on a theoretical distribution or empirically identified (e.g.,
the top or bottom XX percent are classified as extreme). We say that
.10 (10%) of data points are "extreme" just to illustrate. 
Extreme values are shown in black with non extreme values shown in
light grey, both in the rug plot and in the QQ plot.
Although 10% is higher than typically would be chosen, given the small
dataset, it is necessary to have extreme values at both ends of the
distribution for illustration.

```{r}
test <- testDistribution(mtcars$mpg, "normal",
                         extremevalues = "theoretical",
                         ev.perc = .10)
## view the data with extreme values
head(test$Data)

## count how many extreme values there are
table(test$Data$isEV)

## plot the distribution
plot(test)


## show which values are extreme
test$Data[isEV == "Yes"]

## view extreme values on mpg in the original dataset
## by use the original order, the original rows to select
## the correct rows from the original dataset
mtcars[test$Data[isEV == "Yes", OriginalOrder], ]


``` 

Extreme values taken off the empirical percentiles can be chosen as
well.

```{r}
test <- testDistribution(mtcars$mpg, "normal",
                         extremevalues = "empirical",
                         ev.perc = .10)
head(test$Data)
table(test$Data$isEV)

plot(test)

``` 

Many distributions beyond normal are possible. These can be compared
visually or based on log likelihoods. Here we compare results between
a normal and gamma distribution. The log likelihood values show that
the `mpg` variable is better fit by the gamma distribution than the
normal, although they are quite close.

```{r}
testN <- testDistribution(mtcars$mpg, "normal",
                          extremevalues = "theoretical",
                          ev.perc = .05)
testG <- testDistribution(mtcars$mpg, "gamma",
                          extremevalues = "theoretical",
                          ev.perc = .05)

## compare the log likelihood assuming a normal or gamma distribution
testN$Distribution$LL
testG$Distribution$LL

plot(testN)
plot(testG)

``` 

# Model Diagnostics

`JWileymisc` has features to support standard parametric model
diagnostics. The generic function `modelDiagnostics()` has a method
implemented for linear models.

In the code that follows, we use the `modelDiagnostics()` function to
calculate some diagnostics which are plotted using the matching
`plot()` method. The result is a plot of the standardized residuals
against the expected (normal) distribution, the same QQ deviates plot
we saw with `testDistribution()`. Together, these provide information
about whether the model assumption of normality is likely met, or not.
In addition, block dots and the black rug lines indicate relatively
extreme residual values. The label in the top left is the outcome
variable name.

Another plot is a scatter plot of predicted values (x axis) against
the standardized residuals (y axis). Again a five number summary is
used for the x and y axis. To assess whether there are any systematic
trends in the residuals, a loess line is added as a solid blue line.
To assess heterogeneity/heteroscedasticity of residuals, quantile
regression lines are added as dashed blue lines. These lines are for
the 10th and 90th percentile of the distribution. 
Depending on the model and data, these quantile regression lines may
use smoothing splines to allow non-linear patterns.
For the assumption that residual variance is homogenous, you would
expect the two lines to be approximately parallel and flat, with equal
distance between them, across the range of predicted values.

**Note** it is uncommon to decide that 5 percent of values are
extreme. This high value was chosen to illustrate only. More common
values in practice are `ev.perc = .001`, `ev.perc = .005` or 
`ev.perc = .01`, which capture 0.1%, 0.5%, and 1% of values,
respectively. However, in this small dataset, those would not give
many "extreme values" for illustration, so a higher value was used.

```{r, fig.height = 10}

m <- lm(mpg ~ hp * factor(cyl), data = mtcars)

md <- modelDiagnostics(m, ev.perc = .05)

plot(md, ncol = 1)

```

Finally, given that there are a few extreme values, we may want to
identify and exclude them. We can view the extreme values. Here we
extract these by effect type and with the outcome score associated
with each. The index shows the index, from the residuals where the
extreme values occurred. Note that this may not match up to the
original data if there are missing values. In those cases, the easiest
approach is to remove missing values on model variables first, then
fit the model and then calculate diagnostics.

```{r}
## show extreme values
md$extremeValues

## show extreme values in overall dataset
mtcars[md$extremeValues$Index, 1:4]

``` 

Using the extreme values index, we can exclude these and re-fit the
model to see if it changes. We can show a summary of coefficients from
each model and how much they change.

```{r, results = "hide"}

## exclude extreme values
m2 <- lm(mpg ~ hp * factor(cyl), data = mtcars[-md$extremeValues$Index, ])

## show a summary of coefficients from both models
## and the percent change
round(data.frame(
  M1 = coef(m),
  M2 = coef(m2),
  PercentChange = coef(m2) / coef(m) * 100 - 100), 2)

```

```{r, echo = FALSE, results = "asis"}
if (requireNamespace("pander", quietly = TRUE)) {
pander::pandoc.table(round(data.frame(
  M1 = coef(m),
  M2 = coef(m2),
  PercentChange = coef(m2) / coef(m) * 100 - 100), 2),
  justify = "left")
} else {
  ""
}
```

Once some values are removed, new values may be extreme. Although
additional passes are not that common, they can be done. In the
following code, we conduct new diagnostics on the model that had
extreme values from the first model removed.

```{r, fig.height = 10}

## diagnostics after removing outliers from first model
md2 <- modelDiagnostics(m2, ev.perc = .05)

plot(md2, ask = FALSE, ncol = 1)

## show (new) extreme values
md2$extremeValues

```