---
title: "MKdescr"
author: "Matthias Kohl"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
bibliography: MKdescr.bib
vignette: >
  %\VignetteIndexEntry{MKdescr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{utf8}
---


## Introduction  
Package MKdescr includes a collection of functions that I found useful in my 
daily work. It contains several functions for descriptive statistical data 
analysis. Most of the functions were extracted from package MKmisc. Due to
the growing number functions leading to an increasing number of dependencies 
required for MKmisc, I decided to split the package into smaller packages, 
where each package offers specific functionality.

We first load the package.
```{r}
library(MKdescr)
```

## Illustration of quantiles and box- and whisker plots
The following function should help to understand the meaning and computation
of quantiles.
```{r, fig.width=7, fig.height=7}
x <- 1:10
illustrate.quantile(x, alpha = 0.2)
illustrate.quantile(x, alpha = 0.5, type = 7)
illustrate.quantile(x, alpha = 0.8, type = c(2, 7))
```

In addition, there is a function to illustrate the meaning and computation
of box-and-whisker plots.
```{r, fig.width=7, fig.height=7}
set.seed(123)
illustrate.boxplot(rt(50, df = 5))
```


## IQR
I implemented function IQrange before the standard function IQR gained the 
type argument. Since 2010 (r53643, r53644) the function is identical to 
function IQR.
```{r}
x <- rnorm(100)
IQrange(x)
IQR(x)
```

It is also possible to compute a standardized version of the IQR leading to a
normal-consistent estimate of the standard deviation.
```{r}
sIQR(x)
sd(x)
```

## Mean Absolute Deviation
The mean absolute deviation under the assumption of symmetry is a robust alternative
to the sample standard deviation [see @Tukey1960].
```{r}
meanAD(x)
```

## Huber-type Skipped Mean and SD
Huber-type skipped mean and SD are robust alternatives to the arithmetic mean 
and sample standard deviation [see @Hampel1985].
```{r}
skippedMean(x)
skippedSD(x)
```

## Five Number Summary
This is a function that computes a so-called five number summary which in 
contrast to function fivenum uses the first and third quartile instead of the 
lower and upper hinge.
```{r}
fiveNS(x)
```

## Seven Number Summary
Analogously to the five number summary we use quantiles instead of hinges and
eighths as originally proposed by Tukey.
```{r}
sevenNS(x)
```

## Coefficient of Variation (CV)
There are functions to compute the (classical) coefficient of variation as well
as two robust variants. In case of the robust variants, the mean is replaced 
by the median and the SD is replaced by the (standardized) MAD and 
the (standardized) IQR, respectively [see Arachchige2019].
```{r}
## 5% outliers
out <- rbinom(100, prob = 0.05, size = 1)
sum(out)
x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25
CV(x)
medCV(x)
iqrCV(x)
```


## Signal to Noise Ratio (SNR)
There are functions to compute the (classical) signal to noise ratio as well
as two robust variants. In case of the robust variants, the mean is replaced 
by the median and the SD is replaced by the (standardized) MAD and 
the (standardized) IQR, respectively.
```{r}
SNR(x)
medSNR(x)
iqrSNR(x)
```


## Standardized Mean Difference
We compute the standardized mean difference also known as Cohen's und Hogdes' g.
In addition, there is a new proposal in case of unequal variances.
```{r}
n1 <- 100
x <- rnorm(100)
n2 <- 150
y <- rnorm(n2, mean = 3, sd = 2)
## true value
(0-3)/sqrt((1 + n1/n2*2^2)/(n1/n2+1))
## estimates
## Aoki's e
SMD(x, y)
## Hedges' g
SMD(x, y, var.equal = TRUE)
## standardized test statistic of Welch's t-test
SMD(x, y, bias.cor = FALSE)
## Cohen's d
SMD(x, y, bias.cor = FALSE, var.equal = TRUE)
```


## z-Score
There are functions to compute z-score and robust variants of z-score based
on median and MAD, respectively median and IQR.
```{r}
## 10% outliers
out <- rbinom(100, prob = 0.1, size = 1)
sum(out)
x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25
z <- zscore(x)
z.med <- medZscore(x)
z.iqr <- iqrZscore(x)
## mean without outliers (should by close to 0)
mean(z[!out])
mean(z.med[!out])
mean(z.iqr[!out])
## sd without outliers (should by close to 1)
sd(z[!out])
sd(z.med[!out])
sd(z.iqr[!out])
```


## Box- and Whisker-Plot
In contrast to the standard function boxplot which uses the lower and upper 
hinge for defining the box and the whiskers, the function qboxplot uses the
first and third quartile.
```{r, fig.width=7, fig.height=7}
x <- rt(10, df = 3)
par(mfrow = c(1,2))
qboxplot(x, main = "1st and 3rd quartile")
boxplot(x, main = "Lower and upper hinge")
```

The difference between the two versions often is hardly visible.


## Generalized Logarithm
The generalized logarithm, which corresponds to the shifted area hyperbolic sine,
$$
  \log(x + \sqrt{x^2 + 1}) - \log(2)
$$
may be useful as a variance stabilizing transformation when also negative values 
are present.
```{r, fig.width = 7, fig.height = 7}
curve(log, from = -3, to = 5)
curve(glog, from = -3, to = 5, add = TRUE, col = "orange")
legend("topleft", fill = c("black", "orange"), legend = c("log", "glog"))
```

As in case of function log there is also glog10 and glog2.
```{r, fig.width = 7, fig.height = 7}
curve(log10(x), from = -3, to = 5)
curve(glog10(x), from = -3, to = 5, add = TRUE, col = "orange")
legend("topleft", fill = c("black", "orange"), legend = c("log10", "glog10"))
```

There are also functions that compute the inverse of the generalized logarithm.
```{r}
inv.glog(glog(10))
inv.glog(glog(10, base = 3), base = 3)
inv.glog10(glog10(10))
inv.glog2(glog2(10))
```


## Simulate Correlated Variables
To demonstrate Pearson correlation in my lectures, I have written this simple
function to simulate correlated variables and to generate a scatter plot of
the data.
```{r, fig.width=7, fig.height=7}
res <- simCorVars(n = 500, r = 0.8)
cor(res$Var1, res$Var2)
```


## Plot TSH, fT3 and fT4 Values
The thyroid function is usually investigated by determining the values of 
TSH, fT3 and fT4. The function thyroid can be used to visualize the measured
values as relative values with respect to the provided reference ranges.
```{r, fig.width=7, fig.height=7}
thyroid(TSH = 1.5, fT3 = 2.5, fT4 = 14, TSHref = c(0.2, 3.0),
        fT3ref = c(1.7, 4.2), fT4ref = c(7.6, 15.0))
```


## Generalized and Negative Logarithm as Transformations
We can use the generalized logarithm for transforming the axes in ggplot2 plots [@Wickham2016].
```{r, fig.width=7, fig.height=7}
library(ggplot2)
data(mpg)
p1 <- ggplot(mpg, aes(displ, hwy)) + geom_point()
p1
p1 + scale_x_log10()
p1 + scale_x_glog10()
p1 + scale_y_log10()
p1 + scale_y_glog10()
```

The negative logrithm is for instance useful for displaying p values. The 
interesting values are on the top. This is for instance used in a so-called
volcano plot.
```{r, fig.width=7, fig.height=7}
x <- matrix(rnorm(1000, mean = 10), nrow = 10)
g1 <- rep("control", 10)
y1 <- matrix(rnorm(500, mean = 11.25), nrow = 10)
y2 <- matrix(rnorm(500, mean = 9.75), nrow = 10)
g2 <- rep("treatment", 10)
group <- factor(c(g1, g2))
Data <- rbind(x, cbind(y1, y2))
pvals <- apply(Data, 2, function(x, group) t.test(x ~ group)$p.value,
               group = group)
## compute log-fold change
logfc <- function(x, group){
  res <- tapply(x, group, mean)
  log2(res[1]/res[2])
}
lfcs <- apply(Data, 2, logfc, group = group)
ps <- data.frame(pvals = pvals, logfc = lfcs)
ggplot(ps, aes(x = logfc, y = pvals)) + geom_point() +
    geom_hline(yintercept = 0.05) + scale_y_neglog10() +
    geom_vline(xintercept = c(-0.1, 0.1)) + xlab("log-fold change") +
    ylab("-log10(p value)") + ggtitle("A Volcano Plot")
```


## Change Data from Wide to Long
Often it's better to have the data in a long format than in a wide format; e.g.,
when plotting with package ggplot2. The necessary transformation can be done
with function melt.long.
```{r, fig.width=7, fig.height=7}
## some random data
test <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10))
test.long <- melt.long(test)
test.long
ggplot(test.long, aes(x = variable, y = value)) +
  geom_boxplot(aes(fill = variable))
## introducing an additional grouping variable
group <- factor(rep(c("a","b"), each = 5))
test.long.gr <- melt.long(test, select = 1:2, group = group)
test.long.gr
ggplot(test.long.gr, aes(x = variable, y = value, fill = group)) +
  geom_boxplot()
```


## sessionInfo
```{r}
sessionInfo()
```


## References