---
title: "Generating ready-to-analyze datasets with GRD"
bibliography: "../inst/REFERENCES.bib"
csl: "../inst/apa-6th.csl"
output: 
  rmarkdown::html_vignette
description: >
  This vignette covers the basics of GRD, a tool to generate random
  datasets. These data are ready to be analyzed. They can be generated
  from any population distribution, and exhibit any effect.
vignette: >
  %\VignetteIndexEntry{Generating ready-to-analyze datasets with GRD}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, echo = FALSE, warning=FALSE, message = FALSE, results = 'hide'}
cat("this will be hidden; use for general initializations.\n")
library(superb)
library(ggplot2)
```

The package ``superb`` includes the function ``GRD()``. This function is used to easily generate
random data sets. With a few options, it is possible to obtain data from any design, with
any effects. This function, first created for SPSS [@hc14, @hc15] was exported to R [@ch19].
A brief report shows one possible use in the class for teaching statistics to undergrads [@c20].

This vignette illustrate some of its use.

## Simplest specification

The simplest use relies on the default value:

```{r}
dta <- GRD()
head(dta)
```

By default, one hundred scores are generated from a normal distribution with 
mean 0 and standard deviation of 1. In other words, it generate 100 z scores. 
The dependent variable, the last column in the dataframe that will be generated 
is called by default ``DV``. The first column is an "id" column containing a 
number identifying each *simulated* participant. To change the dependent 
variable's name, use

```{r}
dta <- GRD( RenameDV = "score" )
```

## Data from a design with between-subject factors and within-subject factors.

To add various groups to the dataset, use the argument ``BSFactors``, as in

```{r}
dta <- GRD( BSFactors = 'Group(3)')
```

There will be 100 random z scores in each of three groups, for a total of 300 data. The group
number will be given in an additional column, here called ``Group``.  A factorial
design can be generated with more than one factors, such as 

```{r}
dta <- GRD( BSFactors = c('Surgery(2)', 'Therapy(3)') )
```

which will results in 2 $\times$ 3, that is, 6 different groups, crossing all the levels of Surgery (1 and 2)
and all the levels of Therapy (1, 2 and 3). The levels can receive names rather than number, as in

```{r}
dta <- GRD(
    BSFactors = c('Surgery(yes, no)', 'Therapy(CBT,Control,Exercise)')
)
unique(dta$Surgery)
unique(dta$Therapy)
```

Finally, within-subject factors can also be given, as in

```{r}
dta <- GRD(
    BSFactors = c('Surgery(yes,no)', 'Therapy(CBT, Control,Exercise)'),
    WSFactors = 'Contrast(C1,C2,C3)',
)
```

For within-subject designs, the repeated measures will appear in distinct 
columns (here "DV.C1", "DV.C2", and "DV.C3" ). This format is called 
**wide** format, meaning that the repeated measures are all on the same line 
for a given *simulated* participant.

## Deciding the sample sizes

The default is to generate 100 participants in each between-subject groups. 
This default can be changed with ``SubjectsPerGroup``. The most straigthforward
specification is, e.g., ``SubjectsPerGroup = 25`` for 25 participants in each
groups. Unequal group sizes can be specified with:

```{r}
dta <- GRD(
    BSFactors = "Therapy(3)",
    SubjectsPerGroup = c(2, 5, 1)
)
dta
```

## Choosing the population distribution

To sample random data, it is necessary to specify a theoretical population distribution.
The default is to use a normal distribution (the famous "bell-shaped" curve). That population
has a grand mean (``GM``, $\mu$) given by the element ``mean`` and standard deviation ($\sigma$) given 
by the element ``stddev``. These can be redefined using the argument ``Population`` with a
list of the relevant elements. In the following example, IQ are being simulated with :

```{r, dpi=72, fig.height=3, fig.width=4}
dta <- GRD(
    RenameDV = "IQ",
    Population=list(mean=100,stddev=15)
)
hist(dta$IQ)
```

(increase the number of participants using ``SubjectsPerGroup`` to say 10,000, and the bell-shape
curve will be evident!).

Internally, the above call to ``GRD()`` will use ``rnorm`` to generate the 
scores, passing along for the mean parameter the grand mean (internally called
``GM``) and for the standard deviation parameter the provided standard deviation
(internally called ``STDDEV``). This can be explicitly stated using the element
``scores`` as in:

```{r}
dta <- GRD(
    BSFactors = "Group(2)",
    Population = list(
        mean   = 100,         # this set GM to 100
        stddev = 15,        # this set STDDEV to 15
        scores = "rnorm(1, mean = GM, sd = STDDEV )"
    )
)
```

Using ``scores``, it is possible to alter the parameters, for example, have a mean proportional
to the group number, or the standard deviation proportional to the group number, as in:

```{r, dpi=72, fig.height=3, fig.width=4}
dta <- GRD(
    BSFactors = "Group(2)",
    Population = list(
        mean   = 100,       # this set GM to 100
        stddev = 15,        # this set STDDEV to 15
        scores = "rnorm(1, mean = GM, sd = Group * STDDEV )"
    )
)
superb(
    DV ~ Group,
    dta,
    plotStyle = "pointjitterviolin" )
```

Any valid R instruction could be placed in the ``scores`` arguments, such
as ``scores = "rnorm(1, mean = GM, sd = ifelse(Group==1,10,50) )"`` to 
select the standard deviation according to ``Group`` or 
``scores = "1"`` to generate constants. Other theoretical distributions
can also be chosen, as in:

```{r, dpi=72, fig.height=3, fig.width=4}
dta <- GRD(SubjectsPerGroup = 5000,
    RenameDV = "RT",
    Population=list(
        scores = "rweibull(1, shape=2, scale=40)+250"
    )
)
hist(dta$RT,breaks=seq(250,425,by=5))
```

## Getting effects on one or some of the factors

It is possible to generate non-null effects on the factors using 
the argument ``Effects``. Effects can be ``slope(x)`` (an increase of x
points for each level of the factor), ``extent(x)`` (a total increase of 
``x`` over all the levels), ``custom(x, y, etc)`` for an effect of ``x`` point for
the first level of the factor, ``y`` point for the second, etc.

Here is a slope, effect:

```{r, message=FALSE, dpi=72, fig.height=3, fig.width=4}
dta <- GRD(
    BSFactors = 'Therapy(CBT, Control, Exercise)',
    WSFactors = 'Contrast(3)',
    SubjectsPerGroup = 1000,
    Effects = list('Contrast' = slope(2))
)
superb(
    crange(DV.1, DV.3) ~ Therapy,
    dta,
    WSFactors = "Contrast(3)",
    plotStyle = "line" )
```

Effects can also be any R code manipulating the factors, using ``Rexpression``.
One example:

```{r, message=FALSE, dpi=72, fig.height=3, fig.width=4}
dta <- GRD(
    BSFactors = 'Therapy(CBT,Control,Exercise)',
    WSFactors = 'Contrast(3) ',
    SubjectsPerGroup = 1000,
    Effects = list(
        "code1"=Rexpression("if (Therapy =='CBT'){-1} else {0}"),
        "code2"=Rexpression("if (Contrast ==3) {+1} else {0}")
    )
)
superb(
    crange(DV.1, DV.3) ~ Therapy,    
    dta,
    WSFactors = "Contrast(3)",
    plotStyle = "line" )
```

Repeated measures can also be generated from a multivariate normal 
distribution with a correlation ``rho``, with, e.g., 

```{r, dpi=72, fig.width=4, fig.height=4}
dta <- GRD(
    WSFactors = 'Difficulty(1, 2)',
    SubjectsPerGroup = 1000,
    Population=list(mean = 0,stddev = 20, rho = 0.5)
)
plot(dta$DV.1, dta$DV.2)
```

In the case of a multivariate normal distribution, the parameters
for the mean and the standard deviations can be vectors of length
equal to the number of repeated measures. However, covariances are
constants.

```{r, dpi=72, fig.width=4, fig.height=4}
dta <- GRD(
    WSFactors = 'Difficulty(1, 2)',
    SubjectsPerGroup = 1000,
    Population=list(mean = c(10,2),stddev= c(1,0.2),rho =-0.85)
)
plot(dta$DV.1, dta$DV.2)
```

## Contaminate your samples!

Contaminants can be inserted in the simulated data using ``Contaminant``.
This argument works exactly like ``Population`` except for the additional
option ``proportion`` which indicates the proportion of contaminants in
the samples:

```{r, dpi=72, fig.height=3, fig.width=4}
dta <- GRD(SubjectsPerGroup = 5000,
    Population= list( mean=100, stddev = 15 ),
    Contaminant=list( mean=200, stddev = 15, proportion = 0.10 )
)
hist(dta$DV,breaks=seq(-25,300,by=2.5))
```

Contaminants can be normally distributed (as above) or come from any
theoretical distribution which can be simulated in R:

```{r, dpi=72, fig.height=3, fig.width=4}
dta <- GRD(SubjectsPerGroup = 10000,
    Population=list( mean=100, stddev = 15 ),
    Contaminant=list( proportion = 0.10,
        scores="rweibull(1,shape=1.5, scale=30)+1.5*GM")
)
hist(dta$DV,breaks=seq(0,365,by=2.5))
```

Finally, contaminants can be used to add missing data (missing completely at 
random) with:

```{r}
dta <- GRD( BSFactors="grp(2)",
    WSFactors = "Moment (2)",
    SubjectsPerGroup = 1000,
    Effects = list("grp" = slope(100) ),
    Population=list(mean=0,stddev=20,rho= -0.85),
    Contaminant=list(scores = "NA", proportion=0.2)
)
```

## In summary

``GRD()`` is a convenient function to generate about any sorts of data sets 
with any form of effects. The data can simulate any factorial designs 
involving between-subject designs, repeated-measure designs, and 
multivariate data. 

One use if of course in the classroom: students can test their skill by 
generating random data sets and run statistical procedures. To illustrate
type-I errors, it become then easy to generate data with no effect whatsoever
and ask the students who obtain a rejection decision to raise their hand.


# References