---
title: "Ex. 4 - Generating cluster samples"
author: Kondwani Kajera Mughogho
header-includes:
    - \usepackage{setspace}\onehalfspacing
output:
  html_document:
    highlight: tango
vignette: >
  %\VignetteIndexEntry{Ex. 4 - Generating cluster samples}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE, warning = FALSE}
library(knitr)
library(formatR)
options(width = 90, tidy = TRUE, warning = FALSE, message = FALSE)
opts_chunk$set(
  comment = "", warning = FALSE, message = FALSE,
  echo = TRUE, tidy = TRUE
)
```

```{r load}
library(lsasim)
```

```{r packageVersion}
packageVersion("lsasim")
```

---

### **Generating background questionnaire data**

```{r equation, eval=FALSE}
cluster_gen(n,
  N = 1, cluster_labels = NULL, resp_labels = NULL,
  cat_prop = NULL, n_X = NULL, n_W = NULL, c_mean = NULL,
  sigma = NULL, cor_matrix = NULL, separate_questionnaires = TRUE,
  collapse = "none", sum_pop = sapply(N, sum), calc_weights = TRUE,
  sampling_method = "mixed", rho = NULL, theta = FALSE,
  verbose = TRUE, print_pop_structure = verbose
)
```

As its single mandatory argument, cluster_gen requires a numeric list or vector containing the hierarchical structure of the data. As a general rule, as far as this first argument (`n`) as well as the second argument (`N`, representing the population structure) are concerned, vectors can be used to represent symmetric structures and lists can be used for asymmetric structures. What follows are some examples.

The function `cluster_gen` generates clustered samples which resembles the composition of international large-scale assessments participants. The required argument is `n` and the other optional arguments include

* `n`: a numeric vector with the number of sampled observations (clusters or subjects) on each level.
* `N`: a list of numeric vector(s) with the population size of each *sampled* cluster element on each level.
* `cluster_labels`: a character vector with the names of each cluster level.
* `resp_labels`: a character vector with the names of the questionnaire respondents on each level.
* `cat_prop`: a list of vectors where each vector contains the cumulative proportions for each category of a given item.  If theta = TRUE, the first element of cat_prop must be a scalar 1, which corresponds to the theta.
* `n_X`: the number of continuous (`X`) variables per cluster level.
* `n_W`: the number of ordinal (`W`) variables per cluster level.
* `cor_matrix`: a correlation matrix between all variables (except weights).
* `c_mean`: the vector of means for the continuous variables or list of vectors for the continuous variables for each level.
* `sigma`: the vector of  of standard deviations for the continuous variables or list of vectors for the continuous variables for each level.
* `separate_questionnaires`: if the logical argument `separate_questionnaires` 'TRUE', each level will have its own questionnaire. Otherwise, it will be labeled 'q1'.
* `theta`: if the logical argument `theta` is `TRUE` then the latent trait will be generated as the first continuous variable and labeled 'theta'.
* `collapse`: if the logical argument `collapse` is 'TRUE', then function output contains only one data frame with all answers.
* `sum_pop`: is the specification of the total population at each level (sampled or not)
* `calc_weights`: if the logical argument `calc_weights` is 'TRUE', then sampling weights are calculated.
* `sampling_method`: can be "SRS" for Simple Random Sampling or "PPS" for Probabilities Proportional to Size.
* `rho`: specifies the estimated intraclass correlation.
* `verbose`: if the logical argument `verbose` is 'TRUE', then messages are printed in the output.
* `print_pop_structure`: if `print_pop_structure` is 'TRUE', then the population hierarchical structure is printed out (as long as it differs from the sample structure).
* `...`: additional parameters to be passed to `questionnaire_gen()`.

---

#### **Example 1**

We can specify a simple structure of 3 schools with 5 students in each school. That is, `n = 3` and `N = 5`.

```{r ex 1}
set.seed(4388)
cg <- cluster_gen(c(n = 3, N = 5))
```

```{r ex 1_str}
cg$n[[1]]
cg$n[[2]]
cg$n[[3]]
```

---

#### **Example 2**

We can specify a more complex structure of 2 schools with different numbers of students, sampling weights, and custom numbers of questions.

```{r ex 2}
set.seed(4388)
n <- list(3, c(20, 15, 25))
N <- list(5, c(200, 500, 400, 100, 100))
cg <- cluster_gen(n, N, n_X = 5, n_W = 2)
```

```{r ex 2_str}
str(cg$school[[1]])
str(cg$school[[2]])
str(cg$school[[3]])
```

---

#### **Example 3**

We can also control the intra-class correlations and the grand mean.

```{r ex 3}
set.seed(4388)
cg <- cluster_gen(c(5, 1000), rho = .9, n_X = 2, n_W = 0, c_mean = 10)
sapply(1:5, function(s) mean(cg$school[[s]]$q1)) # means per school != 10
mean(sapply(1:5, function(s) mean(cg$school[[s]]$q1))) # closer to c_mean
```

```{r ex 3_str}
str(cg)
```

---

#### **Example 4**
We can make the intraclass variance explode by forcing "incompatible" rho and c_mean.

```{r ex 4}
x <- cluster_gen(c(5, 1000), rho = .5, n_X = 2, n_W = 0, c_mean = 1:5)

```

```{r ex 4_str}
anova(x)
```

---

* Other specifications of `cluster_gen`.

#### **Example 5**

The named vector below represents a sampling structure of 1 country, 2 schools, 5 students per school. The naming of the vector is optional.

```{r ex 5}
set.seed(4388)
n <- c(cnt = 1, sch = 2, stu = 5)
cg <- cluster_gen(n = n)
```

```{r ex 5_str}
cg
```

---

#### **Example 6**

The named vector below represents a sampling structure of 1 country, 2 schools, 5 students per school. In the example, the number of continuous variables have been specified as `n_X` = 10. Only 5 means have been expressed to correspond to the 10 continuous variables. That is, `c_mean` = c(0.3, 0.4, 0.5, 0.6, 0.7). The function will still run by recycling the means over the other, five, variables. In this case, a warning message that reads `Warning: c_mean recycled to fit all continuous variables` will be reported.

```{r ex 6, warning = TRUE}
set.seed(4388)
n <- c(cnt = 1, sch = 2, stu = 5)
cg <- cluster_gen(n = n, n_X = 10, c_mean = c(0.3, 0.4, 0.5, 0.6, 0.7))

```

```{r ex 6_str}
cg
```

---

#### **Example 7**

The named vector below represents a sampling structure of 3 schools, 2 classes, and 5 students per class. Again, the naming of the vector is optional. However, `n_X` and `sigma` can be expressed as lists that coincide with the different levels (i.e., schools and classes). For example, `n_X` = c(1, 2) and `sigma` = list(.1, c(1, 2) can be represented to represent the school and classroom levels. Note that, `sigma` = list(.1, c(1, 2) means that at cluster 1 (class), the standard deviations are .1, where as the standard deviations for level 2 (class) are 1 and 2.

```{r ex 7}
set.seed(4388)
n <- c(school = 3, class = 2, student = 5)
cg <- cluster_gen(n, n_X = c(1, 2), sigma = list(.1, c(1, 2)))
```

```{r ex 7_summary, warning = TRUE}
summary(cg)
```

---

#### **Example 8**

The named vector below represents a sampling structure of 3 schools, 2 classes, and 5 students per class. Again, the naming of the vector is optional. However, `c_mean` can also be expressed as a list that coincide with the different levels (i.e., schools and classes). For example, `c_mean` = list(.1, c(0.55, 0.32) can be represented to represent the school and classroom levels. Note that, `c_mean` = list(.1, c(0.55, 0.32)) means that at cluster 1 (class), the means for the continuous variables are .1, where as the means for level 2 (class) are 0.55 and 0.32.

```{r ex 8}
set.seed(4388)
n <- c(school = 3, class = 2, student = 5)
cg <- cluster_gen(n, n_X = c(1, 2), n_W = c(0, 1), c_mean = list(.1, c(0.55, 0.32)))
```

```{r ex 8_summary}
cg
```