---
title: "Common errors & warnings encountered in DImulti()"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Common errors & warnings encountered in DImulti()}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{css styling, echo=FALSE}
span.R {
  font-family: Courier New;
}

span.bad {
  color: red;
}

```

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(crayon.enabled = TRUE)

ansi_aware_handler <- function(x, options)
{
  paste0(
    "<pre class=\"r-output\"><code>",
    fansi::sgr_to_html(x = x, warn = FALSE, term.cap = "256"),
    "</code></pre>"
  )
}

old_hooks <- fansi::set_knit_hooks(knitr::knit_hooks,
                                 which = c("output", "message", "error", "warning"))
knitr::knit_hooks$set(
  output = ansi_aware_handler,
  message = ansi_aware_handler,
  warning = ansi_aware_handler,
  error = ansi_aware_handler
)
```

```{r setup}
library(DImodelsMulti)
data("dataBEL"); data("dataSWE"); data("simMV"); data("simRM"); data("simMVRM")
```

<h2>Incorrect argument specifications </h2>

The <span class="R">DImulti()</span> function has a long list of possible parameters, which can make
it difficult to ensure that all arguments supplied are in the requested format. See below for an 
example of each parameter and the included datasets, 
<span class="R">dataBEL</span>, 
<span class="R">dataSWE</span>, 
<span class="R">simMV</span>, 
<span class="R">simRM</span> and 
<span class="R">simMVRM</span>, for examples of dataset formatting.



<h3><span class="R">y</span></h3>

This is a required argument. It can be passed in three forms, using either column names or indices:

* <h4>Univariate</h4>

In this case, we pass a single name or index of a numerical column containing the ecosystem function
response value. See <span class="R">?dataSWE</span> for an example of a univariate dataset suitable 
for use with DImulti().
```{r y_univar, eval = FALSE}
DImulti(y = "YIELD", ..., data = dataSWE)
DImulti(y = 9, ..., data = dataSWE)
```
```{r y_dataSWE, eval = TRUE, echo = FALSE}
dataSWE[1:6, "YIELD", drop = FALSE]
```


* <h4>Multivariate stacked</h4>

In this case, we pass a single name or index of a numerical column containing the ecosystem function
response value. See <span class="R">?dataBEL</span> for an example of a multivariate stacked dataset 
suitable for use with DImulti().
```{r y_MVstack, eval = FALSE}
DImulti(y = "Y", ..., data = dataBEL)
DImulti(y = 9, ..., data = dataBEL)
```
```{r y_dataBEL, eval = TRUE, echo = FALSE}
dataBEL[1:6, "Y", drop = FALSE]
```


* <h4>Multivariate wide</h4>

In this case, we pass multiple names or indices of numerical columns containing the 
ecosystem function response values. See <span class="R">?simMV</span> for an example of a 
multivariate wide dataset suitable for use with DImulti().
```{r y_MVwide, eval = FALSE}
DImulti(y = c("Y1", "Y2", "Y3", "Y4"), ..., data = simMV)
DImulti(y = 9:12, ..., data = simMV)
```
```{r y_simMV, eval = TRUE, echo = FALSE}
simMV[1:6, c("Y1", "Y2", "Y3", "Y4"), drop = FALSE]
```


<h3><span class="R">eco_func</span></h3>

This is a required argument for multivariate data and excluded for univariate data. It can be passed 
in two forms:

* <h4>Multivariate stacked</h4>

In this case, we pass a vector of size two. The first index contains a single column name for the 
ecosystem function response type, a factor, the second contains the autocorrelation structure 
desired, either <span class="R">"UN"</span> or <span class="R">"CS"</span>. See 
<span class="R">?dataBEL</span> for an example of a multivariate stacked dataset suitable for use 
with DImulti().
```{r ecoFunc_MVstack, eval = FALSE}
DImulti(eco_func = c("Var", "UN"), ..., data = dataBEL)
```
```{r ecoFunc_dataBEL, eval = TRUE, echo = FALSE}
dataBEL[1:6, "Var", drop = FALSE]
```


* <h4>Multivariate wide</h4>

In this case, we pass a vector of size two. The first index contains the string "NA", the second 
contains the autocorrelation structure desired, either <span class="R">"UN"</span> or <span class="R">"CS"</span>. See <span class="R">?simMV</span> for an example of a multivariate stacked 
dataset suitable for use with DImulti().
```{r ecoFunc_MVwide, eval = FALSE}
DImulti(eco_func = c("NA", "UN"), ..., data = simMV)
```


<h3><span class="R">time</span></h3>

This is a required argument for repeated measures data and excluded for data taken at a single time 
point. It can be passed in the following form:

* <h4>Repeated Measures</h4>

In this case, we pass a vector of size two. The first index contains a single column name for the 
time point reference, a factor, the second contains the autocorrelation structure 
desired, either <span class="R">"UN"</span>, <span class="R">"CS"</span>, or 
<span class="R">"AR1"</span>. See <span class="R">?dataSWE</span> for an example of a repeated 
measures dataset suitable for use with DImulti().
```{r time, eval = FALSE}
DImulti(time = c("YEARN", "AR1"), ..., data = dataSWE)
```
```{r time_dataSWE, eval = TRUE, echo = FALSE}
dataSWE[c(1, 49, 97, 2, 50, 98), "YEARN", drop = FALSE]
```


<h3><span class="R">unit_IDs</span></h3>

This is a required argument for any data. 

It can be passed a column name or index, referencing a unique reference for each experimental unit, 
used for grouping responses from different ecosystem functions or time points. 
```{r unit_IDs, eval = FALSE}
DImulti(unit_IDs = "plot", ..., data = simMVRM)
DImulti(unit_IDs = 1, ..., data = simMVRM)
```
```{r unit_IDs_simVMRM, eval = TRUE, echo = FALSE}
simMVRM[1:6, "plot", drop = FALSE]
```


<h3><span class="R">prop</span></h3>

This is a required argument for any data. 

It can be passed as a vector of column names or indices, referencing the (numerical) initial 
proportions of any species included in the experimental design. 
In the event that a species is included as a factor as opposed to a numerical value, e.g. if it only
appears at values 0 or 1, only the numerical species should be included here, while the factor 
species is passed through the <span class="R">extra_fixed</span> parameter. A warning will be 
printed to the console warning the user of not meeting the simplex requirement.
```{r prop, eval = FALSE}
DImulti(unit_IDs = paste0("p", 1:4), ..., data = simMVRM)
DImulti(unit_IDs = 2:5, ..., data = simMVRM)
```
```{r prop_simVMRM, eval = TRUE, echo = FALSE}
simMVRM[1:6, 2:5, drop = FALSE]
```


<h3><span class="R">data</span></h3>

This is a required argument for any data. 

This argument should be a dataframe or tibble which contains all columns referenced in the 
<span class="R">DImulti()</span> call. There are some restrictions on columns names allowed in this
dataset. An error will be printed to the console to inform a user of name changes required.
```{r data, eval = FALSE}
DImulti(..., data = simMVRM)
DImulti(..., data = simMVRM)
```
```{r data_simVMRM, eval = TRUE, echo = FALSE}
simMVRM[1:6, , drop = FALSE]
```


<h3><span class="R">DImodel</span></h3>

This is a required argument for any data. 

This argument should be a single string selected from the list of included interaction structures:
<span class="R">"STR"</span>,
<span class="R">"ID"</span>,
<span class="R">"FULL"</span>,
<span class="R">"E"</span>,
<span class="R">"AV"</span>,
<span class="R">"ADD"</span>,
<span class="R">"FG"</span>.
```{r DImodel, eval = FALSE}
DImulti(DImodel = "FULL", ...)
DImulti(DImodel = "AV", ...)
```


<h3><span class="R">FG</span></h3>

This is a required argument if the argument <span class="R">"FG"</span> is passed through 
<span class="R">DImodel</span> and can be excluded otherwise.

This argument should be a vector of strings, of the same length as <span class="R">prop</span>, 
which maps each species in the experiment to a functional grouping.
```{r FG, eval = FALSE}
DImulti(prop = c("G1", "G2", "L1", "L2"), DImodel = "FG", 
        FG = c("Grass", "Grass", "Legume", "Legume"), ..., data = dataBEL)
```


<h3><span class="R">ID</span></h3>

This is an optional argument.

This argument should be a vector of strings, of the same length as <span class="R">prop</span>, 
which maps each species in the experiment to a identity grouping, assuming functional redundancy 
between within-group species for their ID effects, this grouping does not affect interactions.
```{r ID, eval = FALSE}
DImulti(prop = c("G1", "G2", "L1", "L2"), ID = c("Grass", "Grass", "Legume", "Legume"), 
        ..., data = dataBEL)
```


<h3><span class="R">extra_fixed</span></h3>

This is an optional argument.

This parameter can be passed a formula of any additional fixed effects to be included, e.g. 
treatments. These effects can be easily crossed with the automatic DI model formula by prefacing 
the formula passed with <span class="R">1:</span> or <span class="R">1*</span>. See 
<span class="R">?dataSWE</span> for a dataset containing treatments suitable for use with 
<span class="R">DImulti()</span>.
```{r extra_fixed, eval = FALSE}
DImulti(extra_fixed = ~DENS, ..., data = dataSWE)
DImulti(extra_fixed = ~DENS + TREAT, ..., data = dataSWE)
DImulti(extra_fixed = ~1*DENS, ..., data = dataSWE)
DImulti(extra_fixed = ~1:(DENS+TREAT), ..., data = dataSWE)
DImulti(extra_fixed = ~1:DENS:TREAT, ..., data = dataSWE)
```
```{r extra_fixed_dataSWE, eval = TRUE, echo = FALSE}
dataSWE[1:6, c("DENS", "TREAT"), drop = FALSE]
```


<h3><span class="R">estimate_theta</span></h3>

This is an optional argument.

This parameter can be passed a boolean value (<span class="R>TRUE</span> or 
<span class="R>FALSE</span>), indicating whether the user wants the function to use profile 
likelihood to estimate values for the non-linear parameter $\theta$. Defaults to 
<span class="R>FALSE</span>, indicating that estimation will not occur, instead fixed values of 
$\theta$ will be used. 
If <span class="R">"STR"</span> or <span class="R">"ID"</span> was passed 
through the parameter <span class="R">DImodel</span>, then setting 
<span class="R">estimate_theta = TRUE</span> will cause a warning to be printed to the console, as
$\theta$ only applies to interaction terms, which do not exist for those options.
```{r estimate_theta, eval = FALSE}
DImulti(estimate_theta = TRUE, ...)
DImulti(estimate_theta = FALSE, ...)
```


<h3><span class="R">theta</span></h3>

This is an optional argument.

This argument should contain numerical values representing the non-linear parameter $\theta$, which
will be applied as a power to the products of pairs of species in the interaction terms of the 
model. A single value may be passed, which will be applied to each ecosystem function, or a 
different value may be passed for each ecosystem function.
If <span class="R">"STR"</span> or <span class="R">"ID"</span> was passed 
through the parameter <span class="R">DImodel</span>, then setting a non-1 value for
<span class="R">theta</span> will cause a warning to be printed to the console, as
$\theta$ only applies to interaction terms, which do not exist for those options.
```{r theta, eval = FALSE}
DImulti(theta = 1, ...)
DImulti(theta = c(0.5, 1, 1.2), ...)
```


<h3><span class="R">method</span></h3>

This is an optional argument.

This parameter is used to change the estimation method used by <span class="R">DImulti()</span>. The 
options are <span class="R">"REML"</span>, referring to restricted maximum likelihood, and 
<span class="R">"ML"</span>, referring to maximum likelihood. 
Defaults to <span class="R">"REML"</span>.
```{r method, eval = FALSE}
DImulti(method = "REML", ...)
DImulti(method = "ML", ...)
```



<h2>Singular fit </h2>

A singular fit error, thrown by <span class="R">nlme::gls()</span> when fitting the model, occurs 
when an element with value of exactly zero exists in the fixed effect variance covariance matrix of 
the model. This usually caused by rank deficiency in the model, i.e. two terms explaining the same
variance. The fix is usually to simplify/change the fixed effect terms, be it the interactions 
structure, treatment terms, or theta values. To aid in this change, the fixed effect formula and any 
estimated theta values are printed to the console with the error.

As an example, we use the dataset <span class="R">dataSWE</span, which is included in this package.


<span class="bad">XXDO NOT RUNXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</span>
```{r singular_dataSWE, eval = TRUE, error = TRUE}
DImulti(prop = 5:8, y = "YIELD", time = c("YEARN", "CS"), unit_IDs = "PLOT", data = dataSWE, 
        DImodel = "ID", extra_fixed = ~DENS+TREAT)
```
<span class="bad">XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</span>
As we include no theta values or interaction terms, and use the simple structure 
<span class="R">"CS"</span>, compound symmetry, for our repeated measure, the only complication in 
this model is the inclusion of the two treatment terms, <span class="R">DENS</span> and 
<span class="R">TREAT</span>. When multiple treatments are included, it is usually good to check 
that one cannot be inferred from the other, e.g. if <span class="R">DENS = High</span> then we know 
that <span class="R">TREAT = 1</span>. Luckily this is not the case here.
```{r singular_dataSWE_DENSTREAT, eval = TRUE, echo = FALSE}
dataSWE[c(1, 16, 31, 40), c("DENS", "TREAT"), drop = FALSE]
```

Instead, we change the formatting of our formula, from creating intercepts to crossing the treatment
with our ID effects, but not each other.
```{r singular_dataSWE_fix, eval = TRUE}
DImulti(prop = 5:8, y = "YIELD", time = c("YEARN", "CS"), unit_IDs = "PLOT", data = dataSWE, 
        DImodel = "ID", extra_fixed = ~1:(DENS+TREAT))
```
Which has resolved our issue.


We show another example using the multivariate dataset <span class="R">dataBEL</span> and estimating
$\theta$. We would first set <span class="R">theta = 1</span> and use model selection to choose an
appropriate interaction structure, which gives us the following model, with a functional group
structure:
```{r singular_dataBEL_Works, eval = TRUE}
model1 <- DImulti(prop = 2:5, y = "Y", eco_func = c("Var", "UN"), unit_IDs = 1, data = dataBEL,
                  FG = c("Grass", "Grass", "Leg", "Leg"), DImodel = "FG", extra_fixed = ~Density, 
                  method = "ML", theta = 1)
```


<h2>Comparing models: ML vs REML </h2>

The optional parameter <span class="R">method</span> in the function 
<span class="R">DImulti()</span> allows a user to change the estimation method used, between the 
default <span class="R">"REML"</span>, restricted maximum likelihood, and 
<span class="R">"ML"</span>, maximum likelihood.

<span class="R">"REML"</span> was chosen as the default as it provides unbiased estimates, as 
opposed to <span class="R">"ML"</span>, however, <span class="R">"REML"</span> is not appropriate
for use with model comparison where fixed effects vary between models, as we can see from the 
following warning message:

<span class="bad">XXDO NOT RUNXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</span>
```{r comparison_dataBEL, eval = TRUE}
model1 <- DImulti(prop = 2:5, y = "Y", eco_func = c("Var", "UN"), unit_IDs = 1, data = dataBEL,
                  FG = c("Grass", "Grass", "Leg", "Leg"), DImodel = "FG", 
                  method = "REML")

model2 <- DImulti(prop = 2:5, y = "Y", eco_func = c("Var", "UN"), unit_IDs = 1, data = dataBEL,
                  FG = c("Grass", "Grass", "Leg", "Leg"), DImodel = "FG", extra_fixed = ~Density,
                  method = "REML")

anova(model1, model2)
```
<span class="bad">XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</span>

In this case, we should use <span class="R">"ML"</span> to fit the models for comparison, then refit
the chosen model using <span class="R">"REML"</span>, for the unbiased estimates.