---
title: "Controlling for Variables"
author: "Donny Williams"
date: "5/25/2020"
bibliography: ../inst/REFERENCES.bib
output:
rmarkdown::html_vignette:
  toc: yes
vignette: >
  %\VignetteIndexEntry{Controlling for Variables}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

# Introduction
This vignette describes how to control for variables. This is a new feature to **BGGM** (version `2.0.0`).

# Example 1: Multivariate Regression
When controlling for variables, a multivariate regression is fitted in **BGGM**. In fact, a GGM can be understood as a multivariate regression with intercepts only models for the predictors.

## Notes about Implementation
**BGGM** does not use the typical approach for multivariate regression in `R`. This avoids having to 
write out each outcome variable, of which there are typically many in a GGM. In **BGGM**, it is assumed
that the data matrix includes only the variables to be included in the GGM and the control variables.

### Correct 
Suppose that we want to control for education level, with five variables included in the 
GGM.

```r
# data
Y <- bfi[,c(1:5, 27)]

# head
head(Y)

#>       A1 A2 A3 A4 A5 education
#> 61617  2  4  3  4  4        NA
#> 61618  2  4  5  2  5        NA
#> 61620  5  4  5  4  4        NA
#> 61621  4  4  6  5  5        NA
#> 61622  2  3  3  4  5        NA
#> 61623  6  6  5  6  5         3
```

Notice that `Y` includes **only** the five variables and `education`.





## Fit Model
This model can then be fitted with

```
fit <- explore(Y, formula = ~ as.factor(education))
```

To show this is indeed a multivariate regression, here are the summarized regression coefficients for the first
outcome.

```
summ_coef <- regression_summary(fit)

# outcome one
summ_coef$reg_summary[[1]]

#>                       Post.mean Post.sd Cred.lb Cred.ub
#> (Intercept)               0.256   0.095   0.072   0.442
#> as.factor(education)2     0.073   0.128  -0.177   0.323
#> as.factor(education)3    -0.202   0.104  -0.405  -0.001
#> as.factor(education)4    -0.462   0.119  -0.691  -0.233
#> as.factor(education)5    -0.578   0.117  -0.815  -0.346
```

And here are the coefficients from `lm` (a univariate regression for `A1`)

```
round(
  cbind(
    # summary: coef and se
    summary( lm(scale(A1, scale = F) ~ as.factor(education), data = Y))$coefficients[,1:2],
    # confidence interval
    confint( lm(scale(A1, scale = F) ~ as.factor(education), data = Y))
), 3)


#>                       Estimate Std. Error  2.5 % 97.5 %
#> (Intercept)              0.256      0.093  0.073  0.438
#> as.factor(education)2    0.072      0.125 -0.172  0.316
#> as.factor(education)3   -0.203      0.101 -0.401 -0.004
#> as.factor(education)4   -0.461      0.116 -0.690 -0.233
#> as.factor(education)5   -0.578      0.115 -0.804 -0.351
```
The estimate are very (very) similar.

## Summary
Note that all the other functions work just the same. For example, the relations controlling for education
are summarized with

```
summary(fit)

#> BGGM: Bayesian Gaussian Graphical Models 
#> --- 
#> Type: continuous 
#> Analytic: FALSE 
#> Formula: ~ as.factor(education) 
#> Posterior Samples: 5000 
#> Observations (n):
#> Nodes (p): 5 
#> Relations: 10 
#> --- 
#> Call: 
#> estimate(Y = Y, formula = ~as.factor(education))
#> --- 
#> Estimates:
#>  Relation Post.mean Post.sd Cred.lb Cred.ub
#>    A1--A2    -0.239   0.020  -0.278  -0.200
#>    A1--A3    -0.109   0.020  -0.150  -0.070
#>    A2--A3     0.276   0.019   0.239   0.312
#>    A1--A4    -0.013   0.021  -0.055   0.026
#>    A2--A4     0.156   0.020   0.117   0.196
#>    A3--A4     0.173   0.020   0.134   0.214
#>    A1--A5    -0.010   0.020  -0.050   0.029
#>    A2--A5     0.150   0.020   0.111   0.189
#>    A3--A5     0.358   0.018   0.322   0.392
#>    A4--A5     0.121   0.020   0.082   0.159
#> --- 
```


### Incorrect
Now if we wanted to control for education, but also had gender in `Y`, this would be incorrect

```
Y <- bfi[,c(1:5, 26:27)]

head(Y)

#>       A1 A2 A3 A4 A5 gender education
#> 61617  2  4  3  4  4      1        NA
#> 61618  2  4  5  2  5      2        NA
#> 61620  5  4  5  4  4      2        NA
#> 61621  4  4  6  5  5      2        NA
#> 61622  2  3  3  4  5      1        NA
#> 61623  6  6  5  6  5      2         3
```


In this case, with `estimate(Y, formula = as.factor(education))`, the GGM would also include `gender`
(six variables instead of the desired 5). This is because all variables not included in `formula` are included in the GGM. This was adopted in **BGGM** to save the user from having to write out each outcome.  

This differs from `lm`, where each outcome needs to be written out, for example `cbind(A1, A2, A3, A4, A4) ~ as.factor(education)`. This is quite cumbersome for a model that includes many nodes.

# Example 2: Multivariate Probit
The above data is ordinal. In this case, it is possible to fit a multivariate probit model. This is also the approach for binary data in **BGGM**. This is implemented with 

```
fit <- estimate(Y, formula = ~ as.factor(education), 
                type = "ordinal", iter = 1000)
```

Note that the multivariate probit models can also be summarized with `regression_summary`.

# Example 3: Gaussian Copula Graphical Model
This final example fits a Gaussian copula graphical model that can be used for mixed data. In this case,
`formula` is not used and instead all of the variables are included in the GGM. 

## Fit Model

This model is estimated with
```
# data
Y <- na.omit(bfi[,c(1:5, 27)])

# fit type = "mixed"
fit <- estimate(Y, type = "mixed", iter = 1000)

# summary
summary(fit)

#> BGGM: Bayesian Gaussian Graphical Models 
#> --- 
#> Type: mixed 
#> Analytic: FALSE 
#> Formula:  
#> Posterior Samples: 1000 
#> Observations (n):
#> Nodes (p): 6 
#> Relations: 15 
#> --- 
#> Call: 
#> estimate(Y = Y, type = "mixed", iter = 1000)
#> --- 
#> Estimates:
#>       Relation Post.mean Post.sd Cred.lb Cred.ub
#>         A1--A2    -0.217   0.048  -0.294  -0.114
#>         A1--A3    -0.063   0.027  -0.113  -0.011
#>         A2--A3     0.364   0.023   0.317   0.410
#>         A1--A4     0.116   0.038   0.048   0.192
#>         A2--A4     0.241   0.031   0.182   0.303
#>         A3--A4     0.228   0.026   0.174   0.275
#>         A1--A5     0.057   0.031   0.003   0.120
#>         A2--A5     0.186   0.027   0.135   0.241
#>         A3--A5     0.438   0.019   0.399   0.474
#>         A4--A5     0.151   0.025   0.103   0.199
#>  A1--education    -0.016   0.069  -0.125   0.119
#>  A2--education     0.063   0.049  -0.016   0.162
#>  A3--education     0.049   0.025   0.002   0.099
#>  A4--education     0.053   0.026   0.005   0.105
#>  A5--education     0.072   0.024   0.024   0.120
#> --- 
```

Here it is clear that education is included in the model, as the relations with the other nodes are included in the output.

## Select Graph
The graph is selected with
```
select(fit)
```


# Note
It is possible to control for variable with all methods in **BGGM**, including when comparing groups, Bayesian hypothesis testing, etc.