---
title: "Beta-Select Demonstration: SEM by 'lavaan'"
date: "2025-05-01"
output:
  rmarkdown::html_vignette:
    number_sections: true
vignette: >
  %\VignetteIndexEntry{Beta-Select Demonstration: SEM by 'lavaan'}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: "references.bib"
csl: apa.csl
---





# Introduction

This article demonstrates how to use
`lav_betaselect()` from the package
[`betaselectr`](https://sfcheung.github.io/betaselectr/)
to standardize
selected variables in a model fitted
by `lavaan` and forming confidence
intervals for the parameters.

# Data and Model

The sample dataset from the package
`betaselectr` will be used in this
demonstration:


``` r
library(betaselectr)
head(data_test_medmod)
#>          dv       iv      mod      med     cov1     cov2
#> 1  7.487873 11.42573 16.65805 42.28988 54.14051 15.56069
#> 2  8.474931 16.64790 22.66332 42.08692 39.21125 17.61286
#> 3 11.206539 14.81278 22.80955 32.76869 31.97963 20.77333
#> 4 10.148827 15.79632 22.94451 43.96807 42.72187 15.66971
#> 5  7.421606 14.29621 24.51562 37.10942 42.74174 21.97132
#> 6  6.846435 12.00819 25.22163 35.46051 30.85914 22.35444
```

Although mean-centering is not
necessary for testing an interaction,
the computation of the standardized
coefficients of variables involved
in an interaction requires that they
are mean-centered, for now.

We mean-center `iv` and `mod` first:


``` r
data_test_medmod$iv <- data_test_medmod$iv - mean(data_test_medmod$iv)
data_test_medmod$mod <- data_test_medmod$mod - mean(data_test_medmod$mod)
```

This is the path model, fitted by
`lavaan::sem()`:


``` r
library(lavaan)
#> This is lavaan 0.6-19
#> lavaan is FREE software! Please report any bugs.
mod <-
"
med ~ iv + mod + iv:mod + cov1 + cov2
dv ~ med + iv + cov1 + cov2
"
fit <- sem(mod,
           data_test_medmod)
```

The model has a moderator, `mod`, posited
to moderate the effect from `iv` to
`med`. The product term is `iv:mod`.

These are the results:


``` r
summary(fit)
#> lavaan 0.6-19 ended normally after 1 iteration
#> 
#>   Estimator                                         ML
#>   Optimization method                           NLMINB
#>   Number of model parameters                        11
#> 
#>   Number of observations                           200
#> 
#> Model Test User Model:
#>                                                       
#>   Test statistic                                 2.303
#>   Degrees of freedom                                 2
#>   P-value (Chi-square)                           0.316
#> 
#> Parameter Estimates:
#> 
#>   Standard errors                             Standard
#>   Information                                 Expected
#>   Information saturated (h1) model          Structured
#> 
#> Regressions:
#>                    Estimate  Std.Err  z-value  P(>|z|)
#>   med ~                                               
#>     iv                0.626    0.215    2.909    0.004
#>     mod               0.329    0.129    2.560    0.010
#>     iv:mod            0.286    0.039    7.340    0.000
#>     cov1             -0.093    0.070   -1.327    0.185
#>     cov2              0.242    0.133    1.823    0.068
#>   dv ~                                                
#>     med               0.092    0.011    8.098    0.000
#>     iv                0.227    0.038    5.896    0.000
#>     cov1             -0.006    0.013   -0.454    0.650
#>     cov2              0.030    0.025    1.230    0.219
#> 
#> Variances:
#>                    Estimate  Std.Err  z-value  P(>|z|)
#>    .med              60.292    6.029   10.000    0.000
#>    .dv                2.087    0.209   10.000    0.000
```

# Problems With Standardization

We can request the standardized solution
using `lavaan::standardizedSolution()`:


``` r
standardizedSolution(fit,
                     output = "text")
#> 
#> Regressions:
#>                     est.std  Std.Err  z-value  P(>|z|) ci.lower ci.upper
#>   med ~                                                                 
#>     iv                0.182    0.062    2.964    0.003    0.062    0.303
#>     mod               0.165    0.064    2.597    0.009    0.040    0.290
#>     iv:mod            0.432    0.051    8.390    0.000    0.331    0.533
#>     cov1             -0.077    0.058   -1.332    0.183   -0.189    0.036
#>     cov2              0.105    0.057    1.836    0.066   -0.007    0.218
#>   dv ~                                                                  
#>     med               0.459    0.052    8.845    0.000    0.357    0.560
#>     iv                0.331    0.053    6.279    0.000    0.228    0.434
#>     cov1             -0.024    0.054   -0.454    0.650   -0.130    0.081
#>     cov2              0.066    0.054    1.233    0.218   -0.039    0.171
#> 
#> Variances:
#>                     est.std  Std.Err  z-value  P(>|z|) ci.lower ci.upper
#>    .med               0.656    0.050   13.243    0.000    0.559    0.753
#>    .dv                0.569    0.050   11.353    0.000    0.471    0.667
```

However, for this model, there are several
problems:

- The product term, `iv:mod`, is also
  standardized. This is inappropriate.
  One simple but underused solution is
  to standardize the variables *before*
  forming the product term [@friedrich_defense_1982].

- The confidence intervals are formed using
  the delta-method, which has been found
  to be inferior to methods such as
  nonparametric percentile bootstrap
  confidence interval for the standardized
  solution [@falk_are_2018]. Although there
  are situations in which the delta-method
  confidence and the nonparametric
  percentile bootstrap confidences can be
  similar (e.g., sample size is large
  and the sample estimates are not extreme),
  it is still safe to at least try both
  methods and compare the results.

- There are cases in which some variables
  are measured by meaningful units and
  do not need to be standardized. for
  example, if `cov1` is age measured by
  year, then age is more
  meaningful than "standardized age".

- In path analysis, categorical variables
  are usually represented by dummy variables,
  each of them having only two possible
  values (0 or 1). It is not meaningful
  to standardize the dummy variables.

# Beta-Select `lav_betaselect()`

The function `lav_betaselect()` can be used
to solve this problem by:

- standardizing variables before product
  terms are formed,

- standardizing only variables for which
  standardization can facilitate
  interpretation, and

- forming confidence intervals that take
  into account selected standardization.

We call the coefficients computed by
this kind of standardization *beta*s-select
($\beta{s}_{Select}$, $\beta_{Select}$
in singular form),
to differentiate them from coefficients
computed by standardizing all variables,
including product terms.

## Estimates Only

Suppose we only need to
solve the first problem, with the product
term computed after `iv` and `mod`
are standardized:


``` r
fit_beta <- lav_betaselect(fit)
```




``` r
fit_beta
```

This is the output
if printed using the
default options:


```
#> 
#> Selected Standardization:
#>                     
#>  Standard Error: Nil
#> 
#> Parameter Estimates Settings:
#>                                              
#>  Standard errors:                  Standard  
#>  Information:                      Expected  
#>  Information saturated (h1) model: Structured
#> 
#> Regressions:
#>            BetaSelect
#>  med ~               
#>   iv            0.182
#>   mod           0.165
#>   iv:mod        0.400
#>   cov1         -0.077
#>   cov2          0.105
#>  dv ~                
#>   med           0.459
#>   iv            0.331
#>   cov1         -0.024
#>   cov2          0.066
#> 
#> Footnote:
#> - Variable(s) standardized: cov1, cov2, dv, iv, med, mod
#> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both
#>   original estimates and betas-select.
#> - Product terms (iv:mod) have variables standardized before computing
#>   them. The product term(s) is/are not standardized.
```



Compared to the solution with the product
term standardized, the coefficient of
`iv:mod` changed substantially from
0.432 to
0.286. As shown by
@cheung_improving_2022, the coefficient
of *standardized* product term (`iv:mod`)
can be substantially different from the
properly standardized product term
(the product of standardized `iv` and
standardized `mod`).

The footnote will also indicate
variables that are standardized,
and remarked that product terms
are formed *after* standardization.

## Estimates and Bootstrap Confidence Interval

Suppose we want to address
both the first and the second problems,
with

- the product term computed after `iv` and `mod`
  standardized, and

- bootstrap confidence intervals used, that
  take into account the sampling variation
  of the standardizers (the standard deviations).

We can call `lav_betaselect()`
again, with additional arguments
set:


``` r
fit_beta <- lav_betaselect(fit,
                           std_se = "bootstrap",
                           bootstrap = 5000,
                           iseed = 2345,
                           parallel = "snow",
                           ncpus = 20)
#> Finding product terms in the model ...
#> Finished finding product terms.
#> 
#> Compute bootstrapping standardized solution:
```

These are the additional arguments:

- `std_se`: The method to compute the
  standard errors as well as confidence
  intervals. Set to `"bootstrap"` for
  nonparametric bootstrapping.

- `iseed`: The seed for the random number
  generator used for bootstrapping. Set
  this to an integer to make the results
  reproducible.

- `parallel`: The method to be used for
  parallel processing. It will be passed
  to `lavaan::bootstrapLavaan()`. Supported
  values are `"none"`, `"snow"`, and
  `"multicore"`.

- `ncpus`: The number of CPU cores to
  use if `parallel` processing is not
  `"none"`. Default is
  `parallel::detectCores(logical = FALSE) - 1`,
  or the number of *physical* cores
  minus one.



This is the output if
printed with default
options:


``` r
fit_beta
```


```
#> 
#> Selected Standardization:
#>                                              
#>  Standard Error:      Nonparametric bootstrap
#>  Bootstrap samples:   5000                   
#>  Confidence Interval: Percentile             
#>  Level of Confidence: 95.0%                  
#> 
#> Parameter Estimates Settings:
#>                                              
#>  Standard errors:                  Standard  
#>  Information:                      Expected  
#>  Information saturated (h1) model: Structured
#> 
#> Regressions:
#>            BetaSelect    SE      Z p-value Sig  CI.Lo CI.Hi CI.Sig
#>  med ~                                                            
#>   iv            0.182 0.061  2.992   0.005  **  0.063 0.300   Sig.
#>   mod           0.165 0.069  2.389   0.015   *  0.028 0.303   Sig.
#>   iv:mod        0.400 0.047  8.565   0.000 ***  0.298 0.481   Sig.
#>   cov1         -0.077 0.057 -1.353   0.185     -0.186 0.038   n.s.
#>   cov2          0.105 0.061  1.725   0.094   . -0.019 0.219   n.s.
#>  dv ~                                                             
#>   med           0.459 0.052  8.828   0.000 ***  0.348 0.553   Sig.
#>   iv            0.331 0.051  6.480   0.000 ***  0.229 0.431   Sig.
#>   cov1         -0.024 0.058 -0.418   0.686     -0.137 0.090   n.s.
#>   cov2          0.066 0.058  1.139   0.259     -0.050 0.178   n.s.
#> 
#> Footnote:
#> - Variable(s) standardized: cov1, cov2, dv, iv, med, mod
#> - Sig codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> - Standard errors, p-values, and confidence intervals are not computed
#>   for betas-select which are fixed in the standardized solution.
#> - P-values for betas-select are asymmetric bootstrap p-value computed
#>   by the method of Asparouhov and Muthén (2021).
#> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both
#>   original estimates and betas-select.
#> - Product terms (iv:mod) have variables standardized before computing
#>   them. The product term(s) is/are not standardized.
```

In this dataset, with 200 cases,
the delta-method confidence
intervals are close to the
bootstrap confidence intervals,
except obviously for the
product term because the coefficient
of the product term has substantially
different values in the two
solutions.

## Estimates and Bootstrap Confidence Intervals, With Only Selected Variables Standardized

Suppose we want to address also the
the third issue, and standardize only
some of the variables. This can be
done using either `to_standardize`
or `not_to_standardize`.

- Use `to_standardize` when
the number of variables to standardize
is much fewer than the number of variables
not to standardize.

- Use `not_to_standardize`
when the number variables to standardize
is much more than the
the number of variables not to standardize.

For example, suppose we only
need to standardize `dv` and
`iv`, `cov1`, and `cov2`,
this is the call to do
this, setting
`to_standardize` to `c("iv", "dv", "cov1", "cov2")`:


``` r
fit_beta_select_1 <- lav_betaselect(fit,
                                    std_se = "bootstrap",
                                    to_standardize = c("iv", "dv", "cov1", "cov2"),
                                    bootstrap = 5000,
                                    iseed = 2345,
                                    parallel = "snow",
                                    ncpus = 20)
```

If we want to standardize all
variables except for `dv`
and `mod`, we can use
this call, and set
`not_to_standardize` to `c("mod", "dv")`:


``` r
fit_beta_select_2 <- lav_betaselect(fit,
                                    std_se = "bootstrap",
                                    not_to_standardize = c("mod", "dv"),
                                    bootstrap = 5000,
                                    iseed = 2345,
                                    parallel = "snow",
                                    ncpus = 20)
```

The results of these calls are identical,
and only those of the second version are
printed:


``` r
fit_beta_select_2
```


```
#> Selected Standardization:
#>                                              
#>  Standard Error:      Nonparametric bootstrap
#>  Bootstrap samples:   5000                   
#>  Confidence Interval: Percentile             
#>  Level of Confidence: 95.0%                  
#> 
#> Parameter Estimates Settings:
#>                                              
#>  Standard errors:                  Standard  
#>  Information:                      Expected  
#>  Information saturated (h1) model: Structured
#> 
#> Regressions:
#>            BetaSelect    SE      Z p-value Sig  CI.Lo CI.Hi CI.Sig
#>  med ~                                                            
#>   iv            0.182 0.061  2.992   0.005  **  0.063 0.300   Sig.
#>   mod           0.034 0.014  2.389   0.015   *  0.006 0.063   Sig.
#>   iv:mod        0.083 0.010  8.565   0.000 ***  0.062 0.100   Sig.
#>   cov1         -0.077 0.057 -1.353   0.185     -0.186 0.038   n.s.
#>   cov2          0.105 0.061  1.725   0.094   . -0.019 0.219   n.s.
#>  dv ~                                                             
#>   med           0.878 0.116  7.567   0.000 ***  0.635 1.092   Sig.
#>   iv            0.634 0.100  6.337   0.000 ***  0.430 0.826   Sig.
#>   cov1         -0.047 0.112 -0.418   0.686     -0.265 0.168   n.s.
#>   cov2          0.126 0.111  1.137   0.259     -0.093 0.345   n.s.
#> 
#> Footnote:
#> - Variable(s) standardized: cov1, cov2, iv, med
#> - Sig codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> - Standard errors, p-values, and confidence intervals are not computed
#>   for betas-select which are fixed in the standardized solution.
#> - P-values for betas-select are asymmetric bootstrap p-value computed
#>   by the method of Asparouhov and Muthén (2021).
#> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both
#>   original estimates and betas-select.
#> - Product terms (iv:mod) have variables standardized before computing
#>   them. The product term(s) is/are not standardized.
```

The footnotes show that, by
specifying that `dv` and `mod` are not
standardized, all the other four variables
are standardized: `iv`, `med`, `cov1`, and `cov2`.
Therefore, in this case, it is more
convenient to use `not_to_standardize`.

When reporting *beta*s-*select*, researchers need
to state which variables
are standardized and which are not.
This can be done in table notes,
or in a column of the parameter estimate
tables. The output can of `lav_betaselect()`
can be printed with `show_Bs.by` set
to `TRUE` to demonstrate the second
approach:


``` r
print(fit_beta_select_2,
      show_Bs.by = TRUE)
```


```
#> Regressions:
#>            BetaSelect    SE      Z p-value Sig  CI.Lo CI.Hi CI.Sig  Selected
#>  med ~                                                                      
#>   iv            0.182 0.061  2.992   0.005  **  0.063 0.300   Sig.    iv,med
#>   mod           0.034 0.014  2.389   0.015   *  0.006 0.063   Sig.       med
#>   iv:mod        0.083 0.010  8.565   0.000 ***  0.062 0.100   Sig.    iv,med
#>   cov1         -0.077 0.057 -1.353   0.185     -0.186 0.038   n.s.  cov1,med
#>   cov2          0.105 0.061  1.725   0.094   . -0.019 0.219   n.s.  cov2,med
#>  dv ~                                                                       
#>   med           0.878 0.116  7.567   0.000 ***  0.635 1.092   Sig.       med
#>   iv            0.634 0.100  6.337   0.000 ***  0.430 0.826   Sig.        iv
#>   cov1         -0.047 0.112 -0.418   0.686     -0.265 0.168   n.s.      cov1
#>   cov2          0.126 0.111  1.137   0.259     -0.093 0.345   n.s.      cov2
#> 
#> Footnote:
#> - Variable(s) standardized: cov1, cov2, iv, med
#> - Sig codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> - Standard errors, p-values, and confidence intervals are not computed
#>   for betas-select which are fixed in the standardized solution.
#> - P-values for betas-select are asymmetric bootstrap p-value computed
#>   by the method of Asparouhov and Muthén (2021).
#> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both
#>   original estimates and betas-select.
#> - The column 'Selected' lists variable(s) standardized when computing
#>   the standardized coefficient of a parameter. ('NA' for user-defined
#>   parameters because they are computed from other standardized
#>   parameters.)
#> - Product terms (iv:mod) have variables standardized before computing
#>   them. The product term(s) is/are not standardized.
```

## Categorical Variables

When calling `lav_betaselect()`,
variables with only two values in
the dataset are assumed to be categorical
and will not be standardized by default.
This can be overriden by setting
`skip_categorical_x` to `FALSE`, though
not recommended.

# Conclusion

In structural equation modeling, there
are situations in which standardizing
all variables is not appropriate, or
when standardization needs to be done
before forming product terms. We are
not aware of tools that can do appropriate
standardization *and* form confidence
intervals that takes into account the
selective standardization. By promoting
the use of *beta*s-*select* using
`lav_betaselect()`, we hope to make it
easier for researchers to do appropriate
Standardization in when reporting SEM
results.

# References