---
title: "Introduction to LMest"
author: "Bartolucci, F., Pandolfi, S., Pennoni, F.,  Serafini, A."
date: "`r Sys.Date()`"
output:
   bookdown::html_document2:
    base_format: rmarkdown::html_vignette
    toc: true
    toc_depth: 2
    number_sections: false
    fontsize: 11pt
vignette: >
  %\VignetteIndexEntry{Introduction to LMest}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r, echo = FALSE, message = FALSE}
knitr::opts_chunk$set(collapse = T, comment = "#>")
options(tibble.print_min = 4L, tibble.print_max = 4L)
library(LMest)
library(knitr)
opts_chunk$set(fig.align = "center",
               fig.width = 6, fig.height = 5,
               dev.args = list(pointsize=10),
               collapse = TRUE, par = TRUE,
               warning = FALSE, message = FALSE,
               highlight = FALSE)
set.seed(1945)
```

## Introduction 

The package `LMest` allows users
to specify and fit Latent (or Hidden) Markov (LM) models for the analysis of longitudinal continuous data, 
categorical data and time-series.
It 
includes
functions for parameter estimation, via the Expectation-Maximization (EM) algorithm, 
covering
the basic LM model and its extensions with individual covariates 
through
suitable parameterizations. 
Additionally, it provides
functions for 
simulated data 
from these models, 
performing model selection 
(including searching 
for the global maximum likelihood), and 
conducting 
local and global decoding.
The package also offer 
standard errors for model parameters which, 
can be obtained using 
numerical or exact methods 
as well as 
parametric bootstrap
techniques.

This document introduces the `LMest`'s basic set of tools, and 
demonstrates how to apply them for the different model specifications,
using data frames in 
various formats. 

`LMest` contains the following main functions 
each with specific tasks:

| Functions | Description|
|------------|------------|
| `lmestData()`| to manipulate data in long format |
|`lmestFormula()`| to build the formulas specifying the model to be estimated |
|`lmest()`| to estimate LM models for categorical responses with or without covariates |
|`se()`| to obtain standard errors for the estimated LM model |
|`lmestCont()`| to estimate LM models for continuous outcomes with or without covariates|
|`lmestMixed()`| to estimate LM models for categorical responses with initial and transition probabilities of the latent process  allowed to vary across different latent sub-populations|
|`lmestSearch()`| to search for the global maximum of the model log-likelihood and to select the optimal number of latent states|
|`lmestDecoding()`| to perform local and global decoding (**Viterbi** algorithm)|
|`bootstrap()`| to  perform bootstrap parametric resampling to compute standard errors for the parameter estimates|
|`draw()`|  to draw samples from the LM models 
|------------------------------------------

Data available are the following:

| Name | Description|
|------------|------------|
|`data_criminal_sim`|Simulated dataset about crimes committed by a cohort of subjects|
|`data_drug`|Longitudinal dataset about marijuana consumption deriving from the National Youth Survey|
|`data_SRHS_long`|Dataset about self-reported health status deriving from the Health and Retirement Study conducted by the University of Michigan in long format|
|`PSIDlong`|Longitudinal dataset deriving from the  Panel Study of Income Dynamics
|`RLMSdat`|Longitudinal dataset deriving from the Russia Longitudinal Monitoring Survey (RLMS) about job satisfaction|
|`RLMSlong`|Longitudinal dataset deriving from the Russia Longitudinal Monitoring Survey (RLMS) about job satisfaction in long format|
|`NLSYlong`|Longitudinal dataset deriving from the National Longitudinal Survey of Youth (NLSY) about antisocial behavior and self-esteem|
|`data_employment_sim`|Simulated dataset assuming interviews conducted among a nationally representative sample of graduates to investigate their employment status after graduation|
|`data_market_sim`|Simulated dataset assuming observations of customers of four different brands along with the prices of each transaction.
|`data_heart_sim`|Simulated dataset assuming an observational retrospective study to assess the health state progression of individuals after treatment.
|------------------------------------------


See `help(package="LMest")` for further details and `citation("LMest")` for main references.



```{r, message = FALSE, echo=1}
library(LMest)
cat(LMest:::Startup.Message(), sep="")
```


## Data: RLMSlong

This dataset, 
included 
in the `LMest` package, concerns the evaluation of job satisfaction of $n$ = 1,718 individuals followed
over
$T$ = 7 years from 2008 to 2014.
The data come from the [Russia Longitudinal Monitoring Survey](https://www.hse.ru/org/hse/rlms), and are documented in `?RLMSlong`. 
The response variable (named `value`), corresponding to the reported job satisfaction at different time occasions, has five ordered categories from `1: absolutely satisfied` to `5: absolutely not satisfied`. 
The longitudinal dataset is in long format:


```{r}
data("RLMSlong")
dim(RLMSlong)
str(RLMSlong)
```


## Data: PSIDlong

This dataset 
is derived from the Panel Study of Income Dynamics (https://psidonline.isr.umich.edu).
The data used in the following application concern $n$ = 1,446 women who were followed from 1987 to 1993.
There are two binary response variables:
`Y1Fertility` (indicating whether a woman had given birth to a child in a certain year) and
`Y2Employment` (indicating whether she was employed).
The covariates are:  `X1Race` (dummy variable equal to 1 for a black woman);  `X2Age` (in 1986, rescaled by its maximum value); `X3Age2` (squared age);  `X4Education` (number of years of schooling);  `X5Child1_2` (number of children in the family aged between 1 and 2 years, referred to the previous year);  `X6Child3_5`;  `X7Child6_13`;  `X8Child14`;  `X9Income` of the husband (in dollars, referred to the previous year, divided by 1,000).


```{r}
data("PSIDlong")
dim(PSIDlong)
str(PSIDlong)
```

## Data: data_criminal_sim

This simulated dataset
contains $n$ = 60,000 observations related to the complete conviction histories of a cohort of offenders followed from the age of criminal responsibility, 10 years, until 40 years. It  also includes the proportion of non-offenders.
We consider $T=6$ age bands 
each of five years in length.
For every age band, we 
have a binary variable equal to 1 if the subject has been convicted for a crime of one of the following  ten typologies and to 0 otherwise. 
The  typologies of crime are: `y1 violence against the person`, `y2 sexual offenses`, `y3 burglary`, `y4 robbery`, `y5 theft and handling stolen goods`, `y6 fraud and forgery`, `y7 criminal damage`, `y8 drug offenses`, `y9 motoring offenses`, `y10 other offenses`.
`Gender` is a covariate coded equal to 1 for male and 2 for female:

```{r}
data(data_criminal_sim)
dim(data_criminal_sim)
str(data_criminal_sim)
```



## Data: NLSYlong

This dataset in the `LMest` package, also downloadable from the package `panelr`, is derived from the National Longitudinal Survey of Youth (https://www.nlsinfo.org/content/cohorts/nlsy79).
The data are a subset concerning  $n$ = 581 individuals who were followed from 1990 to 1994.
We consider two response variables: `anti`, providing a measure of antisocial behavior (measured on a scale ranging from 0 to 6), `self`, which is a measure of self-esteem  (measured on a scale ranging from 6 to 24).
The covariates are the following:  `momage`, a continuous variable indicating the age of the mother at birth; `gender`, a dummy variable equal to 1 for females;   `childage`, a continuous variable indicating the age of the child at the first interview;  `momwork`, a dummy variable equal to 1 if mother works; `married`, a dummy variable equal to 1 if parents are married;  `hispanic` and `black`, two dummy variables for ethnicity; `pov`, a time-varying binary variable indicating the poverty status of the family in the years 1990, 1992 and 1994.


```{r}
data("NLSYlong")
dim(NLSYlong)
```



## Prepare and explore data

Function `lmestData()` allows us to check and prepare the data. For example, for the `NLSYlong` dataset we have:

```{r}
dt <- lmestData(data = NLSYlong, id = "id", time="time",
                responsesFormula= anti+self ~NULL)
summary(dt, dataSummary="responses", varType=rep("c",ncol(dt$Y)))
```

We can display the summary of each response variable for every time occasion.
For the criminal data, if we select only females, we have:

```{r}
data_criminal_sim<-data.frame(data_criminal_sim)
crimf <- data_criminal_sim[data_criminal_sim$sex == 2,]
dt1 <- lmestData(data = crimf, id = "id", time = "time")
summary(dt1, varType = rep("d",ncol(dt1$Y)))  
```


## Building a formula

Function `lmestFormula()` allows us to specify the model to be estimated. In particular, the formula for the basic LM model is specified as: 

```{r}
fmBasic <- lmestFormula(data = RLMSlong, response = "value")
```

The formula for the LM model with all  covariates affecting  the distribution of the latent process may be specified as:

```{r}
fmLatent <- lmestFormula(data = PSIDlong, response = "Y", 
                         LatentInitial = "X", LatentTransition ="X")
```

in which the column names start with "Y" ar "X".

Alternatively, it is possible to specify subsets of covariates 
that influence 
the initial and transition probabilities of the latent process 
which
can be also different.
For example:

```{r}
fmLatent2 <- lmestFormula(data = PSIDlong, response = "Y", 
                          LatentInitial = c("X1Race","X2Age","X3Age2","X9Income"), 
                          LatentTransition =c("X1Race","X2Age","X3Age2","X9Income"))
```

## Latent Markov models for categorical responses

Function `lmest()` allows us to estimate LM models for categorical responses with different model specifications, 
both 
with and without covariates.
These models rely on 
the 
homogeneous 
first-order  Markov chain 
with a finite number of states.
Maximum likelihood estimation of model parameters is performed through the EM algorithm.
Standard errors for the parameter estimates are obtained by exact computation of the information matrix or through reliable numerical approximations of this matrix.  
This can be done by using option `out_se=TRUE` or by using the suitable function 
`se()`.



Using `PSIDlong`, the basic LM model with time  heterogenous transition probabilities can be estimated as follows:

```{r, eval=TRUE, add=TRUE, warning=FALSE,  results='hide'}
mod <- lmest(responsesFormula = fmLatent$responsesFormula,
             index = c("id","time"),
             data = PSIDlong, k = 2) 
```


The function requires a data in long format, so 
"id" column and a "time" column must be specified in the argument "index". The number of latent state may be specified as a single 
value
or a vector of integer values as follows: 

```{r, eval=TRUE, add=TRUE, warning=FALSE, results='hide'}
mod <- lmest(responsesFormula = fmLatent$responsesFormula,
             index = c("id","time"),
             data = PSIDlong, k = 1:3) 
```

The suitable number of latent states is selected using the BIC or AIC criterion and returned.

Print method shows the main results:

```{r}
print(mod)
```

Standard errors can be obtained with  function `se()` as:

```{r}
se(mod)
```


For the data `PSIDlong`, we can estimate an LM model with covariates affecting the distribution of the latent process by fixing $k$ = 2 latent states as follows:

```{r, eval=TRUE, add=TRUE, warning=FALSE}
mod2 <- lmest(responsesFormula = fmLatent$responsesFormula,
             latentFormula =  fmLatent$latentFormula,
             index = c("id","time"),
             data = PSIDlong, k = 2,
             paramLatent = "multilogit",
             start = 0, out_se=TRUE) 
```



Every 10 iterations of the EM algorithm, 
the function displays the following information:

-  the chosen model specification 
- number of latent states
- the type of starting values used
- the number of iterations 
- the value of the log-likelihood at the end of the current iteration 
- the difference 
in log-likelihood 
from the previous iteration
- the discrepancy between the corresponding parameter vectors.


The `summary()` method returns the estimation results:

```{r, echo=TRUE, eval=TRUE, include=TRUE}
summary(mod2)
```

A  plot of the conditional response probabilities referred to the categories of the multivariate response is obtained as:

```{r, fig.width = 5, fig.height = 4}
plot(mod2, what = "CondProb")
```

To  
gain insight the results, we  observe that the second latent state corresponds to women with the lowest propensity 
for fertility and the highest propensity 
for employment. 

The first state corresponds to women with both low propensity 
for 
fertility and 
a low likelihood of having a job.


A path diagram of the estimated transition probabilities is shown below:
 
 
```{r}
plot(mod2, what="transitions")
```

The averaged estimated transition matrix 
reveals 
a high level of persistence 
within
the same latent state. 
Specifically, 
 the probability of transitioning from the first state to the second state is approximately 0.06.

The estimated marginal distribution of the latent states for each time occasion can be represented  in the following plot:

\vspace{1cm}
```{r}
plot(mod2, what="marginal")
```
\vspace{1cm}


## Latent Markov model for continuous outcomes

The LM model for continuous outcomes may be estimated by using function `lmestCont()`, assuming a Gaussian distribution for the response variables given the latent process. 
For the data `NLSYlong`,  we estimate a multivariate LM model with covariates in the latent process. The selection of the number of latent states can be performed by setting option `k`
appropriately:

```{r,results='hide', warning=FALSE}
dt$data$id = as.numeric(dt$data$id)
dt$data$time = as.numeric(dt$data$time)
modc <- lmestCont(responsesFormula = anti + self  ~ NULL,
                  latentFormula = ~ gender + childage + hispanic + black + pov +
                    momwork + married|   gender + childage + hispanic + black+ pov+
                    momwork + married,
                  index = c("id", "time"), 
                  data = dt$data,
                  k = 1:3, 
                  modBasic=1,  
                  output = TRUE, 
                  tol=10^-1)
```

We can display the estimation results with a plot of the indices based on the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC):


```{r} 
plot(modc,what="modSel")
```

A plot of the ellipse of the estimated overall density, weighted by the estimated marginal distribution of the latent states, is obtained as follows:


```{r} 
plot(modc, what="density")
```

A density plot for each component (latent state) is obtained as:

```{r} 
plot(modc,what="density",components=c(1,2))
```

The latent states are ordered according to increasing values of antisocial behavior.

The path diagram of the estimated transition probabilities is obtained as follows:


```{r} 
plot(modc,what="transitions")
```


Other results and asymptotic tests can be obtained using the estimated standard errors


```{r}
semodc<-se(modc)
```


```{r}
TabBe <-cbind(modc$Be, semodc$seBe, modc$Be/semodc$seBe)
colnames(TabBe) <- c("estBe",  "s.e.Be","t-test") 
round(TabBe,3) 
```

The argument `Be`, returned by the function, contains the estimated regression parameters affecting the distribution of the initial probabilities. The gender log-odds (second row of `Be`) is  negative and significant, indicating that females tend to be allocated to the first latent state at the beginning of the survey.


```{r}
TabGa1 <- cbind(modc$Ga,semodc$seGa,modc$Ga/semodc$seGa) 
colnames(TabGa1) <- c("estGa(1)","estGa(2)", "s.e.Ga(1)","s.e.Ga(2)", "t-test(1)","t-test(2)")  
round(TabGa1,3) 
```

  
Output `Ga` contains the estimated parameters affecting the distribution of the transition probabilities.  
These parameters
measure
the influence of each covariate on the transition between  states.


## Mixed Latent Markov model

Function `lmestMixed()` allows us to estimate mixed LM models for categorical responses, 
taking 
into account additional sources of (time-fixed) dependence in the data.
For the data `data_criminal_sim` we are interested to evaluate the patterns of criminal behavior among individuals.
To this end, 
we estimate a model with $k_1$ = 2 latent classes and $k_2$ = 2 latent states, restricting the analysis to females.


```{r,results='hide'}
responsesFormula <- lmestFormula(data = crimf,response = "y")$responsesFormula

modm <- lmestMixed(responsesFormula =responsesFormula,
                  index = c("id","time"),
                  k1 = 2, k2 = 2,
                  tol = 10^-3,
                  data = crimf)

```

Results:

```{r}
summary(modm)
round(modm$Psi[2, , ], 3)
```

We can identify the first latent state as 
representing 
females with null or very low tendency to commit crimes, whereas the second latent state corresponds to criminals 
primarily engaged in 
theft, burglary, and other offences
According to the estimated transition matrix, females classified in the first cluster present a higher probability (around 0.5) 
of moving
from the second to the first latent state 
compared to 
those 
in
the second cluster (of around 0.4).  
This indicates 
a more pronounced tendency 
for individuals in the first cluster 
to commit less crimes over time.


## Search for the global maximum of the log-likelihood

Function `lmestSearch()` addresses both model selection and the multimodality of the likelihood function. It employs different initializations to search for the global maximum of the log-likelihood function.

Two main criteria are provided to select the number of latent states: AIC and BIC.
For example,  for the `RLMSlong` dataset we can estimate the basic LM model for increasing values of the latent states $k$ ranging from  1 to 4:

```{r search, include=TRUE, results='hide'}
out <- lmestSearch(responsesFormula =  fmBasic$responsesFormula,
                   index = c("id","time"), 
                   data = RLMSlong,version ="categorical", k = 1:4,
                   modBasic = 1, seed = 123)
```


We can display the results of the model selection using:


```{r}
summary(out)
```


The minimum BIC index corresponds to the model with $k$=4 latent states, and the model has 31 free parameters. 

The estimation results for the selected number of states can be displayed as follows:




```{r}
mod4 <- out$out.single[[4]]
summary(mod4)
```

A  plot of the conditional response probabilities referred to the categories of the univariate response is obtained with:

```{r} 
plot(mod4, what="CondProb")
```

## Local and global decoding

Function `lmestDecoding()` allows us to predict the sequence of latent states for the sample units on the basis of the output of the main estimation functions, thus enabling
"dynamic pattern recognition".

For the basic LM model estimated
by using  data `PSIDlong`  the local (`Ul`) and global (`Ug`) decoding (using the Viterbi algorithm) are given by: 

```{r} 
dec <- lmestDecoding(mod)

head(dec$Ul)

head(dec$Ug)
```

## Bootstrapping 

Function `bootstrap()` allows us to obtain standard errors through parametric bootstrap. A reasonable number of bootstrap samples is $B=200$.  
For illustrative purposes, we show how to obtain $B=2$ samples from the output `modc` of the model with covariates for the  `NLSYlong` dataset, estimated 
using the
function `lmestCont()`  as follows:

```{r, include=TRUE, results='hide'}
mboot <- bootstrap(modc, n = 581, B = 2, seed = 172)
```
\vspace{1cm}

Object `seMu` contains the estimated standard errors for the conditional means:

```{r}
mboot$seMu
```

## Draw samples 

Function `draw()`
allows us to draw samples from the estimated basic LM model. 
For example, considering the basic LM model illustrated  with the `RLMSlong` dataset,  we can obtain a sample of responses of size $n$ = 100  as follows: 


```{r}
draw3 <- draw(est = mod4, format = "matrices", seed = 4321, n = 100)
head(draw3$Y)
```


Each line of $Y$ contains the sample responses of each unit for the seven time occasions.

The package also provides functions to draw samples from  other LM models.