---
title: "Exercise 2: Nested logit model"
author: Kenneth Train and Yves Croissant
date: '`r Sys.Date()`'
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Exercise 2: Nested logit model}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r label = setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, widtht = 65)
options(width = 65)
```

The data set `HC` from `mlogit` contains data in `R` format on the
choice of heating and central cooling system for 250 single-family,
newly built houses in California.

The alternatives are:

- Gas central heat with cooling `gcc`,
- Electric central resistence heat with cooling `ecc`,
- Electric room resistence heat with cooling `erc`,
- Electric heat pump, which provides cooling also `hpc`,
- Gas central heat without cooling `gc`,
- Electric central resistence heat without cooling `ec`,
- Electric room resistence heat without cooling `er`.


Heat pumps necessarily provide both heating and cooling such that heat
pump without cooling is not an alternative.

The variables are:

- `depvar` gives the name of the chosen alternative,
- `ich.alt` are the installation cost for the heating portion
  of the system,
- `icca` is the installation cost for cooling
- `och.alt` are the operating cost for the heating portion of
  the system
- `occa` is the operating cost for cooling
- `income` is the annual income of the household

Note that the full installation cost of alternative `gcc` is
`ich.gcc+icca`, and similarly for the operating cost and for the
other alternatives with cooling.

@. Run a nested logit model on the data for two nests and one log-sum
coefficient that applies to both nests. Note that the model is
specified to have the cooling alternatives (`gcc}, `ecc},
`erc}, `hpc}) in one nest and the non-cooling alternatives
(`gc}, `ec}, `er}) in another nest.

```{r nest1}
library("mlogit")
data("HC", package = "mlogit")
HC <- dfidx(HC, varying = c(2:8, 10:16), choice = "depvar")
cooling.modes <- idx(HC, 2) %in% c('gcc', 'ecc', 'erc', 'hpc')
room.modes <- idx(HC, 2) %in% c('erc', 'er')
# installation / operating costs for cooling are constants, 
# only relevant for mixed systems
HC$icca[! cooling.modes] <- 0
HC$occa[! cooling.modes] <- 0
# create income variables for two sets cooling and rooms
HC$inc.cooling <- HC$inc.room <- 0
HC$inc.cooling[cooling.modes] <- HC$income[cooling.modes]
HC$inc.room[room.modes] <- HC$income[room.modes]
# create an intercet for cooling modes
HC$int.cooling <- as.numeric(cooling.modes)
# estimate the model with only one nest elasticity
nl <- mlogit(depvar ~ ich + och +icca + occa + inc.room + inc.cooling + int.cooling | 0, HC,
             nests = list(cooling = c('gcc','ecc','erc','hpc'), 
             other = c('gc', 'ec', 'er')), un.nest.el = TRUE)
summary(nl)
``` 

(a) The estimated log-sum coefficient is $0.59$. What does this
estimate tell you about the degree of correlation in unobserved
factors over alternatives within each nest?

> The correlation is approximately $1-0.59=0.41$. It's a moderate
> correlation.
  
(b) Test the hypothesis that the log-sum coefficient is 1.0 (the value
that it takes for a standard logit model.) Can the hypothesis that the
true model is standard logit be rejected?

> We can use a t-test of the hypothesis that the log-sum coefficient
> equal to 1. The t-statistic is :

```{r }
 (coef(nl)['iv'] - 1) / sqrt(vcov(nl)['iv', 'iv'])
``` 

> The critical value of t for 95\% confidence is 1.96. So we can reject
> the hypothesis at 95\% confidence.
 
> We can also use a likelihood ratio test because the multinomial logit
> is a special case of the nested model.

```{r }
# First estimate the multinomial logit model
ml <- update(nl, nests = NULL)
lrtest(nl, ml)
```  

> Note that the hypothesis is rejected at 95\% confidence, but not at
> 99\% confidence.

2. Re-estimate the model with the room alternatives in one nest and
the central alternatives in another nest. (Note that a heat pump is a
central system.)

```{r }
nl2 <- update(nl, nests = list(central = c('ec', 'ecc', 'gc', 'gcc', 'hpc'), 
                    room = c('er', 'erc')))
summary(nl2)
``` 

(a) What does the estimate imply about the substitution patterns
across alternatives? Do you think the estimate is plausible?

> The log-sum coefficient is over 1. This implies that there is more
> substitution across nests than within nests.  I don't think this is
> very reasonable, but people can differ on their concepts of what's
> reasonable.

(b) Is the log-sum coefficient significantly different from 1?

\begin{answer}[5]
  The t-statistic is :
```{r tstat}
 (coef(nl2)['iv'] - 1) / sqrt(vcov(nl2)['iv', 'iv'])
lrtest(nl2, ml)
```   

> We cannot reject the hypothesis at standard confidence levels.

(c) How does the value of the log-likelihood function compare for this
model relative to the model in exercise 1, where the cooling
alternatives are in one nest and the heating alternatives in the other
nest.

```{r }
logLik(nl)
logLik(nl2)
```   

> The $\ln L$ is worse (more negative.) All in all, this seems like a less
> appropriate nesting structure.

3. Rerun the model that has the cooling alternatives in one nest and
the non-cooling alternatives in the other nest (like for exercise 1),
with a separate log-sum coefficient for each nest.

```{r nl3}
nl3 <- update(nl, un.nest.el = FALSE)
```

(a) Which nest is estimated to have the higher correlation in
unobserved factors? Can you think of a real-world reason for this nest
to have a higher correlation?

> The correlation in the cooling nest is around 1-0.60 = 0.4 and that
> for the non-cooling nest is around 1-0.45 = 0.55. So the correlation
> is higher in the non-cooling nest.  Perhaps more variation in
> comfort when there is no cooling. This variation in comfort is the
> same for all the non-cooling alternatives.

(b) Are the two log-sum coefficients significantly different from each
other? That is, can you reject the hypothesis that the model in
exercise 1 is the true model?


> We can use a likelihood ratio tests with models `nl` and
> `nl3`.

```{r lrtest1}
lrtest(nl, nl3)
```   

> The restricted model is the one from exercise 1 that has one log-sum
> coefficient. The unrestricted model is the one we just estimated. The
> test statistics is 0.6299. The critical value of chi-squared with 1
> degree of freedom is 3.8 at the 95\% confidence level. We therefore
> cannot reject the hypothesis that the two nests have the same log-sum
> coefficient.

4. Rewrite the code to allow three nests. For simplicity, estimate
only one log-sum coefficient which is applied to all three
nests. Estimate a model with alternatives `gcc`, `ecc` and
`erc` in a nest, `hpc` in a nest alone, and alternatives
`gc`, `ec` and `er` in a nest. Does this model seem better or
worse than the model in exercise 1, which puts alternative `hpc` in
the same nest as alternatives `gcc`, `ecc` and `erc`?

```{r threentest}
nl4 <- update(nl, nests=list(n1 = c('gcc', 'ecc', 'erc'), n2 = c('hpc'),
                    n3 = c('gc', 'ec', 'er')))
summary(nl4)
``` 

> The $\ln L$ for this model is $-180.26$, which is lower (more negative)
> than for the model with two nests, which got $-178.12$.