---
title: Model components for fitted models with plm
author:
- name: Yves Croissant
date: '`r Sys.Date()`'
output: rmarkdown::html_vignette
bibliography: ../inst/REFERENCES.bib
vignette: >
  %\VignetteIndexEntry{Model components for fitted models with plm}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, echo=FALSE}
library("knitr")
opts_chunk$set(message = FALSE, warning = FALSE)
```

plm tries to follow as close as possible the way models are fitted
using `lm`. This relies on the following steps, using the
`formula`-`data` with some modifications:

- compute internally the `model.frame` by getting the relevant
  arguments (`formula`, `data`, `subset`, `weights`, `na.action` and
  `offset`) and the supplementary argument,
- extract from the `model.frame` the response `y` (with `pmodel.response`) and
  the model matrix `X` (with `model.matrix`),
- call the (non-exported) estimation function `plm.fit` with `X` and `y` as
  arguments.



Panel data has a special structure which is described by an
`index` argument. This argument can be used in the `pdata.frame` function which
returns a `pdata.frame` object. A `pdata.frame` can be used as input to the `data`
argument of `plm`. If the `data` argument of `plm` is an ordinary
`data.frame`, the `index` argument can also be supplied as an argument of
`plm`. In this case, the `pdata.frame` function is called internally to
transform the data.

Next, the `formula`, which is the first and mandatory argument of
`plm` is coerced to a `Formula` object. 

`model.frame` is then called, but with the `data` argument in the
first position (a `pdata.frame` object) and the `formula` in the
second position. This unusual order of the arguments enables to use a
specific `model.frame.pdata.frame` method defined in `plm`.

As for the `model.frame.formula` method, a `data.frame` is returned,
with a `terms` attribute.

Next, the `X` matrix is extracted using `model.matrix`. The usual way
to do so is to feed the function with two arguments, a `formula` or a
`terms` object and a `data.frame` created with `model.frame`. `lm` uses
something like `model.matrix(terms(mf), mf)` where `mf` is a
`data.frame` created with `model.frame`. Therefore, `model.matrix`
needs actually one argument and not two and we therefore wrote a
`model.matrix.pdata.frame` which does the job ; the method first checks
that the argument has a `term` attribute, extracts the `terms`
(actually the `formula`) and then computes the model's matrix `X`.

The response `y` is usually extracted using `model.response`, with a
`data.frame` created with `model.frame` as first argument, but it is
not generic. We therefore created a generic called `pmodel.response`
and provide a `pmodel.response.pdata.frame` method. We illustrate
these features using a simplified (in terms of covariates) example
with the `SeatBelt` data set:

```{r }
library("plm")
data("SeatBelt", package = "pder")
SeatBelt$occfat <- with(SeatBelt, log(farsocc / (vmtrural + vmturban)))
pSB <- pdata.frame(SeatBelt)
```

We start with an OLS (pooling) specification: 

```{r }
formols <- occfat ~ log(usage) + log(percapin)
mfols <- model.frame(pSB, formols)
Xols <- model.matrix(mfols)
y <- pmodel.response(mfols)
coef(lm.fit(Xols, y))
```

which is equivalent to:

```{r }
coef(plm(formols, SeatBelt, model = "pooling"))
```

Next, we use an instrumental variables specification. Variable `usage` is
endogenous and instrumented by three variables indicating the law
context: `ds`, `dp`, and `dsp`. 

The model is described using a two-parts formula, the first part of
the RHS describing the covariates and the second part the
instruments. The following two formulations can be used:

```{r }
formiv1 <- occfat ~ log(usage) + log(percapin) | log(percapin) + ds + dp + dsp
formiv2 <- occfat ~ log(usage) + log(percapin) | . - log(usage) + ds + dp + dsp
```
The second formulation has two advantages:

- in the common case when a lot of covariates are instruments, these
  covariates don't need to be indicated in the second RHS part of the
  formula,
- the endogenous variables clearly appear as they are proceeded by a
  `-` sign in the second RHS part of the formula.

The formula is coerced to a `Formula`, using the `Formula`
package. `model.matrix.pdata.frame` then internally calls 
`model.matrix.Formula` in order to extract the covariates and
instruments model matrices:


```{r }
mfSB1 <- model.frame(pSB, formiv1)
X1 <- model.matrix(mfSB1, rhs = 1)
W1 <- model.matrix(mfSB1, rhs = 2)
head(X1, 3) ; head(W1, 3)
```
For the second (and preferred formulation), the `dot` argument should
be set and is passed to the `Formula` methods. `.` has actually two
meanings: 

- all available covariates,
- the previous covariates used while updating a formula.

which correspond respectively to `dot = "seperate"` (the default) and
`dot = "previous"`. See the difference between the following two examples:


```{r }
library("Formula")
head(model.frame(Formula(formiv2), SeatBelt), 3)
head(model.frame(Formula(formiv2), SeatBelt, dot = "previous"), 3)
```
In the first case, all the covariates are returned by `model.frame` as
the `.` is understood by default as "everything".

In `plm`, the `dot` argument is internally set to
`previous` so that the end-user doesn't have to worry about these
subtleties. 

```{r }
mfSB2 <- model.frame(pSB, formiv2)
X2 <- model.matrix(mfSB2, rhs = 1)
W2 <- model.matrix(mfSB2, rhs = 2)
head(X2, 3) ; head(W2, 3)
```
The IV estimator can then be obtained as a 2SLS estimator: First,
regress the covariates on the instruments and get the fitted values:

```{r }
HX1 <- lm.fit(W1, X1)$fitted.values
head(HX1, 3)
```

Next, regress the response on these fitted values:

```{r }
coef(lm.fit(HX1, y))
```

The same can be achieved in one command by using the `formula`-`data` interface
with `plm`:

```{r }
coef(plm(formiv1, SeatBelt, model = "pooling"))
```


or with the `ivreg` function from package `AER` (or with the newer function `ivreg`
in package `ivreg` superseding `AER::ivreg()`):

```{r }
coef(AER::ivreg(formiv1, data = SeatBelt))
```



```{r eval = FALSE, include = FALSE}
X2 <- model.matrix(Formula(form1), mfSB, rhs = 2, dot = "previous")

formols <- occfat ~ log(usage) + log(percapin)  | . - log(usage) +  ds + dp + dsp

form1 <- occfat ~ log(usage) + log(percapin) + log(unemp) + log(meanage) + 
    log(precentb) + log(precenth) + log(densrur) + log(densurb) + 
    log(viopcap) + log(proppcap) + log(vmtrural) + log(vmturban) + 
    log(fueltax) + lim65 + lim70p + mlda21 + bac08
form2 <- . ~ . |  . - log(usage) + ds + dp +dsp

jorm1 <- occfat ~ log(usage) + log(percapin) + log(unemp) + log(meanage) + 
    log(precentb) + log(precenth) + log(densrur) + log(densurb) + 
    log(viopcap) + log(proppcap) + log(vmtrural) + log(vmturban) + 
    log(fueltax) + lim65 + lim70p + mlda21 + bac08 | . - log(usage) + 
    ds + dp + dsp
jorm2 <- noccfat ~ . | .
```