---
title: "ModelMatrixModel"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{demo_ModelMatrixModel}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{css, echo=FALSE}
pre code {
  white-space: pre-wrap;
}
```

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```
model.matrix function in R is a convenient way to transform training dataset for modeling. But it does not save any parameter used in transformation, so it is hard to apply the same transformation to test dataset or new dataset. ModelMatrixModel package is created to solve the problem.

## setup
```{r}
#devtools::install_github("xinyongtian/R_ModelMatrixModel") #install from github
rm(list=ls())
library(ModelMatrixModel)
set.seed(10)
traindf= data.frame(x1 = sample(LETTERS[1:5], replace = T, 20),
                 x2 = rnorm(20, 100, 5),
                 x3 = factor(sample(c("U","L","P"), replace = T, 20)),
                 y = rnorm(20, 10, 2))
set.seed(20)
newdf=data.frame(x1 = sample(LETTERS[1:5], replace = T, 3),
                 x2 = rnorm(3, 100, 5),
                 x3 = sample(c("U","L","P"), replace = T, 3)) 

head(traindf)
sapply(traindf,class) #input categorical variable can be either character or factor
```
## problem with model.matrix() function
```{r}
f1=formula("~x1+x2")
head(model.matrix(f1, traindf),2)
head(model.matrix(f1, newdf),2)
```
Note the number of columns is different in the two outputs, which will be problematic when  applying the built model to new data . To avoid that,  column x1  in both dataset needs to be transformed  to factor with exact same levels. That will be cumbersome if there are many categorical columns. In addition, other transforming parameters, in transformation like  orthogonal polynomials, also need to be saved.

## ModelMatixModel comes to rescue
### fit data to create ModelMatixModel object
```{r}
f2=formula("~ 1+x1+x2") # "1" is need in order to output intercept column
mm=ModelMatrixModel( f2,traindf,remove_1st_dummy =T,sparse = F)
```

```{r}
class(mm)
head(mm$x,2) #note "_Intercept_" is intercept column
```
### transform new data
```{r}
mm_pred=predict(mm,newdf)
head(mm_pred$x,2)
```

## dummy variable
### keep   first dummy variable
```{r}
mm=ModelMatrixModel(~x1+x2+x3,traindf,remove_1st_dummy = F) 
```
default is to keep first dummy variable
```{r}
data.frame(as.matrix(head(mm$x,2)))
mm_pred=predict(mm,newdf)
data.frame(as.matrix(head(mm_pred$x,2)))
```
### dummy variable with interaction
#### keep 1st dummy variable
```{r}
mm=ModelMatrixModel(~x2+x3+x2:x3,traindf) 
data.frame(as.matrix(head(mm$x,2))) # ':' in column name  is replaced with '_X_'
mm_pred=predict(mm,newdf)
data.frame(as.matrix(head(mm_pred$x,2)))
```
#### remove 1st dummy variable
```{r}
mm=ModelMatrixModel(~x2*x3,traindf,remove_1st_dummy = T) 
data.frame(as.matrix(head(mm$x,2)))
mm_pred=predict(mm,newdf)
data.frame(as.matrix(head(mm_pred$x,2)))
```
### invalid level in new data
It is a common categorical column in  new data contains in valid level, it can be handled as following
```{r}
mm=ModelMatrixModel(~x2+x3,traindf) 
data.frame(as.matrix(head(mm$x,2)))
newdf2=newdf
newdf2[1,'x3']='z'  #create invalid level
mm_pred=predict(mm,newdf2,handleInvalid = "keep") 
```
default is to keep the invalid row ,i.e. set all dummy variables as 0. if handleInvalid = "error", throw error.
```{r}
data.frame(as.matrix(head(mm_pred$x,2))) 
```

## poly() in formula
ModelMatrixModel can save orthogonal polynomials parameter.
```{r}
mm=ModelMatrixModel(~poly(x2,3)+x3,traindf) 
data.frame(as.matrix(head(mm$x,2)))
mm_pred=predict(mm,newdf)
data.frame(as.matrix(head(mm_pred$x,2)))
```
also works raw polynomial transformation
```{r}
mm=ModelMatrixModel(~poly(x2,3,raw=T)+x3, traindf) 
data.frame(as.matrix(head(mm$x,2)))
mm_pred=predict(mm,newdf)
data.frame(as.matrix(head(mm_pred$x,2)))
```

## scale and center
training dataset can be scaled, and same scale parameters then can be applied to new dataset.
```{r}
mm=ModelMatrixModel(~x2+x3,traindf,scale = T,center = T) 
data.frame(as.matrix(head(mm$x,2)))
mm_pred=predict(mm,newdf)
data.frame(as.matrix(head(mm_pred$x,2)))
```