% $Id: jss_np.Rnw,v 1.69 2008/07/22 19:01:59 jracine Exp jracine $

%\VignetteIndexEntry{The crs Package}
%\VignetteDepends{crs}
%\VignetteKeywords{nonparametric, spline, categorical}
%\VignettePackage{crs}

\documentclass[nojss]{jss}

\usepackage{amsmath,amsfonts,enumitem}
\usepackage[utf8]{inputenc} 

\newcommand{\field}[1]{\mathbb{#1}}
\newcommand{\R}{\field{R}}

%% Change the default page sizes.

\setlength{\topmargin}{-0.25in}
\setlength{\textheight}{8.5in}
\setlength{\oddsidemargin}{.0in}
\setlength{\evensidemargin}{.0in}
\setlength{\textwidth}{6.5in}
\setlength{\footskip}{.5in}

%% need no \usepackage{Sweave.sty}

\author{Jeffrey S.~Racine\\McMaster University}

\title{The \pkg{crs} Package}

\Plainauthor{Jeffrey S.~Racine}

\Plaintitle{The crs Package}

\Abstract{ 
  
  This vignette outlines the implementation of the regression spline
  method contained in the \proglang{R} \pkg{crs} package, and also
  presents a few illustrative examples.

}

\Keywords{nonparametric, semiparametric, regression spline,
  categorical data}

\Plainkeywords{Nonparametric, spline, Econometrics, categorical}

\Address{Jeffrey S.~Racine\\
  Department of Economics\\
  McMaster University\\
  Hamilton, Ontario, Canada, L8S 4L8\\
  E-mail: \email{racinej@mcmaster.ca}\\
  URL: \url{http://www.mcmaster.ca/economics/racine/}\\
}

\begin{document}

%% Note that you should use the \pkg{}, \proglang{} and \code{} commands.

%% Note - fragile using \label{} in \section{} - must be outside

%% For graphics

<<eval=TRUE,echo=FALSE,keep.source=TRUE,results=hide>>=
library(crs)
options(prompt = "R> ", crs.messages = FALSE, digits = 3)
@ 

%% <<fig=TRUE,eval=TRUE, height=, width=>>=

\section*{Introduction}

The \pkg{crs} package implements a framework for nonparametric
regression splines that admits both continuous and categorical
predictors. The categorical predictors can be handled in two ways, (i)
using kernel weighting where the kernel functions are tailored to the
discrete support of the categorical predictors \citep{RACINE_LI:2004},
and (ii) using indicator basis functions. The fundamental difference
between these two approaches is that the use of indicator basis
functions consume degrees of freedom via the number of columns in the
spline basis matrix, while kernel weighting does not.

This package implements the approaches described in
\citeauthor{MA_RACINE_YANG:2011} \citeyear{MA_RACINE_YANG:2011} and
\citeauthor{MA_RACINE:2013} \citeyear{MA_RACINE:2013} when the option
\code{kernel=TRUE} is selected as described below. As well, this
package implements a range of related methods and has options that
(hopefully) make it appealing for applied projects, research, and
pedagogical purposes alike.

Data-driven methods can be used for selecting the spline degree,
number of segments/knots, and bandwidths (leave-out-out
cross-validation (\code{cv.func = "cv.ls"}) \citeauthor{STONE:1974}
\citeyear{STONE:1974}, \citeauthor{STONE:1977} \citeyear{STONE:1977},
generalized cross-validation (\code{cv.func="cv.gcv"})
\citeauthor{CRAVEN_WAHBA:1979} \citeyear{CRAVEN_WAHBA:1979}, and the
information-based criterion (\code{cv.func="cv.aic"}) proposed by
\citeauthor{HURVICH_SIMONOFF_TSAI:1998}
\citeyear{HURVICH_SIMONOFF_TSAI:1998}). Details of the implementation
are as follows:

\begin{enumerate}
  [label=(\roman{*}),ref=(\roman{*})]

\item the degree of the spline and number of segments (i.e.~knots
  minus one) for each continuous predictor can be set manually as can
  the bandwidths for each categorical predictor (if appropriate)
  
\item alternatively, any of the data-driven criteria (i.e.\
  \code{cv.func=}) could be used to select either the degree of the
  spline (holding the number of segments/knots minus one fixed at any
  user-set value) and bandwidths for the categorical predictors (if
  appropriate), or the number of segments (holding the degree of the
  spline fixed at any user-set value) and bandwidths for the
  categorical predictors (if appropriate), or the number of segments
  and the degree of the spline for each continuous predictor and
  bandwidths for each categorical predictor (if appropriate)
  
\item when indicator basis functions are used instead of kernel
  smoothing, whether to include each categorical predictor or not can
  be specified manually or chosen via any \code{cv.func} method
  
\item we allow the degree of the spline for each continuous predictor
  to include zero, the inclusion indicator for each categorical
  predictor to equal zero, and the bandwidth for each categorical
  predictor to equal one, and when the degree/inclusion indicator is
  zero or the bandwidth is one, the variable is thereby removed from
  the regression: in this manner, irrelevant predictors can be
  automatically removed by any \code{cv.func} method negating the need
  for pre-testing (\citeauthor{HALL_RACINE_LI:2004}
  \citeyear{HALL_RACINE_LI:2004}, \citeauthor{HALL_LI_RACINE:2007}
  \citeyear{HALL_LI_RACINE:2007})
  
\end{enumerate}

The design philosophy of the \pkg{crs} package aims to closely mimic
the behavior of the \code{lm} function. Indeed, the implementation
relies on \code{lm} for computation of the spline coefficients,
obtaining fitted values, prediction and the like. 95\% confidence
bounds for the fit and derivatives are constructed from asymptotic
formulae and automatically generated. Below we describe in more detail
the specifics of the implementation for the interested reader.

\section*{Implementation}

Kernel-based methods such as local constant (i.e.\ the
\citeauthor{NADARAYA:1965} \citeyear{NADARAYA:1965}
\citeauthor{WATSON:1964} \citeyear{WATSON:1964} estimator) and local
polynomial regression \citep{FAN_GIJBELS:1996} are based on the
concept of `local' weighted averaging. Regression spline methods, on
the other hand, are `global' in nature since a single least square
procedure leads to the ultimate function estimate over the entire data
range \citep{STONE:1994}. This `global nature' implies that
constructing regression splines will be less computationally
burdensome that their kernel-based counterparts leading to their
practical appeal relative to kernel-based approaches.

However, while kernel-based regression methods admit a rich array of
predictor types, spline regression methods can be limited in their
potential applicability as they are predicated on \emph{continuous}
predictors only. In applied settings we often encounter
\emph{categorical} predictors such as strength of preference
(``strongly prefer'', ``weakly prefer'', ``indifferent'' etc.) and so
forth. When confronted with categorical predictors, researchers
typically break their data into subsets governed by the values of the
categorical predictors (i.e.~they break their data into `cells') and
then conduct regression using only the response and continuous
predictors lying in each cell. Though consistent, this `frequency'
approach can be inefficient. Recent developments in the kernel
smoothing of categorical data \citep{LI_RACINE:2007} suggest more
efficient estimation approaches in such settings. The \pkg{crs}
package considers two complementary approaches that seamlessly handle
the mix of continuous and categorical predictors often encountered in
applied settings.

\subsection*{The Underlying Model}

We presume the reader wishes to model the unknown conditional mean in
the following location-scale model,
\begin{equation*}
  Y=g\left( \mathbf{X},\mathbf{Z}\right) +\sigma \left( \mathbf{X},\mathbf{Z}
  \right) \varepsilon,
\end{equation*}
where $g(\cdot )$ is an unknown function, $\mathbf{X=}\left(
  X_{1},\ldots ,X_{q}\right) ^{\mathrm{T}}$ is a $q$-dimensional
vector of continuous predictors, and $\mathbf{Z}=\left( Z_{1},\ldots
  ,Z_{r}\right) ^{\mathrm{T}}$ is an $r$-dimensional vector of
categorical predictors. Letting $\mathbf{z} =\left( z_{s}\right)
_{s=1}^{r}$, we assume that $z_{s}$ takes $c_{s}$ different values in
$D_{s}\equiv \left\{ 0,1,\ldots ,c_{s}-1\right\} $, $ s=1,\ldots ,r$,
and let $c_{s}$ be a finite positive constant.

For the continuous predictors the regression spline model employs the
B-spline routines in the GNU Scientific Library
(\url{http://www.gnu.org/software/gsl/}). The B-spline function is the
maximally differentiable interpolative basis function (B-spline stands
for `basis-spline'), and a B-spline with no internal knots is a
B\'ezier curve.

Heuristically, we conduct linear (in parameter) regression using the
\proglang{R} function \code{lm}. However, we replace the continuous
predictors with B-splines of potentially differing order for every
continuous predictor. For the tensor product bases we set
\code{intercept=TRUE} for each univariate spline basis, while for the
additive spline bases we adopt the \code{intercept=FALSE} variants
(the B-splines will therefore not sum to one, i.e., an order $m$
B-spline with one segment (two knots/breakpoints) has $m+1$ columns
and we drop the first as is often done, though see
\citep{ZHOU_WOLFE:2000} for an alternative approach based upon
shrinkage methods) and include instead an intercept in the model. This
allows multiple bases to coexist when there is more than one
continuous predictor without introducing singularities. The tensor
product basis is given by
\begin{equation*}
B_1\otimes B_2\otimes \dots \otimes B_p,
\end{equation*}
where $\otimes$ is the Kronecker product where the products operate
column-wise and $B_j$ is the basis matrix for predictor $j$ as
outlined above. We also support a `generalized' polynomial B-spline
basis that consists of a varying-order polynomial with appropriate
interactions.  A general $p$th order local polynomial estimator for
the {\it multivariate} predictor case is more cumbersome notationally
speaking (we let $q$ denote the number of continuous predictors). The
general multivariate case is considered by \citeauthor{MASRY:1996a}
\citeyear{MASRY:1996a} who develops some carefully considered notation
and establishes the uniform almost sure convergence rate and the
pointwise asymptotic normality result for the local polynomial
estimator of $g(x)$ and its derivatives up to order $p$. Borrowing
from \citeauthor{MASRY:1996a} \citeyear{MASRY:1996a}, we introduce the
following notation:
\begin{equation*}
  r = (r_1,\dots ,r_q) , \quad r! = r_1! \times \dots
  \times r_q! ,
  \quad
  \bar r = \sum_{j=1}^q r_j,
\end{equation*}
\begin{align}
  \label{regression:eq:Masry 2}
  x^r = x_1^{r_1}\times \dots \times x_{q}^{r_q}, \quad \sum_{ 0 \leq
    \bar r \leq p} &= \sum_{j=0 }^p \sum_{r_1=0}^j \dots
  \sum_{r_q=0}^j,\cr &\quad\text{ (with $\bar r \equiv r_1+\dots +r_q
    = j$) }
\end{align}
and
\begin{equation*}
  (D^r g)(x) = \frac{ \partial^r g(x) }{ \partial x_1^{r_1}
    \dots \partial x_q^{r_q} }.
\end{equation*}
Using this notation, and assuming that $g(x)$ has derivatives of total
order $p+1$ at a point $x$, we can approximate $g(z)$ locally using a
multivariate polynomial of total order $p$ given by
\begin{equation*}
  g(z) \stackrel{\sim}{= } \sum_{ 0 \leq \bar r \leq p }
  \frac{1}{r!} (D^r) g(v)|_{v=x}(z-x)^r.
\end{equation*}
The generalized (varying-order) involves using the following
expression rather than \eqref{regression:eq:Masry 2} above,
\begin{equation*}
  \sum_{ 0 \leq \bar r \leq {\max\{p_1,\dots,p_q\}}} = \sum_{j=0
  }^{\max\{p_1,\dots,p_q\}} \sum_{r_1=0}^{j\le p_1} \dots
  \sum_{r_q=0}^{j\le p_q}.
\end{equation*}

When additive B-spline bases are employed we have a semiparametric
`additive' spline model (no interaction among variables), otherwise
when the tensor product is employed we have a fully nonparametric
model (interaction among all variables).  Whether to use the additive
or tensor product or generalized polynomial bases can be automatically
determined via any \code{cv.func} method (see the options for
\code{basis=} in \code{?crs}).

We offer the option to use categorical kernel weighting
(\code{lm(\dots,weights=L)}) to handle the presence of categorical
predictors (see below for a description of \code{L}). We also offer the
option of using indicator basis functions for the categorical
predictors (again taking care to remove one column to avoid
singularity given the presence of the intercept term in the
model). These bases are then treated similar to the bases $B_j$ for
continuous predictors described above.

\subsection*{Example: A B-spline and its First Derivative.}

  The figure below presents an example of a B-spline and its first
  derivative (the spline derivatives are required in order to compute
  derivatives from the spline regression model).

\setkeys{Gin}{width=0.45\textwidth}
\begin{center}
<<fig=TRUE,echo=FALSE,keep.source=TRUE>>=
degree <- 5
x <- seq(0,1,length=1000)
B <- gsl.bs(x,degree=degree,intercept=TRUE)
matplot(x,B,type="l")
@ 
<<fig=TRUE,echo=FALSE,keep.source=TRUE>>=
deriv <- 1
B <- gsl.bs(x,degree=degree,deriv=deriv,intercept=TRUE)
matplot(x,B,type="l")
@ 
\end{center}

Above we plot a degree-\Sexpr{degree} B-spline (left) with one segment
(two knots) and its \Sexpr{deriv}st-order derivative (right).

\subsection*{Least-Squares Estimation of the Underlying Model}

We estimate $\beta \left( \mathbf{z}\right) $ by minimizing the
following weighted least squares criterion,
\begin{equation*}
  \widehat{\beta }\left( \mathbf{z}\right) =\arg \min_{\beta \left( \mathbf{z}
    \right) \in \mathbb{R}^{\mathbf{K}_{n}}}\sum\limits_{i=1}^{n}\left\{ Y_{i}-
    \mathcal{B}\left( \mathbf{X}_{i}\right) ^{\mathrm{T}}\beta \left( \mathbf{z}\right) \right\} ^{2}L\left( \mathbf{Z}_{i},\mathbf{z},\mathbf{\lambda }
  \right) .
\end{equation*}

\subsection*{Placement of Knots}

The user can determine where knots are to be placed using one of two methods:

\begin{enumerate}
  [label=(\roman{*}),ref=(\roman{*})]
  
\item knots can be placed at equally spaced quantiles whereby an equal
  number of observations lie in each segment (`quantile knots')
  
\item knots can be placed at equally spaced intervals (`uniform knots')
  
\end{enumerate}

\subsection*{Kernel Weighting}

Let $Z_i$ be an $r$-dimensional vector of categorical/discrete
predictors.  We use $z_s$ to denote the $s$-th component of $z$, we
assume that $z_s$ takes $c_s$ different values in $D_s
\stackrel{def}{=} \{0,1,\dots ,c_s-1\}$, $s=1,\dots,r$, and let
$c_s\geq 2$ be a finite positive constant. For expositional simplicity
we will consider the case in which the components of $z$ are
unordered.

For an unordered categorical predictor, we use a variant of the kernel
function outlined in \citep{AITCHISON_AITKEN:1976} defined as
\begin{equation}
  \label{eq:barL(new)}
  l( Z_{is}, z_{s},\lambda_s) =
  \left\{ \begin{array}{ll} 1, & \text{ when $ Z_{is} = z_s$},\\
      \lambda_s, &  \text{  otherwise}.
    \end{array}  \right.
\end{equation}
Let ${\bf 1}(A)$ denote the usual indicator function, which assumes
the value one if $A$ holds true, zero otherwise. Using
\eqref{eq:barL(new)}, we can construct a product kernel function given
by
\begin{equation*}
  L(Z_i,z,\lambda) = \prod_{s=1}^r l( Z_{is},z_s,\lambda_s) =
  \prod_{s=1}^{r} \lambda_s^{ {\bf 1}( Z_{is} \neq z_s) },
\end{equation*}
while for an ordered categorical we use the function defined by
\begin{equation*}
  \tilde l(Z_{is},z_s,\lambda_s) = \lambda_s^{|Z_{is}-z_s| }
\end{equation*}
and modify the product kernel function appropriately. When $Z$
contains a mix of ordered and unordered variables we select the
appropriate kernel for each variable's type when constructing the
product kernel.

Note that when $\lambda_s=1$ all observations are `pooled' hence the
variable $z_s$ is removed from the resulting estimate, while when
$\lambda_s=0$ only observations lying in a given cell are used to form
the estimate.

\subsection*{Estimation Details}

Estimating the model requires construction of the spline bases and
their tensor product (if specified) along with the categorical kernel
weighting function. Then, for a given degree and number of segments
for each continuous predictor and bandwidth for each categorical
predictor (or indicator bases if \code{kernel=FALSE}), the model is
fit via least-squares.

All smoothing parameters can be set manually by the user if so
desired.  You must use the option \code{cv="none"} otherwise the
values specified manually will become the starting points for search
when \code{cv="nomad"} (`nonsmooth mesh adaptive direct search', see
\citep{NOMAD} and \citep{LEDIGABEL:2011} and the references therein).

The degree and bandwidth vector can be jointly determined via any
\code{cv.func} method by setting the option \code{cv="nomad"} or
\code{cv="exhaustive"} (exhaustive search).

Setting the option \code{cv="nomad"} computes NOMAD-based
cross-validation directed search while setting \code{cv="exhaustive"}
computes exhaustive cross-validation directed search for each unique
combination of the degree and segment vector for each continuous
predictor from \code{degree=degree.min} through
\code{degree=degree.max} (default 0 and 10, respectively) and from
\code{segments=segments.min} through \code{segments=segments.max}
(default 1 and 10, respectively).

When \code{kernel=TRUE} setting the option \code{cv="exhaustive"}
computes bandwidths ($\in[0,1]$) obtained via numerical minimization
(see \code{optim}) for each unique combination of the degree
and segment vectors (restarting can be conducted via
\code{restarts=}). When conducting \code{cv="nomad"} the number of
multiple starts can be controlled by \code{nmulti=}. The model
possessing the lowest criterion function value is then selected as the
final model.

Note that \code{cv="exhaustive"} is often unfeasible (this
combinatoric problem can become impossibly large to compute in finite
time) hence \code{cv="nomad"} is the default. However, with
\code{cv="nomad"} one should set \code{nmulti=} to some sensible value
greater than zero (say, 10 or larger) to strive to avoid becoming
trapped in local minima.

\subsection*{Data-Driven Smoothing Parameter Criteria}

We incorporate three popular approaches for setting the smoothing
parameters of the regression spline model, namely least-squares
cross-validation, generalized cross-validation, and an AIC method
corrected for use in nonparametric settings.

Let the fitted value from the spline regression model be denoted $\hat
Y_i=B_{m}(X_i)^T\hat \beta(Z_i)$.  Letting $\hat\varepsilon_i=Y_i-\hat
Y_i$ denote the $i$th residual from the categorical regression spline
model, the least-squares cross-validation function is given by
\begin{equation*}
CV=\frac{1}{n}\sum_{i=1}^n\frac{\hat\varepsilon_i^2}{(1-h_{ii})^2}
\end{equation*}
and this computation can be done with effectively one pass through the
data set, where $h_{ii}$ denotes the $i$th diagonal element of the
spline basis projection matrix (see below for details). Since $h_{ii}$
is computed routinely for robust diagnostics by many statistics
programs, this can be computed along with (and hence as cheaply as)
the vector of spline coefficients themselves. Thus least-squares
cross-validation is computationally appealing, particularly for large
data sets.

Let $H$ denote the $n\times n$ weighting matrix such that $\hat Y=HY$
with its $i$th diagonal element given by $H_{ii}$ where $\text{tr}(H)$
means the trace of $H$ which is equal to $\sum_{i=1}^n h_{ii}$.  The
matrix $H$ is often called the `hat matrix' or `smoother matrix' and
depends on $X$ but not on $Y$. The `generalized' cross-validation
function is obtained by replacing $h_{ii}$ in the above formula with
its average value denoted $\text{tr}(H)/n$ \citep{CRAVEN_WAHBA:1979}.

The information-based criterion proposed by
\citep{HURVICH_SIMONOFF_TSAI:1998} is given by
\begin{equation*}
  \text{AIC}_c=\ln(\hat\sigma^2)+
  \frac{1+\text{tr}(H)/n}{1-\{\text{tr}(H)+2\}/n},
\end{equation*}
where
\begin{equation*}
  \hat\sigma^2=\frac{1}{n}\sum_{i=1}^n\hat\varepsilon_i^2
  =Y'(I-H)'(I-H)Y/n.
\end{equation*}

Each of these criterion functions can be minimized with respect to the
unknown smoothing parameters either by numerical optimization
procedures or by exhaustive search.

Though each of the above criteria are asymptotically equivalent in
terms of the bandwidths they deliver ($\text{tr}(H)/n\to 0$ as
$n\to\infty$), they may differ in finite-sample settings for a small
smoothing parameter (large $\text{tr}(H)/n$) with the AIC$_c$
criterion penalizing more heavily when undersmoothing than either the
least-squares cross-validation or generalized cross-validation
criteria (the AIC$_c$ criterion effectively applies an infinite
penalty for $\text{tr}(H)/n\ge 1/2$).

\subsection*{Pruning}

Once a model has been selected via cross-validation
(i.e.~\code{degree}, \code{segments}, \code{include} or \code{lambda}
have been selected), there is the opportunity to (potentially) further
refine the model by adding the option \code{prune=TRUE} to the
\code{crs} function call. Pruning is accomplished by conducting
stepwise cross-validated variable selection using a modified version
of the \code{stepAIC} function in the \proglang{R} \pkg{MASS} package
where the function \code{extractAIC} is replaced with the function
\code{extractCV} with additional modifications where
necessary. Pruning of potentially superfluous bases is undertaken,
however, the pruned model (potentially containing a subset of the
bases) is returned \emph{only if its cross-validation score is lower
  than the model being pruned}. When this is not the case a warning is
issued to this effect. A final pruning stage is commonplace in the
spline framework and may positively impact on the finite-sample
efficiency of the resulting estimator depending on the rank of the
model being pruned. Note that this option can only be applied when
\code{kernel=FALSE}.

\section*{Illustrative Examples}

Next we provide a few illustrative examples that may be of interest to
the reader.

\subsection*{Example: One Categorical/One Continuous Predictor}

By way of illustration we consider a simple example involving one
continuous and one discrete predictor.

<<echo=TRUE,keep.source=TRUE>>=
set.seed(42)
n <- 1000
x <- runif(n)
z <- rbinom(n,1,.5)
y <- cos(2*pi*x) + z + rnorm(n,sd=0.25)
z <- factor(z)
model <- crs(y~x+z)
summary(model)
@ 

The function \code{crs} called in this example returns a \code{crs}
object.  The generic functions \code{fitted} and \code{residuals}
extract (or generate) estimated values and residuals. Furthermore, the
functions \code{summary}, \code{predict}, and \code{plot} (options
\code{mean=FALSE}, \code{deriv=FALSE}, \code{ci=FALSE},
\code{plot.behavior} = \code{c("plot"}, \code{"plot-data"},
\code{"data")}) support objects of this type. The figure below
presents summary output in the form of partial regression surfaces
(predictors not appearing on the axes are held constant at their
medians/modes). Note that for this simple example we used the option
\code{plot(model,mean=TRUE)}.

\setkeys{Gin}{width=0.75\textwidth}
\begin{center}
<<fig=TRUE,multifig=TRUE,echo=FALSE,keep.source=TRUE>>=
options(SweaveHooks = list(multifig = function() par(mfrow=c(2,2))))
plot(model,mean=TRUE)
@ 
\end{center}

\subsection*{Example: Regression Discontinuity Design}

By way of illustration we consider a simple example involving two
continuous predictors and one categorical predictor. In this example
there is a `discontinuity' in the regression surface potentially
demarcated by the discrete predictor.

<<echo=TRUE,keep.source=TRUE>>=
set.seed(1234)
n <- 1000
x1 <- runif(n)
x2 <- runif(n)
z <- ifelse(x1>.5,1,0)
dgp <- cos(2*pi*x1)+sin(2*pi*x2)+2*z
z <- factor(z)
y <- dgp + rnorm(n,sd=1)
model <- crs(y~x1+x2+z)
summary(model)
@ 

The figure below plots the resulting estimate. The discontinuity
occurs when $x_1>0.5$ but the nature of the discontinuity is unknown
as is the functional form on either side of the potential
discontinuity. The categorical regression spline is able to detect
this `break'. 

On a related note, testing for a significant break could be
accomplished with an (asymptotic) F-test (to do so we must set
\code{kernel=FALSE} however) as the following illustrates (note the
argument \code{include=0} says to drop the one categorical predictor
or, say, \code{c(1,1,\dots,0,1\dots,1)} for multivariate categorical
predictors).

<<echo=TRUE,keep.source=TRUE>>=
## When kernel=FALSE, we could use the anova() function
## and set model.return=TRUE.
## Unrestricted model:
model <- crs(y~x1+x2+z,cv="none",
             degree=model$degree,
             segments=model$segments,
             basis=model$basis,
             kernel=FALSE,
             include=1,
             model.return=TRUE)
## Restricted model:
model.res <- crs(y~x1+x2+z,cv="none",
                 degree=model$degree,
                 segments=model$segments,
                 basis=model$basis,
                 kernel=FALSE,
                 include=0,
                 model.return=TRUE)
anova(model.res$model.lm,model$model.lm)
@ 

\setkeys{Gin}{width=0.75\textwidth}
\begin{center}
<<fig=TRUE,echo=FALSE,keep.source=TRUE>>=
num.eval <- 50
x1.seq.0 <- seq(min(x1[z==0]),max(x1[z==0]),length=num.eval)
x2.seq.0 <- seq(min(x2[z==0]),max(x2[z==0]),length=num.eval)
x.grid <- expand.grid(x1.seq.0,x2.seq.0)
newdata <- data.frame(x1=x.grid[,1],x2=x.grid[,2],z=factor(rep(0,num.eval**2),levels=c(0,1)))
z0 <- matrix(predict(model,newdata=newdata),num.eval,num.eval)

x1.seq.1 <- seq(min(x1[z==1]),max(x1[z==1]),length=num.eval)
x2.seq.1 <- seq(min(x2[z==1]),max(x2[z==1]),length=num.eval)
x.grid <- expand.grid(x1.seq.1,x2.seq.1)
newdata <- data.frame(x1=x.grid[,1],x2=x.grid[,2],z=factor(rep(1,num.eval),levels=c(0,1)))
z1 <- matrix(predict(model,newdata=newdata),num.eval,num.eval)
xlim <- c(0,1)
zlim=c(min(z0,z1),max(z0,z1))
theta <- 15
phi <- 10
persp(x=x1.seq.0,y=x2.seq.0,z=z0,
      xlab="x1",ylab="x2",zlab="y",
      xlim=xlim,
      ylim=xlim,      
      zlim=zlim,
      ticktype="detailed",      
      border="red",
      theta=theta,phi=phi)
par(new=TRUE)
persp(x=x1.seq.1,y=x2.seq.1,z=z1,
      xlab="x1",ylab="x2",zlab="y",
      xlim=xlim,
      ylim=xlim,      
      zlim=zlim,
      theta=theta,phi=phi,
      ticktype="detailed",
      border="blue")
@ 
\end{center}

\section*{Acknowledgements}

I would like to gratefully acknowledge support from the Natural
Sciences and Engineering Research Council of Canada
(\url{http://www.nserc-crsng.gc.ca}), the Social Sciences and Humanities
Research Council of Canada (\url{http://www.sshrc-crsh.gc.ca}), and the Shared
Hierarchical Academic Research Computing Network
(\url{http://www.sharcnet.ca}).

\bibliography{crs}

\end{document}