\documentclass[a4paper]{article}
\usepackage{amsmath, amssymb}
\usepackage{color}
\usepackage[latin1]{inputenc}


%\VignetteIndexEntry{subselect} 

\setlength{\textwidth}{16cm}
\setlength{\textheight}{21cm}
\setlength{\evensidemargin}{0cm}
\setlength{\oddsidemargin}{0cm}

\newcommand{\bvec}{\ensuremath{\mathrm{\mathbf{b}}}}
\newcommand{\A}{\ensuremath{\mathrm{\mathbf{A}}}}
\newcommand{\C}{\ensuremath{\mathrm{\mathbf{C}}}}
\newcommand{\E}{\ensuremath{\mathrm{\mathbf{E}}}}
\newcommand{\F}{\ensuremath{\mathrm{\mathbf{F}}}}
\newcommand{\Hmat}{\ensuremath\mathrm{\mathbf{H}}}
\newcommand{\I}{\ensuremath{\mathrm{\mathbf{I}}}}
\newcommand{\M}{\ensuremath{\mathrm{\mathbf{M}}}}
\renewcommand{\P}{\ensuremath{\mathrm{\mathbf{P}}}}
\renewcommand{\S}{\ensuremath{\mathrm{\mathbf{S}}}}
\newcommand{\T}{\ensuremath{\mathrm{\mathbf{T}}}}
\newcommand{\U}{\ensuremath{\mathrm{\mathbf{U}}}}
\newcommand{\X}{\ensuremath{\mathrm{\mathbf{X}}}}
\newcommand{\Y}{\ensuremath{\mathrm{\mathbf{Y}}}}
\newcommand{\umn}{\ensuremath{\mathrm{\mathbf{1}_n}}}

\newcommand{\SsubGp}{\mbox{${\mathrm{\mathbf{S}}}_{\{\cal G\}}$}}
\newcommand{\tr}{\ensuremath{\mathrm{tr}}}

\title{The \texttt{subselect} R package}
\author{Jorge Cadima, Jorge Orestes Cerdeira, Pedro Duarte Silva, Manuel Minhoto}
\begin{document}


\maketitle


\begin{center}
\textbf{Version 0.12}
\end{center}



\hrule

\vskip 0.5cm

\abstract{The \texttt{subselect} package addresses the issue of variable
selection in different statistical contexts, among which exploratory data 
analyses; univariate or multivariate linear models; generalized linear 
models; principal components analysis; linear discriminant analysis, 
canonical correlation analysis. 
Selecting variable subsets requires the definition of a numerical
criterion which measures the quality of any given variable
subset as a surrogate for the full set of variables. The current version 
of the \texttt{subselect} package provides eight different  
criteria. For each available criterion, the package provides a function 
that computes the criterion value of any given subset. More significantly, 
the package provides efficient search functions that seek the 
best subsets of any given size, for a specified criterion.
}


\vskip 0.5cm

\hrule

\tableofcontents 


\vskip 0.5cm

\section{Introduction}


\subsection{The variable selection problem}

Selecting a subset of variables which can be used as a surrogate for a
full set of variables, without major loss of quality, is a problem
that arises in numerous contexts. 

Multiple linear regression has been the classical statistical setting 
for variable selection and the greedy algorithms of backward elimination, 
forward selection and stepwise selection have long been used
to choose a subset of linear predictors. However, their limitations are 
well known  and more efficient search algorithms have also been devised,
such as Furnival and Wilson's Leaps and Bounds Algorithm (reference
\cite{Furnival:74}).   

The \texttt{subselect} package addresses the issue of variable
selection in different statistical contexts, among which:
\begin{itemize}
\item exploratory data analyses;
\item univariate or multivariate linear models;
\item generalized linear models;
\item principal components analysis;
\item linear discriminant analysis;
\item canonical correlation analysis.
\end{itemize}


\subsection{Measuring the quality of a subset}

Selecting variable subsets requires the definition of a numerical
\textbf{criterion} which measures the quality of any given variable
subset. In a univariate multiple linear regression, for example, possible
measures of the quality of a subset of predictors are the coefficient
of determination $R^2$,  the $F$ statistic in a goodness-of-fit test, 
its corresponding $p$-value or Akaike's Information Criterion (AIC), 
to give a few examples.  

The \texttt{subselect} package assumes that all potential variable
subsets can be ranked according to a well-defined numerical criterion
that is relevant for the problem at hand, and that \textit{the ultimate 
goal is to select the best subsets for any given cardinality}, i.e., 
the subsets of any given size which have the highest criterion values, 
for a given criterion.

The criteria currently available in the package will be further
discussed in Section \ref{s criteria}. For each available criterion,
the package provides a function that computes the criterion value of any given 
subset. More importantly, the package provides efficient functions 
to search for the best subsets of any given size.

\subsection{Selecting the best subsets}

Identifying the best variable subsets of a given cardinality is 
-- for many criteria -- a computationally intensive problem with
datasets of even moderate size, as is well known in the standard
linear regression context. 

The \texttt{subselect} package has a function (\texttt{eleaps})
implementing Duarte Silva's adaptations (references \cite{Pedro:01}
and \cite{Pedro:02}) of Furnival and Wilson's Leaps and Bounds
Algorithm \cite{Furnival:74} for 
variable selection, using the criteria considered in this
package. This function is very fast for datasets with up to
approximately $p=30$ variables and is guaranteed to identify the $m$  best
subsets of any size $k$ from $1$ to $p$, with $m$ in the range from 
$1$ to $\binom{p}{k}$. It is \textbf{the  recommended
function for variable selection with small or moderately-sized data
sets} and is discussed in subsection \ref{ss eleaps}.   

For \textbf{larger data sets} (roughly $p>35$) the \texttt{eleaps} function is
no longer computationally feasible and alternative search heuristics
are needed. The \texttt{subselect} package provides \textbf{three functions
with different search algorithms}: 
\begin{description}
\item[\texttt{anneal}] a simulated annealing-type search algorithm; 
\item[\texttt{genetic}] a genetic algorithm; and 
\item[\texttt{improve}] a modified local improvement algorithm.
\end{description}
These algorithms are described in reference \cite{CCM:04} and are 
discussed in Section \ref{s algoritmos}. They perform better
than the greedy-type algorithms, such as the standard stepwise
selection algorithms widely used in linear regression.

All four search functions invoke code written in either C++
(\texttt{eleaps}) or Fortran (\texttt{anneal},  \texttt{genetic} and
\texttt{improve}), to speed up computation times.
The search functions are further discussed in Section \ref{s algoritmos}.


\subsection{Installing and loading \texttt{subselect}} 


The \texttt{subselect} package is on CRAN (cran.r-project.org) and can
be installed like any other CRAN package, namely using R's
\texttt{install.packages} facility:


\begin{verbatim}
> install.packages(''subselect'')
\end{verbatim}

This vignette assumes that the \texttt{subselect} package has already
been installed in the user's system. 
Once installed, \texttt{subselect} must be loaded into an R session:

<<>>=
library(subselect) 
@ 

A preliminary view of the available functions and of their purpose 
can, as usual, be obtained via:

\begin{verbatim}
> library(help=subselect)
\end{verbatim}



\section{Criteria for variable selection} 
\label{s criteria}

Currently, the package accepts \textbf{eight different criteria}
measuring the quality of any given variable subset. These 
criteria can be (simplistically) classified into three groups: 
(i) criteria that are useful in an exploratory analysis or principal 
component analysis of a data set;  (ii) criteria that 
are model-based and are relevant in contexts that can be framed as a
multivariate linear model  (multivariate linear regressions, MANOVAs,
linear discriminant analysis, canonical correlation analysis, etc.); and 
(iii) criteria that are useful for variable selection in the context 
of a generalized linear model. 

\subsection{Criteria for exploratory data analyses and PCA}

The three criteria relevant for exploratory data analyses  have been 
previously suggested in the literature (see references \cite{CadJol:01} 
and \cite{CCM:04} for a fuller discussion). They can all be
framed as matrix correlations \cite{Ramsay:84} involving an $n\times p$ 
data matrix (where $p$ is the number of variables and $n$ the number of
observations on those variables) 
and projections of its columns on some subspace of $\mathbb{R}^n$.
In the \texttt{subselect} package, these three criteria are called: 

\begin{description}
\item[RM] The square of this matrix correlation is the
proportion of total variance (inertia) that is preserved 
if the $p$ variables are orthogonally projected onto the subspace 
spanned by a given $k$-variable subset. \emph{Choosing the $k$-variable
subset that maximizes this criterion is therefore akin to what is 
done in a Principal Component Analysis, with the difference that 
the optimal $k$-dimensional subspace is chosen only from among
the  $\binom{p}{k}$  subspaces spanned by $k$ of the $p$
variables}. This is the second of McCabe's  four criteria for what he
terms \textbf{\textit{principal variables}} \cite{McCabe:84}. See Subsection \ref{sss RM} for more details.

\item[GCD] Yanai's \emph{Generalized Coefficient of Determination} 
\cite{Ramsay:84} is  a measure of the closeness of two subspaces that
is closely related to the \emph{gap metric} \cite{Golub:96}. 
More precisely, it is the average of the squared canonical 
correlations between two sets of variables spanning each of the 
subspaces \cite{Ramsay:84}. In this context, these subspaces
are  the subspace spanned by $g$ Principal Components of 
the full, $p$-variable, data set; and the subspace spanned by a 
$k$-variable subset of those $p$ variables. By default the PCs 
considered are the \emph{first} $g$ PCs, and the number of variables
and PCs considered is the same ($k=g$), but other options can be 
specified by the user. \emph{Maximizing the GCD
corresponds to chosing the $k$ variables that span a subspace that is
as close as possible to the principal subspace spanned by the $g$ 
principal components that were specified}. See Subsection \ref{sss GCD} for more details.

\item[RV] Escoufier's RV-criterion \cite{RobEsc:76}  measures the
similarity of the $n$-point configurations defined by the $n$ rows of two
(comparable) data matrices, allowing for translations of the origin, rigid
rotations and global changes of scale. In this case, the two matrices
that are compared are the original data matrix $\X$ and the matrix
$\P_k\X$ which results from projecting the $p$ columns of $\X$ onto
a $k$-dimensional subspace spanned by $k$ of the $p$ original
variables. \emph{Maximizing this criterion, among all $k$-variable subsets,
corresponds to choosing the $k$ variables that span the subspace which 
best reproduces the original $n$-point configuration, allowing for 
translations of the origin, rigid rotations and global changes of scale}. 
See Subsection \ref{sss RV} for more details.
\end{description}

The package functions that compute the values of these indices 
(described below) all have the following two input \emph{arguments}:

\begin{description}
\item[\texttt{mat}] denotes the covariance or correlation matrix of 
the full data set;
\item[\texttt{indices}] is the vector, matrix or 3-d array of integers
giving sets of integers identifying the variables in the subsets of interest.
\end{description}

The function for the $GCD$ coefficient has an additional argument,  
described in subsection \ref{sss GCD}.

\subsubsection{The RM coefficient}
\label{sss RM}

\paragraph{Motivation and definition}

\textbf{ }\newline

It is well-known that the first $k$ principal components of an $n\times p$ data set are the $k$ linear combinations of the $p$ original variables which span the subspace that maximizes the proportion of total variance retained, if the data are orthogonally projected onto that subspace. A similar problem, but considering only the subspaces that are \emph{spanned by $k$-variable subsets of the $p$ original variables}, can be formulated \cite{CadJol:01} as follows: determine the $k$-variable subset which maximizes the square of the RM coefficient, 
%the matrix correlation between the $n\times p$ data
%matrix $\X$ and $\P_k\X$, where $\P_k$ is the $n\times n$ matrix of
%orthogonal projections on the subspace spanned by some subset ${\cal K}$ of $k$
%among the $p$ variables:
\begin{equation}
\label{eq RM}
RM = \mathrm{corr}(\X,\P_k \X) =
\sqrt{\frac{\mathrm{tr}(\X^t \P_k \X)}{\mathrm{\mathrm{tr}(\X^t \X)}}}=
\sqrt{\frac{\sum\limits_{i=1}^{p}\lambda_i (r_m)_i^2}{\sum\limits_{j=1}^{p}\lambda_j}}
 = \sqrt{\frac{\mathrm{tr}([\S^2]_{({\cal K})}\S_{\cal K}^{-1})}{\mathrm{tr}(\S)}} ,
\end{equation}
with \texttt{corr} denoting the matrix correlation; \texttt{tr} the matrix trace; and where
\begin{itemize}
\item $\X$ is the full (column-centered or possibly standardized) data matrix; 
\item $\P_k$ is the matrix of  orthogonal projections on the subspace 
spanned by a given $k$-variable subset;  
\item $\S=\frac{1}{n}\X^t\X$ is the $p\times p$ covariance (or correlation) matrix of the full data set;
\item ${\cal K}$ denotes the index set of the $k$ variables in the variable subset; 
\item $\S_{\cal K}$ is the $k\times k$ principal submatrix of matrix $\S$ which 
  results from retaining the \k rows/columns whose indices belong to ${\cal K}$;
\item $[\S^2]_{({\cal K})}$ is the $k\times k$ principal submatrix of $\S^2$ 
obtained by retaining the \k rows/columns associated with set ${\cal K}$.
\item $\lambda_i$ stands for the $i$-th largest
eigenvalue of the covariance (or correlation) matrix defined by $\X$;
\item $r_m$ stands for the multiple correlation between the
$i$-th principal component of the full data set and the $k$-variable subset.
\end{itemize}

The values of the RM coefficient will lie in the interval $[0,1]$, with 
larger values indicating a higher proportion of explained variance. Note that it is not relevant whether 
sample (co)variances (with denominator $n\!-\!1$), or even the sums of squares 
and products matrices, are used instead of matrix $\S$ as defined above: 
RM is insensitive to multiplication of $\S$ by a constant.

\paragraph{The \texttt{rm.coef} function}

\textbf{ }\newline

The \texttt{subselect} package provides the function \textbf{rm.coef} which
computes the RM coefficient, given a positive definite matrix 
(function argument \texttt{mat}, corresponding to $\S$ in 
the final expression of equation (\ref{eq RM})) and a vector
with $k$ indices (function argument \texttt{indices}, corresponding to 
${\cal K}$ in (\ref{eq RM})):

\begin{verbatim}
> rm.coef(mat,indices)
\end{verbatim}

The \textbf{rm.coef} function uses the final expression in equation (\ref{eq RM}). 
In the standard setting, matrix $\S$ is a covariance or correlation matrix for the full data set,
but it may be any positive definite matrix, such as a matrix of non-central second moments. 


\paragraph{Examples}

\textbf{ }\newline

As an example of the use of the \textbf{rm.coef} function, let us compute the 
value of the RM coefficient for the two petal 
measurements in Fisher's \emph{iris} data (variables 3 and 4, respectively, 
\texttt{Petal.Length} and \texttt{Petal.Width}; see \texttt{help(iris)} 
in a standard R session for more information on this data set). This can be
obtained as follows, using \textbf{rm.coef}'s only two arguments:

<<>>=
rm.coef(mat=var(iris[,-5]),indices=c(3,4))

@ 
The square of this value,
<<>>=
rm.coef(var(iris[,-5]),c(3,4))^2

@ 
gives the proportion of total variability that is preserved if the
four morphometric variables were orthogonally projected  onto the
subspace spanned by the petal measurements. It is quite often the case that this value
is not very much smaller than the proportion of total variability that
is accounted for by the same number of principal components \cite{CadJol:01}.

If more than one subset of a given cardinality is desired, the \texttt{indices} 
argument should be given as a matrix whose rows provide the indices of the variables 
in each subset. For example, the RM coefficients for the three-variable 
subsets of the \emph{iris} data, given by variables $\{1,2,3\}$ and $\{1,2,4\}$ 
are requested by the command:

<<>>=
rm.coef(var(iris[,-5]), indices=matrix(nrow=2,ncol=3,byrow=TRUE,c(1,2,3,1,2,4)))
@

The argument \texttt{indices} can also be a three-dimensional array, if subsets 
of different cardinalities are desired. In this case, the third dimension of 
\texttt{indices} is associated with different cardinalities. This option 
is especially useful when applying the \texttt{rm.coef} function to the output of the 
search algorithms (see Section \ref{s algoritmos}), if more than one cardinality 
has been requested. 

An example for the less frequent case, where the request is built from scratch, is 
now given. It computes the
value of RM for two 1-variable subsets and two 3-variable subsets:

<<>>=
subsets  <- array(data=c(3,2,0,0,0,0,1,1,2,2,3,4), dim=c(2,3,2))
colnames(subsets) <- paste("V",1:3,sep="")
rownames(subsets) <- paste("Solution",1:2)
dimnames(subsets)[[3]]<-paste("Size",c(1,3))
subsets
rm.coef(var(iris[,-5]),indices=subsets)
@ 

The output places the values for each cardinality in a different column.
Notice how the missing variable indices for the lower cardinalities are given as zeroes,

%{\color{red}[FALTA: SE MUDAR O OUTPUT PARA NA, ESTAS FUNCOES TAMBEM TEM DE MUDAR]}.


\subsubsection{The GCD coefficient}
\label{sss GCD}


\paragraph{Motivation and definition}

\textbf{ }\newline

In the standard setting for the \texttt{subselect} package, given a $p$-variable 
data set and a subset of $g$ of its principal components, the GCD criterion 
is a measure of similarity between the  principal subspace spanned by the $g$ 
specified principal components and the subspace spanned by a given 
$k$-variable subset of the original variables \cite{CadJol:01}.

The GCD is the matrix correlation between  the matrix  $\P_k$ of  orthogonal 
projections on the subspace  
spanned by a given $k$-variable subset and  the matrix  $\P_g$ of
orthogonal projections on the subspace spanned by the $g$ given
principal components of the full data set \cite{CadJol:01}: 
  
\begin{equation}
  \label{eq GCD}
GCD  = \mathrm{corr}(\P_k , \P_g) = 
\frac{\mathrm{tr}(\P_k\cdot \P_g)}{\sqrt{k\cdot g}} =
 \frac{1}{\sqrt{k g}} \sum\limits_{i\in{\cal
G}}(r_m)_i^2 = \tr([\SsubGp]_{({\cal K})}\S_{\cal
K}^{-1})/\sqrt{gk},
\end{equation}
where 
\begin{itemize}
\item  ${\cal K}$, $\S$,  $\S_{\cal K}$ and  $(r_m)_i$ are all defined
as above (subsection \ref{sss RM});
\item ${\cal G}$ denotes the index set of the $g$ principal components 
in the PC subset; 
\item $\SsubGp$ is the $p\times p$ matrix of rank $g$ that results from 
retaining only the $g$ terms in the spectral decomposition of $\S$ that 
are associated with the PC indices in the set ${\cal G}$;
\item $[\SsubGp]_{({\cal K})}$ is the $k\times k$ principal submatrix 
of $\SsubGp$ that results from retaining only the rows/columns whose 
indices are in ${\cal K}$.
\end{itemize}

The values of the GCD coefficient are in the interval $[0,1]$, with 
larger values indicating greater similarity between the subspaces.

\paragraph{The \texttt{gcd.coef} function}

\textbf{ }\newline

The \textbf{gcd.coef} function computes the GCD coefficient for a given
positive definite matrix (function argument \texttt{mat}, corresponding 
to matrix $\S$ in the final expression of equation (\ref{eq GCD})), 
a given vector of $k$ variable indices (function argument \texttt{indices}, 
corresponding to ${\cal K}$ in  (\ref{eq GCD})),  
and a given vector of $g$ PC indices (function argument \texttt{pcindices},
corresponding to ${\cal G}$ in  (\ref{eq GCD})):

\begin{verbatim}
> gcd.coef(mat,indices,pcindices)
\end{verbatim}

If the \texttt{pcindices} argument is not specified, by default it 
is set to the first $k$ PCs, where $k$ is the size of the variable 
subset defined by argument \texttt{indices}.

The value of the GCD is computed 
using the final expression in equation (\ref{eq GCD}). Matrix $\S$ (the 
input argument \texttt{mat}) is usually the covariance or correlation 
matrix for the full $p$-variable data set, but there may be other contexts where 
the final expression in (\ref{eq GCD}) makes sense.

\paragraph{Examples}

\textbf{ }\newline

The value of the GCD coefficient for the first two covariance-matrix PCs in  
Fisher's \emph{iris} data and the  two petal measurements
can be obtained as following:

<<>>=
gcd.coef(var(iris[,-5]),ind=c(3,4),pcind=c(1,2))

@ 

The two sepal measurements span a subspace that is much closer to the principal plane:

<<>>=
gcd.coef(var(iris[,-5]),ind=c(1,2))

@ 


As with the RM coefficient  the value of the GCD coefficient 
for multiple subsets can be requested by providing the argument \texttt{mat} as a matrix
(if the subsets are all of the same cardinality) or a three-dimensional array (for subsets 
of different cardinalities) (see subsection \ref{sss RM} for examples). In this case, 
however, the PC indices requested via the argument \texttt{pcindices}
must use the same vector for each function call.


\subsubsection{The RV coefficient}
\label{sss RV}

\paragraph{Motivation and definition}

\textbf{ }\newline

As already stated, Escoufier's RV-criterion \cite{RobEsc:76}  measures the
similarity of the $n$-point configurations defined by the $n$ rows of two
(comparable) data matrices, allowing for translations of the origin, rigid
rotations and global changes of scale. The criterion is defined, in this 
context, as the matrix correlation 
between the matrices  $\X\X^t$ and $\P_k\X\X^t\P_k$ (where $\X$ and
$\P_k$ are defined as above) \cite{CCM:04}:

\begin{equation}
  \label{eq RV}
\mathrm{RV}=\mathrm{corr}(\X\X^t,\P_{k}\X\X^t\P_{k})=
\frac{1}{\sqrt{\tr(\S^2)}}\cdot\sqrt{\tr
\left(\left[\S^2\right]_{({\cal K})}\left[\S_{\cal K}\right]^{-1}\right)^2},
\end{equation}
($\S$, $\S_{\cal K}$ and $\left[\S^2\right]_{({\cal K})}$ are defined as in
section \ref{sss RM}).

Possible values for the RV coefficient lie in the interval $[0,1]$, with 
larger values indicating more similarity between the two $n$-point configurations.

\paragraph{The \texttt{rv.coef} function}

\textbf{ }\newline

The value of the RV coefficient, for a given positive definite matrix 
(function argument \texttt{mat}, corresponding 
to matrix $\S$ in the final expression of equation (\ref{eq RV})) and 
a given vector of $k$ variable indices (function argument \texttt{indices}, 
corresponding to ${\cal K}$ in equation (\ref{eq RV})), can be
computed using the \textbf{rv.coef} function: 

\begin{verbatim}
> rv.coef(mat,indices)
\end{verbatim}


\paragraph{Examples}

\textbf{ }\newline

The \texttt{farm} data set (included in this package), has $n=99$ individuals,
observed in each of $p=62$ variables. The RV coefficient
reveals that the $99-$point configuration of the standardized data in
$\mathbb{R}^{62}$ is fairly similar to the  configuration that results
from the orthogonal projection of those $99$ points on the subspace
spanned by variables number 2, 37, 57 and 59:

<<>>=
data(farm)
rv.coef(cor(farm),ind=c(2,37,57,59))
@ 

For two different variable subsets of size 4, the argument \texttt{indices} 
is given as a matrix, whose rows indicate the variable indices for 
each subset. For example, the RV-coefficients of the four-variable subsets 
$\{2,12,56,59\}$ and  $\{2,3,11,59\}$, are requested by the command:

<<>>=
rv.coef(cor(farm), indices=matrix(nrow=2,ncol=4,byrow=TRUE,c(2,12,56,59,2,3,11,59)))
@ 


\subsection{Criteria for a Multivariate Linear model context}
\label{ss linear model criteria}

Different statistical methods arise within the general
framework of a multivariate linear model \cite{Huberty:94}:
$$\X = \A \pmb{\Psi} + \U ,$$ 
where $\X$ is an $n\times p$ data matrix of
original variables, $\A$ is a known ($n\times q$) design matrix, 
$\pmb{\Psi}$ is a ($q\times p$) matrix of unknown parameters 
and $\U$ is a ($n\times p$) matrix of error vectors. 
Particular cases in this setting include, among others,
[Multivariate] Analysis of Variance ([M]ANOVA),  Linear 
Discriminant Analyis (LDA) and Canonical Correlation
Analysis (CCA). Here we will be particularly concerned with 
contexts where a selection of subsets of $\X$ 
is warranted, the classical cases being LDA and CCA. 

In these statistical methods, variable subsets are often  
assessed according to their contribution to the violation of an 
additional reference hypothesis, $H_0 \,:\, \C \, \pmb{\Psi} = 0$, 
where $\C$ is a known coefficient matrix of rank $r$ \cite{Huberty:94}.

For example, in LDA $\X$ is a matrix of $n$ observations, divided 
by $q$ groups, on $p$ attributes; $\A$ is a matrix of group indicators; 
and $\pmb{\Psi}$ is a matrix of group-specific population means. In this 
case, the rows of $\C$ specify $q-1$ contrasts, stating the equality 
of population means across groups. Hence, $r = min(p,q-1)$, and any index 
of the extent of $H_0$ violations can be interpreted as a measure 
of group separation.   

In Canonical Correlation Analysis, $\A = \left[ \umn \, \Y \right]$ 
where  $\umn $ is a vector of ones, the columns of $\X$ and $\Y$ 
are $n$ observations on two different sets of variables,
the reference hypothesis, 
$H_0: \C \ \pmb{\Psi} = \left[0 \, \I_q \right] \left[\pmb{\Psi}_0^t \, 
  \pmb{\Psi}_{\Y}^t \right]^t = 0 $, 
states that the two sets are uncorrelated, and 
$r = min\left(rank(\X),rank(\Y)\right)$. Indicators for the violation of 
$H_0$ are measures of linear association between the $\X$ and $\Y$ variable 
sets. We note that in this context only the variables associated 
with the $\X$ group are selected, while the $\Y$ group of variables 
remains fixed.

When $\Y = y$ consists of a single (response) variable this problem 
becomes the traditional variable selection problem in Linear Regression, 
usually modelled as 
$y = \left[1_n \, \X \right] \left[\beta_0 \ \pmb{\beta}_X \right]^t + \pmb{\epsilon} $. 
While \texttt{subselect} can also be used with reasonable results 
in multivariate regression models 
$\Y =\left[1_n \, \X \right] \left[\pmb{\beta_0} \, \pmb{\beta}_X \right]^t + \pmb{\epsilon}$,
with $\Y$ having more than one column, the symmetric association indices  
considered in \texttt{subselect} were not designed to measure the 
quality of $\Y$ predictions. In multivariate prediction problems 
other (non-symmetric) measures such as those discussed in Rencher 
\cite{Rencher:95} and McQuarrie and Tsai \cite{McQuarrieTsai:98} may 
be more relevant. In univariate regression all these measures are 
monotone functions of the traditional coefficient of determination, 
which is also the index considered by \texttt{subselect}.

  
It is well known that, under classical Gaussian assumptions, test 
statistics for $H_0$ are given by several increasing functions of 
the $r$ positive eigenvalues of a product matrix $\T^{-1}\Hmat$, 
with $\T$ and $\Hmat$ the total and effect matrices of Sum of 
Squares and Cross Product (SSCP) deviations associated with $H_0$, such that
$$ \T = \Hmat + \E,$$
where matrix $\E$ is the matrix of residual SSCP.  The former SSCP matrices are given by 
$\T = \X^t(\I_p - \P_{\omega}) \X$ and $\Hmat = \X^t(\P_{\Omega} -
\P_{\omega}) \X$,  where $\I_p$ is an identity matrix, $\P_{\Omega} =
\A(\A^t\A)^-\A^t$ and  
$$\P_{\omega} = \A(\A^t\A)^-\A^t -
\A(\A^t\A)^-\C^t[\C(\A^t\A)^-\C^t]^-\C(\A^t\A)^-\A^t,$$ 
are projection matrices on the spaces spanned by the columns of $\A$
(space $\Omega$) and by the linear combinations of these columns that
satisfy the reference hypothesis (space $\omega$). In these
formulas $\M^t$ denotes the transpose of matrix $\M$, and $\M^-$ a
generalized inverse. Furthermore, whether
or not the classical assumptions hold, the same eigenvalues can be
used to define descriptive indices that measure an "effect"
characterized by the violation of $H_0$.

\subsubsection{Four criteria}

In the \texttt{subselect} package four effect-size indices, which 
are monotone functions of the traditional test statistics, are 
used as comparison criteria. These indices are called:

\begin{description} 
\item[Ccr12] The $ccr_1^2$  index is a function of the traditional 
\emph{Roy first root} test statistic \cite{Roy:39}, $\lambda_1$, for
the violation of  the linear hypothesis of the form 
$H_0\,:\, \C \pmb{\Psi} = 0$. 
%Specifically, $$ccr_1^2 = \frac{\lambda_1}{1+\lambda_1}.$$
\emph{Maximizing this criterion is equivalent to maximizing
Roy's first root}  (see Subsection \ref{sss ccr12} for more details).
\item[Tau2] The $\tau^2$ index is a  function of the traditional \emph{Wilks'
Lambda} statistic  \cite{Wilks:32}.
%, defined as $\Lambda=\frac{det(\E)}{det(\T)}$: $$\tau^2= 1-\Lambda^{(1/r)}.$$
\emph{Maximizing this criterion is equivalent to minimizing Wilks' Lambda} (see Subsection \ref{sss tau2}).
\item[Xi2] The $\xi^2$ coefficient is a function of the
\emph{Bartlett-Pillai trace} test statistic  \cite{Pillai:55}
%, $P=\tr(\Hmat\T^{-1})$: $$\xi^2 = \frac{P}{r}.$$
\emph{Maximizing this criterion is equivalent to maximizing
the Bartlett-Pillai statistic} (see Subsection \ref{sss xi2}).
\item[Zeta2] The index $\zeta^2$ is a function of the traditional
\emph{Lawley-Hotelling trace} test statistic  \cite{Lawley:38} \cite{Hotelling:51}.
%,  $V=\tr(\Hmat\E^{-1})$: $$\zeta^2= \frac{V}{V+r}.$$
\emph{Maximizing this criterion is equivalent to maximizing
the Lawley-Hotelling statistic} (see Subsection \ref{sss zeta2}).
\end{description} 

Cramer and Nicewander \cite{CramerNicewander:79} 
introduced these indices in the context of CCA, and Huberty \cite{Huberty:94} 
discusses their use in the context of LDA and MANOVA. 
\textit{In all four cases, the indices are defined so that their 
values lie between $0$ and $1$, and the larger the index value, the better 
the variable subset} as a surrogate for the full data set.  
In the specific case of a multiple linear regression (with a single
response variable), all four indices are equal to the standard
coefficient of determination, $R^2$. However, in multivariate problems with $r > 1$
the indices differ because they place different emphasis on canonical directions
associated with the effect under study. In particular, $ccr_1^2$ only considers the
first canonical direction, $\xi^2$ weights all $r$ directions in a balanced way, and
$\tau^2$, $\zeta^2$ are intermediate indices that emphasize the main
directions (see reference \cite{Pedro:01} for further details).  

The four package functions that compute the values of these indices are 
called \texttt{*.coef}, where the asterisk denotes the above criterion name. 
These functions have \textit{six input arguments}: 
\begin{description}
\item[mat]  denotes the Total SSCP matrix $\T$;
\item[H] denotes the effects SSCP matrix  of relevance for
the particular case at hand (see subsection \ref{sss Hmat} for more details);
\item[r] denotes the rank of the $\Hmat$ matrix (except in degenerate cases - 
see the \texttt{help} files for each \texttt{*.coef} function);
\item[indices] is the vector, matrix or 3-d array of integers
that identify the variables in the subsets of interest; 
\item[tolval \texttt{and} tolsym] are parameters used
to check for ill-conditioning of the matrix \texttt{mat},
and the symmetry of matrices \texttt{mat} and $\Hmat$ (see the
  \texttt{help} files for more details).  
\end{description}

\subsubsection{Creating the SSCP matrices} 
\label{sss Hmat} 

The relevant Sum of Squares and Cross-Products Effect matrix $\Hmat$
is specific to each context in which model-based variable selection
arises. Computing these matrices can sometimes be time-consuming and
the direct application of the formulas described in the previous
subsection can lead to large rounding errors. The \texttt{subselect} 
package provides  helper functions which, for standard contexts, 
create both the SSCP Effects matrix $\Hmat$ and the SSCP Total
matrix $\T$, using sound numerical procedures based on
singular value decompositions, 
%{\color{red}[JC: tu e' que sabes  como fizeste, se foi com a dvs, acho bem referir]}
as well as computing the presumed rank $r$ of $\Hmat$. 
%There is also a more 
%general helper function (\texttt{glhHmat}) for the general 
%multivariate linear hypothesis  case.

The output from these functions can be used as input for
both the functions that compute the model-based criteria of quality
for a given variable subset (discussed in the subsections below) and
for the search functions that seek the best subsets of variables
for a given data set (see Section \ref{s algoritmos}). 

For the multivariate linear model context, \texttt{subselect} 
provides three such helper functions. For all three functions, 
the output is a list of four objects: 
\begin{description}
\item[mat] is the relevant Total SSCP matrix $\T$ divided by
$n-1$ (where $n$ is the number of rows in the data matrix);
\item[H] is the relevant Effect SSCP matrix divided by $n-1$;
%{\color{red}PEDRO:  Acho que sim, mas sugiro que , se nao 
%fizer mal, se deixe como esta'. Caso contrario teremos de 
%re-escrever varias coisas.  Disseste:'' Wouldn't it be cleaner 
\item[r] is the rank of matrix $\Hmat$;
\item[call] is the function call which generated the output.
\end{description}

The fact that \texttt{mat} and \texttt{T} are not defined as the 
standard SSCP matrices, but rather divided by $n-1$ is of no consequence,  
since multiplying T and H by a common scalar does not affect the criteria values.  

The three helper functions currently available for the multivariate 
linear model context are the following:


\paragraph{The \textbf{lmHmat} function for linear regression and CCA}

\textbf{ }\newline

The function \textbf{lmHmat} creates the necessary SSCP matrices for 
a \textbf{\emph{linear regression}} or \emph{\textbf{canonical correlation
analysis}} context. In a (possibly multivariate response) linear
regression context, it is assumed that the response variables are
fixed and a subset of the predictor variables is being assessed. 
In a canonical correlation analysis context, it is
assumed that one of the two groups of variables is fixed, and it is a
subset of the second group of variables  that is being assessed 
(variable selection in both groups will currently have to be done via
a 2-step approach). 
This function takes by default the following arguments:
\begin{description}
  \item[x] is a matrix containing the full set
of predictor variables, in the regression context, or the group of
variables in which a variable subset is to be chosen in the CCA context. 
  \item[y] is a matrix or vector containing the set of
fixed response variables, in the regression context, or the set of
fixed variables in the CCA context.
\end{description}

There is an S3 \emph{method} for arguments \texttt{x} and 
\texttt{y} of class  \texttt{data.frame}, as well as methods 
for classes \texttt{formula} and \texttt{lm}, that replace 
the input arguments by:
\begin{description}
  \item[formula] a  standard linear model formula  $y \sim
x1 + x2 + ...$ defining the model relation between response and
predictor variables (in the regression context) or fixed and
assessed variables in the CCA context. 
  \item[data] a data frame from which variables specified in formula 
are preferentially to be taken. 
\end{description}
or
\begin{description}
  \item[fitdlmmodel] an object of class \texttt{lm}, as produced by R's
  \texttt{lm} function. 
\end{description}

In this context, the output object \texttt{mat} is 
%{\color{red} proportional to -- if mat and H are not divided by n-1}
the covariance
matrix  $T_x$ of the $x$ variables; object \texttt{H} is 
%{\color{red} proportional to?}
the covariance matrix  $H_{x\vert y}$ of the projections of the $x$
variables on the space spanned by the $y$ variables;  and $r$ is the
expected rank of the $\Hmat$ matrix which, under the assumption of
linear independence, equals the minimum between the number of
variables in the x and y sets (the true rank of $\Hmat$ can be
different if the linear independence condition fails). See the
\texttt{lmHmat} help file for more information.

\paragraph{Example.} An example of the use of the helper function
\texttt{lmHmat} involves the \texttt{iris} data set. The goal is to
study the \textbf{(univariate response) multiple linear regression} of variable
\texttt{Sepal.Length} (the first variable in the \texttt{iris} data
frame) on the three remaining predictors.  

<<>>=
lmHmat(x=iris[,2:4], y=iris[,1])
@

\paragraph{The \textbf{ldaHmat} function for linear discriminant analysis}

\textbf{ }\newline

This function takes by default the following arguments:
\begin{description}
  \item[x] is a matrix containing the discriminating
variables, from which a subset is being considered. 
  \item[grouping] is a factor specifying the class to which each
observation belongs.
\end{description}

There are S3 \emph{methods} for 
\begin{itemize}
\item class \texttt{data.frame} input argument \texttt{x};
\item input object of class \texttt{formula}.
\end{itemize}
With these methods, the input arguments for \texttt{ldaHmat} can be given as:
\begin{description}
  \item[formula] a formula of the form $grouping \sim x1 + x2 + ...$
where the $x$ variables denote the discriminating variables. 
  \item[data] a data frame from which variables specified in formula are preferentially to be taken. 
\end{description}

In this context, the output objects \texttt{mat} and \texttt{H} are
the standard total ($\T$) and between-group ($\Hmat$) SSCP matrices of
Fisher's linear discriminant analysis; and output object \texttt{r} is the
rank of the between-group matrix $\Hmat$, which equals the minimum 
between the number of discriminators and the number of groups minus
one (although the true rank of $\Hmat$ can be different, if the discriminators
are linearly dependent).

\paragraph{Examples}

A simple example of the use of function \texttt{ldaHmat} again
involves the \texttt{iris} data set. We seek to discriminate
the three iris species, using the four morphometric variables as
discriminators.

<<>>=
ldaHmat(x=iris[,1:4], grouping=iris$Species)
@
%$  

In the same example, the function could have been invoked as:

<<>>=
attach(iris)
ldaHmat(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width)
detach(iris)
@
%$


\paragraph{The \textbf{glhHmat} function for a general linear
hypthesis context}  

\textbf{ }\newline 

The function \textbf{glhHmat} creates the appropriate SSCP matrices 
for any problem that can be described as an instance of the 
\textbf{\emph{general multivariate linear model}} $\X = \A \pmb{\Psi} +
\U$, \emph{with a \textbf{reference hypothesis}} $H_0: \C \pmb{\Psi} =
\mathbf{0}$, where $\pmb{\Psi}$ is a matrix of unknown parameters and $\C$
is a known rank-$r$ coefficient matrix. 
By default, this function takes the following arguments:
\begin{description}
  \item[x] is a matrix containing the response variables. 
  \item[A] is a design matrix specifying a linear model in which $\X$ is the response.
  \item[C] is a matrix or vector containing the coefficients of the reference hypothesis. 
\end{description}

There is an S3 \emph{method} for input of class \texttt{data.frame} 
in which x and \A~ are defined as data frames. A further method accepts 
input of class \texttt{formula}, with input arguments:
\begin{description}
  \item[formula] a formula of the form $X \sim A1 + A2 + ...+ Ap$
where the terms of the right hand side specify the relevant columns of the design matrix. 
  \item[C] a matrix or vector containing the coefficients of the reference hypothesis. 
  \item[data] a data frame from which variables specified in formula are preferentially to be taken. 
\end{description}

In this context, the $\T$ and $\Hmat$ matrix have the generic form
given at the beginning of this subsection and \texttt{r} is the
rank of $\Hmat$, which equals the rank of $\C$ (the
true rank of $\Hmat$ can be different from \texttt{r} if the $\X$
variables are linearly dependent). 

\paragraph{Example.} 

The following example creates the Total and Effects SSCP matrices, 
$\T$ and $\Hmat$, for an analysis of the data set \texttt{crabs} in  
the MASS package. This data set records physical measurements on 200
specimens of \textit{Leptograpsus variegatus} crabs observed on the
shores of  Western Australia. The crabs are classified by two factors,
both with two levels each: \texttt{sex}  and \texttt{sp} (crab species,
as defined by its colour: blue or orange). The measurement
variables include the carapace length (\texttt{CL}),  
the carapace width (\texttt{CW}), the size of the frontal lobe  
(\texttt{FL}) and the rear width (\texttt{RW}). 
We assume that there is an interest in 
comparing the subsets of these variables measured in their original  
and logarithmic scales.   In particular, we assume that it is wished 
to create the $\T$ and $\Hmat$ matrices associated with an analysis 
of the effect of the \texttt{sp} factor after controlling for 
sex. Only the \texttt{formula}, \texttt{C} and \texttt{data} arguments
are explicitly given in this function call.


<<>>=
library(MASS)
data(crabs)
lFL <- log(crabs$FL) ; lRW <- log(crabs$RW); lCL <- log(crabs$CL); lCW <- log(crabs$CW)
C <- matrix(0.,nrow=2,ncol=4)
C[1,3] = C[2,4] = 1.
C
Hmat5 <- glhHmat(cbind(FL,RW,CL,CW,lFL,lRW,lCL,lCW) ~ sp*sex,C=C,data=crabs)
Hmat5
@ 


\subsubsection{The $ccr_1^2$ (ccr12) coefficient and Roy's first root statistic}
\label{sss ccr12}

\paragraph{Motivation and definition}

\textbf{ }\newline

The $ccr_1^2$ coefficient is an increasing function of Roy's first root test
statistic for the reference hypothesis in the standard multivariate
linear model. Roy's first root is the largest eigenvalue of $\Hmat\E^{-1}$, where
$\Hmat$ is the Effect matrix and $\E$ is the Error (residual) matrix. 
The index $ccr_1^2$ is related to Roy's first root $\lambda_1$ by: 
$$ccr_1^2 =\frac{\lambda_1}{1+\lambda_1}.$$ 

\paragraph{The \texttt{ccr12.coef} function}

\textbf{ }\newline

The \texttt{subselect} package provides the function \textbf{ccr12.coef} which
computes the $ccr_1^2$ coefficient, given the variance or total SSCP matrix for the full data set, \texttt{mat}, the effects SSCP matrix \texttt{H}, the expected rank of the $\Hmat$ matrix, \texttt{r}, and a vector \texttt{indices} with the indices of the variable subset that is being considered. These arguments may be defined with the helper functions described in Subsection \ref{sss Hmat}. A standard function call looks like:

\begin{verbatim}
> ccr12.coef(mat, H, r, indices)
\end{verbatim}

For further arguments and options, see the function \texttt{help} page.


\paragraph{Example}

\textbf{ }\newline

The following example in the use of function \texttt{ccr12.coef} in
the context of a (univariate response) Multiple Linear Regression uses 
the \texttt{Cars93} data set from the MASS library. Variable 5 (average price)
is regressed on 13 other variables. The goal is to compare subsets of
these 13 variables according to their ability to predict car
prices. The helper function \texttt{lmHmat} creates the relevant input
to test the value of the $ccr_1^2$ criterion for the subset of the
fourth, fifth, tenth and eleventh predictors.

<<>>=
library(MASS)
data(Cars93)
CarsHmat <- lmHmat(x=Cars93[c(7:8,12:15,17:22,25)],y=Cars93[5])
ccr12.coef(mat=CarsHmat$mat, H=CarsHmat$H, r=CarsHmat$r, indices=c(4,5,10,11))
@
%$

\subsubsection{The $\tau^2$ (tau2) coefficient and Wilk's Lambda} 
\label{sss tau2} 

\paragraph{Motivation and definition}

\textbf{ }\newline

The Tau squared index $\tau^2$ is a decreasing function of the 
standard Wilk's Lambda statistic for the multivariate linear model and
its reference hypothesis. The Wilk's lambda statistic ($\Lambda$) is 
given by:   
$$\Lambda=\frac{det(\E)}{det(\T)},$$ 
where $\E$ is the Error (residual) SSCP matrix and $\T$ is the Total
SSCP matrix.   The index $\tau^2$ is related to the Wilk's Lambda
statistic by:  
$$\tau^2 = 1 - \Lambda^{1/r},$$ 
where $r$ is the rank of  the Effect SSCP matrix $\Hmat$.

\paragraph{The \texttt{tau2.coef} function}

\textbf{ }\newline

Function \textbf{tau2.coef} is similar to its counterpart \texttt{ccr12.coef} described in the previous Subsection. 
A standard function call looks like:

\begin{verbatim}
> tau2.coef(mat, H, r, indices)
\end{verbatim}


\paragraph{Example}

\textbf{ }\newline

A very simple example of the use of the $\tau^2$ criterion with the 
Linear Discriminant Analysis example for the \emph{iris} data set, 
using all four morphometric variables to discriminate the three species. 
The subset consisting of variables 1 and 3 is then considered as a surrogate for all four variables.
 
<<>>=
irisHmat <- ldaHmat(iris[1:4],iris$Species)
tau2.coef(irisHmat$mat,H=irisHmat$H,r=irisHmat$r,c(1,3))
@ 

\subsubsection{The $\xi^2$ (xi2) coefficient and the Bartlett-Pillai statistic}
\label{sss xi2}

\paragraph{Motivation and definition}

\textbf{ }\newline

The Xi squared index is an increasing function of the traditional
Bartllet-Pillai trace test statistic. The Bartlett-Pillai trace $P$ is
given by:  
$P=tr(\Hmat\T^{-1})$ where $\Hmat$ is the Effects SSCP matrix and $\T$
is the Total SSCP matrix.  The Xi squared index $\xi^2$  is related 
to the Bartllet-Pillai trace by: 
$$\xi^2 =\frac{P}{r},$$ 
where $r$ is the rank of $\Hmat$.
 
\paragraph{The \texttt{xi2.coef} function}

\textbf{ }\newline

Function \textbf{xi2.coef} is similar to the previous criterion functions described in this Section. 
A standard function call looks like:

\begin{verbatim}
> xi2.coef(mat, H, r, indices)
\end{verbatim}



\paragraph{Example}

\textbf{ }\newline

The same example considered in Subsection \ref{sss tau2}, only 
this time using the $\xi^2$ index of $\tau^2$.

<<>>=
irisHmat <- ldaHmat(iris[1:4],iris$Species)
xi2.coef(irisHmat$mat,H=irisHmat$H,r=irisHmat$r,c(1,3))
@

\subsubsection{The $\zeta^2$ (zeta2) coefficient and the Lawley-Hotelling statistic}
\label{sss zeta2} 

\paragraph{Motivation and definition}

\textbf{ }\newline

The Zeta squared index $\zeta^2$ is an increasing function of the 
traditional Lawley-Hotelling trace test statistic. The 
Lawley-Hotelling  trace is given by $V=tr(\Hmat\E^{-1})$ where $\Hmat$ 
is the Effect SSCP matrix and $\E$ is the Error SSCP matrix.  
 The index $\zeta^2$ is related to the Lawley-Hotelling  trace by:   
$$\zeta^2 =\frac{V}{V+r} ,$$ 
where $r$ is the rank of $\Hmat$. 

\paragraph{The \texttt{zeta2.coef} function}

\textbf{ }\newline


Again, function \textbf{zeta2.coef} has arguments similar to those of the other related functions in this Section. 
A standard function call looks like:

\begin{verbatim}
> zeta2.coef(mat, H, r, indices)
\end{verbatim}



\paragraph{Example}

\textbf{ }\newline

Again, the same example as in the previous two subsections:

<<>>=
irisHmat <- ldaHmat(iris[1:4],iris$Species)
zeta2.coef(irisHmat$mat,H=irisHmat$H,r=irisHmat$r,c(1,3))
@  


\subsection{Criterion for generalized linear models}  
\label{ss glm}  

\paragraph{Motivation and definition}

\textbf{ }\newline

Variable selection in the context of generalized linear models is
typically based on the minimization of statistics that test the
significance of the excluded variables.  In particular, the likelihood
ratio, Wald and Rao statistics, or some monotone function of those
statistics, are often proposed as comparison criteria for variable
subsets of the same dimensionality. All these statistics are
asympotically equivalent and, given suitable assumptions,
can be converted into information criteria, such as the AIC, which 
also compare  subsets of different dimensionalities 
(see references \cite{Lawless:78} and \cite{Lawless:87}  for further details). 

Among these criteria, Wald's statistic has some computational
advantages because it can always be derived from the same  maximum
likelihood and Fisher  information estimates (concerning the full
model). In particular, let $W_{all}$ be the value of the Wald statistic
testing the significance of the full covariate vector, $\bvec$ and
$\F$ be the coefficient and Fisher   information estimates,
and $\Hmat$ be an auxiliary rank-one matrix given by $\Hmat =  \F
\bvec \bvec^t \F$. It follows that the value of Wald's statistic for
the   excluded variables ($W_{exc}$) in a given subset  is given by  
$W_{exc} = W_{all} - tr (\F_{indices}^{-1} \Hmat_{indices})$, where 
$\F_{indices}$ and $\Hmat_{indices}$ are the portions of the $\F$ and
$\Hmat$ matrices associated with the selected variables. 

The \texttt{subselect} package provides a function \texttt{wald.coef}
that computes the value of Wald's statistic, testing the significance
of the excluded  variables, in the context of variable subset
selection in generalized linear models (see Subsection \ref{sss wald} for more details). 

\subsubsection{A helper function for the GLM context} 
\label{sss helper GLM} 

As with the multivariate linear model context, creating the Fisher
Information matrix and the auxiliary matrix $\Hmat$ described above
may be time-consuming. The \texttt{subselect} package provides the
helper function  \textbf{glmHmat} which accepts  a glm object
(\texttt{fitdglmmodel}) to retrieve an estimate of  Fisher's Information.
($\F$) matrix together with an auxiliary rank-one positive-definite
matrix ($\Hmat$), such that the positive eigenvalue of $\F^{-1}\Hmat$
equals the value of Wald's statistic  for testing the global
significance of the fitted model. These matrices may be used as input
to the function that computes the Wald criterion, as well as to the
variable selection search routines described in Section \ref{s
algoritmos}. 

As an example of the use of this helper function, in the context of
binary response  logistic regression models, consider the last 100
observations of the \emph{iris} data set (retaining only observations
for the \emph{versicolor} and \emph{virginica} species). Assume that
the goal is to judge subsets of the four morphometric variables (petal
and sepal lengths and widths), in models with the binary response
variable given by an indicator variable of the two remaining species.  
The helper function \texttt{glmHmat} will produce the Fisher Information  
matrix (output object \texttt{mat}), the auxiliary $\Hmat$ function
(output object \texttt{H}), the function \texttt{call}, and set the 
rank of output object \texttt{H} to $1$,  as follows: 

<<>>=
iris2sp <- iris[iris$Species != "setosa",]
modelfit <- glm(Species ~ Sepal.Length + Sepal.Width + Petal.Length +
Petal.Width, data=iris2sp, family=binomial)  
Hmat <- glmHmat(modelfit)
Hmat
@ %$



\subsubsection{Function \texttt{wald.coef} and the Wald coefficient}    
\label{sss wald}

The function \textbf{wald.coef} computes the value ($W_{exc}$)    
of the Wald statistic, as described above. 
The function takes as input arguments:
\begin{description} 
  \item[mat] An estimate of Fisher's information matrix $\F$ for the  
  full model variable-coefficient estimates; 
  \item[H] A matrix product of the form $\Hmat =  \F \bvec \bvec^t \F$ 
  where $\bvec$ is a vector of variable-coefficient estimates; 
   \item[indices] a numerical vector, matrix or 3-d array of integers 
  giving the indices of the variables in the subset. If a matrix is 
  specified, each row is taken to represent a different 
  \emph{k}-variable subset. If a 3-d array is given, it is assumed 
  that the third dimension corresponds to different cardinalities. 
  \item[tolval \texttt{and} tolsym] are parameters used in checks for  
   ill-conditioning and positive-definiteness of the Fisher Information and 
   the auxiliary ($\Hmat$) matrices (see the \texttt{wald.coef} help file for further details).  
\end{description} 

The values of arguments \texttt{mat} and \texttt{H} can be created  
using the \texttt{glmHmat} helper function, described in subsection \ref{sss helper GLM}.   


\paragraph{Example}

\textbf{ }\newline

An example of variable selection in the context of binary response 
regression models can be given using the same \texttt{crabs} data set  
from the MASS package that was already discussed in subsection \ref{sss Hmat}.  
The logarithms and original physical measurements 
of the \textit{Leptograpsus variegatus} crabs considered in the MASS crabs   
data set are used to fit a logistic model where each crab's sex forms 
the response variable. The quality of the variable subset made up by the  
first, sixth and seventh predictor variables is measured via the Wald coefficent.   

<<>>= 
library(MASS) 
lFL <- log(crabs$FL)
lRW <- log(crabs$RW)
lCL <- log(crabs$CL)
lCW <- log(crabs$CW)
logrfit <- glm(sex ~ FL + RW + CL + CW  + lFL + lRW + lCL + lCW,data=crabs,family=binomial)
lHmat <- glmHmat(logrfit) 
wald.coef(lHmat$mat,lHmat$H,indices=c(1,6,7),tolsym=1E-06)  
@  

 
It should be stressed that, contrary to the criteria considered in the     
previous problems, $W_{exc}$ is not bounded  above by 1 and $W_{exc}$   
is a \textit{decreasing} function of subset quality.    

\section{Search algorithms}
\label{s algoritmos}

Given any data set for which the variable selection problem is 
relevant and a criterion that measures how well any 
given variable subset approximates the full data set, the 
problem of \emph{finding the best $k$-variable subsets for that     
criterion} arises naturally. Such problems are computationally 
intensive for the criteria considered in this package. 
A complete search, among all $k$-variable subsets is a task which  
quickly becomes unfeasible even for moderately-sized data 
sets unless $k$ is very small, or very large, when compared with $p$.  

It may happen that
a criterion has special properties which render enumeration
methods possible for some moderate-sized data sets. Furnival and
Wilson's \cite{Furnival:74} Leaps and bounds algorithm did this in the context
of subset selection in Linear Regression, and package co-author Duarte Silva 
(see references \cite{Pedro:01} and \cite{Pedro:02})
has discussed the application of this algorithm to various
methods of Multivariate Statistics. The package function \textbf{eleaps} implements 
these algorithms for the criteria discussed above, using C++ code 
(see Section \ref{ss eleaps}). This function is very fast for small or 
moderately sized datasets (with roughly $p<30$ variables). For larger data sets
(roughly $p>35$ variables) it becomes computationally unfeasible and alternative 
search heuristics become necessary.

Traditional heuristics for the variable selection problem in the context
of linear regression, such as the backward elimination, forward selection or 
stepwise selection algorithms belong to a general class of algorithms called 
\emph{greedy}. Their limitations are well-known. The search algorithms considered 
in this package do not fall in this category. They are of three types: a 
simulated annealing algorithm (implemented by
the \textbf{anneal} function and discussed in section \ref{ss anneal}); a 
genetic algorithm (function \textbf{genetic}, discussed in section \ref{ss genetic}) 
and a modified local search algorithm (function \textbf{improve}, discussed in 
section \ref{ss improve}). These algorithms were discussed in \cite{CCM:04}. 
The three corresponding package functions use Fortran code for computational efficiency.


\subsection{Common input arguments}
\label{ss input}

The four search functions described above share several input arguments. These are:

\begin{description}
  \item[mat] a covariance/correlation, information or sums of squares and products
    matrix of the variables from which the k-subset is to be selected (equivalent 
    to the homonymous argument in the criteria functions discussed in 
    Section \ref{s criteria}). 
  \item[kmin] the cardinality of the smallest subset that is sought.
  \item[kmax] the cardinality of the largest subset that is sought.
  \item[exclude] a vector of variables (referenced by their row/column
  numbers in matrix \texttt{mat}) that are to be forcibly excluded from
  the subsets considered.
  \item[include] a vector of variables (referenced by their row/column
  numbers in matrix \texttt{mat}) that are to be forcibly included in
  the subsets considered.
  \item[criterion] A character variable, indicating which criterion
  is to be used in judging the quality of the subsets. These are discussed
  in Section \ref{s criteria}. The default
  criterion is \texttt{rm} for exploratory and PCA analysis, \texttt{tau2} in
  the multivariate linear model context, and \texttt{Wald} for generalized linear models.
  \item[pcindices] (for the GCD criterion only) is a vector of ranks of 
  Principal Components that are to be used for comparison with the 
  k-variable subsets. The default value is the character string
  \texttt{first\_k}, which associates PCs 1 to \emph{k} to each
  cardinality \emph{k}  requested by the user.
  \item[H] Effect description matrix. Not used with the RM, RV or GCD
    criteria, hence the NULL default value.
  \item[r] Rank of the effects  matrix, $\Hmat$. Not used with the
    RM, RV or GCD criteria. 
  \item[tolval] a parameter to fine-tune the approach to ill-conditioned 
    \texttt{mat} arguments (see the search function help files for more details).
  \item[tolsym] a parameter to fine-tune the approach to non-symmetric 
    \texttt{mat} and $\Hmat$ arguments (see the search function help files for more details).
\end{description}

Each search function has additional specific input arguments that are discussed 
in the subsequent sections and in the appropriate help files.


\subsection{Common output objects}
\label{ss output}

The four search functions have a \textit{common output structure}: a \texttt{list} with 
five components:

\begin{description}
   \item[subsets] An $m\times kmax \times length(kmin:kmax)$ 3-dimensional 
   array, where $m$ is the number of different solutions that are to be 
   produced for each cardinality (see the subsequent subsections for details). 
   For each cardinality (dimension 3) and each solution
   (dimension 1) the list of variables (referenced by their row/column
   numbers in matrix \texttt{mat}) in the subset is given (dimension 2). 
    For cardinalities  smaller than \texttt{kmax}, the extra final positions are set
   to zero.
%{\color{red}MUDAR PARA NA?}). 
   These output objects can be directly 
   passed to the \texttt{*.coef} functions discussed in Section \ref{s criteria}, 
   in order to compute the performance of these subsets with criteria other than 
   the one used to select them; 
   \item[values] An $m\times  length(kmin:kmax)$ matrix, giving for each cardinality 
   (columns), the criterion values  of the best $m$ (rows) subsets selected according to 
   the chosen criterion;
   \item[bestvalues] A $length(kmin:kmax)$ vector giving
   the overall best values of the criterion for each cardinality;
   \item[bestsets] A $length(kmin:kmax) \times kmax$ matrix, giving, for each 
   cardinality (rows), the variables (referenced by their row/column numbers in 
   matrix \texttt{mat}) in the best $k$-variable subset (can also be fed to the 
   \texttt{*.coef} functions).
   \item[call] The function call which generated the output.
\end{description}

%This output is of \textbf{class} \emph{subselect} (which inherits from class \texttt{list}). 
%Methods for this class are discussed in Section \ref{s methods}.



\subsection{The \texttt{eleaps} function: an efficient complete search}
\label{ss eleaps}

For each cardinality $k$ (with $k$ ranging from \texttt{kmin} to \texttt{kmax}),
% \texttt{eleaps} performs a branch and bound search for the best \texttt{nsol}-variable 
\texttt{eleaps} ("Extended Leaps and Bounds") performs a branch and bound search for 
the best \texttt{nsol}-variable 
subsets (\texttt{nsol} being a user-specified function argument), according to a 
specified criterion. The function \texttt{eleaps} implements Duarte Silva's 
adaptation for various multivariate analysis contexts (references 
\cite{Pedro:01} and \cite{Pedro:02}) of Furnival and 
Wilson's Leaps and Bounds Algorithm (reference \cite{Furnival:74}) for 
variable selection in Regression Analysis. If the search is not completed within 
a user defined time limit (input argument \texttt{timelimit}), \texttt{eleaps} 
exits with a warning message, returning the best (but not necessarly optimal)
solutions found in the partial search performed.

In order to improve computation times, the bulk of computations are
carried out by C++ routines. Further details about the Algorithm can 
be found in references \cite{Pedro:01} and \cite{Pedro:02} and in the 
comments  to the C++ code (in the package's \texttt{src} directory). 
A discussion of the criteria considered can be found in  
Section \ref{s criteria} above. The function checks for ill-conditioning of the 
input matrix (see the \texttt{eleaps} help file for details).

\paragraph{Examples}

\textbf{ }\newline

For illustration of use, we now consider five examples.

\begin{description}
\item[Example 1] deals with a small data set provided in the standard distributions 
of R. The \texttt{swiss} data set is a 6-variable data set with  Swiss 
fertility and socioeconomic indicators (1888). Subsets of variables of all cardinalities 
are sought, since the \texttt{eleaps} function sets, by default,
\texttt{kmin} to 1 and \texttt{kmax} to one less than the number of columns in the 
input argument \texttt{mat}. The function call requests the best three
subsets of each cardinality, using the RM criterion (subsection \ref{sss RM}). 

<<>>=
data(swiss)
eleaps(cor(swiss),nsol=3, criterion="RM")
@ 

In this example, it is not necessary to explicitly specify the criterion RM, 
since it is the default criterion for any function call which does not explicitly set 
the \texttt{r} input argument (except in a GLM context - see Section \ref{ss glm}).

\item[Example 2] illustrates the use of the \texttt{include} and \texttt{exclude} 
arguments that are common to all the search functions provided by the package \texttt{subselect}. 
Here, we request only 2- and 3- dimensional subsets that exclude
variable number 6 and include variable number 1. 
For each cardinality, three solutions are requested (argument \texttt{nsol}). 
The criterion requested is the GCD (see subsection \ref{sss GCD}) and the 
subspace used to gauge our solutions is the principal 
subspace spanned by the first three principal components of the full data set 
(argument \texttt{pcindices}). Our solutions will be the 2- and 3-variable subsets that 
span subspaces that are closest to this 3-d principal subspace.
 
<<>>=
data(swiss)
swiss.gcd <- eleaps(cor(swiss),kmin=2,kmax=3,exclude=6,include=1,nsol=3,criterion="gcd",pcindices=1:3)
swiss.gcd
@ 

\emph{The output of this function call can be used as input for a function call requesting 
the values of the chosen solutions under a different criterion}. For example, the 
values of the RM coefficient for the above solutions are given by:

<<>>=
rm.coef(mat=cor(swiss), indices=swiss.gcd$subsets)
@ 


\item[Example 3] involves a \textbf{Linear Discriminant Analysis} example with a very 
small data set, which has already been discussed in subsection \ref{sss tau2}.
We consider the Iris data and three groups, defined by species (setosa, 
versicolor and virginica). The goal is to select the 2- and 3-variable
subsets that are optimal for the linear discrimination (as measured 
by the \texttt{ccr12} criterion, \emph{i.e.}, by Roy's first root statistic).
 
<<>>=
irisHmat <- ldaHmat(iris[1:4],iris$Species)
eleaps(irisHmat$mat,kmin=2,kmax=3,H=irisHmat$H,r=irisHmat$r,crit="ccr12")
@ 

\item[Example 4] \label{ex Cars eleaps}
involves the MASS package \texttt{Cars93} data set
that was  already discussed in subsection \ref{sss ccr12}. The context
here is a \textbf{Canonical
Correlation Analysis}. Two groups of variables within the Cars93
data set are compared: the $X$ group (variables 7,8, 12 to 15, 17 to 22 and 25) 
and the $Y$ group (variables 4 and 6). The latter variables are respectively the
car prices for a basic (Min.Price) and premium (Max.Price) version, of the car models. 
The goal is to select 4- to 6-variable subsets of the 13-variable $\X$ group that 
are optimal in terms of preserving the canonical correlations, according to the
\texttt{zeta2} criterion (Warning: the $\Y$ group with the 2-price variables  is kept 
intact; subset selection is carried out in the $\X$ group only).
The \texttt{tolsym} parameter is used to relax the symmetry
requirements on the effect matrix $\Hmat$ which, for numerical reasons,
is slightly asymmetric. Since corresponding off-diagonal entries of
matrix $\Hmat$ are different, but by less than \texttt{tolsym}, $\Hmat$ is 
replaced  by its symmetric part: $(\Hmat+\Hmat^t)/2$.

<<>>=
library(MASS)
data(Cars93)
Cars93.xgroup <- Cars93[,c(7:8,12:15,17:22,25)]
CarsHmat <- lmHmat(Cars93.xgroup,Cars93[,c(4,6)])
colnames(Cars93[,c(4,6)])
colnames(Cars93.xgroup)
#colnames(CarsHmat$mat)
Cars.eleaps <- eleaps(CarsHmat$mat, kmin=4, kmax=6, H=CarsHmat$H, r=CarsHmat$r, crit="zeta2", tolsym=1e-9)
Cars.eleaps$bestvalues
Cars.eleaps$bestsets
@ 
%$

\item[Example 5.] A final example involves the use of the \texttt{eleaps} function for variable   
selection in the  context of a \textbf{generalized linear model}, more precisely, of a logistic  
regression model. We consider the last 100 observations of 
the \emph{iris} data set (versicolor and virginica species) and seek the best variable  
subsets for the model with the indicator variable for the other two species as  
the binary response variable.  

<<>>=
iris2sp <- iris[iris$Species != "setosa",]
logrfit <- glm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,iris2sp,family=binomial)
Hmat <- glmHmat(logrfit)
eleaps(Hmat$mat, H=Hmat$H, r=Hmat$r, criterion="Wald", nsol=3)
@ 

It should be stressed that, unlike other criteria in the
\texttt{subselect} package, the Wald criterion is not bounded above by
1 and is a \emph{decreasing} function of subset quality, so that the
3-variable subsets do, in fact, perform better than their smaller-sized
counterparts.


\end{description}



\subsection{The \texttt{anneal} function: a simulated annealing algorithm}
\label{ss anneal}

Given a full data set, the \textbf{anneal} function uses a Simulated Annealing
algorithm to seek a $k$-variable subset which is
optimal, as a surrogate for the full set, with respect to a given criterion.  
The algorithm is described in detail in \cite{CCM:04}.

In brief, for each of the  solutions requested by the user (via the
\texttt{nsol} function argument), 
an initial $k$-variable subset (for k ranging from
\texttt{kmin} to \texttt{kmax}) of a full set of $p$  variables is
randomly selected and passed on to a Simulated Annealing algorithm. 
The algorithm then selects a random subset in the neighbourhood of the
current subset (neighbourhood of a subset S being defined as the
family of all $k$-variable subsets which differ from S by a
single variable), and decides whether to replace the current subset 
according to the Simulated Annealing rule, i.e., either (i) always,
if the alternative subset's criterion value is larger; or (ii)
with probability $\exp^{\frac{ac-cc}{t}}$ if the alternative subset's
criterion value ($ac$) is smaller than that of the current
solution ($cc$), where $t$ (the temperature parameter) decreases
throughout the iterations of the algorithm. For each cardinality $k$,
the stopping criterion for the algorithm is the number of iterations,  
which is controlled by the user (function argument \texttt{niter}). 
Also controlled by the user are the 
initial temperature (argument \texttt{temp}), the rate of geometric
cooling of the temperature (argument \texttt{cooling}) and the  
frequency with which the temperature is cooled, as measured by 
argument \texttt{coolfreq}, the number of iterations after which the
temperature is multiplied by 1-\texttt{cooling}.

Optionally, the best $k$-variable subset produced by the simulated
annealing algorithm may be used as input in a restricted local search
algorithm, for possible further improvement (this option is controlled
by the logical argument \texttt{improvement}  which, by default, is
TRUE). The user may force variables to be included and/or  excluded
from the $k$-variable subsets (arguments \texttt{include} and
\texttt{exclude}), and may specify initial solutions
(argument  \texttt{initialsol}). 

\paragraph{Computational effort}

\textbf{ } \newline
For each cardinality $k$, the total
number of calls to the procedure which computes the criterion values
is $nsol \times niter + 1$. These calls are the dominant computational
effort in each iteration of the algorithm.   In order to improve
computation times, the bulk of computations is carried out by a
Fortran routine. Further details about the Simulated Annealing
algorithm can  be found in reference \cite{CCM:04} and in the comments
to the Fortran code (in the \texttt{src} subdirectory for this
package).  


\paragraph{The \texttt{force} argument}

\textbf{ } \newline
For datasets with a very large number of variables
(currently $p$ > 400), it is  necessary to set the \texttt{force}
argument to TRUE for the function to run, but this may cause a session
crash if there is not enough memory available. The function checks for
ill-conditioning of the input matrix (see details  in the
\texttt{anneal} help file). 


\paragraph{Reproducible solutions}

\textbf{ } \newline
The \texttt{anneal} algorithm is a random algorithm, so that solutions
are, in general,  different each time the algorithm is run. For
reproducible solutions, the logical  argument \texttt{setseed} should
be set to TRUE during a session and left as TRUE in subsequent
function   calls where it is wished to reproduce the results. 


\paragraph{Uniqueness of solutions}

\textbf{ } \newline
The requested \texttt{nsol} solutions are \emph{not necessarily different
solutions}. The \texttt{nsol} solutions are computed separately, so
that (unlike what happens with the \texttt{eleaps} function), the same 
variable subset may appear more than once among the \texttt{nsol}
solutions. 

\paragraph{Examples}

\textbf{ } \newline
Four \textbf{examples} of usage of the \texttt{anneal} function are now given. 

\begin{description}
\item[Example 1] is a very simple example, with a small data set with very few iterations
of the algorithm, using the RM criterion (although for a data set of this size it is 
best to use the \texttt{eleaps} complete search, described in section \ref{ss eleaps}).


<<>>=
data(swiss)
anneal(cor(swiss),kmin=2,kmax=3,nsol=4,niter=10,criterion="RM")
@ 

\item[Example 2] uses the 62-variable farm data set, included in this package (see \texttt{help(farm)} for details).

<<>>=
data(farm)
anneal(cor(farm), kmin=6, nsol=5, criterion="rv")
@ 

Since the \texttt{kmax} argument was not specified, the \texttt{anneal} function by default 
assigns it the same value as \texttt{kmin}. Notice that there may be repeated subsets among 
the 5 solutions produced by the function call.

\item[Example 3] \label{ex CysticFibrosis anneal}
involves subset selection in the context of a
\textbf{(univariate)  Multiple Linear Regression}. The data set \texttt{cystfibr},
included in the ISwR package, contains lung function data for cystic fibrosis
patients (7-23 years old). The data consists of 25 observations on 10 variables.
The objective is to predict the variable \texttt{pemax} (maximum expiratory pressure) from
relevant patient characteristics.  A best subset of linear predictors is sought, 
using the \texttt{tau2} criterion  which, in the case of a univariate linear regression, is
just  the standard Coefficient of Determination, $R^2$. 

<<>>=
library(ISwR)
cystfibrHmat <- lmHmat(pemax ~ age+sex+height+weight+bmp+fev1+rv+frc+tlc, data=cystfibr) 
colnames(cystfibrHmat$mat)
cystfibr.tau2 <- anneal(cystfibrHmat$mat, kmin=4, kmax=6, H=cystfibrHmat$H, r=cystfibrHmat$r, crit="tau2")  
cystfibr.tau2$bestvalues  
cystfibr.tau2$bestsets  
@
 

The algorithm underlying the \texttt{anneal} function is a random algorithm, whose 
solutions are not necessarily reproducible. It may happen that the solutions for different cardinalities 
are not nested. This illustrates that the algorithm can produce different results 
from the standard greedy algorithms, such as forward selection or backward elimination.
 
That the value of the $\tau^2$ coefficient in the context of a     
univariate linear regression is the coefficient of determination can   
be confirmed via R's standard \texttt{lm} function:  

<<>>= 
summary(lm(pemax ~ weight+bmp+fev1+rv, data=cystfibr))$r.squared
@ 

The other three  multivariate linear hypothesis
criteria discussed in Section \ref{ss linear model criteria} also give
the value of $R^2$ in a univariate multiple regression, as is
illustrated by the following command, which requests the
value of the $\xi^2$ criterion for the above solutions.

<<>>=
xi2.coef(mat=cystfibrHmat$mat, indices=cystfibr.tau2$bestsets,  H=cystfibrHmat$H, r=cystfibrHmat$r)
@ 
%$

\item[Example 4] considers variable selection in the context of a 
\textbf{logistic regression} model. We consider the last 100 
observations of the \emph{iris} data set (that is, the observations 
for the versicolor and virginica species) and seek the best 1- to 
3-variable subsets for the logistic regression that uses
the four morphometric variables to model the probability of each of 
these two species. 

<<>>= 
data(iris)
iris2sp <- iris[iris$Species != "setosa",]
logrfit <- glm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,iris2sp,family=binomial)
Hmat <- glmHmat(logrfit)
iris2p.Wald <- anneal(Hmat$mat,1,3,H=Hmat$H,r=1,nsol=5,criterion="Wald")
iris2p.Wald$bestsets
iris2p.Wald$bestvalues
@
\end{description}
%$


\subsection{The \texttt{genetic} function: a genetic algorithm}
\label{ss genetic}


Given a full data set,  a Genetic Algorithm
algorithm seeks a k-variable subset which is
optimal, as a surrogate for the full set, with respect to a given criterion.  
The algorithm is described in detail in \cite{CCM:04}.

In brief, for each cardinality $k$ (with $k$ ranging from the function
arguments \texttt{kmin} to \texttt{kmax}),
an initial population of  $k$-variable subsets is randomly
selected from a full set of $p$  variables. The size of this initial
population is specified by the function argument \texttt{popsize}
(\texttt{popsize}=100 by default). 
In each iteration, \texttt{popsize}/2 couples
are formed from among the population and each couple generates a child
(a new $k$-variable subset)
which inherits properties of its parents (specifically, it inherits
all variables common to both parents and a random selection of
variables in the symmetric difference of its parents' genetic makeup).
Each offspring may optionally undergo a mutation in the form of a
local improvement algorithm  (see subsection \ref{ss improve}),
with a user-specified probability. Whether or not mutations occur is
controlled by the logical variable \texttt{mutate} (which is FALSE by
default), and the respective probability is given by the argument 
\texttt{mutprob}.  The parents
and offspring are ranked according to their criterion value, and the
best \texttt{popsize} of these k-subsets  will make up the next
generation, which is used as the current population in the subsequent
iteration. 

The stopping rule for the algorithm is the number of generations,
which is specified by the function argument \texttt{nger}.

Optionally, the best \emph{k}-variable subset produced by the Genetic
Algorithm may be passed as input to a restricted local improvement
algorithm, for possible further improvement (see subsection \ref{ss
improve}).  

The user may force variables to be included and/or excluded from the
\emph{k}-subsets (function arguments \texttt{include} and \texttt{exclude}), 
and may specify an initial population (function argument \texttt{initialpop}).


The function checks for ill-conditioning of the input matrix (see the
\texttt{genetic} help file for details).


\paragraph{Genetic diversity}

\textbf{ } \newline
For this algorithm to run, it needs genetic diversity in the
population (i.e., a large number of different variables in the
variable subsets that are being considered). This means that the
function will not run on data sets with a small number of variables, $p$. 
In an attempt to ensure this genetic diversity, the
function has an optional argument \texttt{maxclone}, which is an  
integer variable specifying the maximum number of identical replicates 
(clones) of individuals (variable subsets) that is acceptable in the population. 
 However, even \texttt{maxclone}=0 does not guarantee that there are
no repetitions: only the offspring  of couples are tested for
clones. If any such clones are rejected, they are replaced by a
k-variable subset chosen at random, without any further clone tests. 
%See Example \ref{exemp 1 genetic}. 

\paragraph{Computational effort}

\textbf{ } \newline
For each cardinality \emph{k}, the total number of calls to the
procedure which computes the criterion 
values is $popsize + nger \times popsize/2$. These calls are the
dominant computational effort in each iteration of the algorithm.  
In order to improve computation times, the bulk of computations are
carried out by a Fortran routine. Further details about the Genetic
Algorithm can be found in \cite{CCM:04} and in the comments to the 
Fortran code (in the \texttt{src} subdirectory for this package).  

\paragraph{The \texttt{force} argument}

\textbf{ } \newline
For datasets with a very
large number of variables (currently $p$ > 400), it is 
necessary to set the \texttt{force} argument to TRUE for the function to
run, but this may cause a session crash if there is not enough memory
available. 


\paragraph{Reproducible solutions}

\textbf{ } \newline
The \texttt{genetic} algorithm is a random algorithm, so that solutions
are, in general,  different each time the algorithm is run. For
reproducible solutions, the logical  argument \texttt{setseed} should
be set to TRUE during a session and left as TRUE in subsequent
function   calls where it is wished to reproduce the results. 

\paragraph{Uniqueness of solutions}

\textbf{ } \newline
The requested \texttt{nsol} solutions are \emph{not necessarily different
solutions}. The \texttt{nsol} solutions are computed separately, so
that (unlike what happens with the \texttt{eleaps} function), the same 
variable subset may appear more than once among the \texttt{nsol}
solutions.   


\paragraph{Examples}

\textbf{ } \newline
Two \textbf{examples} of use of the \texttt{genetic} function are now given:

\begin{description}
\item[Example 1.] \label{exemp 1 genetic} 
Consider the 62-variable Portuguese \texttt{farm} dataset, already discussed in
subsection \ref{sss RV} (page \pageref{sss RV}). We seek a 10-variable
subset which spans a subspace as close as possible to the first 10-d
principal subspace, so that criterion GCD (subsection \ref{sss GCD})
is relevant. By default, in the \texttt{genetic} function (as in the
\texttt{anneal} and \texttt{improve} function, but unlike the
\texttt{eleaps} function), the default value of \texttt{kmax} is set
equal to the value of \texttt{kmin}, and so does not need to be
specified in this case.

<<>>=
farm.gcd <- genetic(cor(farm), kmin=10, crit="gcd")
farm.gcd$bestsets
farm.gcd$bestvalues
@ 

Since the argument \texttt{popsize} was not specified, the default
population size ($100$) was used. As stated above, not all 100
solutions are necessarily different. To select only the different
solutions R's function \texttt{unique} may be used:

<<>>=
unique(farm.gcd$subsets)
@ 

If more than just a few different solutions are desired,    
it may be possible to increase the diversity of solutions by using the 
\texttt{maxclone} argument. The following example compares the number 
of different solutions obtained using the default value of 
\texttt{maxclone}=5 and a user-specified value 
\texttt{maxclone}=0. Since the \texttt{subsets} output object is a 3-d 
array, \texttt{unique} also produces a 3-d object, whose first 
dimension gives the number of different solutions. 

<<>>=
dim(unique(genetic(cor(farm), kmin=10, crit="gcd")$subsets))
dim(unique(genetic(cor(farm), kmin=10, maxclone=0, crit="gcd")$subsets))
@ 

\item[Example 2.] We consider subset selection in the context of a  
\textbf{Canonical Correlation Analysis}. The data set in this example 
is the same as in Example 1 above.  
Two groups of variables within the farm data set are compared: the 
$X$ group (all variables except the first four) and the $Y$ group 
(variables 1 to 4).  The goal is to select 4- to 6-variable 
subsets of the 58-variable $X$ group that  are optimal in terms of 
preserving the canonical correlations, according to the \texttt{zeta2} 
criterion. 
(Warning: the 4-variable $Y$ group is kept  intact; subset 
selection is carried out in the $X$  group only).  
As can be seen from the results, even a four-variable $X$ subgroup 
essentially reproduces the canonical correlations with the $Y$ group of variables
%The \texttt{tolsym} 
%parameter is used to relax the symmetry requirements on the effect 
%matrix $\Hmat$ which, for numerical reasons, is slightly 
%asymmetric. Since corresponding off-diagonal entries of matrix $\Hmat$ 
%are different, but by less than \texttt{tolsym}, $\Hmat$ is  replaced 
%by its symmetric part: $(\Hmat+\Hmat^t)/2$.  

<<>>= 
data(farm) 
farm.xgroup <- farm[,-c(1,2,3,4)]
farmHmat <- lmHmat(farm.xgroup,farm[,1:4])
colnames(farmHmat$mat)
farm.gen <- genetic(farmHmat$mat, kmin=4, kmax=6, H=farmHmat$H, r=farmHmat$r,crit="zeta2", maxclone=0, popsize=150)
farm.gen$bestvalues
farm.gen$bestsets
@ 
\end{description}

WARNING: The column numbers given by the last command are relative only to the $X$ group of variables, and so must be read off the list of variable names produced by the \texttt{colnames} command above (which are the same as those that would be printed out by the command \texttt{colnames(farm.xgroup)}). 



\subsection{The \texttt{improve} function: a restricted improvement algorithm}
\label{ss improve}

The function \textbf{improve} implements a Restricted Local
Improvement search for an optimal $k$-variable subset. Given a set of
variables, the algorithm seeks a $k$-variable subset which is
optimal, as a surrogate for the whole set, with respect to a given
criterion.  

In brief, for each  solution requested by the user (the number of
which is set by the \texttt{nsol} function argument),
an initial $k$-variable subset (for $k$ ranging from 
argument \texttt{kmin} to argument \texttt{kmax}) of a full set of $p$
variables is 
randomly selected and the variables not belonging to this subset are
placed in a queue. The possibility of replacing a variable in the
current $k$-subset with a variable from the queue is then explored.  
More precisely, a variable is selected, removed from the queue, and the 
$k$ values of the criterion which would result from swapping this selected
variable with each variable in the current subset are computed. If the
best of these values improves the current criterion value, the current
subset is updated accordingly. In this case, the variable which leaves
the subset is added to the queue, but only if it has not previously
been in the queue (i.e., no variable can enter the queue twice). The
algorithm proceeds until the queue is emptied  (see reference
\cite{CCM:04} for more details).   

The user may force variables to be included and/or excluded (arguments
\texttt{include} and \texttt{exclude}) from the
$k$-variable subsets, and may specify initial solutions (using the
\texttt{initialsol} argument).


The function checks for ill-conditioning of the input matrix (see the
\texttt{improve} help function for more details).

\paragraph{Computational effort}

\textbf{ } \newline
For each cardinality k, the total number of calls to the procedure
which computes the criterion values is O($nsol \times k \times p$). 
These calls are the dominant computational
effort in each iteration of the algorithm.  
In order to improve computation times, the bulk of computations are
carried out in a Fortran routine. Further details about the algorithm can
be found in Reference \cite{CCM:04} and in the comments to the Fortran code (in
the \texttt{src} subdirectory for this package). 


\paragraph{The \texttt{force} argument}

\textbf{ } \newline
For datasets with a very
large number of variables (currently $p$ > 400), it is 
necessary to set the \texttt{force} argument to \texttt{TRUE} for the
function to run, but this may cause a session crash if there is not
enough memory available. 


\paragraph{Reproducible solutions}

\textbf{ } \newline
As with the \texttt{anneal} and \texttt{genetic} algorithm, so too the
\texttt{improve} algorithm is a random algorithm. Solutions 
are, in general,  different each time the algorithm is run. For
reproducible solutions, the logical  argument \texttt{setseed} should
be set to TRUE during a session and left as TRUE in subsequent
function   calls where it is wished to reproduce the results. 

\paragraph{Uniqueness of solutions}

\textbf{ } \newline
The requested \texttt{nsol} solutions are \emph{not necessarily different
solutions}. The \texttt{nsol} solutions are computed separately, so
that (unlike what happens with the \texttt{eleaps} function), the same 
variable subset may appear more than once among the \texttt{nsol}
solutions. 


\paragraph{Examples}

\textbf{ } \newline
Two \textbf{examples} of use of the \texttt{improve} function are now given:

\begin{description}
\item[Example 1.] A very simple example, using the standardized \texttt{swiss} data
set from R's general distribution (see \texttt{help(swiss)} for more
information on this data set, which was also used in the first example
of Section \ref{ss eleaps}). The purpose is to determine the best 
2-variable set, so that orthogonally projecting the 47-point (standardized) 
configuration on the
subspace spanned by those two  variables would provide a configuration
as similar as possible (allowing for rigid rotations, translation of
the origin and global changes of scale) to the original 47-point
configuration defined by all six (standardized) variables in the data set. Two
different requests are made, the first without any restrictions, while
the second forces the inclusion of variable 1, and the exclusion of
variable 6, from the selected subsets.

<<>>=
swiss.imp1 <- improve(mat=cor(swiss),kmin=2,kmax=3,nsol=4,criterion="GCD")
swiss.imp2 <- improve(cor(swiss),2,3,nsol=4,criterion="GCD",include=c(1),exclude=6)
swiss.imp1$bestvalues
swiss.imp1$bestsets
swiss.imp2$bestvalues
swiss.imp2$bestsets
@ 


\item[Example 2.] A second very simple example in a \textbf{Linear
Discriminant Analysis} setting. Here we consider the same situation 
as in the example for the discussion of the \texttt{ldaHmat} fundtion,
in subsection \ref{sss Hmat}. The goal is to select the 2- and 3-variable
subsets that are optimal for the linear discrimination of the three
\emph{iris} species in Fisher's \emph{iris} data (as measured 
by the \texttt{ccr12} criterion).

<<>>=
data(iris)
irisHmat <- ldaHmat(iris[1:4],iris$Species)
improve(irisHmat$mat,kmin=2,kmax=3,H=irisHmat$H,r=irisHmat$r,crit="ccr12")
@ 

\item[Example 3] is a more involved example, for a \textbf{MANOVA}  
context. It follows up on the example used when introducing the 
\textbf{glhHmat} function, in subsection \ref{sss Hmat}. More 
precisely, in this data set 200 crabs are classified by two factors, 
\texttt{sex}  and \texttt{sp}, with two levels each. There are also 
measurement variables, in both their original  
and logarithmic scales. The goal is to detect the variables that are 
prominent in an analysis of the effect of the \texttt{sp} factor after 
controlling for sex. the initial commands in this example were already 
discussed in subsection \ref{sss Hmat}. 


<<>>= 
library(MASS)
data(crabs)
lFL <- log(crabs$FL) ; lRW <- log(crabs$RW); lCL <- log(crabs$CL); lCW <- log(crabs$CW)
C <- matrix(0.,nrow=2,ncol=4)
C[1,3] = C[2,4] = 1.
C
Hmat5 <- glhHmat(cbind(FL,RW,CL,CW,lFL,lRW,lCL,lCW) ~ sp*sex,C=C,data=crabs)
improve(mat=Hmat5$mat, kmin=4, nsol=3, H=Hmat5$H, r=Hmat5$r, crit="xi2",tolsym=1e-06)
@ 
\end{description}


%\section{Methods}
%\label{s methods}

%{\color{red}FALTA}

\bibliographystyle{plain}
\bibliography{subbib}

\end{document}