\documentclass[a4paper]{article} \usepackage{amsmath, amssymb} \usepackage{color} \usepackage[latin1]{inputenc} %\VignetteIndexEntry{subselect} \setlength{\textwidth}{16cm} \setlength{\textheight}{21cm} \setlength{\evensidemargin}{0cm} \setlength{\oddsidemargin}{0cm} \newcommand{\bvec}{\ensuremath{\mathrm{\mathbf{b}}}} \newcommand{\A}{\ensuremath{\mathrm{\mathbf{A}}}} \newcommand{\C}{\ensuremath{\mathrm{\mathbf{C}}}} \newcommand{\E}{\ensuremath{\mathrm{\mathbf{E}}}} \newcommand{\F}{\ensuremath{\mathrm{\mathbf{F}}}} \newcommand{\Hmat}{\ensuremath\mathrm{\mathbf{H}}} \newcommand{\I}{\ensuremath{\mathrm{\mathbf{I}}}} \newcommand{\M}{\ensuremath{\mathrm{\mathbf{M}}}} \renewcommand{\P}{\ensuremath{\mathrm{\mathbf{P}}}} \renewcommand{\S}{\ensuremath{\mathrm{\mathbf{S}}}} \newcommand{\T}{\ensuremath{\mathrm{\mathbf{T}}}} \newcommand{\U}{\ensuremath{\mathrm{\mathbf{U}}}} \newcommand{\X}{\ensuremath{\mathrm{\mathbf{X}}}} \newcommand{\Y}{\ensuremath{\mathrm{\mathbf{Y}}}} \newcommand{\umn}{\ensuremath{\mathrm{\mathbf{1}_n}}} \newcommand{\SsubGp}{\mbox{${\mathrm{\mathbf{S}}}_{\{\cal G\}}$}} \newcommand{\tr}{\ensuremath{\mathrm{tr}}} \title{The \texttt{subselect} R package} \author{Jorge Cadima, Jorge Orestes Cerdeira, Pedro Duarte Silva, Manuel Minhoto} \begin{document} \maketitle \begin{center} \textbf{Version 0.12} \end{center} \hrule \vskip 0.5cm \abstract{The \texttt{subselect} package addresses the issue of variable selection in different statistical contexts, among which exploratory data analyses; univariate or multivariate linear models; generalized linear models; principal components analysis; linear discriminant analysis, canonical correlation analysis. Selecting variable subsets requires the definition of a numerical criterion which measures the quality of any given variable subset as a surrogate for the full set of variables. The current version of the \texttt{subselect} package provides eight different criteria. For each available criterion, the package provides a function that computes the criterion value of any given subset. More significantly, the package provides efficient search functions that seek the best subsets of any given size, for a specified criterion. } \vskip 0.5cm \hrule \tableofcontents \vskip 0.5cm \section{Introduction} \subsection{The variable selection problem} Selecting a subset of variables which can be used as a surrogate for a full set of variables, without major loss of quality, is a problem that arises in numerous contexts. Multiple linear regression has been the classical statistical setting for variable selection and the greedy algorithms of backward elimination, forward selection and stepwise selection have long been used to choose a subset of linear predictors. However, their limitations are well known and more efficient search algorithms have also been devised, such as Furnival and Wilson's Leaps and Bounds Algorithm (reference \cite{Furnival:74}). The \texttt{subselect} package addresses the issue of variable selection in different statistical contexts, among which: \begin{itemize} \item exploratory data analyses; \item univariate or multivariate linear models; \item generalized linear models; \item principal components analysis; \item linear discriminant analysis; \item canonical correlation analysis. \end{itemize} \subsection{Measuring the quality of a subset} Selecting variable subsets requires the definition of a numerical \textbf{criterion} which measures the quality of any given variable subset. In a univariate multiple linear regression, for example, possible measures of the quality of a subset of predictors are the coefficient of determination $R^2$, the $F$ statistic in a goodness-of-fit test, its corresponding $p$-value or Akaike's Information Criterion (AIC), to give a few examples. The \texttt{subselect} package assumes that all potential variable subsets can be ranked according to a well-defined numerical criterion that is relevant for the problem at hand, and that \textit{the ultimate goal is to select the best subsets for any given cardinality}, i.e., the subsets of any given size which have the highest criterion values, for a given criterion. The criteria currently available in the package will be further discussed in Section \ref{s criteria}. For each available criterion, the package provides a function that computes the criterion value of any given subset. More importantly, the package provides efficient functions to search for the best subsets of any given size. \subsection{Selecting the best subsets} Identifying the best variable subsets of a given cardinality is -- for many criteria -- a computationally intensive problem with datasets of even moderate size, as is well known in the standard linear regression context. The \texttt{subselect} package has a function (\texttt{eleaps}) implementing Duarte Silva's adaptations (references \cite{Pedro:01} and \cite{Pedro:02}) of Furnival and Wilson's Leaps and Bounds Algorithm \cite{Furnival:74} for variable selection, using the criteria considered in this package. This function is very fast for datasets with up to approximately $p=30$ variables and is guaranteed to identify the $m$ best subsets of any size $k$ from $1$ to $p$, with $m$ in the range from $1$ to $\binom{p}{k}$. It is \textbf{the recommended function for variable selection with small or moderately-sized data sets} and is discussed in subsection \ref{ss eleaps}. For \textbf{larger data sets} (roughly $p>35$) the \texttt{eleaps} function is no longer computationally feasible and alternative search heuristics are needed. The \texttt{subselect} package provides \textbf{three functions with different search algorithms}: \begin{description} \item[\texttt{anneal}] a simulated annealing-type search algorithm; \item[\texttt{genetic}] a genetic algorithm; and \item[\texttt{improve}] a modified local improvement algorithm. \end{description} These algorithms are described in reference \cite{CCM:04} and are discussed in Section \ref{s algoritmos}. They perform better than the greedy-type algorithms, such as the standard stepwise selection algorithms widely used in linear regression. All four search functions invoke code written in either C++ (\texttt{eleaps}) or Fortran (\texttt{anneal}, \texttt{genetic} and \texttt{improve}), to speed up computation times. The search functions are further discussed in Section \ref{s algoritmos}. \subsection{Installing and loading \texttt{subselect}} The \texttt{subselect} package is on CRAN (cran.r-project.org) and can be installed like any other CRAN package, namely using R's \texttt{install.packages} facility: \begin{verbatim} > install.packages(''subselect'') \end{verbatim} This vignette assumes that the \texttt{subselect} package has already been installed in the user's system. Once installed, \texttt{subselect} must be loaded into an R session: <<>>= library(subselect) @ A preliminary view of the available functions and of their purpose can, as usual, be obtained via: \begin{verbatim} > library(help=subselect) \end{verbatim} \section{Criteria for variable selection} \label{s criteria} Currently, the package accepts \textbf{eight different criteria} measuring the quality of any given variable subset. These criteria can be (simplistically) classified into three groups: (i) criteria that are useful in an exploratory analysis or principal component analysis of a data set; (ii) criteria that are model-based and are relevant in contexts that can be framed as a multivariate linear model (multivariate linear regressions, MANOVAs, linear discriminant analysis, canonical correlation analysis, etc.); and (iii) criteria that are useful for variable selection in the context of a generalized linear model. \subsection{Criteria for exploratory data analyses and PCA} The three criteria relevant for exploratory data analyses have been previously suggested in the literature (see references \cite{CadJol:01} and \cite{CCM:04} for a fuller discussion). They can all be framed as matrix correlations \cite{Ramsay:84} involving an $n\times p$ data matrix (where $p$ is the number of variables and $n$ the number of observations on those variables) and projections of its columns on some subspace of $\mathbb{R}^n$. In the \texttt{subselect} package, these three criteria are called: \begin{description} \item[RM] The square of this matrix correlation is the proportion of total variance (inertia) that is preserved if the $p$ variables are orthogonally projected onto the subspace spanned by a given $k$-variable subset. \emph{Choosing the $k$-variable subset that maximizes this criterion is therefore akin to what is done in a Principal Component Analysis, with the difference that the optimal $k$-dimensional subspace is chosen only from among the $\binom{p}{k}$ subspaces spanned by $k$ of the $p$ variables}. This is the second of McCabe's four criteria for what he terms \textbf{\textit{principal variables}} \cite{McCabe:84}. See Subsection \ref{sss RM} for more details. \item[GCD] Yanai's \emph{Generalized Coefficient of Determination} \cite{Ramsay:84} is a measure of the closeness of two subspaces that is closely related to the \emph{gap metric} \cite{Golub:96}. More precisely, it is the average of the squared canonical correlations between two sets of variables spanning each of the subspaces \cite{Ramsay:84}. In this context, these subspaces are the subspace spanned by $g$ Principal Components of the full, $p$-variable, data set; and the subspace spanned by a $k$-variable subset of those $p$ variables. By default the PCs considered are the \emph{first} $g$ PCs, and the number of variables and PCs considered is the same ($k=g$), but other options can be specified by the user. \emph{Maximizing the GCD corresponds to chosing the $k$ variables that span a subspace that is as close as possible to the principal subspace spanned by the $g$ principal components that were specified}. See Subsection \ref{sss GCD} for more details. \item[RV] Escoufier's RV-criterion \cite{RobEsc:76} measures the similarity of the $n$-point configurations defined by the $n$ rows of two (comparable) data matrices, allowing for translations of the origin, rigid rotations and global changes of scale. In this case, the two matrices that are compared are the original data matrix $\X$ and the matrix $\P_k\X$ which results from projecting the $p$ columns of $\X$ onto a $k$-dimensional subspace spanned by $k$ of the $p$ original variables. \emph{Maximizing this criterion, among all $k$-variable subsets, corresponds to choosing the $k$ variables that span the subspace which best reproduces the original $n$-point configuration, allowing for translations of the origin, rigid rotations and global changes of scale}. See Subsection \ref{sss RV} for more details. \end{description} The package functions that compute the values of these indices (described below) all have the following two input \emph{arguments}: \begin{description} \item[\texttt{mat}] denotes the covariance or correlation matrix of the full data set; \item[\texttt{indices}] is the vector, matrix or 3-d array of integers giving sets of integers identifying the variables in the subsets of interest. \end{description} The function for the $GCD$ coefficient has an additional argument, described in subsection \ref{sss GCD}. \subsubsection{The RM coefficient} \label{sss RM} \paragraph{Motivation and definition} \textbf{ }\newline It is well-known that the first $k$ principal components of an $n\times p$ data set are the $k$ linear combinations of the $p$ original variables which span the subspace that maximizes the proportion of total variance retained, if the data are orthogonally projected onto that subspace. A similar problem, but considering only the subspaces that are \emph{spanned by $k$-variable subsets of the $p$ original variables}, can be formulated \cite{CadJol:01} as follows: determine the $k$-variable subset which maximizes the square of the RM coefficient, %the matrix correlation between the $n\times p$ data %matrix $\X$ and $\P_k\X$, where $\P_k$ is the $n\times n$ matrix of %orthogonal projections on the subspace spanned by some subset ${\cal K}$ of $k$ %among the $p$ variables: \begin{equation} \label{eq RM} RM = \mathrm{corr}(\X,\P_k \X) = \sqrt{\frac{\mathrm{tr}(\X^t \P_k \X)}{\mathrm{\mathrm{tr}(\X^t \X)}}}= \sqrt{\frac{\sum\limits_{i=1}^{p}\lambda_i (r_m)_i^2}{\sum\limits_{j=1}^{p}\lambda_j}} = \sqrt{\frac{\mathrm{tr}([\S^2]_{({\cal K})}\S_{\cal K}^{-1})}{\mathrm{tr}(\S)}} , \end{equation} with \texttt{corr} denoting the matrix correlation; \texttt{tr} the matrix trace; and where \begin{itemize} \item $\X$ is the full (column-centered or possibly standardized) data matrix; \item $\P_k$ is the matrix of orthogonal projections on the subspace spanned by a given $k$-variable subset; \item $\S=\frac{1}{n}\X^t\X$ is the $p\times p$ covariance (or correlation) matrix of the full data set; \item ${\cal K}$ denotes the index set of the $k$ variables in the variable subset; \item $\S_{\cal K}$ is the $k\times k$ principal submatrix of matrix $\S$ which results from retaining the \k rows/columns whose indices belong to ${\cal K}$; \item $[\S^2]_{({\cal K})}$ is the $k\times k$ principal submatrix of $\S^2$ obtained by retaining the \k rows/columns associated with set ${\cal K}$. \item $\lambda_i$ stands for the $i$-th largest eigenvalue of the covariance (or correlation) matrix defined by $\X$; \item $r_m$ stands for the multiple correlation between the $i$-th principal component of the full data set and the $k$-variable subset. \end{itemize} The values of the RM coefficient will lie in the interval $[0,1]$, with larger values indicating a higher proportion of explained variance. Note that it is not relevant whether sample (co)variances (with denominator $n\!-\!1$), or even the sums of squares and products matrices, are used instead of matrix $\S$ as defined above: RM is insensitive to multiplication of $\S$ by a constant. \paragraph{The \texttt{rm.coef} function} \textbf{ }\newline The \texttt{subselect} package provides the function \textbf{rm.coef} which computes the RM coefficient, given a positive definite matrix (function argument \texttt{mat}, corresponding to $\S$ in the final expression of equation (\ref{eq RM})) and a vector with $k$ indices (function argument \texttt{indices}, corresponding to ${\cal K}$ in (\ref{eq RM})): \begin{verbatim} > rm.coef(mat,indices) \end{verbatim} The \textbf{rm.coef} function uses the final expression in equation (\ref{eq RM}). In the standard setting, matrix $\S$ is a covariance or correlation matrix for the full data set, but it may be any positive definite matrix, such as a matrix of non-central second moments. \paragraph{Examples} \textbf{ }\newline As an example of the use of the \textbf{rm.coef} function, let us compute the value of the RM coefficient for the two petal measurements in Fisher's \emph{iris} data (variables 3 and 4, respectively, \texttt{Petal.Length} and \texttt{Petal.Width}; see \texttt{help(iris)} in a standard R session for more information on this data set). This can be obtained as follows, using \textbf{rm.coef}'s only two arguments: <<>>= rm.coef(mat=var(iris[,-5]),indices=c(3,4)) @ The square of this value, <<>>= rm.coef(var(iris[,-5]),c(3,4))^2 @ gives the proportion of total variability that is preserved if the four morphometric variables were orthogonally projected onto the subspace spanned by the petal measurements. It is quite often the case that this value is not very much smaller than the proportion of total variability that is accounted for by the same number of principal components \cite{CadJol:01}. If more than one subset of a given cardinality is desired, the \texttt{indices} argument should be given as a matrix whose rows provide the indices of the variables in each subset. For example, the RM coefficients for the three-variable subsets of the \emph{iris} data, given by variables $\{1,2,3\}$ and $\{1,2,4\}$ are requested by the command: <<>>= rm.coef(var(iris[,-5]), indices=matrix(nrow=2,ncol=3,byrow=TRUE,c(1,2,3,1,2,4))) @ The argument \texttt{indices} can also be a three-dimensional array, if subsets of different cardinalities are desired. In this case, the third dimension of \texttt{indices} is associated with different cardinalities. This option is especially useful when applying the \texttt{rm.coef} function to the output of the search algorithms (see Section \ref{s algoritmos}), if more than one cardinality has been requested. An example for the less frequent case, where the request is built from scratch, is now given. It computes the value of RM for two 1-variable subsets and two 3-variable subsets: <<>>= subsets <- array(data=c(3,2,0,0,0,0,1,1,2,2,3,4), dim=c(2,3,2)) colnames(subsets) <- paste("V",1:3,sep="") rownames(subsets) <- paste("Solution",1:2) dimnames(subsets)[[3]]<-paste("Size",c(1,3)) subsets rm.coef(var(iris[,-5]),indices=subsets) @ The output places the values for each cardinality in a different column. Notice how the missing variable indices for the lower cardinalities are given as zeroes, %{\color{red}[FALTA: SE MUDAR O OUTPUT PARA NA, ESTAS FUNCOES TAMBEM TEM DE MUDAR]}. \subsubsection{The GCD coefficient} \label{sss GCD} \paragraph{Motivation and definition} \textbf{ }\newline In the standard setting for the \texttt{subselect} package, given a $p$-variable data set and a subset of $g$ of its principal components, the GCD criterion is a measure of similarity between the principal subspace spanned by the $g$ specified principal components and the subspace spanned by a given $k$-variable subset of the original variables \cite{CadJol:01}. The GCD is the matrix correlation between the matrix $\P_k$ of orthogonal projections on the subspace spanned by a given $k$-variable subset and the matrix $\P_g$ of orthogonal projections on the subspace spanned by the $g$ given principal components of the full data set \cite{CadJol:01}: \begin{equation} \label{eq GCD} GCD = \mathrm{corr}(\P_k , \P_g) = \frac{\mathrm{tr}(\P_k\cdot \P_g)}{\sqrt{k\cdot g}} = \frac{1}{\sqrt{k g}} \sum\limits_{i\in{\cal G}}(r_m)_i^2 = \tr([\SsubGp]_{({\cal K})}\S_{\cal K}^{-1})/\sqrt{gk}, \end{equation} where \begin{itemize} \item ${\cal K}$, $\S$, $\S_{\cal K}$ and $(r_m)_i$ are all defined as above (subsection \ref{sss RM}); \item ${\cal G}$ denotes the index set of the $g$ principal components in the PC subset; \item $\SsubGp$ is the $p\times p$ matrix of rank $g$ that results from retaining only the $g$ terms in the spectral decomposition of $\S$ that are associated with the PC indices in the set ${\cal G}$; \item $[\SsubGp]_{({\cal K})}$ is the $k\times k$ principal submatrix of $\SsubGp$ that results from retaining only the rows/columns whose indices are in ${\cal K}$. \end{itemize} The values of the GCD coefficient are in the interval $[0,1]$, with larger values indicating greater similarity between the subspaces. \paragraph{The \texttt{gcd.coef} function} \textbf{ }\newline The \textbf{gcd.coef} function computes the GCD coefficient for a given positive definite matrix (function argument \texttt{mat}, corresponding to matrix $\S$ in the final expression of equation (\ref{eq GCD})), a given vector of $k$ variable indices (function argument \texttt{indices}, corresponding to ${\cal K}$ in (\ref{eq GCD})), and a given vector of $g$ PC indices (function argument \texttt{pcindices}, corresponding to ${\cal G}$ in (\ref{eq GCD})): \begin{verbatim} > gcd.coef(mat,indices,pcindices) \end{verbatim} If the \texttt{pcindices} argument is not specified, by default it is set to the first $k$ PCs, where $k$ is the size of the variable subset defined by argument \texttt{indices}. The value of the GCD is computed using the final expression in equation (\ref{eq GCD}). Matrix $\S$ (the input argument \texttt{mat}) is usually the covariance or correlation matrix for the full $p$-variable data set, but there may be other contexts where the final expression in (\ref{eq GCD}) makes sense. \paragraph{Examples} \textbf{ }\newline The value of the GCD coefficient for the first two covariance-matrix PCs in Fisher's \emph{iris} data and the two petal measurements can be obtained as following: <<>>= gcd.coef(var(iris[,-5]),ind=c(3,4),pcind=c(1,2)) @ The two sepal measurements span a subspace that is much closer to the principal plane: <<>>= gcd.coef(var(iris[,-5]),ind=c(1,2)) @ As with the RM coefficient the value of the GCD coefficient for multiple subsets can be requested by providing the argument \texttt{mat} as a matrix (if the subsets are all of the same cardinality) or a three-dimensional array (for subsets of different cardinalities) (see subsection \ref{sss RM} for examples). In this case, however, the PC indices requested via the argument \texttt{pcindices} must use the same vector for each function call. \subsubsection{The RV coefficient} \label{sss RV} \paragraph{Motivation and definition} \textbf{ }\newline As already stated, Escoufier's RV-criterion \cite{RobEsc:76} measures the similarity of the $n$-point configurations defined by the $n$ rows of two (comparable) data matrices, allowing for translations of the origin, rigid rotations and global changes of scale. The criterion is defined, in this context, as the matrix correlation between the matrices $\X\X^t$ and $\P_k\X\X^t\P_k$ (where $\X$ and $\P_k$ are defined as above) \cite{CCM:04}: \begin{equation} \label{eq RV} \mathrm{RV}=\mathrm{corr}(\X\X^t,\P_{k}\X\X^t\P_{k})= \frac{1}{\sqrt{\tr(\S^2)}}\cdot\sqrt{\tr \left(\left[\S^2\right]_{({\cal K})}\left[\S_{\cal K}\right]^{-1}\right)^2}, \end{equation} ($\S$, $\S_{\cal K}$ and $\left[\S^2\right]_{({\cal K})}$ are defined as in section \ref{sss RM}). Possible values for the RV coefficient lie in the interval $[0,1]$, with larger values indicating more similarity between the two $n$-point configurations. \paragraph{The \texttt{rv.coef} function} \textbf{ }\newline The value of the RV coefficient, for a given positive definite matrix (function argument \texttt{mat}, corresponding to matrix $\S$ in the final expression of equation (\ref{eq RV})) and a given vector of $k$ variable indices (function argument \texttt{indices}, corresponding to ${\cal K}$ in equation (\ref{eq RV})), can be computed using the \textbf{rv.coef} function: \begin{verbatim} > rv.coef(mat,indices) \end{verbatim} \paragraph{Examples} \textbf{ }\newline The \texttt{farm} data set (included in this package), has $n=99$ individuals, observed in each of $p=62$ variables. The RV coefficient reveals that the $99-$point configuration of the standardized data in $\mathbb{R}^{62}$ is fairly similar to the configuration that results from the orthogonal projection of those $99$ points on the subspace spanned by variables number 2, 37, 57 and 59: <<>>= data(farm) rv.coef(cor(farm),ind=c(2,37,57,59)) @ For two different variable subsets of size 4, the argument \texttt{indices} is given as a matrix, whose rows indicate the variable indices for each subset. For example, the RV-coefficients of the four-variable subsets $\{2,12,56,59\}$ and $\{2,3,11,59\}$, are requested by the command: <<>>= rv.coef(cor(farm), indices=matrix(nrow=2,ncol=4,byrow=TRUE,c(2,12,56,59,2,3,11,59))) @ \subsection{Criteria for a Multivariate Linear model context} \label{ss linear model criteria} Different statistical methods arise within the general framework of a multivariate linear model \cite{Huberty:94}: $$\X = \A \pmb{\Psi} + \U ,$$ where $\X$ is an $n\times p$ data matrix of original variables, $\A$ is a known ($n\times q$) design matrix, $\pmb{\Psi}$ is a ($q\times p$) matrix of unknown parameters and $\U$ is a ($n\times p$) matrix of error vectors. Particular cases in this setting include, among others, [Multivariate] Analysis of Variance ([M]ANOVA), Linear Discriminant Analyis (LDA) and Canonical Correlation Analysis (CCA). Here we will be particularly concerned with contexts where a selection of subsets of $\X$ is warranted, the classical cases being LDA and CCA. In these statistical methods, variable subsets are often assessed according to their contribution to the violation of an additional reference hypothesis, $H_0 \,:\, \C \, \pmb{\Psi} = 0$, where $\C$ is a known coefficient matrix of rank $r$ \cite{Huberty:94}. For example, in LDA $\X$ is a matrix of $n$ observations, divided by $q$ groups, on $p$ attributes; $\A$ is a matrix of group indicators; and $\pmb{\Psi}$ is a matrix of group-specific population means. In this case, the rows of $\C$ specify $q-1$ contrasts, stating the equality of population means across groups. Hence, $r = min(p,q-1)$, and any index of the extent of $H_0$ violations can be interpreted as a measure of group separation. In Canonical Correlation Analysis, $\A = \left[ \umn \, \Y \right]$ where $\umn $ is a vector of ones, the columns of $\X$ and $\Y$ are $n$ observations on two different sets of variables, the reference hypothesis, $H_0: \C \ \pmb{\Psi} = \left[0 \, \I_q \right] \left[\pmb{\Psi}_0^t \, \pmb{\Psi}_{\Y}^t \right]^t = 0 $, states that the two sets are uncorrelated, and $r = min\left(rank(\X),rank(\Y)\right)$. Indicators for the violation of $H_0$ are measures of linear association between the $\X$ and $\Y$ variable sets. We note that in this context only the variables associated with the $\X$ group are selected, while the $\Y$ group of variables remains fixed. When $\Y = y$ consists of a single (response) variable this problem becomes the traditional variable selection problem in Linear Regression, usually modelled as $y = \left[1_n \, \X \right] \left[\beta_0 \ \pmb{\beta}_X \right]^t + \pmb{\epsilon} $. While \texttt{subselect} can also be used with reasonable results in multivariate regression models $\Y =\left[1_n \, \X \right] \left[\pmb{\beta_0} \, \pmb{\beta}_X \right]^t + \pmb{\epsilon}$, with $\Y$ having more than one column, the symmetric association indices considered in \texttt{subselect} were not designed to measure the quality of $\Y$ predictions. In multivariate prediction problems other (non-symmetric) measures such as those discussed in Rencher \cite{Rencher:95} and McQuarrie and Tsai \cite{McQuarrieTsai:98} may be more relevant. In univariate regression all these measures are monotone functions of the traditional coefficient of determination, which is also the index considered by \texttt{subselect}. It is well known that, under classical Gaussian assumptions, test statistics for $H_0$ are given by several increasing functions of the $r$ positive eigenvalues of a product matrix $\T^{-1}\Hmat$, with $\T$ and $\Hmat$ the total and effect matrices of Sum of Squares and Cross Product (SSCP) deviations associated with $H_0$, such that $$ \T = \Hmat + \E,$$ where matrix $\E$ is the matrix of residual SSCP. The former SSCP matrices are given by $\T = \X^t(\I_p - \P_{\omega}) \X$ and $\Hmat = \X^t(\P_{\Omega} - \P_{\omega}) \X$, where $\I_p$ is an identity matrix, $\P_{\Omega} = \A(\A^t\A)^-\A^t$ and $$\P_{\omega} = \A(\A^t\A)^-\A^t - \A(\A^t\A)^-\C^t[\C(\A^t\A)^-\C^t]^-\C(\A^t\A)^-\A^t,$$ are projection matrices on the spaces spanned by the columns of $\A$ (space $\Omega$) and by the linear combinations of these columns that satisfy the reference hypothesis (space $\omega$). In these formulas $\M^t$ denotes the transpose of matrix $\M$, and $\M^-$ a generalized inverse. Furthermore, whether or not the classical assumptions hold, the same eigenvalues can be used to define descriptive indices that measure an "effect" characterized by the violation of $H_0$. \subsubsection{Four criteria} In the \texttt{subselect} package four effect-size indices, which are monotone functions of the traditional test statistics, are used as comparison criteria. These indices are called: \begin{description} \item[Ccr12] The $ccr_1^2$ index is a function of the traditional \emph{Roy first root} test statistic \cite{Roy:39}, $\lambda_1$, for the violation of the linear hypothesis of the form $H_0\,:\, \C \pmb{\Psi} = 0$. %Specifically, $$ccr_1^2 = \frac{\lambda_1}{1+\lambda_1}.$$ \emph{Maximizing this criterion is equivalent to maximizing Roy's first root} (see Subsection \ref{sss ccr12} for more details). \item[Tau2] The $\tau^2$ index is a function of the traditional \emph{Wilks' Lambda} statistic \cite{Wilks:32}. %, defined as $\Lambda=\frac{det(\E)}{det(\T)}$: $$\tau^2= 1-\Lambda^{(1/r)}.$$ \emph{Maximizing this criterion is equivalent to minimizing Wilks' Lambda} (see Subsection \ref{sss tau2}). \item[Xi2] The $\xi^2$ coefficient is a function of the \emph{Bartlett-Pillai trace} test statistic \cite{Pillai:55} %, $P=\tr(\Hmat\T^{-1})$: $$\xi^2 = \frac{P}{r}.$$ \emph{Maximizing this criterion is equivalent to maximizing the Bartlett-Pillai statistic} (see Subsection \ref{sss xi2}). \item[Zeta2] The index $\zeta^2$ is a function of the traditional \emph{Lawley-Hotelling trace} test statistic \cite{Lawley:38} \cite{Hotelling:51}. %, $V=\tr(\Hmat\E^{-1})$: $$\zeta^2= \frac{V}{V+r}.$$ \emph{Maximizing this criterion is equivalent to maximizing the Lawley-Hotelling statistic} (see Subsection \ref{sss zeta2}). \end{description} Cramer and Nicewander \cite{CramerNicewander:79} introduced these indices in the context of CCA, and Huberty \cite{Huberty:94} discusses their use in the context of LDA and MANOVA. \textit{In all four cases, the indices are defined so that their values lie between $0$ and $1$, and the larger the index value, the better the variable subset} as a surrogate for the full data set. In the specific case of a multiple linear regression (with a single response variable), all four indices are equal to the standard coefficient of determination, $R^2$. However, in multivariate problems with $r > 1$ the indices differ because they place different emphasis on canonical directions associated with the effect under study. In particular, $ccr_1^2$ only considers the first canonical direction, $\xi^2$ weights all $r$ directions in a balanced way, and $\tau^2$, $\zeta^2$ are intermediate indices that emphasize the main directions (see reference \cite{Pedro:01} for further details). The four package functions that compute the values of these indices are called \texttt{*.coef}, where the asterisk denotes the above criterion name. These functions have \textit{six input arguments}: \begin{description} \item[mat] denotes the Total SSCP matrix $\T$; \item[H] denotes the effects SSCP matrix of relevance for the particular case at hand (see subsection \ref{sss Hmat} for more details); \item[r] denotes the rank of the $\Hmat$ matrix (except in degenerate cases - see the \texttt{help} files for each \texttt{*.coef} function); \item[indices] is the vector, matrix or 3-d array of integers that identify the variables in the subsets of interest; \item[tolval \texttt{and} tolsym] are parameters used to check for ill-conditioning of the matrix \texttt{mat}, and the symmetry of matrices \texttt{mat} and $\Hmat$ (see the \texttt{help} files for more details). \end{description} \subsubsection{Creating the SSCP matrices} \label{sss Hmat} The relevant Sum of Squares and Cross-Products Effect matrix $\Hmat$ is specific to each context in which model-based variable selection arises. Computing these matrices can sometimes be time-consuming and the direct application of the formulas described in the previous subsection can lead to large rounding errors. The \texttt{subselect} package provides helper functions which, for standard contexts, create both the SSCP Effects matrix $\Hmat$ and the SSCP Total matrix $\T$, using sound numerical procedures based on singular value decompositions, %{\color{red}[JC: tu e' que sabes como fizeste, se foi com a dvs, acho bem referir]} as well as computing the presumed rank $r$ of $\Hmat$. %There is also a more %general helper function (\texttt{glhHmat}) for the general %multivariate linear hypothesis case. The output from these functions can be used as input for both the functions that compute the model-based criteria of quality for a given variable subset (discussed in the subsections below) and for the search functions that seek the best subsets of variables for a given data set (see Section \ref{s algoritmos}). For the multivariate linear model context, \texttt{subselect} provides three such helper functions. For all three functions, the output is a list of four objects: \begin{description} \item[mat] is the relevant Total SSCP matrix $\T$ divided by $n-1$ (where $n$ is the number of rows in the data matrix); \item[H] is the relevant Effect SSCP matrix divided by $n-1$; %{\color{red}PEDRO: Acho que sim, mas sugiro que , se nao %fizer mal, se deixe como esta'. Caso contrario teremos de %re-escrever varias coisas. Disseste:'' Wouldn't it be cleaner \item[r] is the rank of matrix $\Hmat$; \item[call] is the function call which generated the output. \end{description} The fact that \texttt{mat} and \texttt{T} are not defined as the standard SSCP matrices, but rather divided by $n-1$ is of no consequence, since multiplying T and H by a common scalar does not affect the criteria values. The three helper functions currently available for the multivariate linear model context are the following: \paragraph{The \textbf{lmHmat} function for linear regression and CCA} \textbf{ }\newline The function \textbf{lmHmat} creates the necessary SSCP matrices for a \textbf{\emph{linear regression}} or \emph{\textbf{canonical correlation analysis}} context. In a (possibly multivariate response) linear regression context, it is assumed that the response variables are fixed and a subset of the predictor variables is being assessed. In a canonical correlation analysis context, it is assumed that one of the two groups of variables is fixed, and it is a subset of the second group of variables that is being assessed (variable selection in both groups will currently have to be done via a 2-step approach). This function takes by default the following arguments: \begin{description} \item[x] is a matrix containing the full set of predictor variables, in the regression context, or the group of variables in which a variable subset is to be chosen in the CCA context. \item[y] is a matrix or vector containing the set of fixed response variables, in the regression context, or the set of fixed variables in the CCA context. \end{description} There is an S3 \emph{method} for arguments \texttt{x} and \texttt{y} of class \texttt{data.frame}, as well as methods for classes \texttt{formula} and \texttt{lm}, that replace the input arguments by: \begin{description} \item[formula] a standard linear model formula $y \sim x1 + x2 + ...$ defining the model relation between response and predictor variables (in the regression context) or fixed and assessed variables in the CCA context. \item[data] a data frame from which variables specified in formula are preferentially to be taken. \end{description} or \begin{description} \item[fitdlmmodel] an object of class \texttt{lm}, as produced by R's \texttt{lm} function. \end{description} In this context, the output object \texttt{mat} is %{\color{red} proportional to -- if mat and H are not divided by n-1} the covariance matrix $T_x$ of the $x$ variables; object \texttt{H} is %{\color{red} proportional to?} the covariance matrix $H_{x\vert y}$ of the projections of the $x$ variables on the space spanned by the $y$ variables; and $r$ is the expected rank of the $\Hmat$ matrix which, under the assumption of linear independence, equals the minimum between the number of variables in the x and y sets (the true rank of $\Hmat$ can be different if the linear independence condition fails). See the \texttt{lmHmat} help file for more information. \paragraph{Example.} An example of the use of the helper function \texttt{lmHmat} involves the \texttt{iris} data set. The goal is to study the \textbf{(univariate response) multiple linear regression} of variable \texttt{Sepal.Length} (the first variable in the \texttt{iris} data frame) on the three remaining predictors. <<>>= lmHmat(x=iris[,2:4], y=iris[,1]) @ \paragraph{The \textbf{ldaHmat} function for linear discriminant analysis} \textbf{ }\newline This function takes by default the following arguments: \begin{description} \item[x] is a matrix containing the discriminating variables, from which a subset is being considered. \item[grouping] is a factor specifying the class to which each observation belongs. \end{description} There are S3 \emph{methods} for \begin{itemize} \item class \texttt{data.frame} input argument \texttt{x}; \item input object of class \texttt{formula}. \end{itemize} With these methods, the input arguments for \texttt{ldaHmat} can be given as: \begin{description} \item[formula] a formula of the form $grouping \sim x1 + x2 + ...$ where the $x$ variables denote the discriminating variables. \item[data] a data frame from which variables specified in formula are preferentially to be taken. \end{description} In this context, the output objects \texttt{mat} and \texttt{H} are the standard total ($\T$) and between-group ($\Hmat$) SSCP matrices of Fisher's linear discriminant analysis; and output object \texttt{r} is the rank of the between-group matrix $\Hmat$, which equals the minimum between the number of discriminators and the number of groups minus one (although the true rank of $\Hmat$ can be different, if the discriminators are linearly dependent). \paragraph{Examples} A simple example of the use of function \texttt{ldaHmat} again involves the \texttt{iris} data set. We seek to discriminate the three iris species, using the four morphometric variables as discriminators. <<>>= ldaHmat(x=iris[,1:4], grouping=iris$Species) @ %$ In the same example, the function could have been invoked as: <<>>= attach(iris) ldaHmat(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width) detach(iris) @ %$ \paragraph{The \textbf{glhHmat} function for a general linear hypthesis context} \textbf{ }\newline The function \textbf{glhHmat} creates the appropriate SSCP matrices for any problem that can be described as an instance of the \textbf{\emph{general multivariate linear model}} $\X = \A \pmb{\Psi} + \U$, \emph{with a \textbf{reference hypothesis}} $H_0: \C \pmb{\Psi} = \mathbf{0}$, where $\pmb{\Psi}$ is a matrix of unknown parameters and $\C$ is a known rank-$r$ coefficient matrix. By default, this function takes the following arguments: \begin{description} \item[x] is a matrix containing the response variables. \item[A] is a design matrix specifying a linear model in which $\X$ is the response. \item[C] is a matrix or vector containing the coefficients of the reference hypothesis. \end{description} There is an S3 \emph{method} for input of class \texttt{data.frame} in which x and \A~ are defined as data frames. A further method accepts input of class \texttt{formula}, with input arguments: \begin{description} \item[formula] a formula of the form $X \sim A1 + A2 + ...+ Ap$ where the terms of the right hand side specify the relevant columns of the design matrix. \item[C] a matrix or vector containing the coefficients of the reference hypothesis. \item[data] a data frame from which variables specified in formula are preferentially to be taken. \end{description} In this context, the $\T$ and $\Hmat$ matrix have the generic form given at the beginning of this subsection and \texttt{r} is the rank of $\Hmat$, which equals the rank of $\C$ (the true rank of $\Hmat$ can be different from \texttt{r} if the $\X$ variables are linearly dependent). \paragraph{Example.} The following example creates the Total and Effects SSCP matrices, $\T$ and $\Hmat$, for an analysis of the data set \texttt{crabs} in the MASS package. This data set records physical measurements on 200 specimens of \textit{Leptograpsus variegatus} crabs observed on the shores of Western Australia. The crabs are classified by two factors, both with two levels each: \texttt{sex} and \texttt{sp} (crab species, as defined by its colour: blue or orange). The measurement variables include the carapace length (\texttt{CL}), the carapace width (\texttt{CW}), the size of the frontal lobe (\texttt{FL}) and the rear width (\texttt{RW}). We assume that there is an interest in comparing the subsets of these variables measured in their original and logarithmic scales. In particular, we assume that it is wished to create the $\T$ and $\Hmat$ matrices associated with an analysis of the effect of the \texttt{sp} factor after controlling for sex. Only the \texttt{formula}, \texttt{C} and \texttt{data} arguments are explicitly given in this function call. <<>>= library(MASS) data(crabs) lFL <- log(crabs$FL) ; lRW <- log(crabs$RW); lCL <- log(crabs$CL); lCW <- log(crabs$CW) C <- matrix(0.,nrow=2,ncol=4) C[1,3] = C[2,4] = 1. C Hmat5 <- glhHmat(cbind(FL,RW,CL,CW,lFL,lRW,lCL,lCW) ~ sp*sex,C=C,data=crabs) Hmat5 @ \subsubsection{The $ccr_1^2$ (ccr12) coefficient and Roy's first root statistic} \label{sss ccr12} \paragraph{Motivation and definition} \textbf{ }\newline The $ccr_1^2$ coefficient is an increasing function of Roy's first root test statistic for the reference hypothesis in the standard multivariate linear model. Roy's first root is the largest eigenvalue of $\Hmat\E^{-1}$, where $\Hmat$ is the Effect matrix and $\E$ is the Error (residual) matrix. The index $ccr_1^2$ is related to Roy's first root $\lambda_1$ by: $$ccr_1^2 =\frac{\lambda_1}{1+\lambda_1}.$$ \paragraph{The \texttt{ccr12.coef} function} \textbf{ }\newline The \texttt{subselect} package provides the function \textbf{ccr12.coef} which computes the $ccr_1^2$ coefficient, given the variance or total SSCP matrix for the full data set, \texttt{mat}, the effects SSCP matrix \texttt{H}, the expected rank of the $\Hmat$ matrix, \texttt{r}, and a vector \texttt{indices} with the indices of the variable subset that is being considered. These arguments may be defined with the helper functions described in Subsection \ref{sss Hmat}. A standard function call looks like: \begin{verbatim} > ccr12.coef(mat, H, r, indices) \end{verbatim} For further arguments and options, see the function \texttt{help} page. \paragraph{Example} \textbf{ }\newline The following example in the use of function \texttt{ccr12.coef} in the context of a (univariate response) Multiple Linear Regression uses the \texttt{Cars93} data set from the MASS library. Variable 5 (average price) is regressed on 13 other variables. The goal is to compare subsets of these 13 variables according to their ability to predict car prices. The helper function \texttt{lmHmat} creates the relevant input to test the value of the $ccr_1^2$ criterion for the subset of the fourth, fifth, tenth and eleventh predictors. <<>>= library(MASS) data(Cars93) CarsHmat <- lmHmat(x=Cars93[c(7:8,12:15,17:22,25)],y=Cars93[5]) ccr12.coef(mat=CarsHmat$mat, H=CarsHmat$H, r=CarsHmat$r, indices=c(4,5,10,11)) @ %$ \subsubsection{The $\tau^2$ (tau2) coefficient and Wilk's Lambda} \label{sss tau2} \paragraph{Motivation and definition} \textbf{ }\newline The Tau squared index $\tau^2$ is a decreasing function of the standard Wilk's Lambda statistic for the multivariate linear model and its reference hypothesis. The Wilk's lambda statistic ($\Lambda$) is given by: $$\Lambda=\frac{det(\E)}{det(\T)},$$ where $\E$ is the Error (residual) SSCP matrix and $\T$ is the Total SSCP matrix. The index $\tau^2$ is related to the Wilk's Lambda statistic by: $$\tau^2 = 1 - \Lambda^{1/r},$$ where $r$ is the rank of the Effect SSCP matrix $\Hmat$. \paragraph{The \texttt{tau2.coef} function} \textbf{ }\newline Function \textbf{tau2.coef} is similar to its counterpart \texttt{ccr12.coef} described in the previous Subsection. A standard function call looks like: \begin{verbatim} > tau2.coef(mat, H, r, indices) \end{verbatim} \paragraph{Example} \textbf{ }\newline A very simple example of the use of the $\tau^2$ criterion with the Linear Discriminant Analysis example for the \emph{iris} data set, using all four morphometric variables to discriminate the three species. The subset consisting of variables 1 and 3 is then considered as a surrogate for all four variables. <<>>= irisHmat <- ldaHmat(iris[1:4],iris$Species) tau2.coef(irisHmat$mat,H=irisHmat$H,r=irisHmat$r,c(1,3)) @ \subsubsection{The $\xi^2$ (xi2) coefficient and the Bartlett-Pillai statistic} \label{sss xi2} \paragraph{Motivation and definition} \textbf{ }\newline The Xi squared index is an increasing function of the traditional Bartllet-Pillai trace test statistic. The Bartlett-Pillai trace $P$ is given by: $P=tr(\Hmat\T^{-1})$ where $\Hmat$ is the Effects SSCP matrix and $\T$ is the Total SSCP matrix. The Xi squared index $\xi^2$ is related to the Bartllet-Pillai trace by: $$\xi^2 =\frac{P}{r},$$ where $r$ is the rank of $\Hmat$. \paragraph{The \texttt{xi2.coef} function} \textbf{ }\newline Function \textbf{xi2.coef} is similar to the previous criterion functions described in this Section. A standard function call looks like: \begin{verbatim} > xi2.coef(mat, H, r, indices) \end{verbatim} \paragraph{Example} \textbf{ }\newline The same example considered in Subsection \ref{sss tau2}, only this time using the $\xi^2$ index of $\tau^2$. <<>>= irisHmat <- ldaHmat(iris[1:4],iris$Species) xi2.coef(irisHmat$mat,H=irisHmat$H,r=irisHmat$r,c(1,3)) @ \subsubsection{The $\zeta^2$ (zeta2) coefficient and the Lawley-Hotelling statistic} \label{sss zeta2} \paragraph{Motivation and definition} \textbf{ }\newline The Zeta squared index $\zeta^2$ is an increasing function of the traditional Lawley-Hotelling trace test statistic. The Lawley-Hotelling trace is given by $V=tr(\Hmat\E^{-1})$ where $\Hmat$ is the Effect SSCP matrix and $\E$ is the Error SSCP matrix. The index $\zeta^2$ is related to the Lawley-Hotelling trace by: $$\zeta^2 =\frac{V}{V+r} ,$$ where $r$ is the rank of $\Hmat$. \paragraph{The \texttt{zeta2.coef} function} \textbf{ }\newline Again, function \textbf{zeta2.coef} has arguments similar to those of the other related functions in this Section. A standard function call looks like: \begin{verbatim} > zeta2.coef(mat, H, r, indices) \end{verbatim} \paragraph{Example} \textbf{ }\newline Again, the same example as in the previous two subsections: <<>>= irisHmat <- ldaHmat(iris[1:4],iris$Species) zeta2.coef(irisHmat$mat,H=irisHmat$H,r=irisHmat$r,c(1,3)) @ \subsection{Criterion for generalized linear models} \label{ss glm} \paragraph{Motivation and definition} \textbf{ }\newline Variable selection in the context of generalized linear models is typically based on the minimization of statistics that test the significance of the excluded variables. In particular, the likelihood ratio, Wald and Rao statistics, or some monotone function of those statistics, are often proposed as comparison criteria for variable subsets of the same dimensionality. All these statistics are asympotically equivalent and, given suitable assumptions, can be converted into information criteria, such as the AIC, which also compare subsets of different dimensionalities (see references \cite{Lawless:78} and \cite{Lawless:87} for further details). Among these criteria, Wald's statistic has some computational advantages because it can always be derived from the same maximum likelihood and Fisher information estimates (concerning the full model). In particular, let $W_{all}$ be the value of the Wald statistic testing the significance of the full covariate vector, $\bvec$ and $\F$ be the coefficient and Fisher information estimates, and $\Hmat$ be an auxiliary rank-one matrix given by $\Hmat = \F \bvec \bvec^t \F$. It follows that the value of Wald's statistic for the excluded variables ($W_{exc}$) in a given subset is given by $W_{exc} = W_{all} - tr (\F_{indices}^{-1} \Hmat_{indices})$, where $\F_{indices}$ and $\Hmat_{indices}$ are the portions of the $\F$ and $\Hmat$ matrices associated with the selected variables. The \texttt{subselect} package provides a function \texttt{wald.coef} that computes the value of Wald's statistic, testing the significance of the excluded variables, in the context of variable subset selection in generalized linear models (see Subsection \ref{sss wald} for more details). \subsubsection{A helper function for the GLM context} \label{sss helper GLM} As with the multivariate linear model context, creating the Fisher Information matrix and the auxiliary matrix $\Hmat$ described above may be time-consuming. The \texttt{subselect} package provides the helper function \textbf{glmHmat} which accepts a glm object (\texttt{fitdglmmodel}) to retrieve an estimate of Fisher's Information. ($\F$) matrix together with an auxiliary rank-one positive-definite matrix ($\Hmat$), such that the positive eigenvalue of $\F^{-1}\Hmat$ equals the value of Wald's statistic for testing the global significance of the fitted model. These matrices may be used as input to the function that computes the Wald criterion, as well as to the variable selection search routines described in Section \ref{s algoritmos}. As an example of the use of this helper function, in the context of binary response logistic regression models, consider the last 100 observations of the \emph{iris} data set (retaining only observations for the \emph{versicolor} and \emph{virginica} species). Assume that the goal is to judge subsets of the four morphometric variables (petal and sepal lengths and widths), in models with the binary response variable given by an indicator variable of the two remaining species. The helper function \texttt{glmHmat} will produce the Fisher Information matrix (output object \texttt{mat}), the auxiliary $\Hmat$ function (output object \texttt{H}), the function \texttt{call}, and set the rank of output object \texttt{H} to $1$, as follows: <<>>= iris2sp <- iris[iris$Species != "setosa",] modelfit <- glm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris2sp, family=binomial) Hmat <- glmHmat(modelfit) Hmat @ %$ \subsubsection{Function \texttt{wald.coef} and the Wald coefficient} \label{sss wald} The function \textbf{wald.coef} computes the value ($W_{exc}$) of the Wald statistic, as described above. The function takes as input arguments: \begin{description} \item[mat] An estimate of Fisher's information matrix $\F$ for the full model variable-coefficient estimates; \item[H] A matrix product of the form $\Hmat = \F \bvec \bvec^t \F$ where $\bvec$ is a vector of variable-coefficient estimates; \item[indices] a numerical vector, matrix or 3-d array of integers giving the indices of the variables in the subset. If a matrix is specified, each row is taken to represent a different \emph{k}-variable subset. If a 3-d array is given, it is assumed that the third dimension corresponds to different cardinalities. \item[tolval \texttt{and} tolsym] are parameters used in checks for ill-conditioning and positive-definiteness of the Fisher Information and the auxiliary ($\Hmat$) matrices (see the \texttt{wald.coef} help file for further details). \end{description} The values of arguments \texttt{mat} and \texttt{H} can be created using the \texttt{glmHmat} helper function, described in subsection \ref{sss helper GLM}. \paragraph{Example} \textbf{ }\newline An example of variable selection in the context of binary response regression models can be given using the same \texttt{crabs} data set from the MASS package that was already discussed in subsection \ref{sss Hmat}. The logarithms and original physical measurements of the \textit{Leptograpsus variegatus} crabs considered in the MASS crabs data set are used to fit a logistic model where each crab's sex forms the response variable. The quality of the variable subset made up by the first, sixth and seventh predictor variables is measured via the Wald coefficent. <<>>= library(MASS) lFL <- log(crabs$FL) lRW <- log(crabs$RW) lCL <- log(crabs$CL) lCW <- log(crabs$CW) logrfit <- glm(sex ~ FL + RW + CL + CW + lFL + lRW + lCL + lCW,data=crabs,family=binomial) lHmat <- glmHmat(logrfit) wald.coef(lHmat$mat,lHmat$H,indices=c(1,6,7),tolsym=1E-06) @ It should be stressed that, contrary to the criteria considered in the previous problems, $W_{exc}$ is not bounded above by 1 and $W_{exc}$ is a \textit{decreasing} function of subset quality. \section{Search algorithms} \label{s algoritmos} Given any data set for which the variable selection problem is relevant and a criterion that measures how well any given variable subset approximates the full data set, the problem of \emph{finding the best $k$-variable subsets for that criterion} arises naturally. Such problems are computationally intensive for the criteria considered in this package. A complete search, among all $k$-variable subsets is a task which quickly becomes unfeasible even for moderately-sized data sets unless $k$ is very small, or very large, when compared with $p$. It may happen that a criterion has special properties which render enumeration methods possible for some moderate-sized data sets. Furnival and Wilson's \cite{Furnival:74} Leaps and bounds algorithm did this in the context of subset selection in Linear Regression, and package co-author Duarte Silva (see references \cite{Pedro:01} and \cite{Pedro:02}) has discussed the application of this algorithm to various methods of Multivariate Statistics. The package function \textbf{eleaps} implements these algorithms for the criteria discussed above, using C++ code (see Section \ref{ss eleaps}). This function is very fast for small or moderately sized datasets (with roughly $p<30$ variables). For larger data sets (roughly $p>35$ variables) it becomes computationally unfeasible and alternative search heuristics become necessary. Traditional heuristics for the variable selection problem in the context of linear regression, such as the backward elimination, forward selection or stepwise selection algorithms belong to a general class of algorithms called \emph{greedy}. Their limitations are well-known. The search algorithms considered in this package do not fall in this category. They are of three types: a simulated annealing algorithm (implemented by the \textbf{anneal} function and discussed in section \ref{ss anneal}); a genetic algorithm (function \textbf{genetic}, discussed in section \ref{ss genetic}) and a modified local search algorithm (function \textbf{improve}, discussed in section \ref{ss improve}). These algorithms were discussed in \cite{CCM:04}. The three corresponding package functions use Fortran code for computational efficiency. \subsection{Common input arguments} \label{ss input} The four search functions described above share several input arguments. These are: \begin{description} \item[mat] a covariance/correlation, information or sums of squares and products matrix of the variables from which the k-subset is to be selected (equivalent to the homonymous argument in the criteria functions discussed in Section \ref{s criteria}). \item[kmin] the cardinality of the smallest subset that is sought. \item[kmax] the cardinality of the largest subset that is sought. \item[exclude] a vector of variables (referenced by their row/column numbers in matrix \texttt{mat}) that are to be forcibly excluded from the subsets considered. \item[include] a vector of variables (referenced by their row/column numbers in matrix \texttt{mat}) that are to be forcibly included in the subsets considered. \item[criterion] A character variable, indicating which criterion is to be used in judging the quality of the subsets. These are discussed in Section \ref{s criteria}. The default criterion is \texttt{rm} for exploratory and PCA analysis, \texttt{tau2} in the multivariate linear model context, and \texttt{Wald} for generalized linear models. \item[pcindices] (for the GCD criterion only) is a vector of ranks of Principal Components that are to be used for comparison with the k-variable subsets. The default value is the character string \texttt{first\_k}, which associates PCs 1 to \emph{k} to each cardinality \emph{k} requested by the user. \item[H] Effect description matrix. Not used with the RM, RV or GCD criteria, hence the NULL default value. \item[r] Rank of the effects matrix, $\Hmat$. Not used with the RM, RV or GCD criteria. \item[tolval] a parameter to fine-tune the approach to ill-conditioned \texttt{mat} arguments (see the search function help files for more details). \item[tolsym] a parameter to fine-tune the approach to non-symmetric \texttt{mat} and $\Hmat$ arguments (see the search function help files for more details). \end{description} Each search function has additional specific input arguments that are discussed in the subsequent sections and in the appropriate help files. \subsection{Common output objects} \label{ss output} The four search functions have a \textit{common output structure}: a \texttt{list} with five components: \begin{description} \item[subsets] An $m\times kmax \times length(kmin:kmax)$ 3-dimensional array, where $m$ is the number of different solutions that are to be produced for each cardinality (see the subsequent subsections for details). For each cardinality (dimension 3) and each solution (dimension 1) the list of variables (referenced by their row/column numbers in matrix \texttt{mat}) in the subset is given (dimension 2). For cardinalities smaller than \texttt{kmax}, the extra final positions are set to zero. %{\color{red}MUDAR PARA NA?}). These output objects can be directly passed to the \texttt{*.coef} functions discussed in Section \ref{s criteria}, in order to compute the performance of these subsets with criteria other than the one used to select them; \item[values] An $m\times length(kmin:kmax)$ matrix, giving for each cardinality (columns), the criterion values of the best $m$ (rows) subsets selected according to the chosen criterion; \item[bestvalues] A $length(kmin:kmax)$ vector giving the overall best values of the criterion for each cardinality; \item[bestsets] A $length(kmin:kmax) \times kmax$ matrix, giving, for each cardinality (rows), the variables (referenced by their row/column numbers in matrix \texttt{mat}) in the best $k$-variable subset (can also be fed to the \texttt{*.coef} functions). \item[call] The function call which generated the output. \end{description} %This output is of \textbf{class} \emph{subselect} (which inherits from class \texttt{list}). %Methods for this class are discussed in Section \ref{s methods}. \subsection{The \texttt{eleaps} function: an efficient complete search} \label{ss eleaps} For each cardinality $k$ (with $k$ ranging from \texttt{kmin} to \texttt{kmax}), % \texttt{eleaps} performs a branch and bound search for the best \texttt{nsol}-variable \texttt{eleaps} ("Extended Leaps and Bounds") performs a branch and bound search for the best \texttt{nsol}-variable subsets (\texttt{nsol} being a user-specified function argument), according to a specified criterion. The function \texttt{eleaps} implements Duarte Silva's adaptation for various multivariate analysis contexts (references \cite{Pedro:01} and \cite{Pedro:02}) of Furnival and Wilson's Leaps and Bounds Algorithm (reference \cite{Furnival:74}) for variable selection in Regression Analysis. If the search is not completed within a user defined time limit (input argument \texttt{timelimit}), \texttt{eleaps} exits with a warning message, returning the best (but not necessarly optimal) solutions found in the partial search performed. In order to improve computation times, the bulk of computations are carried out by C++ routines. Further details about the Algorithm can be found in references \cite{Pedro:01} and \cite{Pedro:02} and in the comments to the C++ code (in the package's \texttt{src} directory). A discussion of the criteria considered can be found in Section \ref{s criteria} above. The function checks for ill-conditioning of the input matrix (see the \texttt{eleaps} help file for details). \paragraph{Examples} \textbf{ }\newline For illustration of use, we now consider five examples. \begin{description} \item[Example 1] deals with a small data set provided in the standard distributions of R. The \texttt{swiss} data set is a 6-variable data set with Swiss fertility and socioeconomic indicators (1888). Subsets of variables of all cardinalities are sought, since the \texttt{eleaps} function sets, by default, \texttt{kmin} to 1 and \texttt{kmax} to one less than the number of columns in the input argument \texttt{mat}. The function call requests the best three subsets of each cardinality, using the RM criterion (subsection \ref{sss RM}). <<>>= data(swiss) eleaps(cor(swiss),nsol=3, criterion="RM") @ In this example, it is not necessary to explicitly specify the criterion RM, since it is the default criterion for any function call which does not explicitly set the \texttt{r} input argument (except in a GLM context - see Section \ref{ss glm}). \item[Example 2] illustrates the use of the \texttt{include} and \texttt{exclude} arguments that are common to all the search functions provided by the package \texttt{subselect}. Here, we request only 2- and 3- dimensional subsets that exclude variable number 6 and include variable number 1. For each cardinality, three solutions are requested (argument \texttt{nsol}). The criterion requested is the GCD (see subsection \ref{sss GCD}) and the subspace used to gauge our solutions is the principal subspace spanned by the first three principal components of the full data set (argument \texttt{pcindices}). Our solutions will be the 2- and 3-variable subsets that span subspaces that are closest to this 3-d principal subspace. <<>>= data(swiss) swiss.gcd <- eleaps(cor(swiss),kmin=2,kmax=3,exclude=6,include=1,nsol=3,criterion="gcd",pcindices=1:3) swiss.gcd @ \emph{The output of this function call can be used as input for a function call requesting the values of the chosen solutions under a different criterion}. For example, the values of the RM coefficient for the above solutions are given by: <<>>= rm.coef(mat=cor(swiss), indices=swiss.gcd$subsets) @ \item[Example 3] involves a \textbf{Linear Discriminant Analysis} example with a very small data set, which has already been discussed in subsection \ref{sss tau2}. We consider the Iris data and three groups, defined by species (setosa, versicolor and virginica). The goal is to select the 2- and 3-variable subsets that are optimal for the linear discrimination (as measured by the \texttt{ccr12} criterion, \emph{i.e.}, by Roy's first root statistic). <<>>= irisHmat <- ldaHmat(iris[1:4],iris$Species) eleaps(irisHmat$mat,kmin=2,kmax=3,H=irisHmat$H,r=irisHmat$r,crit="ccr12") @ \item[Example 4] \label{ex Cars eleaps} involves the MASS package \texttt{Cars93} data set that was already discussed in subsection \ref{sss ccr12}. The context here is a \textbf{Canonical Correlation Analysis}. Two groups of variables within the Cars93 data set are compared: the $X$ group (variables 7,8, 12 to 15, 17 to 22 and 25) and the $Y$ group (variables 4 and 6). The latter variables are respectively the car prices for a basic (Min.Price) and premium (Max.Price) version, of the car models. The goal is to select 4- to 6-variable subsets of the 13-variable $\X$ group that are optimal in terms of preserving the canonical correlations, according to the \texttt{zeta2} criterion (Warning: the $\Y$ group with the 2-price variables is kept intact; subset selection is carried out in the $\X$ group only). The \texttt{tolsym} parameter is used to relax the symmetry requirements on the effect matrix $\Hmat$ which, for numerical reasons, is slightly asymmetric. Since corresponding off-diagonal entries of matrix $\Hmat$ are different, but by less than \texttt{tolsym}, $\Hmat$ is replaced by its symmetric part: $(\Hmat+\Hmat^t)/2$. <<>>= library(MASS) data(Cars93) Cars93.xgroup <- Cars93[,c(7:8,12:15,17:22,25)] CarsHmat <- lmHmat(Cars93.xgroup,Cars93[,c(4,6)]) colnames(Cars93[,c(4,6)]) colnames(Cars93.xgroup) #colnames(CarsHmat$mat) Cars.eleaps <- eleaps(CarsHmat$mat, kmin=4, kmax=6, H=CarsHmat$H, r=CarsHmat$r, crit="zeta2", tolsym=1e-9) Cars.eleaps$bestvalues Cars.eleaps$bestsets @ %$ \item[Example 5.] A final example involves the use of the \texttt{eleaps} function for variable selection in the context of a \textbf{generalized linear model}, more precisely, of a logistic regression model. We consider the last 100 observations of the \emph{iris} data set (versicolor and virginica species) and seek the best variable subsets for the model with the indicator variable for the other two species as the binary response variable. <<>>= iris2sp <- iris[iris$Species != "setosa",] logrfit <- glm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,iris2sp,family=binomial) Hmat <- glmHmat(logrfit) eleaps(Hmat$mat, H=Hmat$H, r=Hmat$r, criterion="Wald", nsol=3) @ It should be stressed that, unlike other criteria in the \texttt{subselect} package, the Wald criterion is not bounded above by 1 and is a \emph{decreasing} function of subset quality, so that the 3-variable subsets do, in fact, perform better than their smaller-sized counterparts. \end{description} \subsection{The \texttt{anneal} function: a simulated annealing algorithm} \label{ss anneal} Given a full data set, the \textbf{anneal} function uses a Simulated Annealing algorithm to seek a $k$-variable subset which is optimal, as a surrogate for the full set, with respect to a given criterion. The algorithm is described in detail in \cite{CCM:04}. In brief, for each of the solutions requested by the user (via the \texttt{nsol} function argument), an initial $k$-variable subset (for k ranging from \texttt{kmin} to \texttt{kmax}) of a full set of $p$ variables is randomly selected and passed on to a Simulated Annealing algorithm. The algorithm then selects a random subset in the neighbourhood of the current subset (neighbourhood of a subset S being defined as the family of all $k$-variable subsets which differ from S by a single variable), and decides whether to replace the current subset according to the Simulated Annealing rule, i.e., either (i) always, if the alternative subset's criterion value is larger; or (ii) with probability $\exp^{\frac{ac-cc}{t}}$ if the alternative subset's criterion value ($ac$) is smaller than that of the current solution ($cc$), where $t$ (the temperature parameter) decreases throughout the iterations of the algorithm. For each cardinality $k$, the stopping criterion for the algorithm is the number of iterations, which is controlled by the user (function argument \texttt{niter}). Also controlled by the user are the initial temperature (argument \texttt{temp}), the rate of geometric cooling of the temperature (argument \texttt{cooling}) and the frequency with which the temperature is cooled, as measured by argument \texttt{coolfreq}, the number of iterations after which the temperature is multiplied by 1-\texttt{cooling}. Optionally, the best $k$-variable subset produced by the simulated annealing algorithm may be used as input in a restricted local search algorithm, for possible further improvement (this option is controlled by the logical argument \texttt{improvement} which, by default, is TRUE). The user may force variables to be included and/or excluded from the $k$-variable subsets (arguments \texttt{include} and \texttt{exclude}), and may specify initial solutions (argument \texttt{initialsol}). \paragraph{Computational effort} \textbf{ } \newline For each cardinality $k$, the total number of calls to the procedure which computes the criterion values is $nsol \times niter + 1$. These calls are the dominant computational effort in each iteration of the algorithm. In order to improve computation times, the bulk of computations is carried out by a Fortran routine. Further details about the Simulated Annealing algorithm can be found in reference \cite{CCM:04} and in the comments to the Fortran code (in the \texttt{src} subdirectory for this package). \paragraph{The \texttt{force} argument} \textbf{ } \newline For datasets with a very large number of variables (currently $p$ > 400), it is necessary to set the \texttt{force} argument to TRUE for the function to run, but this may cause a session crash if there is not enough memory available. The function checks for ill-conditioning of the input matrix (see details in the \texttt{anneal} help file). \paragraph{Reproducible solutions} \textbf{ } \newline The \texttt{anneal} algorithm is a random algorithm, so that solutions are, in general, different each time the algorithm is run. For reproducible solutions, the logical argument \texttt{setseed} should be set to TRUE during a session and left as TRUE in subsequent function calls where it is wished to reproduce the results. \paragraph{Uniqueness of solutions} \textbf{ } \newline The requested \texttt{nsol} solutions are \emph{not necessarily different solutions}. The \texttt{nsol} solutions are computed separately, so that (unlike what happens with the \texttt{eleaps} function), the same variable subset may appear more than once among the \texttt{nsol} solutions. \paragraph{Examples} \textbf{ } \newline Four \textbf{examples} of usage of the \texttt{anneal} function are now given. \begin{description} \item[Example 1] is a very simple example, with a small data set with very few iterations of the algorithm, using the RM criterion (although for a data set of this size it is best to use the \texttt{eleaps} complete search, described in section \ref{ss eleaps}). <<>>= data(swiss) anneal(cor(swiss),kmin=2,kmax=3,nsol=4,niter=10,criterion="RM") @ \item[Example 2] uses the 62-variable farm data set, included in this package (see \texttt{help(farm)} for details). <<>>= data(farm) anneal(cor(farm), kmin=6, nsol=5, criterion="rv") @ Since the \texttt{kmax} argument was not specified, the \texttt{anneal} function by default assigns it the same value as \texttt{kmin}. Notice that there may be repeated subsets among the 5 solutions produced by the function call. \item[Example 3] \label{ex CysticFibrosis anneal} involves subset selection in the context of a \textbf{(univariate) Multiple Linear Regression}. The data set \texttt{cystfibr}, included in the ISwR package, contains lung function data for cystic fibrosis patients (7-23 years old). The data consists of 25 observations on 10 variables. The objective is to predict the variable \texttt{pemax} (maximum expiratory pressure) from relevant patient characteristics. A best subset of linear predictors is sought, using the \texttt{tau2} criterion which, in the case of a univariate linear regression, is just the standard Coefficient of Determination, $R^2$. <<>>= library(ISwR) cystfibrHmat <- lmHmat(pemax ~ age+sex+height+weight+bmp+fev1+rv+frc+tlc, data=cystfibr) colnames(cystfibrHmat$mat) cystfibr.tau2 <- anneal(cystfibrHmat$mat, kmin=4, kmax=6, H=cystfibrHmat$H, r=cystfibrHmat$r, crit="tau2") cystfibr.tau2$bestvalues cystfibr.tau2$bestsets @ The algorithm underlying the \texttt{anneal} function is a random algorithm, whose solutions are not necessarily reproducible. It may happen that the solutions for different cardinalities are not nested. This illustrates that the algorithm can produce different results from the standard greedy algorithms, such as forward selection or backward elimination. That the value of the $\tau^2$ coefficient in the context of a univariate linear regression is the coefficient of determination can be confirmed via R's standard \texttt{lm} function: <<>>= summary(lm(pemax ~ weight+bmp+fev1+rv, data=cystfibr))$r.squared @ The other three multivariate linear hypothesis criteria discussed in Section \ref{ss linear model criteria} also give the value of $R^2$ in a univariate multiple regression, as is illustrated by the following command, which requests the value of the $\xi^2$ criterion for the above solutions. <<>>= xi2.coef(mat=cystfibrHmat$mat, indices=cystfibr.tau2$bestsets, H=cystfibrHmat$H, r=cystfibrHmat$r) @ %$ \item[Example 4] considers variable selection in the context of a \textbf{logistic regression} model. We consider the last 100 observations of the \emph{iris} data set (that is, the observations for the versicolor and virginica species) and seek the best 1- to 3-variable subsets for the logistic regression that uses the four morphometric variables to model the probability of each of these two species. <<>>= data(iris) iris2sp <- iris[iris$Species != "setosa",] logrfit <- glm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,iris2sp,family=binomial) Hmat <- glmHmat(logrfit) iris2p.Wald <- anneal(Hmat$mat,1,3,H=Hmat$H,r=1,nsol=5,criterion="Wald") iris2p.Wald$bestsets iris2p.Wald$bestvalues @ \end{description} %$ \subsection{The \texttt{genetic} function: a genetic algorithm} \label{ss genetic} Given a full data set, a Genetic Algorithm algorithm seeks a k-variable subset which is optimal, as a surrogate for the full set, with respect to a given criterion. The algorithm is described in detail in \cite{CCM:04}. In brief, for each cardinality $k$ (with $k$ ranging from the function arguments \texttt{kmin} to \texttt{kmax}), an initial population of $k$-variable subsets is randomly selected from a full set of $p$ variables. The size of this initial population is specified by the function argument \texttt{popsize} (\texttt{popsize}=100 by default). In each iteration, \texttt{popsize}/2 couples are formed from among the population and each couple generates a child (a new $k$-variable subset) which inherits properties of its parents (specifically, it inherits all variables common to both parents and a random selection of variables in the symmetric difference of its parents' genetic makeup). Each offspring may optionally undergo a mutation in the form of a local improvement algorithm (see subsection \ref{ss improve}), with a user-specified probability. Whether or not mutations occur is controlled by the logical variable \texttt{mutate} (which is FALSE by default), and the respective probability is given by the argument \texttt{mutprob}. The parents and offspring are ranked according to their criterion value, and the best \texttt{popsize} of these k-subsets will make up the next generation, which is used as the current population in the subsequent iteration. The stopping rule for the algorithm is the number of generations, which is specified by the function argument \texttt{nger}. Optionally, the best \emph{k}-variable subset produced by the Genetic Algorithm may be passed as input to a restricted local improvement algorithm, for possible further improvement (see subsection \ref{ss improve}). The user may force variables to be included and/or excluded from the \emph{k}-subsets (function arguments \texttt{include} and \texttt{exclude}), and may specify an initial population (function argument \texttt{initialpop}). The function checks for ill-conditioning of the input matrix (see the \texttt{genetic} help file for details). \paragraph{Genetic diversity} \textbf{ } \newline For this algorithm to run, it needs genetic diversity in the population (i.e., a large number of different variables in the variable subsets that are being considered). This means that the function will not run on data sets with a small number of variables, $p$. In an attempt to ensure this genetic diversity, the function has an optional argument \texttt{maxclone}, which is an integer variable specifying the maximum number of identical replicates (clones) of individuals (variable subsets) that is acceptable in the population. However, even \texttt{maxclone}=0 does not guarantee that there are no repetitions: only the offspring of couples are tested for clones. If any such clones are rejected, they are replaced by a k-variable subset chosen at random, without any further clone tests. %See Example \ref{exemp 1 genetic}. \paragraph{Computational effort} \textbf{ } \newline For each cardinality \emph{k}, the total number of calls to the procedure which computes the criterion values is $popsize + nger \times popsize/2$. These calls are the dominant computational effort in each iteration of the algorithm. In order to improve computation times, the bulk of computations are carried out by a Fortran routine. Further details about the Genetic Algorithm can be found in \cite{CCM:04} and in the comments to the Fortran code (in the \texttt{src} subdirectory for this package). \paragraph{The \texttt{force} argument} \textbf{ } \newline For datasets with a very large number of variables (currently $p$ > 400), it is necessary to set the \texttt{force} argument to TRUE for the function to run, but this may cause a session crash if there is not enough memory available. \paragraph{Reproducible solutions} \textbf{ } \newline The \texttt{genetic} algorithm is a random algorithm, so that solutions are, in general, different each time the algorithm is run. For reproducible solutions, the logical argument \texttt{setseed} should be set to TRUE during a session and left as TRUE in subsequent function calls where it is wished to reproduce the results. \paragraph{Uniqueness of solutions} \textbf{ } \newline The requested \texttt{nsol} solutions are \emph{not necessarily different solutions}. The \texttt{nsol} solutions are computed separately, so that (unlike what happens with the \texttt{eleaps} function), the same variable subset may appear more than once among the \texttt{nsol} solutions. \paragraph{Examples} \textbf{ } \newline Two \textbf{examples} of use of the \texttt{genetic} function are now given: \begin{description} \item[Example 1.] \label{exemp 1 genetic} Consider the 62-variable Portuguese \texttt{farm} dataset, already discussed in subsection \ref{sss RV} (page \pageref{sss RV}). We seek a 10-variable subset which spans a subspace as close as possible to the first 10-d principal subspace, so that criterion GCD (subsection \ref{sss GCD}) is relevant. By default, in the \texttt{genetic} function (as in the \texttt{anneal} and \texttt{improve} function, but unlike the \texttt{eleaps} function), the default value of \texttt{kmax} is set equal to the value of \texttt{kmin}, and so does not need to be specified in this case. <<>>= farm.gcd <- genetic(cor(farm), kmin=10, crit="gcd") farm.gcd$bestsets farm.gcd$bestvalues @ Since the argument \texttt{popsize} was not specified, the default population size ($100$) was used. As stated above, not all 100 solutions are necessarily different. To select only the different solutions R's function \texttt{unique} may be used: <<>>= unique(farm.gcd$subsets) @ If more than just a few different solutions are desired, it may be possible to increase the diversity of solutions by using the \texttt{maxclone} argument. The following example compares the number of different solutions obtained using the default value of \texttt{maxclone}=5 and a user-specified value \texttt{maxclone}=0. Since the \texttt{subsets} output object is a 3-d array, \texttt{unique} also produces a 3-d object, whose first dimension gives the number of different solutions. <<>>= dim(unique(genetic(cor(farm), kmin=10, crit="gcd")$subsets)) dim(unique(genetic(cor(farm), kmin=10, maxclone=0, crit="gcd")$subsets)) @ \item[Example 2.] We consider subset selection in the context of a \textbf{Canonical Correlation Analysis}. The data set in this example is the same as in Example 1 above. Two groups of variables within the farm data set are compared: the $X$ group (all variables except the first four) and the $Y$ group (variables 1 to 4). The goal is to select 4- to 6-variable subsets of the 58-variable $X$ group that are optimal in terms of preserving the canonical correlations, according to the \texttt{zeta2} criterion. (Warning: the 4-variable $Y$ group is kept intact; subset selection is carried out in the $X$ group only). As can be seen from the results, even a four-variable $X$ subgroup essentially reproduces the canonical correlations with the $Y$ group of variables %The \texttt{tolsym} %parameter is used to relax the symmetry requirements on the effect %matrix $\Hmat$ which, for numerical reasons, is slightly %asymmetric. Since corresponding off-diagonal entries of matrix $\Hmat$ %are different, but by less than \texttt{tolsym}, $\Hmat$ is replaced %by its symmetric part: $(\Hmat+\Hmat^t)/2$. <<>>= data(farm) farm.xgroup <- farm[,-c(1,2,3,4)] farmHmat <- lmHmat(farm.xgroup,farm[,1:4]) colnames(farmHmat$mat) farm.gen <- genetic(farmHmat$mat, kmin=4, kmax=6, H=farmHmat$H, r=farmHmat$r,crit="zeta2", maxclone=0, popsize=150) farm.gen$bestvalues farm.gen$bestsets @ \end{description} WARNING: The column numbers given by the last command are relative only to the $X$ group of variables, and so must be read off the list of variable names produced by the \texttt{colnames} command above (which are the same as those that would be printed out by the command \texttt{colnames(farm.xgroup)}). \subsection{The \texttt{improve} function: a restricted improvement algorithm} \label{ss improve} The function \textbf{improve} implements a Restricted Local Improvement search for an optimal $k$-variable subset. Given a set of variables, the algorithm seeks a $k$-variable subset which is optimal, as a surrogate for the whole set, with respect to a given criterion. In brief, for each solution requested by the user (the number of which is set by the \texttt{nsol} function argument), an initial $k$-variable subset (for $k$ ranging from argument \texttt{kmin} to argument \texttt{kmax}) of a full set of $p$ variables is randomly selected and the variables not belonging to this subset are placed in a queue. The possibility of replacing a variable in the current $k$-subset with a variable from the queue is then explored. More precisely, a variable is selected, removed from the queue, and the $k$ values of the criterion which would result from swapping this selected variable with each variable in the current subset are computed. If the best of these values improves the current criterion value, the current subset is updated accordingly. In this case, the variable which leaves the subset is added to the queue, but only if it has not previously been in the queue (i.e., no variable can enter the queue twice). The algorithm proceeds until the queue is emptied (see reference \cite{CCM:04} for more details). The user may force variables to be included and/or excluded (arguments \texttt{include} and \texttt{exclude}) from the $k$-variable subsets, and may specify initial solutions (using the \texttt{initialsol} argument). The function checks for ill-conditioning of the input matrix (see the \texttt{improve} help function for more details). \paragraph{Computational effort} \textbf{ } \newline For each cardinality k, the total number of calls to the procedure which computes the criterion values is O($nsol \times k \times p$). These calls are the dominant computational effort in each iteration of the algorithm. In order to improve computation times, the bulk of computations are carried out in a Fortran routine. Further details about the algorithm can be found in Reference \cite{CCM:04} and in the comments to the Fortran code (in the \texttt{src} subdirectory for this package). \paragraph{The \texttt{force} argument} \textbf{ } \newline For datasets with a very large number of variables (currently $p$ > 400), it is necessary to set the \texttt{force} argument to \texttt{TRUE} for the function to run, but this may cause a session crash if there is not enough memory available. \paragraph{Reproducible solutions} \textbf{ } \newline As with the \texttt{anneal} and \texttt{genetic} algorithm, so too the \texttt{improve} algorithm is a random algorithm. Solutions are, in general, different each time the algorithm is run. For reproducible solutions, the logical argument \texttt{setseed} should be set to TRUE during a session and left as TRUE in subsequent function calls where it is wished to reproduce the results. \paragraph{Uniqueness of solutions} \textbf{ } \newline The requested \texttt{nsol} solutions are \emph{not necessarily different solutions}. The \texttt{nsol} solutions are computed separately, so that (unlike what happens with the \texttt{eleaps} function), the same variable subset may appear more than once among the \texttt{nsol} solutions. \paragraph{Examples} \textbf{ } \newline Two \textbf{examples} of use of the \texttt{improve} function are now given: \begin{description} \item[Example 1.] A very simple example, using the standardized \texttt{swiss} data set from R's general distribution (see \texttt{help(swiss)} for more information on this data set, which was also used in the first example of Section \ref{ss eleaps}). The purpose is to determine the best 2-variable set, so that orthogonally projecting the 47-point (standardized) configuration on the subspace spanned by those two variables would provide a configuration as similar as possible (allowing for rigid rotations, translation of the origin and global changes of scale) to the original 47-point configuration defined by all six (standardized) variables in the data set. Two different requests are made, the first without any restrictions, while the second forces the inclusion of variable 1, and the exclusion of variable 6, from the selected subsets. <<>>= swiss.imp1 <- improve(mat=cor(swiss),kmin=2,kmax=3,nsol=4,criterion="GCD") swiss.imp2 <- improve(cor(swiss),2,3,nsol=4,criterion="GCD",include=c(1),exclude=6) swiss.imp1$bestvalues swiss.imp1$bestsets swiss.imp2$bestvalues swiss.imp2$bestsets @ \item[Example 2.] A second very simple example in a \textbf{Linear Discriminant Analysis} setting. Here we consider the same situation as in the example for the discussion of the \texttt{ldaHmat} fundtion, in subsection \ref{sss Hmat}. The goal is to select the 2- and 3-variable subsets that are optimal for the linear discrimination of the three \emph{iris} species in Fisher's \emph{iris} data (as measured by the \texttt{ccr12} criterion). <<>>= data(iris) irisHmat <- ldaHmat(iris[1:4],iris$Species) improve(irisHmat$mat,kmin=2,kmax=3,H=irisHmat$H,r=irisHmat$r,crit="ccr12") @ \item[Example 3] is a more involved example, for a \textbf{MANOVA} context. It follows up on the example used when introducing the \textbf{glhHmat} function, in subsection \ref{sss Hmat}. More precisely, in this data set 200 crabs are classified by two factors, \texttt{sex} and \texttt{sp}, with two levels each. There are also measurement variables, in both their original and logarithmic scales. The goal is to detect the variables that are prominent in an analysis of the effect of the \texttt{sp} factor after controlling for sex. the initial commands in this example were already discussed in subsection \ref{sss Hmat}. <<>>= library(MASS) data(crabs) lFL <- log(crabs$FL) ; lRW <- log(crabs$RW); lCL <- log(crabs$CL); lCW <- log(crabs$CW) C <- matrix(0.,nrow=2,ncol=4) C[1,3] = C[2,4] = 1. C Hmat5 <- glhHmat(cbind(FL,RW,CL,CW,lFL,lRW,lCL,lCW) ~ sp*sex,C=C,data=crabs) improve(mat=Hmat5$mat, kmin=4, nsol=3, H=Hmat5$H, r=Hmat5$r, crit="xi2",tolsym=1e-06) @ \end{description} %\section{Methods} %\label{s methods} %{\color{red}FALTA} \bibliographystyle{plain} \bibliography{subbib} \end{document}