%\VignetteIndexEntry{Different models for different purposes: focused model comparison in R} %\VignetteEngine{knitr::knitr} % \documentclass[article,shortnames]{jss} \documentclass[article,shortnames,nojss,nofooter]{jss} \usepackage{bm} \usepackage{tabularx} \usepackage{graphics} \usepackage{alltt} \usepackage{amsmath} <<echo=FALSE>>= options(prompt = "R> ", continue = "+ ", width = 70, useFancyQuotes = FALSE, digits = 3) library("knitr") opts_chunk$set(fig.path="fic-") @ \title{Different Models for Different Purposes: Focused Model Comparison in R} \author{ ~\\Christopher Jackson <chris.jackson@mrc-bsu.cam.ac.uk>,\\ Howard Thom <howard.thom@bristol.ac.uk>,\\ Gerda Claeskens <gerda.claeskens@kuleuven.be> } \Plainauthor{Christopher Jackson, MRC Biostatistics Unit} \newcommand{\btheta}{\boldsymbol{\theta}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bgamma}{\boldsymbol{\gamma}} \newcommand{\bdelta}{\boldsymbol{\delta}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\z}{\mathbf{z}} \Abstract{ Typical methods of model comparison are used to pick one ``best'' model, no matter what the estimates from the model are used for. ``Focused'' model comparison, by contrast, considers that different models may be better for different purposes. Different models may give better estimates of different ``focus'' quantities, functions of the basic parameters. In the ``focused information criterion'' of Claeskens and Hjort (2003), data are assumed to be generated by a ``wide'' model, in which all models we would consider are nested. Fitting the wide model to the observed data, however, may give estimates that are not sufficiently precise. Therefore we might accept some bias in the estimate in return for greater precision. The optimal submodel for a particular focus is the one which minimises the mean squared error of the estimate of that focus from the submodel, assuming that the wide model is true. The \code{fic} package calculates this error for comparisons within any class of models fitted by maximum likelihood, for any focus quantity. The bias-variance tradeoff is shown directly. There are shortcuts for commonly-used model classes such as generalised linear models and parametric survival models. Covariate selection problems in Cox regression models are also supported. } \Keywords{FIC,model comparison,AIC,BIC} \begin{document} \section{Introduction: principles for model comparison} To compare a set of statistical models fitted to the same data by maximum likelihood, it is common to rank them according to some ``criterion". For example, Akaike's information criterion (AIC, \citet{aic}) takes the form $$ -2\log \ell(\hat\theta ; \mathbf{x}) + 2p $$ where $\ell(\hat\theta ; \mathbf{x})$ is the maximised likelihood for the model fitted to the dataset $\mathbf{x}$, the likelihood is maximised at parameters $\hat\theta$, and $p$ is the number of parameters. The Bayesian information criterion (BIC, \citet{schwarz}) is $$ -2\log \ell(\hat\theta ; \mathbf{x}) + p\log(n) $$ where $n$ is the number of observations. These two criteria are based on very different principles. Thus they often rank models differently. The AIC is designed to choose models with better predictive ability, thus it tends to favour bigger models as the sample size increases. BIC is an approximation to Bayesian model comparison by Bayes factors, and prefers models with higher posterior probability under an implicit weak prior \citep[with an amount of information equivalent to one observation, see][]{kass:wasserman}. If there is a ``true" model, the BIC will tend to select it as the sample size increases. In many situations there may not be a true model, and collecting more data will uncover more complexity in the process generating the data, in which case AIC may be more suitable. See, e.g., \citet{burnham:anderson:book}, \citet{claeskens:hjort:book} for more theory behind these, and other similar model comparison criteria. Both of these methods give a single ranking of models according how well they fit a given dataset. However, different models may be better for different purposes, for example to estimate different quantities of interest. Such quantities are termed the ``focus'' of a model, and this is the idea behind ``focused" model comparison. This paper describes the \pkg{fic} \proglang{R} package, available from This compares a set of models according to how accurately they estimate a focus quantity. The models should all be nested in a single ``wide'' model that is assumed to generate the data. Section \ref{sec:formulae} gives an informal introduction to the principles, and sets out the formulae for various related ``focused'' model comparison criteria, as developed by \citet{fic} and \citet{claeskens:hjort:book}. Section \ref{sec:software} explains how general-purpose software is constructed to enable these quantities to be evaluated for any class of models and focuses, with the minimum of user effort. A worked example of using the \pkg{fic} package for covariate selection in generalised linear models is given in Section \ref{sec:example}. The extension of the theory to deal with Cox proportional hazards regression models is described in Section \ref{sec:cox} with a worked example. A set of additional package vignettes demonstrate the use of the package in a variety of other situations. \section{Focused model comparison: principles and formulae} \label{sec:formulae} Suppose the range of models we are willing to use is bounded by \begin{itemize} \item a \emph{wide model}, in which all models we would use are nested, with parameters $(\btheta,\bgamma)$, \item a \emph{narrow model}, the smallest model we might be willing to use, defined by setting $\bgamma=\bgamma_0$ in the wide model. \end{itemize} A typical example is covariate selection in regression models, where $\btheta$ are the coefficients of ``protected'' covariates which are always included, $\bgamma$ are the coefficients of optional covariates that may or may not be included, and $\bgamma_0=0$. More generally, we wish to choose the appropriate level of flexibility for the distribution of some outcome. For example, choosing between a Poisson versus a Negative Binomial model for a count outcome, or an exponential versus a Weibull survival model, where the former is a constrained version of the latter. Suppose also that the purpose of the model is to estimate some \emph{focus} quantity, which could be any function of the basic parameters $$\mu = g(\btheta, \bgamma)$$ In focused model comparison, we prefer models which give better estimates of $\mu$. A typical way to define ``better" is by the \emph{mean squared error}. The mean squared error of the estimate $\hat\mu_S$ under a submodel $S$ of the wide model, compared to the true value $\mu$, is $$E\left\{(\hat\mu_S - \mu)^2\right\}$$ This expectation is calculated under the assumption that the data are generated from the wide model. While we believe the wide model is the most realistic, we also accept that there may not be enough data to give sufficiently precise estimates of $\mu$. Therefore we are willing to accept some bias in this estimate, in return for a smaller variance, by selecting a smaller model than the wide model. The submodel $S$ with the lowest mean squared error is the one which makes the optimal trade-off between bias and variance. The mean squared error $MSE_S$ under model $S$ can be decomposed as a sum of the squared bias $B_S^2$ and the variance $V_S$. \begin{equation} \label{eq:mse} \begin{aligned} MSE_S = E\left\{(\hat\mu_S - \mu\right)^2\} & = \left\{E(\hat\mu_S)- \mu\right\}^2 + E\left\{(\hat\mu_S - E(\hat\mu_S))^2\right\} \\ & = B_S^2 + V_S \end{aligned} \end{equation} Estimators for these quantities are constructed by \citet{fic} under an asymptotic framework in which the data are assumed to be $n$ independent identically distributed observations from the wide model, but reparameterised so that $\bgamma = \gamma_0 + \bdelta / \sqrt{n}$. Thus as the sample size $n$ increases, we aim to detect more subtle departures from the narrow model. An obvious estimate for the bias $B_S$ is \( \hat{B_S} = \hat{\mu_S} - \hat{\mu_W} \), where $\hat\mu_W$ is the estimate of the focus quantity under the wide model, which is assumed to be unbiased. However, \citet[][]{fic} derive a more accurate, asymptotically unbiased, estimate for the \emph{squared} bias as \begin{equation} \label{eq:sqbias:est} % \widehat{B_S^2} = ( \hat\psi_{W} - \hat\psi_S )^2 / n \widehat{B_{S}^{2}} = \hat\omega^\top (I - G_S) (\hat\bdelta \hat\bdelta^\top - Q) (I - G_S) \hat\omega % \widehat{B_S^2} = ( \hat\psi_{W} - \hat\psi_S ) - omega (I - G_S) Q (I - G_s) \omega \end{equation} where: \begin{itemize} % bdelta is bias of par ests in narrow model on sqrtn scale % omegabdelta is bias of focus estimate in narrow model on sqrtn scale \item $\hat\bdelta = \hat\bgamma\sqrt{n}$, where $\hat\bgamma$ is the estimate of $\bgamma$ under the wide model. \item $\omega^\top\bdelta$ is the bias of the estimate of $\sqrt{n}\mu$ under the narrow model $N$, that is, the asymptotic mean of $\sqrt{n}(\hat\mu_N - \mu)$. Thus $\omega$ acts as a linear transformation from the biases of the basic parameters $\bgamma$ to the biases of the focus parameter $\mu$. %\item % and \item $\omega$ is estimated as $\hat\omega = J_{10}J_{00}^{-1}\frac{d\mu}{d\theta} - \frac{d\mu}{d\gamma}$ using Taylor approximation arguments, where $J$ is the information (inverse covariance) matrix under the wide model \footnote{Note that \citet{claeskens:hjort:book} calculate the MSE of the focus multiplied by $\sqrt{n}$ rather than the MSE of the focus, thus they define $J$ instead as the information matrix divided by $n$. Thus their definition of $\omega$ is the same as ours since $n$ cancels, but their definitions of $Q_S$ and $Q$ are different. } and subscripts $0$ and $1$ select the rows and columns forming the submatrices of $J$ that correspond to parameters $\btheta$ and $\bgamma$ respectively. The partial derivatives of the focus $\mu$ are evaluated at the estimates from the wide model. %% TODO should it be \pi_S not \pi ? \item $G_S = \pi^\top Q_S \pi Q^{-1}$ is an estimate of the transformation that maps the wide model estimate of $\bdelta$ to the submodel $S$ estimate, where $Q_S = (\pi Q^{-1} \pi^\top)^{-1}$, $Q^{-1} = J_{11}$ and $\pi$ is the projection matrix consisting of 0s and 1s which maps a vector of the same length as $(\btheta,\bgamma)$ to a subvector containing the elements corresponding to submodel $S$. \end{itemize} Occasionally the estimate~(\ref{eq:sqbias:est}) of squared bias is negative. \citet{fic} also present an adjusted version of~(\ref{eq:sqbias:est}), which assumes the bias is zero in these cases. \begin{equation} \label{eq:sqbias:adj} \widehat{B^{2*}_S} = max\left\{0, \widehat{B^2_S} \right\} \end{equation} The corresponding estimate of the bias is \begin{equation} \label{eq:bias:adj} \widehat{B_S*} = sign(\hat\psi_{W} - \hat\psi_S)\sqrt{\widehat{B_S^{2*}}} \end{equation} where $\hat\psi_{W} = \hat\omega^\top\hat\bdelta$ and $\hat\psi_S = \hat\omega^\top G_S \hat\bdelta$ are estimates of $\omega^\top\bdelta$ under the wide model and submodel respectively. The estimate for the variance of $\hat{\mu}_S$ under the wide model, derived by \citet{fic}, is \[\widehat{V_S} = (\hat{\tau_0}^2 + \hat{\omega}^\top Q_S^0 \hat{\omega}) \] where $\hat{\tau_0}^2$ estimates the variance of the narrow model focus estimate (using ``delta method" principles, $\hat\tau_0^2 = \frac{d\mu}{d\theta}^\top J_{00}^{-1} \frac{d\mu}{d\theta}$), and the additional term $(\hat{\omega}^\top Q_S^0 \hat{\omega})$ is the increase in variance we accept by using a wider but still misspecified model $S$, with $Q_S^0 = \pi^\top Q_S \pi$. Thus we compare models on the basis of the root mean squared error, estimated by \begin{equation} \label{eq:rmse} \sqrt{\widehat{MSE_S}} = \sqrt{\widehat{B_S^2} + \hat{V_S}} \end{equation} or the alternative version which, while not asymptotically unbiased, is based on an interpretable estimate~(\ref{eq:bias:adj}) of the bias. \begin{equation} \label{eq:rmse:adj} \sqrt{\widehat{MSE_S^*}} = \sqrt{\widehat{B_S^{2*}} + \hat{V_S}} \end{equation} \citet{fic} define the ``focused information criterion'' (FIC), which has a slightly simpler form due to excluding terms common to all submodels $S$, and is related to the MSE as \begin{equation} \label{eq:fic} FIC_S = n \widehat{MSE_S} - \hat\tau_0^2 + \hat\omega^\top Q \hat\omega \end{equation} Models with lower FIC give better estimates of the focus quantity. However we prefer to use the (root) MSE as the model comparison statistic, due to its direct interpretation as the error of the focus estimate. \subsection{Model comparison averaged over a range of focuses} Often we want a model that performs well in a range of situations. In covariate selection problems, for example, we might want to estimate a focus quantity accurately for a defined range of covariate values. Thus the quantities defined above may now depend on the covariate value $u$. We might simply define the ``averaged MSE'' for submodel $S$, \[ MSE_S^{(ave)} = \int MSE_S(u) dW(u) du \] as a weighted average of the mean squared errors~(\ref{eq:mse}) for focuses defined by different covariate values $u$, weighted by their prevalence $W(u)$. However \citet{claeskens:hjort:book} derived an alternative formula, so that if correction of the squared bias estimate, as in~(\ref{eq:sqbias:adj}), is required, it only needs to be performed once: \begin{equation} \label{eq:amse} % AMSE = max(IS,0) + IIS \widehat{MSE_S^{(ave)}} = I_S + \widehat{V_S^{(ave)}} \end{equation} where \begin{equation} \label{eq:bias:ave} I_S = Tr((I - G_S)(\hat\bdelta \hat\bdelta^\top - Q)(I - G_S)^\top A) \end{equation} is an estimate of the squared bias, and \begin{equation} \widehat{V_S^{(ave)}} = \tau_0^{2(ave)} + II_S \end{equation} is an estimate of the variance, with $\tau_0^{2(ave)}=\int \tau_0(u) du$, $II_S = Tr(Q_S^0 A)$, and \begin{eqnarray*} A & = & J_{10} J_{00}^{-1}B_{00}J_{00}^{-1}J_{01} - J_{10}J_{00}^{-1}B_{01} - B_{10}J_{00}^{-1}J_{01} + B_{11} \\ B & =& \int \left( \begin{array}{l} d\mu(u)/d\btheta\\ d\mu(u)/d\bgamma \end{array} \right) \left( \begin{array}{l} d\mu(u)/d\btheta\\ d\mu(u)/d\bgamma \end{array} \right) ^\top dW(u) \quad = \quad \left( \begin{array}{ll} B_{00} & B_{01}\\ B_{10} & B_{11} \end{array} \right) \end{eqnarray*} An analogue of the alternative MSE estimate~(\ref{eq:rmse:adj}) can then be defined, based on a bias estimator which is corrected when~(\ref{eq:bias:ave}) is negative, by substituting $max(IS,0)$ for $IS$ in~(\ref{eq:amse}). If the focus is defined as the log density for one observation $y$, and if we average over the observed distribution of $y$ in the data, then model comparison using this procedure is asymptotically equivalent to model comparison by AIC \citep{claeskens:hjort:book,fic}. See the additional \pkg{fic} package vignette on linear models for an example. \section{Software for focused model comparison} \label{sec:software} In order to calculate the MSE~(\ref{eq:mse}) for focused model comparison of a submodel $S$ against a wide model, we simply need to know \begin{itemize} \item the estimates $\hat{\btheta}_W$ and $\hat{\bgamma}_W$ and their covariance matrix under the wide model, \item the focus function $\mu(\btheta,\bgamma)$ \item the definition of which parameters are included in submodel $S$ and which are included in the narrow model $N$. \end{itemize} Additionally if we want to know the focused information criterion~(\ref{eq:fic}) we will need the sample size $n$. Derivatives of the focus function, required by~(\ref{eq:sqbias:est}), can be calculated numerically in general, for which robust software exists --- \pkg{numDeriv} \citep{numDeriv} is used here. Analytic derivatives are implemented in \pkg{fic} for two built-in focuses (the outcome probability in logistic regression, and the mean outcome in linear regression), but we have noticed no loss in accuracy from using numerical methods. This knowledge allows the \pkg{fic} package to implement focused model comparison for any class of models and focuses. The estimates and covariance matrix are routinely computed by functions for fitting models by maximum likelihood. Therefore the user simply needs to supply \begin{itemize} \item an \proglang{R} object containing the wide model \item a definition of the focus function $\mu()$ \item indicators for what submodels they want to compare \end{itemize} In addition the software needs to know where to look inside the wide model object for the estimates and covariance matrix, but this information can be built into the software for a range of commonly-used models. \section{Example: Covariate selection in logistic regression } \label{sec:example} The use of the \code{fic} package is illustrated for covariate selection in logistic regression, using the example from \citet{claeskens:hjort:book} (Example 6.1). The dataset was originally presented by \citet{hosmer:lemeshow:firstedition}. Data are taken from $n=189$ women with newborn babies, and the binary outcome is whether the baby is born with a weight of less than 2500g. We build a logistic regression model to predict the outcome, but are uncertain about what covariates should be included. The data are provided as an object \code{birthwt} in the \pkg{fic} package. This is the same as \code{birthwt} in \pkg{MASS} \citep{venables:ripley} with the addition of a few extra columns defining interactions and transformations as in \citet{claeskens:hjort:book}. The following covariates are always included (coefficient vector $\btheta$) \begin{itemize} \item $x_1$ Weight of mother in kg, \code{lwtkg} \end{itemize} The following covariates will be selected from (coefficient vector $\bgamma$) \begin{itemize} \item $z_1$ age, in years, \code{age} \item $z_2$ indicator for smoking, \code{smoke} \item $z_3$ history of hypertension, \code{ht} \item $z_4$ uterine irritability, \code{ui} \item interaction $z_5 = z_1 z_2$ between smoking and age, \code{smokeage} \item interaction $z_6 = z_2 z_4$ between smoking and uterine irritability, \code{smokeui} \end{itemize} Firstly the wide model, that includes all the above covariates, is defined and fitted. <<message=FALSE>>= library("fic") wide.glm <- glm(low ~ lwtkg + age + smoke + ht + ui + smokeage + smokeui, data = birthwt, family = binomial) @ The \emph{focus function} is then defined. This should be an \proglang{R} function, mapping the parameters \code{par} of the wide model to the quantity of interest. The focus can have any number of additional arguments. Typically an argument called \code{X} is used to supply covariate values at which the focus function should be evaluated. \footnote{Note we could also have written the focus function as \code{function(par,X)plogis(par[1] + X \%*\% par[-1])} then we could have omitted the dummy covariate values of 1 for the intercept at the start of \code{vals.smoke} and \code{vals.nonsmoke}.} Here we take the probability of low birth weight as the focus, for two covariate categories: \begin{enumerate} \item smokers with average or typical values of the other covariates. These values are given in the order supplied when specifying the model (for smokers: intercept, \code{lwtkg}=58.24, \code{age}=22.95, \code{smoke}=1, \code{ht}=0, \code{ui}=0, \code{smokeage}=22.95, \code{smokeui}=0). \item non-smokers with average values of the other covariates \end{enumerate} <<>>= prob_logistic <- function(par, X)plogis(X %*% par) vals.smoke <- c(1, 58.24, 22.95, 1, 0, 0, 22.95, 0) vals.nonsmoke <- c(1, 59.50, 23.43, 0, 0, 0, 0, 0) X <- rbind("Smokers" = vals.smoke, "Non-smokers" = vals.nonsmoke) @ We can illustrate this function by calculating the probability of low birth weight, given the parameters of the fitted wide model, for each group. This is about twice as high for smokers. <<>>= prob_logistic(coef(wide.glm), X=X) @ The \code{fic} function can then be used to calculate the mean squared error of the focus for one or more given submodels. For illustration we will compare two models, both including maternal weight, one including age and smoking, but the other including age, smoking and hypertension. <<>>= mod1.glm <- glm(low ~ lwtkg + age + smoke, data = birthwt, family = binomial) mod2.glm <- glm(low ~ lwtkg + age + smoke + ht, data = birthwt, family = binomial) @ We supply the following arguments to the \code{fic} function. \begin{itemize} \item \code{wide}: the fitted wide model. All the model fit statistics are computed using the estimates and covariance matrix from this model. \code{fic} will automatically recognise that this is a GLM fitted by the \code{glm} function in R, and extract the relevant information. \item \code{inds}: indicators for which parameters are included in the submodels, that is, which elements of $(\btheta,\bgamma)$ are fixed to $\bgamma_0$. This should have number of rows equal to the number of submodels to be assessed, and number of columns equal to $dim(\btheta) + dim(\bgamma)$, the total number of parameters in the wide model, 8 in the case of \code{wide.glm}, which includes the intercept and the coefficients of seven covariates. It contains 1s in the positions where the parameter is included in the submodel, and 0s in positions where the parameter is excluded. This should always be 1 in the positions defining the narrow model, as specified in \code{inds0} below. If just one submodel is to be assessed, \code{inds} can also be supplied as a vector of length $dim(\btheta) + dim(\bgamma)$. Note that \code{inds} indexes \emph{parameters} rather than \emph{linear model terms}, that is, in covariate selection problems where a variable is a factor with more than two levels, \code{inds} should contain separate entries for the coefficient of each factor level relative to the baseline level, not just one entry indicating the presence of the factor as a whole. A utility to construct this in the presence of factors is illustrated in Section~\ref{sec:cox:example}. \item \code{inds0} vector of indicators for which parameters are included in the narrow model, in the same format as \code{inds}. This can be omitted, in which case the narrow model is assumed to be given by the first row of \code{inds}. In this case, just the first two parameters are included, the intercept and the coefficient of \code{lwtkg}. \end{itemize} <<>>= inds <- rbind(mod1 = c(1,1,1,1,0,0,0,0), mod2 = c(1,1,1,1,1,0,0,0)) inds0 <- c(1,1,0,0,0,0,0,0) @ \begin{itemize} \item \code{focus} the focus function, and \code{X}, the alternative covariate values to evaluate it at. As well as an \proglang{R} function, this argument can alternatively be supplied as a character string naming a built-in focus function supplied by the \pkg{fic} package. Currently these just include \code{"prob_logistic"}, the outcome probability in a logistic regression, and \code{"mean_normal"}, the mean outcome in a normal linear regression. \end{itemize} The main \code{fic} function then returns the model fit statistics and the estimate of the focus quantity for each model. <<>>= fic1 <- fic(wide = wide.glm, inds = inds, inds0 = inds0, focus = prob_logistic, X=X) fic1 @ The object returned by \code{fic} is a data frame containing one row for each combination of focus covariate values indicated in the column \code{vals} and submodels indicated in the column \code{mods}. The focus estimate is returned in the final column \code{focus}, while the remaining columns contain the following model comparison statistics: \begin{itemize} \item \code{rmse} The root mean squared error of the submodel focus estimate, calculated assuming the wide model is true (Equation~\ref{eq:rmse}), \item \code{rmse.adj} Alternative estimate of the root mean squared error (Equation \ref{eq:rmse:adj}) based on the adjusted bias estimator (\ref{eq:sqbias:adj}--\ref{eq:bias:adj}). \item \code{bias} The estimated bias $\widehat{B_S*}$ (Equation \ref{eq:bias:adj}), which will be zero if $\widehat{B_{S}^{2}}$ (\ref{eq:sqbias:est}) is negative. \item \code{se} The standard error $\sqrt{\hat{V}}$ of the submodel focus estimate, calculated assuming the wide model is true. \item \code{FIC} The FIC as originally defined by \citet{fic} (Equation~\ref{eq:fic}). \end{itemize} The submodels are fitted automatically within the \code{fic} function in order to produce the focus estimate, so it was not really necessary to fit \code{mod1.glm} and \code{mod2.glm} by hand, as above. As the wide model has class \code{glm}, it is recognised as a GLM, so \code{fic} assumes that our submodels correspond to models with different covariates included, as indicated by \code{inds}. The focus estimates from the submodels can then be returned alongside the model comparison statistics. As well as the specific covariate categories, \code{fic} calculates model comparison statistics which are averaged over the categories, indicated by a value of \code{ave} in the column \code{vals}. An equally-weighted average is computed by default. Arbitrary weights can be supplied in the argument \code{wt} to \code{fic}. Recall that \code{mod2} contains one more covariate than \code{mod1}. For each of the two focuses, and the average, the unadjusted and adjusted bias estimates are lower due to the inclusion of this covariate, while the standard error \code{se} is higher. Given the lower \code{rmse} and \code{rmse.adj} under \code{mod2}, the reduction in bias is deemed to be worth the increase in uncertainty. \subsection{Comparing a wide range of models } We may want to examine a broad range of models, particularly in regression contexts. The function \code{all_inds} (a wrapper around \code{expand.grid}) creates a matrix of indicators that defines all submodels with different covariates, spanned by a given wide model (here \code{wide.glm}) and a narrow model (here defined by \code{inds0}). This function works only for classes of model objects \code{x} for which the \code{terms(x)} function is understood, which includes standard \proglang{R} regression models such as \code{lm} and \code{glm}. Factors are handled naturally. <<>>= combs <- all_inds(wide.glm, inds0) @ The resulting matrix can be used as the \code{inds} argument to \code{fic} to compare all submodels in this example, again for a focus defined by the probability of low birth weight at covariate values defined by \code{X}. Before calling \code{fic} again, we redefine \code{combs} to exclude models with interactions but not both corresponding main effects. <<>>= combs <- subset(combs, !((smoke==0 & smokeage==1) | (smoke==0 & smokeui==1) | (age==0 & smokeage==1) | (ui==0 & smokeui==1))) ficres <- fic(wide = wide.glm, inds = combs, inds0 = inds0, focus = prob_logistic, X = X) @ Notice that some of the \code{rmse} elements of \code{ficres} are \code{NaN}, since the first squared bias estimate $\widehat{B^2}$ is negative. The alternative estimate~(\ref{eq:rmse:adj}), \code{rmse.adj}, might be preferred in these cases, since it is consistent with the bias and variance estimates. A comparison of many models can be illustrated by a scatterplot of the focus estimate against the root MSE of each submodel. The default \code{plot} method for \code{fic} objects accomplishes this using base \proglang{R} graphics: try \code{plot(ficres)}. Alternatively a graph can be plotted using \code{ggplot2} if this package is installed. This is illustrated in Figure~\ref{fig:ficres}. <<birthwt,include=FALSE>>= ggplot_fic(ficres) @ \begin{figure}[h] \centering \includegraphics{fic-birthwt-1} \caption{Focused comparison of logistic regression models for low birth weight with different covariates} \label{fig:ficres} \end{figure} There is one panel for each of the two covariate categories (smokers and non-smokers) defining the focus (probability of low birth weight) and an average over the two categories. The solid blue line is the focus estimate under the wide model, and the dashed blue line is the focus estimate under the narrow model. An informal illustration of the uncertainty around the estimate of the focus quantity from each submodel is given by the estimate $\pm 1.96 \times \sqrt{\hat{V}}$. Note that this underestimates the uncertainty if inference is based on a selected model --- a better confidence interval would then account for the range of models being searched over (``post-selection'' inference, see, e.g., \citet{hjort:claeskens,berk:postselection}). Each submodel is labelled faintly using the row names of the matrix supplied as the \code{inds} argument to \code{fic}. In this case, these names were automatically constructed by the function \code{all_inds} and contain a string of binary 0/1 indicators for the inclusion of eight parameters. For smokers, a group of smaller models including the narrow model (labelled \code{11000000}) give estimates of the probability of low birth weight with the lowest MSE, while by contrast, for non-smokers, the wide model (labelled \code{11111111}) and similar larger models give the most accurate estimates of the focus quantity. Note that in this dataset, there are 115 non-smokers and 74 smokers, thus more data enables bigger models to be identified for non-smokers. Wider models are also preferred for describing the average population. The model with the optimal MSE could be identified if necessary, with the \code{summary} method for objects returned by \code{fic}. This lists the parameters included in the optimal model for each focus. <<>>= summary(ficres) @ Though in general, if there are multiple models with similar estimation performance and different results, the reasons for their differences should be explored in greater depth with the aid of background knowledge. \subsection[Calling ``fic'' for an unfamiliar class of models ]{Calling \pkg{fic} for an unfamiliar class of models } Above, the \code{fic} function recognised the fitted model objects as GLMs, that is, objects of class \code{"glm"} returned by the \code{glm()} function in base \proglang{R}. But the package can be used to calculate focused model comparison statistics for any class of models, not just the special classes it recognises. To do this, it needs to know where two things are stored inside the fitted model objects: \begin{enumerate} \item the vector of maximum likelihood estimates $(\hat\btheta,\hat\bgamma)$, \item the covariance matrix of the maximum likelihood estimates, $J^{-1}$. \end{enumerate} plus, if the ``classic'' FIC is required, also the number of observations $n$ contributing to the model fit. Given a fitted model object called \code{mod}, the \code{fic()} function assumes by default that \code{coef(mod)}, and \code{vcov(mod)} respectively return the estimates and covariance matrix. Likewise it assumes that \code{nobs(mod)} returns the number of observations if FIC is required. If one or more of these assumptions is not true, the defaults can be changed by supplying the argument \code{fns} to \code{fic()}, which should be a named list of three components. Each component should be a function taking one argument (the fitted model) which extracts the required information from the fitted model and returns it, as in the following example. Here the first component of \code{fns} is a function which, when applied to a \code{glm} object, returns the maximum likelihood estimates of the regression coefficients. <<eval=FALSE>>= fns <- list(coef = function(x)coef(x), nobs = function(x)nobs(x), vcov = function(x)vcov(x)) fic1 <- fic(wide = wide.glm, inds = inds, inds0 = inds0, focus = prob_logistic, fns = fns, X = X) @ A full worked example of a using \code{fic} for a novel class of models, defined and fitted by custom \proglang{R} functions, is given in the package vignette ``Examples of focused model comparison: skew-normal models''. \subsection{Other classes of models} The \pkg{fic} package also has built-in methods for the following classes of models. \begin{itemize} \item Linear models fitted with \code{lm} in base \proglang{R}. Examples of covariate selection and polynomial order selection are given in an additional \pkg{fic} vignette on linear models, which also illustrates the use of quantiles of the outcome distribution as focuses, and focuses that depend on variance parameters as well as linear model coefficients. \item Parametric survival models fitted with \code{flexsurvreg} in the \pkg{flexsurv} package \citep{flexsurv}, or \code{survreg} in the \pkg{survival} package \citep{survival-package}. The additional \pkg{fic} vignette ``Examples of focused model comparison: parametric survival models'' illustrates the use of \code{fic} to compare parametric baseline survival functions of different levels of complexity. \code{fic} needs to be set up carefully here since the same model can be presented in several different parameterisations. Focused comparison requires the submodels to be defined by fixing parameters of the wide model to special values. \item Multi-state models fitted with the \pkg{msm} package \citep{msmjss}. A multi-state model might consist of several regression models, one for each transition between states in a multi-state structure, which poses a challenge to defining and interpreting a focused model comparison. The package vignette ``Examples of focused model comparison: multi-state models'' gives a worked example. Another feature of this example is that the focus of interest is a complicated model summary function provided by the \pkg{msm} package, which needs to be converted carefully into the format expected by \pkg{fic}. \end{itemize} \section{Focused covariate selection in Cox proportional hazards regression} \label{sec:cox} In a Cox regression model, time-to-event outcomes $t_i$ are observed on individuals $i$, potentially with right-censoring. At time $t$, individual $i$ is assumed to have a hazard $h_i(t)$ which is proportional to their covariate values. We compare models that have different sets of covariates. In the most general ``wide'' model, $h_i(t) = h_0(t)\exp(\btheta^\top\x_i + \bgamma^\top \z_i)$. The baseline hazard $h_0(t)$ is left unspecified, while $\btheta$ and $\bgamma$ are estimated by maximum partial likelihood. We compare submodels $S$ of this wide model, which include different subsets of covariates, according to how accurately they estimate some focus quantity $\mu = \mu(\btheta, \bgamma, H_0() | \mathbf{x}, t)$, where $H_0()$ is the cumulative baseline hazard function, which can be estimated nonparametrically by various methods. Typical focus quantities might depend on time $t$ as well as covariate values $\mathbf{x}$, e.g., the probability $S(t|\mathbf{x},\btheta, \bgamma, H_0())$ that a person with covariates $\mathbf{x}$ will survive $t$ years. Again the mean squared error $MSE_S = B_S^2 + V_S$ of $\mu_S$, the focus quantity under submodel $S$, has an asymptotically unbiased estimator $\widehat{B_S^2} + \widehat{V_S}$, like that in Section~\ref{sec:formulae}, which was derived by \citet{fic:cox} under the same theoretical principles: \begin{eqnarray*} % \widehat{B_S^2} & = & (\hat\psi_W - \hat\psi_S)^2/n, \quad % \hat\psi_W = (\hat\omega - % \hat\kappa)^\top\hat\bdelta, \quad \hat\psi_S = % (\hat\omega - \hat\kappa)^\topG_S\hat\bdelta\\ \widehat{B_S^2} & = & (\hat\omega - \hat\kappa)(I - G_S) (\hat\bdelta \hat\bdelta^\top - Q) (I - G_S) (\hat\omega - \hat\kappa)\\ \widehat{V_S} & = & \left\{\hat\tau_0(t)^2 + (\hat\omega - \hat\kappa)^\top Q_{S}^0 (\hat\omega - \hat\kappa)\right\}/n \\ \hat\omega & = & J_{10} J_{00}^{-1} \frac{d\mu}{d\btheta} - \frac{d\mu}{d\bgamma}\\ \hat\kappa(t) & = & (J_{10} J_{00}^{-1} F_0(t) - F_1(t))\frac{d\mu}{dH_0}, \end{eqnarray*} where $Q_S^0,G_S,\hat\bdelta,J_{10},J_{00}$ are defined as in Section~\ref{sec:formulae}, except in terms of the partial likelihood instead of the likelihood. Newly-defined quantities are \[F(t) = \int_0^t \left\{G_n^{(1)}(u) / G_n^{(0)}(u)\right\} dH_0(u) = \left( \begin{array}{l} F_0(t) \\ F_1(t) \end{array}\right)\] where $F_0(t)$ and $F_1(t)$ have $p$ and $q$ components respectively, \begin{eqnarray*} G_n^{(0)}(u) &=& \frac{1}{n}\sum_{i=1}^n Y_i(u) \exp(\btheta^\top\x_i + \bgamma^\top\z_i)\\ G_n^{(1)}(u) &=& \frac{1}{n}\sum_{i=1}^n Y_i(u) \exp(\btheta^\top\x_i + \bgamma^\top\z_i) \left(\begin{array}{l}x_i\\z_i\end{array}\right) \end{eqnarray*} both evaluated at the estimates of $\btheta$, $\bgamma$ and $H_0()$ from the wide model, \[ \tau_0(t)^2 = \Big(\frac{d\mu}{dH_0}\Big)^2\int_0^t \frac{dH_0(u)}{g^{(0)}(u,\beta,0)} + \left\{\frac{d\mu}{d\beta} - \frac{d\mu}{dH_0} F_0(t)\right\}^\top J_{00}^{-1}\left\{\frac{d\mu}{d\beta} - \frac{d\mu}{dH_0} F_0(t)\right\} \] and $Y_i(t) = I(t_i \geq t)$ is the indicator for individual $i$ being at risk at time $t$. As before, if $\widehat{B_S^2} < 0$, we could also use an alternative estimator for $MSE_S$ which assumes that the bias is zero in these cases. \subsection{Example: Malignant melanoma} \label{sec:cox:example} To illustrate the method, \citet{fic:cox} study a dataset from 205 patients with malignant melanoma, earlier analysed in detail by \citet{andersen:1993}. This dataset is also provided in the \pkg{fic} package. We compare models ranging from a wide model \code{wide} that includes 7 terms in the regression model formula, to a narrow model that includes only sex. <<>>= library("survival") wide <- coxph(Surv(years, death==1) ~ sex + thick_centred + infilt + epith + ulcer + depth + age, data = melanoma) @ In this example, we need to deal with \emph{factor} terms in the model when setting up the \code{inds} and \code{inds0} indicators to supply to \code{fic}. Specifically, \code{infilt} and \code{depth} are factors with 4 and 3 levels respectively, represented in the model by 3 and 2 model parameters respectively, instead of one parameter for each. The remaining terms in the model are each associated with one parameter. Thus the wide model, with 7 terms, includes 10 parameters. The function \code{expand_inds} can be used to construct \code{inds} or \code{inds0} terms in the presence of factors\footnote{This function only works for classes of models for which the \code{model.matrix} function is understood and returns objects with an \code{"assign"} attribute. This includes all the commonly-used models in base \proglang{R}.}. We supply a vector of 7 elements, indicating the presence or absence of each of the 7 terms in the model formula. In this case, only the first term, \code{sex}, is included in the narrow model. Then to create an \code{inds0} vector of 10 elements, indicating the presence or absence of each of the wide model's 10 parameters in the narrow model, we call <<>>= inds0 <- expand_inds(c(1,0,0,0,0,0,0), wide) inds0 @ Note that in Cox regression there is no intercept parameter, therefore the model parameters include only the regression coefficients. In fully parametric regression models, for example GLMs, the vector supplied to \code{expand_inds} should contain an additional element indicating the presence of the intercept. The \code{fic} package includes three built-in alternative focuses, as specified through the \code{focus} argument to \code{fic}. \begin{itemize} \item \code{focus="hr"}: the hazard ratio between an individual with covariates $X$ and an individual with covariates 0 (which by definition of the Cox model is independent of time $t$) \item \code{focus="survival"}: the survival probability at time $t$, for an individual with covariates $X$ \item \code{focus="cumhaz"}: the cumulative hazard at time $t$, for an individual with covariates $X$ \end{itemize} The required covariate values and/or time point(s) are supplied as the \code{X} and \code{t} arguments to \code{fic}. It is possible to supply alternative focuses, though this is slightly trickier than for standard full likelihood models. It requires the user to supply a list of three functions as the \code{focus} argument to \code{fic}, one returning the focus, and two returning its derivatives with respect to $(\btheta,\bgamma)$ and $H_0(t)$ respectively. The functions take arguments \code{par}, \code{H0}, \code{X} and \code{t} representing the parameter vector, cumulative hazard, covariate matrix and time. To see some examples, print the following lists of functions, which are used for the three built-in focuses above. <<eval=FALSE>>= list(focus = fic:::cox_hr, deriv = fic:::cox_hr_deriv, dH = fic:::cox_hr_dH) list(focus = fic:::cox_cumhaz, deriv = fic:::cox_cumhaz_deriv, dH = fic:::cox_cumhaz_dH) list(focus = fic:::cox_survival, deriv = fic:::cox_survival_deriv, dH = fic:::cox_survival_dH) @ In the melanoma example, all possible submodels spanning the wide and narrow model are compared. As before, a matrix of indicators describing these models is constructed, and the submodels are fitted automatically within the \code{fic} function. <<>>= combs <- all_inds(wide, inds0) @ The focus is defined as the 5 year survival probability \code{fic(...,focus = "survival", t = 5)} for the covariate values defined by \code{X}, here taken as men and women separately, with average age and mean observed tumour thickness among men and women, infiltration level 4, epithelioid cells and ulceration present, and invasion depth 2. The utility \code{newdata_to_X} is used to convert the user-defined data frame \code{newdata} that identifies these covariate values, with one variable per covariate or factor, to a design matrix \code{X}, with one column for each of the 10 parameter values.\footnote{Note the intercept is excluded here, as Cox models don't have a regression intercept term. In GLMs, the design matrix usually includes an intercept.} <<>>= newdata <- with(melanoma, data.frame(sex = c("female","male"), thick_centred = tapply(thick_centred, sex, mean), infilt = 4, epith = 1, ulcer = 1, depth = 2, age = tapply(age, sex, mean))) X <- newdata_to_X(newdata, wide, intercept=FALSE) ficall <- fic(wide, inds = combs, inds0 = inds0, focus = "survival", X = X, t = 5) @ <<melanoma,include=FALSE>>= ggplot_fic(ficall, ci = FALSE, xlim = c(0,1)) @ \begin{figure}[h] \centering \includegraphics{fic-melanoma-1} \caption{Focused comparison of Cox regression models for the melanoma data. } \label{fig:melanoma} \end{figure} The models give a big range of estimates for the focus survival probability. Generally, the models returning higher survival estimates are those models closer to the narrow model, and the error of these estimates is high. The most accurate models, as judged by the MSE of this focus, return survival estimates of around 0.6--0.7 for men and 0.4--0.5 for women. If desired, the model with the lowest root MSE could be identified with the \code{summary} method. <<>>= summary(ficall) @ However, the reasons for the variation in results between models should be investigated further, with the aid of clinical background knowledge. % We can compare with Fig 6.5 of \cite{claeskens:hjort:book} <<echo=FALSE,eval=FALSE>>= par(mfrow = c(1,2)) plot(ficall$FIC[ficall$vals=="female"], ficall$focus[ficall$vals=="female"], xlim = c(0,30), ylim = c(0, 1), ylab = "5yr survival estimates, women", xlab="FIC") plot(ficall$FIC[ficall$vals=="male"], ficall$focus[ficall$vals=="male"], xlim = c(0,30), ylim = c(0.2, 0.9), ylab = "5yr survival estimates, men", xlab = "FIC") @ \section{Discussion} In focused model comparison, we explicitly connect statistical models with the scientific questions that the models were designed to address. Different models might give better estimates of different focus quantities of interest. Focused information criteria enable models to be compared according to how well they balance the bias and variance of an estimate of interest. The bias and variance terms that contribute to the criteria can be examined directly to illustrate the trade-off. The \pkg{fic} package performs focused model comparison easily for any class of models fitted by maximum likelihood. The package is designed to be easily extensible, by adding new model classes and focuses. The vignettes illustrate some examples of particular model classes, focuses and model comparison problems. Further contributions of code and worked examples would be welcome. More advanced models, beyond standard maximum likelihood, might need novel focused criteria to be implemented. Some generalisations were discussed by \citet{claeskens:hjort:book}, for example, for mixed models, missing data, and where a submodel is on the boundary of the parameter space. \citet{gueuning2018} developed focused information criteria for high-dimensional situations where one or more of the models being compared is not identifiable from the data. \citet{jullum2017parametric,jullum:coxvspar} used FIC to choose between parametric and nonparametric estimators, \citet{hellton2018fridge} used ideas of focused selection to set the tuning parameter in ridge regression, while \citet{ko:fic:copula} used it for copula model selection. Model uncertainty is an area of ongoing research. It is often not wise to highlight only the results of the single model that is selected by a criterion. If multiple well-performing models give different estimates, then the uncertainty about the model choice should be acknowledged. If a single ``best'' estimate is desired, then this could be produced by a model-averaged estimator (see, e.g., \citet{hjort:claeskens}, \citet{claeskens:hjort:book}, \citet{hansen:averaging:2014}, \citet{zhang:averaging}). Alternatively, if we wish to highlight a single best \emph{model}, this could provide the point estimate, but coupled with a confidence interval that acknowledges the range of models that have been selected from. In either case, determining appropriate confidence intervals or standard errors is challenging \citep[see, e.g.,][]{charkhi:claeskens,bachoc:postsel}. Bootstrapping, e.g., by resampling the data, or resampling parameter estimates from their asymptotic sampling distributions under a wide model, is an attractive method of handling model uncertainty that is simple to implement in software, though its theoretical properties in this context are not well understood. See, e.g., \citet{claeskens:hjort:book}, \citet{breiman1996bagging}, \citet{my:averaging:bayesian}, for examples and discussion. Estimation is often followed by decision making. Formal decision theory might also lend itself to focused model comparison. For example we might consider the ``focus'' to be the decision among discrete actions with optimal expected loss. Again, bootstrapping might provide a route to focused model comparison, e.g., by calculating the average loss, over a set of bootstrap resamples, of decisions made under competing models. Focused model comparison principles might also be used for Bayesian inference. In realistic situations, the true process is typically more complex than the biggest model that can be identified from the data. From a Bayesian perspective, we might use a theoretically-realistic model, but with a prior that encodes external information to enable the parameters to be identified from the data, e.g., through shrinkage, regularisation or elicited judgements. However, the appropriate model or prior might still be uncertain, and we may have multiple competing models. \citet{claeskens:hjort:book} present some equivalents of FIC for Bayesian estimators where there is prior information about departures from the narrow model. Various criteria, with justifications related to cross-validation or predictive loss, have been developed to compare Bayesian models \citep{dic,plummer:dic,watanabe2013widely,vehtari2017practical} or average their estimates \citep{yao:stacking,my:averaging:bayesian}. These might also be extended to enable models to be compared for different focuses. \pkg{fic} is available from \url{http://CRAN.R-project.org/package=fic}. Development versions are available on \url{https://github.com/chjackson/fic}, and contributions are welcome. \bibliography{fic} \end{document}