%\VignetteIndexEntry{Different models for different purposes: focused model comparison in R}
%\VignetteEngine{knitr::knitr}

% \documentclass[article,shortnames]{jss}
\documentclass[article,shortnames,nojss,nofooter]{jss}
\usepackage{bm}
\usepackage{tabularx}
\usepackage{graphics}
\usepackage{alltt}
\usepackage{amsmath}

<<echo=FALSE>>=
options(prompt = "R> ",
        continue = "+  ",
        width = 70,
        useFancyQuotes = FALSE,
        digits = 3)
library("knitr")
opts_chunk$set(fig.path="fic-")
@

\title{Different Models for Different Purposes: Focused Model Comparison in R}
\author{
  ~\\Christopher Jackson <chris.jackson@mrc-bsu.cam.ac.uk>,\\
  Howard Thom <howard.thom@bristol.ac.uk>,\\
  Gerda Claeskens <gerda.claeskens@kuleuven.be>
}
\Plainauthor{Christopher Jackson, MRC Biostatistics Unit}

\newcommand{\btheta}{\boldsymbol{\theta}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\bgamma}{\boldsymbol{\gamma}}
\newcommand{\bdelta}{\boldsymbol{\delta}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\z}{\mathbf{z}}


\Abstract{
  Typical methods of model comparison are used to pick one ``best'' model, no
  matter what the estimates from the model are used for.  ``Focused''
  model comparison, by contrast, considers that different models may
  be better for different purposes.  Different models may give better 
  estimates of different ``focus'' quantities, functions of the
  basic parameters.

  In the ``focused information criterion'' of Claeskens and Hjort
  (2003), data are assumed to be generated by a ``wide'' model, in
  which all models we would consider are nested.  Fitting the wide
  model to the observed data, however, may give estimates that are not
  sufficiently precise.  Therefore we might accept some bias in the
  estimate in return for greater precision.  The optimal submodel for a
  particular focus is the one which minimises the mean squared error
  of the estimate of that focus from the submodel, assuming that the
  wide model is true.

  The \code{fic} package calculates this error for comparisons within
  any class of models fitted by maximum likelihood, for any focus
  quantity.   The bias-variance
  tradeoff is shown directly. There are shortcuts for commonly-used
  model classes such as generalised linear models and parametric survival models.
  Covariate selection problems in Cox regression models are also
  supported.  }


\Keywords{FIC,model comparison,AIC,BIC}

\begin{document}

\section{Introduction: principles for model comparison}

To compare a set of statistical models fitted to the same data by maximum likelihood, it is common to rank them according to some ``criterion".  For example, Akaike's information criterion (AIC, \citet{aic}) takes the form 

$$ -2\log \ell(\hat\theta ; \mathbf{x}) + 2p $$
where $\ell(\hat\theta ; \mathbf{x})$ is the maximised likelihood for the model fitted to the dataset $\mathbf{x}$, the likelihood is maximised at parameters $\hat\theta$, and $p$ is the number of parameters. 

The Bayesian information criterion (BIC, \citet{schwarz}) is

$$ -2\log \ell(\hat\theta ; \mathbf{x}) + p\log(n) $$

where $n$ is the number of observations. These two criteria are based on very different principles.  Thus they often rank models differently.  The AIC is designed to choose models with better predictive ability, thus it tends to favour bigger models as the sample size increases.  BIC is an approximation to Bayesian model comparison by Bayes factors, and prefers models with higher posterior probability under an implicit weak prior \citep[with an amount of information equivalent to one observation, see][]{kass:wasserman}.  If there is a ``true" model, the BIC will tend to select it as the sample size increases.  In many situations there may not be a true model, and collecting more data will uncover more complexity in the process generating the data, in which case AIC may be more suitable.   See, e.g., \citet{burnham:anderson:book}, \citet{claeskens:hjort:book} for more theory behind these, and other similar model comparison criteria.

Both of these methods give a single ranking of models according how well they fit a given dataset.   However, different models may be better for different purposes, for example to estimate different quantities of interest.  Such quantities are termed the ``focus'' of a model, and this is the idea behind ``focused" model comparison.

This paper describes the \pkg{fic} \proglang{R} package, available
from   This compares a set of
models according to how accurately they estimate a focus quantity.
The models should all be nested in a single ``wide'' model that is
assumed to generate the data.   Section \ref{sec:formulae} gives an
informal introduction to the principles, and sets out the formulae for
various related ``focused'' model comparison criteria,  as developed
by \citet{fic} and \citet{claeskens:hjort:book}.  Section
\ref{sec:software} explains how general-purpose software is
constructed to enable these quantities to be evaluated for any class
of models and focuses, with the minimum of user effort.   A worked
example of using the \pkg{fic} package for covariate selection in
generalised linear models is given in Section \ref{sec:example}. The extension of the theory to deal with Cox proportional hazards regression models is described in Section \ref{sec:cox} with a worked example. 
  A set of additional package vignettes demonstrate the use of the package in a variety of other situations. 

\section{Focused model comparison: principles and formulae}
\label{sec:formulae}

Suppose the range of models we are willing to use is bounded by 

\begin{itemize}
\item a \emph{wide model}, in which all models we would use are nested, with parameters $(\btheta,\bgamma)$,

\item a \emph{narrow model}, the smallest model we might be willing to use, defined by setting $\bgamma=\bgamma_0$ in the wide model. 
\end{itemize}

A typical example is covariate selection in regression models, where
$\btheta$ are the coefficients of ``protected'' covariates which are
always included, $\bgamma$ are the coefficients of optional covariates
that may or may not be included, and $\bgamma_0=0$.   More generally, we wish to choose the appropriate level of flexibility for the distribution of some outcome.  For example, choosing between a Poisson versus a Negative Binomial model for a count outcome, or an exponential versus a Weibull survival model, where the former is a constrained version of the latter.

Suppose also that the purpose of the model is to estimate some \emph{focus} quantity, which could be any function of the basic parameters 

$$\mu = g(\btheta, \bgamma)$$

In focused model comparison, we prefer models which give better estimates of $\mu$.  A typical way to define ``better" is by the \emph{mean squared error}.  The mean squared error of the estimate $\hat\mu_S$ under a submodel $S$ of the wide model, compared to the true value $\mu$, is 

$$E\left\{(\hat\mu_S - \mu)^2\right\}$$ 

This expectation is calculated under the assumption that the data are generated from the wide model.   While we believe the wide model is the most realistic, we also accept that there may not be enough data to give sufficiently precise estimates of $\mu$.  Therefore we are willing to accept some bias in this estimate, in return for a smaller variance, by selecting a smaller model than the wide model.  The submodel $S$ with the lowest mean squared error is the one which makes the optimal trade-off between bias and variance. 

The mean squared error $MSE_S$ under model $S$ can be decomposed as a sum of the squared bias $B_S^2$ and the variance $V_S$. 

\begin{equation}
  \label{eq:mse}
\begin{aligned}
MSE_S = E\left\{(\hat\mu_S - \mu\right)^2\}  & = \left\{E(\hat\mu_S)- \mu\right\}^2 + E\left\{(\hat\mu_S -  E(\hat\mu_S))^2\right\} \\
    & = B_S^2 + V_S
\end{aligned}
\end{equation}

Estimators for these quantities are constructed by \citet{fic} under
an asymptotic framework in which the data are assumed to be $n$
independent identically distributed observations from the wide model, but
reparameterised so that $\bgamma = \gamma_0 + \bdelta / \sqrt{n}$.
Thus as the sample size $n$ increases, we aim to detect more subtle
departures from the narrow model.

An obvious estimate for the bias $B_S$ is  
\( \hat{B_S} = \hat{\mu_S} - \hat{\mu_W} \),
where $\hat\mu_W$ is the estimate of the focus quantity under the wide model, which is assumed to be unbiased. 
However, \citet[][]{fic} derive a more accurate, asymptotically unbiased, estimate for the
\emph{squared} bias as 

\begin{equation}
  \label{eq:sqbias:est}
%  \widehat{B_S^2} = ( \hat\psi_{W} - \hat\psi_S )^2 / n
  \widehat{B_{S}^{2}} = \hat\omega^\top (I - G_S) (\hat\bdelta \hat\bdelta^\top - Q)  (I - G_S) \hat\omega 
%  \widehat{B_S^2} = ( \hat\psi_{W} - \hat\psi_S ) - omega (I - G_S) Q (I - G_s) \omega
\end{equation}

where:
\begin{itemize}

% bdelta is bias of par ests in narrow model on sqrtn scale  
% omegabdelta is bias of focus estimate in narrow model on sqrtn scale 

\item 
$\hat\bdelta = \hat\bgamma\sqrt{n}$, where $\hat\bgamma$ is the
estimate of $\bgamma$ under the wide model.

\item $\omega^\top\bdelta$ is the bias of the estimate of $\sqrt{n}\mu$
  under the narrow model $N$, that is, the asymptotic mean of
  $\sqrt{n}(\hat\mu_N - \mu)$.  Thus $\omega$ acts as a linear
  transformation from the biases of the basic parameters $\bgamma$ to
  the biases of the focus parameter $\mu$.

%\item 
%  and 

\item $\omega$ is estimated as $\hat\omega = J_{10}J_{00}^{-1}\frac{d\mu}{d\theta} -
  \frac{d\mu}{d\gamma}$ using Taylor approximation arguments, where
  $J$ is the information (inverse covariance) matrix under the wide
  model \footnote{Note that \citet{claeskens:hjort:book} calculate the
    MSE of the focus multiplied by $\sqrt{n}$ rather than the MSE of
    the focus, thus they define $J$ instead as the information matrix divided
    by $n$.  Thus their definition of $\omega$ is the same as ours
    since $n$ cancels, but their definitions of $Q_S$ and $Q$ are different. }
and subscripts $0$ and $1$ select the rows and columns forming the submatrices of $J$ that correspond to parameters $\btheta$ and $\bgamma$ respectively.  The partial derivatives of the focus $\mu$ are evaluated at the estimates from the wide model.

%% TODO should it be \pi_S not \pi ? 

\item $G_S = \pi^\top Q_S \pi  Q^{-1}$ is an estimate of the transformation that maps the wide model estimate of $\bdelta$ to the submodel $S$ estimate, where 
    $Q_S = (\pi Q^{-1} \pi^\top)^{-1}$, $Q^{-1} = J_{11}$ and $\pi$ is the projection matrix consisting of 0s and 1s which maps a vector of the same length as $(\btheta,\bgamma)$ to a subvector containing the elements corresponding to submodel $S$. 
  
\end{itemize}

Occasionally the estimate~(\ref{eq:sqbias:est}) of squared bias is
negative.  \citet{fic} also present an adjusted version of~(\ref{eq:sqbias:est}), 
which assumes the bias is zero in these cases. 
\begin{equation}
  \label{eq:sqbias:adj}
  \widehat{B^{2*}_S} = max\left\{0, \widehat{B^2_S} \right\} 
\end{equation}

The corresponding estimate of the bias is 

\begin{equation}
  \label{eq:bias:adj}
  \widehat{B_S*} = sign(\hat\psi_{W} - \hat\psi_S)\sqrt{\widehat{B_S^{2*}}}  
\end{equation}
  where $\hat\psi_{W} = \hat\omega^\top\hat\bdelta$ and
 $\hat\psi_S = \hat\omega^\top G_S \hat\bdelta$ are estimates of
 $\omega^\top\bdelta$ under the wide model and submodel respectively.

The estimate for the variance of $\hat{\mu}_S$ under the wide model, derived by \citet{fic}, is 
\[\widehat{V_S} = (\hat{\tau_0}^2 + \hat{\omega}^\top Q_S^0 \hat{\omega}) \] 

where $\hat{\tau_0}^2$ estimates the variance of the narrow model focus estimate (using ``delta method" principles, $\hat\tau_0^2 = \frac{d\mu}{d\theta}^\top J_{00}^{-1} \frac{d\mu}{d\theta}$), and the additional term $(\hat{\omega}^\top Q_S^0 \hat{\omega})$ is the increase in variance we accept by using a wider but still misspecified model $S$, with $Q_S^0 = \pi^\top Q_S \pi$.

Thus we compare models on the basis of the root mean squared error, estimated by
\begin{equation}
  \label{eq:rmse}
  \sqrt{\widehat{MSE_S}} = \sqrt{\widehat{B_S^2} + \hat{V_S}}
\end{equation}

or the alternative version which, while not asymptotically unbiased, is based on an interpretable estimate~(\ref{eq:bias:adj}) of the bias. 
\begin{equation}
  \label{eq:rmse:adj}
  \sqrt{\widehat{MSE_S^*}} = \sqrt{\widehat{B_S^{2*}} + \hat{V_S}}
\end{equation}

\citet{fic} define the ``focused information criterion'' (FIC), which has a slightly simpler form due to excluding terms common to all submodels $S$, and is related to the MSE as
\begin{equation}
  \label{eq:fic}
  FIC_S = n \widehat{MSE_S} - \hat\tau_0^2 + \hat\omega^\top Q \hat\omega  
\end{equation}
Models with lower FIC give better estimates of the focus quantity.  However we prefer to use the (root) MSE as the model comparison statistic, due to its direct interpretation as the error of the focus estimate. 




\subsection{Model comparison averaged over a range of focuses}

Often we want a model that performs well in a range of situations.    In covariate selection problems, for example, we might want to estimate a focus quantity accurately for a defined range of covariate values.  Thus the quantities defined above may now depend on the covariate value $u$. We might simply define the ``averaged MSE'' for submodel $S$,

\[ MSE_S^{(ave)} = \int MSE_S(u) dW(u) du \] 

as a weighted average of the mean squared errors~(\ref{eq:mse}) for focuses defined by different covariate values $u$, weighted by their prevalence $W(u)$.   However \citet{claeskens:hjort:book} derived an alternative formula, so that if correction of the squared bias estimate, as in~(\ref{eq:sqbias:adj}), is required, it only needs to be performed once:
\begin{equation}
  \label{eq:amse}
%  AMSE = max(IS,0) + IIS
  \widehat{MSE_S^{(ave)}} = I_S + \widehat{V_S^{(ave)}}
\end{equation}
where
\begin{equation}
  \label{eq:bias:ave}
  I_S =  Tr((I - G_S)(\hat\bdelta \hat\bdelta^\top - Q)(I - G_S)^\top A)  
\end{equation}
is an estimate of the squared bias, and 
\begin{equation}
  \widehat{V_S^{(ave)}} = \tau_0^{2(ave)} + II_S
\end{equation}
is an estimate of the variance, with $\tau_0^{2(ave)}=\int \tau_0(u) du$, $II_S = Tr(Q_S^0 A)$, and 
\begin{eqnarray*}
  A & = & J_{10} J_{00}^{-1}B_{00}J_{00}^{-1}J_{01} - J_{10}J_{00}^{-1}B_{01} - B_{10}J_{00}^{-1}J_{01} + B_{11} \\
B & =& \int 
\left( 
  \begin{array}{l}
  d\mu(u)/d\btheta\\
  d\mu(u)/d\bgamma
\end{array}
\right)
\left( 
\begin{array}{l}
  d\mu(u)/d\btheta\\
  d\mu(u)/d\bgamma
\end{array}
\right)
^\top dW(u) \quad = \quad
\left( 
\begin{array}{ll}
  B_{00} & B_{01}\\
  B_{10} & B_{11}
\end{array}
\right)
\end{eqnarray*}

An analogue of the alternative MSE estimate~(\ref{eq:rmse:adj}) can then be defined, based on a bias estimator which is corrected when~(\ref{eq:bias:ave}) is negative, by substituting $max(IS,0)$ for $IS$ in~(\ref{eq:amse}). 

If the focus is defined as the log density for one observation $y$, and if we average over
the observed distribution of $y$ in the data, then model comparison using this
procedure is asymptotically equivalent to model comparison by 
AIC \citep{claeskens:hjort:book,fic}.  See the additional \pkg{fic} package
vignette on linear models for an example. 


\section{Software for focused model comparison}
\label{sec:software} 

In order to calculate the MSE~(\ref{eq:mse}) for focused model comparison
of a submodel $S$ against a wide model, we simply need to know
\begin{itemize}
\item the estimates $\hat{\btheta}_W$ and $\hat{\bgamma}_W$ and their covariance matrix under the wide model,

\item the focus function $\mu(\btheta,\bgamma)$ 

\item the definition of which parameters are included in submodel $S$
  and which are included in the narrow model $N$. 

\end{itemize} 

Additionally if we want to know the focused information criterion~(\ref{eq:fic}) 
we will need the sample size $n$. 

Derivatives of the focus function, required by~(\ref{eq:sqbias:est}), can
be calculated numerically in general, for which robust software exists
--- \pkg{numDeriv} \citep{numDeriv} is used here.  Analytic
derivatives are implemented in \pkg{fic} for two built-in focuses (the
outcome probability in logistic regression, and the mean outcome in
linear regression), but we have noticed no loss in accuracy from using
numerical methods.

This knowledge allows the \pkg{fic} package to implement focused model
comparison for any class of models and focuses.  The estimates and
covariance matrix are routinely computed by functions for fitting
models by maximum likelihood.  Therefore the user simply needs to
supply

\begin{itemize}
\item an \proglang{R} object containing the wide model 
\item a definition of the focus function $\mu()$
\item indicators for what submodels they want to compare 
\end{itemize}

In addition the software needs to know where to look inside the wide
model object for the estimates and covariance matrix, but this
information can be built into the software for a range of
commonly-used models.



\section{Example: Covariate selection in logistic regression }
\label{sec:example}

The use of the \code{fic} package is illustrated for covariate selection in logistic regression, using the example from \citet{claeskens:hjort:book} (Example 6.1).  The dataset was originally presented by \citet{hosmer:lemeshow:firstedition}. Data are taken from $n=189$ women with newborn babies, and the binary outcome is whether the baby is born with a weight of less than 2500g.   We build a logistic regression model to predict the outcome, but are uncertain about what covariates should be included. 

The data are provided as an object \code{birthwt} in the \pkg{fic} package.  This is the same as \code{birthwt} in \pkg{MASS} \citep{venables:ripley} with the addition of a few extra columns defining interactions and transformations as in \citet{claeskens:hjort:book}. 

The following covariates are always included (coefficient vector $\btheta$)

\begin{itemize}
\item  $x_1$ Weight of mother in kg, \code{lwtkg}
\end{itemize}


The following covariates will be selected from (coefficient vector $\bgamma$)

\begin{itemize}

\item $z_1$ age, in years, \code{age}

\item $z_2$ indicator for smoking, \code{smoke}

\item $z_3$ history of hypertension, \code{ht}

\item $z_4$ uterine irritability, \code{ui}

\item interaction $z_5 = z_1 z_2$ between smoking and age, \code{smokeage}

\item interaction $z_6 = z_2 z_4$ between smoking and uterine irritability, \code{smokeui}
\end{itemize}

Firstly the wide model, that includes all the above covariates, is defined and fitted.  

<<message=FALSE>>=
library("fic")
wide.glm <- glm(low ~ lwtkg + age + smoke + ht + ui + smokeage + smokeui, 
                data = birthwt, family = binomial)
@ 

The \emph{focus function} is then defined.  This should be an \proglang{R}
function, mapping the parameters \code{par} of the wide model to the
quantity of interest.   The focus can have any number of additional 
arguments.  Typically an argument called \code{X} is used
to supply covariate values at which the focus function should be
evaluated.  \footnote{Note we could also have written the focus
  function as \code{function(par,X)plogis(par[1] + X \%*\% par[-1])}
then we could have omitted the dummy covariate values of 1 for the intercept at the start of
\code{vals.smoke} and \code{vals.nonsmoke}.}
 Here we take the probability of low birth weight as the focus, for two covariate categories: 
 
\begin{enumerate}
\item smokers with average or typical values of the other covariates.  These values are given in the order supplied when specifying the model (for smokers: intercept, \code{lwtkg}=58.24, \code{age}=22.95, \code{smoke}=1, \code{ht}=0, \code{ui}=0, \code{smokeage}=22.95, \code{smokeui}=0).

\item non-smokers with average values of the other covariates

\end{enumerate}

<<>>=
prob_logistic <- function(par, X)plogis(X %*% par)
vals.smoke <-    c(1, 58.24, 22.95, 1, 0, 0, 22.95, 0)
vals.nonsmoke <- c(1, 59.50, 23.43, 0, 0, 0, 0, 0)
X <- rbind("Smokers" = vals.smoke, "Non-smokers" = vals.nonsmoke)
@
We can illustrate this function by calculating the probability of low birth weight, given the parameters of the fitted wide model, for each group.  This is about twice as high for smokers. 
<<>>=
prob_logistic(coef(wide.glm), X=X)
@

The \code{fic} function can then be used to calculate the mean squared error of the focus for one or more given submodels.  For illustration we will compare two models, both including maternal weight, one including age and smoking, but the other including age, smoking and hypertension. 

<<>>=
mod1.glm <- glm(low ~ lwtkg + age + smoke, data = birthwt, family = binomial)
mod2.glm <- glm(low ~ lwtkg + age + smoke + ht, data = birthwt,
                family = binomial)
@

We supply the following arguments to the \code{fic} function.

\begin{itemize}

\item \code{wide}: the fitted wide model.  All the model fit statistics are computed using the estimates and covariance matrix from this model.   \code{fic} will automatically recognise that this is a GLM fitted by the \code{glm} function in R, and extract the relevant information. 

\item \code{inds}: indicators for which parameters are included in the submodels, that is, which elements of $(\btheta,\bgamma)$ are fixed to $\bgamma_0$.  This should have number of rows equal to the number of submodels to be assessed, and number of columns equal to $dim(\btheta) + dim(\bgamma)$, the total number of parameters in the wide model, 8 in the case of \code{wide.glm}, which includes the intercept and the coefficients of seven covariates.   It contains 1s in the positions where the parameter is included in the submodel, and 0s in positions where the parameter is excluded.  This should always be 1 in the positions defining the narrow model, as specified in \code{inds0} below.  If just one submodel is to be assessed, \code{inds} can also be supplied as a vector of length $dim(\btheta) + dim(\bgamma)$.    
  
  Note that \code{inds} indexes \emph{parameters} rather than \emph{linear model terms}, that is, in covariate selection problems where a variable is a factor with more than two levels, \code{inds} should contain separate entries for the coefficient of each factor level relative to the baseline level, not just one entry indicating the presence of the factor as a whole.   A utility to construct this in the presence of factors is illustrated in Section~\ref{sec:cox:example}. 

\item \code{inds0} vector of indicators for which parameters are included in the narrow model, in the same format as \code{inds}.  This can be omitted, in which case the narrow model is assumed to be given by the first row of \code{inds}.  In this case, just the first two parameters are included, the intercept and the coefficient of \code{lwtkg}.
\end{itemize}
<<>>=
inds <- rbind(mod1 = c(1,1,1,1,0,0,0,0),
              mod2 = c(1,1,1,1,1,0,0,0))
inds0 <- c(1,1,0,0,0,0,0,0)
@ 
\begin{itemize}
\item \code{focus} the focus function, and \code{X}, the alternative
  covariate values to evaluate it at.  As well as an \proglang{R} function, 
  this argument can alternatively be supplied as a character string
  naming a built-in focus function supplied by the \pkg{fic} package.
  Currently these just include \code{"prob_logistic"}, the outcome
  probability in a logistic regression, and \code{"mean_normal"}, the
  mean outcome in a normal linear regression.

\end{itemize}

The main \code{fic} function then returns the model fit statistics and the estimate of the focus quantity for each model. 

<<>>=
fic1 <- fic(wide = wide.glm, inds = inds, inds0 = inds0,
            focus = prob_logistic, X=X)
fic1
@

The object returned by \code{fic} is a data frame containing one row for
each combination of focus covariate values indicated in the column \code{vals} and submodels indicated in the column \code{mods}.   The focus estimate is returned in the final column \code{focus}, while the remaining columns contain the following model comparison statistics:
\begin{itemize}
\item \code{rmse} The root mean squared error of the submodel focus
  estimate, calculated assuming the wide model is true (Equation~\ref{eq:rmse}),
\item \code{rmse.adj} Alternative estimate of the root mean squared error (Equation \ref{eq:rmse:adj}) based on the adjusted bias estimator (\ref{eq:sqbias:adj}--\ref{eq:bias:adj}).
\item \code{bias} The estimated bias $\widehat{B_S*}$ (Equation \ref{eq:bias:adj}), which will be zero if $\widehat{B_{S}^{2}}$ (\ref{eq:sqbias:est}) is negative. 
\item \code{se} The standard error $\sqrt{\hat{V}}$ of the submodel focus estimate, calculated assuming the wide model is true.
\item \code{FIC} The FIC as originally defined by \citet{fic} (Equation~\ref{eq:fic}).
\end{itemize}

The submodels are fitted automatically within the \code{fic} function
in order to produce the focus estimate, so it was not really necessary
to fit \code{mod1.glm} and \code{mod2.glm} by hand, as above.  As the
wide model has class \code{glm}, it is recognised as a GLM, so
\code{fic} assumes that our submodels correspond to models with
different covariates included, as indicated by \code{inds}.  The focus
estimates from the submodels can then be returned alongside the model
comparison statistics.

As well as the specific covariate categories, \code{fic} calculates 
model comparison statistics which are averaged over the categories, indicated by a value of \code{ave} in the column \code{vals}.  An equally-weighted average is computed by default. Arbitrary weights can be supplied in the argument \code{wt} to \code{fic}.

Recall that \code{mod2} contains one more covariate than \code{mod1}.  For each of the
two focuses, and the average, the unadjusted and adjusted bias
estimates are lower due to the inclusion of this covariate, while the
standard error \code{se} is higher.  Given the lower \code{rmse} and
\code{rmse.adj} under \code{mod2}, the reduction in bias is deemed to
be worth the increase in uncertainty.


\subsection{Comparing a wide range of models }

We may want to examine a broad range
of models, particularly in regression contexts.  The function \code{all_inds} (a wrapper around
\code{expand.grid}) creates a matrix of indicators that defines all
submodels with different covariates, spanned by a given wide model (here \code{wide.glm}) and a
narrow model (here defined by \code{inds0}).  This
function works only for classes of model objects \code{x} for which the
\code{terms(x)} function is understood, which includes standard \proglang{R}
regression models such as \code{lm} and \code{glm}. Factors are handled naturally. 
<<>>=
combs <- all_inds(wide.glm, inds0)
@
The resulting matrix can be used as the \code{inds} argument to
\code{fic} to compare all submodels in this example, again for a focus defined by the
probability of low birth weight at covariate values defined by
\code{X}.   Before calling \code{fic} again, we redefine
\code{combs} to exclude
models with interactions but not both corresponding main effects. 
<<>>=
combs <- subset(combs,
                !((smoke==0 & smokeage==1) |
                  (smoke==0 & smokeui==1) |
                  (age==0 & smokeage==1) |
                  (ui==0 & smokeui==1)))
ficres <- fic(wide = wide.glm, inds = combs, inds0 = inds0,
              focus = prob_logistic, X = X)
@

Notice that some of the \code{rmse} elements of \code{ficres} are
\code{NaN}, since the first squared bias estimate $\widehat{B^2}$ is
negative.  The alternative estimate~(\ref{eq:rmse:adj}),
\code{rmse.adj}, might be preferred in these cases, since it is
consistent with the bias and variance estimates.

A comparison of many models can be illustrated by a scatterplot of
the focus estimate against the root MSE of each submodel.  The default
\code{plot} method for \code{fic} objects accomplishes this using base
\proglang{R} graphics: try \code{plot(ficres)}. 
Alternatively a graph can be plotted using \code{ggplot2} if this
package is installed.   This is illustrated in Figure~\ref{fig:ficres}. 
<<birthwt,include=FALSE>>=
ggplot_fic(ficres)
@ 
\begin{figure}[h]
  \centering
  \includegraphics{fic-birthwt-1}
  \caption{Focused comparison of logistic regression models for
    low birth weight with different covariates}
  \label{fig:ficres}
\end{figure}
There is one panel for each of the two covariate categories (smokers and
non-smokers) defining the focus (probability of low birth weight) and
an average over the two categories. 
The solid blue line is the focus estimate under the wide model, and
the dashed blue line is the focus estimate under the narrow model.
 An informal illustration of the uncertainty around the estimate of
 the focus quantity from
each submodel is given by the estimate $\pm 1.96 \times
\sqrt{\hat{V}}$.  Note that this underestimates the uncertainty 
if inference is based on a selected model --- a 
better confidence interval would then account for the range of models
being searched over (``post-selection'' inference, see,
e.g., \citet{hjort:claeskens,berk:postselection}).

Each submodel is labelled faintly using the row
names of the matrix supplied as the \code{inds} argument to
\code{fic}.  In this case, these names were automatically
constructed by the function \code{all_inds} and contain a string of binary 0/1 
indicators for the inclusion of eight parameters.
For smokers, a group of smaller models including the narrow model
(labelled \code{11000000}) give
estimates of the probability of low birth weight with the 
lowest MSE, while by contrast, for non-smokers, the wide model
(labelled \code{11111111}) and similar larger
models give the most accurate estimates of the focus quantity.   Note that in this
dataset, there are 115 non-smokers and 74 smokers, thus more data
enables bigger models to be identified for non-smokers.  Wider 
models are also preferred for describing the average population.

The model with the optimal MSE could be identified if necessary,
with the \code{summary} method for objects returned by \code{fic}.
This lists the parameters included in the optimal model for each
focus. 
<<>>=
summary(ficres)
@ 
Though in general, if there are
multiple models with similar estimation performance and different
results, the reasons for their differences should be explored in
greater depth with the aid of background knowledge. 

\subsection[Calling ``fic'' for an unfamiliar class of models ]{Calling \pkg{fic} for an unfamiliar class of models }

Above, the \code{fic} function recognised the fitted model objects as GLMs, that is, objects of class \code{"glm"} returned by the \code{glm()} function in base \proglang{R}.
But the package can be used to calculate focused model comparison statistics for any class of models, not just the special classes it recognises. To do this, it needs to know where two things are stored inside the fitted model objects:

\begin{enumerate}
\item the vector of maximum likelihood estimates $(\hat\btheta,\hat\bgamma)$,

\item  the covariance matrix of the maximum likelihood estimates, $J^{-1}$.  
\end{enumerate}

plus, if the ``classic'' FIC is required, also the number of
observations $n$ contributing to the model fit.  Given a fitted model
object called \code{mod}, the \code{fic()} function assumes by default
that \code{coef(mod)}, and \code{vcov(mod)} respectively return
the estimates and covariance matrix.  Likewise it assumes that \code{nobs(mod)}
returns the number of observations if FIC is required. 

If one or more of these assumptions is not true, the defaults can be
changed by supplying the argument \code{fns} to \code{fic()}, which
should be a named list of three components.  Each component should be
a function taking one argument (the fitted model) which extracts the
required information from the fitted model and returns it, as in the
following example.  Here the first component of \code{fns} is a function which, when applied to a \code{glm} object, returns the maximum likelihood estimates of the regression coefficients. 

<<eval=FALSE>>=
fns <- list(coef = function(x)coef(x),
            nobs = function(x)nobs(x),
            vcov = function(x)vcov(x))
fic1 <- fic(wide = wide.glm, inds = inds, inds0 = inds0,
            focus = prob_logistic, fns = fns, X = X)
@

A full worked example of a using \code{fic} for a novel class of models, defined and fitted by custom \proglang{R} functions, is given in the package vignette ``Examples of focused model comparison: skew-normal models''. 


\subsection{Other classes of models} 

The \pkg{fic} package also has built-in methods for the following
classes of models. 

\begin{itemize}

\item Linear models fitted with \code{lm} in base \proglang{R}.  Examples  
  of covariate selection and polynomial order 
  selection are given in an additional \pkg{fic}
  vignette on linear models, which 
  also illustrates the use of quantiles of the outcome distribution
  as focuses, and focuses that depend on variance parameters as well
  as linear model coefficients. 

\item Parametric survival models fitted with \code{flexsurvreg} in the
  \pkg{flexsurv} package \citep{flexsurv}, or \code{survreg} in the
  \pkg{survival} package \citep{survival-package}.

  The additional \pkg{fic} vignette ``Examples of focused model
  comparison: parametric survival models'' illustrates the use of \code{fic}
  to compare parametric baseline survival functions of different levels
  of complexity.  \code{fic} needs to be set up carefully here since 
  the same model can be presented in several
  different parameterisations.  Focused comparison requires the
  submodels to be defined by fixing parameters of the wide model to
  special values. 

\item Multi-state models fitted with the \pkg{msm} package \citep{msmjss}.

  A multi-state model might consist of several regression models, one
  for each transition between states in a multi-state structure, which
  poses a challenge to defining and interpreting a focused model
  comparison.  The package vignette ``Examples of focused model
  comparison: multi-state models'' gives a worked example.  Another
  feature of this example is that the focus of interest is a complicated model summary
  function provided by the \pkg{msm} package, which needs to be
  converted carefully into the format expected by \pkg{fic}.

\end{itemize}


\section{Focused covariate selection in Cox proportional hazards regression} 
\label{sec:cox} 

In a Cox regression model, time-to-event outcomes $t_i$ are observed
on individuals $i$, potentially with right-censoring.  At time $t$,
individual $i$ is assumed to have a hazard $h_i(t)$ which is
proportional to their covariate values.  We compare
models that have different sets of covariates.  In the most general
``wide'' model, $h_i(t) = h_0(t)\exp(\btheta^\top\x_i + \bgamma^\top \z_i)$.
The baseline hazard $h_0(t)$ is left unspecified, while $\btheta$ and
$\bgamma$ are estimated by maximum partial likelihood.

We compare submodels $S$ of this wide model, which include different
subsets of covariates, according to how accurately they estimate some
focus quantity $\mu = \mu(\btheta, \bgamma, H_0() | \mathbf{x}, t)$,
where $H_0()$ is the cumulative baseline hazard function, which 
can be estimated nonparametrically by various methods. 
Typical focus quantities might depend on time $t$ as well as covariate
values $\mathbf{x}$, e.g., the probability
$S(t|\mathbf{x},\btheta, \bgamma, H_0())$ that a person with
covariates $\mathbf{x}$ will survive $t$ years.

Again the mean squared error $MSE_S = B_S^2 + V_S$ of $\mu_S$, the focus
quantity under submodel $S$, has an asymptotically unbiased estimator
$\widehat{B_S^2} + \widehat{V_S}$, 
like that in Section~\ref{sec:formulae}, which was derived by \citet{fic:cox} 
under the same theoretical principles:
\begin{eqnarray*}
   % \widehat{B_S^2} & = & (\hat\psi_W - \hat\psi_S)^2/n, \quad
   %                       \hat\psi_W = (\hat\omega -
   %                       \hat\kappa)^\top\hat\bdelta, \quad \hat\psi_S =
   %                       (\hat\omega - \hat\kappa)^\topG_S\hat\bdelta\\
\widehat{B_S^2} & = & (\hat\omega - \hat\kappa)(I - G_S) (\hat\bdelta
                      \hat\bdelta^\top - Q)  (I - G_S) (\hat\omega  - \hat\kappa)\\
  \widehat{V_S} & = &  \left\{\hat\tau_0(t)^2 + (\hat\omega - \hat\kappa)^\top Q_{S}^0 (\hat\omega - \hat\kappa)\right\}/n \\
 \hat\omega & = &  J_{10} J_{00}^{-1} \frac{d\mu}{d\btheta} - \frac{d\mu}{d\bgamma}\\
 \hat\kappa(t) & = &  (J_{10} J_{00}^{-1} F_0(t) - F_1(t))\frac{d\mu}{dH_0},
\end{eqnarray*}
where $Q_S^0,G_S,\hat\bdelta,J_{10},J_{00}$ are defined as in Section~\ref{sec:formulae}, except in terms of the partial likelihood instead of the likelihood.   Newly-defined quantities are
\[F(t) = \int_0^t \left\{G_n^{(1)}(u) / G_n^{(0)}(u)\right\}  dH_0(u) = \left(
  \begin{array}{l}
    F_0(t) \\
    F_1(t)
  \end{array}\right)\]
  where $F_0(t)$ and $F_1(t)$ have $p$ and $q$ components respectively, 
\begin{eqnarray*}
  G_n^{(0)}(u) &=& \frac{1}{n}\sum_{i=1}^n Y_i(u) \exp(\btheta^\top\x_i + \bgamma^\top\z_i)\\
  G_n^{(1)}(u) &=& \frac{1}{n}\sum_{i=1}^n Y_i(u) \exp(\btheta^\top\x_i + \bgamma^\top\z_i) \left(\begin{array}{l}x_i\\z_i\end{array}\right)
\end{eqnarray*}
both evaluated at the estimates of $\btheta$, $\bgamma$ and $H_0()$
from the wide model, 
\[
\tau_0(t)^2 = \Big(\frac{d\mu}{dH_0}\Big)^2\int_0^t
\frac{dH_0(u)}{g^{(0)}(u,\beta,0)} + \left\{\frac{d\mu}{d\beta} -
\frac{d\mu}{dH_0} F_0(t)\right\}^\top J_{00}^{-1}\left\{\frac{d\mu}{d\beta} -
\frac{d\mu}{dH_0} F_0(t)\right\}
\]
and  $Y_i(t) = I(t_i \geq t)$ is the indicator
for individual $i$ being at risk at time $t$.

As before, if $\widehat{B_S^2} < 0$, we could also use an alternative
estimator for $MSE_S$ which assumes that the bias is zero in these cases. 


\subsection{Example: Malignant melanoma}
\label{sec:cox:example} 

To illustrate the method, \citet{fic:cox} study a dataset from 205 patients with malignant melanoma, earlier analysed in detail by \citet{andersen:1993}.  This dataset is also provided in the \pkg{fic} package.

We compare models ranging from a wide model \code{wide} that includes 7 terms in the regression model formula, to a narrow model that includes only sex.  
<<>>=
library("survival")
wide <- coxph(Surv(years, death==1) ~ sex + thick_centred + infilt + epith + 
                ulcer + depth + age, data = melanoma)
@ 
In this example, we need to deal with \emph{factor} terms in the model when setting up the \code{inds} and \code{inds0} indicators to supply to \code{fic}.    Specifically, \code{infilt} and \code{depth} are factors with 4 and 3 levels respectively, represented in the model by 3 and 2 model parameters respectively, instead of one parameter for each.  The remaining terms in the model are each associated with one parameter.  Thus the wide model, with 7 terms, includes 10 parameters. 

The function \code{expand_inds} can be used to construct \code{inds} or \code{inds0} terms in the presence of factors\footnote{This function only works for classes of models for which the \code{model.matrix} function is understood and returns objects with an \code{"assign"} attribute. This includes all the commonly-used models in base \proglang{R}.}.   We supply a vector of 7 elements, indicating the presence or absence of each of the 7 terms in the model formula.  In this case, only the first term, \code{sex}, is included in the narrow model.  Then to create an \code{inds0} vector of 10 elements, indicating the presence or absence of each of the wide model's 10 parameters in the narrow model, we call 

<<>>=
inds0 <- expand_inds(c(1,0,0,0,0,0,0), wide)
inds0
@ 
Note that in Cox regression there is no intercept parameter, therefore the model parameters include only the regression coefficients.   In fully parametric regression models, for example GLMs, the vector supplied to \code{expand_inds} should contain an additional element indicating the presence of the intercept.

The \code{fic} package includes three built-in alternative focuses, as specified through the \code{focus} argument to \code{fic}.
\begin{itemize}
\item \code{focus="hr"}: the hazard ratio between an individual with covariates $X$ and an individual with covariates 0 (which by definition of the Cox model is independent of time $t$)
\item \code{focus="survival"}: the survival probability at time $t$, for an individual with covariates $X$
\item \code{focus="cumhaz"}: the cumulative hazard at time $t$, for an individual with covariates $X$
\end{itemize}
The required covariate values and/or time point(s) are supplied as the \code{X} and \code{t} arguments to \code{fic}.

It is possible to supply alternative focuses, though this is slightly
trickier than for standard full likelihood models. 
It requires the user to supply a list of three functions as the
\code{focus} argument to \code{fic}, one returning the focus, and two returning
its derivatives with respect to $(\btheta,\bgamma)$ and $H_0(t)$
respectively.  The functions take arguments \code{par}, \code{H0},
\code{X} and \code{t} representing the parameter vector, cumulative
hazard, covariate matrix and time.   To see some examples, print the following
lists of functions, which are used for the three built-in focuses above.

<<eval=FALSE>>=
list(focus = fic:::cox_hr,
     deriv = fic:::cox_hr_deriv,
     dH = fic:::cox_hr_dH)
list(focus = fic:::cox_cumhaz,
     deriv = fic:::cox_cumhaz_deriv,
     dH = fic:::cox_cumhaz_dH)
list(focus = fic:::cox_survival,
     deriv = fic:::cox_survival_deriv,
     dH = fic:::cox_survival_dH)
@ 

In the melanoma example, all possible submodels spanning the wide and
narrow model are compared. As before, a matrix of indicators
describing these models is constructed, and the submodels are fitted
automatically within the \code{fic} function.
<<>>=
combs <- all_inds(wide, inds0)
@

The focus is defined as the 5 year survival probability
\code{fic(...,focus = "survival", t = 5)} for the covariate values defined by
\code{X}, here taken as men and women separately, with
average age and mean observed tumour thickness among men and women,
infiltration level 4,
epithelioid cells and ulceration present, and invasion depth 2.  The
utility \code{newdata_to_X}  is used to convert the user-defined data
frame \code{newdata} that identifies these covariate values, with one 
variable per covariate or factor, to a design matrix \code{X}, with
one column for each of the 10 parameter values.\footnote{Note the
  intercept is excluded here, as Cox models don't have a regression
  intercept term.  In GLMs, the design matrix usually includes an intercept.}
<<>>=
newdata <- with(melanoma,
                data.frame(sex = c("female","male"),
                           thick_centred = tapply(thick_centred, sex, mean),
                           infilt = 4, epith = 1, ulcer = 1, depth = 2,
                           age = tapply(age, sex, mean)))
X <- newdata_to_X(newdata, wide, intercept=FALSE)
ficall <- fic(wide, inds = combs, inds0 = inds0,
              focus = "survival", X = X, t = 5)
@ 
<<melanoma,include=FALSE>>=
ggplot_fic(ficall, ci = FALSE, xlim = c(0,1))
@ 
\begin{figure}[h]
  \centering
  \includegraphics{fic-melanoma-1}
  \caption{Focused comparison of Cox regression models for
    the melanoma data.  }
  \label{fig:melanoma}
\end{figure}

The models give a big range of estimates for the focus survival
probability. Generally, the models returning higher survival estimates 
are those models closer to the narrow model, and the error of these
estimates is high.  The most accurate models, as judged by the MSE of
this focus, return survival estimates of around 0.6--0.7 for men and 
0.4--0.5 for women.  If desired, the model with the lowest root MSE could
be identified with the \code{summary} method.
<<>>=
summary(ficall)
@ 
However, the reasons for the variation in results between models should be 
investigated further, with the aid of clinical background knowledge.
% We can compare with Fig 6.5 of \cite{claeskens:hjort:book}

<<echo=FALSE,eval=FALSE>>=
par(mfrow = c(1,2))
plot(ficall$FIC[ficall$vals=="female"], ficall$focus[ficall$vals=="female"], 
     xlim = c(0,30), ylim = c(0, 1), 
     ylab = "5yr survival estimates, women", xlab="FIC")
plot(ficall$FIC[ficall$vals=="male"], ficall$focus[ficall$vals=="male"], 
     xlim = c(0,30), ylim = c(0.2, 0.9),
     ylab = "5yr survival estimates, men", xlab = "FIC")
@ 



\section{Discussion}

In focused model comparison, we explicitly connect 
statistical models with the scientific questions that the models were
designed to address.  Different models might give better estimates of
different focus quantities of interest.  Focused information
criteria enable models to be compared according to how well they balance the bias and variance of an
estimate of interest.  The bias and variance terms that
contribute to the criteria can be examined directly to illustrate the
trade-off. 

The \pkg{fic} package performs focused model comparison easily for any
class of models fitted by maximum likelihood.  The package is designed
to be easily extensible, by adding new model classes and focuses.  The
vignettes illustrate some examples of particular model classes,
focuses and model comparison problems.  Further contributions of code
and worked examples would be welcome.  More advanced models, beyond
standard maximum likelihood, might need novel focused criteria to be
implemented.  Some generalisations were discussed by
\citet{claeskens:hjort:book}, for example, for mixed models, missing
data, and where a submodel is on the boundary of the parameter
space. \citet{gueuning2018} developed focused information criteria for
high-dimensional situations where one or more of the models being
compared is not identifiable from the data.
\citet{jullum2017parametric,jullum:coxvspar} used FIC to choose
between parametric and nonparametric estimators,
\citet{hellton2018fridge} used ideas of focused selection to set the
tuning parameter in ridge regression, while \citet{ko:fic:copula} used
it for copula model selection.

Model uncertainty is an area of ongoing research.  It is often not
wise to highlight only the results of the single model that is
selected by a criterion.  If multiple well-performing models give
different estimates, then the uncertainty about the model choice
should be acknowledged.  If a single ``best'' estimate is desired,
then this could be produced by a model-averaged estimator (see, e.g.,
\citet{hjort:claeskens}, \citet{claeskens:hjort:book}, \citet{hansen:averaging:2014},
\citet{zhang:averaging}).
Alternatively, if we wish to highlight a
single best \emph{model}, this could provide the point estimate, but coupled
with a confidence interval that acknowledges the range of models that
have been selected from.  In either case, determining appropriate
confidence intervals or standard errors is challenging \citep[see, e.g.,][]{charkhi:claeskens,bachoc:postsel}. 
Bootstrapping, e.g., by resampling the data, or
resampling parameter estimates from their asymptotic sampling distributions
under a wide model, is an attractive method of handling model
uncertainty that is simple to implement in software, though its
theoretical properties in this context are not well understood.  See,
e.g., \citet{claeskens:hjort:book}, \citet{breiman1996bagging},
\citet{my:averaging:bayesian}, for examples and
discussion.

Estimation is often followed by decision making.  Formal decision
theory might also lend itself to focused model comparison.  For
example we might consider the ``focus'' to be the decision among
discrete actions with optimal expected loss.  Again, bootstrapping
might provide a route to focused model comparison, e.g., by calculating
the average loss, over a set of bootstrap resamples, of decisions made
under competing models.

Focused model comparison principles might also be used for Bayesian
inference.  In realistic situations, the true process is typically
more complex than the biggest model that can be identified from the
data.  From a Bayesian perspective, we might use a
theoretically-realistic model, but with a prior that encodes external
information to enable the parameters to be identified from the data,
e.g., through shrinkage, regularisation or elicited judgements.
However, the appropriate model or prior might still be uncertain, and
we may have multiple competing models.  \citet{claeskens:hjort:book}
present some equivalents of FIC for Bayesian estimators where there is
prior information about departures from the narrow model.  Various
criteria, with justifications related to cross-validation or
predictive loss, have been developed to compare Bayesian models
\citep{dic,plummer:dic,watanabe2013widely,vehtari2017practical} or
average their estimates \citep{yao:stacking,my:averaging:bayesian}.  
These might also be extended to enable models to be compared for 
different focuses.

\pkg{fic} is available from \url{http://CRAN.R-project.org/package=fic}.  Development versions are available on \url{https://github.com/chjackson/fic}, and contributions are welcome.

\bibliography{fic}

\end{document}