\documentclass{report}[notitlepage,11pt]
\usepackage{Sweave}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{makeidx}

\addtolength{\textwidth}{1in}
\addtolength{\oddsidemargin}{-.5in}
\setlength{\evensidemargin}{\oddsidemargin}
%\VignetteIndexEntry{Survival package methods}

\SweaveOpts{keep.source=TRUE, fig=FALSE}
% Ross Ihaka suggestions
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em}
\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
\fvset{listparameters={\setlength{\topsep}{0pt}}}
%\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topep}}

\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\Lhat}{\hat\Lambda}
\newcommand{\lhat}{\hat\lambda}
\newcommand{\Shat}{\hat S}
\newcommand{\KM}{{\rm KM}}
\newcommand{\NA}{{\rm NA}}
\newcommand{\FH}{{\rm FH}}
\newcommand{\Ybar}{\overline Y}
\newcommand{\Nbar}{\overline N}
\newcommand{\xbar}{\overline x}
\newcommand{\tmu}{\tilde \mu}
\newcommand{\bhat}{\hat\beta}


\title{Further documentation on code and methods in the survival package}
\author{Terry Therneau}
\date{March 2024}

<<setup, echo=FALSE>>=
library(survival)
@ 
\makeindex
\begin{document}
\maketitle
\chapter{Introduction}
\begin{quote}
Kernighan's Law:  ``Everyone knows that debugging is twice as
hard as writing a program in the first place. So if you're as clever as 
you can be when you write it, how will you ever debug it?''
\end{quote}

The most complex code in the survival package arises out of two aspects.
First, some of the mathematical formulas underlying the code are themselves
complex, a second is the work I have done to avoid $O(n^2)$ computations
in the algorithms. 
A consequence of this is that a few comment lines interspersed into the source
code will never be enough information.  Anyone reading it, including myself,
is likely to ask ``what the heck is he trying to \emph{do} here?'' 
A higher level overview is needed that can include equations and
talk about higher level concepts.

  For many years I have used the noweb package to intersperse technical
documentation with the source code for the survival package.
However, despite its advantages, the uptake of noweb by the R community
in general has been nearly nil. 
It is increasingly clear that no future maintainer will continue the 
work, e.g.,
on the github page I have yet to recieve a suggested update that actually
fixed the .Rnw source file instead of the .R file derived from it.  This means
that I can't merge a suggested change automatically, but have to replicate it
myself.

This document is a start at addressing this.  As routines undergo maintainance,
I will remove the relevant .Rnw file in the noweb directory and work directly
on the C and R code, migrating the extra material into this document.
In this
vignette are discussions of design issues, algorithms, and detailed
formulas.  In the .R and .c code I will place comments of the form
``See methods document, abc:def'', adding an abc:def entry to the index
of this document. 
I am essentially splitting each noweb document into two parts. 
The three advantages are 
\begin{itemize}
  \item Ease the transition to community involvement and maintainance.
  \item The methods document will be a more ready source of documentation for
    those who want to know technical details, but are not currently modifying
    the code.
  \item (minor) I expect it will be useful to have the methods document and the
    R or C code open simultaneously in 2 windows, when editing. 
\end{itemize}
However, this conversion will be a long process.  

\paragraph{Notation}
Throughout this document I will use formal counting process notation,
so you may as well get used to it.  The primary advantage is that it
is completely precise, and part of the goal for this document is to give
a fully accurate description of what is computed.
In this notation we have
\begin{itemize}
  \item $Y_i(t)$ = 1 if observation $i$ is at risk at time $t$, 0 otherwise.
  \item $N_i(t)$ = total number of events, up to time $t$ for subject $i$.
  \item $dN_i(t)$ = 1 if there was an event at exactly time $t$
  \item $X$ = the matrix of covariates, with $n$ rows, one per observation, and
    a column per covariate.
  \item $X(s)$ = the time-dependent covariate matrix, if there are time-dependent
    covariates
\end{itemize}

For multistate models the extends to $Y_{ij}(t)$, which is 1 if subject $i$ is
at risk and in state $j$ at time $t$, and $N_{ijk}(t)$ which is the number of
$j$ to $k$ transitions that have occured, for subject $i$, up to time $t$.
The number at risk and number of events at time $t$ can be written as
\begin{align*}
  n(t) &= \sum_{i=1}^n w_iY_i(t) \\
       &= \Ybar(t) \\
  d(t) &= \sum_{i=1}^n w_i dN_i(t) \\
       &= d\Nbar(t)
\end{align*}
where $w_i$ are optional weights for each observation.
$\Nbar(t)$ is the cumulative number of events up to time $t$.
I will often also use $r_i(t) = \exp(X_i(t)\beta)$ as the per observation risk
score in a Cox model.


\chapter{Survival Curves}
\index{survfit}
The survfit function was set up as a method so that we could apply the
function to both formulas (to compute the Kaplan-Meier) and to coxph
objects.
The downside to this is that the manual pages get a little odd:
\code{survfit} is generic and \code{survfit.formula} and \code{survfit.coxph}
are the ones a user will want.  But from
a programming perspective it has mostly been a good idea.
At one time, long long ago, we allowed the function to be called with
``Surv(time, status)'' as the formula, i.e., without a right hand side
of \textasciitilde 1.  That was
a bad idea, now abandoned: \code{survfit.Surv} is now a simple stub that
prints an error message.

\section{Roundoff and tied times}
\index{tied times}
 One of the things that drove me nuts was the problem of
``tied but not quite tied'' times. It is a particular issue for both
survival curves and the Cox model, as both treat a tied time differently than
two close but untied values. 
As an example consider two values of 24173 = 23805 + 368. These are values from
an actual study with times in days: enrollment at age 23805 days and then 368 
days of follow-up.
However, the user chose to use age in years, and saved those values out
in a CSV file, the left hand side of the above equation becomes
66.18206708000000 and the right hand side addition yeilds 66.18206708000001.
The R phrase \code{unique(x)} sees these two values as distinct but 
\code{table(x)} and \code{tapply} see it as a single value since they 
first apply \code{factor} to the values, and that in turn uses 
\code{as.character}.  
A transition through CSV is not necessary to create the problem.  Consider the
small code chunk below.  For someone born on 1960-03-10, it caclulates the
time interval between a study enrollment on each date from 2010-01-01 to 
2010-07-29 (200 unique dates) and a follow up that is exactly 29 days later,
but doing so on age scale.
<<test, echo=TRUE>>=
tfun <- function(start, gap, birth= as.Date("1960-01-01")) {
    as.numeric(start-birth)/365.25 - as.numeric((start + gap)-birth)/365.25
}

test <- logical(200)
for (i in 1:200) {
    test[i] <- tfun(as.Date("2010/01/01"), 29) ==
               tfun(as.Date("2010/01/01") + i, 29)
}
table(test)
@ 
The number of FALSE entries in the table depends on machine, compiler,
and possibly several other issues. 
There is discussion of this general issue in the R FAQ: ``why doesn't R
think these numbers are equal''.
The Kaplan-Meier and Cox model both pay careful attention to ties, and
so both now use the \code{aeqSurv} routine to first preprocess
the time data.  It uses the same rules as \code{all.equal} to
adjudicate ties and near ties.  See the vignette on tied times for more
detail.


The survfit routine has been rewritten more times than any other in the package,
as we trade off simplicty of the code with execution speed.  
The current version does all of the oranizational work in S and calls a C
routine for each separate curve. 
The first code did everything in C but was too hard to maintain and the most
recent prior function did nearly everything in S. 
Introduction of robust variance
prompted a movement of more of the code into C since that calculation
is computationally intensive.

The \code{survfit.formula} routine does a number of data checks, then
hands the actual work off to one of three computational routes: simple survival
curves using the \code{survfitKM.R} and \code{survfitkm.c} functions,
interval censored data to \code{survfitTurnbull.R}, 
and multi-state curves using the \code{survfitAJ.R} and \code{survfitaj.c} pair.

The addition of \code{+ cluster(id)} to the formula was the suggested form
at one time, we now prefer \code{id} as a separate argument.  Due to the
long comet's tail of usage we are unlikely to formally depreciate the older
form any time soon.
The istate argument applies to multi-state models (Aalen-Johansen), or any data
set with multiple rows per subject; but the code refrains from complaint if 
it is present when not needed.
   
\section{Single outcome} 
\index{Kaplan-Meir}
We bein with classic survival, where there is a single outcome.  Multistate
models will follow in a separate section.

At each event time we have 
\begin{itemize}
  \item n(t) = weighted number at risk
  \item d(t) = weighted number of events
  \item e(t) = unweighted number of events
\end{itemize}
leading to the Nelson-Aalen estimate of cumulative hazard and the
Kaplan-Meir estimate of survival, and the estimate of hazard at each
time point.
\begin{align*}
  KM(t) &= KM(t-) (1- d(t)/n(t) \\
  NA(t) &= NA(t-) + d(t)/n(t) 
  h(t)  &= d(t)/n(t) \\
\end{align*}

An alternative estimate in the case of tied times is the Fleming-Harrington.
When there are no case weights the FH idea is quite simple.
The assumption is that the real data is not tied, but we saw a coarsened version.
If we see 3 events out of 10 subjects at risk the NA increment is 3/10, but the
FH is 1/10 + 1/9 + 1/8, it is what we would have seen with the 
uncoarsened data.
If there are case weights we give each of the 3 terms a 1/3 chance of being
the first, second, or third event
\begin{align*}
  KM(t) &= KM(t-) (1- d(t)/n(t) \\
  NA(t) &= NA(t-) + d(t)/n(t) \\
  FH(t) &= FH(t-) + \sum_{i=1}^{3} \frac{(d(t)/3}{n(t)- d(t)(i-1)/3}
\end{align*}
If we think of the size of the denominator as a random variable $Z$, an
exact solution would use $E(1/Z)$, the FH uses $1/E(Z)$ and the NA uses
$1/\max(Z)$ as the denominator for each of the 3 deaths. 
Although Fleming and Harrington were able to show that the FH has a lower
MSE than the KM, it little used.  The primary reasons for its inclusion
in the package are first, that I was collaborating with them at the time so
knew of these results, but second and more importantly, this is identical to 
the argument for the Efron approximation in a Cox model.  (The NA hazard
corresponds to the Breslow approximation.)
The Efron approx is the default for the coxph function, so post coxph survival
curves need to deal with it.

When one of these 3 subjects has an event but continues to be at risk,
 which can happen with start/stop data, then the argument gets trickier.
Say that the first of the 3 continues, the others do not. We can argue that
subject 1 remains at risk for all 3 denominators, or not.
The first is more a mathematical viewpoint, the second more medical, e.g.,
in a study with repeated infections you will never have a second one 
recorded on the same day.  
For multi-state, we finally ``tossed in the towel'' and now use the Breslow
approx as default for multi-state hazard (coxph) models.
For single state, the FH estimate is rarely reqested but we still do our best
to handle all aspects of it (pride and history), but there would be few tears
if it were dropped.

When there is (time1, time2) data the code uses a \code{position} vector,
explained further below which is 1 if this is obs is the leftmost for a subjet
and 2 if it is the rightmost, 3 if it is both.  A primary purpose was to not
count a subject as an extra entry and censor in the middle of a string of
times such as (0, 10), (20, 35), (35, 50),  but we also use it to moderate the
FH estimate: only those with an event at the end of their intervals participate
in the special computation for ties.  

The survfitKM call has arguments of
\begin{itemize}
  \item y: a matrix y containing survival times, either 2 or 3 columns
  \item weight: vector of case weights
  \item ctype: compuation for the cumulative hazard: either the Nelson-Aalen (1)
    or Efron approximation for ties (2) approach. Number 2 is very rarely used.
  \item stype: computation for the survival curve: either product-limit (1),
    also known as the Kaplan-Meier or exp(-cumulative hazard) (2).
    Use of ctype=2, stype=2 matches a Cox model using the Efron approximation 
    for ties, ctype=1, stype=2 is the Fleming-Harrington estimate of survival
    or a Cox model with the Breslow approximation for ties.
  \item type: older form of the ctype/stype argument, retained for backwards
    compatability. Type 1= (ctype 1/ stype 1), 2= (2, 1), 3 = (1,2), 4= (2,2).
  \item id: subject id, used for data checks when there are multiple rows per
    subject
  \item cluster: clustering used for the robust variance. Clustering is based
    on id if cluster is missing (the usual case)
  \item influence: return the influence matrix for the survival
  \item start.time: optional starting time for the curve
  \item entry: return entry times for (time1, time2) data
  \item time0: include time 0 in the output
  \item se.fit: compute the standard errors
  \item conf.type, conf.int, conf.lower: options for confidence intervals
\end{itemize}


\subsection{Confidence intervals}
If $p$ is the survival probability and $s(p)$ its standard error,
we can do confidence intervals on the simple scale of
$ p \pm 1.96 s(p)$, but that does not have very good properties.
Instead use a transformation $y = f(p)$ for which the standard error is
$s(p) f'(p)$, leading to the confidence interval
\begin{equation*}
 f^{-1}\left(f(p) +- 1.96 s(p)f'(p) \right)
 \end{equation*}
Here are the supported transformations.
\begin{center}
  \begin{tabular}{rccc} 
    &$f$& $f'$ & $f^{-1}$ \\ \hline
log & $\log(p)$ & $1/p$ & $ \exp(y)$ \\
log-log & $\log(-\log(p))$ & $1/\left[ p \log(p) \right]$ &
   $\exp(-\exp(y)) $  \\
logit & $\log(p/1-p)$ & $1/[p (1-p)]$ & $1- 1/\left[1+ \exp(y)\right]$ \\
arcsin & $\arcsin(\sqrt{p})$ & $1/(2 \sqrt{p(1-p)})$ &$\sin^2(y)$ \\

\end{tabular} \end{center}
Plain intervals can give limits outside of (0,1), we truncate them when this
happens.  The log intervals can give an upper limit greater than 1 (rare),
again truncated to be $\le 1$; the lower limit is always valid.
The log-log and logit are always valid.  The arcsin requires some fiddling at
0 and $\pi/2$ due to how the R function is defined, but that is only a
nuisance and not a real flaw in the math.  In practice, all the intervals
except plain appear to work well.
In all cases we return NA as the CI for survival=0: it makes the graphs look
better.

Some of the underlying routines compute the standard error of $p$ and some
the standard error of $\log(p)$.  The \code{selow} argument is used for the 
modified lower limits of Dory and Korn.  When this is used for cumulative
hazards the ulimit arg will be FALSE: we don't want to impose an upper limit 
of 1.

\subsection{Robust variance}
\index{survfitkm!robust variance}
For an ordinary Kapan-Meier curve it can be shown that the infinitesimal 
jackknife (IJ) variance is identical to the Greenwood estimate, so the extra 
compuational burden of a robust estimate is unnecessary.
The proof does not carry over to curves with (time1, time2) data.
The use of (time1, time2) data for a single outcomes arises in 3 cases, with 
illustrations shown below.
\begin{itemize}
  \item Delayed entry for a subset of subjects.  The robust variance in
    this case is not identical to the Greenwood formula, but is very close in
    all the situations I have seen.
  \item Multiple events of the same type, as arises in reliability data. In this
    case the cumulative hazard rather than P(state) is the estimate of most
    interest; it is known as the mean cumulative function (MCF)
    in the reliability literature.  If there are multiple events per subject
    (the usual) an id variable is required. 
    A study with repeated glycemic events as the endpoint (not shown) had a
    mean of 48 events per subject over 5.5 months.  In this case
    accounting for within-subject correlation was crucial;
    the ratio of robust/asymptotic standard error was 4.4. For the valveSeat
    data shown below the difference is minor.
  \item Survival curves for a time-dependent covariate, the so called ``extended
    Kaplan-Meier'' \cite{Snapinn05}. 
    I have strong statistical reservations about this method.
\end{itemize}
Here are examples of each.  In the myeloma data set the desired time scale was
time since diagnosis, and some of the patients in the study had been diagnosed
at another institution before referral to the study institution.

<<survrepeat, fig=TRUE, echo=TRUE>>=
fit1a <- survfit(Surv(entry, futime, death) ~ 1, myeloma)
fit1b <- survfit(Surv(entry, futime, death) ~ 1, myeloma, id=id, robust=TRUE)
matplot(fit1a$time/365.25, cbind(fit1a$std.err, fit1b$std.err/fit1b$surv), 
        type='s',lwd=2, lty=1, col=2:3, #ylim=c(0, .6),
        xlab="Years post diagnosis", ylab="Estimated sd of log(surv)")
#
# when two valve seats failed at the same inspection, we need to jitter one
#  of the times, to avoid a (time1, time2) interval of length 0
ties <- which(with(valveSeat, diff(id)==0 & diff(time)==0))  #first of a tie
temp <- valveSeat$time
temp[ties] <- temp[ties] - .1
vdata <- valveSeat
vdata$time1 <- ifelse(!duplicated(vdata$id), 0, c(0, temp[-length(temp)]))
vdata$time2 <- temp
fit2a <- survfit(Surv(time1, time2, status) ~1, vdata)
fit2b <- survfit(Surv(time1, time2, status) ~1, vdata, id=id)
plot(fit2a, cumhaz=TRUE, xscale=365.25, xlab="Years in service",
     ylab="Estimated number of repairs")
lines(fit2b, cumhaz=TRUE, lty=c(1,3,3))
legend(150, 1.5, c("Estimate", "asymptotic se", "robust se"), lty=1:3, bty='n')

#
# PBC data, categorized by most recent bilirubin
# as an example of the EKM
pdata <- tmerge(subset(pbcseq, !duplicated(id), c(id, trt, age, sex, stage)),
                subset(pbcseq, !duplicated(id, fromLast=TRUE)), id,
                death= event(futime, status==2))
bcut <- cut(pbcseq$bili, c(0, 1.1, 5, 100), c('normal', 'moderate', 'high'))
pdata <- tmerge(pdata, pbcseq, id, cbili = tdc(day, bcut))
pdata$ibili <- pdata$cbili[match(pdata$id, pdata$id)] # initial bilirubin

ekm <- survfit(Surv(tstart, tstop, death) ~ cbili, pdata, id=id)
km  <- survfit(Surv(tstart, tstop, death) ~ ibili, pdata, id=id)
plot(ekm, fun='event', xscale=365.25, lwd=2, col=1:3, conf.int=TRUE,
     lty=2, conf.time=c(4,8,12)*365.25,
     xlab="Years post enrollment", ylab="Death")
lines(km, fun='event', lwd=1, col=1:3, lty=1)
#      conf.time= c(4.1, 8.1, 12.1)*365.25)
text(c(4600, 4300, 2600), c(.23, .56, .78), c("Normal", "Moderate", "High"),
     col=1:3, adj=0)
legend("topleft", c("KM", "EKM"), lty=1:2, col=1, lwd=2, bty='n')
@ 

The EKM plot is complex: dashed are the EKM along with 
confidence itervals, solid lines are the KM stratified by enrollment bilirubin.
The EKM bias for normal and moderate is large, but confidence intervals differ
little between the robust and asymptotic se (latter not shown).

As shown above, more often than not the robust IJ variance is not needed for
the KM curves themselves.  
However, they are the central computation for psuedovalues, which are steadily
increasing in popularity.
Let $U_k(t)$ be the IJ for observation $k$ at time $t$.
This is a vector, but in section \ref{sect:IJ2} for multi-state data it will
be a matrix with 1 column per state.
Likewise let $C(t)$ and $A(t)$ be influence vectors for the cumulative hazard 
and the area under the curve at time $t$.
Let $h(t)$ be the hazard increment at time $t$.
Then

\begin{align}
  h(t) &= \frac{\sum_i w_idN_i(t)}{\sum_i w_i Y_i(t)}   \label{haz1}\\
  \frac{\partial h(t)}{\partial w_k} &= \frac{dN_i(t) - Y_k(t)h(t)}
                               {\sum_i w_i Y_i(t)} \label{dhaz1}\\
       &= C_k(t) - C_k(t-) \nonumber\\
  S(t) &= \prod_{s\le t} 1-h(s)  \label {KM1}\\
       &= S(t-) [1-h(t)] \nonumber \\
  U_k(t) &= \frac{\partial S(t)}{\partial w_k} \\
         &= U_k(t-)[1-h(t)] - S(t-)\frac{dN_i(t) - Y_k(t)h(t)}{\sum_i w_i Y_i(t)}
            \label{dsurv1} \\
  U_k(t) &=  \frac{\partial \prod_{s\le t} 1-h(s)}{\partial w_k} \nonumber \\
         &=  \sum_{s\le t} \frac{S(t)}{1-h(s)}
               \frac{dN_i(s) - Y_k(s)h(s)}{\sum_i w_i Y_i(s)} \nonumber \\
          &=  S(t) \sum_{s\le t} \frac{dN_i(s) - Y_k(s)h(s)}
               {[1-h(s)] \sum_i w_i Y_i(s)} \label{dsurv2} \\
         &= S(t) \int_0^t \frac{\partial \log[1-h_k(s)]}{w_k} d\Nbar(s) 
               \label{dsurv3} 
\end{align}

The simplest case is $C_k(\tau)$ at a single user requested reporting time 
$\tau$, i.e., a simple use of the pseudo routine. 
The obvious code will update all $n$ elements of $C$ at each event time,
an $O(nd)$ computation where $d$ is the number of unique event times $\le \tau$.
For most data $d$ and $n$ grow together, and $O(n^2)$ computations are something
we want to avoid at all costs, a large data set will essentially freeze the
computer.

The code solution is to create a running total $z$ of the right hand term of
\eqref{dhaz1}, which applies to all subjects at risk.  Then the leverage 
for an observation at risk over the interval $(a,b)$ is
\begin{align*}
  z(t) &= \sum_{s\le t} \frac{h(s)}{\sum_i w_i Y_i(s)} \\
  C_k(\tau) &= frac{N_k(\tau)}{\sum_i w_i Y_i(b)} +z(\min(a,\tau)) - 
               z(\min(b,\tau))
\end{align*}
In R code this becomes an indexing problem.  We can create 3 $n$ by $m$ = 
number of
reporting times matrices that point to the correct elements of the vectors
\code{c(0, 1/n.risk)} and \code{c(0, n.event/n.risk\^2)} obtained from the
survfit object. 

The computation for survival starts out exactly the same if using the exponential
estimate $S(t) = \exp(-\Lambda(t))$, otherwise do the same computation but
using the rightmost term of equation \eqref{dsurv2}.
At the end multiply the column for reporting time $\tau$ by $-S(\tau)$, for
each $\tau$.

There are two ways to look at the influence on the sojuourn time, which is
calculated as the area under the curve. 
\index{survfit!AUC}
They are shown in the figure \ref{AUCfig} for an AUC at time 55.

\begin{figure}
<<auc, echo=FALSE, fig=TRUE>>=
test <- survfit(Surv(time, status) ~1, aml, subset=(x=="Maintained"))
ntime <- length(test$time)
oldpar <- par(mfrow=c(1,2), mar=c(5,5,1,1))
plot(test, conf.int=FALSE, xmax=60)
jj <- (test$n.event > 0)
segments(test$time[jj], test$surv[jj], test$time[jj], 0, lty=2)
segments(55, test$surv[9], 55, 0, lty=2)
points(c(0, test$time[jj]), c(1, test$surv[jj]))
segments(0,0,55,0)
segments(0,0,0,1)

plot(test, conf.int=FALSE, xmax=60)
segments(pmin(test$time,55), test$surv, 55, test$surv, lty=2)
segments(55,test$surv[ntime],55,1)
segments(test$time[1], 1, 55, 1, lty=2)
points(c(test$time[jj],100),  .5*(c(1, test$surv[jj]) + c(test$surv[jj], 0)))
par(oldpar)
@ 
  \caption{Two ways of looking at the AUC}
  \label{AUCfig}
\end{figure}

The first uses the obvious rectangle rule.  The leverage of observation
$i$ on AUC(55) is the sum of the leverage of observation $i$ on the height
of each of the circles times the width of the associated rectangle.
The leverage on the height of each circle are the elements of $U$.
The second figure depends on the fact that the influence on the AUC must be
the same as the influence on 55-AUC.  It will be the influence of each 
observation on the length
of each circled segment, times the distance to 55.
This is closely related to the asymptotic method used for the ordinary KM,
which assumes that the increments are uncorrelated.
 
Using the first approach, let
Let $w_0, w_1, \ldots, w_m$ be the widths of the $m+1$ rectangles under the
survival curve $S(t)$ from 0 to $\tau$, and $d_1, \ldots, d_m$ the set of
event times that are $\le \tau$.
Further, let $u_{ij}$ be the additive elements that make up $U$ as found in
the prior formulas, i.e.,
\begin{align*}
  U_{ik} &= S(d_k) \sum_{j=1}^k u_{ij} \\
  u_{ij} &= \frac{\partial \log(1 - h(d_j))}{\partial w_i} \; \mbox{KM}\\
        &=  -\frac{dN_i(d_j) - Y_i(d_j) h(d_j)}{[1-h(d_j)] \Ybar(d_j)} \\
  u_{ij}^*&= \frac{\partial - h(d_j)}{\partial w_i}  \; \mbox{exp(chaz)}\\
        &=  -\frac{dN_i(d_j) - Y_i(d_j) h(d_j)}{\Ybar(d_j)} \\
\end{align*}
The second varianct holsd when using the exponential form of $S$.

Finally, let $A(a,b)$ be the integral under $S$ from $a$ to $b$.
\begin{align}
  A(0,\tau) &=  w_0 1 +  
          \sum_{j=1}^m w_j S(d_j) \\nonumber \\
  \frac{\partial A(0,\tau}{\partial w_k} &=
            \frac{\partial A(d_1,\tau}{\partial w_k} \nonumber \\
  \frac{\partial A(d_1,\tau}{\partial w_k} &=
   \sum_{j=1}^m w_j U_{kj}  \nonumber \\
  &= \sum_{j=1}^m w_j S(d_j) \sum_{i=1}^j u_{ki} \nonumber \\
  &= \sum_{i=1}^m u_{ki} \left[\sum_{j=i}^m w_j S(d_j) \right] \nonumber \\
  &= \sum_{i=1}^m A(d_i, \tau) u_{ki} \label{eq:auctrick}   
\end{align}

This is a similar sum to before, with a new set of weights.  Any given 
(time1, time2) observation involves $u$ terms over that (time1,time2) range,
we can form a single cumulative sum and use the value at time2 minus the value
at time1.
A single running total will not work for multiple $\tau$ values, however, 
since the weights depend on $\tau$.  
Write $A(d_i, \tau) = A(d_1, \tau) - A(d_1,d_i)$ and separate the two sums.
For the first $A(d_1,\tau)$ can be moved outside the sum, and so will play
the same role as $S(d)$ in $U$, the second becomes a weighted cumulative
sum with weights that are independent of tau.

Before declaring victory with this last modification take a moment assess
the computational impact of replacing $A(d_i,\tau)$ with two terms.
Assume $m$ time points, $d$ deaths, and $n$ observations.
Method 1 will need to $O(d)$ to compute the weights, $O(d)$ for the weighted
sum, and $O(2n)$ to create each column of the output for (time1, time2) data,
for a total of $O(2m (n+d))$.  The second needs to create two running sums, each
$O(d)$ and $O(4n)$ for each column of output for $O(2d + 4mn)$. 
The more 'clever' appoach is always slightly slower!
(The naive method of adding up rectangles is much slower than either, however,
because it needs to compute $U$ at every death time.)

\paragraph{Variance}

Efficient computation of a the variance of the cumulative hazard within 
survfit is more complex, 
since this is computed at every event time.  We consider three cases below.

Case 1: The simplest case is unclustered data without delayed entry; 
not (time1,time2) data.
\begin{itemize}
  \item Let $W= \sum_i w_i^2$ be the sum of case weights for all those at
    risk.  At the start everyone is in the risk set.  Let $v(t)=0$ and
    and $z(t)=0$ be running sums.
  \item At each event time $t$
    \begin{enumerate}
      \item Update $z(t) = z(t-) - h(t)/\Ybar(t)$, the running sum of the
        right hand term in equation \eqref{dhaz1}.
      \item For all $i$= (event or censored at $t$), fill in their element
        in $C_i$, add $(w_i C_i)^2$ to $v(t)$, and subtract $w_i^2$ from $W$.
      \item The variance at this time point is $v(t)+ Wz^2(t)$; all those still
        at risk have the same value of $C_k$ at this time point.
    \end{enumerate}
\end{itemize}

Case 2: Unclustered with delayed entry.
Let the time interval for each observation $i$ be $(s_i, t_i)$.  In this case
the contribution at each event time, in step 3 above, for those currently at
risk is
\begin{equation*}
  \sum_j w_j^2 [z(t)- z(s_j)]^2 = z^2(t)\sum_jw_j^2  + \sum_jw_j^2z^2(s_j)
                 - 2 z(t) \sum(w_j z(s_j)                              
\end{equation*}
This requires two more running sums.  When an observation leaves the risk set
all three of them are updated.
In both case 1 and 2 our algorithm is $O(d +n)$.

Case 3: Clustered variance.  The variance at each time point is now
\begin{equation}
  \rm{var}\NA(t) = \sum_g\left( \sum_{i \in g} w_i \frac{\partial \NA(t)}
   {\partial w_i} \right)^2 \label{groupedC}
\end{equation}
The observations within a cluster do not necessarily have the same case
weight, so this does not collapse to one of the prior cases. 
It is also more difficult to identify when a group is no longer at risk
and could be moved over to the $v(t)$ sum.
In the common case of (time1, time2) data the cluster is usually a single
subject and the weight will stay constant, but we can not count on that.
In a marginal structural model (Robbins) for instance the inverse probability 
of censoring weights (IPCW) will change over time.

The above, plus the fact that the number of groups may be far less than the
number of observations, suggests a different approach.
Keep the elements of the weighted grouped $C$ vector, one row per group rather
than one row per observation.
This corresponds to the inner term of equation \eqref{groupedC}. 
Now update each element at each event time.
This leads to an $O(gd)$ algorithm.
A slight improvement comes from realizing that the increment for group $g$
at a given time involves the sum of $Y_i(t)w_i$, and only updating those
groups with a non-zero weight.

For the Fleming-Harrington update, the key realization is that if there are $d$
death at a given timepoint, the increment in the cumulative hazard NA(t) is
a sum of $d$ 'normal' updates, but with perturbed weights for the tied deaths.
Everyone else at risk gets a common update to the hazard. 
The basic computation and equations for $C$ involve an small loop over $d$ to
create the increments, sort of like an aside in a stage play, but then 
adding them into $C$ is the same as before.  The basic steps for case 1--3
do not change.

The C code for survfit has adapted case 3 above, for all three of the hazard,
survival, and AUC.
A rationale is that if the data set is very large the user can choose 
\code{std.err =FALSE} as an option, then later use the residuals function to
get values at selected times.  When the number of deaths gets over 1000
is where we begin to really notice the $O(nd)$ slowdown, but a plot only needs
100 points to look smooth
Most use cases will only need variance at a 1--3 points.
Another reason is that for ordinary survival, robust variance is almost
never requested unless there is clustering.

\section{Aalen-Johansen}
In a multistate model let the hazard for the $jk$ transition be 
\begin{align*}
  \hat\lambda_{jk}(t) &= \frac{\sum_i dN_{ijk}(t)}{\sum_i Y_{ij}(t)}\\
                 &= \frac{{\overline N}_{jk}(t)}{\Ybar(t)}
\end{align*}
and gather terms together into the nstate by nstate matrix $A(t)$ 
with $A_{jk}(t) = \hat\lambda_{jk}(t)$, with $A_{jj}(t) = -\sum_{k \ne j} A_{jk}(t)$.
That is, the row sums of $A$ are constrainted to be 0.

The cumulative hazard estimate for the $j:k$ transition is the cumulative sum of
$\hat\lambda_{jk}(t)$.  Each is treated separately, there is no change in 
methods or formula from the single state model.

The estimated probability in state $p(t)$ is a vector with one element per 
state and the natural constraint that the elements sum to 1; it is a probability
distribution over the states.
Two estimates are
\begin{align}
  p(t) &= p(0) \prod_{s \le t} e^{A(s)}   \label{AJexp}\\
  p(t) &= p(0) \prod_{s\le t}  (I + A(s)) \nonumber \\
       &= p(0) \prod_{s\le t}  H(s)  \label{AJmat}
\end{align}
Here $p(0)$ is the estimate at the starting time point.
Both $\exp(A(s))$ and $I + A(s)$ are transition matrices, i.e.,  all elements are
positive and rows sum to 1. They encode the state transition probabilities
at time $s$, the diagonal is the probability of no transition for observations
in the given state. At any time point with no observed transitions $A(s)=0$
and $H(s)= I$; wlog the product is only over the discrete times $s$ at which
an event (transition) actually occured.
For matrices $\exp(C)\exp(D) \ne \exp(C+D)$ unless $DC= CD$, i.e.,  
equation \eqref{AJexp} does not simplify.

Equation \eqref{AJmat} is known as the Aalen-Johansen estimate.  In the case
of a single state it reduces to the Kaplan-Meier, for competing risks it
reduces to the cumulative incidence (CI) estimator.  
Interestingly, the exponential form \eqref{AJexp} is
almost never used for survfit.formula, but is always used for the curves after
a Cox model.
Both of the forms obey the basic probability rules that
$0 \le p_k(t) \le 1$ and $\sum_k p_k(t) =1$; all probabilities are between 0 and
1 and the sum is 1.
The analog of \eqref{AJmat} has been proposed by some authors for Cox model 
curves, but it can lead to negative elements of $p(t)$ in certain cases, so the
survival package does not support it. 

The compuations are straightforward.  The code makes two passes through the
data, the first to create all the counts: number at risk, number of events of
each type, number censored, etc.  Since for most data sets the total number at
risk decreses over time this pass is done in reverse time order since it leads
to more additions than subtractions in the running number at risk and so less
prone to round off error.  
(Unless there are extreme weights this is gilding the lily - there will not be
round off error for either case.)
The rule for tied times is that events happen first, censors second, and
entries third; imagine the censors at $t+\epsilon$ and entries at $t+ 2\epsilon$.

When a particular subject has multiple observations, say times of (0,10), (10,15)
and (15,24), we don't want the output to count this as a ``censoring'' 
and/or entry at time 10 or 15.
The \code{position} vector is 1= first obs for a 
subject, 2= last obs,  3= both (someone with only one row of data), 0=neither.
This subject would be 1,0,0,2.
The position vector is created by the \code{survflag} routine.
If the same subject id were used in two curves the counts are separate for each
curve; for example curves by enrolling institution in a multi-center trial, 
where two centers happened to have an overlapping id value.

A variant of this is if a given subject changed curves at time 15, which occurs
when users are estimating the ``extended Kaplan-Meier'' curves proposed by 
Snapinn \cite{Snapinn05}, in which case the subject will be counted as censored
at time 15 wrt the first curve, and an entry at time 15 for the second.
(I myself consider Snapinn's estimate to be statistically unsound.)

If a subject changed case weights between the (0,10) and (10,15) interval we
do not count them as separate entry and exits for the weighted n.enter
and n.censor counts.  This means that the running total of weighted n.enter,
n.event, and n.censor will not recreate the weighted n.risk value; the latter
\emph{does} change when weights change in this way.  The rationale for this
is that the entry and censoring counts are mostly used for display, they
do not participate in further computations. 


\subsection{Data}
The survfit routine uses the \code{survcheck} routine, internally, to verify
that the data set follows an overall rule  that each
subject in the data set follows a path which is physically possible:
\begin{enumerate}
 \item Cannot be in two places at once (no overlaps)
 \item Over the interval from first entry to last follow-up, they have to be
   somewhere (no gaps).
 \item Any given state, if entered, must be occupied for a finite amount of
   time (no zero length intervals).
 \item States must be consistent.  For an interval (t1, t2, B) = entered state B
   at time t2, the current state for the next interval must be B 
   (no teleporting).
\end{enumerate}

I spent some time thinking about whether I should allow the survfit routine to
bend the rules, and allow some of these.
First off, number 1 and 3 above are not negotiable; otherwise it becomes
impossible to clearly define the number at risk. 
But perhaps there are use cases for 2 and/or 4?
 
One data example I have used in the past for the Cox
model is a subject on a research protocol, over 30 years ago now, 
who was completely lost to follow-up for nearly 2 years and then reappeared.
Should they be counted as ``at risk'' in that interim?  
I argued no, on the grounds that the were not at risk for an ``event that would
have been captured'' in our data set --- we would never have learned of a
disease progression.
Someone should not be in the denominator of the Cox model (or KM) at a
given time point if they cannot be in the numerator.
We decided to put them into the data set with a gap in their follow-up time.

But in the end I've decided no to bending the rules.  The largest reason
is that in my own experience a data set that breaks any of 1--4 above
is almost universally the outcome of a programming mistake.  The above example
is one of only 2 non-error cases I can think of, in nearly 40 years of
clinical research:
I \emph{want} the routine to complain. (Remember, I'd like the package to be
helpful to others, but I wrote the code for me.)
Second is that a gap can easily be accomodated by creating extra state which 
plays the role of a waiting room. 
The P(state) estimate for that new state is not interesting, but the
others will be identical to what we would have obtained by allowing a gap.
Instances of case 4 can often be solved by relabeling the states.
I cannot think of an actual use case.

\subsection{Counting the number at risk}
\index{surfit!number at risk}
For a model with $k$ states, counting the number at risk is fairly 
straightforward, and results in a matrix with $k$ columns and
one row per time point.
Each element contains the number who are currently in that state and
not censored.  Consider the simple data set shown in figure \ref{entrydata}.

\begin{figure}
<<entrydata, fig=TRUE, echo=FALSE>>=
mtest <- data.frame(id= c(1, 1, 1,  2,  3,  4, 4, 4,  5, 5, 5, 5),
                    t1= c(0, 4, 9,  0,  2,  0, 2, 8,  1, 3, 6, 8),
                    t2= c(4, 9, 10, 5,  9,  2, 8, 9,  3, 6, 8, 11),
                    st= c(1, 2,  1, 2,  3,  1, 3, 0,  2,  0,2, 0))

mtest$state <- factor(mtest$st, 0:3, c("censor", "a", "b", "c"))

temp <- survcheck(Surv(t1, t2, state) ~1, mtest, id=id)
plot(c(0,11), c(1,5.1), type='n', xlab="Time", ylab= "Subject")
with(mtest, segments(t1+.1, id, t2, id, lwd=2, col=as.numeric(temp$istate)))
event <- subset(mtest, state!='censor')
text(event$t2, event$id+.2, as.character(event$state))
@ 
  \caption{A simple multi-state data set. Colors show the current state of
    entry (black), a (red), b(green) or c (blue), letters show the occurrence
    of an event.}
  \label{entrydata}
\end{figure}

The obvious algorithm is simple: at any given time point draw a vertical line
and count the number in each state who intersect it.
At a given time $t$ the rule is that events happen first, then censor,
then entry.  At time 2 subject 4 has an event and subject 3 enters:
the number at risk for the four states at time point 2 is (4, 0, 0, 0).
At time point $2+\epsilon$ it is (4, 1, 0, 1). 
At time 9 subject 3 is still at risk.

The default is to report a line of output at each unique censoring and
event time, adding rows for the unique entry time is an optional argument.
Subject 5 has 4 intervals of (1,3b), (3,6+), (6,8b) and (8,11+):
time 6 is not reported as a censoring time, nor as an entry time.
Sometimes a data set will have been preprocessed by the user
with a tool like survSplit, resulting in hundreds of rows
for some subjects, most of which are `no event here', and there is no reason
to include all these intermediate times in the output.
We make use of an ancillary vector \code{position} which is 1 if this interval is
the leftmost of a subject sequence, 2 if rightmost, 3 if both.
If someone had a gap in followup we would treat their re-entry to the risk
sets as an entry, however; the survcheck routine will have already generated
an error in that case.

The code does its summation starting at the largest time and moving left,
adding and subtracting from the number at risk (by state) as it goes. 
In most data sets the number at risk decreases over time, going from right
to left results in more additions than subtractions and thus less
potential roundoff error. 
Since it is possible for a subject's intervals to have different case
weights, the number at risk \emph{does} need to be updated at time 6 for
subject 5, though that time point is not in the output.

In the example outcome b can be a repeated event, e.g., something like repeated
infections.  At time 8 the cumulative hazard for event b will change, but
the proability in state will not.  
It would be quite unusual to have a data set with some repeated states and
others not, but the code allows it.


Consider the simple illness-death model in figure \ref{illdeath}. 
There are 3 states and 4 transtions in the model. The number at risk (n.risk),
number censored (n.censor), probability in state (pstate) and number of
entries (n.enter) comonents of the survfit call will have 3 columns, one
per state.
The number of events (n.event) and cumulative hazard (cumhaz) components 
will have 4 columns, one per transition,  labeled as 1.2, 1.3, 2.1, and 2.3.
In a matrix of counts with row= starting state and column = ending state,
this is the order in which the non-zero elements appear.
The order of the transitions is yet another consequence of the 
fact that R stores matrices in column major order.

There are two tasks for which the standard output is insufficient: drawing the
curves and computing the area under the curve.
For both these we need to know $t0$, the starting point for the curves.
There are 4 cases:
\begin{enumerate}
  \item This was specified by the user using \code{start.time}
  \item All subjects start at the same time 
  \item All subjects start in the same state
  \item Staggered entry in diverse states
\end{enumerate}
For case 1 use what the user specified, of course. 
For (time, status) data, i.e. competing risks use the same rule as for simple
survival, which is min(time, 0).  Curves are assumed to start at 0 unless there
are negative time values.  
Otherwise for case 2 and 3 use min(time1), the first time point found; in many
cases this will be 0 as well.

The hardest case most often arises for data that is on age scale where there
may be a smattering of subjects at early ages.
Imagine that we had subjects at ages 19, 23, 28, another dozen from 30-40, and
the first transition at age 41, 3 states A, B, C, and the user did not specify
a starting time. 
Suppose we start at age=19 and that subject is in state B.  
Then p(19)= c(0,1,0), and AJ p(t) estimate will remain (0,1,0) until there
is a B:A or B:C transition, no matter what the overall prevalence of inital
states as enrollment progresses.  Such an outcome is not useful.  
The default for case 4 is to use the smallest event time at time 0, with intial
prevalence the distribution of state for those observations which are at risk
at that time. The prevalence at the starting time is saved in the fit object
as p0. 
The best choice for case 4 is a user specified time, however.

Since we do not have an estimate of $p$= probability in state before time 0,
the output will not contain any rows before that time point.
Note that if there is an event exactly at the starting time: a death at time
0 for competing risks, a transition at start.time, or case 4 above, 
that the first row of the pstate matrix will not be the same as p0.  
The output never has repeats of the same time value (for any given curve). 
One side effect of this is that a plotted curve never begins with a vertical
drop.

\section{Influence}
\index{Andersen-Gill!influence}
\index{multi-state!influence}
Let $C(t)$ be the influence matrix for the cumulative hazard $\Lambda(t)$ and 
$U(t)$ that for the probability in state $p(t)$.
In a multistate model with $s$ states there are $s^2$ potential transitions,
but only a few are observed in any given data set. 
The code returns the estimated cumulative hazard $\Lhat$ as a matrix with
one column for each $j:k$ transition that is actually observed, the same
will be true for $C$.
We will slightly abuse the usual subscript notation; read
$C_{ijk}$ as the $i$th row and the $j:k$ transition (column).
Remember that 
a transition from state $j$ to the same state can occur with multiple events of 
the same type.  
Let
\begin{equation*}
  \Ybar_j(t) = \sum_i Y_{ij}(t) 
\end{equation*} 
be the number at risk in state $j$ at time $t$.  Then

\begin{align}
  C_i(t) &=  \frac{\partial \Lambda(t)}{\partial w_i} \nonumber \\
         &=  C_i(t-) + \frac{\partial \lambda(t)}{\partial w_i} \nonumber\\
  \lambda_{jk}(t) & = \frac{\sum w_i dN_{ijk}}{\Ybar_j(t)} \label{ihaz1}\\
  \frac{\partial \lambda_{jk}(t)}{\partial w_i} &= 
            \frac{dN_{ijk}(t)- Y_{ij}(t)\lambda_{jk}(t)}{\Ybar_j(t)} 
              \label{hazlev} \\
        V_c(t) &= \sum_i [w_i C_i(t)]^2 \label{varhaz}
\end{align}
At any time $t$, formula \eqref{hazlev} captures the fact that an observation has
influence only on potential transitions from its current state at time $t$,
and only over the (time1, time2) interval it spans.  
At any given event time, each non-zero $h_{jk}$ at that time will add an
increment to only those rows of $C$ representing observations currently
at risk and in state $j$.  The IJ variance is the weighted sum of 
squares. (For replication weights, where w=2 means two actual observations,
the weight is outside the brackets.  We will only deal with sampling weights.)

The (weighted) grouped estimates are
\begin{align}
  C_{gjk}(t) &= \sum_{i \in g} w_i C_{ijk}(t) \nonumber \\
         &= C_{gjk}(t-) + \sum_{i \in g} w_i
       \frac{dN_{ijk}(t) - Y_{ij}(t)lambda_{jk}(t)}{\Ybar_j(t)} \label{hazlev2} \\
          &=C_{gjk}(t-) +\sum_{i \in g} w_i\frac{dN_{ijk}(t)}{n_j(t)} - 
        \left(\sum_{i \in g}Y_{ij}(t)w_i \right) \frac{\lambda_{jk}(t)}{\Ybar_j(t)}
            \label{hazlev2b} \\
\end{align}
A $j:k$ event at time $t$ adds only to the $jk$ column of $C$.  
The last line above \eqref{hazlev2b} spits out the $dN$ portion, which will
normally be 1 or at most a few events at this time point, from the second,
which is identical for subjects at risk in group $g$.
This allows us to split out the sum of weights.
Let $w_{gj}(t)$ be the sum of weights by group and state, which is kept in
a matrix of the same form as $U$ and can be efficiently updated. 
Each row of the grouped $C$ is computed with the same effort as a row of
the ungrouped matrix. 
  
Let $C$ be the per observation influence matrix, $W$ a diagonal matrix
of per-observation weights, and $B$ the $n$ by $g$ 01/ design matrix that
collapses observations by groups.
For instance, \code{B = model.matrix(~ factor(grp) -1)}.
The the weighted and collapsed influence is $V= B'WC$, and the infinitesimal
jackknife variances are the column sums of the squared elements of V,
\code{colSums(V*V)}. A feature of the IJ is that the column sums of $WC$ and of
$B'WC$ are zero.
In the special case where each subject is a group and the weight for a 
subject is constant over time, then one can collapse and then weight rather than
weight and then collapse. 
The survfitci.c routine (now replaced by survfitaj.c) made this assumption,
but unfortunately the code had no test cases with disparate weights within
a group and so the error was not caught for several years. 
There is now a test case.

For $C$ above, where each row is a simple sum over event times, 
formula \eqref{hazlev2b} essentially does the scale + sum step at each event
time and so only needs to keep one row per group.
The parent routine will warn if any subject id is
in two different groups. 
 Doing otherwise makes no statistical sense, IMHO, so any instance if this
is almost certainly an error in the data setup.
However, someone may one day find a counterexample, hence a warning rather than
an error.

For the probability in state $p(t)$,  
if the starting estimate $p(0)$ is provided by the user, then $U(0) =0$. If
not, $p(0)$ is defined as the distribution of states among those at risk at
the first event time $\tau$.
\begin{align*}
 p_j &= \frac{\sum w_i Y_{ij}(\tau)}{\sum w_i Y_i(\tau)} \\
 \frac{\partial p_j}{\partial w_k} &=
     \frac{Y_{ij}(\tau) - Y_i(\tau)p_j}{\sum w_i Y_i(\tau)}
\end{align*}
Assume 4 observations at risk at the starting point, with weights of 
(1, 4, 6, 9) in states (a, b, a, c), respectively.
Then $p(0) = (7/20, 4/20, 9/20)$ and the unweighted $U$ matrix for those 
observations is
\begin{equation*}
U(0) = \left( \begin{array}{ccc}
  (1- 7/20)/20 & (0 - 4/20)/20 & (0- 9/20)/20 \\
  (0- 7/20)/20 & (1 - 4/20)/20 & (0- 9/20)/20 \\
  (1- 7/20)/20 & (0 - 4/20)/20 & (0- 9/20)/20 \\
  (0- 7/20)/20 & (0-  4/20)/20 & (1- 9/20)/20 
  \end{array} \right)
\end{equation*}
with rows of 0 for all observations not at risk at the starting time.
Weighted column sums are $wU =0$.

The AJ estimate of the probablity in state vector $p(t)$ is
defined by the recursive formula $p(t)= p(t-)H(t)$.
Remember that the derivative of a matrix product $AB$ is $d(A)B + Ad(B)$ where
$d(A)$ is the elementwise derivative of $A$ and similarly for $B$.
(Write out each element of the matrix product.)
Then $i$th row of U satisfies
\begin{align}
  U_i(t) &= \frac{\partial p(t)}{\partial w_i} \nonumber \\
         &= \frac{\partial p(t-)}{\partial w_i} H(t) + 
           p(t-) \frac{\partial H(t)}{\partial w_i} \nonumber \\
         &= U_i(t-) H(t) +  p(t-) \frac{\partial H(t)}{\partial w_i} 
           \label{ajci}
\end{align}  
The first term of \ref{ajci} collapses to ordinary matrix multiplication,
the second to a sparse multiplication.
Consider the second term for any chosen observation $i$, which is in state $j$.
This observation appears only  in row $j$ of $H(t)$, and thus $dH$ is zero
for all other rows of $H(t)$ by definition.  If observation $i$ is not at risk
at time $t$ then $dH$ is zero.
The derviative vector thus collapses an elementwise multiplication of 
$p(t-)$ and the appropriate row of $dH$, or 0 for those not at risk.

Let $Q(t)$ be the matrix containing $p(t-) Y_i(t)dH_{j(i).}(t)/dw_i$ as its 
$i$th row, $j(i)$ the state for row $j$.
Then
\begin{align*} 
  U(t_2) &= U(t_1)H(t_2) + Q(t_2) \\
  U(t_3) &= \left(U(t_1)H(t_2) + Q(t_2) \right)H(t_3) + Q(t_3) \\
  \vdots &= \vdots
\end{align*}
Can we collapse this in the same way as $C$, retaining only one row of $U$ 
per group containing the weighted sum?  The equations above are more complex
due to matrix multiplication.  The answer is yes,
because the weight+collapse operation can be viewed as multiplication by a 
design matrix $B$ on the
left, and matrix multiplication is associative and additive.
That is $B(U + Q) = BU + BQ$ and $B(UH) = (BU)H$.
However, we cannot do the rearrangment that was used in the single endpoint 
case to turn this into a sum, equation \eqref{dsurv2}, as matrix multiplication
is not commutative.  

For the grouped IJ, we have
\begin{align*}
  U_{g}(t) &= \sum_{i \in g} w_i U_{i}(t) \\
         &= U_{g}(t-)H(t) + \sum_{i \in g}w_i Y_{ij}(t)
              \frac{\partial H(t)}{\partial w_i}
\end{align*}

The above has the potential to be a very slow computation. If $U$ has
$g$ rows and $p$ columns, at each event time there is a matrix multiplication 
$UH$, creation of the $H$ and $H'$ matrices,
addition of the correct row of $H'$ to each row of $U$,
and finally adding up the sqared elements of $U$ to get a variance.
This gives $O(d(gp^2 + 2p^2 + gp + gp))$.
Due to the matrix multiplication, every element of $U$ is modified at
every event time, so there is no escaping the final $O(gp)$ sum of
squares for each column. The primary gain is to make $g$ small.

At each event time $t$ the leverage for the AUC must be updated by 
$\delta U(t-)$, where $\delta$ is the amount of time between this event and
the last. At each reporting which does not coincide with an event time, 
there is a further update with $\delta$ the time between the last event and
the reporting time, before the sum of squares is computed.
The AUC at the left endpoint of the curve is 0 and likewise the leverage.
If a reporting time and event time coincide, do the AUC first.

At each event time
\begin{enumerate}
  \item Update $w_{gj}(\tau) = \sum_{i \in g} Y_{ij}(\tau) w_i$, the number 
    currently at risk in each group.
  \item Loop across the events ($dN_{ijk}(\tau)=1$). Update $C$ using equation
    \eqref{e1} and create the transformation matrix $H$.
  \item Compute $U(\tau) = U(\tau-)H$ using sparse methods.
  \item Loop a second time over the $dN_{ijk}(\tau)$ terms, and for all those
    with $j\ne k$ apply \eqref{e2} and\eqref{e3}.  
  \item For each type of transition $jk$ at this time, apply equation
    \eqref{e4}, and if $j \ne k$, \eqref{e5} and \eqref{e6}. 
  \item Update the variance estimates, which are column sums of squared elements
    of C, U, and AUC leverage.
\end{enumerate}

\begin{align}
  \lambda_{jk}(\tau) &= \frac{\sum_i dN_{ijk}(\tau)}{\sum_i w_i Y_{ij}(\tau)}
    \nonumber \\
  C_{g(i)jk}(\tau) &= C_{gjk}(\tau-) + w_{g(i)}\frac{dN_{ijk}(\tau)}
         {\sum_i w_i Y_{ij}(\tau)} \label{e1} \\
  U_{g(i)j}(\tau) &=U_{gj}(\tau-) - w_{g(i)}\frac{dN_{ijk}(\tau) p_j(\tau-)}
        {\sum_i w_i Y_{ij}(\tau)} \label{e2} \\
  U_{g(i)k}(\tau) &=U_{gk}(\tau-) + w_{g(i)}\frac{dN_{ijk}(\tau) p_j(\tau-)}
        {\sum_i w_i Y_{ij}(\tau)} \label{e3} \\
  C_{ghk}(\tau) &= C_{gjk}(\tau-) -  w_{gj}(\tau)\frac{\lambda_{jk}(\tau)}
      {\sum_i w_i Y_{ij}(\tau)} \label{e4} \\
  U_{gj}(\tau) &=U_{gj}(\tau-) + w_{gj}(\tau)\frac{\lambda_{jk}(\tau) p_j(\tau-)}
      {\sum_i w_i Y_{ij}(\tau)} \label{e5} \\
  U_{gk}(\tau) &=U_{gk}(\tau-) - w_{gj}(\tau)\frac{\lambda_{jk}(\tau) p_j(\tau-)}
      {\sum_i w_i Y_{ij}(\tau)} \label{e6} 
\end{align}

In equations \eqref{e1}--\eqref{e3} above $g(i)$ is the group to which
observed event $dN_i(\tau)$ belongs.
At any reporting time there is often 1 or at most a handful of events.
In equations \eqref{e4}--\eqref{e6} there is an update for each row of
$C$ and $U$, each row a group.
Most often, the routine will be called with the default where each unique
subject is their own group:
if only 10\% of the subjects are in group $j$ at time $\tau$ then 
90\% of the $w_{gj}$ weights will be 0. 
It may be slightly faster to test for $w$ positive before multiplying by 0?

For the `sparse multiplication' above we employ simple approach: if the
$H-I$ matrix has only 1 non-zero diagonal, then $U$ can be updated
in place.
For the NAFLD example in the survival vignette, for instance, this is true for
all but 161 out of the 6108 unique event times (5\%).  For the remainder of
the event times it suffices to create a temporary copy of $U(t-)$.

\chapter{Cox model}
\section{Fitting}
To be filled in.
\section{Absolute risk}
\index{coxph:survival}
\subsection{Cumulative hazard and survival curves}
The survival curves from a fitted coxph model default to the Breslow
estimate = exp(cumulative hazard).  
We also support the simple product-limit estimate (for which I have concerns,
see more in the multi-state section), and the Kalfleisch-Prentice modification
of the product-limit form. 

Predicted survival curves are always for a particular set of covariates.
We do not have much sympathy for the notion of \emph{the} baseline hazard,
which textbooks define as the predicites survival for all $x$ equal to zero,
since it can easily lead to overflow or underflow errors in the exponential
function.
The coxph object includes a \code{means} component, which contains a 
workable vector of centering constants, i.e., a set for which the exponential
will not be a problem.  
The labels of ``means'' is historical, as individual elements of the vector are
not necessarily a mean, 0/1 covariates for instance are given a centering
constant of zero.  Let $c$ be that centering vector, and define
$r(z;c) = \exp[(z-c)\beta]$ be the recentered risk score for a hypothetical
subject with covariate vector $z$.
The predicted cumulative hazard for a new subject with covariate vector
$z$ is
\begin{align} 
  \lhat(t;z) &= \frac{\sum_i w_i dN_i(t)}{\sum_i Y_i(t)w_i r(x_i;z)} \nonumber \\
             &= e^{(z-c)\beta} \lhat(t; c) \nonumber \\
  \Lambda(t; z) &= \int_0^t \lhat(s; z) ds \nonumber \\
               &=  e^{(z-c)\beta}\Lambda(t;c) \label{coxcumhaz}\\
   \Shat(t; z) &= e^{-\Lambda(t; z)} \nonumber \\
           &= \Shat(t; c)^{e^{(z-c)\beta}} \label{coxsurv}
\end{align}

Per above we can compute the survival curve for any covariate $z$ from
the ``baseline'' curve for $x=c$. 
Any vector $c$ that will ensure that the above can be calculated
without roundoff error will suffice for safety, the survival code uses the
mean component from the fit.
To be clear, for most data sets use of $c=0$ as a centering constant
will be just fine; it is sufficient to keep the $r(x_i,c)$ values far enough 
away from the \code{double.min.exp} and \code{double.max.exp} components of
\code{.Machine} (around 1000 on many machines) such that weighted means and
variances, using $r(x_i,c)$ as the weights, do not lose precision.
One case we have seen is where someone is looking at a seasonal effect
and uses a Date variable for the covariate.  Say that the study were centered
in 2022, then the mean $x$ value is approximately 19000.
(Even then we would be okay if $\beta$ is small enough).

If a user requested the product-limit form, then the predicted survival is
\begin{align*} 
  \Shat(t; z) &= \prod_{s\le t} [1- lhat(t;z)] \\
              &\ne \left(\prod_{s\le t} [1- lhat(t;c)]\right)^{e^{(z-c)\beta}}
\end{align*}
 
A key issue is that if $r(x_i;z)$ is very small, e.g., when 
computing the predicted curve for a very high risk subject,
then $1- \lhat(t;z)$ may be negative, leading to an invalid survival estimate.
For this reason the survival package does not support using a product-limit form.

\index{survival:{Kalbfleisch-Prentice estimate}}
An alternative estimate of $\lhat$ proposed by Kalbfleisch and Prentice 
avoids this issue of an invalid estimate.  
Also, it collapses to the Kaplan-Meier when $\beta=0$, something that some 
consider to be a major plus. 
In this author's opinion it is a ``meh''; a numerical difference that is tiny
until the number at risk falls below 20, and even then much smaller than the
se of the estimate. 
For ordinary survival, Fleming and Harrington showed the exp(-cumulative hazard)
actually had a smaller MSE than the KM, but again the difference is so small
that no one really cares.
For completeness, our software allows the KP as an option.  The variance of
$\log(S)$, however, is assumed to be exactly the same as that of the
exponential estimate, i.e., the variance of $\Lambda$.  Creating a special
variance estimate is just too much work


The KP approach computes a multiplicative increment $z(t)$ as the 
solution to equation \eqref{KPhaz1}.
If there is a single event at time $t$, this reduces to the closed form
of \eqref{KPhaz2}, where $d$ is shorthand for the index of the death at
that time point.  Otherwise the code uses simple bisection on the interval [0,1),
which must contain the solution.
A bit of algebra confirms that the final estimate is mathematically independent
of the centering vector $c$. 

\begin{align}
  \sum  w_i Y_i(t) r(x_i;c) &= \sum dN_i(t) w_i \frac{r(x_i;c)}{1- z(t)^{r(x_i;c)}}
      \label{KPhaz1}  \\
    z(t) &= 1- w_d\frac{r(x_d;c)}{\left(\sum  w_i Y_i(t) r(x_i;c)\right)^
        {1/r(x_d;c)}} \label{KPhaz2} \\
      \Shat(t; z) &= \left(\prod_{s \le t} z(t)\right)^{e^{(z-c)\beta}} \label{KPhaz3}
\end{align}

\subsection{Variance}

The variance of $\Lhat(t;z)$ has two terms, the first an analog to the
variance increment for the Nelson-Aalen estimate, the second accounts for the
variance of $\bhat$. 
Let $\tmu(t;z)$ be the cumulative weighted differnce between $z$ and the
running mean
\begin{equation}
  \tmu(t;z) = \int_0^t (z- \xbar(s)) d\Lambda(s; z) 
               e^{(z-c)\beta} \int_0^t (z- \xbar(s)) d\Lambda(s; c) \label{tmu}
\end{equation}
This is a weighted average distance between $z$ and the center of the data.
The variance is
\begin{equation*}
  \rm{var}\Lhat(t;z) = \left[e^{2(z-c)\beta} \int_0^t \frac{\lhat(s;c)}
    {\sum_i w_i Y_i(s) r(x_i; c)} \right] + \tmu(t;z)'V \tmu(t;z)
\end{equation*}

The \code{agsurv} routine returns individual terms of the integral as
a vector \code{varhaz}, the first term of the variance is then a weighted
cumulative sum.
A matrix with a row for each time point is returned as \code{xbar}, containing
the product of $\lhat(t;c)$ with $\xbar(t)$.  (Note the $\xbar$ itself does not
depend on centering.) The routine also returns the vector $\lambda(t;c)$.

The second term of the variance cannot be treated as a cumulative sum of squares,
we need to first add and \emph{then} multiply. 
In the pseudo code below, hazard and xbar are results from agsurv and vmat the
variance matrix from the coxph model.
<<echo=TRUE, eval=FALSE>>=
temp1 <- outer(hazard, z, '*') - xbar
temp2 <- apply(temp1, 2, cumsum)
v2 <-  rowSums((temp2 %*% vmat)* temp2)
@ 
\begin{itemize}
  \item temp1 is a matrix where each row is $\lambda(t;c) (z- \xbar(t))$
  \item each row of temp2 is the integral up to that point
  \item element $i$ of v2 contains the result of 
    \code{temp2[i,] \%*\% vmat \%*\% t(temp2[i,])}.  
    The use of both matrix multiplication and elementwise multiplication, in
    the right order, is important.
\end{itemize}  
We can offset the use of $\lambda(t;c)$ instead of $\lambda(t;z)$ at the 
end, but creation and summation of the $z-\xbar$ matrix needs to be
done separately for each $z$.  

\subsection{Time-dependent coefficients}
\index{survival:{time-dependent coefficients}}
Look at an example from the time-dependent vignette.

<<tdcoef>>=
vet2 <- survSplit(Surv(time, status) ~ ., data= veteran, cut=c(90, 180), 
                  episode= "tgroup", id="id", start="time1", end="time2")
vfit2 <- coxph(Surv(time1, time2, status) ~ trt + prior +
                  I(karno/10):strata(tgroup), data=vet2)
vfit2

cdata <- data.frame(time1 = rep(c(0,90,180), 2),
                    time2 =  rep(c(90,180, 365), 2),
                    status= rep(0,6),   #necessary, but ignored
                    tgroup= rep(1:3, 2),
                    trt  =  rep(1,6),
                    prior=  rep(0,6),
                    karno=  rep(c(40, 75), each=3),
                    curve=  rep(1:2, each=3))
cdata
vsurv <- survfit(vfit2, newdata=cdata, id=curve)
@   

In the code \code{slist} will have one curve per stratum, 3 in the above case.
The \code{id} for this new curve has nothing to do with any id variable in
the original fit, here it is an id for the resulting curve.
The algorithm is executed once per new curve
\begin{itemize}
  \item For each row of data corresponding to this id,
    pluck off the appropriate rows of slist[[stratum of this row]],
        i.e.  the time, n.censor, n.event, etc components.
  \item Glue these together into a new curve
\end{itemize}

The (time1, time2) values in the new data correspond to the time values in the
chosen stratum; in this case those are consistent, but one could have
strata on different time scales.  The time values in the new curve will
be 
\begin{itemize}
  \item time values for the stratum of row 1
  \item (time2[1] - time1[2]) + time values for the stratum of row 2
  \item (time2[2] - time2[3]) + time values for the stratum of row 3
  \item \ldots
\end{itemize}
The idea is that if stratum 1 went from 30 to 100 and stratum 2 from 150 to
200, we want our new time scale to be from 30 to 150. 
In the case above these splicing constants are all zero, which is what I
expect, but users do things I have never thought of.

The underlying routine agsurv returns results for the hazard function centered
at $c$ = the means component of the coxph result.
But from this point forward, we need to do separate computations for each
new id, a simple loop since there are unlikely to be many separate id values.
Let $j$ be the rows of newdata that corresond to the first id. 
For each of these rows pull off the hazard increments, increments of term 1
of the variance, and the $\tmu$ values of \eqref{tmu}; multiple the first
and third by risk[j], the second by risk[j] squared, and save them.
When that is done,
\begin{itemize}
  \item Compute the cumulate hazard, cumsum(unlist(hazard increments))
  \item Exponential survial is exp(-cum haz), KP survival is cumprod of the
    KP contributions
  \item The first term of the variance is again cumsum(unlist(var1))
  \item For the second term we need the $tmu$ matrix, using cumsum(unlist)
    for each column, separately.  Then the quadratic form can be computed.
\end{itemize}

I am moderately confident of the statisical theory for this in the case of a
time-dependent covariate, where each new id covers the same range of time, 
without holes or overlaps.  For the more complex cases I am much less certain,
and would myself use a jackknife or boostrap.


\chapter{Multistate hazard model}
\section{Absolute risk}
\index{cox:survival}
To begin, first remember that predicitons from a Cox model are always for
\emph{someone}, that is, some particular covariate vector $z$.
This includes survival curves and other estimates of absolute risk.  
Put aside the notion of ``the'' baseline survival curve from the start.
For a multistate model the predicted curve is a matrix product
\begin{align*}
  r_i(z) &= \exp((x_i-z) \beta) \\
  \lambda_{jk}(t;z) & = \frac{\sum_i w_i dN_{ijk}(t)}{\sum_i w_i r_i(z) Y_{ij}(t)} 
      \\
  p(t;z) &= p(0) \prod_{s\le t} e^{A(s;z)} \\
  A(s;z) &= \left( \begin{array}{cccc}
        \square & \lambda_{12}(s;z) & \lambda_{13}(s;z) & \ldots \\
        \lambda_{21}(s;z) & \square & \lambda_{23}(s;z) & \ldots \\
        \vdots & \vdots & \vdots & \ddots \end{array} \right)
\end{align*}
In the above $j$, $k$ are states, $i$ is an observation.  
The diagonal elements are such that row sums are zero.

The mstate package uses the first order Taylor series  $\exp(A) \approx I + A$,
for $p(t;z)$, which is parallel to the formula in the Aalen-Johansen.
We do not, for a couple of reasons.
\begin{itemize}
  \item If the chosen $x$ corresponds to a particularly high relative hazard,
    then the diagonal element if $I+A$ can be negative, and 
    $I+A$ is no longer a valid transition matrix.
  \item Consistency with the single state Cox model's Breslow estimate,
    which uses the exponential.
    The I+A approximation has never been seriously proposed for the simple 
    Cox model, probably because of the point just above.
\end{itemize}

For matrices, $\exp(A)\exp(B) \ne \exp(A+B)$
unless $AB= BA$, which does not hold for this application: we cannot avoid
the matrix multiplication.
For transition matrices such as these $\exp(A(s))$ will have row sums of 1
and all elements non-negative, numerically they are very well behaved 
arguments to the matrix exponential function.
Per section \ref{matexp}, for non-tied death times (only 1 event at this time)
$A$ and $\exp(A)$ have a simple closed form and computation will be fast.

For simple survival the above reduces to $\exp(-\Lambda(t;z)$, 
and in that case the
variance of the cumulative hazard estimate has been derived by (ref),
not using the IJ as
\begin{align*}
  v(t;z) &= d'{\cal I}^-1 d +
      \int_0^t \frac{\sum_i dN_i(t)}{\sum_i Y_i(t) r_i(z)} \\
      d(t;z) &= \int_0^t (\xbar(s)-z) d\Lambda(s; z)
\end{align*}


\chapter{Residuals}
\section{Survival}
\index{Aalen-Johansen!residuals}
The residuals.survfit function will return the IJ values at selected times,
and is the more useful interface for a user.
One reason is that since the user specifies the times, the result can be 
returned as an array with (obervation, state, time) as the dimensions, even for
a fit that has multiple curves.
Full influence from the \code{survfit} routine, on the other hand, are returned
as a list with one per curve.  The reason is that each curve might have a 
different set of event times.

Because only a few times are reported, an important question is whether a 
more efficient algorithm can be created. 
For the cumulative hazard, each transition is a separte computation, so we
can adapt the prior fast computation. 
We now have separate hazards for each observed transition $jk$, with numerator
in the n.transitions matrix and denominator in the n.risk matrix.
An observation in state $j$ is at risk for all $j:k$ transitions for the entire
(time1, time2) interval; the code can use a single set of yindex, tindex, 
sindex and dindex intervals. 
The largest change is to iterate over the transtions, and the 0/1 for each
j:k transition replaces simple status.

The code for this section is a bit opaque, here is some further explanation.
Any given observation is in a single state over the (time1, time2) interval
for that transition and accumulates the $-\lambda_{jk}(t)/n_j(t)$ portion of the
IJ for all $j:k$ transitions that it overlaps, but only those that occur before
the chosen reporting time $\tau$. 
Create a matrix hsum, one column for each observed $jk$ transition, each column
contains the cumulative sum of  $-\lambda_{jk}(t)/n_j(t)$ for that transition,
and one row for each event time in the survfit object. 

To index the hsum matrix create a matrix ymin which has a row per observation
and a column per reporting time (so is the same size as the desired output),
whose i,j element contains the index of the largest event time which is
$\le$ min(time2[i], $\tau_j$), that is, the last row of hsum that would apply
to this (observation time, reporting time) pair. Likewise let smin be a
matrix which points to the largest event time which does \emph{not} apply.
The ymin and smin matrices apply to all the transitions; they are created
using the outer and pmin functions.  
The desired $d\lambda$ portion of the residual for the $jk$ transition will be 
\code{hsum[smin, jk] - hsum[ymin, jk]}, where $jk$ is a shorthand for the
column of hmat that encodes the $j:k$ transition.  A similar sort of trickery
is used for the $dN_{jk}$ part of the residual.
Matrix subscripts could be used to make it slightly faster, at the price of
further impenetrability.

Efficient computation of residuals for the probability in state are more 
difficult.  The key issue is that at
every event time there is a matrix multiplication $U(t-)H(t)$ which visits
all $n$ by $p$ elements of $U$. 
For illustration assume 3 event times at 1,2,3, and the user only wants to
report the leverate at time 3.  Let $B$ be the additions at each time point.
Then 
\begin{align*}
U(1) &=  U(0)H(1) + B(1) \\
U(2) &=  U(1)H(2) + B(2) \\
U(3) &=  U(2)H(3) + B(3) \\
U(4) &=  U(3)H(4) + B(4) \\
     &= \left([U(0)H(1) + B(1)] H(2) + B(2)\right) H(3) + B(3)\\
     &= U(0)[H(1)H(2)H(3)] + B(1)[H(2)H(3)] + B(2)H(3) + B(3) 
\end{align*}

The first four rows are the standard iteration. The $U$ matrix will quickly
become dense, since any $j:k$ transition adds to any at-risk row in state $j$.
Reporting back only at only a few time point does not change the computational 
burden of the matrix multiplications.
We cannot reuse the logarithm trick from the KM residuals, as 
  $\log(AB) \ne \log(A) + \log(B)$ when $A$ and $B$ are matrices.

The expansion in the last row above shows another way to arrange the computation.
For any time $t$ with only a $j:k$ event, $B(t)$ is non-zero only in columns
$j$ and $k$, and only for rows with $Y_j(t) =1$, 
those at risk and in state $j$ at time $t$.
For example, assume that $d_{50}$ were the largest death time less than or equal
to the first reporting time.
Accumulate the product in reverse order as $B(50) + B(49)H(50) +
B(48)[H(49)H(50)], \ldots$, updating the product of $H$ terms at each step.
Most updates to $H$ involve a sparse matrix (only one event) so are $O(s^2)$
where $s$ is the number of states.  Each $B$ multiplication is also sparse.
We can then step ahead to the next reporting time using the same algorithm,
along with a final matrix update to carry forward $U(50)$.
At the end there will have been $d$=number of event times matrix multipications
of a sparse update $H(t)$ times the dense product of later $H$ matrices, and
each row $i$ of $U$ will have been 'touched'
once for every death time $k$ such that $Y_i(d_k) =1$ (i.e., each $B$ term 
that it is part of), and once more for each prior reporting time.

I have not explored whether this idea will work in practice, nor do I see how
to extend it to the AUC computation.


<<competecheck, echo=FALSE>>=
m2 <- mgus2
m2$etime <- with(m2, ifelse(pstat==0, futime, ptime))
m2$event <- with(m2, ifelse(pstat==0, 2*death, 1))
m2$event <- factor(m2$event, 0:2, c('censor', 'pcm', 'death'))
m2$id <-1:nrow(m2)

# 20 year reporting time (240 months)
mfit <- survfit(Surv(etime, event) ~1, m2, id=id)

y3 <- (mfit$time <= 360 & rowSums(mfit$n.event) >0)  # rows of mfit, of interest
etot <- sum(m2$n.event[y3,])
nrisk <- mean(mfit$n.risk[y3,1])
@ 

As a first example, consider the mgus2 data set, 
set, a case with 2 competing risks of death and plasma cell malignancy (PCM),
and use a reporting times of 10, 20, and 30 years.
There are $n=$`r nrow(m2)` observations, `r etot` events and `r sum(etime)` 
unique event times before 30 years, with an average of $r=$ `r round(nisk,1)`
subjects at risk at each of these event times.
As part of data blinding follow up times in the MGUS data set were rounded to
months, and as a consequence there are very few singleton event times.

Both the $H$ matrix and products of $H$ remain sparse for any row corresponding
to an absorbing state (a single 1 on the diagonal), so for a competing risk 
problem the $UH$ multiplication is $O(3n)$ multiplications, those for the
nested algorithm will be $O(3r)$ where $r$ is the average number at risk.
In this case the improvement for the nested algorithm is modest, and 
at an additional price of $9d$ multiplications to accumulate $H$.

<<checknafld>>=
ndata <- tmerge(nafld1[,1:8], nafld1, id=id, death= event(futime, status))
ndata <- tmerge(ndata, subset(nafld3, event=="nafld"), id, 
                nafld= tdc(days))
ndata <- tmerge(ndata, subset(nafld3, event=="diabetes"), id = id,
                diabetes = tdc(days), e1= cumevent(days))
ndata <- tmerge(ndata, subset(nafld3, event=="htn"),  id = id,
                htn = tdc(days), e2 = cumevent(days))
ndata <- tmerge(ndata, subset(nafld3, event=="dyslipidemia"), id=id,
                lipid = tdc(days), e3= cumevent(days))
ndata <- tmerge(ndata, subset(nafld3, event %in% c("diabetes", "htn", 
                                                   "dyslipidemia")), 
                id=id, comorbid= cumevent(days))
ndata$cstate <- with(ndata, factor(diabetes + htn + lipid, 0:3, 
                                   c("0mc", "1mc", "2mc", "3mc")))
temp <- with(ndata, ifelse(death, 4, comorbid))
ndata$event <- factor(temp, 0:4, 
         c("censored", "1mc", "2mc", "3mc", "death"))
ndata$age1 <- ndata$age + ndata$tstart/365.25   # analysis on age scale
ndata$age2 <- ndata$age + ndata$tstop/365.25

ndata2 <- subset(ndata, age2 > 50 & age1 < 90)
nfit <- survfit(Surv(age1, age2, event) ~1, ndata, id=id, start.time=50,
                p0=c(1,0,0,0,0), istate=cstate, se.fit = FALSE)
netime <- (nfit$time <=90 & rowSums(nfit$n.event) > 0)
# the number at risk at any time is all those in the intial state of a transition
#  at that time
from <- as.numeric(sub(":[0-9]*", "", colnames(nfit$n.transition)))
fmat <- model.matrix(~ factor(from, 1:5) -1)
temp <- (nfit$n.transition %*% fmat) >0 # TRUE for any transition 'from' state
frisk <- (nfit$n.risk * ifelse(temp,1, 0))
nrisk <- rowSums(frisk[netime,])
maxrisk <- apply(frisk[netime,],2,max)
@ 

As a second case consider the NAFLD data used as an example in the survival
vignette, a 5 state model with death and 0, 1, 2, or 3 metabolic comorbidities
as the states.
For a survival curve on age scale, using age 50 as a starting point and computing
the influence at age 90, the data set has \Sexpr{nrow(ndata2)} rows that overlap
the age 50-90 interval, while the
average risk set size is $r=$ \Sexpr{round(mean(nrisk))}, $<12$\% of the data rows.
There are \Sexpr{sum(netime)} unique event times between 50 and 90 of
which \Sexpr{sum(rowSums(nfit$n.transition[netime,]) ==1)} have only a single %$
event type and the remainder 2. 

In this case the first algorithm can take advantage of sparse $H$ matrices and
the second of 
sparse $B$ matrices.
The big gain is rows of $B$ that are 0.
To take maximal advantage of this we can keep an updated vector of
the indices of the rows that are at risk, see the section on skiplists.  
Note: this has not yet been implemented.

For the AUC we can use a rearrangment.  Below we write out the leverage on
the AUC at time $d_4$, the fourth event, with $w_i= d_{i+1}-d_i$ the width
of the adjactent rectagles that make up the area under the curve.

\begin{align*}
A(d_4) &= U(0)\left[w_0 + H(d_1) w_1 + H(d_1)H(d_2)w_2 + H(d_1)H(d_2)H(d_3)w_3\right]+ \\
       &\phantom{=} B(d_1)\left[w_1 + H(d_2)w_2 + H(d_2)H(d_3)w_3\right] + \\
       &\phantom{=} B(d_2)\left[w_2 + H(d_3) w_3 \right] + B(d_3)w_3
\end{align*}

For $U$ the matrix sequence was $I$, $H_3I$, $H_2H_3I$, \ldots, it is now starts
with $w_3I$, at the next step we multiply the prior matrix by $H_3$ and add 
$w_2I$, at the third multiply the prior matrix by $H_2$ and add $w_1$, etc.

\section{Mulistate hazard models}

For multistate models, or for simple models with multiple events per
subject,  we will use the IJ estimate of variance. 
As a preliminar write down the martingale process for subject $i$
\begin{align*}
 M_{mjk}(t) &= \int_0^t dN_{mjk}(s) - Y_{mj}(s)e^{r_m(z)}d\Lambda_{jk}(t;z) \\
          &= \int_0^t dN_{mjk}(s) - Y_{mj}(s) e^{r_m(z)}\frac{\sum_i w_i dN_{ijk}(s)}
                          {\sum_iY_{ij}(s) e^{r_i(z)}}
\end{align*}
Because of cancellation of $\exp(z)$ in the numerator and denominator the
second expression is independent of the target value $z$.
Computationally, we will
always use a centering constant $c$ = the \code{mean} component of the fit
object to forestall any round off issues in the exp function.
 
\begin{align}
  \frac{\partial \sum_i w_i Y_{ij}(t) r_i(t;z)}{\partial w_m} &=
  Y_{mj}(t)r_m(t;z) + \sum_i w_iY_{ij}(t) r_i(t;z) 
    \left[ \sum_p(x_{ip}- z_p)
     \frac{\partial \beta_p}{\partial w_m} \right] \label{ajij1} \\
     &= Y_m(t)r_m(t;z) + \sum_i w_iY_i(t) r_i(t;z) (x_i- z)D_m' \label{ajij2} \\
 \frac{\partial\lambda_{jk}(t;z)}{\partial w_m} &= 
    \frac{dN_{mjk}(t) -  dM_m(t)} {\sum_i w_i r_i(z) Y_{ij}(t)} 
    - \frac{\left[\sum_iw_i dN_{ijk}(t)\right] 
           \left[\sum_i  w_iY_i(t) r_i(t;z) (x_i- z)D_m' \right]}
           {\left[\sum_i w_i r_i(z) Y_{ij}(t) \right]^2}      \nonumber \\
    &= a - b \label{ajij3} \\
    a &= \exp(z-c)\frac{dN_{mjk}(t) -  dM_m(t)} {\sum_i w_i r_i(c) Y_{ij}(t)} 
             \nonumber \\
    b &= \exp(z-c)\lambda_{jk}(t;c) (\xbar(t) -z)D'_m \label{ajij4}
\end{align}

$D_m$ is row $m$ of the 
dfbeta matrix for the Cox model's coefficent vector $\beta$; $D$ is an $n$
by $p$ matrix where $n$ is the number of observations and $p$ the number
of coefficients; treat $x_i-z$ as a row vector, 
the $i$th row of the $X$ matrix, recentered.
The first term of equation \eqref{ajij3} is essentially the leverage of the
ordinary cumulative hazard, with $w_i r_i$ replacing $w_i$. 
The second term is the the impact of $w_m$ as mediated through
its effect on the model coefficients.
Much like confidence intervals for the fitted line in a linear regression,
the impact is greater when $z$ is far from the mean of the covariates.

Any row of the data will be at
risk in exactly one state for the (time1, time2) interval of that data row,
so $Y_{ij}(t)$ is non-zero for only one $j$.
The second term is the the impact of $w_m$ as mediated through
its effect on the model coefficients.
In the AJ estimate this meant that any row of the data only has an
effect on hazards that originate from that state.
For the MSH model, a coefficient that is common to two transitions means that
all observations at risk for either transition will also have an IJ term
for both.

When the same covariate affects two transitions but with disjoint coefficients,
the \code{stacker} routine will have created an expanded $X$ matrix which
has separate columns for the two coefficients.
Likewise, the $D$ matrix will have a separate column for each coefficient and
will be semi-sparse, in that it has zeros for data rows that do not contribute
to a give coefficient.
Over all rows, the right hand portion of equation \eqref{ajij4} is
\code{rowSums((X- rep(z, each=nrow(X)))* D)}. 



\section{Skiplists}
\index{skiplist}
An efficient structure for a continually updated index to the rows currently
at risk appears to be a skiplist, see table II of Pugh
\cite{Pugh90}.  Each observation is added (at time2) and deleted (at
time 1) just once from the risk set so we have $n$ additions and $n$ deletions,
which are faster for skip lists than binary trees. Reading out the entire list,
which is done at each event time, has the same cost for each structure.
Searching is faster for trees, but we don't need that operation.
If there are $k$ states we keep a separate skiplist containing those at
risk for each state.

When used in the residuals function, 
we can modify the usual skiplist algorithm to take advantage of two things:
we know the maximum size of the list from the n.risk component,
and that the additions are nearly in random order.
The last follows from the fact that most data sets are in subject id order,
so that ordering by observation's ending time skips wildly across data rows.
For the nafld data example above, the next figure shows the row number of the
first 200 additions to the risk set, when going from largest time to smallest.
For the example given above, the maximum number at risk in each of the 5
states is (\Sexpr{paste(apply(nfit$n.risk, 2, max), collapse=', ')}).  %$

<<skiplist1, fig=TRUE>>=
sort2 <- order(ndata$age2)
plot(1:200, rev(sort2)[1:200], xlab="Addition to the list",
     ylab="Row index of the addition")
@ 

Using $p=4$ as recommended by Pugh \cite{Pugh90} leads to $\log_4(1420) =5$
linked lists; 1420 is the maximum number of the $n$ =\Sexpr{nrow(ndata)} rows at
risk in any one state at any one time.  
The first is a forward linked list containing all those currently at risk,
the second has 1/4 the elements, the third 1/16, 1/64, 1/256.  
More precisely, 3/4 of the nodes have a single forward pointer, 3/16 have 2 
forward pointers corresponding to levels 1 and 2, 3/64 have 3 forward pointer,
etc.  All told there are 4/3 as many pointer as a singly linked list, a modest
memory cost over a linked list.
To add a new element to the list first find the point of insertion in list 4,
list 3, list 2, and list 1 sequentially; each of which requires on average
2 forward steps.  Randomly choose whether the new entry should participate in
1, 2, 3 or 4 of the levels, and insert it.

\begin{figure}
<<skiplist2, fig=TRUE, echo=FALSE>>=
# simulate a skiplist with period 3
set.seed(1953)
n <- 30
yval <- sort(sample(1:200, n))
phat <- c(2/3, 2/9, 2/27)
y.ht <- rep(c(1,1,2,1,1,1,1,3,1,1,2,1,1,1,2,1), length=n)
plot(yval, rep(1,n), ylim=c(1,3), xlab="Data", ylab="")
indx <- which(y.ht > 1)
segments(yval[indx], rep(1, length(indx)), yval[indx], y.ht[indx])

y1 <- yval[indx]
y2 <- yval[y.ht==3]
lines(c(0, y2[1], y2[1], y1[5], y1[5], max(yval[yval < 100])),
      c(3,3, 2,2,1,1), lwd=2, col=2, lty=3)
@ 
 \caption{A simplified skiplist with 40 elements and 3 levels, showing the
   path used to insert a new element with value 100.}
 \label{skiplist2}
\end{figure}

Figure \ref{skiplist2} shows a simplified list with 40 elements and 3 levels,
along with the search path for adding a new entry with value 100.
The search starts at the head (shown here at x= 0), takes one step at level 3 to
find the largest element $< 100$ at level 3, then 3 steps at level 2,
then one more at level 1 to find the insertion point.  This is compared to
20 steps using only level 1.
Since the height of a inserted node is chosen randomly there is always the 
possiblity
of a long final search on level 1, but even a string of 10 is unlikely:
$(3/4)^{10} < .06$.  


\chapter{Misc numerics}
\section{Matrix exponentials and transition matrices}
For multi-state models, we need to compute the exponential of the transition
matrix, sometimes many times.
The matrix exponential is formally defined as
\begin{equation*}
  \exp(A) = I + \sum_{j=1}^\infty A^i/i!
  \end{equation*}
The computation is nicely solved by the expm package
\emph{if} we didn't need derivatives and/or high speed.  
We want both.

We take advantage of three special cases to speed things up.  Each
depends on the structure of $A$, which has the transition rates from
state $j$ to state $k$ in off diagonal element $A_{jk}$ (non negative),
and diagonal elements such that each row sums to zero.
$P= \exp(A)$ is the transition probabilty matrix for this time point,
each row of $P$ sums to 1 and all elements are non-negative.

\begin{enumerate}
  \item  If all the non-zero rates fall into a single row, or they all
    fall into a single column, then there are simple closed form formulas
    shown below. 
  \item If the rate matrix $R$ is upper triangular and the (non-zero) diagonal
    elements are distinct, there is a fast matrix decomposition algorithm.
    If the transition matrix is acylic then it can be rearranged to be in upper
    triangular form.  The decomposition also gives a simple expression for
    the derivative.
  \item In the general case we use a Pade-Laplace algorithm: the same found 
    in the matexp package, but also supply derivatives if requested.
\end{enumerate}

When derivatives are needed, the routine is also passed an array \code{dR} whose
$(j,k,i)$ element is the derivative of $A_{jk}$ with respect to parameter 
$\theta_i$.
In the case of infinitesimal jackknife compuations, which is what is most
used in the survival package, $\theta$ will be the per-subject leverage
on each of the underlying hazards at that time.  
(Or a group of subjects, for a grouped jackknife).
Since the rows of $P$ must sum to one, note that the rows of any derivative of
$P$ will sum to zero, i.e., 
$dR_{jji}= -\sum_{k!=j} dR_{jki}$.
Since zero rates only occur for transitions that are not possible
($\exp(X\beta) \lambda(t)$ is 0 only if $\lambda(t)$ is 0)
the derivative will be non-zero only if the rate is non-zero. 

Let $P(t)= \exp(A(t))$. 
If there is only one non-zero diagonal element, $A_{jj}$ say, then
\begin{align*} 
    P_{jj} &= e^{-A_{jj}} \\
    P_{jk} &= \left(1- e^{-A_{jj}}\right) \frac{A_{jk}}/{\sum_{l\ne j} A_{jl}} \\
    P_{kk} &= 1; k\ne j 
\end{align*}
and all other elements of $P$ are zero.
If there is only a single event at this time, which turns out to be the most
common case with continuous time data, then $P$ and its derivative have only
two non-zero elements, corresponding to $k$= current state of the observation
with an event.  It is not true that the derivative is zero for other subjects
than the $i$ who had an event, as everyone has an impact on $\hat \beta$ and
thus on the estimated rate at that time.

\begin{align*}
  \frac{\partial P_{jj}}{\partial \theta_{jki}} &=
     -P_{jj} \partial{A_{jk}} \partial{\theta_{jki}} \\
 \frac{\partial P_{jk}}{\partial \theta_{jki}} &=
     P_{jj} \partial{A_{jk}} \partial{\theta_{jki}} \\
\end{align*}

If there are multiple events at this time point but all have the same
starting state, the     
The derivative of $P$ with respect to $\theta_{jki}$ will be 0 for all rows
except row $j$.
\begin{align*} 
 \frac{\partial P_{jk}}{\partial \theta_{jki}} &= eta_{jk}A_{jj} 
     \;\mbox{single event type} \\
    \frac{\partial A_{jk}}{\partial \eta_{jm}}&=  
      A_{jj} eta_{jm}\frac{A_{jm}}{\sum_{l\ne j} A_{jl}} +
      (A_{jj} -1) \frac{\eta_{jm} (1- \sum_{l\ne j} A_{jl})}{(\sum_{l\ne j} A_{jl})^2}
\end{align*}


If all non-zero rates lie in a single column $k$, then 
\begin{align*} 
  P_{jj} = exp(-A_{jk}) \quad j\ne k\\
  P_{jk} = 1- P_{jj} \quad j \ne k\\
  P_{kk} =1
\end{align*}

If time is continuous then most events will be at a unique event time, and the
first formula will thus be the most common case.  This formula also holds
for competing risks.  If there is a shared hazard
for a set of transitions to a single state $k$ (often death), then for a 
unique event time (or tied events of the same type) the second formula will hold.

The matrix decomposition is use when
the state space is acylic, the case for many survival problems.  
The states can be reordered so that A is always upper triangular.
In that case, the diagonal elements of A are the eigenvalues.  If these
are unique (ignoring the zeros), then an algorithm of Kalbfleisch and Lawless
gives both A and the derivatives of A in terms of a matrix decomposition.
For the remaining cases use the Pade' approximation as found in the
matexp package.

The overall stategy is the following:
\begin{enumerate} 
  \item Call \code{survexpmsetup} once, which will decide if the matrix is
    acyclic, and return a reorder vector if so or a flag if it is not.
    This determination is based on the possible transitions, e.g., on the
    transitions matrix from survcheck.
  \item Call \code{survexpm} for each individual transition matrix.
    In that routine
    \begin{itemize}
      \item First check for the simple cases, otherwise
      \item Do not need derivatives: call survexpm
      \item Do need derivatives
        \begin{itemize}
          \item If upper triangular and no tied values, use the deriv routine
          \item Otherwise use the Pade routine
        \end{itemize}
    \end{itemize}
\end{enumerate}



\printindex
\bibliographystyle{plain}
\bibliography{refer}
\end{document}