\documentclass[article, nojss]{jss}
\usepackage{amsthm}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{color}
\usepackage{afterpage}

\newtheorem{exmp}{Example}[section]
\newtheorem{rexample}{R Example}[section]
% !Rnw weave = knitr
% \VignetteIndexEntry{The glarma package}
% \VignetteEngine{knitr::knitr}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% almost as usual
\author{William T.M. Dunsmuir\\
  University of New South Wales
  \And
  David J. Scott\\
  University of Auckland
}

\title{The \pkg{glarma} Package for Observation Driven Time Series
  Regression of Counts}

%% for pretty printing and a nice hypersummary also set:
\Plainauthor{William T.M. Dunsmuir, David J. Scott} %% comma-separated

\Plaintitle{The glarma Package for Observation Driven Time Series
  Regression of Counts} %% without formatting

\Shorttitle{\pkg{glarma}: Count Time Series} %% a short title (if necessary)

%% an abstract and keywords
\Abstract{
  We review the theory and application of generalised linear
  autoregressive moving average observation driven models for time
  series of counts with explanatory variables and describe the
  estimation of these models using the \pkg{glarma}
  R-package. Diagnostic and graphical methods are also
  illustrated by several examples.
}

\Keywords{observation driven count time series, generalized linear
  arma models, \pkg{glarma}, \proglang{R}}

\Plainkeywords{observation driven count time series, generalized
  linear arma models, glarma, R}

%% publication information
%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
%% \Volume{50}
%% \Issue{9}
%% \Month{June}
%% \Year{2012}
%% \Submitdate{2012-06-04}
%% \Acceptdate{2012-06-04}

%% The address of (at least) one author should be given
%% in the following format:
\Address{
  William T.M. Dunsmuir\\
  Department of Statistics \\
  School of Mathematics and Statistics\\
  University of New South Wales\\
  Sydney, NSW, 2052, Australia\\
  E-mail: \email{W.Dunsmuir@unsw.edu.au}\\
  URL: \url{http://web.maths.unsw.edu.au/~dunsmuir/}
}
%% It is also possible to add a telephone and fax number
%% before the e-mail in the following format:
%% Telephone: +43/512/507-7103
%% Fax: +43/512/507-2851

%% for those who use Sweave please include the following line (with % symbols):
%% need no \usepackage{Sweave.sty}

%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{document}

<<setup, include=FALSE>>=
library(knitr)
opts_chunk$set(comment = NA, fig.path='Figures/glarma',
               fig.align = 'center', fig.show = 'hold')
options(replace.assign = TRUE, width = 60)
knit_hooks$set(rexample = function(before, options, envir) {
    if (before) {
        sprintf('\\begin{rexample}\\label{%s}\\hfill{}', options$label)
    } else {
        '\\end{rexample}'
    }
}
)
@

<<echo=FALSE, warning = FALSE>>=
library(MASS)
library(glarma)
@


%% include your article here, just as usual
%% Note that you should use the \pkg{}, \proglang{} and \code{} commands.

\section{Introduction}
\label{sec:Intro}

In the past 15 years there has been substantial progress made in
developing regression models with serial dependence for discrete
valued response time series such as arise for modelling Bernoulli,
binomial, Poisson or negative binomial counts. In this paper we
consider the GLARMA (generalized linear autoregressive moving average)
subclass of observation driven models in detail. Assessing and
modelling dependence when the outcomes are discrete random variables
is particularly challenging. A major objective of using GLARMA models
is the making of inferences concerning regression variables while
ensuring that dependence is detected and properly accounted
for. GLARMA models are relatively easy to fit and provide an
accessible and rapid way to detect and account for serial dependence
in regression modelling of time series.

\subsection{Generalized state space models}
\label{sec:Generalized SS models}

The GLARMA models considered here are a subclass of generalized state
space models for non-Gaussian time series described in
\cite{davis1999modeling}, \cite{brockwell2010introduction} and
\cite{durbin2012time} for example. A generalized state-space model for
a time series $\{Y_{t},t=1,2,\ldots \}$ consists of an observation
variable and state variable. The model is expressed in terms of
conditional probability distributions for the observation and state
variables. Such models can be loosely characterized as either
\textbf{parameter driven} or \textbf{observation driven}. The
observation specification is the same for both models.

For parameter driven models the serial dependence in the state
equation is governed by a latent, usually stationary, time series that
cannot be observed directly and which evolves independently of past
and present values of the observed responses or the covariates. On the
other hand, as the name implies, in observation driven models, the
random component of $W_{t}$ depends on past observations $\left\{
  Y_{s},s<t\right\}$.

Estimation of parameter driven models requires very high dimensional
integrals to be evaluated or approximated using asymptotic expansions,
simulation methods, numerical integration or all three. Because of
this they can be difficult to fit and for routine model building in
which many potential regressors need to be considered and evaluated
for significance, the parameter driven models for count time series are
not yet ready for general use.

On the other hand, the observation driven models considered here are
much easier to fit because the likelihood is conditionally specified
as a product of conditional distributions which belong to the
exponential family and for which the natural parameter is readily
calculated via recursion. As a result they are relatively
straightforward to apply in practical settings with numerous
regressors and long time series.

The outline of the remainder of the paper is as follows. Section
\ref{sec:Theory GLARMA Models} provides the necessary theoretical
background for GLARMA models. It describes the various combinations of
model elements (response distributions, dependence structures and
predictive residuals) that are currently supported in the \pkg{glarma}
package. Options for initializing and obtaining maximum likelihood
estimates using a form of Fisher scoring or Newton-Raphson iterations
are described. Issues of parameter identifiability and convergence
properties of GLARMA models and the maximum likelihood estimates are
also reviewed to guide users in the application of these
models. Section \ref{sec:modelling} describes the various modelling
functions available in the package. Section \ref{sec:diagnostics}
describes the built-in model diagnostic procedures and plotting
functions. Section \ref{sec:examples} provides several examples
illustrating the use of the package on real data sets.

\section{Theory of GLARMA models}
\label{sec:Theory GLARMA Models}

The \pkg{glarma} package provides functionality for estimating
regression relationships between a vector of regressors (covariates,
predictors) and a discrete valued response. In time series regression
modelling it is typically the case that there is serial
dependence. This package models the serial dependence using the GLARMA
class of observation driven models and provides valid inference for
the regression model components.

Let there be $N$ consecutive times at which the response and regressor
series are observed. The response series are observations on the
random variables $\{Y_t: t=1, \ldots, N \} $ and associated with these
are $K$-dimensional vectors of regressors $x_{t}$ observed also for
$t=1,\ldots, N$. We let $\mathcal{F}_t = \{Y_s: s < t, x_s: s \le t\}$
be the past information on the response series and the past and
present information on the regressors. In general the conditional
distribution of $Y_{t}$ given $\mathcal{F}_{t}$ is given in
exponential family form as
\begin{equation}
f(y_{t}|W_{t})=\exp\left\{  y_{t}W_{t}-a_{t}b(W_{t})+c_{t}\right\}
  \label{Eq: EFDensity}%
\end{equation}
where $a_t$ and $c_t$ are sequences of constants possibly depending on
the observations $y_t$. Information in $\mathcal{F}_{t}$ is
summarized in the state variable $W_t$. Details are provided below for
specific distributions available in \pkg{glarma}.

Note that (\ref{Eq: EFDensity}) is not the fully general form of the
exponential family \cite[see][]{mccullagh1989generalized} in that it
does not include an over-dispersion parameter and the canonical link is
used. It follows from (\ref{Eq: EFDensity}) that the conditional means
and variances of the responses are $\mu_t:= E(Y_t|W_t)=a_t\dot b(W_t)$
and $\sigma_t^2:=\text{var}(Y_t|W_t)=a_t\ddot b(W_t)$. The negative
binomial case is special---see below.

Observation driven models take various forms \citep[see][for a general
discussion]{Benjamin2003}. Here we focus on the case where the state
vector in (\ref{Eq: EFDensity}) is of the general form
\begin{equation}
W_{t}=x_{t}^{T}\beta+Z_{t}. \label{Eq: LinPredWt}%
\end{equation}

In addition to the regression parameters $\beta$ we assume that there
are other parameters $\psi$ which specify the process $\left\{
  Z_{t}\right\} $ as discussed below.

\subsection{GLARMA dependence structure}
\label{sec:GLARMA Dependence Structure}

Serial dependence in the response process can be introduced via $Z_t$
in the state process using linear combinations of past predictive
residuals $e_t$ as
\begin{equation}
Z_{t}=\sum_{j=1}^{\infty}\gamma_{j} e_{t-j} \label{Eq: ODprocess}%
\end{equation}

The predictive residuals are defined as
\begin{equation}
e_{t}=\frac{Y_{t}-\mu_{t}}{\nu_{t}}.
\label{Eq: GeneralGLARMAResids}%
\end{equation}
for some scaling sequence $\{\nu_t\}$---see Section \ref{sec:Types
  Residuals} for choices currently supported. Note that these are
martingale differences, hence are zero mean and uncorrelated. When
$\nu_t$ is set to the conditional standard deviation of $Y_t$ the
$e_t$ are also unit variance, hence are weakly stationary white noise.

One parsimonious way in which to parameterize the infinite moving
average weights $\gamma _{j}$ in (\ref{Eq: ODprocess}), is to allow
them to be the coefficients in an autoregressive-moving average
filter.  Specifically, set
\begin{equation}
\sum_{j=1}^{\infty }\gamma_{j} \zeta^{j}=\theta
(\zeta)/\phi(\zeta)-1, \nonumber
\end{equation}
where $\phi (\zeta)=1-\phi _{1}\zeta-\cdots -\phi _{p}\zeta^{p}$ and
$\theta (\zeta)=1+\theta _{1}\zeta+\cdots +\theta _{q}\zeta^{q}$ are
the respective autoregressive and moving average polynomials of the
ARMA filter, each having all zeros outside the unit circle. It follows
that $\{Z_{t}\}$ satisfies the ARMA-like recursions,
\begin{equation}
Z_{t}=\sum_{i=1}^{p}\phi _{i}(Z_{t-i}+e_{t-i})+\sum_{i=1}^{q}\theta
_{i}e_{t-i}.  \label{Eq: Zt recursion}
\end{equation}
The $\{Z_t\}$ defined in this way can be thought of as the best linear
predictor of a stationary invertible ARMA process with driving noise
specified by the sequence $\{e_t\}$ of scaled deviations of count
responses from their conditional mean given the past responses and the
past and current regressors.  This specification allows recursive
calculation (in time $t$) of the state equation.  The model is
referred to as a GLARMA model (see \citep{davis2003observation}).
\cite{Shephard1995GLARmod} provides the first example of such a model
class.

\subsection{Response distributions}
\label{sec:Response Distributions}

Specific examples of exponential family members of the form (\ref{Eq:
  EFDensity}) that are currently supported are Poisson, negative
binomial and binomial which includes Bernoulli as a special case.

\emph{Poisson:} Here $a_{t}\equiv 1$, $b(W_t)=\exp(W_t)$, $c_t=-\log
y_t!$ and the canonical link is $g(\mu)=\ln(\mu)$. Note that
$\mu_t=\exp(W_t)$ and $\sigma_t^2=\exp(W_t).$

\emph{Binomial/Bernoulli:} Let the number of trials at time $t$ be
$m_t$ and $\pi_t=P(Y_t=1|W_t)$. Then $a_{t}=m_t$,
$b(\theta)=\ln(1+\exp(W_t))$ and $c_t=\log {m_t \choose y_t}$.  The
canonical link is the logit so that $W_t=\log(\pi_t/(1-\pi_t))$. Note
that $\mu_t=m_t \pi_t$ and $\sigma_t^2=m_t \pi_t (1-\pi_t)$. The
Bernoulli case has $m_t\equiv 1$.

\emph{Negative binomial:} Let $\mu_t=\exp(W_t)$. The \pkg{glarma}
package uses the negative binomial density in the form
\begin{equation}
f(y_{t}|W_{t},\alpha)=\frac{\Gamma (\alpha +y_{t})}{\Gamma (\alpha
)\Gamma (y_{t}+1)}\left[ \frac{\alpha }{\alpha +\mu _{t}}\right] ^{\alpha }%
\left[ \frac{\mu _{t}}{\alpha +\mu _{t}}\right] ^{y_{t}}.
\label{Eq: NegBin Density}
\end{equation}
Note that $\mu_t=\exp(W_t)$ and $\sigma_t^2=\mu_t+\mu_t^2/\alpha$. As
$\alpha \to \infty$ the negative binomial density converges to the
Poisson density.  Also note that if $\alpha $ is known, this density
can be put in the one parameter exponential family with appropriate
definitions of $\theta _{t}$, $b(\theta _{t})$, $a(\psi )$, $%
c_{t}(y_{t},\psi )$. If $\alpha $ is not known then (\ref{Eq: NegBin
  Density}) is not a member of the one parameter exponential family.

\subsection{Types of GLARMA residuals}
\label{sec:Types Residuals}
GLARMA predictive residuals are of the form (\ref{Eq:
  GeneralGLARMAResids}) where $\nu(W_{t})$ is a scaling
function. Currently several choices for this are supported.
\begin{description}
\item[Pearson Scaling] Here $\nu_{t}=\nu_{P,t}$ where
  \[
  \nu_{P,t}=[a_t\ddot{b}(W_{t})]^{0.5}%
  \]
  in which case Pearson residuals result.
\item[Score-type Scaling] These replace conditional standard deviation
  by conditional variances
  \[
  \nu_{S,t}=a_t\ddot{b}(W_{t})
  \]
  resulting in the `score-type' residuals used in
  \cite{creal2008general}.
\item[Identity Scaling] A third option, which allows some form of the
  BARMA (binary ARMA) models considered in
  \cite{wang2011autopersistence} to be fit is to use no scaling with
  \[
  \nu_{I,i}=1.
  \]
\end{description}

For the Poisson response distribution GLARMA model, failure to scale
by the variance or standard deviation function will lead to unstable
Poisson means (that diverge to infinity or collapse to zero as an
absorbing state for instance) and existence of stationary and ergodic
solutions to the recursive state equation is not assured---see
\cite{davis1999modeling}, \cite{davis2003observation} and
\cite{davis2005maximum} for details.  For the binomial situation this
lack of scaling should not necessarily lead to instability in the
success probability as time evolves since the success probabilities,
$p_{t}$, and observed responses, $Y_{t}$, are both bounded between $0$
and $1$. Thus degeneracy can only arise if the regressors $x_{t}$
become unbounded from below or above. As recommended in
\cite{davis1999modeling} temporal trend regressors should be scaled
using a factor relating to sample size $n$.

\subsection{The GLARMA likelihood}
\label{sec:GLARMA Likelihood}
Given $n$ successive observations $\{y_{t}: t=1,\ldots,n\}$ on the
response series the likelihood is constructed as the product of
conditional densities of $Y_t$ given $\mathcal{F}_{t}$. The state
vector $W_t$ at each time embodies these conditioning variables and so
the log likelihood is given by
\begin{equation}
l(\delta)=\sum_{t=1}^{n} \log f_{Y_t|W_t}(y_t| W_t; \delta).
\label{Eq: General log likelihood}
\end{equation}

For the Poisson and binomial response distributions the log-likelihood
(\ref{Eq: General log likelihood}) is
\begin{equation}
l(\delta)=\sum_{t=1}^{n} \{ y_{t}W_{t}(\delta)-a_{t}b(W_{t}(\delta))+c_{t} \}
\end{equation}
where $\delta =(\beta,\phi,\theta)$.

For the negative binomial response distribution the log-likelihood is
more complicated because the shape parameter $\alpha$ also has to be
estimated along with $\beta$, $\phi$ and $\theta$. We then let $\delta
=(\beta,\phi,\theta,\alpha)$.

%\begin{equation}
%l(\delta)=\log \left(\frac{\Gamma(\alpha +y_{t})}{\Gamma (\alpha) \Gamma (y_{t}+1)}\right) \left[ \alpha \log \left(\frac{\alpha }{\alpha +\mu _{t}}\] \right)+
%y_{t} \log \left( \frac{\mu _{t}}{\alpha +\mu _{t}}\right). \label{Eq: NegBin logLik}
%\end{equation}


Note that $e_{t}$ in (\ref{Eq: GeneralGLARMAResids}), the $Z_t$ in
(\ref{Eq: Zt recursion}) and thus the $W_{t}$ in (\ref{Eq:
  LinPredWt}) are functions of the unknown parameter $\delta$ and
hence need to be recomputed for each iteration of the likelihood
optimization. Thus in order to calculate the likelihood and its
derivatives, recursive expressions are needed to calculate $e_{t}$,
$Z_t$ and $W_{t}$ as well as their first and second partial derivative
with respect to $\delta $.  Expressions for these recursive formulae
are available in \cite{davis2005maximum} for the Poisson case.
Corresponding formulae for the binomial case were derived in
\cite{HaolanLu2002} and for the negative binomial case in
\cite{BoWang2004}. The essential computational cost is in the
recursions for $Z_t$ and $W_t$ and their first and second derivative
with respect to $\delta$. Fortunately, these require identical code
for the various response distributions and definitions of predictive
residuals $e_t$.

For calculation of the $Z_t$ in (\ref{Eq: Zt recursion}), initializing
conditions for the recursions must be used. The current implementation
in \pkg{glarma} is to set $e_t=0$ and $Z_t=0$ for $t \le 0$ ensuring
that the conditional and unconditional expected values of $e_t$ are
zero for all $t$.

The likelihood is maximized from a suitable starting value of the
parameter $\delta$ using a version of Fisher scoring iteration or by
Newton-Raphson iteration. For a given value of $\delta$ let the vector
of first derivatives with respect to $\delta$ of the log-likelihood
(\ref{Eq: General log likelihood}) be
\begin{equation} \nonumber
d(\delta)=\frac{\partial l(\delta)}{\partial\delta}
\end{equation}
and the second derivative matrix be
\begin{equation} \label{Eq: D General}
D_{NR}(\delta) = \frac{\partial^{2}l(\delta)}{\partial
\delta\partial\delta^{^\top}},
\end{equation}
where the matrix of second derivatives of the log-likelihood is (in
the Poisson and binomial response cases) given by
\begin{equation} \label{Eq: D NR PoisBin cases}
D_{NR}(\delta)
=\sum_{t=1}^{n}[y_t-a_t \dot{b}(W_t)]\frac{\partial^{2}W_{t}}{
\partial \delta\partial\delta^{^\top}}-\sum_{t=1}^{n} a_t \ddot{b}(W_{t})\frac{\partial W_{t}}{%
\partial \delta}\frac{\partial W_{t}}{\partial\delta^{^\top}} .
\end{equation}
and $\dot{b}(u)$ and $\ddot{b}(u)$ are the first and second
derivatives respectively of the function $b(u)$ with respect to the
argument $u$.

Using the fact that, at the true parameter value $\delta$, $E[y_t -
a_t \dot{b}(W_t)|\mathcal{F}_{t}]=0$ the expected value the first
summation in (\ref{Eq: D NR PoisBin cases}) is zero and hence the
expected value of the matrix of second derivatives is
$E[D_{FS}(\delta)]$ where
\begin{equation} \label{Eq: D FS PoisBin cases}
D_{FS}(\delta)
= - \sum_{t=1}^{n} a_t \ddot{b}(W_{t})\frac{\partial W_{t}}{%
\partial \delta}\frac{\partial W_{t}}{\partial\delta^{^\top}} .
\end{equation}
Note also that due to the martingale difference property of the
predictive residuals we also have $E[D_{NR}(\delta)]= -
E[d(\delta)d(\delta)^\top]$.  While these expectations cannot be
computed in closed form, expression (\ref{Eq: D FS PoisBin cases})
requires first derivatives only and is used in package \pkg{glarma} as
the basis for the approximate Fisher scoring method.

Thus, if $\delta^{(k)}$ is the parameter vector at the current iterate
$k$, the Newton-Raphson updates proceed using
\begin{equation}\label{Eq: MLE Iterations}
  \delta^{(k+1)}= \delta^{(k)}-D_{NR}(\delta^{(k)})^{-1} d(\delta^{(k)})
\end{equation}
and the approximate Fisher scoring updates use $D_{FS}$ in place of
$D_{NR}$

Given a specified tolerance $TOL$, iterations continue until the
largest gradient of the log-likelihood satisfies $\max_i
|d_i(\delta^{(k)}|) \le TOL$ or a maximum number of iterations MAXITER
is surpassed. At termination we let $\hat{\delta}= \delta^{(k+1)}$ and
call this the ``maximum likelihood estimate" of $\delta$.

By default, the iterations in (\ref{Eq: MLE Iterations}) are
initialized using the generalized linear model (GLM) estimates for
$\beta$ and zero initial values for the autoregressive moving average
terms. For the negative binomial case $\beta$ and $\alpha$ are
initialized using a call to \code{glm.nb()} from the package
\pkg{MASS} (see \cite{VenRip02}). Convergence in the majority of cases
is rapid. Users may optionally specify initial parameter values of
their own choice.

\subsection{Parameter identifiability}
\label{sec:Parameter Identifiability}

The GLARMA component $Z_t$ of the state variable given in (\ref{Eq: Zt
  recursion}) can be rewritten as
\begin{equation}
Z_{t}=\sum_{i=1}^{p}\phi_{i}Z_{t-i}+\sum_{i=1}^{\tilde{q}}\tilde{\theta}
_{i}e_{t-i}.  \label{Eq: Zt recursion alternate}
\end{equation}
where $\tilde{q}=\max(p,q)$ and
\begin{enumerate}
\item If $p \le q$, $\tilde{\theta}_j=\theta_j+\phi_j$ for
  $j=1,\ldots,p$ and $\tilde{\theta}_j=\theta_j$ for $j=p+1,
  \ldots,q$.
\item If $p>q$, $\tilde{\theta}_j=\theta_j+\phi_j$ for $j=1,\ldots,q$ and
  $\tilde{\theta}_j=\phi_j$ for \j=$q+1,\ldots,p$.
\end{enumerate}
When pre-observation period values are set to zero
(that is $Z_t=0$ for $t \le 0$ and $e_t=0$ for $t \le 0$) then
if and only if $\tilde{\theta}_j = 0$ for $j=1,\ldots,\tilde{p}$ the
recursion (\ref{Eq: Zt recursion alternate}) would result in $Z_t =0$
for all $t$ and hence there is no serial dependence in the GLARMA
model. This is is equivalent to $\phi_j=-\theta_j$ for
$j=1,\ldots,p$ and $\theta_j=0$ for $j=p+1,\ldots,\tilde{q}$.

Consequently a null hypothesis of no serial dependence requires only
these constraints on the $\theta$ and $\phi$ parameters. In some
situations this means that under the null hypothesis of no serial
dependence there are nuisance parameters which cannot be
estimated. This has implications for convergence of the iterations
required to optimize the likelihood and on testing that there is no
serial dependence in the observations (other than induced by the
regression component $x_t^\top \beta$).

When $p>0$ and $q=0$ (equivalent to an ARMA$(p,p)$ specification with
constraint $\theta_j=\phi_j$ or a pure MA with $p=0$ and $q>0$ then
identification issues do not arise and the hypothesis of no serial
dependence corresponds to the hypothesis that $\phi_j=0$ for
$j=1,\ldots,p$ in the first case and $\theta_j=0$ for $j=1,\ldots,q$
in the second case. The provided likelihood ratio and Wald tests (see
Section \ref{sec:tests} for further details) will have an asymptotic
chi-squared distribution with correct degrees of freedom.

In cases where $p>0$ and $q>0$ some caution is advised when fitting
models and testing that serial dependence is not present. To simplify
the discussion we focus on the case where $p=q$;

\begin{enumerate}
\item If there is no serial dependence in the observations but $p=q>0$
  is specified then there is a strong possibility that the likelihood
  optimization for this overspecified model will not converge because
  the likelihood surface will be `ridge-like' along the line where
  $\phi_j=-\theta_j$. This issue is classical for standard ARMA
  models. Similarly if the degree of serial dependence is of lower
  order than that specified for the GLARMA model identifiability
  issues and lack of convergence of the likelihood optimizing
  recursions is likely to occur. Following from this it is highly
  recommended that users start with low orders for $p$ and $q$ and
  initially avoid specifying them to be equal. Once stability of
  estimation is reached for a lower order specification increasing the
  values of $p$ or $q$ could be attempted. Lack of identifiability
  typically manifests itself in the matrix of second derivatives
  $D_{NR}$ or the approximate Fisher scoring version $D_{FS}$ becoming
  close to singular or even non-positive definite. The state variable
  $W_t$ can also degenerate to $\pm \infty$ for which an error code in
  the output from the \texttt{glarma()} call is provided.
\item The likelihood ratio test that there is no serial dependence
  versus the alternative that there is GLARMA like serial dependence
  with $p=q>0$ will not have a standard chi-squared distribution
  because the parameters $\phi_j$ for $j=1,\ldots,p$ are nuisance
  parameters which cannot be estimated under the null
  hypothesis. Testing methods such as proposed in
  \cite{hansen1996inference} need to be developed for this situation.
\end{enumerate}

\subsection{Stochastic properties of GLARMA models}
\label{sec:Stoch Props GLARMA Mods}

Means, variances and autocovariances for the state process $\{W_t\}$
can be readily derived using the definition of $Z_t$ in (\ref{Eq:
  ODprocess})---see \citep{davis1999modeling}. For the Poisson
response case the corresponding means, variance and autocovariances
for the count response series $\{Y_t\}$ can be derived
approximately. Additionally an approximate interpretation of the
regression coefficients $\beta$ can be given---see
\citep{davis2003observation}. Similar results could be derived for the
negative binomial response case.  For binomial and Bernoulli responses,
calculation of means, variances, autocovariances for the response
series and interpretation of regression coefficients is not
straightforward. This is a typical issue for interpretation of random
effects models and transition models in the binomial or Bernoulli
case---see \cite{diggle2002analysis} for example.

To date the stationarity and ergodicity properties of the GLARMA model
are only partially understood. These properties are important in order
to ensure that the process is capable of generating sample paths that
do not degenerate to zero or do not explode as time progresses, as
well as for establishing the large sample distributional properties of
estimates of the parameters. \cite{davis2003observation} provide partial
results for perhaps the simplest of all possible models for Poisson
responses specified with $p=0$, $q=1$ and $x_{t}^\top\beta =\beta
$. Results for simple examples of the stationary Bernoulli case are given
in \cite{streett2000some}.

\subsection{Fitted values}
\label{sec:fitted values}
There are two concepts of fitted values currently supported for the
GLARMA model.  The first is defined as the estimated conditional mean
function $\hat{\mu}_t$ at time $t$ calculated using the maximum
likelihood estimates $\hat{\delta}$. Thus
\begin{equation} \label{Eq: Fitted Est Cond Mean}
\hat{\mu}_t =m_t \dot{b} (x^\top_t \hat{\beta} + \hat{Z}_t)
\end{equation}
where $\hat{Z}_t$ are calculated using $\hat{\delta}$ in
(\ref{Eq: Zt recursion}). These fitted values combine the regression
fit (fixed effects) together with the contribution from weighted sums
of past estimated predictive residuals.

Because for GLARMA models the unconditional mean function is difficult
to obtain exactly in all cases an estimated unconditional mean
function of $t$ is not provided. Instead, for the second concept of
fitted values, the fitted value from the regression term only is
suggested as a guide to the fit without the effect of random variation
due to $Z_t$.  This is defined to be
\begin{equation} \label{Eq: Fitted Fixed Effects}
\tilde{\mu}_t =m_t \dot{b} (x^\top_t \hat{\beta})
\end{equation}
We refer to this as the ``fixed effects fit" in plotting functions
below. Note that this is not an estimate of the unconditional mean
even in the Poisson case (arguably the most tractable for this
calculation)---the theoretical unconditional mean for this case is
approximated by $\exp(x^\top_t \beta + \nu^2/2)$ where $\nu^2 =
\sum_{l=1}^{\infty} \gamma_i^2$---see \cite{davis2003observation} for
details. A similar calculation for the binomial case is not
available. Hence, in view of these theoretical constraints, the use of
the fixed effects fit seems a simple and sensible alternative to the
conditional mean $\hat{\mu}_t$ given by (\ref{Eq: Fitted Est Cond
  Mean}).

\subsection{Distribution theory for likelihood estimation}
\label{sec:Asymptotic Dist MLEs}

For inference in the GLARMA model it is assumed that the central limit
theorem holds so that
\begin{equation}\label{Eq: CLT}
\hat{\delta} \overset{d}{\approx} N(\delta , \hat{\Omega})
\end{equation}
where the approximate covariance matrix is estimated by
\begin{equation}\label{Eq: Asymptotic Cov Mat}
  \hat{\Omega} = - D_{NR}(\hat{\delta})^{-1}
\end{equation}
in the case of Newton-Raphson and similarly with $D_{NR}$ replaced by
$D_{FS}$ in the case of Fisher scoring. Thus a standard error for the
maximum likelihood estimates of the $i$th component of $\delta$ is
computed using $\hat{\Omega}^{1/2}_{ii}$.

There have been a number of claims in the literature concerning a
central limit theorem for models of this type. However all of these
make assumptions concerning convergence of key quantities all of which
require the ergodicity to be established which has not been done in
generality as yet. The central limit theorem for the maximum
likelihood parameter estimates is rigorously established only in the
stationary Poisson response case in \cite{davis2003observation} and in
the Bernoulli stationary case in \cite{streett2000some}. Simulation
results are also reported in
\cite{davis1999modeling,davis2003observation} for non-stationary
Poisson models. Other simulations not reported in the literature
support the supposition that the estimates $\hat{\delta}$ have a
multivariate normal distribution for large samples for a range of
regression designs and for the various response distributions
considered here.

A central limit theorem for the maximum likelihood estimators is
currently not available for the general model. Regardless of the
technical issues involved in establishing a general central limit
theorem the above approximate result seems plausible since, for these
models the log-likelihood is a sum of elements in a triangular array
of martingale differences.

For nested models likelihood ratio test statistics can be calculated
and compared to the assumed chi-squared asymptotic distribution in the
usual way. The above asymptotic result can be used to obtain an
approximate chi-squared distribution for a Wald test that subsets of
$\delta$ take specified values. Let $\delta^{(1)}$ specify a subset of
$\delta$ that is hypothesized to take a specific value
$\delta_0^{(1)}$. The Wald test is constructed as
\begin{equation} \label{Eq: Wald Test General From}
W^2 = [\hat{\delta}^{(1)}-\delta_0^{(1)}]^\top [\hat{\Omega}^{(1)}]^{-1} [\hat{\delta}^{(1)}-\delta_0^{(1)}]
\end{equation}
where $\hat{\Omega}^{(1)}$ is the submatrix corresponding to
$\delta_0^{(1)}$ of the estimated asymptotic covariance matrix of
(\ref{Eq: Asymptotic Cov Mat}).

Further details on implementation of these tests in \pkg{glarma} are
given in Section \ref{sec:tests}.



\section[Modelling]{Modelling functions}
\label{sec:modelling}

There are seven modelling functions for fitting GLARMA models, falling
into three groups:
\begin{description}
\item[Poisson:] \texttt{glarmaPoissonPearson()} and
\texttt{glarmaPoissonScore()}.
\item[Binomial:] \texttt{glarmaBinomialIdentity()},
\texttt{glarmaBinomialPearson()} and\\ \texttt{glarmaBinomialScore()}.
\item[Negative Binomial:] \texttt{glarmaNegBinPearson()} and
\texttt{glarmaNegBinScore()}.
\end{description}

The second component of the name indicates the distribution used for
the counts and the third component the residuals used in the fitting
routine. A call to \texttt{glarma()} results in a call to the
appropriate fitting routine, as determined by the values of the
arguments \texttt{type}, and \texttt{residuals} supplied to the
\texttt{glarma()} call. Pearson residuals are used by default. Two
iterative methods are available for the optimization of the
log-likelihood, Fisher Scoring (\texttt{method = "FS"}) and
Newton-Raphson (\texttt{method = "NR"}), the default method being
Fisher Scoring. The object returned by any of the fitting routines is
of class \code{"glarma"}.

To specify the model in a call to \texttt{glarma()}, the response
variable is given by the argument \texttt{y}, and the matrix of
predictors for the regression part of the model is given by the
argument \texttt{X}. The matrix \texttt{X} must include a column of
ones to enable the fitting of a mean term in the regression component
of the model. Initial values can be given for the coefficients in the
regression component using the argument \texttt{beta}. If no initial
values are provided, a call is made to the corresponding generalized
linear model to obtain initial regression coefficient values.

The ARMA component of the model is specified using the arguments
\texttt{phiLags} and \texttt{phiInit} (for the AR terms) and
\texttt{thetaLags} and \texttt{thetaInit} (for the MA terms). For both
the AR and MA terms, the first argument of the pair of arguments
specifies the orders of the lags which are to be included in the
model, and the second argument the initial values of the coefficients
for those lags.

When the counts are modeled using the negative binomial distribution,
there is an additional parameter, the shape parameter of the negative
binomial, designated as $\alpha$ in the GLARMA model. This parameter
is called $\theta$ in the function \texttt{glm.nb()} from the package
\pkg{MASS}, but for GLARMA models $\theta$ refers to the moving
average terms in the ARMA component of the model. An initial value for
$\alpha$ can be provided using the argument \texttt{alphaInit}. If no
initial value is provided, a call is made to \texttt{glm.nb()} from
\pkg{MASS}. An initial value for the call to \texttt{glm.nb()} can be
supplied by giving a value to the argument \texttt{alpha} of
\texttt{glarma()}. The default value for \texttt{alpha} is 1.

Because the GLARMA model is fitted using numerical non-linear
optimization, non-convergence is a possibility. Two error codes are
included in the object returned by the \texttt{glarma()} to alert
users to numerical problems with fitting. If the Fisher Scoring or
Newton-Raphson iterations fail to converge, \texttt{errCode} will be
set to 1. This can result from non-identifiability of the ARMA
component of the model such as when the degrees and lags of both the
AR and MA components are specified to be the same, as discussed in
Section \ref{sec:Parameter Identifiability}. It is possible that for
certain values of the ARMA parameters the recursions calculating
$\{W_t\}$ diverge to $\pm \infty$. In that case the value of
\texttt{WError} will be set to 1 allowing the user to check for this
condition when the likelihood optimization fails to converge.

Once a fitted model object has been obtained, there are accessor
functions available using S3 methods to extract the coefficients
(\texttt{coef()}, or the alias \texttt{coefficients()}), the fitted
values (\texttt{fitted()} or the alias \texttt{fitted.values()}), the
residuals (\texttt{residuals()} or the alias \texttt{resid()}), the
model frame (\texttt{model.frame()}), the number of observations
(\texttt{nobs()}), the log-likelihood (\texttt{logLik()}), and the AIC
(\texttt{extractAIC()}). These are standard implementations of these
methods with the exception of \texttt{coef()}. This method takes an
argument \texttt{types} which allows the extraction of the ARMA
coefficients (\texttt{types = "ARMA"}), or the regression coefficients
(\texttt{types = "beta"}), or both sets of coefficients (\texttt{types
  = "all"}), the default.

Other S3 methods available for an object of class \code{"glarma"} are
\texttt{print}, \texttt{summary}, \texttt{print.summary}, and
\texttt{plot}.

\section{Diagnostics}
\label{sec:diagnostics}

\subsection[Test]{Likelihood ratio and Wald tests}
\label{sec:tests}

In \pkg{glarma}, the likelihood ratio test and the Wald test tests
that the serial dependence parameters
$\psi=(\phi^\top,\theta^\top)^\top$ are all equal to zero (that is
tests $H_{0}:\psi=0$ versus $H_{a}:\psi\neq0$) are provided by the
function \texttt{likTests()}, which operates on an object of
type ``glarma". The likelihood ratio test compares the likelihood of
the fitted GLARMA model with the likelihood of the GLM model with the
same regression structure. The same null hypothesis applies to the
Wald test, which is based on the Wald statistic defined in (\ref{Eq:
  Wald Test General From}). Values of both statistics are compared to
the chi-squared distribution with degrees of freedom given by the
number of ARMA parameters. These degrees of freedom and associated
chi-squared $p$~values are correct under the situations discussed in
Section \ref{sec:Parameter Identifiability}.

Package users may also construct their own tailor made likelihood
ratio tests by using the reported log-likelihood (\texttt{logLik()})
for the two models under comparison and Wald tests $W^2$ in (\ref{Eq:
  Wald Test General From}) using the appropriate submatrix of the
reported estimated covariance matrix in (\ref{Eq: Asymptotic Cov Mat})
available as \texttt{glarmamod\$cov}.


\subsection[PIT]{Probability integral transformation}
\label{sec:pit}

To examine the validity of the assumed distribution in the GLARMA
model a number of authors have suggested the use of the probability
integral transformation (PIT), see for example
\cite{Czado2009}. Although the PIT applies to continuous distributions
and the distributions in GLARMA models are discrete, \cite{Czado2009}
have provided a non-randomized approach which has been implemented in
the \pkg{glarma} package. There are four functions involved:
\texttt{glarmaPredProb} calculates conditional predictive
probabilities; \texttt{glarmaPIT} calculates the non-randomized PIT;
\texttt{histPIT} plots a histogram of the PIT; and \texttt{qqPIT}
draws a Q-Q plot of the PIT. If the distribution selected for the
model is correct, then the histogram and Q-Q plot should resemble the
histogram and Q-Q plot obtained when sampling from the uniform
distribution on $[0,1]$. Of the two plots, the histogram is generally
more revealing. Deviations from the expected form of the Q-Q plot are
often difficult to discern.

To calculate the conditional predictive probabilities and the PIT the
following formulae from \cite{Czado2009} are used.

Given the counts $\{y_t\}$, the conditional predictive probability
function $F^{(t)}(.|y_t)$ is given by
\begin{equation}
  \label{eq:condpit}
F^{(t)}(u|y_t) =
\begin{cases}
  0, & u \leq F(y_t - 1),\\
  \frac{\displaystyle u - F(y_t - 1)}
  {\displaystyle F(y_t) - F(y_t - 1)}, &
  F(y_t - 1) \leq u \leq F(y_t),\\
  1, & u > F(y_t).
\end{cases}
\end{equation}
Here $F(y_t)$ and $F(y_t - 1)$ are the upper and lower conditional
predictive probabilities respectively.

Then the non-randomized PIT is
defined as
\begin{equation}
  \label{eq:pit}
{\bar{F}}(u) = \frac{1}{T-1}\sum_{t=2}^T F^{(t)}(u|y_t)
\end{equation}

To draw the PIT histogram, the number of bins, $I$, is chosen, then
the height of the $i$th bin is
\begin{equation}
  \label{eq:histogram}
  f_i = \bar{F}\left(\frac{i}{I}\right) - \bar{F}\left(\frac{i-1}{I}\right).
\end{equation}
The default number of bins in \texttt{histPIT} is 10. To help with
assessment of the distribution, a horizontal line is drawn on the
histogram at height 1, representing the density function of the
uniform distribution on $[0,1]$.

The Q-Q plot of the PIT plots $\bar{F}(u)$ against $u$, for $u \in
[0,1]$. The quantile function of the uniform distribution on $[0,1]$ is
also drawn on the plot for reference.

\cite{jung2011useful} employ the above diagnostics as well as the
randomized version of PIT residuals to compare alternative competing
count time series models for several data sets.

\subsection{Plots}
\label{sec:plots}

The plot method for objects of class \code{"glarma"} produces six plots by
default: a time series plot with the observed values of the dependent
variable, the fixed effects fit, and the GLARMA fit; an ACF plot of
the residuals; a plot of the residuals against time; a normal Q-Q plot;
the PIT histogram; and the Q-Q plot for the PIT. Any subset of these
six plots can be produce using the \texttt{which} argument. For
example to omit both of the Q-Q plots (plots 4 and 6), set
\texttt{which = c(1:3, 5)}. Arguments to the plot method are also
provided to change properties of lines in these plots, namely line
types, widths, and colours.

\section{Examples}
\label{sec:examples}

There are four example data sets included in the \pkg{glarma}
package. Sample analyses for all these data sets are provided in
either the help pages for the data sets or for the \texttt{glarma()}
function.

GLARMA models with Poisson counts have appeared previously in the
literature, however analyses using the binomial and negative binomial
distributions are novel, so we concentrate on those cases in this
section.


\subsection{Asthma data}
\label{sec:asthma-data}

This data set arose from a single hospital (at Campbelltown, as part
of a larger study into the relationship between atmospheric pollution
and the number of asthma cases presenting at emergency departments in
the South West region of Sydney, Australia, see
\cite{davis2003observation}. A description of the columns in the data
set is given in Table~\ref{tab:asthma}


\begin{table}[!ht]
  \centering
  \begin{tabular}{|c|l|l|}
    \hline
    Column & Variable & Description \\
    \hline
    1 & \texttt{Count} & Daily asthma counts \\
    2 & \texttt{Intercept} & Vector of 1s \\
    3 & \texttt{Sunday} & Dummy variable for Sundays \\
    4 & \texttt{Monday} & Dummy variable for Mondays \\
    5 & \texttt{CosAnnual} & $\cos(2\pi t/365)$, annual cosine term\\
    6 & \texttt{SinAnnual} & $\sin(2\pi t/365)$, annual sine term\\
    7 & \texttt{H7} & Scaled, lagged and smoothed humidity\\
    8 & \texttt{NO2max} & Maximum daily nitrogen oxide\\
    9--16 & \texttt{T1.1990--T2.1993} & Smooth shapes to capture
    school terms in each year\\
    \hline
  \end{tabular}
  \caption{The asthma data set}
\end{table}
\label{tab:asthma}

We fit a model with a moving average term at lag 7 with negative
binomial counts. The initial values of the regression coefficients are
found by fitting the corresponding GLM model, and the initial value of
the shape parameter, $\alpha$ of the negative binomial distribution is
taken as 0. Pearson residuals are used and fitting is by
Newton-Raphson.

<<asthma, echo = TRUE, prompt = TRUE, tidy = FALSE, cache = TRUE>>=
data(Asthma)
y <- Asthma[, 1]
X <- as.matrix(Asthma[, 2:16])
glarmamod <- glarma(y, X, thetaLags = 7, type = "NegBin", method = "NR",
                    residuals = "Pearson", alphaInit = 0,
                    maxit = 100, grad = 1e-6)
glarmamod
summary(glarmamod)
@ %def
We note that virtually all the regression terms in the model are
significant, most being highly significant. The moving average term is
significant and both the tests indicate that there is a need to fit a
GLARMA model rather than a simple GLM. The value of $\alpha$ is quite
large, suggesting that a Poisson model might provide adequate fit.

The plot method for an object of class \code{"glarma"} shows six plots
by default: a time series plot with observed values of the dependent
variable, fixed effects fit, and GLARMA fit; an ACF plot of residuals;
a plot of residuals against time; a normal Q-Q plot; the PIT
histogram; and the uniform Q-Q plot for the PIT. As an example, in
Figure~\ref{fig:asthmaplot}, we show just four of these plots. Since
the default title for the PIT histogram is too long for the available
space we use the \code{titles} argument to abbreviate it.

<<asthmaplot, fig.width = 4, fig.height = 4, out.width = '.4\\linewidth', fig.cap = "Diagnostic plots for the asthma model">>=
par(mar = c(4,4,3,.1), cex.lab = 0.95, cex.axis = 0.9,
    mgp = c(2,.7,0), tcl = -0.3, las = 1)
plot(glarmamod, which = c(1,2,3,5),
     titles = list(NULL, NULL, NULL, "PIT for GLARMA (Neg. Binomial)"))
@ %def

The ACF plot indicates that the model has dealt adequately with any
serial correlation present, and the PIT histogram suggests that the
negative binomial provides a suitable model for the counts.
\newpage


\subsection{Court Conviction Data}
\label{sec:court-data}

This data set records monthly counts of charges laid and convictions
made in Local Courts and Higher Court in armed robbery in New South
Wales, Australia, from 1995--2007, see \cite{dunsmuir2008assessing}. A
description of the columns in the data set is given in
Table~\ref{tab:court}.

\begin{table}[!ht]
  \centering
  \begin{tabular}{|c|l|l|}
    \hline
    Column & Variable & Description \\
    \hline
    1 & \texttt{Date} & Date in month/year format\\
    2 & \texttt{Incpt} & Vector of 1s \\
    3 & \texttt{Trend} & Scaled time trend \\
    4 & \texttt{Step.2001} & Step change from 2001 onwards \\
    5 & \texttt{Trend.2001} & Change in trend from 2001 onwards\\
    6 & \texttt{HC.N} & Monthly number of cases, Higher Court\\
    7 & \texttt{HC.Y} & Monthly number of convictions, Higher Court\\
    8 & \texttt{HC.P} & Monthly proportion of convictions, Higher Court\\
    9 & \texttt{LC.N} & Monthly number of cases, Lower Court\\
    10 & \texttt{LC.Y} & Monthly number of convictions, Lower Court\\
    11 & \texttt{LC.P} & Monthly proportion of convictions, Lower Court\\
    \hline
  \end{tabular}
  \caption{The court conviction data set}
\end{table}
\label{tab:court}

The first step is to set up dummy variables for months.
<<courtmonths, echo = TRUE, prompt = TRUE, tidy = FALSE, cache = TRUE>>=
data(RobberyConvict)
datalen <- dim(RobberyConvict)[1]
monthmat <- matrix(0, nrow = datalen, ncol = 12)
dimnames(monthmat) <- list(NULL, c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                                   "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
months <- unique(months(strptime(RobberyConvict$Date, format = "%m/%d/%Y"),
                        abbreviate=TRUE))
for (j in 1:12) {
  monthmat[months(strptime(RobberyConvict$Date,  "%m/%d/%Y"),
                  abbreviate = TRUE) == months[j], j] <- 1
}

RobberyConvict <- cbind(rep(1, datalen), RobberyConvict, monthmat)
rm(monthmat)
@ %def

Similar analyses can be carried out for both the Lower Court and the
Higher Court data. Here we consider only the Lower Court data. The
ARIMA component of the model is chosen to be AR(1) and the model for
the conviction counts is binomial. A GLM is fitted first to obtain an
initial value for the regression coefficients. The initial value of
the AR parameter is set at 0. Pearson residuals are used with
Newton-Raphson iteration.

<<courtmodel, echo = TRUE, prompt = TRUE, tidy = FALSE, cache = TRUE>>=
### Prepare the data for fitting a binomial
y1 <- RobberyConvict$LC.Y
n1 <- RobberyConvict$LC.N
Y <- cbind(y1, n1-y1)
head(Y, 5)
### Fit the GLM
glm.LCRobbery <- glm(Y ~ Step.2001 +
                        I(Feb + Mar + Apr + May + Jun + Jul) +
                        I(Aug + Sep + Oct + Nov + Dec),
                     data = RobberyConvict, family = binomial(link = logit),
                     na.action = na.omit, x = TRUE)
summary(glm.LCRobbery, corr = FALSE)
X <- glm.LCRobbery$x
colnames(X)[3:4] <- c("Feb-Jul","Aug-Dec")
head(X, 5)
glarmamod <- glarma(Y, X, phiLags = c(1), type = "Bin", method = "NR",
                    residuals = "Pearson", maxit = 100, grad = 1e-6)
summary(glarmamod)
@ %def

We observe that the regression coefficients for the GLARMA model are
quite similar to those for the GLM model. In particular, the step
change in 2001 is highly significant. The likelihood ratio and
Wald tests both suggest the need to deal with autocorrelation.
<<courtplot, fig.width = 4, fig.height = 4, out.width = '.4\\linewidth', fig.cap = "Diagnostic plots for the court conviction model">>=
par(mar = c(4,4,3,.1), cex.lab = 0.95, cex.axis = 0.9,
    mgp = c(2,.7,0), tcl = -0.3, las = 1)
plot(glarmamod)
@ %def
In the diagnostic plots shown in Figure~\ref{fig:courtplot}, the ACF
plot shows little residual autocorrelation, and the Q-Q plot of the
residuals shows reasonable conformity with normality. However the PIT
histogram suggests that the binomial model for the counts is not
appropriate for this data.  \clearpage
\bibliography{glarma}



\end{document}