---
author:
  - name: Rob J Hyndman
    affiliation: Monash University
    address: >
      Department of Econometrics \& Business Statistics
      Monash University
      Clayton VIC 3800, Australia
    email: Rob.Hyndman@monash.edu
    url: https://robjhyndman.com
  - name: Yeasmin Khandakar
    affiliation: Monash University
    address: >
      Department of Econometrics \& Business Statistics
      Monash University
      Clayton VIC 3800, Australia
title:
  formatted: "Automatic Time Series Forecasting:\\newline the \\pkg{forecast} Package for \\proglang{R}"
  # If you use tex in the formatted title, also supply version without
  plain:     "Automatic Time Series Forecasting: the forecast Package for R"
  # For running headers, if needed
  short:     "\\pkg{forecast}: Automatic Time Series Forecasting"
abstract: >
  This vignette to the \proglang{R} package \pkg{forecast} is an updated version of @HK2008, published in the *Journal of Statistical Software*.

  Automatic forecasts of large numbers of univariate time series are often needed in business and other contexts. We describe two automatic forecasting algorithms that have been implemented in the \pkg{forecast} package for \proglang{R}. The first is based on innovations state space models that underly exponential smoothing methods. The second is a step-wise algorithm for forecasting with ARIMA models. The algorithms are applicable to both seasonal and non-seasonal data, and are compared and illustrated using four real time series. We also briefly describe some of the other functionality available in the \pkg{forecast} package.}

keywords:
  # at least one keyword must be supplied
  formatted: [ARIMA models, automatic forecasting, exponential smoothing, prediction intervals, state space models, time series, "\\proglang{R}"]
  plain:     [ARIMA models, automatic forecasting, exponential smoothing, prediction intervals, state space models, time series, R]
preamble: >
  \usepackage{amsmath,rotating,bm,fancyvrb,paralist,thumbpdf}
  \Volume{27}
  \Issue{3}
  \Month{July}
  \Year{2008}
  \Submitdate{2007-05-29}
  \Acceptdate{2008-03-22}
  \def\damped{$_{\mbox{\footnotesize d}}$}
  \let\var=\VAR
  \def\R{\proglang{R}}
  \def\dampfactor{\phi_h}
  \raggedbottom
bibliography: JSS-paper.bib
vignette: >
  %\VignetteIndexEntry{Automatic Time Series Forecasting: the forecast Package for R (Hyndman & Khandakar, JSS 2008)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
documentclass: jss
output: if (rmarkdown::pandoc_version() >= "2") rticles::jss_article else rmarkdown::html_vignette
fig_width: 7
fig_height: 6
fig_caption: true
---

```{r load_forecast, echo=FALSE, message=FALSE}
library('forecast')
```
```{r load_expsmooth, echo=FALSE, message=FALSE, eval=FALSE}
library('expsmooth')
```
```{r expsmooth_datsets, echo=FALSE, message=FALSE}
bonds <-
structure(c(5.83, 6.06, 6.58, 7.09, 7.31, 7.23, 7.43, 7.37, 7.6,
7.89, 8.12, 7.96, 7.93, 7.61, 7.33, 7.18, 6.74, 6.27, 6.38, 6.6,
6.3, 6.13, 6.02, 5.79, 5.73, 5.89, 6.37, 6.62, 6.85, 7.03, 6.99,
6.75, 6.95, 6.64, 6.3, 6.4, 6.69, 6.52, 6.8, 7.01, 6.82, 6.6,
6.32, 6.4, 6.11, 5.82, 5.87, 5.89, 5.63, 5.65, 5.73, 5.72, 5.73,
5.58, 5.53, 5.41, 4.87, 4.58, 4.89, 4.69, 4.78, 4.99, 5.23, 5.18,
5.54, 5.9, 5.8, 5.94, 5.91, 6.1, 6.03, 6.26, 6.66, 6.52, 6.26,
6, 6.42, 6.1, 6.04, 5.83, 5.8, 5.74, 5.72, 5.23, 5.14, 5.1, 4.89,
5.13, 5.37, 5.26, 5.23, 4.97, 4.76, 4.55, 4.61, 5.07, 5, 4.9,
5.28, 5.21, 5.15, 4.9, 4.62, 4.24, 3.88, 3.91, 4.04, 4.03, 4.02,
3.9, 3.79, 3.94, 3.56, 3.32, 3.93, 4.44, 4.29, 4.27, 4.29, 4.26,
4.13, 4.06, 3.81, 4.32, 4.7), .Tsp = c(1994, 2004.33333333333,
12), class = "ts")
usnetelec <-
structure(c(296.1, 334.1, 375.3, 403.8, 447, 476.3, 550.3, 603.9,
634.6, 648.5, 713.4, 759.2, 797.1, 857.9, 920, 987.2, 1058.4,
1147.5, 1217.8, 1332.8, 1445.5, 1535.1, 1615.9, 1753, 1864.1,
1870.3, 1920.8, 2040.9, 2127.4, 2209.4, 2250.7, 2289.6, 2298,
2244.4, 2313.4, 2419.5, 2473, 2490.5, 2575.3, 2707.4, 2967.3,
3038, 3073.8, 3083.9, 3197.2, 3247.5, 3353.5, 3444.2, 3492.2,
3620.3, 3694.8, 3802.1, 3736.6, 3858.5, 3848), .Tsp = c(1949,
2003, 1), class = "ts")
ukcars <-
structure(c(330.371, 371.051, 270.67, 343.88, 358.491, 362.822,
261.281, 240.355, 325.382, 316.7, 171.153, 257.217, 298.127,
251.464, 181.555, 192.598, 245.652, 245.526, 225.261, 238.211,
257.385, 228.461, 175.371, 226.462, 266.15, 287.251, 225.883,
265.313, 272.759, 234.134, 196.462, 205.551, 291.283, 284.422,
221.571, 250.697, 253.757, 267.016, 220.388, 277.801, 283.233,
302.072, 259.72, 297.658, 306.129, 322.106, 256.723, 341.877,
356.004, 361.54, 270.433, 311.105, 326.688, 327.059, 274.257,
367.606, 346.163, 348.211, 250.008, 292.518, 343.318, 343.429,
275.386, 329.747, 364.521, 378.448, 300.798, 331.757, 362.536,
389.133, 323.322, 391.832, 421.646, 416.823, 311.713, 381.902,
422.982, 427.722, 376.85, 458.58, 436.225, 441.487, 369.566,
450.723, 462.442, 468.232, 403.636, 413.948, 460.496, 448.932,
407.787, 469.408, 494.311, 433.24, 335.106, 378.795, 387.1, 372.395,
335.79, 397.08, 449.755, 402.252, 391.847, 385.89, 424.325, 433.28,
391.213, 408.74, 445.458, 428.202, 379.048, 394.042, 432.796), .Tsp = c(1977,
2005, 4), class = "ts")
visitors <-
structure(c(75.7, 75.4, 83.1, 82.9, 77.3, 105.7, 121.9, 150,
98, 118, 129.5, 110.6, 91.7, 94.8, 109.5, 105.1, 95, 130.3, 156.7,
190.1, 139.7, 147.8, 145.2, 132.7, 120.7, 116.5, 142, 140.4,
128, 165.7, 183.1, 222.8, 161.3, 180.4, 185.2, 160.5, 157.1,
163.8, 203.3, 196.9, 179.6, 207.3, 208, 245.8, 168.9, 191.1,
180, 160.1, 136.6, 142.7, 175.4, 161.4, 149.9, 174.1, 192.7,
247.4, 176.2, 192.8, 189.1, 181.1, 149.9, 157.3, 185.3, 178.2,
162.7, 190.6, 198.6, 253.1, 177.4, 190.6, 189.2, 168, 161.4,
172.2, 208.3, 199.3, 197.4, 216, 223.9, 266.8, 196.1, 238.2,
217.8, 203.8, 175.2, 176.9, 219.3, 199.1, 190, 229.3, 255, 302.4,
242.8, 245.5, 257.9, 226.3, 213.4, 204.6, 244.6, 239.9, 224,
267.2, 285.9, 344, 250.5, 304.3, 307.4, 255.1, 214.9, 230.9,
282.5, 265.4, 254, 301.6, 311, 384, 303.8, 319.1, 313.5, 294.2,
244.8, 261.4, 329.7, 304.9, 268.6, 320.7, 342.9, 422.3, 317.2,
392.7, 365.6, 333.2, 261.5, 306.9, 358.2, 329.2, 309.2, 350.4,
375.6, 465.2, 342.9, 408, 390.9, 325.9, 289.1, 308.2, 397.4,
330.4, 330.9, 366.5, 379.5, 448.3, 346.2, 353.6, 338.6, 341.1,
283.4, 304.2, 372.3, 323.7, 323.9, 354.8, 367.9, 457.6, 351,
398.6, 389, 334.1, 298.1, 317.1, 388.5, 355.6, 353.1, 397, 416.7,
460.8, 360.8, 434.6, 411.9, 405.6, 319.3, 347.9, 429, 372.9,
403, 426.5, 459.9, 559.9, 416.6, 429.2, 428.7, 405.4, 330.2,
370, 446.9, 384.6, 366.3, 378.5, 376.2, 523.2, 379.3, 437.2,
446.5, 360.3, 329.9, 339.4, 418.2, 371.9, 358.6, 428.9, 437,
534, 396.6, 427.5, 392.5, 321.5, 260.9, 308.3, 415.5, 362.2,
385.6, 435.3, 473.3, 566.6, 420.2, 454.8, 432.3, 402.8, 341.3,
367.3, 472, 405.8, 395.6, 449.9, 479.9, 593.1, 462.4, 501.6,
504.7, 409.5), .Tsp = c(1985.33333333333, 2005.25, 12), class = "ts")
```

# Introduction

Automatic forecasts of large numbers of univariate time series are often needed in business. It is common to have over one thousand product lines that need forecasting at least monthly. Even when a smaller number of forecasts are required, there may be nobody suitably trained in the use of time series models to produce them. In these circumstances, an automatic forecasting algorithm is an essential tool. Automatic forecasting algorithms must determine an appropriate time series model, estimate the parameters and compute the forecasts. They must be robust to unusual time series patterns, and applicable to large numbers of series without user intervention. The most popular automatic forecasting algorithms are based on either exponential smoothing or ARIMA models.

In this article, we discuss the implementation of two automatic univariate forecasting methods in the \pkg{forecast} package for \proglang{R}. We also briefly describe some univariate forecasting methods that are part of the \pkg{forecast} package.

The \pkg{forecast} package for the \proglang{R} system for statistical computing [@R] is available from the Comprehensive \proglang{R} Archive Network at \url{https://CRAN.R-project.org/package=forecast}. Version `r packageVersion('forecast')` of the package was used for this paper. The \pkg{forecast} package contains functions for univariate forecasting and a few examples of real time series data. For more extensive testing of forecasting methods, the \pkg{fma} package contains the 90 data sets from @MWH3, the \pkg{expsmooth} package contains 24 data sets from @expsmooth08, and the \pkg{Mcomp} package contains the 1001 time series from the M-competition [@Mcomp82] and the 3003 time series from the M3-competition [@M3comp00].

The \pkg{forecast} package implements automatic forecasting using exponential smoothing, ARIMA models, the Theta method [@AN00], cubic splines [@HKPB05], as well as other common forecasting methods. In this article, we primarily discuss the exponential smoothing approach (in Section \ref{sec:expsmooth}) and the ARIMA modelling approach (in Section \ref{sec:arima}) to automatic forecasting. In Section \ref{sec:package}, we describe the implementation of these methods in the \pkg{forecast} package, along with other features of the package.

# Exponential smoothing {#sec:expsmooth}

Although exponential smoothing methods have been around since the
1950s, a modelling framework incorporating procedures for model
selection was not developed until relatively recently.
@OKS97, @HKSG02 and @HKOS05 have shown that all
exponential smoothing methods (including non-linear methods) are
optimal forecasts from innovations state space models.

Exponential smoothing methods were originally classified by Pegels'
(1969)\nocite{Pegels69} taxonomy. This was later extended by
@Gardner85, modified by @HKSG02, and extended again by
@Taylor03a, giving a total of fifteen methods seen in the
following table.

\begin{table}[!hbt]
\begin{center}\vspace{0.2cm}
\begin{tabular}{|ll|ccc|} \hline
& &\multicolumn{3}{c|}{Seasonal Component} \\
\multicolumn{2}{|c|}{Trend}& N & A & M\\
\multicolumn{2}{|c|}{Component}  & (None)    & (Additive)  & (Multiplicative)\\
\cline{3-5} &&&\\[-0.3cm]
N & (None) & N,N & N,A & N,M\\
&&&&\\[-0.3cm]
A & (Additive) & A,N & A,A & A,M\\
&&&&\\[-0.3cm]
A\damped & (Additive damped) & A\damped,N & A\damped,A & A\damped,M\\
&&&&\\[-0.3cm]
M & (Multiplicative) & M,N & M,A & M,M\\
&&&&\\[-0.3cm]
M\damped  & (Multiplicative damped) & M\damped,N & M\damped,A & M\damped,M\\
\hline
\end{tabular}\vspace{0.2cm}
\end{center}
\caption{The fifteen exponential smoothing methods.}
\end{table}

Some of these methods are better known under other names. For
example, cell (N,N) describes the simple exponential smoothing (or
SES) method, cell (A,N) describes Holt's linear method, and cell
(A\damped,N) describes the damped trend method. The additive
Holt-Winters' method is given by cell (A,A) and the multiplicative
Holt-Winters' method is given by cell (A,M). The other cells
correspond to less commonly used but analogous methods.

## Point forecasts for all methods

We denote the observed time series by $y_1,y_2,\dots,y_n$. A
forecast of $y_{t+h}$ based on all of the data up to time $t$ is
denoted by $\hat{y}_{t+h|t}$.
To illustrate the method, we give the point forecasts and updating
equations for method (A,A), the Holt-Winters' additive method:
\begin{subequations}\label{eq:AMmethod}
\begin{align}
\mbox{Level:}\quad &\ell_t  =  \alpha (y_t - s_{t-m}) + (1-\alpha)(\ell_{t-1} + b_{t-1})\hspace*{1cm} \label{eq:3-44a}\\
\mbox{Growth:}\quad &b_t  =  \beta^*(\ell_t - \ell_{t-1}) + (1-\beta^*)b_{t-1} \label{eq:3-45a}\\
\mbox{Seasonal:}\quad &s_t  =  \gamma(y_t - \ell_{t-1} -b_{t-1}) + (1-\gamma)s_{t-m}\label{eq:3-46a}\\
\mbox{Forecast:}\quad &\hat{y}_{t+h|t}  =  \ell_t + b_th
+s_{t-m+h_m^+}. \label{eq:3-47a}
\end{align}
\end{subequations}
where $m$ is the length of seasonality (e.g., the number of months
or quarters in a year), $\ell_t$ represents the level of the series,
$b_t$ denotes the growth, $s_t$ is the seasonal component,
$\hat{y}_{t+h|t}$ is the forecast for $h$ periods ahead, and $h_m^+
= \big[(h-1) \mbox{ mod } m\big] + 1$. To use method
\eqref{eq:AMmethod}, we need values for the initial states $\ell_0$,
$b_0$ and $s_{1-m},\dots,s_0$, and for the smoothing parameters
$\alpha$, $\beta^*$ and $\gamma$. All of these will be estimated
from the observed data.

Equation \eqref{eq:3-46a} is slightly different from the usual
Holt-Winters equations such as those in @MWH3 or
@BOK05. These authors replace \eqref{eq:3-46a} with
$$
s_t = \gamma^*(y_t - \ell_{t}) + (1-\gamma^*)s_{t-m}.
$$
If $\ell_t$ is substituted using \eqref{eq:3-44a}, we obtain
$$s_t = \gamma^*(1-\alpha)(y_t - \ell_{t-1}-b_{t-1}) +
\{1-\gamma^*(1-\alpha)\}s_{t-m}.
$$
Thus, we obtain identical forecasts using this approach by replacing
$\gamma$ in \eqref{eq:3-46a} with $\gamma^*(1-\alpha)$. The
modification given in \eqref{eq:3-46a} was proposed by @OKS97
to make the state space formulation simpler. It is equivalent to
Archibald's (1990)\nocite{Archibald90} variation of the
Holt-Winters' method.

\begin{sidewaystable}
\begin{small}
\begin{center}
\begin{tabular}{|c|lll|} \hline
  & \multicolumn{3}{c|}{Seasonal} \\
{Trend} & \multicolumn{1}{c}{N} & \multicolumn{1}{c}{A}
& \multicolumn{1}{c|}{M}\\
\cline{2-4}
    & $\ell_t = \alpha y_t + (1-\alpha) \ell_{t-1}$
    & $\ell_t = \alpha (y_t - s_{t-m}) + (1-\alpha) \ell_{t-1}$
    & $\ell_t = \alpha (y_t / s_{t-m}) + (1-\alpha) \ell_{t-1}$\\
{N}
    &
    & $s_t = \gamma (y_t - \ell_{t-1}) + (1-\gamma) s_{t-m}$
    & $s_t = \gamma (y_t / \ell_{t-1}) + (1-\gamma) s_{t-m}$ \\
    & $\hat{y}_{t+h|t} = \ell_t$
    & $\hat{y}_{t+h|t} = \ell_t + s_{t-m+h_m^+}$
    & $\hat{y}_{t+h|t}= \ell_ts_{t-m+h_m^+}$ \\
\hline
    & $\ell_t = \alpha y_t + (1-\alpha) (\ell_{t-1}+b_{t-1})$
    & $\ell_t = \alpha (y_t - s_{t-m}) + (1-\alpha)    (\ell_{t-1}+b_{t-1})$
    & $\ell_t = \alpha (y_t / s_{t-m}) + (1-\alpha)    (\ell_{t-1}+b_{t-1})$\\
{A}
    & $b_t = \beta^* (\ell_t-\ell_{t-1}) + (1-\beta^*) b_{t-1}$
    & $b_t = \beta^* (\ell_t-\ell_{t-1}) + (1-\beta^*) b_{t-1}$
    & $b_t = \beta^* (\ell_t-\ell_{t-1}) + (1-\beta^*) b_{t-1}$\\
    &
    & $s_t = \gamma (y_t - \ell_{t-1}-b_{t-1}) + (1-\gamma)    s_{t-m}$
    & $s_t = \gamma (y_t / (\ell_{t-1}-b_{t-1})) + (1-\gamma)    s_{t-m}$\\
    & $\hat{y}_{t+h|t} = \ell_t+hb_t$
    & $\hat{y}_{t+h|t} = \ell_t +hb_t +s_{t-m+h_m^+}$
    & $\hat{y}_{t+h|t}= (\ell_t+hb_t)s_{t-m+h_m^+}$ \\
\hline
    & $\ell_t = \alpha y_t + (1-\alpha) (\ell_{t-1}+\phi b_{t-1})$
    & $\ell_t = \alpha (y_t - s_{t-m}) + (1-\alpha) (\ell_{t-1}+\phi    b_{t-1})$
    & $\ell_t = \alpha (y_t / s_{t-m}) + (1-\alpha) (\ell_{t-1}+\phi    b_{t-1})$\\
{A\damped }
    & $b_t = \beta^* (\ell_t-\ell_{t-1}) + (1-\beta^*) \phi    b_{t-1}$
    & $b_t = \beta^* (\ell_t-\ell_{t-1}) + (1-\beta^*) \phi    b_{t-1}$
    & $b_t = \beta^* (\ell_t-\ell_{t-1}) + (1-\beta^*) \phi    b_{t-1}$\\
    &
    & $s_t = \gamma (y_t - \ell_{t-1}-\phi b_{t-1}) + (1-\gamma)    s_{t-m}$
    & $s_t = \gamma (y_t / (\ell_{t-1}-\phi b_{t-1})) + (1-\gamma)    s_{t-m}$\\
    & $\hat{y}_{t+h|t} = \ell_t+\dampfactor b_t$
    & $\hat{y}_{t+h|t} = \ell_t+\dampfactor b_t+s_{t-m+h_m^+}$
    & $\hat{y}_{t+h|t}= (\ell_t+\dampfactor b_t)s_{t-m+h_m^+}$
\\
\hline
    & $\ell_t = \alpha y_t + (1-\alpha) \ell_{t-1}b_{t-1}$
    & $\ell_t = \alpha (y_t - s_{t-m}) + (1-\alpha)    \ell_{t-1}b_{t-1}$
    & $\ell_t = \alpha (y_t / s_{t-m}) + (1-\alpha)    \ell_{t-1}b_{t-1}$\\
{M}
    & $b_t = \beta^* (\ell_t/\ell_{t-1}) + (1-\beta^*) b_{t-1}$
    & $b_t = \beta^* (\ell_t/\ell_{t-1}) + (1-\beta^*) b_{t-1}$
    & $b_t = \beta^* (\ell_t/\ell_{t-1}) + (1-\beta^*) b_{t-1}$\\
    &
    & $s_t = \gamma (y_t - \ell_{t-1}b_{t-1}) + (1-\gamma) s_{t-m}$
    & $s_t = \gamma (y_t / (\ell_{t-1}b_{t-1})) + (1-\gamma) s_{t-m}$\\
    & $\hat{y}_{t+h|t} = \ell_tb_t^h$
    & $\hat{y}_{t+h|t} = \ell_tb_t^h + s_{t-m+h_m^+}$
    & $\hat{y}_{t+h|t}= \ell_tb_t^hs_{t-m+h_m^+}$ \\
\hline
    & $\ell_t = \alpha y_t + (1-\alpha) \ell_{t-1}b^\phi_{t-1}$
    & $\ell_t = \alpha (y_t - s_{t-m}) + (1-\alpha)\ell_{t-1}b^\phi_{t-1}$
    & $\ell_t = \alpha (y_t / s_{t-m}) + (1-\alpha)\ell_{t-1}b^\phi_{t-1}$\\
{M\damped }
    & $b_t = \beta^* (\ell_t/\ell_{t-1}) + (1-\beta^*)    b^\phi_{t-1}$
    & $b_t = \beta^* (\ell_t/\ell_{t-1}) + (1-\beta^*)    b^\phi_{t-1}$
    & $b_t = \beta^* (\ell_t/\ell_{t-1}) + (1-\beta^*)    b^\phi_{t-1}$\\
    &
    & $s_t = \gamma (y_t - \ell_{t-1}b^\phi_{t-1}) + (1-\gamma) s_{t-m}$
    & $s_t = \gamma (y_t / (\ell_{t-1}b^\phi_{t-1})) + (1-\gamma) s_{t-m}$\\
    & $\hat{y}_{t+h|t} = \ell_tb_t^{\phi_h}$
    & $\hat{y}_{t+h|t} = \ell_tb_t^{\phi_h} + s_{t-m+h_m^+}$
    & $\hat{y}_{t+h|t}= \ell_tb_t^{\phi_h}s_{t-m+h_m^+}$ \\
\hline
\end{tabular}
\end{center}
\end{small}

\caption{Formulae for recursive calculations and point forecasts. In
each case,  $\ell_t$ denotes the series level at time $t$, $b_t$
denotes the slope at time $t$, $s_t$ denotes the seasonal component
of the series at time $t$, and $m$ denotes the number of seasons in
a year; $\alpha$, $\beta^*$, $\gamma$ and $\phi$ are constants,
$\phi_h = \phi+\phi^2+\dots+\phi^{h}$ and $h_m^+ = \big[(h-1)
\mbox{ mod } m\big] + 1$.}\label{table:pegels}
\end{sidewaystable}

Table \ref{table:pegels} gives recursive formulae for computing
point forecasts  $h$ periods ahead for all of the exponential
smoothing methods. Some interesting special cases can be obtained by
setting the smoothing parameters to extreme values. For example, if
$\alpha=0$, the level is constant over time; if $\beta^*=0$, the
slope is constant over time; and if $\gamma=0$, the seasonal pattern
is constant over time.  At the other extreme, naïve forecasts (i.e.,
$\hat{y}_{t+h|t}=y_t$ for all $h$) are obtained using the (N,N)
method with $\alpha=1$.  Finally, the additive and multiplicative
trend methods are special cases of their damped counterparts
obtained by letting $\phi=1$.

## Innovations state space models {#sec:statespace}

For each exponential smoothing method in Table \ref{table:pegels},
@expsmooth08 describe two possible innovations state space
models, one corresponding to a model with additive errors and the
other to a model with multiplicative errors. If the same parameter
values are used, these two models give equivalent point forecasts,
although different prediction intervals. Thus there are 30 potential
models described in this classification.

Historically, the nature of the error component has often been
ignored, because the distinction between additive and multiplicative
errors makes no difference to point forecasts.

We are careful to distinguish exponential smoothing \emph{methods}
from the underlying state space \emph{models}. An exponential
smoothing method is an algorithm for producing point forecasts only.
The underlying stochastic state space model gives the same point
forecasts, but also provides a framework for computing prediction
intervals and other properties.

To distinguish the models with additive and multiplicative errors,
we add an extra letter to the front of the method notation. The
triplet (E,T,S) refers to the three components: error, trend and
seasonality. So the model ETS(A,A,N) has additive errors, additive
trend and no seasonality---in other words, this is Holt's linear
method with additive errors. Similarly, ETS(M,M\damped,M) refers to
a model with multiplicative errors, a damped multiplicative trend
and multiplicative seasonality. The notation
ETS($\cdot$,$\cdot$,$\cdot$) helps in remembering the order in which
the components are specified.

Once a model is specified, we can study the probability distribution
of future values of the series and find, for example, the
conditional mean of a future observation given knowledge of the
past. We denote this as $\mu_{t+h|t} = \E(y_{t+h} \mid \bm{x}_t)$,
where $\bm{x}_t$ contains the unobserved components such as
$\ell_t$, $b_t$ and $s_t$. For $h=1$ we use $\mu_t\equiv\mu_{t+1|t}$
as a shorthand notation. For many models, these conditional means
will be identical to the point forecasts given in
Table \ref{table:pegels}, so that $\mu_{t+h|t}=\hat{y}_{t+h|t}$.
However, for other models (those with multiplicative trend or
multiplicative seasonality), the conditional mean and the point
forecast will differ slightly for $h\ge 2$.

We illustrate these ideas using the damped trend method of
@GM85.

\subsubsection{Additive error model:  ETS(A,A$_d$,N)}

Let $\mu_t = \hat{y}_t = \ell_{t-1}+b_{t-1}$ denote the one-step
forecast of $y_{t}$ assuming that we know the values of all
parameters. Also, let $\varepsilon_t = y_t - \mu_t$ denote the
one-step forecast error at time $t$. From the equations in
Table \ref{table:pegels}, we find that
\begin{align}
\label{ss1}
y_t    &= \ell_{t-1} + \phi b_{t-1} + \varepsilon_t\\
\ell_t &= \ell_{t-1} + \phi b_{t-1} + \alpha \varepsilon_t
\label{ss2}\\
b_t    &= \phi b_{t-1} + \beta^*(\ell_t - \ell_{t-1}- \phi b_{t-1})
    = \phi b_{t-1} + \alpha\beta^*\varepsilon_t. \label{ss3}
\end{align}
We simplify the last expression by setting $\beta=\alpha\beta^*$. The three equations above constitute a state space model underlying the damped Holt's method. Note that it is an \emph{innovations} state space model [@AM79;@Aoki87] because the same error term appears in each equation. We  an write it in standard state space notation by defining the state vector as $\bm{x}_t = (\ell_t,b_t)'$ and expressing \eqref{ss1}--\eqref{ss3} as
\begin{subequations}
\begin{align}
y_t &= \left[ 1    \phi \right] \bm{x}_{t-1} +
\varepsilon_t\label{obseq}\\
\bm{x}_t &= \left[\begin{array}{ll}
                        1 & \phi\\
                        0 & \phi
                   \end{array}\right]\bm{x}_{t-1}
            + \left[\begin{array}{l}
                          \alpha\\
                          \beta
                    \end{array}\right]\varepsilon_t.\label{stateeq}
\end{align}
\end{subequations}
The model is fully specified once we state the distribution of the
error term $\varepsilon_t$. Usually we assume that these are
independent and identically distributed, following a normal
distribution with mean 0 and variance $\sigma^2$, which we write as
$\varepsilon_t \sim\mbox{NID}(0, \sigma^2)$.

\subsubsection{Multiplicative error model: ETS(M,A$_d$,N)}

A model with multiplicative error can be derived similarly, by first
setting $\varepsilon_t = (y_t-\mu_t)/\mu_t$, so that $\varepsilon_t$
is the relative error. Then, following a similar approach to that
for additive errors, we find
\begin{align*}
y_t &= (\ell_{t-1} + \phi b_{t-1})(1 + \varepsilon_t)\\
\ell_t &= (\ell_{t-1} + \phi b_{t-1})(1 + \alpha \varepsilon_t)\\
b_t &= \phi b_{t-1} + \beta(\ell_{t-1}+\phi b_{t-1})\varepsilon_t,
\end{align*}
or
\begin{align*}
y_t &= \left[ 1    \phi \right] \bm{x}_{t-1}(1 + \varepsilon_t)\\
\bm{x}_t &= \left[\begin{array}{ll}
                        1 & \phi \\
                        0 & \phi
                   \end{array}\right]\bm{x}_{t-1} + \left[ 1
                   \phi
                   \right] \bm{x}_{t-1}
             \left[\begin{array}{l}
                          \alpha\\
                          \beta
                    \end{array}\right]\varepsilon_t.
\end{align*}
Again we assume that $\varepsilon_t \sim \mbox{NID}(0,\sigma^2)$.

Of course, this is a nonlinear state space model, which is usually
considered difficult to handle in estimating and forecasting.
However, that is one of the many advantages of the innovations form
of state space models --- we can still compute forecasts, the
likelihood and prediction intervals for this nonlinear model with no
more effort than is required for the additive error model.

## State space models for all exponential smoothing methods {#sec:ssmodels}

There are similar state space models for all 30 exponential
smoothing variations. The general model involves a state vector
$\bm{x}_t = (\ell_t, b_t$, $s_t, s_{t-1}, \dots, s_{t-m+1})'$ and
state space equations of the form
\begin{subequations}\label{eq:ss}
\begin{align}
y_t &= w(\bm{x}_{t-1}) + r(\bm{x}_{t-1})\varepsilon_t
\label{eq:ss1}\\
\bm{x}_t &= f(\bm{x}_{t-1}) + g(\bm{x}_{t-1})\varepsilon_t
\label{eq:ss2}
\end{align}
\end{subequations}
where $\{\varepsilon_t\}$ is a Gaussian white noise process with
mean zero and variance $\sigma^2$, and $\mu_t = w(\bm{x}_{t-1})$.
The model with additive errors has $r(\bm{x}_{t-1})=1$, so that $y_t
= \mu_{t} + \varepsilon_t$. The model with multiplicative errors has
$r(\bm{x}_{t-1})=\mu_t$, so that $y_t = \mu_{t}(1 + \varepsilon_t)$.
Thus,  $\varepsilon_t = (y_t - \mu_t)/\mu_t$ is the relative error
for the multiplicative model. The models are not unique. Clearly,
any value of $r(\bm{x}_{t-1})$ will lead to identical point
forecasts for $y_t$.

All of the methods in Table \ref{table:pegels} can be written in the
form \eqref{eq:ss1} and \eqref{eq:ss2}. The specific form for each
model is given in @expsmooth08.

Some of the combinations of trend, seasonality and error can
occasionally lead to numerical difficulties; specifically, any model
equation that requires division by a state component could involve
division by zero. This is a problem for models with additive errors
and either multiplicative trend or multiplicative seasonality, as
well as for the model with multiplicative errors, multiplicative
trend and additive seasonality. These models should therefore be
used with caution.

The multiplicative error models are useful when the data are
strictly positive, but are not numerically stable when the data
contain zeros or negative values. So when the time series is not
strictly positive, only the six fully additive models may be
applied.

The point forecasts given in Table \ref{table:pegels} are easily
obtained from these models by iterating equations \eqref{eq:ss1}
and \eqref{eq:ss2} for $t=n+1, n+2,\dots,n+h$, setting
$\varepsilon_{n+j}=0$ for $j=1,\dots,h$. In most cases (notable
exceptions being models with multiplicative seasonality or
multiplicative trend for $h\ge2$), the point forecasts can be shown
to be equal to $\mu_{t+h|t} = \E(y_{t+h} \mid \bm{x}_t)$, the
conditional expectation of the corresponding state space model.

The models also provide a means of obtaining prediction intervals.
In the case of the linear models, where the forecast distributions
are normal, we can derive the conditional variance $v_{t+h|t} =
\var(y_{t+h} \mid \bm{x}_t)$ and obtain prediction intervals
accordingly. This approach also works for many of the nonlinear
models. Detailed derivations of the results for many models are
given in @HKOS05.

A more direct approach that works for all of the models is to simply
simulate many future sample paths conditional on the last estimate
of the state vector, $\bm{x}_t$. Then prediction intervals can be
obtained from the percentiles of the simulated sample paths. Point
forecasts can also be obtained in this way by taking the average of
the simulated values at each future time period. An advantage of
this approach is that we generate an estimate of the complete
predictive distribution, which is especially useful in applications
such as inventory planning, where expected costs depend on the whole
distribution.

## Estimation {#sec:estimation}

In order to use these models for forecasting, we need to know the
values of $\bm{x}_0$ and the parameters $\alpha$, $\beta$, $\gamma$
and $\phi$. It is easy to compute the likelihood of the innovations
state space model \eqref{eq:ss}, and so obtain maximum likelihood
estimates. @OKS97 show that
\begin{equation}\label{likelihood}
L^*(\bm\theta,\bm{x}_0) = n\log\Big(\sum_{t=1}^n
\varepsilon^2_t\Big) + 2\sum_{t=1}^n \log|r(\bm{x}_{t-1})|
\end{equation}
is equal to twice the negative logarithm of the likelihood function
(with constant terms eliminated), conditional on the parameters
$\bm\theta = (\alpha,\beta,\gamma,\phi)'$ and the initial states
$\bm{x}_0 = (\ell_0,b_0,s_0,s_{-1},\dots,s_{-m+1})'$, where $n$ is
the number of observations. This is easily computed by simply using
the recursive equations in Table \ref{table:pegels}. Unlike state
space models with multiple sources of error, we do not need to use
the Kalman filter to compute the likelihood.

The parameters $\bm\theta$ and the initial states $\bm{x}_0$ can be
estimated by minimizing $L^*$. Most implementations of exponential
smoothing use an ad hoc heuristic scheme to estimate $\bm{x}_0$.
However, with modern computers, there is no reason why we cannot
estimate $\bm{x}_0$ along with $\bm\theta$, and the resulting
forecasts are often substantially better when we do.

We constrain the initial states $\bm{x}_0$ so that the seasonal
indices add to zero for additive seasonality, and add to $m$ for
multiplicative seasonality. There have been several suggestions for
restricting the parameter space for  $\alpha$, $\beta$ and $\gamma$.
The traditional approach is to ensure that the various equations can
be interpreted as weighted averages, thus requiring $\alpha$,
$\beta^*=\beta/\alpha$, $\gamma^*=\gamma/(1-\alpha)$ and $\phi$ to
all lie within $(0,1)$. This suggests
$$0<\alpha<1,\qquad 0<\beta<\alpha,\qquad 0<\gamma <
1-\alpha,\qquad\mbox{and}\qquad 0<\phi<1.
$$
However, @HAA08 show that these restrictions are usually
stricter than necessary (although in a few cases they are not
restrictive enough).

## Model selection

Forecast accuracy measures such as mean squared error (MSE) can
be used for selecting a model for a given set of data, provided the
errors are computed from data in a hold-out set and not from the
same data as were used for model estimation. However, there are
often too few out-of-sample errors to draw reliable conclusions.
Consequently, a penalized method based on the in-sample fit is
usually better.

One such approach uses a penalized likelihood such as Akaike's
Information Criterion:
$$\mbox{AIC} = L^*(\hat{\bm\theta},\hat{\bm{x}}_0) + 2q,
$$
where $q$ is the number of parameters in $\bm\theta$ plus the number
of free states in $\bm{x}_0$, and $\hat{\bm\theta}$ and
$\hat{\bm{x}}_0$ denote the estimates of $\bm\theta$ and $\bm{x}_0$.
We select the model that minimizes the AIC amongst all of the models
that are appropriate for the data.

The AIC also provides a method for selecting between the additive
and multiplicative error models. The point forecasts from the two
models are identical so that standard forecast accuracy measures
such as the MSE or mean absolute percentage
error (MAPE) are unable to select between the error types. The AIC
is able to select between the error types because it is based on
likelihood rather than one-step forecasts.

Obviously, other model selection criteria (such as the BIC) could
also be used in a similar manner.

## Automatic forecasting {#sec:algorithm}

We combine the preceding ideas to obtain a robust and widely
applicable automatic forecasting algorithm.  The steps involved are
summarized below.
\begin{compactenum}
\item For each series, apply all models that are appropriate,
optimizing the parameters (both smoothing parameters and the initial
state variable) of the model in each case.

\item Select the best of the models according to the AIC.

\item Produce point forecasts using the best model (with optimized
parameters) for as many steps ahead as required.

\item Obtain prediction intervals for the best model either using the analytical results of Hyndman, Koehler, et al. (2005), or by simulating future sample paths for $\{y_{n+1},\dots,y_{n+h}\}$ and finding the $\alpha/2$ and $1-\alpha/2$ percentiles of the simulated data at each forecasting horizon.  If simulation is used, the sample paths may be generated using the normal distribution for errors (parametric bootstrap) or using the resampled errors (ordinary bootstrap).
\end{compactenum}

@HKSG02 applied this automatic forecasting strategy to the M-competition data [@Mcomp82] and the IJF-M3 competition data [@M3comp00] using a restricted set of exponential smoothing models, and demonstrated that the methodology is particularly good at short term forecasts (up to about 6 periods ahead), and especially for seasonal short-term series (beating all other methods in the competitions for these series).

# ARIMA models {#sec:arima}

A common obstacle for many people in using Autoregressive Integrated
Moving Average (ARIMA) models for forecasting is that the order
selection process is usually considered subjective and difficult to
apply. But it does not have to be. There have been several attempts
to automate ARIMA modelling in the last 25 years.

@HR82 proposed a method to identify the order of an ARMA
model for a stationary series. In their method the innovations can
be obtained by fitting a long autoregressive model to the data, and
then the likelihood of potential models is computed via a series of
standard regressions. They established the asymptotic properties of
the procedure under very general conditions.

@Gomez98 extended the Hannan-Rissanen identification method
to include multiplicative seasonal ARIMA model identification.
@TRAMOSEATS98 implemented this automatic identification
procedure in the software \pkg{TRAMO} and \pkg{SEATS}. For a
given series, the algorithm attempts to find the model with the
minimum BIC.

@Liu89 proposed a method for identification of seasonal ARIMA
models using a filtering method and certain heuristic rules; this
algorithm is used in the \pkg{SCA-Expert} software. Another
approach is described by @MP00a whose algorithm for
univariate ARIMA models also allows intervention analysis. It is
implemented in the software package ``Time Series Expert''
(\pkg{TSE-AX}).

Other algorithms are in use in commercial software, although they
are not documented in the public domain literature. In particular,
\pkg{Forecast Pro} [@ForecastPro00] is well-known for its excellent
automatic ARIMA algorithm which was used in the M3-forecasting
competition [@M3comp00]. Another proprietary algorithm is
implemented in \pkg{Autobox} [@Reilly00]. @OL96 provide an
early review of some of the commercial software that implement
automatic ARIMA forecasting.

## Choosing the model order using unit root tests and the AIC

A non-seasonal ARIMA($p,d,q$) process is given by
$$
\phi(B)(1-B^d)y_{t} = c + \theta(B)\varepsilon_t
$$
where $\{\varepsilon_t\}$ is a white noise process with mean zero
and variance $\sigma^2$, $B$ is the backshift operator, and
$\phi(z)$ and $\theta(z)$ are polynomials of order $p$ and $q$
respectively. To ensure causality and invertibility, it is assumed
that $\phi(z)$ and $\theta(z)$ have no roots for $|z|<1$
[@BDbook91]. If $c\ne0$, there is an implied polynomial of
order $d$ in the forecast function.

The seasonal ARIMA$(p,d,q)(P,D,Q)_m$ process is given by
$$
\Phi(B^m)\phi(B)(1-B^{m})^D(1-B)^dy_{t} = c + \Theta(B^m)\theta(B)\varepsilon_t
$$
where $\Phi(z)$ and $\Theta(z)$ are polynomials of orders $P$ and
$Q$ respectively, each containing no roots inside the unit circle.
If $c\ne0$, there is an implied polynomial of order $d+D$ in the
forecast function.

The main task in automatic ARIMA forecasting is selecting an
appropriate model order, that is the values $p$, $q$, $P$, $Q$, $D$,
$d$. If $d$ and $D$ are known, we can select the orders $p$, $q$,
$P$ and $Q$ via an information criterion such as the AIC:
$$\mbox{AIC} = -2\log(L) + 2(p+q+P+Q+k)$$
where $k=1$ if $c\ne0$ and 0 otherwise, and $L$ is the maximized
likelihood of the model fitted to the \emph{differenced} data
$(1-B^m)^D(1-B)^dy_t$.  The likelihood of the full model for $y_t$
is not actually defined and so the value of the AIC for different
levels of differencing are not comparable.

One solution to this difficulty is the ``diffuse prior'' approach
which is outlined in @DKbook01 and implemented in the
\code{arima()} function [@Ripley:2002] in \R. In this approach, the initial values
of the time series (before the observed values) are assumed to have
mean zero and a large variance. However, choosing $d$ and $D$ by
minimizing the AIC using this approach tends to lead to
over-differencing. For forecasting purposes, we believe it is better
to make as few differences as possible because over-differencing
harms forecasts [@SY94] and widens prediction intervals.
[Although, see @Hendry97 for a contrary view.]

Consequently, we need some other approach to choose $d$ and $D$. We
prefer unit-root tests. However, most unit-root tests are based on a
null hypothesis that a unit root exists which biases results towards
more differences rather than fewer differences. For example,
variations on the Dickey-Fuller test [@DF81] all assume there
is a unit root at lag 1, and the HEGY test of @HEGY90 is
based on a null hypothesis that there is a seasonal unit root.
Instead, we prefer unit-root tests based on a null hypothesis of no
unit-root.

For non-seasonal data, we consider ARIMA($p,d,q$) models where $d$
is selected based on successive KPSS unit-root tests [@KPSS92].
That is, we test the data for a unit root; if the test result is
significant, we test the differenced data for a unit root; and so
on. We stop this procedure when we obtain our first insignificant
result.

For seasonal data, we consider ARIMA$(p,d,q)(P,D,Q)_m$ models where
$m$ is the seasonal frequency and $D=0$ or $D=1$ depending on an
extended Canova-Hansen test [@CH95]. Canova and Hansen only
provide critical values for $2<m<13$. In our implementation of their
test, we allow any value of $m>1$. Let $C_m$ be the critical value
for seasonal period $m$. We plotted $C_m$ against $m$ for values of
$m$ up to 365 and noted that they fit the line $C_m = 0.269
m^{0.928}$ almost exactly. So for $m>12$, we use this simple
expression to obtain the critical value.

We note in passing that the null hypothesis for the Canova-Hansen
test is not an ARIMA model as it includes seasonal dummy terms. It
is a test for whether the seasonal pattern changes sufficiently over
time to warrant a seasonal unit root, or whether a stable seasonal
pattern modelled using fixed dummy variables is more appropriate.
Nevertheless, we have found that the test is still useful for
choosing $D$ in a strictly ARIMA framework (i.e., without seasonal
dummy variables). If a stable seasonal pattern is selected (i.e.,
the null hypothesis is not rejected), the seasonality is effectively
handled by stationary seasonal AR and MA terms.

After $D$ is selected, we choose $d$ by applying successive KPSS
unit-root tests to the seasonally differenced data (if $D=1$) or the
original data (if $D=0$). Once $d$ (and possibly $D$) are selected,
we proceed to select the values of $p$, $q$, $P$ and $Q$ by
minimizing the AIC. We allow $c\ne0$ for models where $d+D < 2$.

## A step-wise procedure for traversing the model space

Suppose we have seasonal data and we consider
ARIMA$(p,d,q)(P,D,Q)_m$ models where $p$ and $q$ can take values
from 0 to 3, and $P$ and $Q$ can take values from 0 to 1. When $c=0$
there is a total of 288 possible models, and when $c\ne 0$ there is
a total of 192 possible models, giving 480 models altogether. If the
values of $p$, $d$, $q$, $P$, $D$ and $Q$ are allowed to range more
widely, the number of possible models increases rapidly.
Consequently, it is often not feasible to simply fit every potential
model and choose the one with the lowest AIC. Instead, we need a
way of traversing the space of models efficiently in order to arrive
at the model with the lowest AIC value.

We propose a step-wise algorithm as follows.
\begin{description}
\item[Step 1:]  We try four possible models to start with.
\begin{itemize}
\item ARIMA($2,d,2$) if $m=1$ and ARIMA($2,d,2)(1,D,1)$ if $m>1$.

\item ARIMA($0,d,0$) if $m=1$ and ARIMA($0,d,0)(0,D,0)$ if $m>1$.

\item ARIMA($1,d,0$) if $m=1$ and ARIMA($1,d,0)(1,D,0)$ if $m>1$.

\item ARIMA($0,d,1$) if $m=1$ and ARIMA($0,d,1)(0,D,1)$ if $m>1$.
\end{itemize}
If $d+D \le 1$, these models are fitted with $c\ne0$. Otherwise, we
set $c=0$. Of these four models, we select the one with the smallest
AIC value. This is called the ``current'' model and is denoted by
ARIMA($p,d,q$) if $m=1$ or ARIMA($p,d,q)(P,D,Q)_m$ if $m>1$.

\item[Step 2:] We consider up to seventeen variations on the current
model:
\begin{itemize}
\item where one of $p$, $q$, $P$ and $Q$ is allowed to vary by
$\pm1$ from the current model;

\item where $p$ and $q$ both vary by $\pm1$ from the current model;

\item where $P$ and $Q$ both vary by $\pm1$ from the current model;

\item where the constant $c$ is included if the current model
      has $c=0$ or excluded if the current model has $c\ne0$.
\end{itemize}
Whenever a model with lower AIC is found, it becomes the new
``current'' model and the procedure is repeated. This process
finishes when we cannot find a model close to the current model with
lower AIC.
\end{description}

There are several constraints on the fitted models to avoid problems
with convergence or near unit-roots. The constraints are outlined
below.
\begin{compactitem}\itemsep=8pt
\item The values of $p$ and $q$ are not allowed to exceed specified
upper bounds (with default values of 5 in each case).

\item The values of $P$ and $Q$ are not allowed to exceed specified
upper bounds (with default values of 2 in each case).

\item We reject any model which is ``close'' to non-invertible or
non-causal. Specifically, we compute the roots of $\phi(B)\Phi(B)$
and $\theta(B)\Theta(B)$. If either have a root that is smaller than
1.001 in absolute value, the model is rejected.

\item If there are any errors arising in the non-linear optimization
routine used for estimation, the model is rejected. The rationale
here is that any model that is difficult to fit is probably not a
good model for the data.
\end{compactitem}

The algorithm is guaranteed to return a valid model because the
model space is finite and at least one of the starting models will
be accepted (the model with no AR or MA parameters). The selected
model is used to produce forecasts.

## Comparisons with exponential smoothing

There is a widespread myth that ARIMA models are more general than
exponential smoothing. This is not true. The two classes of models
overlap. The linear exponential smoothing models are all special
cases of ARIMA models---the equivalences are discussed in
@HAA08. However, the non-linear exponential smoothing models
have no equivalent ARIMA counterpart. On the other hand, there are
many ARIMA models which have no exponential smoothing counterpart.
Thus, the two model classes overlap and are complimentary; each has
its strengths and weaknesses.

The exponential smoothing state space models are all non-stationary.
Models with seasonality or non-damped trend (or both) have two unit
roots; all other models---that is, non-seasonal models with either
no trend or damped trend---have one unit root. It is possible to
define a stationary model with similar characteristics to
exponential smoothing, but this is not normally done. The philosophy
of exponential smoothing is that the world is non-stationary. So if
a stationary model is required, ARIMA models are better.

One advantage of the exponential smoothing models is that they can
be non-linear. So time series that exhibit non-linear
characteristics including heteroscedasticity  may be better modelled
using exponential smoothing state space models.

For seasonal data, there are many more ARIMA models than the 30
possible models in the exponential smoothing class of
Section \ref{sec:expsmooth}. It may be thought that the larger model
class is advantageous. However, the results in @HKSG02 show
that the exponential smoothing models performed better than the
ARIMA models for the seasonal M3 competition data. (For the annual
M3 data, the ARIMA models performed better.) In a discussion of
these results, @Hyndman01 speculates that the larger model
space of ARIMA models actually harms forecasting performance because
it introduces additional uncertainty. The smaller exponential
smoothing class is sufficiently rich to capture the dynamics of
almost all real business and economic time series.

# The forecast package {#sec:package}

The algorithms and modelling frameworks for automatic univariate
time series forecasting are implemented in the \pkg{forecast}
package in \R. We illustrate the methods using the following four
real time series shown in Figure \ref{fig:etsexamples}.
\begin{compactitem}
\item Figure \ref{fig:etsexamples}(a) shows 125 monthly US government
bond yields (percent per annum) from January 1994 to May 2004.

\item Figure \ref{fig:etsexamples}(b) displays 55 observations of annual
US net electricity generation (billion kwh) for 1949 through 2003.

\item Figure \ref{fig:etsexamples}(c) presents 113 quarterly observations
of passenger motor vehicle production in the U.K. (thousands of
cars) for the first quarter of 1977 through the first quarter of
2005.

\item Figure \ref{fig:etsexamples}(d) shows 240 monthly observations of
the number of short term overseas visitors to Australia from May
1985 to April 2005.
\end{compactitem}

```{r etsexamples, fig.height=7, fig.width=9, echo=FALSE, fig.cap="Four time series showing point forecasts and 80\\% \\& 95\\% prediction intervals obtained using exponential smoothing state space models."}
par(mfrow = c(2,2))
mod1 <- ets(bonds)
mod2 <- ets(usnetelec)
mod3 <- ets(ukcars)
mod4 <- ets(visitors)

plot(forecast(mod1), main="(a) US 10-year bonds yield", xlab="Year", ylab="Percentage per annum")
plot(forecast(mod2), main="(b) US net electricity generation", xlab="Year", ylab="Billion kwh")
plot(forecast(mod3), main="(c) UK passenger motor vehicle production", xlab="Year", ylab="Thousands of cars")
plot(forecast(mod4), main="(d) Overseas visitors to Australia", xlab="Year", ylab="Thousands of people")
```

```{r etsnames, echo=FALSE}
etsnames <- c(mod1$method, mod2$method, mod3$method, mod4$method)
etsnames <- gsub("Ad","A\\\\damped",etsnames)
```

## Implementation of the automatic exponential smoothing algorithm

The innovations state space modelling framework described in
Section \ref{sec:expsmooth} is implemented via the \code{ets()}
function in the \pkg{forecast} package. (The default settings of \code{ets()} do not allow models with multiplicative trend, but they can be included using \code{allow.multiplicative.trend=TRUE}.) The models chosen via the
algorithm for the four data sets were:
\begin{compactitem}
\item `r etsnames[1]` for monthly US 10-year bonds yield\\
($\alpha=`r format(coef(mod1)['alpha'], digits=4, nsmall=4)`$,
 $\beta=`r format(coef(mod1)['beta'], digits=4, nsmall=4)`$,
 $\phi=`r format(coef(mod1)['phi'], digits=4, nsmall=4)`$,
 $\ell_0 = `r format(coef(mod1)['l'], digits=4, nsmall=4)`$,
 $b_0=`r format(coef(mod1)['b'], digits=4, nsmall=4)`$);

\item `r etsnames[2]` for annual US net electricity generation\\
($\alpha=`r format(coef(mod2)['alpha'], digits=4, nsmall=4)`$,
 $\beta=`r format(coef(mod2)['beta'], digits=4, nsmall=4)`$,
 $\ell_0 = `r format(coef(mod2)['l'], digits=4, nsmall=4)`$,
 $b_0=`r format(coef(mod2)['b'], digits=4, nsmall=4)`$);

\item `r etsnames[3]` for quarterly UK motor vehicle production\\
($\alpha=`r format(coef(mod3)['alpha'], digits=4, nsmall=4)`$,
 $\gamma=`r format(coef(mod3)['gamma'], digits=4, nsmall=4)`$,
 $\ell_0 = `r format(coef(mod3)['l'], digits=4, nsmall=4)`$,
 $s_{-3}=`r format(-sum(coef(mod3)[c('s0','s1','s2')]), digits=4, nsmall=4)`$,
 $s_{-2}=`r format(coef(mod3)['s2'], digits=4, nsmall=4)`$,
 $s_{-1}=`r format(coef(mod3)['s1'], digits=4, nsmall=4)`$,
 $s_0=`r format(coef(mod3)['s0'], digits=4, nsmall=4)`$);

\item `r etsnames[4]` for monthly Australian overseas visitors\\
($\alpha=`r format(coef(mod4)['alpha'], digits=4, nsmall=4)`$,
 $\beta=`r format(coef(mod4)['beta'], digits=2, nsmall=4)`$,
 $\gamma=`r format(coef(mod4)['gamma'], digits=4, nsmall=4)`$,
 $\ell_0 = `r format(coef(mod4)['l'], digits=4, nsmall=4)`$,
 $b_0 = `r format(coef(mod4)['b'], digits=4, nsmall=4)`$,
 $s_{-11}=`r format(12-sum(tail(coef(mod4),11)), digits=4, nsmall=4)`$,
 $s_{-10}=`r format(coef(mod4)['s10'], digits=4, nsmall=4)`$,
 $s_{-9}=`r format(coef(mod4)['s9'], digits=4, nsmall=4)`$,
 $s_{-8}=`r format(coef(mod4)['s8'], digits=4, nsmall=4)`$,
 $s_{-7}=`r format(coef(mod4)['s7'], digits=4, nsmall=4)`$,
 $s_{-6}=`r format(coef(mod4)['s6'], digits=4, nsmall=4)`$,
 $s_{-5}=`r format(coef(mod4)['s5'], digits=4, nsmall=4)`$,
 $s_{-4}=`r format(coef(mod4)['s4'], digits=4, nsmall=4)`$,
 $s_{-3}=`r format(coef(mod4)['s3'], digits=4, nsmall=4)`$,
 $s_{-2}=`r format(coef(mod4)['s2'], digits=4, nsmall=4)`$,
 $s_{-1}=`r format(coef(mod4)['s1'], digits=4, nsmall=4)`$,
 $s_0=`r format(coef(mod4)['s0'], digits=4, nsmall=4)`$).
\end{compactitem}
Although there is a lot of computation involved, it can be handled
remarkably quickly on modern computers. Each of the forecasts shown
in Figure \ref{fig:etsexamples} took no more than a few seconds on a
standard PC. The US electricity generation series took the longest
as there are no analytical prediction intervals available for the
ETS(M,M\damped,N) model. Consequently, the prediction intervals for
this series were computed using simulation of 5000 future sample
paths.

To apply the algorithm to the US net electricity generation time
series \code{usnetelec}, we use the following command.

```{r ets-usnetelec, echo=TRUE}
etsfit <- ets(usnetelec)
```

The object \code{etsfit} is of class ``\code{ets}'' and contains all
of the necessary information about the fitted model including model
parameters, the value of the state vector $\bm{x}_t$ for all $t$,
residuals and so on. Printing the \code{etsfit} object shows the
main items of interest.

```{r ets-usnetelec-print,echo=TRUE}
etsfit
```

Some goodness-of-fit measures [defined in @HK06] are obtained using \code{accuracy()}.

```{r ets-usnetelec-accuracy,eval=TRUE,echo=TRUE}
accuracy(etsfit)
```

There are also \code{coef()}, \code{plot()}, \code{summary()}, \code{residuals()}, \code{fitted()} and
\code{simulate()} methods for objects of class ``\code{ets}''. The
\code{plot()} function shows time plots of the original time series
along with the extracted components (level, growth and seasonal).

The \code{forecast()} function computes the required forecasts which
are then plotted as in Figure \ref{fig:etsexamples}(b).

```{r ets-usnetelec-fcast, fig.height=5, fig.width=8, message=FALSE, warning=FALSE, include=FALSE, output=FALSE}
fcast <- forecast(etsfit)
plot(fcast)
```

Printing the \code{fcast} object gives a table showing the
prediction intervals.

```{r ets-usnetelec-fcast-print,eval=TRUE,echo=TRUE}
fcast
```

The \code{ets()} function also provides the useful feature of
applying a fitted model to a new data set. For example, we could
withhold 10 observations from the \code{usnetelec} data set when
fitting, then compute the one-step forecast errors for the
out-of-sample data.

```{r ets-usnetelec-newdata,eval=FALSE,echo=TRUE}
fit <- ets(usnetelec[1:45])
test <- ets(usnetelec[46:55], model = fit)
accuracy(test)
```

We can also look at the measures of forecast accuracy where the
forecasts are based on only the fitting data.

```{r ets-usnetelec-fcast-accuracy,eval=FALSE,echo=TRUE}
accuracy(forecast(fit,10), usnetelec[46:55])
```

## The HoltWinters() function

There is another implementation of exponential smoothing in \R\ via
the \code{HoltWinters()} function [@Meyer:2002] in the \pkg{stats} package. It
implements only the (N,N), (A,N), (A,A) and (A,M) methods. The
initial states $\bm{x}_0$ are fixed using a heuristic algorithm.
Because of the way the initial states are estimated, a full three
years of seasonal data are required to implement the seasonal
forecasts using \code{HoltWinters()}. (See @shortseasonal for
the minimal sample size required.) The smoothing parameters are
optimized by minimizing the average squared prediction errors, which
is equivalent to minimizing \eqref{likelihood} in the case of
additive errors.

There is a \code{predict()} method for the resulting object which
can produce point forecasts and prediction intervals. Although it is
nowhere documented, it appears that the prediction intervals
produced by \code{predict()} for an object of class
\code{HoltWinters} are based on an equivalent ARIMA model in the
case of the (N,N), (A,N) and (A,A) methods, assuming additive
errors. These prediction intervals are equivalent to the prediction
intervals that arise from the (A,N,N), (A,A,N) and (A,A,A) state
space models. For the (A,M) method, the prediction interval provided
by \code{predict()} appears to be based on @CY91 which is an
approximation to the true prediction interval arising from the
(A,A,M) model. Prediction intervals with multiplicative errors are
not possible using the \code{HoltWinters()} function.

## Implementation of the automatic ARIMA algorithm

```{r arimaexamples, fig.height=7, fig.width=9, echo=FALSE, fig.cap="Four time series showing point forecasts and 80\\% \\& 95\\% prediction intervals obtained using ARIMA models."}
mod1 <- auto.arima(bonds, seasonal=FALSE, approximation=FALSE)
mod2 <- auto.arima(usnetelec)
mod3 <- auto.arima(ukcars)
mod4 <- auto.arima(visitors)
par(mfrow = c(2,2))
plot(forecast(mod1), main="(a) US 10-year bonds yield", xlab="Year", ylab="Percentage per annum")
plot(forecast(mod2), main="(b) US net electricity generation", xlab="Year", ylab="Billion kwh")
plot(forecast(mod3), main="(c) UK passenger motor vehicle production", xlab="Year", ylab="Thousands of cars")
plot(forecast(mod4), main="(d) Overseas visitors to Australia", xlab="Year", ylab="Thousands of people")
```

The algorithm of Section \ref{sec:arima} is applied to the same four
time series. Unlike the exponential smoothing algorithm, the ARIMA
class of models assumes homoscedasticity, which is not always
appropriate. Consequently, transformations are sometimes necessary.
For these four time series, we model the raw data for series
(a)--(c), but the logged data for series (d). The prediction
intervals are back-transformed with the point forecasts to preserve
the probability coverage.

To apply this algorithm to the US net electricity generation time
series \code{usnetelec}, we use the following commands.

```{r arima-auto-fcast,eval=TRUE,echo=TRUE,fig.show="hide"}
arimafit <- auto.arima(usnetelec)
fcast <- forecast(arimafit)
plot(fcast)
```

```{r arimanames, echo=FALSE}
# Convert character strings to latex
arimanames <- c(as.character(mod1),
  as.character(mod2),
  as.character(mod3),
  as.character(mod4))
arimanames <-
    gsub("\\[([0-9]*)\\]", "$_{\\1}$", arimanames)
```

The function \code{auto.arima()} implements the algorithm of
Section \ref{sec:arima} and returns an object of class \code{Arima}.
The resulting forecasts are shown in Figure \ref{fig:arimaexamples}. The
fitted models are as follows:
\begin{compactitem}
\item `r arimanames[1]` for monthly US 10-year bonds yield\\
($\theta_1= `r format(coef(mod1)['ma1'], digits=4, nsmall=4)`$);

\item `r arimanames[2]` for annual US net electricity generation\\
($\phi_1= `r format(coef(mod2)['ar1'], digits=4, nsmall=4)`$;
$\phi_2= `r format(coef(mod2)['ar2'], digits=4, nsmall=4)`$;
$\theta_1= `r format(coef(mod2)['ma1'], digits=4, nsmall=4)`$;
$\theta_2= `r format(coef(mod2)['ma2'], digits=4, nsmall=4)`$;
$c= `r format(coef(mod2)['drift'], digits=4, nsmall=4)`$);

\item `r arimanames[3]` for quarterly UK motor vehicle production\\
($\phi_1= `r format(coef(mod3)['ar1'], digits=4, nsmall=4)`$;
$\phi_2= `r format(coef(mod3)['ar2'], digits=4, nsmall=4)`$;
$\Phi_1= `r format(coef(mod3)['sar1'], digits=4, nsmall=4)`$;
$\Phi_2= `r format(coef(mod3)['sar2'], digits=4, nsmall=4)`$);

\item `r arimanames[4]` for monthly Australian overseas visitors\\
($\phi_1= `r format(coef(mod4)['ar1'], digits=4, nsmall=4)`$;
$\theta_1= `r format(coef(mod4)['ma1'], digits=4, nsmall=4)`$;
$\Theta_1= `r format(coef(mod4)['sma1'], digits=4, nsmall=4)`$;
$\Theta_2= `r format(coef(mod4)['sma2'], digits=4, nsmall=4)`$;
$c= `r format(coef(mod4)['drift'], digits=4, nsmall=4)`$).
\end{compactitem}
Note that the \R\ parameterization has $\theta(B) = (1 + \theta_1B +
\dots + \theta_qB)$ and $\phi(B) = (1 - \phi_1B + \dots - \phi_qB)$,
and similarly for the seasonal terms.

A summary of the forecasts is available, part of which is shown
below.

<!--
For some reason, this doesn't work.
```{r arimafcastsummary, echo=TRUE, message=FALSE, warning=FALSE, as.is=TRUE}
summary(fcast)
```
 -->

```
Forecast method: ARIMA(2,1,2) with drift
Series: usnetelec

Coefficients:
          ar1      ar2     ma1     ma2    drift
      -1.3032  -0.4332  1.5284  0.8340  66.1585
s.e.   0.2122   0.2084  0.1417  0.1185   7.5595

sigma^2 estimated as 2262:  log likelihood=-283.34
AIC=578.67   AICc=580.46   BIC=590.61

Error measures:
                   ME   RMSE    MAE      MPE   MAPE    MASE     ACF1
Training set 0.046402 44.894 32.333 -0.61771 2.1012 0.45813 0.022492

Forecasts:
     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
2004       3968.957 3908.002 4029.912 3875.734 4062.180
2005       3970.350 3873.950 4066.751 3822.919 4117.782
2006       4097.171 3971.114 4223.228 3904.383 4289.959
2007       4112.332 3969.691 4254.973 3894.182 4330.482
2008       4218.671 4053.751 4383.591 3966.448 4470.894
2009       4254.559 4076.108 4433.010 3981.641 4527.476
2010       4342.760 4147.088 4538.431 4043.505 4642.014
2011       4393.306 4185.211 4601.401 4075.052 4711.560
2012       4470.261 4248.068 4692.455 4130.446 4810.077
2013       4529.113 4295.305 4762.920 4171.535 4886.690
```

The training set error measures for the two models are very similar.
Note that the information criteria are not comparable.

The \pkg{forecast} package also contains the function \code{Arima()}
which is largely a wrapper to the \code{arima()} function in the
\pkg{stats} package. The \code{Arima()} function in the
\pkg{forecast} package makes it easier to include a drift term when
$d+D=1$. (Setting \code{include.mean=TRUE} in the \code{arima()}
function from the \pkg{stats} package will only work when $d+D=0$.)
It also provides the facility for fitting an existing ARIMA model to
a new data set (as was demonstrated for the \code{ets()} function
earlier).

One-step forecasts for ARIMA models are now available via a
\code{fitted()} function. We also provide a new function
\code{arima.errors()} which returns the original time series after
adjusting for regression variables. If there are no regression
variables in the ARIMA model, then the errors will be identical to
the original series. If there are regression variables in the ARIMA
model, then the errors will be equal to the original series minus
the effect of the regression variables, but leaving in the serial
correlation that is modelled with the AR and MA terms. In contrast,
\code{residuals()} provides true residuals, removing the AR and MA
terms as well.

The generic functions \code{summary()}, \code{print()}, \code{fitted()}  and
\code{forecast()} apply to models obtained from either the
\code{Arima()} or \code{arima()} functions.

## The forecast() function

The \code{forecast()} function is generic and has S3 methods for a
wide range of time series models. It computes point forecasts and
prediction intervals from the time series model. Methods exist for
models fitted using \code{ets()}, \code{auto.arima()},
\code{Arima()}, \code{arima()}, \code{ar()}, \code{HoltWinters()}
and \texttt{StructTS()}.

There is also a method for a \code{ts} object. If a time series
object is passed as the first argument to \code{forecast()}, the
function will produce forecasts based on the exponential smoothing
algorithm of Section \ref{sec:expsmooth}.

In most cases, there is an existing \code{predict()} function which
is intended to do much the same thing. Unfortunately, the resulting
objects from the \code{predict()} function contain different
information in each case and so it is not possible to build generic
functions (such as \code{plot()} and \code{summary()}) for the
results. So, instead, \code{forecast()} acts as a wrapper to
\code{predict()}, and packages the information obtained in a common
format (the \code{forecast} class). We also define
a default \code{predict()} method which is used when no existing
\code{predict()} function exists, and calls the relevant
\code{forecast()} function. Thus, \code{predict()} methods parallel
\code{forecast()} methods, but the latter provide consistent output
that is more usable.

\subsection[The forecast class]{The \code{forecast} class}

The output from the \code{forecast()} function is an object of class
``\code{forecast}'' and includes at least the following information:
\begin{compactitem}
\item the original series;

\item point forecasts;

\item prediction intervals of specified coverage;

\item the forecasting method used and information about the fitted
model;

\item residuals from the fitted model;

\item one-step forecasts from the fitted model for the period of the
observed data.
\end{compactitem}

There are \code{print()}, \code{plot()} and \code{summary()} methods
for the ``\code{forecast}'' class. Figures \ref{fig:etsexamples}
and \ref{fig:arimaexamples} were produced using the \code{plot()} method.

The prediction intervals are, by default, computed
for 80\% and 95\% coverage, although other values are possible if
requested. Fan charts [@Wallis99] are possible using the
combination \verb|plot(forecast(model.object, fan = TRUE))|.

## Other functions {#sec:other}

We now briefly describe some of the other features of the
\pkg{forecast} package. Each of the following functions produces an
object of class ``\code{forecast}''.

\code{croston()}
:  implements the method of @Croston72 for intermittent demand forecasting. In this method, the time series is decomposed into two separate sequences: the non-zero values and the time intervals between non-zero values. These are then independently forecast using simple exponential smoothing and the forecasts of the original series are obtained as ratios of the two sets of forecasts. No prediction intervals are provided because there is no underlying stochastic model [@SH05].

\code{theta()}
: provides forecasts from the Theta method [@AN00]. @HB03 showed that these were equivalent to a special case of simple exponential smoothing with drift.

\code{splinef()}
: gives cubic-spline forecasts, based on fitting a cubic spline to the historical data and extrapolating it linearly. The details of this method, and the associated prediction intervals, are discussed in @HKPB05.

\code{meanf()}
: returns forecasts based on the historical mean.

\code{rwf()}
: gives ``naïve'' forecasts equal to the most recent observation assuming a random walk model. This function also allows forecasting using a random walk with drift.

In addition, there are some new plotting functions for time series.

\code{tsdisplay()}
: provides a time plot along with an ACF and PACF.

\code{seasonplot()}
: produces a seasonal plot as described in @MWH3.

\newpage

# Bibliography