\documentclass{article}
\setlength{\parindent}{0pt}	% Eliminate the indent at the beginning of a new paragraph
\setcounter{secnumdepth}{0}	% Elimate the section numbering starting from a specific depth (see WikiBook)
\usepackage[round,sort]{natbib}
\usepackage{fixltx2e}
\usepackage{graphicx}	% To manage external pictures
\usepackage{float}
\usepackage{subfig}	% To add subfigures within figures, with labels (see WikiBooks)
\usepackage{verbatim}
\usepackage[colorlinks=true,linkcolor=blue,citecolor=blue,urlcolor=blue]{hyperref}
\usepackage{amssymb,amsbsy,amsmath}
\usepackage{epsfig}
\usepackage[left=3cm,top=3cm,bottom=3.5cm,right=3cm]{geometry}	% For easy management of document margins
\usepackage{fancyhdr} % To customize the header/footer (see WikiBooks)
%\usepackage{rotating}
\numberwithin{equation}{section}	% Equation numbers relative to sections

% ---------------------------------------------------------------------------------------------------------------------------------------

% \VignetteIndexEntry{ABC}
%\VignetteDepends{quantreg,nnet}
%\VignettePackage{abc}
%\documentclass{amsart}
\newcommand{\code}[1]{{\texttt{#1}}}
\newcommand{\pkg}[1]{{\texttt{#1}}}
\newcommand{\class}[1]{{\textit{#1}}}
\newcommand{\R}{{\normalfont\textsf{R }}{}}

\begin{document}

% <<label=R options,echo=FALSE>>=
options(width = 60)
options(SweaveHooks = list(fig = function() par(mar=c(3,3,1,0.5),mgp = c(2,1,0))))
@

\SweaveOpts{prefix.string=fig,include=F,keep.source=T,eps=FALSE,resolution=72}

<<echo=false>>=
options(continue="  ")
@
%@% TO ELIMINATE THE "+" IN CONSECUTIVE SCRIPT LINES

\title{Approximate Bayesian Computation (ABC) in R: A Vignette}
\author{K Csill\'ery, L Lemaire, O Fran\c cois, MGB Blum}
\date{\pkg{abc} version \Sexpr{packageDescription("abc")[["Version"]]} , \Sexpr{Sys.Date()} }
\maketitle

\tableofcontents
\setcounter{footnote}{1} \footnotetext{This document is included as a
  vignette (a \LaTeX\ document created using the \R function
  \code{Sweave}) of the package \pkg{abc}. It is automatically
  dowloaded together with the package and can be accessed through \R
  typing \code{vignette("abc")}.}  \newpage
\setlength{\parskip}{4pt} % Space between paragraphs

<<echo=FALSE>>=
rm(list=ls())
@ 

\section{Summary}

An implementation of Approximate Bayesian Computation (ABC) methods in
the \R language is available in the package \pkg{abc} with associated
example data sets in the \pkg{abc.data} package.  The aim of this
vignette is to provide an extended overview of the capabilities of the
package, with a detailed example of the analysis of real data. Some
general information on installation procedures and object oriented
programming in \R are given in Section \emph{Introduction}. The ABC
methodology is briefly illustrated in Section \emph{A brief
  introduction to ABC}. A detailed example of an ABC analysis, where
we aim to estimate the ancestral human population size is provided in
Section \emph{A walk-though example: human demographic history}. Users
already familiar with the \R and ABC methods may skip the first two
Sections and start with the example.

\section{Introduction}

\subsection{Installation}

\R is a open source software project and can be freely downloaded from
the CRAN website. There are already several resources for an
introduction to \R.  The CRAN website link to contributed
documentation also offers excellent introductions in several
languages. Once \R is running the installation of additional packages
is quite straightforward.  To install the \pkg{abc} and \pkg{abc.data} packages from \R
simply type:

\vspace{2mm}
\noindent
\texttt{> install.packages("abc")}
\texttt{> install.packages("abc.data")}

\vspace{2mm}
\noindent
Once the packages are installed, thely need to be made
accessible to the current \R session by the commands:
<<results = hide>>=
library(abc)
require(abc.data)
@

For online help facilities or the details of a particular command
(such as the function \texttt{abc}) you can type:
<<eval=FALSE>>=
help(package="abc")
help(abc)
@ 

The first command gives a brief summary of the available commands in
the package, the second give more detailed information about a
specific command. \R help files can also be used to illustrate how commands
are executed, so they can be pasted into an \R session, or run as a whole with
the following command:
<<eval=FALSE>>=
example(abc)
@ 



\subsection{Methods and classes in \R}

This is only a brief reminder that expressions in \R manipulate
objects, which may be data (a scalar or a matrix for example), but
objects may also be functions, or more complex collections of objects.
All objects have a class, which enables functions to act on them
appropriately. Thus, for example, when the function \code{summary}
is applied on an object of class \verb|"abc"| (i.e. an object that
had been generated by the function \code{abc}), it would act
differently than on a simple matrix. In fact, the function
\code{summary} on an \verb|"abc"| object calls the function
\code{summary.abc}. Likewise, the \code{plot}, \code{hist} and
\code{print} functions will recognize the classes of objects
generated by the various functions of the \pkg{abc} package and act
accordingly.


\section{A brief introduction to ABC}

This section contains a short introduction to the ABC
methodology. Recall that the main steps of an ABC analysis follow the
general scheme of any Bayesian analysis: formulating a model, fitting
the model to data (parameter estimation), and improving the model by
checking its fit (posterior predictive checks) and comparing it to
other models \citep{gelmanetal03, csilleryetal10}. In the following
sections, we detail each of these steps and highlight the appropriate
functions of the \pkg{abc} package \citep{csilleryetal12}.

\subsection{Parameter inference}

Suppose that we want to compute the posterior probability distribution
of a univariate or multivariate parameter, $\theta$.  A parameter
value $\theta_i$, is sampled from its prior distribution to simulate a
dataset $y_i$, for $i=1,\dots, n$ where $n$ is the number of
simulations. A set of summary statistics $S(y_i)$ is computed from the
simulated data and compared to the summary statistics obtained from
the actual data $S(y_0)$ using a distance measure $d$. We consider the
Euclidean distance for $d$, where each summary statistic is
standardized by a robust estimate of the standard deviation (the
median absolute deviation). If $d(S(y_i),S(y_0))$ (i.e. the distance
between $S(y_i)$ and $S(y_0)$) is less than a given threshold, the
parameter value $\theta_i$ is accepted. In order to set a threshold 
above which simulations are rejected, the user has to provide
the tolerance rate, which is defined as the percentage of accepted
simulation. The accepted $\theta_i$'s form a sample from an
approximation of the posterior distribution. The estimation of the
posterior distribution can be improved by the use of regression
techniques (see below).

The \pkg{abc} package implements three ABC algorithms for constructing
the posterior distribution from the accepted $\theta_i$'s: a rejection
method, and regression-based correction methods that use either local
linear regression \citep{beaumontetal02} or neural networks
\citep{blumfrancois10}. When the rejection method is selected, the
accepted $\theta_i$'s are considered as a sample from the posterior
distribution \citep{pritchardetal99}. The two regression methods 
implement an additional step to correct for the imperfect match
between the accepted, $S(y_i)$, and observed summary statistics,
$S(y_0)$, using the following regression equation in the vicinity of
$S(y_0)$
$$
\theta_i=m(S(y_i))+ \epsilon_i,
$$ where $m$ is the regression function, and the $\epsilon_i$'s are
centered random variables with a common variance. Simulations that
closely match $S(y_0)$ are given more weight by assigning to each
simulation $(\theta_i,S(y_i))$ the weight $K[d(S(y_i),S(y_0))]$, where
$K$ is a statistical kernel. The local linear model assumes a linear
function for $m$ (\code{"loclinear"} method in the \code{abc}
function), while neural networks (\code{"neuralnet"} method) account
for the non-linearity of $m$ and allow users to reduce the dimension
of the set of summary statistics. Once the regression is performed, a
weighted sample from the posterior distribution is obtained by
correcting the $\theta_i$'s as follows,
$$
\theta_i^*=\hat{m}(S(y_0))+\hat{\epsilon_i},
$$ where $\hat{m}(\cdot)$ is the estimated conditional mean and the
$\hat{\epsilon_i}$'s are the empirical residuals of the regression
\citep{beaumontetal02}. Additionally, a correction for
heteroscedasticity is applied (by default)
$$
\theta_i^*=\hat{m}(S(y_0))+\frac{\hat{\sigma}(S(y_0))}{\hat{\sigma}(S(y_i))}
\hat{\epsilon_i}
$$ where $\hat{\sigma}(\cdot)$ is the estimated conditional standard
deviation \citep{blumfrancois10}. When using the \code{"loclinear"}
method, a warning about the collinearity of the design matrix of the
regression might be issued. In that situation, we recommend to rather
use the related \code{"ridge"} method that performs local-linear ridge
regression and deals with the collinearity issue.

Finally, we note that alternative algorithms exist that sample from an
updated distribution that is closer in shape to the posterior than to
the prior \citep{beaumontatal09, marjorametal03,
  wegmannetal10}. However, we do not implement these methods in
\R \pkg{abc} because they require the repeated use of the
simulation software.

\subsection{Cross-validation}

\R \pkg{abc} implements a leave-one-out cross-validation to evaluate
the accuracy of parameter estimates and the robustness of the
estimates to the tolerance rate. To perform cross-validation, the
$i^{th}$ simulation is randomly selected as a validation simulation,
its summary statistic(s) $S(y_i)$ are used as pseudo-observed
summary statistics, and its parameters are estimated with the function
\code{abc} using all simulations except the $i^{th}$
simulation. Ideally, the process is repeated $n$ times, where $n$ is
the number of simulations (so-called $n$-fold
cross-validation). However, performing an $n$-fold cross-validation
could be very time consuming, so the cross-validation is often
performed for a subset of a few 100 randomly selected
simulations. The prediction error is calculated as
$$ E_{\rm pred}=\frac{\sum_i(\tilde{\theta}_i-\theta_i)^2}{Var(\theta_i)},
$$ where $\theta_i$ is the true parameter value of the $i^{th}$
simulated data set and $\tilde{\theta}_i$ is the estimated parameter
value (the posterior median, mean or mode).

\subsection{Model selection}

Model selection is part of any Bayesian analysis, and can be performed
in an ABC framework. The aim is generally to estimate the posterior
probability of a model $M$ as $Pr(M | S(y_0))$. Three different
methods are implemented in the \pkg{abc} package. With the rejection
method, the posterior probability of a given model is approximated by
the proportion of accepted simulations given this model. The two other
methods are based on multinomial logistic regression or neural
networks. In these two approaches, the model indicator is treated as
the response variable of a polychotomous regression, where the summary
statistics are the independent variables \citep{fagundesetal2007,beaumont08}. Using
neural networks can be efficient when highly dimensional statistics
are used. Any of these methods are valid when the different models to
be compared are, a priori, equally likely, and the same number of
simulations are performed under each model.

A further function, \code{expected.deviance}, is implemented to guide
the model selection procedure. The function computes an approximate
expected deviance from the posterior predictive distribution. Thus, in
order to use the function, users have to re-use the simulation tool
and to simulate data from the posterior parameter values. The
method is particularly advantageous when it is used with one of the
regression methods. Further details on the method can be found in
\citep{francoislaval11} and fully worked out examples are provided in
the package manual.


\subsection{Model misclassification}

A cross-validation tool is available for model selection as well. The
objective is to evaluate if model selection with ABC is able to
distinguish between the proposed models by making use of the existing
simulations. The summary statistics from one of the simulations are
considered as pseudo-observed summary statistics and classified using
all the remaining simulations. Then, one expects that a large
posterior probability should be assigned to the model that generated
the pseudo-observed summary statistics. Two versions of the
cross-validation are implemented. The first version is a ``hard''
model classification. We consider a given simulation as the
pseudo-observed data and assign it to the model with the highest
posterior model probability. This procedure is repeated for a given
number of simulations for each model. The results are summarized in a
so-called {\it confusion matrix} \citep{hastieetal03}. Each row of the
confusion matrix represents the number of simulations under a true
model, while each column represents the number of simulations under a
model assigned by \code{postpr}. If all simulations had been correctly
classified, only the diagonal elements of the matrix are non-zero. The
second version is called ``soft'' classification. Here we do not
assign a simulation to the model with the highest posterior
probability, but average the posterior probabilities over many
simulations for a given model. This procedure is again summarized as a
matrix, which is similar to the confusion matrix. However, the
elements of the matrix do not give model counts, but the average
posterior probabilities across simulations for a given model.


\subsection{Goodness-of-fit}

A formal hypothesis testing procedure for goodness-of-fit is available since the version 2.0 of
the package \pkg{abc}. The test statistic is the median of the distance between accepted summary
statistics and observed ones. Instead of the mean, other choices are possible including mean or maximum values.
To obtain the null distribution of the test statistic, we consider a given number of pseudo-observed summary statistics
as in the cross-validation experiments. Ideally, simulations should be performed with the posterior distributions
but it is more convenient to use the prior distribution because it does not require an additional use of the
simulation software.

In addition to the hypothesis-testing procedure, we provide a graphical method to check the Goodness-of-fit.
Our advocated approach is based on Principal Component Analysis, which is useful for graphical display when there are many
summary statistics.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{A walk-though example: human demographic history}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 


\subsection{Getting started}

In following sections we show how the \pkg{abc} package can be used
for the inference and model comparison steps, and we indicate how
generic tools of the \R package can be used for model checking. The
package is a so-called ``post-processing tool'', so the simulations
and the calculation of the summary statistics have to be performed by
the user's preferred software. Alternatively, simulation software
might be called in an \R session, which opens up the possibility for a
highly interactive ABC analysis.  For example, for coalescent models,
users might want to use one of the many existing software for
simulating genetic data such as \pkg{ms} \citep{hudson02} or
\pkg{simcoal2} \citep{simcoal2}.  Note that the package's data set
\code{human}, for example, has been generated using \pkg{ms}. The
exact code can be found in \texttt{inst/doc/runms.R} and may be re-run
or modified by interested users. Since the release of the
\pkg{phyclust} package, a function called \code{ms} is also available,
which allows users to call \pkg{ms} directly from an \R session. The
calculation of summary statistics for could either be done using our
own \R code or, for \pkg{ms} outputs, one could use the software
\pkg{msABC} \citep{pavlidisetal10}
(\url{http://www.bio.lmu.de/~pavlidis/home/?Software:msABC}).

In order to use the package certain \R objects always have to be
prepared. These are a vector of the observed summary statistics, a
matrix of the simulated summary statistics, where each row corresponds
to a simulation and each column corresponds to a summary statistic,
and a matrix of the simulated parameter values, where each row
corresponds to a simulation and each column corresponds to a
parameter. If users want to perform model selection, an additional
vector with the model indices has to be ready.

The \pkg{abc} package contains two examples: \code{musigma2} and
\code{human}. The \code{musigma2} data is very simple, so users might
prefer to try this first. The example is motivated by Anderson,
Edgar's iris data, so the observed statistics are the mean and
variance of the sepal of \emph{Iris setosa}, estimated from part of
the \code{iris} data. The aim is to estimate two parameters: $\mu$ and
$\sigma^2$. The advantage of this simple example is that the true
posterior distributions are known, so the accuracy of parameter
inference can be easily checked. The data set can be loaded using the
following code:

<<>>=
library(abc)
require(abc.data)
data(musigma2)
@

Tha data set contains five \R objects. The manual page of these \R
objects can be accessed by typing the name of one of the five objects,
such as \code{stat.obs}, which contains the observed summary
statistics (\texttt{?stat.obs}). At the end of the manual page of the
function \code{abc}, several examples can be found using the
\code{musigma2} data set, so we leave the user to discover this simple
example on their own. This example can be accessed by typing
\texttt{?abc}.


\subsection{The human data}

The \code{human} data set is more realistic. Our aim is to estimate
the human ancestral population size and to discriminate between different
demographic models. We discuss this example in detail.

\subsection{Background}

Several population genetic studies have found evidence that African
human populations have been expanding, while human populations outside
of Africa went through a out-of-Africa population bottleneck and then expanded. We
re-examined this problem by re-analyzing published data of 50 unlinked
autosomal non-coding regions from a Hausa (Africa), a Chinese (Asia),
and an Italian (Europe) population \citep{voightetal05}. Data is summarized
using three summary statistics: the average nucleotide diversity
($\bar\pi$), and the mean and the variance of Tajima's D (Table 1 in
\cite{voightetal05}). Both Tajima's D and $\bar\pi$ have been
classically used to detect historical changes in population size. The
observed summary statistics (our data!) can be accessed with the
following code:

<<>>=
require(abc.data)
data(human)
stat.voight
@

A negative Tajima's D signifies an excess of low frequency
polymorphisms, indicating population size expansion. While a positive
Tajima's D indicates low levels of both low and high frequency
polymorphisms, thus a sign of a population bottleneck. In constant
size populations, Tajima's D is expected to be zero.

The data set \code{human} contains four objects, and
\code{stat.voight} is just one of them. The manual page of all objects
can be accessed by referring to one of its objects that we can
achieved by typing for example, \texttt{?stat.voight}. Other objects
of \code{human} contain the simulated summary statistics
(\code{stat.3pops.sim}) and the model indices (\code{models}). Using
these three objects, we can estimate the posterior probabilities of
different demographic scenarios (see below) in the three human
populations: Hausa, Italian, and Chinese. The fourth object,
\code{par.italy.sim} may be used to estimate the ancestral population
size of the European population (represented by an Italian sample).


\subsection{The demographic models}

Since we do not know which demographic model would be the most
appropriate to estimate the ancestral human population size, first, we
propose three different models to see which model is the most
supported by the data. The three models of demographic history are:
constant population size, exponential growth after a period of
constant population size, and population bottleneck, where, after the
bottleneck, the population recovers to its original size. All three
models can be defined by a number of demographic parameters, such as
population sizes, rates and timings of changes in population size,
which we do not detail here. The data has been generated for the
users, however, can easily be re-generated and the demographic
parameters may be altered. We simulated $50,000$ data sets under each
of the three demographic models using the software \pkg{ms}
\citep{hudson02}, and calculated the three summary statistics in each
model. The exact \R code to simulate the \code{human} data can be
found in \verb|/inst/doc/runms.R|, thus can be viewed, modified and
re-run by interested users, given that \pkg{ms} and its sister package
\verb|sample_stats| are installed and configured.

To illustrate the informativeness of our summary statistics we can
plot them as a function of the model (Figure \ref{fig:ss}).

<<label=ssplot, include=FALSE>>=
par(mfcol = c(1,3), mar=c(5,3,4,.5))
boxplot(stat.3pops.sim[,"pi"]~models, main="Mean nucleotide diversity")
boxplot(stat.3pops.sim[,"TajD.m"]~models, main="Mean Tajima's D")
boxplot(stat.3pops.sim[,"TajD.v"]~models, main="Var in Tajima's D")
@

\setkeys{Gin}{width=5in, height=5in}
\begin{figure} 
  \begin{center} 
<<label=ss,fig=TRUE,echo=FALSE>>= 
<<ssplot>> 
@ 
\end{center} 
\caption{Summary statistics under the three demographic models:
  population bottleneck, constant population size, exponential growth}
\label{fig:ss}
\end{figure}


\subsection{Model selection}

The main model selection function is called \code{postpr}. However, before
applying this function on the real data, we perform a cross-validation
for model selection (\code{cv4postpr}) to evaluate if ABC can, at all,
distinguish between the three models. The cross-validation might take a
long time to run. At this point, we might just want just a quick result to
illustrate the use of the function, so we run only 10 cross-validation
simulations, and summarize and the results using the following
commands:

<<label=modsel>>=
cv.modsel <- cv4postpr(models, stat.3pops.sim, nval=5, tol=.01, method="mnlogistic")
s <- summary(cv.modsel)
@ 

The resulting confusion matrix may also be plotted using the following command:
<<label=cv4postprplot, include=FALSE>>=
plot(cv.modsel, names.arg=c("Bottleneck", "Constant", "Exponential"))
@ 

\setkeys{Gin}{width=4in, height=4in}
\begin{figure} 
  \begin{center} 
<<label=cv4postpr,fig=TRUE,echo=FALSE>>= 
<<cv4postprplot>>
@ 
\end{center} 
\caption{Misclassification proportions for the tree models. If all
  tree models were classified correctly $100\%$ of the times, all
  three bars would have a single colour.}
\label{fig:cv4postpr}
\end{figure}

<<echo=FALSE, include=FALSE>>=
mytotal <- length(cv.modsel$cvsamples)/length(unique(models))
myexp <- s$conf.matrix$tol0.01[3,3]
misclasstot <- 1-(sum(s$conf.matrix$tol0.01[1,1],s$conf.matrix$tol0.01[2,2],s$conf.matrix$tol0.01[3,3])/sum(s$conf.matrix$tol0.01))
@ 

Both the confusion matrix and the barplot (Figure \ref{fig:cv4postpr})
illustrate that ABC is able to distinguish between these three
models. Notably, the exponential expansion model can be classified
correctly the most frequently: \Sexpr{myexp} times out of
\Sexpr{mytotal}. However, the total misclassification rate of
\Sexpr{round(misclasstot,2)} prompts us to be cautious about the
results obtained with model selection and suggests that additional
data or summary statistics are required to better discriminate between
the models. Remember that simulations are randomly selected, so every
user will get slightly different figures for the misclassification
rates, especially if \code{nval} is small.

Now, we may calculate the posterior probabilities of each demographic
scenario using the rejection (\verb|"rejection"|) and the multinomial
logistic regression method (\verb|"mnlogistic"|) of the function
\code{postpr} with a tolerance rate of $0.05\%$. The function
\code{summary} prints out posterior model probabilities and ratios of
model probabilities (the Bayes factors) in a user-friendly way:

<<>>= 
modsel.ha <- postpr(stat.voight["hausa",], models, stat.3pops.sim, tol=.05, method="mnlogistic")
modsel.it <- postpr(stat.voight["italian",], models, stat.3pops.sim, tol=.05, method="mnlogistic")
modsel.ch <- postpr(stat.voight["chinese",], models, stat.3pops.sim, tol=.05, method="mnlogistic")
summary(modsel.ha)
summary(modsel.it)
summary(modsel.ch)
@

These results show that Hausa data supports best the model of
exponential growth model, while the Italian and Chinoise data supports
best the population bottleneck model.

In the following, we only focus on the data set from
Italy.


\subsection{Goodness-of-fit}


Before turning to parameter inference, it is important to check that the preferred
model provides a good fit to the data.

For the Italian data, we can plot the histogram of the null distribution under a bottleneck model on which we superimpose
the observed value (Figure \ref{fig:distance}).

<<label=distanceplot>>=
res.gfit.bott=gfit(target=stat.voight["italian",], sumstat=stat.3pops.sim[models=="bott",],
statistic=mean, nb.replicate=100)
plot(res.gfit.bott, main="Histogram under H0")
@

\setkeys{Gin}{width=4in, height=4in}
\begin{figure} 
  \begin{center} 
<<label=distance,fig=TRUE,echo=FALSE>>= 
<<distanceplot>> 
@ 
\end{center} 
\caption{Histogram of the null distribution of the test statistic for goodness of fit assuming a bottleneck model.}
\label{fig:distance}
\end{figure}

We can compute a P-value to test the fit to the bottleneck model.
We can additionally check that the other demographic models do not provide a good fit to the Italian data, which 
would confirm the results obtained with model selection.

<<>>=
res.gfit.exp=gfit(target=stat.voight["italian",], sumstat=stat.3pops.sim[models=="exp",],
statistic=mean, nb.replicate=100)
res.gfit.const=gfit(target=stat.voight["italian",], sumstat=stat.3pops.sim[models=="const",],
statistic=mean, nb.replicate=100)
summary(res.gfit.bott)
summary(res.gfit.exp)
summary(res.gfit.const)
@

Alternative graphical procedures are also possible to perform Goodness-of-fit. We can perform Principal
Component Analysis (PCA) with 2 components to make new summary statistics. We can then display the $90\%$ (or something else
equal to 1-\code{cprob}) enveloppe of the 2 PCs obtained with each demographic model (Figure \ref{fig:pca}).
Simulations are performed a priori to avoid additional use of the
simulation software but similar procedures can be used with posterior simulations. 


<<label=pcaplot>>=
gfitpca(target=stat.voight["italian",], sumstat=stat.3pops.sim, index=models, cprob=.1)
@

\setkeys{Gin}{width=4in, height=4in}
\begin{figure} 
  \begin{center} 
<<label=pca,fig=TRUE,echo=FALSE>>= 
<<pcaplot>> 
@ 
\end{center} 
\caption{$90\%$ enveloppe of the 2 Principal Components obtained with each demographic model.}
\label{fig:pca}
\end{figure}

\subsection{Posterior predictive checks}

To further confirm that the bottleneck model provided the best fit to
the Italian data, we can also consider posterior predictive checks
\cite{gelmanetal03}. Note that there is no specific function in
\pkg{abc} for posterior predictive checks, nevertheless the task can
be easily carried out. First, we estimate the posterior distributions
of the parameters of the three models using the function \code{abc}
(see details in the following section). Then, we sample a set of 1,000
multivariate parameters from their posterior distribution. Last, we
obtain a sample from the distribution of the three summary statistics
a posteriori by simulating data sets with the 1,000 sampled
multivariate parameters using \pkg{ms}. The exact code to run \pkg{ms}
again can be found in \verb|\inst\doc\runms4ppc.R|. By running this
code one can re-generate the package's data set \code{ppc}.

The results of the posterior predictive checks could best be dispayed
as histograms:
\setkeys{Gin}{width=5in, height=5in}
<<label=ppcplot, include=FALSE>>=
require(abc.data)
data(ppc)
mylabels <- c("Mean nucleotide diversity","Mean Tajima's D", "Var Tajima's D")
par(mfrow = c(1,3), mar=c(5,2,4,0))
for (i in c(1:3)){
    hist(post.bott[,i],breaks=40, xlab=mylabels[i], main="")
    abline(v = stat.voight["italian", i], col = 2)
}
@

\begin{figure} 
  \begin{center}
<<label=ppc,fig=TRUE,echo=FALSE>>= 
<<ppcplot>> 
@ 
\end{center} 
\caption{Posterior predictive checks for the Italian data under the
  bottleneck model}
\label{fig:ppc} 
\end{figure}

The Figure \ref{fig:ppc} illustrates the posterior predictive checks
under the bottleneck model. We can see that the bottleneck model is
able to reproduce the observed values of the summary statistics in the
Italian data. Note that the constant population size model is also
able to reproduce the observed summary statistics of the Italian
data. However, the model of exponential growth is unable to account
for the observed level of heterozygosity when accounting for the
remaining two summary statistics in the Italian sample (results not
shown, but maybe generated by the user).

Finally, note that such posterior predictive checks use the summary
statistics twice; once for sampling from the posterior and once for
comparing the marginal posterior predictive distributions to the
observed values of the summary statistics. To avoid this circularity,
we might consider using different summary statistics for posterior
predictive checks than for parameter estimation. Nevertheless, this
could be difficult in many applications, since we already used our
``best'' summary statistics for inference.

Now we can move to our original objective, which is to estimate the
ancestral population size.

\subsection{Cross-validation}

Now, we are almost ready to infer the ancestral population size under
the bottleneck model for the Italian data. So, we select the
simulated parameter values that correspond to the bottleneck model:

<<>>=
stat.italy.sim <- subset(stat.3pops.sim, subset=models=="bott")
head(stat.italy.sim)
@

The bottleneck model can be described with four parameters, which can
be found in the four columns of \code{par.italy.sim}: the ancestral
population size $N_a$, the ratio of the population sizes before and
during the bottleneck (\code{a}), the duration of the bottleneck
(\code{duration}), and the time since the beginning of the bottleneck
(\code{start}).

<<>>=
head(par.italy.sim)
@

Before moving to the inference step, we first assess if ABC is able to
estimate the parameter $N_a$ at all. We use the function \code{cv4abc}
to determine the accuracy of ABC and the sensitivity of estimates to
the tolerance rate. The following code evaluates the accuracy of
estimates of $N_a$ under three tolerance rates using the rejection and
the local linear regression method. The \verb|"neuralnet"| method is
not recommended for cross-validation, as it takes considerably longer
to run than the other methods. Since the cross-validation
may take a long time to run, we might prefer to run first a code
with a smaller value of \code{nval} to compare different tolerance
rates, and also different estimation methods.

<<>>=
cv.res.rej <- cv4abc(data.frame(Na=par.italy.sim[,"Ne"]), stat.italy.sim, nval=10,
tols=c(.005,.01, 0.05), method="rejection")
cv.res.reg <- cv4abc(data.frame(Na=par.italy.sim[,"Ne"]), stat.italy.sim, nval=10,
tols=c(.005,.01, 0.05), method="loclinear")
summary(cv.res.rej)
summary(cv.res.reg)
@ 

We can also plot the results using the following code:

\setkeys{Gin}{width=6in, height=6in}
<<label=cv4abcplot, include=FALSE>>=
par(mfrow=c(1,2), mar=c(5,3,4,.5), cex=.8)
plot(cv.res.rej, caption="Rejection")
plot(cv.res.reg, caption="Local linear regression")
@

\begin{figure}
  \begin{center} 
<<label=cv4abc,fig=TRUE,echo=FALSE>>= 
<<cv4abcplot>> 
@ 
\end{center}
\caption{Cross-validation for parameter inference. The colors
  correspond to different levels of the tolerance rate in an
  increasing order from red to yellow.} 
\label{fig:cv4abc} 
\end{figure}

The function \code{summary} shows the prediction error under the three
different tolerance rates. \code{plot} compares the two methods
(\verb|"rejection"| and \verb|"loclinear"|) under three different
tolerance rates. Users can choose how they want to summarize the
posterior distribution by setting the argument \code{statistic} in of
the function \code{cv4abc}. By default, the posterior distribution of
summarized by its median. Thus, the plots shows the posterior
distribution medians of $N_a$ for each cross-validation sample.
Points of the cross-validation plot are scattered around the identity
line indicating that $N_a$ can be well estimated using the three
summary statistics (Figure \ref{fig:cv4abc}). Further, estimates were
not only accurate for $N_a$, but also insensitive to the tolerance
rate (Figure \ref{fig:cv4abc}). Accordingly, the prediction error is
relatively low and independent of the tolerance rate.

\subsection{Parameter inference}

Now, we finaly estimate the posterior distribution of $N_a$ using the
main function of the package: \code{abc}. The following code shows how
to use \code{abc} with the method \verb|"neuralnet"|. We chose to
use a \verb|"log"| transformation of the parameter. The correction for
heteroscedasticity of performed by default.

<<>>=
res <- abc(target=stat.voight["italian",], param=data.frame(Na=par.italy.sim[, "Ne"]),
sumstat=stat.italy.sim, tol=0.05, transf=c("log"), method="neuralnet")
res
@

The function \code{abc} returns an object of class \verb|"abc"|.  The
function \code{print} returns a simple summary of the object (see
above). Using the function \code{summary} we can calculate summaries
of the posterior distributions, such as the mode, mean, median, and
credible intervals, taking into account the posterior weights, when
appropriate. We obtain the following summary of the posterior
distribution using the function \code{summary}:

<<>>=
summary(res)
@

Two functions are available to visualize the results, \code{hist} and
\code{plot}. \code{hist} simply displays a histogram of the weighted
posterior sample (Figure \ref{fig:abchist}).

<<label=abchistplot, include=FALSE>>=
hist(res)
@ 

\setkeys{Gin}{width=4in, height=4in}
\begin{figure} 
  \begin{center} 
<<label=abchist,fig=TRUE,echo=FALSE>>= 
<<abchistplot>> 
@
\end{center} 
\caption{Histogram of the weighted posterior sample of $N_a$}
\label{fig:abchist} 
\end{figure}

The function \code{plot} can be used as a diagnostic tool when one of
the regression methods is applied. We can run the following code to
generate the diagnostic plots of the estimation of the posterior
distribution of $N_a$:
<<label=abcplot, include=FALSE>>=
par(cex=.8)
plot(res, param=par.italy.sim[, "Ne"])
@ 

\setkeys{Gin}{width=6in, height=6in}
\begin{figure} 
  \begin{center} 
<<label=abc,fig=TRUE,echo=FALSE>>= 
<<abcplot>> 
@
\end{center} 
\caption{ABC regression diagnostics for the estimation of the
  posterior distribution of $N_a$}
\label{fig:abc} 
\end{figure}

Figure \ref{fig:abc} shows that \code{plot} returns various plots that
allow the evaluation of the quality of estimation when one of the
regression methods is used. The following four plots are generated: a
density plot of the prior distribution, a density plot of the
posterior distribution estimated with and without regression-based
correction, a scatter plot of the Euclidean distances as a function of
the parameter values, and a normal Q-Q plot of the residuals from the
regression. When the heteroscedastic regression model is used, a
normal Q-Q plot of the standardized residuals is displayed.

Figure \ref{fig:abc} shows that the posterior distribution is very
different from the prior distribution, confirming that the three
summary statistics convey information about the ancestral population
size (Figure \ref{fig:abc}, lower left panel). The upper right panel of
Figure \ref{fig:abc} shows the distance between the simulated and
observed summary statistics as a function of the prior values of
$N_a$. Again, this plot confirms that the summary statistics convey
information about $N_a$, because the distances corresponding to the
accepted values are clustered and not spread around the prior range of
$N_a$. The lower right panel displays a standard Q-Q plot of the
residuals of the regression, $\hat{\epsilon_i}$. This plot serves as a
regression diagnostic of the linear or non-linear regression when the method
is \verb|"loclinear"| or \verb|"neuralnet"|.


\section{Conclusions}

A few of the capabilities of the \pkg{abc} package have been
described. Inevitably, new applications will demand new features and
might reveal unforeseen bugs. Please send comments or suggestions and
report bugs to \texttt{kati.csillery@gmail.com} and/or
\texttt{michael.blum@imag.fr}.

\bibliographystyle{plainnat}
\bibliography{abcvignette}
\addcontentsline{toc}{section}{Bibliography} % To add bibliography to the TOC


\end{document}