%\VignetteIndexEntry{JoSAE: Functions for unit-level small area estimators and their variance}
\documentclass[english,11pt]{article}
%\documentclass[a4paper,11pt]{article}
%%%% v0.2-1: spell errors and bib corrected
\usepackage[a4paper]{geometry}
\geometry{verbose,tmargin=2cm,bmargin=2cm,lmargin=2cm,rmargin=2cm,footskip=1cm}

\usepackage{Sweave}
\usepackage{color}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage[authoryear]{natbib}
\usepackage{url}
\usepackage[utf8]{inputenc}

% \setlength{\oddsidemargin}{1cm}
% \setlength{\textwidth}{15cm}
% \setlength{\topmargin}{-2.2cm}
% \setlength{\textheight}{22cm}

% \usepackage{amssymb}
% \usepackage{amsmath}


%\usepackage{hyperref}
\usepackage[linkcolor=blue, bookmarks=true, citecolor=blue, colorlinks=true, linktocpage, a4paper]{hyperref}



\title{Vignette of the JoSAE package}
\author{Johannes Breidenbach\thanks{Norwegian Forest and Landscape
    Institute, 1431 {\AA}s, Norway, job@skogoglandskap.no, Tel.: +47 64 94 89 81}}
\date{6 October 2011: JoSAE 0.2}
%\address{Norwegian Institute for Forest and Landscape}

\begin{document}
\SweaveOpts{concordance=TRUE}
\maketitle

%\tableofcontents


\section{Introduction}
The aim in the analysis of sample surveys is frequently to derive
estimates of subpopulation characteristics. This task is denoted small
area estimation (SAE) \citep{rao2003}. Often, the sample
available for the subpopulation is, however, too small to allow a
reliable estimate. Frequently, auxiliary variables exist that are correlated
with the variable of interest. Several estimators can make
use of auxiliary information which may reduce the variance of the
estimate \citep{rao2003}. Another term for \emph{small area} is
\emph{domain}. These two terms will be used interchangeable in the following.


The \texttt{JoSAE} package implements the \emph{generalized
regression} (GREG) \citep{sarndal1984} and unit level \emph{empirical best
linear unbiased prediction} (EBLUP) \citep{battese1988}
estimators and their variances. The \emph{synthetic regression} and
the \emph{simple random sample} (SRS) estimates are also calculated. The
purpose of the \texttt{JoSAE} package is to document the functions used in
the publication of \citep{bre2011}. The data used in
that study are also provided.


If \texttt{R} is running, the \texttt{JoSAE} package can be installed
by typing

<<eval=F, echo=TRUE>>=
install.packages("JoSAE")

@
into the console\footnote{The character "\texttt{>}" is not
  part of the command. A working Internet connection is required.}.

The command
<<>>=
library(JoSAE)
@
loads the package into the current workspace. We can get an overview of
the packages' contents by typing
<<eval=F, echo=T>>=
?JoSAE
@

\section{Using the provided functions - small area estimates}
For our small area estimates, we need
\begin{itemize}
   \item sample data which contain the variable of interest and the
     auxiliary variables of all sampled population elements and
   \item domain data which contain the mean of the auxiliary variables
     of all population elements within each domain of interest. It is
     assumed that auxiliary information is available for every
     population element.
\end{itemize}
Both data sets need to have a corresponding domain ID.


\subsection{Mean forest biomass within Norwegian municipalities}

To load and plot the data used by \citep{bre2011} we write:
<<fig=TRUE>>=
	#mean auxiliary variables for the populations in the domains
data(JoSAE.domain.data)
	#data for the sampled elements
data(JoSAE.sample.data)

#plot(biomass.ha~mean.canopy.ht,JoSAE.sample.data)

library(lattice)

print(xyplot(biomass.ha ~ mean.canopy.ht | domain.ID, data = JoSAE.sample.data))
@

The data set \texttt{JoSAE.sample.data} contains the above-ground
forest biomass (the variable of interest) observed on
sample plots of the Norwegian National Forest Inventory (NNFI) and the mean
canopy height derived from overlapping digital aerial images (the
auxiliary variable). The domain ID indicates in which of 14
municipalities (i.e., our small areas) the sample plot was located.

The data set \texttt{JoSAE.domain.data} contains the mean
canopy height, photogrammetrically obtained from overlapping digital aerial images within
the forest of a municipality. All population elements (i.e., not only
those elements where field data from the NNFI were available) were
used to derive this mean.

In order to make use of the auxiliary variables, a statistical model
needs to be fit that links the variable of interest to the auxiliary
variables. We fit a linear mixed-effects model \citep{pinheiro2011}
with a random intercept on the municipality level to our data:

<<>>=
    #lme model
summary(fit.lme <- lme(biomass.ha ~ mean.canopy.ht, data=JoSAE.sample.data
                       , random=~1|domain.ID))
@

In combination with the domain-level data, the functions provided in
the \texttt{JoSAE} package can now be used to calculate domain level
EBLUP estimates and their variances. Since the functions expect
variable names in the domain data and the sample data to be the same,
we first have to do some renaming:

<<>>=
    #domain data need to have the same column names as sample data or vice versa
d.data <- JoSAE.domain.data
names(d.data)[3] <- "mean.canopy.ht"
@

Then we can use the \texttt{eblup.mse.f.wrap} function, which does all
the work. This function is a wrapper function that calls several other
\texttt{JoSAE} functions. All attributes the function needs are the domain data
and the fitted model (an lme object).

<<>>=
result <- eblup.mse.f.wrap(domain.data = d.data, lme.obj = fit.lme)
@

Besides the EBLUP estimate and its variance, the function calculates
the GREG and SRS estimate as well as a synthetic regression estimate based on a linear
model fitted with the fixed-part of the lme formula. Many other
domain characteristics are calculated by the \texttt{eblup.mse.f.wrap}
function. The help page lists the details. Let's print some of the
most interesting results in Tables~\ref{tab:one} and \ref{tab:two}.

<<echo=FALSE,results=tex>>=
library(xtable)
#tmp <- result[,c(1:2,6,4,9:11)]
tmp <- result[,c("domain.ID","N.i.domain","n.i.sample", "biomass.ha.sample.mean", "GREG"
                 , "EBLUP", "Synth")]
names(tmp)[4] <- c("sample.mean")
print(xtable(tmp, label = "tab:one", caption="Number of population and sampled elements as well as simple random sample, synthetic, GREG and EBLUP estimates of the mean above-ground forest biomass within 14 Norwegian municipalities."), include.rownames=FALSE)

tmp <- result[,c("domain.ID","n.i.sample","sample.se", "GREG.se", "EBLUP.se.1", "EBLUP.se.2")]#result[,c(1:2,6,23:26)]
print(xtable(tmp, label = "tab:two", caption="Number of population and sampled elements as well as standard errors of the simple random sample, GREG and EBLUP estimates of the mean above-ground forest biomass within 14 Norwegian municipalities."), include.rownames=FALSE)
@

\clearpage

The \texttt{eblup.mse.f.wrap} function does not return a standard error for the
synthetic regression estimate, since no estimators exist that consider
its model bias. In Table~\ref{tab:two}, it needs to be noted that variances for the SRS and GREG estimates are
unstable for small sample sizes within domains (say <6 observations). A variance
estimate is technically impossible for domains with just one
observation. The EBLUP variances are frequently smaller than the GREG
variances and stable even for domains with just one
observation. However, the EBLUP variance is model-based and thus
relies on the correctness of the fitted model. \citet{rao2003} suggests two
different EBLUP variances estimates. Both are returned by the
\texttt{eblup.mse.f.wrap} function (Table~\ref{tab:two}).

The data can be visualized by:%height=8,
<<fig=T,width=7,height=6>>=

tmp <- result[,c("biomass.ha.sample.mean", "Synth", "GREG", "EBLUP")]
    #actual plot
tmp1 <- barplot(t(as.matrix(tmp)), beside=T
            , names.arg=result$domain.ID
            , xlab="Municipalities"
                , ylab=expression(paste("Estimated biomass (Mg ", ha^{-1}, ")" ))
                , ylim=c(0,200))
    #print n.sample plots
text(tmp1[2,]+.5, y = 50, labels = result$n.i.sample,cex=1.5)
    #error bars
tmp2<- result[,c("sample.se", "sample.se", "GREG.se", "EBLUP.se.2")]#sample.se twice to fill the column, only used once.
tmp2[is.na(tmp2)] <- 0
        #plot error bars
            #sample mean
arrows(x0=tmp1[1,], y0=tmp[,1]+tmp2[,1], x1=tmp1[1,], y1 = tmp[,1]-tmp2[,1]
           , length = 0.01, angle = 90, code = 3)
            #GREG
arrows(x0=tmp1[3,], y0=tmp[,3]+tmp2[,3], x1=tmp1[3,], y1 = tmp[,3]-tmp2[,3]
           , length = 0.01, angle = 90, code = 3)
            #EBLUP
arrows(x0=tmp1[4,], y0=tmp[,4]+tmp2[,4], x1=tmp1[4,], y1 = tmp[,4]-tmp2[,4]
           , length = 0.01, angle = 90, code = 3)
    #legend
legend(13,200, fill=grey(c(.3, .6, .8, .9)), legend=c("SRS", "Synth", "GREG", "EBLUP"), bty="n")

@


\subsection{County crop areas in Iowa}
\citet{battese1988} were the first to describe the EBLUP estimator. They demonstrated its
application using Landsat data to estimate the mean hectares of corn and
soybeans within counties (small areas) in north-central Iowa. Thanks to
\citet{schoch2011}, the Landsat data are available in \texttt{R}. The functions
in the \texttt{JoSAE} package should give approximately similar results as
those presented by \citet{battese1988} and \citet[Table 7.3,p.144]{rao2003}.

Let's get the data, split the data sets into a domain-specific and
sample specific data frame and add a numeric domain ID to both. We will also exclude an ``outlying''
domain\footnote{The \texttt{rsae} package was specifically developed for robust
estimation where outliers do not need to be excluded. As of R 3.0.2,
rsae is archived. Therefore, the landsat data were included in JoSAE.} in row 33 as
was suggested by \citet{battese1988}:
<<>>=


data(landsat)

    #prepare the domain data - exclude "outlying" domain
landsat.domains <- unique(landsat[-33,c(1, 7:8,10)])
        #add a numeric domain ID
landsat.domains$domain.ID <- 1:nrow(landsat.domains)
        #change names to the names in the sample data
names(landsat.domains)[2:3] <- c("PixelsCorn", "PixelsSoybeans")

    #prepare the unit-level sample data
tmp <- landsat[-33,c(2:6, 10)]
        #add numeric domain ID
landsat.sample <- merge(landsat.domains[4:5], tmp, by="CountyName")

@


Now we can fit a linear mixed-effects model and obtain our small area
estimates:
<<>>=
summary(landsat.lme <- lme(HACorn ~ PixelsCorn + PixelsSoybeans
                           , data=landsat.sample
                           , random=~1|domain.ID))
    #obtain EBLUP estimates and MSE
result <- eblup.mse.f.wrap(domain.data = landsat.domains
             , lme.obj = landsat.lme)

@

<<echo=FALSE,results=tex>>=

tmp <- result[,c("CountyName.domain","n.i.sample","EBLUP", "EBLUP.se.1", "EBLUP.se.2", "GREG.se")]
#tmp <- result[,c(5,9,13,28,29,27)]
names(tmp)[1:2] <- c("County.name","n_i")#, "EBLUP.mean", "EBLUP.se.1", "EBLUP.se.2", "GREG.se")
#names(tmp) <- c("County.name","n_i", "EBLUP.mean", "EBLUP.se.1", "EBLUP.se.2", "GREG.se")
print(xtable(tmp, label = "tab:landsat"
             , caption="EBLUP estimates of county means of hectares under corn and estimated standard errors of the EBLUP and GREG estimates.")
      , include.rownames=FALSE)


@

Comparing Table~\ref{tab:landsat} with the reference
\citep{battese1988, rao2003} suggests that the results
are quite similar but not exactly the same. The EBLUP estimates for the
county means are slightly different because \citet{battese1988} adjusted the
estimates to sum up to the approximately unbiased Survey-Regression
estimate for the total area. The standard errors are slightly different since \citet{battese1988}
used the method of \emph{fitting of constants} to estimate the model
parameters but REML was used here. Finally, \citet{battese1988} obtained standard
errors also for Survey-Regression estimates within domains with
just one observation. Unfortunately, no details were
elaborated. Given that the Survey-Regression estimator
should be the same as the GREG \citep[p. 20]{rao2003}, it is unclear to me how this
was done (any hints would be appreciated).

All in all, it looks like
the functions in the \texttt{JoSAE} package are correctly implemented.



\section{Synthetic estimation}

This section documents the estimators in \cite{bre2015}. R-code is given rather than implemented functions since the implementation is rather straight forward. The validation data are not given here except for one stand for which the variance estimation is explained.

It should again be noted that the synthetic estimators should be avoided, if observations are available within the small areas. This is because regression models can be biased for specific small areas. 

Load NFI data, fit the linking model, and create data for one validation stand. Elev.Mean is the vegetation height, N and E are northing and easting.
<<>>=
data(nfi)
    #fit the model
fit.nfi.iw <- lm(vol.2011~Elev.Mean, nfi, weights=1/Elev.Mean)
  #data (model matrix, X) of one validation stand
stand <- cbind(Intercept=1, Elev.Mean=c(147.41,127.48,98.66,118.85,124,120.81,119.7),
               N=c(0,23,0,55,27,80,56), E=c(73,77,0,39,37,54,54) )
  #aggregate to obtain X-bar
stand.agg <- apply(stand[,1:2], 2, mean)
@

As indicated by one reviewer: If just variance estimator (3) is of interest, also a White estimator could be used.
For all other estimators, a model for the residual variance is needed.

Synthetic variance estimators consider the uncertainty in the model. The uncertainty in the model parameters is 
the covariance matrix and will be called Sigma. The residual variance will be called sig. Due to the assumed heteroskedasticity, 
it needs to be multiplied with $x_i$ to be meaningful.
<<>>=
    #obtain covariance matrix 
Sigma <- vcov(fit.nfi.iw)

    #residual variance
sig <- summary(fit.nfi.iw)$sigma
@

Variance estimator (3) is based on the concept of the estimation of superpopulation parameters and can be obtained as follows for the example stand.

<<>>=
var.p <- t(stand.agg) %*% Sigma %*% stand.agg
@


Variance estimator (5) does not make much sense for heteroskedastic models and is therefore not shown here. Variance estimator (6) can be implemented as:

<<>>=
var.prh <- var.p + sum(sig^2 * stand[,2])/nrow(stand)^2
@

For variance estimator (7), we need some model that describes the spatial autocorrelation of grid cells within a stand. In the paper, we use one global model. This may not be the best solution, as it is likely that the structure of the autocorrelation is different from stand to stand. This is, however, out of the scope of the paper. A spatial range of 23 m was estimated for the spatial model based on the validation data. This process is not shown here. First we create the spatial object:

<<>>=
library(nlme)
spG <- corGaus(23, form = ~N+E)
@


Then we create the correlation matrix given the distance between the observations and the autocorrelation structure. Furthermore, we create the matrix of expected variances.

<<>>=
cormat <- corMatrix(Initialize(spG, data.frame(stand)))

varmat <- (sig * sqrt(stand[,2])) %o% (sig * sqrt(stand[,2]))
@

This finally results in estimator (7):

<<>>=
var.prhs <- var.p + sum(cormat * varmat)/nrow(stand)^2
@


The square root of the variances (ignoring bias) results in the standard error (SE). The SE of the different estimators increases from (5)-(7) because more error components are considered.

<<>>=
sqrt(c(var.p, var.prh, var.prhs))
@

The larger the number of population elements (i.e., grid cells or pixels) is, the smaller will be the difference between the estimators. Again, beware of bias in synthetic estimates!


\section{Acknowledgments}
I would like to thank Tobias Schoch, the author of the \texttt{rsae}
package \citep{schoch2011} for the provision of
the Landsat data set.

\begin{thebibliography}{---}
\bibitem[Battese et al.(1988)]{battese1988} Battese, G.E., R.M. Harter, and W.A. Fuller (1988): \emph{An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data}, Journal of the American Statistical Association 83, pp. 28-36.
%
\bibitem[Breidenbach and Astrup(2011)]{bre2011} Breidenbach, J. and
  R. Astrup (2012): \emph{Small area estimation of forest attributes
    in the Norwegian National Forest Inventory}. European Journal of
  Forest Research, 131:1255-1267.
  %
\bibitem[Breidenbach, et al.(2015)]{bre2015} Breidenbach, J. McRoberts, R. E.,
  R. Astrup (2015): \emph{Empirical
  coverage of model-based variance estimators for 
  remote sensing assisted estimation of stand-level timber volume.}  Remote
  Sensing of Environment. In press.
%
\bibitem[Pinheiro et al.(2011)]{pinheiro2011} Pinheiro, J., D. Bates, S. DebRoy, D. Sarkar and the R Development Core Team (2011): \emph{nlme: Linear and Nonlinear Mixed Effects Models}. R package version 3.1-101.
%
\bibitem[Rao(2003)]{rao2003} Rao, J.N.K. (2003): \emph{Small Area Estimation}, New Work: John Wiley and Sons.
%
\bibitem[S{\"a}rndal(1984)]{sarndal1984} S{\"a}rndal, C. (1984): \emph{Design-consistent versus model-dependent estimation for small domains}. Journal of the American Statistical Association, JSTOR, 624-631
%
\bibitem[Schoch(2011)]{schoch2011} Schoch, T. (2011): \emph{rsae: Robust Small Area Estimation}, R package version 0.1-3.
\end{thebibliography}


\end{document}