---
title: "Spatial sampling with SamplingStrata"
author: "Marco Ballin, Giulio Barcaroli"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: SamplingStrata.bib
vignette: >
  %\VignetteIndexEntry{Spatial sampling with SamplingStrata}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE}
options(width = 999)
knitr::opts_chunk$set(echo = TRUE, fig.width=6, fig.height=4, out.width = 6, out.heigth = 4)

```

```{r load, eval=TRUE, echo=FALSE, include=FALSE}
load("spatial_vignette.RData")
```


# Optimization with the *spatial* method

Let us suppose we want to design a sample survey with $k$ $Z$ target variables, each one of them correlated to one or more of the available $Y$ frame variables. 

When frame units are georeferenced or geocoded, the presence of spatial auto-correlation can be investigated. This can be done by executing for instance the Moran test on the target variables: if the null hypothesis is rejected (i.e. the hypothesis of the presence of spatial auto-correlation is accepted) then we should take into account also this variance component.

As indicated by @degruijter2016 and @degruijter2019, in case $Z$ is the target variable, omitting as negligible the *fpc* factor, the sampling variance of its estimated mean is:

\begin{equation} \label{eq1}
V(\hat{\bar{Z}}) = \sum_{h=1}^{H}(N_{h}/N)^{2} S_{h}^{2}/n_{h}
\end{equation}

We can write the variance in each stratum $h$ as:

\begin{equation} \label{eq2}
S_{h}^{2} = \dfrac{1}{N_{h}^{2}} \sum_{i=1}^{N_{h-1}}\sum_{j=i+1}^{N_{h}}(z_{i}-z_{j})^{2}
\end{equation} 

The optimal determination of strata is obtained by minimizing the quantity $O$:

\begin{equation} \label{eq3}
O = \sum_{h=1}^{H} \dfrac{1}{N_{h}^{2}} \{ \sum_{i=1}^{N_{h-1}}  \sum_{j=i+1}^{N_{h}} (z_{i}-z_{j})^{2}\}^{1/2}
\end{equation}

Obviously, values $z$ are not known, but only their predictions, obtained by means of a regression model. So, in Equation \ref{eq3} we can substitute $(z_{i}-z_{j})^{2}$ with 

\begin{equation} \label{eq5}
D_{ij}^{2} = \dfrac{(\tilde{z}_{i}-\tilde{z}_{j})^{2}}{R^{2}} + V(e_{i}) + V(e_{j}) - 2Cov(e_{i},e_{j})
\end{equation}

where $R^{2}$ is the squared correlation coefficient indicating the fitting of the regression model, and $V(e_{i})$, $V(e_{j})$ are the model variances of the residuals. The spatial auto-correlation component is contained in the term $Cov(e_{i},e_{j})$.

In particular, the quantity $D_{ij}$ is calculated in this way:

\begin{equation} \label{eq6}
D_{ij}^{2} = \dfrac{(\tilde{z}_{i}-\tilde{z}_{j})^{2}}{R^{2}} + (s_{i}^{2} + s_{j}^{2}) - 2 s_{i}  s_{j}  e^{-k (d_{ij}/range)}
\end{equation}

where $d_{ij}$ is the Euclidean distance between two units i and j in the frame (calculated using their geographical coordinates, that must be expressed in meters), the $s_{i}$ and $s_{j}$ are estimates of the prediction errors in the single points and *range* is the maximum distance below which spatial auto-correlation can be observed among points. The value of *range* can be determined by an analysis of the spatial *variogram*.


To summarize, when frame units can be geo-referenced, the proposed procedure is the following:

* acquire coordinates of the geographic location of the units in the population of interest;
*	fit a *kriging* model (or any other spatial model) on data for each $Z$;
* obtain predicted values together with prediction errors for each $Z$ and associate them to each unit in the frame;
* perform the optimization step.


# Example

In order to illustrate the "*spatial*" method, we make use of a dataset generally employed as an example of spatially correlated phenomena (in this case, the concentration of four heavy metals in a portion of the river Meuse). This dataset comes with the library "*sp*":

```{r, eval = FALSE,echo=FALSE, message=FALSE}
library(sp)
# locations (155 observed points)
data("meuse")
# grid of points (3103)
data("meuse.grid")
meuse.grid$id <- c(1:nrow(meuse.grid))
coordinates(meuse)<-c("x","y")
coordinates(meuse.grid)<-c("x","y")
```

```{r, eval = FALSE,echo=FALSE,include=FALSE}
par(mfrow=c(1,2))
plot(meuse.grid)
title("Meuse territory (3103 points)",sub="with reported distance from the river",cex.main=0.8,cex.sub=0.8)
plot(meuse)
title("Subset of 155 points",sub="with also observed metals concentration",cex.main=0.8,cex.sub=0.8)
par(mfrow=c(1,1))
```

![](spatial_figures/unnamed-chunk-2-1.png)

We analyse the territorial distribution of the *lead* and *zinc* concentration, and model (by using the *universal kriging*) their relations with distance from the river, using the subset of 155 points on which these values have been jointly observed: 

```{r, eval = F,echo=TRUE,message=FALSE,fig.width=6,fig.height=8}
library(automap)
kriging_lead = autoKrige(log(lead) ~ dist, meuse, meuse.grid)
plot(kriging_lead,sp.layout = NULL, justPosition = TRUE)
```

![](spatial_figures/unnamed-chunk-3-1.png)
```{r, eval = F,echo=TRUE,message=FALSE,fig.width=6,fig.height=8}
kriging_zinc = autoKrige(log(zinc) ~ dist, meuse, meuse.grid)
plot(kriging_zinc, sp.layout = list(pts = list("sp.points", meuse)))
```

![](spatial_figures/unnamed-chunk-3-2.png)

It is possible to calculate the fitting of the two models in this way:

```{r, eval = F,echo=TRUE,message=FALSE,fig.width=6,fig.height=8}
r2_lead <- 1 - kriging_lead$sserr/sum((log(meuse$lead)-mean(log(meuse$lead)))^2)
r2_lead
## [1] 0.9999997
r2_zinc <- 1 - kriging_zinc$sserr/sum((log(meuse$zinc)-mean(log(meuse$zinc)))^2)
r2_zinc
## [1] 0.9999999
```


Using these *kriging* models, we are able to predict the values of lead and zinc concentration on the totality of the 3,103 points in the Meuse territory:

```{r, eval = F,echo=TRUE}
df <- NULL
df$id <- meuse.grid$id
df$lead.pred <- kriging_lead$krige_output@data$var1.pred
df$lead.var <- kriging_lead$krige_output@data$var1.var
df$zinc.pred <- kriging_zinc$krige_output@data$var1.pred
df$zinc.var <- kriging_zinc$krige_output@data$var1.var
df$lon <- meuse.grid$x
df$lat <- meuse.grid$y
df$dom1 <- 1
df <- as.data.frame(df)
head(df)
##   id lead.pred  lead.var zinc.pred  zinc.var    lon    lat dom1
## 1  1  5.509360 0.1954937  6.736502 0.2007150 181180 333740    1
## 2  2  5.546006 0.1716895  6.785460 0.1749260 181140 333700    1
## 3  3  5.488913 0.1784052  6.698883 0.1826314 181180 333700    1
## 4  4  5.388320 0.1855561  6.558216 0.1906426 181220 333700    1
## 5  5  5.584415 0.1463018  6.841612 0.1465346 181100 333660    1
## 6  6  5.525538 0.1533757  6.749216 0.1549663 181140 333660    1
```

The aim is now to produce the optimal stratification of the 3,103 points under a precision constraint of 1% on the target estimates of the mean *lead*  and *zinc* concentrations:

```{r, eval = F,echo=TRUE, message=FALSE, warning=FALSE}
library(SamplingStrata)
frame <- buildFrameSpatial(df=df,
                      id="id",
                      X=c("lead.pred","zinc.pred"),
                      Y=c("lead.pred","zinc.pred"),
                      variance=c("lead.var","zinc.var"),
                      lon="lon",
                      lat="lat",
                      domainvalue = "dom1")
cv <- as.data.frame(list(DOM=rep("DOM1",1),
                         CV1=rep(0.01,1),
                         CV2=rep(0.01,1),
                         domainvalue=c(1:1) ))
cv
#    DOM  CV1  CV2 domainvalue
# 1 DOM1 0.01 0.01           1
```

To this aim, we carry out the optimization step by indicating the method *spatial*:

```{r, eval = F,echo=TRUE, message=FALSE, warning=FALSE}
set.seed(1234)
solution <- optimStrata (
  method = "spatial",
  errors=cv, 
  framesamp=frame,
  iter = 15,
  pops = 10,
  nStrata = 5,
  fitting = c(r2_lead,r2_zinc),
  range = c(kriging_lead$var_model$range[2],kriging_zinc$var_model$range[2]),
  kappa=1,
  writeFiles = FALSE,
  showPlot = FALSE,
  parallel = FALSE)
framenew <- solution$framenew
outstrata <- solution$aggr_strata
## 
## Input data have been checked and are compliant with requirements
## Sequential optimization as parallel = FALSE, defaulting number of cores = 1
##  *** Domain :  1   1
##  Number of strata :  3103
##  *** Sample cost:  61.92901
##  *** Number of strata:  4
```

![](spatial_figures/unnamed-chunk-7-1.png)


obtaining the following optimized strata:

```{r, eval = F,echo=TRUE}
plotStrata2d(framenew,outstrata,domain=1,vars=c("X1","X2"),
             labels=c("Lead","Zinc"))
```

![](spatial_figures/unnamed-chunk-8-1.png)

that can be visualised in this way:

```{r, eval = F,echo=TRUE}
frameres <- SpatialPointsDataFrame(data=framenew, coords=cbind(framenew$LON,framenew$LAT) )
frameres2 <- SpatialPixelsDataFrame(points=frameres[c("LON","LAT")], data=framenew)
frameres2$LABEL <- as.factor(frameres2$LABEL)
spplot(frameres2,c("LABEL"), col.regions=bpy.colors(5))
```

![](spatial_figures/unnamed-chunk-9-1.png)

We can now proceed with the selection of the sample, using the function 'selectSampleSpatial' (which is a wrapper of function 'lpm2_kdtree' in package 'SamplingBigData', see @lisic2018b), in order to ensure a spatially balanced sample:

```{r, eval = F,echo=TRUE}
s <- selectSampleSpatial(framenew,outstrata,coord_names=c("LON","LAT"))
## 
## *** Sample has been drawn successfully ***
##  62  units have been selected from  4  strata
```

whose units are so distributed in the territory:

```{r, eval = F,echo=TRUE,message=FALSE,fig.width=6, fig.height=8}
s <- selectSampleSpatial(framenew,outstrata,coord_names = c("LON","LAT"))
coordinates(s) <- ~LON+LAT
proj4string(s) <- CRS("+init=epsg:28992")
s$LABEL <- as.factor(s$LABEL)
library(mapview)
mapview(s,zcol="LABEL", map.types = c("OpenStreetMap"))
```

![](spatial_figures/unnamed-chunk-10-1.jpeg)



# References