\documentclass{article}
\usepackage{amstext}
\usepackage{amsfonts}
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage[round]{natbib}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{rotating}
\usepackage{authblk}
\usepackage[left=25mm, right=25mm, top=20mm, bottom=20mm]{geometry}
%\usepackage[nolists]{endfloat}
%\usepackage{mathspec}
\usepackage{dsfont}
\usepackage{bbm}

%\VignetteEngine{knitr::knitr}
%\VignetteDepends{FDboost, fda, fields, maps, mapdata}
%\VignetteIndexEntry{FDboost density-on-scalar births}

\newcommand{\Rpackage}[1]{{\normalfont\fontseries{b}\selectfont #1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rcmd}[1]{\texttt{#1}}
\newcommand{\Roperator}[1]{\texttt{#1}}
\newcommand{\Rarg}[1]{\texttt{#1}}
\newcommand{\Rlevel}[1]{\texttt{#1}}

\newcommand{\RR}{\textsf{R}}
\renewcommand{\S}{\textsf{S}}
\newcommand{\df}{\mbox{df}}
\DeclareMathOperator{\clr}{clr}
\newcommand{\ddelta}{\, \mathrm{d}\delta}

\RequirePackage[T1]{fontenc}
\RequirePackage{graphicx,ae,fancyvrb}
\IfFileExists{upquote.sty}{\RequirePackage{upquote}}{}
\usepackage{relsize}

\renewcommand{\baselinestretch}{1}
\setlength\parindent{0pt}


\hypersetup{%
  pdftitle = {density-on-scalar births},
  pdfsubject = {package vignette},
  pdfauthor = {Eva-Maria Maier},
%% change colorlinks to false for pretty printing
  colorlinks = {true},
  linkcolor = {blue},
  citecolor = {blue},
  urlcolor = {red},
  hyperindex = {true},
  linktocpage = {true},
}

\begin{document}

\setkeys{Gin}{width=\textwidth}

\title{Live births in Germany: density-on-scalar regression}
\author{Eva-Maria Maier
\thanks{E-mail: eva-maria.maier@hu-berlin.de}}
\affil{\textit{Wirtschaftswissenschaftliche Fakult\"at, \\
Humboldt-Universit\"at zu Berlin, \\
Unter den Linden 6, D-10099 Berlin, Germany.}}
% Spandauer Stra{\ss}e 1, D-10178 Berlin, Germany.}}
\date{}
\maketitle

% Inline code evaluation with \Sexpr{}, e.g., \Sexpr{length(1:10)}

% \noindent$^1$ 
%            \newline

% To Do: Referenzen, wenn mein Paper auf Arxiv

\noindent
This vignette illustrates how to use \Rpackage{FDboost}, which was designed for functional regression (Brockhaus et al., 2015), 
for density-on-scalar regression.
Despite being a special case of function-on-scalar regression (at least for densities defined on a nontrivial interval with respect to the Lebesgue measure, which we refer to as \emph{continuous case}), it has to be treated differently due to the special properties of probability density functions, namely nonnegativity and integration to one. 
Our vignette is based on the approach by Maier et al. (2021).

\section{Load and plot data}

We use the data set \texttt{birthDistribution} from the package \Rpackage{FDboost}, containing densities of live births in Germany over the months per year (1950-2019) and sex (male and female), resulting in 140 densities.
It is a list with the following elements:
\begin{itemize}
\item
\texttt{birth\_densities}:
A 140 x 12 matrix containing the birth densities in its rows. The first 70 rows correspond to male newborns, the second 70 rows to female ones. Within both of these, the years are ordered increasingly (1950-2019).
\item
\texttt{birth\_densities\_clr}:
A 140 x 12 matrix containing the clr transformed densities in its rows. Same structure as \texttt{birth\_densities}.
\item
\texttt{sex}:
A factor vector of length 140 with levels \texttt{"m"} (male) and \texttt{"f"} (female), corresponding to the sex of the newborns for the rows of \texttt{birth\_densities} and \texttt{birth\_densities\_clr}. The first 70 elements are \texttt{"m"}, the second 70 \texttt{"f"}.
\item
\texttt{year}:
A vector of length 140 containing the integers $1950, \ldots, 2019, 1950, \ldots, 2019$, corresponding to the years for the rows of \texttt{birth\_densities} and \texttt{birth\_densities\_clr}.
\item
\texttt{month}:
A vector containing the integers from 1 to 12, corresponding to the months for the columns of \texttt{birth\_densities} and \texttt{birth\_densities\_clr} (domain $\mathcal{T}$ of the (clr-)densities).
\end{itemize}
This list already is in the format needed to pass it to \Roperator{FDboost}.
Note that to compensate for the different lengths of the months, the average number of births per day for each month (by sex and year) was used to compute the birth shares from the absolute birth counts.
The 12 shares corresponding to one year and sex form one density in the Bayes Hilbert space $B^2(\delta) = B^2\left( \mathcal{T}, \mathcal{A}, \delta\right)$, where $\mathcal{T} = \{1, \ldots, 12\}$ corresponds to the set of the 12 months, $\mathcal{A} := \mathcal{P}(\mathcal{T})$ corresponds to the power set of $\mathcal{T}$, and the reference measure $\delta := \sum_{t = 1}^{12} \delta_t$ corresponds to the sum of dirac measures at $t \in \mathcal{T}$.
Thus, our analysis is an example for the discrete case and the integral of a density is simply the sum of all 12 share values.
We indicate how to proceed in the continuous case, whenever it is distinct from the discrete one over the course of this vignette.
We denote the density contained in the $i$-th row of \texttt{birth\_densities} with $f_i = f_{sex_i, year_i}$, where $sex_i$ and $year_i$ denote the $i$-th elements of \texttt{sex} and \texttt{year}, respectively, $i = 1, \ldots, 140$.
We load the package and the data and plot the densities:
<<plot-data, echo=TRUE, warning=FALSE, message=FALSE, fig.cap='Densities of births in Germany per year and sex in $B^2(\\delta)$. Years are coded by different colors and line types, see Figure \\ref{fig:legend}.', fig.height=3.5, fig.width=11, fig.pos='!h'>>=
# load FDboost package
library(FDboost)
# load birth_densities
data("birthDistribution", package = "FDboost")

# function to plot a matrix or vector containing functions in B^2(delta) or L^2_0(delta);
# Is used for densities, effects, predictions (also clr transformed)
plot_function <- function(plot_matrix, ...) {
  funplot(1:12, plot_matrix, xlab = "month", xaxp = c(1, 12, 11), pch = 20, ...)
  abline( h = 0, col = "grey", lwd = 0.5)
}

# function to create two plots (by sex) from a matrix containing densities or predictions
# (also clr transformed) for males in first half of rows and females in second half
plot_birth_densities <- function(birth_matrix, ylim = range(birth_matrix), ...) {
  par(mfrow = c(1, 2))
  for (k in 1:2) {
    n_obs <- nrow(birth_matrix) / 2
    obs <- 1:n_obs + (k - 1) * n_obs
    plot_function(birth_matrix[obs, ], main = c("Male", "Female")[k], 
                   ylim = ylim, col = rainbow(n_obs, start = 0.5, end = 1), 
                   lty = c(1, 2, 4, 5), ...)
  }
}

# Plot densities
plot_birth_densities(birthDistribution$birth_densities, ylab = "densities")
@

<<legend, echo=TRUE, fig.cap='Coding of the years.', fig.height=1.2, fig.width=8, fig.pos='!h'>>=
# legend (also for later plots)
year_col <- rainbow(70, start = 0.5, end = 1)
year_lty <- c(1, 2, 4, 5)
par(mar = c(0, 0, 0, 0) + 0.1)
plot(NULL, xaxt = "n", yaxt = "n", bty = "n", ylab = "", xlab = "", xlim = 0:1, ylim = 0:1)
legend("top", xpd = TRUE, legend = 1950:2019, lty = year_lty, ncol = 10, bty = "n",
       text.col = year_col, col = year_col, cex = 0.7)
@

\pagebreak
Overall, the range of the density values is quite small (from \Sexpr{round(min(birthDistribution$birth_densities), 3)} to \Sexpr{round(max(birthDistribution$birth_densities), 3)}).
While there is hardly a visible difference between the two sexes, we see a trend over the years:
In the early months of the years the density values tend to decrease, in the later ones it is vice versa.

%\clearpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Model equation and clr transformation}

We consider the model
\begin{align}
f_i = &\beta_0 \oplus I(sex_i = sex) \odot \beta_{sex} \oplus g(year_i) \oplus \varepsilon_i, && i= 1, \ldots , 140, \label{model_bayes}
\intertext{with a group-specific intercept $\beta_{sex}$ for $sex \in \{ \text{male, female} \}$, a flexible effect $g(year)$ for $year \in [1950, 2019]$, and functional error terms $\varepsilon_i \in B^2(\delta)$ with $\mathbb{E}(\varepsilon_i) = 0$ the additive neutral element of $B^2(\delta)$, corresponding to a constant density.
Equivalently, we can consider the centered log-ratio (clr) transformed model}
\clr \left[ f_i \right]
= &\clr \left[ \beta_0 \right] + I(sex_i = sex) \cdot \clr \left[ \beta_{sex} \right] + \clr \left[ g(year_i)\right] + \clr \left[ \varepsilon_i \right] && i= 1, \ldots , 140, \label{model_clr}
\end{align}
for estimation, which is part of $L_0^2(\delta) = L_0^2\left( \mathcal{T}, \mathcal{A}, \delta\right) = \left\{ f \in L^2\left( \delta \right) ~|~ \int_{\mathcal{T}} f \, \mathrm{d}\delta = 0 \right\}$, a closed subspace of $L^2(\delta) = L^2\left( \mathcal{T}, \mathcal{A}, \delta\right)$.
\Rpackage{FDboost} was desiged for functions in $L^2(\mathbb{R}, \mathfrak{B}, \lambda)$, where $\mathfrak{B}$ denotes the Borel $\sigma$-algebra and $\lambda$ the Legesgue measure.
However, with some unfamiliar specifications, \Rpackage{FDboost} can be used to estimate model \eqref{model_clr}.
% Estimating a density-on-scalar model in the continuous case works similarly.
Thus, our first step towards estimation is to apply the clr transformation on our densities, which is given by
\begin{align}
\clr \left[ f \right]
:= \log f - \frac1{\delta(\mathcal{T})} \int_{\mathcal{T}} \log f \ddelta
= \log f_ - \frac1{12} \sum_{t = 1}^{12} \log f(t). \label{definition_clr} %, i = 1, \ldots, 140
\end{align}
We call the resulting clr transformed densities \emph{clr-densities} in the following.
The data set \texttt{birthDistribution} already contains the clr-densities.
Whenever that's not the case, one can use the function \Roperator{clr()} to compute the clr-densities, which we include here for the sake of completeness.
Note that the choice of appropriate integration weights \texttt{w} for the corresponding Bayes Hilbert space is crucial to get a reasonable result.
In our discrete case, equal weights \texttt{w = 1} are appropriate.
In the continuous case, the choice of the weights depends on the grid on which the function was evaluated.
The weight for each function value must correspond to the length of the subinterval it represents.
E.g., for a function defined on $\mathcal{T} = [a, b]$ evaluated on a grid with equidistant distance $d$, where the boundary grid values are $a + \frac{d}{2}$ and $b - \frac{d}{2}$ (i.e., the grid points are centers of subintervals of size $d$), equal weights $d$ should be chosen for \texttt{w}.

<<clr-data, echo=TRUE, fig.cap='Clr transformed densities in $L^2_0(\\delta)$. Years are coded by different colors and line types, see Figure \\ref{fig:legend}.', fig.height=3.5, fig.width=11, fig.pos='!h'>>=
# The function clr() can be used to compute the clr-densities; Our reference measure delta
# corresponds to equal integration weights w = 1 for all density values
birth_densities_clr_test <- t(apply(birthDistribution$birth_densities, 1, clr, w = 1))
# Compare with clr-densities contained in data set
sum(birth_densities_clr_test != birthDistribution$birth_densities_clr)
# Plot clr-densities
plot_birth_densities(birthDistribution$birth_densities_clr, ylab = "clr-densities")
@

\pagebreak
Due to the small range of the density values in this example, their shape is very similar to the original densities, see Figure \ref{fig:plot-data}.
In general, this is not the case.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Estimation}

When fitting model \eqref{model_clr} with the function \texttt{FDboost}, the specification of the \texttt{timeformula} needs some special attention.
First, we must respect the integrate-to-zero constraint of $L_0^2(\delta)$.
This is achieved by using the constrained base-learner \texttt{bbsc} in the \texttt{timeformula} (instead of the unconstrained \texttt{bbs} as usual), which transforms the basis such that it fulfills the sum-to-zero constraint.
In our discrete case, this corresponds to the integrate-to-zero constraint directly.
In the continuous case, the sum is proportional to the integral numerically, if the grid points where the function is evaluated are selected appropriately (e.g., centers of equal sized subintervals).
Thus, using \texttt{bbsc} is suitable in this case, as well.
Second, we must specify the B-spline basis in \texttt{bbsc} appropriately. 
The continuous case is straightforward, e.g., by using cubic B-splines.
In our discrete case, a suitable (unconstrained) basis is $(\mathds{1}_{\{1\}}, \ldots, \mathds{1}_{\{12\}}) \in L^2 ( \delta)^{12}$, where $\mathds{1}_{A}$ denotes the indicator function of $A \in \mathcal{A}$.
This results in the identity matrix as design matrix.
In \texttt{bbs()} (or \texttt{bbsc()}, which yields the corresponding constrained basis), this can be achieved using \texttt{degree = 1} with knots equal to $\mathcal{T}$.

<<fit-model, echo = TRUE, message=FALSE>>=
model <- FDboost(birth_densities_clr ~ 1 + bolsc(sex, df = 1) + 
                   bbsc(year, df = 1, differences = 1),
                 # use bbsc() in timeformula to ensure integrate-to-zero constraint
                 timeformula = ~bbsc(month, df = 4, 
                                     # December is followed by January of subsequent year
                                     cyclic = TRUE, 
                                     # knots = {1, ..., 12} with additional boundary knot
                                     # 0 (coinciding with 12) due to cyclic = TRUE
                                     knots = 1:11, boundary.knots = c(0, 12), 
                                     # degree = 1 with these knots yields identity matrix 
                                     # as design matrix
                                     degree = 1),
                 data = birthDistribution, offset = 0, 
                 control = boost_control(mstop = 1000))
@

To determine the optimal stopping iteration we perform a $10$-fold bootstrap.
This is rather time-consuming (especially in the continuous case, when the response densities are evaluated at many grid-values) and preferably should be executed parallelized on multiple cores. In order to avoid long compilation times for the vignette, the following code is commented out, but it should be possible to obtain the same stopping iteration within a few minutes.
<<get-mstop, echo = TRUE, message=FALSE>>=
# set.seed(1708)
# folds <- applyFolds(model, folds = cv(rep(1, model$ydim[1]), type = "bootstrap", B = 10))
# ms <- mstop(folds) # = 999
ms <- 999 
model <- model[ms] 
@

Our final object \texttt{model} contains the fit of model~\eqref{model_clr}, i.e., on clr-level.
<<plot_effects_clr, echo = TRUE, fig.cap='Estimated effects in $L_0^2(\\delta)$. Years are coded by different colors and line types, see Figure \\ref{fig:legend}.', fig.height=3, fig.width=11, fig.pos='!h'>>=
# Plotting 'model' yields the clr-transformed effects
par(mfrow = c(1, 3))
plot(model, n1 = 12, n2 = 12)
@

Since model~\eqref{model_clr} is equivalent to model~\eqref{model_bayes} via the clr transformation, we have to use the inverse clr transformation to get densities of interest (like estimated effects or predictions) for model~\eqref{model_bayes} in the Bayes Hilbert space, after extracting them from \texttt{model}.
The inverse clr transformation is given by
\[
\clr^{-1} (\tilde{f}) 
:= \frac{\exp\tilde{f}}{\int_{\mathcal{T}} \exp\tilde{f} \ddelta}
= \frac{\exp\tilde{f}}{\sum_{t=1}^{12} \exp\tilde{f}(t)}.
\]
for $\tilde{f} \in L_0^2(\delta)$ and can be computed using \texttt{clr(..., inverse = TRUE)}, again specifying appropriate integration weights \texttt{w}.
Note that in contrast to Maier et al. (2021), the definition above includes normalization to obtain the probability density function (which is the representative of the equivalence class of proportional functions in $B^2(\delta)$).
<<get_and_plot_effects, echo = TRUE, fig.cap='Estimated effects in $B^2(\\delta)$. Years are coded by different colors and line types, see Figure \\ref{fig:legend}.', fig.height=3, fig.width=11, fig.pos='!h'>>=
# Get estimated clr transformed effects; we use predict(), which returns a matrix of the
# same dimension as the response (140 x 12), i.e., we have to extract the respective rows;
# Alternatively, one could use coef(), but has to specify n1 = 12, n2 = 12 to get the den-
# sities at 1, ..., 12, which also only yields the year effect on a grid of 12 years

# all rows contain intercept
intercept_clr <- predict(model, which = 1)[1, ] 
# first 70 rows contain effect for sex = male, second 70 rows for sex = female
sex_clr <- predict(model, which = 2)[c(1, 71), ] 
# first 70 rows contain effect for years from 1950 to 2019, second 70 rows are repetition
year_clr <- predict(model, which = 3)[1:70, ]

sex_col <- c("blue", "red")
par(mfrow = c(1, 3), mar = c(5, 5, 4, 2) + 0.1)

# Retransform to Bayes Hilbert space using clr(..., inverse = TRUE); Our reference measure
# delta corresponds to equal integration weights w = 1 for all function values
intercept <- clr(intercept_clr, w = 1, inverse = TRUE)
sex <- t(apply(sex_clr, 1, clr, w = 1, inverse = TRUE))
year <- t(apply(year_clr, 1, clr, w = 1, inverse = TRUE))

# Plot retransformed effects
plot_function(intercept, main = "Intercept", ylab = expression(hat(beta)[0]), 
              id = rep(1, 12)) # id is passed to funplot since intercept is a vector
plot_function(sex, main = "Effect of sex", col = sex_col, 
              ylab = expression(hat(beta)["sex"]))
legend("topleft", legend = c("sex = male", "sex = female"), text.col = sex_col, bty = "n")
plot_function(year, main = "Effect of year", col = year_col, 
              ylab = expression(hat(g)("year")), lty = year_lty)
@

While all effects get selected by the algorithm, the effects of sex are very small.
We plot the predictions using the same range as in Figure \ref{fig:plot-data} for better comparison:
<<get_and_plot_predictions, echo = TRUE, fig.cap='Predicted densities in $B^2(\\delta)$. Years are coded by different colors and line types, see Figure \\ref{fig:legend}.', fig.height=3.5, fig.width=11, fig.pos='!h'>>=
predictions_clr <- predict(model)
predictions <- t(apply(predictions_clr, 1, clr, inverse = TRUE))
plot_birth_densities(predictions, ylim = range(birthDistribution$birth_densities), 
                     ylab = "predictions")
@
\section*{References}
\begin{itemize}
\item[] Brockhaus, S., Scheipl, F., Hothorn, T., and Greven, S. (2015). The functional linear array model.
        \textit{Statistical Modelling} 15(3), 279--300.
\item[] Maier, E.-M., St\"ocker, A., Fitzenberger, B., Greven, S. (2021). Additive Density-on-Scalar Regression in Bayes Hilbert Spaces with an Application to Gender Economics. \textit{arXiv preprint arXiv:2110.11771.}
\end{itemize}



\end{document}