%\VignetteIndexEntry{Introducing the float package: 32-Bit Floats for R}
\documentclass[]{article}


\input{./include/settings}


\mytitle{Introducing the {float} package: 32-Bit Floats for {R}}
\mysubtitle{}
\myversion{0.2-4}
\myauthor{
\centering
Drew Schmidt \\ 
\texttt{wrathematics@gmail.com} 
}

\newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}}

\begin{document}
\makefirstfew



\section{Introduction}\label{introduction}

R has a "numeric" type for vectors and matrices. This type must be either 
integer or double precision. As such, R has no real ability to work with 
32-bit floats. However, sometimes single precision (or less!) is more than 
enough for a particular task.  The \pkg{float} package~\cite{float} 
extends R's linear algebra facilities to include single precision
(float) data.  Float vectors/matrices have half the precision of their
"numeric"-type counterparts, for a performance vs accuracy trade-off.

The internal representation is an S4 class, which allows us to keep the syntax
identical to that of base R's.  Interaction between base types for binary
operators is generally possible.  In these cases, type promotion always
defaults to the higher precision (more on this in Section~\ref{typepromotion}). 
 The package ships with copies of the single precision 'BLAS' and 'LAPACK', 
which are automatically built in the event they are not available on the system.

\subsection{Installation}\label{installation}

You can install the stable version from CRAN using the usual
\texttt{install.packages()}:

\begin{lstlisting}[language=rr]
install.packages("float")
\end{lstlisting}

The development version is maintained on GitHub. You can install this
version using any of the well-known installer packages available to R:

\begin{lstlisting}[language=rr]
remotes::install_github("wrathematics/float")
\end{lstlisting}

Note that for best performance, you will need to build the package from source, 
either from the GitHub repository or from the CRAN source distribution.  See 
Section~\ref{blasnlapack} for more details as to why a source installation is 
recommended.


\subsection{BLAS and LAPACK Libraries}\label{blasnlapack}

The linear algebra operations in the \pkg{float} package are handled by the 
BLAS and LAPACK~\cite{lawson1979basic,anderson1999lapack}.  By default, R will 
not ship with the single precision versions of these functions, so we include 
a source code distribution within the package.  This is the "reference" or 
NetLib implementation, which is not particularly efficient. Additionally, 
compiling these can take a very long time.

To take advantage of the enhanced run-time performance and reduced compilation 
times of using tuned BLAS/LAPACK with \pkg{float}, you will need to choose 
an implementation and install it.  Typical implementations include 
\pkg{OpenBLAS}~\cite{OpenBLAS}, Intel \pkg{MKL}~\cite{mkl}, AMD 
\pkg{ACML}~\cite{acml}, \pkg{Atlas}~\cite{atlas}, and Apple's 
\pkg{Accelerate}~\cite{accelerate}. You can read more about using different 
BLAS implementations with R in the R Installation and Administration 
manual~\cite{Rblas}.

Once you switch BLAS implementations with R, you will need to rebuild the 
\pkg{float} package from source.



\section{For Users}

\subsection{Basics}

R does not have a 32-bit float type (hence the package). You can cast your data 
from integer/numeric to float using \code{fl()} (you can also cast a float to a 
numeric via \code{dbl()}):

\begin{lstlisting}[language=rr]
library(float)

x = matrix(1:9, 3)
x
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

s = fl(x)
s
## # A float32 matrix: 3x3
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
\end{lstlisting}

This will of course require 1.5x the memory of the input matrix (storing it as 
both a double and as a float). For workflows requiring many operations, the 
memory savings will still be substantial. At this time, we do not have a reader, 
so casting is the best way to go. However, once you cast the matrix to a float, 
you can serialize it as usual with \code{save()} and/or \code{saveRDS()}.

For testing or other cases where random matrices are needed (e.g., PCA via 
random normal projections~\cite{halko2011finding}), we include several random 
generators. The functions \code{flrunif()} and \code{flrnorm()} are somewhat 
like R's \code{runif()} and \code{rnorm()} in that they produce vectors (but 
also matrices) of floating point random uniform/normal values:

\begin{lstlisting}[language=rr]
set.seed(1234)

flrunif(5)
## # A float32 vector: 5
## [1] 0.1137034 0.6222994 0.6092747 0.6233795 0.8609154
flrunif(2, 3)
## # A float32 matrix: 2x3
##             [,1]      [,2]      [,3]
## [1,] 0.640310585 0.2325505 0.5142511
## [2,] 0.009495757 0.6660838 0.6935913
flrunif(5, min=10, max=20)
## # A float32 vector: 5
## [1] 15.44975 12.82734 19.23433 12.92316 18.37296
\end{lstlisting}

Arbitrary generators can be used with the \code{flrand()} interface. It behaves 
more like R's \code{runif()}, \code{rnorm()}, etc., except that it accepts a 
generator function for its first argument. For example:

\begin{lstlisting}[language=rr]
flrand(generator=rexp, n=5, rate=.1)
## # A float32 vector: 5
## [1]  8.624105  6.745913  8.380404  7.604303 18.800766
flrand(function(n) sample(5, size=n, replace=TRUE), 5)
## # A float32 vector: 5
## [1] 2 2 2 1 1
\end{lstlisting}

This is conceptually similar to first generating \code{n} random values and 
then casting them over to floats, but more memory efficient. The function 
processes the generator data in 4KiB chunks (for double precision generators).

\subsection{Arithmetic and Type Promotion}\label{typepromotion}

Perhaps a mistake in hindsight, but floats and numeric vectors/matrices will 
interoperate with each other in binary arithmetic operations. So you can 
multiply \code{2L}, \code{2.0}, and \code{fl(2)} in any binary combination you 
like. But the output will be determined by the highest precision; in fact, the 
arithmetic itself will be carried out in the highest possible precision.  So 
adding a float matrix with a double matrix is really just adding 2 double 
matrices together after casting the float up (which uses quite a bit of 
additional memory) and returning a double matrix.

This even works for more complicated functions like \code{\%*\%} (matrix 
multiplication). For example:

\begin{lstlisting}[language=rr]
set.seed(1234)
x = matrix(1:4, 2)
y = flrunif(2, 2)

x
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
y
## # A float32 matrix: 2x2
##           [,1]      [,2]
## [1,] 0.1137034 0.6092747
## [2,] 0.6222994 0.6233795

x %*% y
## # A float32 matrix: 2x2
##          [,1]     [,2]
## [1,] 1.980602 2.479413
## [2,] 2.716604 3.712067

storage.mode(x) = "double"
x %*% y
##          [,1]     [,2]
## [1,] 1.980602 2.479413
## [2,] 2.716605 3.712067
\end{lstlisting}

Long story short, be careful when mixing types.



\section{For Developers}

\subsection{Basics}

A \code{float32} matrix/vector is really a very simple S4 class. It has one 
slot, \code{@Data}, which should be an ordinary R integer vector or matrix. 
The values of that integer matrix will be interpreted as floats in the provided 
methods. If you wish to create your own method, say using C kernels or 
Rcpp~\cite{Rcpp}, then you will have to play the same game. More on that later.

To create a \code{float32} object, use \code{float::float32()}:

\begin{lstlisting}[language=rr]
Data = 1:3
x = float32(Data)
x
## # A float32 vector: 3
## [1] 1.401298e-45 2.802597e-45 4.203895e-45
\end{lstlisting}

To access the integer data of a \code{float32}, just grab the \code{@Data} slot:

\begin{lstlisting}[language=rr]
> x@Data
## [1] 1 2 3
\end{lstlisting}

In general there's no relationship between the integer vs float interpretations
of the values residing in the same block of memory, with the exception of 0:

\begin{lstlisting}[language=rr]
x = fl(0:3)
x@Data
## [1]          0 1065353216 1073741824 1077936128
dbl(x+x)
## [1] 0 2 4 6
(x+x)@Data
## [1]          0 1073741824 1082130432 1086324736
x@Data + x@Data
## [1]          0 2130706432         NA         NA
## Warning message:
## In x@Data + x@Data : NAs produced by integer overflow
\end{lstlisting}

So when creating new functionality not provided by existing \pkg{float} 
package methods, you will probably have to move to compiled code.


\subsection{Compiled Code}

Using 32-bit floats from \pkg{float} in compiled code is not terribly 
difficult, but maybe a bit annoying.  The general way to proceed for a 32-bit 
float \code{x} is:

\begin{itemize}
  \item Pass \code{x@Data} (an integer) to \code{.Call()}
  \item Inside the C/C++ function, use a \code{float} pointer to the integer 
data.
  \item Return from \code{.Call()} an integer vector/matrix.
  \item Put the return from \code{.Call()} (say \code{ret}) in the \code{float} 
S4 class: \code{float32(ret)}.
\end{itemize}

One can access the data with the \code{FLOAT()} macro.  If writing an R package, 
add \pkg{float} to the \code{LinkingTo} list in the package DESCRIPTION file. 
Then add \code{\#include <float/float32.h>} and the macro will be available.  
If you are working outside the construct of a package (not recommended), then 
you can define the macro as follows:

\begin{lstlisting}[language=cc]
#define FLOAT(x) ((float*) INTEGER(x))
\end{lstlisting}

This is the "\code{DATAPTR}" way of doing things, similar to \code{REAL()} for 
double precision and \code{INTEGER} for ints. There is no Rcpp-like idiom 
for floats similar to \code{NumericVector} and \code{NumericMatrix} at this 
time.

Here's a basic example of how one would create a new function \code{add1()} 
(ignoring that we could just do \code{x+1}) using C.  We will do this outside 
of a package framework for simplicity of demonstration, but again, it is 
recommended that you use the \code{LinkingTo} way mentioned above.

\begin{lstlisting}[language=cc, title=add1.c]
#include <Rinternals.h>
#include <R.h>

#define FLOAT(x) ((float*) INTEGER(x))

SEXP R_add1(SEXP x_)
{
  SEXP ret;
  PROTECT(ret = allocVector(INTSXP, 1));
  
  float *x = FLOAT(x_);
  FLOAT(ret)[0] = x[0] + 1.0f;
  
  UNPROTECT(1);
  return ret;
}
\end{lstlisting}

Note that using \code{INTEGER(ret)[0]} instead of \code{FLOAT(ret)[0]} on line 
12 above is not correct. That would first cast the value to an integer before 
storing the data.  Then back at the R level, once put in the \code{float32} 
class, that integer value would be treated as though it were a \code{float}. If 
that explanation doesn't make sense, try modifying the above to the wrong thing 
and see what happens.

We can build that function with \code{R CMD SHLIB add1.c}, and then call it via:

\begin{lstlisting}[language=rr, title=add1.r]
dyn.load("add1.so")
library(float)

add1 = function(x)
{
  ret = .Call("R_add1", x@Data)
  float32(ret)
}

add1(fl(1))
## # A float32 vector: 1
## [1] 2
add1(fl(pi))
## # A float32 vector: 1
## [1] 4.141593
\end{lstlisting}

Like I said, not really difficult, but annoying.


\subsection{Linking and Additional Functions}
\label{sec:linking}

If you are writing C/C++ code on single precision vectors and matrices, there 
is a good chance that you will need to link with the \pkg{float} package.  For 
sure if you want to efficiently do linear algebra (say via BLAS/LAPACK or you 
are using the float interface from Armadillo via 
\pkg{RcppArmadillo}~\cite{RcppArmadillo}), you will need to do this for CRAN 
safety\footnote{If you are just working on your own machine and linking with 
high-performance BLAS/LAPACK, then no linking is necessary. For portability, 
you need to link.}.  To do this, you will need to set the \code{LDFLAGS} line 
of your \code{src/Makevars} file to include something like this:

\begin{lstlisting}
FLOAT_LIBS = `${R_HOME}/bin${R_ARCH_BIN}/Rscript -e "float:::ldflags()"`

PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(FLOAT_LIBS)
\end{lstlisting}

By default, \code{float:::ldflags()} will try to dynamically link on Linux and
Mac, but you can force static linking via \code{float:::ldflags(static=TRUE)}.
In my opinion, dynamic linking is preferential, but you are free to make up your
own mind about that. However, dynamic linking to an R package shared library is
(I think?) impossible on Windows. So Windows will always statically link.

In addition to BLAS and LAPACK symbols, there are a few helpers available.  
First, we include float values \code{NA_FLOAT} and \code{R_NaNf}, which are 
32-bit analogues to \code{NA_REAL} and \code{R_NaN}.  

\begin{lstlisting}[language=c]
int ISNAf(const float x);
int ISNANf(const float x);
\end{lstlisting}

which you can find in the \code{float/float32.h} header.

Finally, we also provide \code{float:::cppflags()} for the \code{PKG_CPPFLAGS}
include flags. But using the \code{LinkingTo} field should usually be sufficient
(i.e., you don't need it most of the time). One notable exception is if you are
doing something goofy with a different compiler, like \code{nvcc} where you may
need to explicitly pass the include flags.




\section{Some Benchmarks}

We will be examining two common applications from statistics which are 
dominated by linear algebra computations: covariance and principal components 
analysis.  The setup for each of these benchmarks is:

\begin{lstlisting}[language=rr]
library(float)
library(rbenchmark)
set.seed(1234)

reps = 5
cols = c("test", "replications", "elapsed", "relative")

m = 7500
n = 500
x = matrix(rnorm(m*n), m, n)
s = fl(x)
\end{lstlisting}

All benchmarks were performed using 2 cores of on an Intel Core i5-5200U 
(2.20GHz CPU) laptop running Linux and:

\begin{itemize}
  \item gcc 7.2.0
  \item R version 3.4.2
  \item libopenblas 0.2.20
\end{itemize}

Note that the benchmarks are highly dependent on the choice of BLAS library and 
hardware used. The cache sizes for this machine are:

\begin{lstlisting}[language=rr]
memuse::Sys.cachesize()
## L1I:   32.000 KiB 
## L1D:   32.000 KiB 
## L2:   256.000 KiB 
## L3:     3.000 MiB 
\end{lstlisting}




\subsection{Covariance}

Since covariance is just the crossproducts matrix $x^Tx$ on mean-centered data, 
we can very easily create a custom covariance function:

\begin{lstlisting}[language=rr]
custcov = function(x)
{
  s = scale(x, TRUE, FALSE)
  crossprod(s) / max(1L, nrow(x)-1)
}
\end{lstlisting}

This function will work for numeric inputs as well as 32-bit floats.  We can 
compare these two cases against R's internal covariance function:

\begin{lstlisting}[language=rr]
benchmark(custcov(x), custcov(s), cov(x), replications=reps, columns=cols)
##         test replications elapsed relative
## 3     cov(x)            5   8.113   43.385
## 2 custcov(s)            5   0.187    1.000
## 1 custcov(x)            5   0.719    3.845
\end{lstlisting}

R's \code{cov()} is clearly not designed with performance in mind.  The 
performance difference between the \code{custcov(s)} (float) and 
\code{custcov(x)} (double) calls should only be about 2x. The higher 
performance we see is likely due to the fact that our implementation of 
\code{scale()} is better than R's.  We can compare this to a highly optimized 
implementation of covariance, namely \code{covar()} from the \pkg{coop} 
package~\cite{coop}:

\begin{lstlisting}[language=rr]
benchmark(custcov(s), coop::covar(x), replications=reps, columns=cols)
##             test replications elapsed relative
## 2 coop::covar(x)            5   0.358    1.817
## 1     custcov(s)            5   0.197    1.000
\end{lstlisting}

This looks more in line with what we would expect moving from double to single 
precision.


\subsection{Principal Components Analysis}

PCA is just SVD with some statistical window dressing:

\begin{lstlisting}[language=rr]
pca = function(x)
{
  p = svd(scale(x, TRUE, FALSE), nu=0)
  p$d = p$d / max(1, sqrt(nrow(x) - 1))
  names(p) = c("sdev", "rotation")
  
  p
}
\end{lstlisting}

Once again, our function will work for both numeric inputs as well as 32-bit 
floats. We again compare the performance of these two cases against R's 
internal function (in this case, \code{prcomp()}):

\begin{lstlisting}[language=rr]
benchmark(pca(x), pca(s), prcomp(x), replications=reps, columns=cols)
##        test replications elapsed relative
## 2    pca(s)            5   1.592    1.000
## 3    pca(x)            5   3.663    2.301
## 1 prcomp(x)            5   4.293    2.697
\end{lstlisting}

Again, our improved \code{scale()} implementation is giving an edge (and 
possibly because \code{prcomp()} is doing more \emph{useful} work, as opposed 
to \code{cov()}\dots), although it is much less pronounced here since the SVD 
calculation is dominating.  Indeed, the overall run time is roughly 10x higher 
here for the single precision PCA case compared to the single precision 
covariance calculation.


\addcontentsline{toc}{section}{References}
\bibliography{./include/float}
\bibliographystyle{plain}

\end{document}