--- title: "doby: Groupwise computations and miscellaneous utilities" author: "Søren Højsgaard" output: html_document: toc: true number_sections: true vignette: > %\VignetteIndexEntry{doby: Groupwise computations and miscellaneous utilities} %\VignettePackage{doBy} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options("digits"=3) library(doBy) library(boot) #devtools::load_all() ``` The \doby{} package contains a variety of utility functions. This working document describes some of these functions. The package originally grew out of a need to calculate groupwise summary statistics (much in the spirit of \code{PROC SUMMARY} of the \proglang{SAS} system), but today the package contains many different utilities. ```{r} library(doBy) ``` \section{Data used for illustration} \label{sec:co2data} The description of the \code{doBy} package is based on the \code{mtcars} dataset. ```{r} head(mtcars) tail(mtcars) ``` # Groupwise computations ## summaryBy() and summary_by() \label{sec:summaryBy} The \summaryby{} function is used for calculating quantities like ``the mean and variance of numerical variables $x$ and $y$ for each combination of two factors $A$ and $B$''. Notice: A functionality similar to \summaryby\ is provided by \code{aggregate()} from base \R. ```{r} myfun1 <- function(x){ c(m=mean(x), s=sd(x)) } summaryBy(cbind(mpg, cyl, lh=log(hp)) ~ vs, data=mtcars, FUN=myfun1) ``` A simpler call is ```{r} summaryBy(mpg ~ vs, data=mtcars, FUN=mean) ``` Instead of formula we may specify a list containing the left hand side and the right hand side of a formula\footnote{This is a feature of \summaryby\ and it does not work with \code{aggregate}.} but that is possible only for variables already in the dataframe: ```{r} summaryBy(list(c("mpg", "cyl"), "vs"), data=mtcars, FUN=myfun1) ``` Inspired by the \pkg{dplyr} package, there is a \verb|summary_by| function which does the samme as \summaryby{} but with the data argument being the first so that one may write ```{r} mtcars |> summary_by(cbind(mpg, cyl, lh=log(hp)) ~ vs, FUN=myfun1) ``` ## orderBy() and order_by() Ordering (or sorting) a data frame is possible with the \code{orderBy} function. For example, we order the rows according to \code{gear} and \code{carb} (within \code{gear}): ```{r} x1 <- orderBy(~ gear + carb, data=mtcars) head(x1, 4) tail(x1, 4) ``` If we want the ordering to be by decreasing values of one of the variables, we can do ```{r} x2 <- orderBy(~ -gear + carb, data=mtcars) ``` Alternative forms are: ```{r} x3 <- orderBy(c("gear", "carb"), data=mtcars) x4 <- orderBy(c("-gear", "carb"), data=mtcars) x5 <- mtcars |> order_by(c("gear", "carb")) x6 <- mtcars |> order_by(~ -gear + carb) ``` ## splitBy() and split_by() Suppose we want to split the \code{airquality} data into a list of dataframes, e.g.\ one dataframe for each month. This can be achieved by: ```{r} x <- splitBy(~ Month, data=airquality) x <- splitBy(~ vs, data=mtcars) lapply(x, head, 4) attributes(x) ``` Alternative forms are: ```{r} splitBy("vs", data=mtcars) mtcars |> split_by(~ vs) ``` ## subsetBy() and subset_by() Suppose we want to select those rows within each month for which the the wind speed is larger than the mean wind speed (within the month). This is achieved by: ```{r} x <- subsetBy(~am, subset=mpg > mean(mpg), data=mtcars) head(x) ``` Note that the statement \code{Wind > mean(Wind)} is evaluated within each month. Alternative forms are ```{r} x <- subsetBy("am", subset=mpg > mean(mpg), data=mtcars) x <- mtcars |> subset_by("vs", subset=mpg > mean(mpg)) x <- mtcars |> subset_by(~vs, subset=mpg > mean(mpg)) ``` ## transformBy() and transform_by() The \code{transformBy} function is analogous to the \code{transform} function except that it works within groups. For example: ```{r} head(x) x <- transformBy(~vs, data=mtcars, min.mpg=min(mpg), max.mpg=max(mpg)) head(x) ``` Alternative forms: ```{r} x <- transformBy("vs", data=mtcars, min.mpg=min(mpg), max.mpg=max(mpg)) x <- mtcars |> transform_by("vs", min.mpg=min(mpg), max.mpg=max(mpg)) ``` ## lapplyBy() and lapply_by() This \code{lapplyBy} function is a wrapper for first splitting data into a list according to the formula (using splitBy) and then applying a function to each element of the list (using lapply). ```{r} lapplyBy(~vs, data=mtcars, FUN=function(d) lm(mpg~cyl, data=d) |> summary() |> coef()) ``` # Miscellaneous utilities ## firstobs() and lastobs() To obtain the indices of the first/last occurences of an item in a vector do: ```{r} x <- c(1, 1, 1, 2, 2, 2, 1, 1, 1, 3) firstobs(x) lastobs(x) ``` The same can be done on variables in a data frame, e.g. ```{r} firstobs(~vs, data=mtcars) lastobs(~vs, data=mtcars) ``` \subsection{The \code{which.maxn()} and \code{which.minn()} functions} \label{sec:whichmaxn} The location of the $n$ largest / smallest entries in a numeric vector can be obtained with ```{r} x <- c(1:4, 0:5, 11, NA, NA) which.maxn(x, 3) which.minn(x, 5) ``` ## Subsequences - subSeq() Find (sub) sequences in a vector: ```{r} x <- c(1, 1, 2, 2, 2, 1, 1, 3, 3, 3, 3, 1, 1, 1) subSeq(x) subSeq(x, item=1) subSeq(letters[x]) subSeq(letters[x], item="a") ``` ## Recoding values of a vector - recodeVar() ```{r} x <- c("dec", "jan", "feb", "mar", "apr", "may") src1 <- list(c("dec", "jan", "feb"), c("mar", "apr", "may")) tgt1 <- list("winter", "spring") recodeVar(x, src=src1, tgt=tgt1) ``` ## Renaming columns of a dataframe or matrix - renameCol() ```{r} head(renameCol(mtcars, c("vs", "mpg"), c("vs_", "mpg_"))) ``` ## Time since an event - timeSinceEvent() Consider the vector ```{r} yvar <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0) ``` Imagine that "1" indicates an event of some kind which takes place at a certain time point. By default time points are assumed equidistant but for illustration we define time time variable ```{r} tvar <- seq_along(yvar) + c(0.1, 0.2) ``` Now we find time since event as ```{r} tse <- timeSinceEvent(yvar, tvar) tse ``` The output reads as follows: \begin{itemize} \item \verb"abs.tse": Absolute time since (nearest) event. \item \verb"sign.tse": Signed time since (nearest) event. \item \verb"ewin": Event window: Gives a symmetric window around each event. \item \verb"run": The value of \verb"run" is set to $1$ when the first event occurs and is increased by $1$ at each subsequent event. \item \verb"tae": Time after event. \item \verb"tbe": Time before event. \end{itemize} ```{r} plot(sign.tse ~ tvar, data=tse, type="b") grid() rug(tse$tvar[tse$yvar == 1], col="blue",lwd=4) points(scale(tse$run), col=tse$run, lwd=2) lines(abs.tse + .2 ~ tvar, data=tse, type="b",col=3) ``` ```{r} plot(tae ~ tvar, data=tse, ylim=c(-6,6), type="b") grid() lines(tbe ~ tvar, data=tse, type="b", col="red") rug(tse$tvar[tse$yvar==1], col="blue", lwd=4) lines(run ~ tvar, data=tse, col="cyan", lwd=2) ``` ```{r} plot(ewin ~ tvar, data=tse, ylim=c(1, 4)) rug(tse$tvar[tse$yvar==1], col="blue", lwd=4) grid() lines(run ~ tvar, data=tse, col="red") ``` We may now find times for which time since an event is at most 1 as ```{r} tse$tvar[tse$abs <= 1] ``` ## Example: Using subSeq() and timeSinceEvent() Consider the \verb|lynx| data: ```{r} lynx <- as.numeric(lynx) tvar <- 1821:1934 plot(tvar, lynx, type="l") ``` Suppose we want to estimate the cycle lengths. One way of doing this is as follows: ```{r} yyy <- lynx > mean(lynx) head(yyy) sss <- subSeq(yyy, TRUE) sss ``` ```{r} plot(tvar, lynx, type="l") rug(tvar[sss$midpoint], col="blue", lwd=4) ``` Create the "event vector" ```{r} yvar <- rep(0, length(lynx)) yvar[sss$midpoint] <- 1 str(yvar) ``` ```{r} tse <- timeSinceEvent(yvar,tvar) head(tse, 20) ``` We get two different (not that different) estimates of period lengths: ```{r} len1 <- tapply(tse$ewin, tse$ewin, length) len2 <- tapply(tse$run, tse$run, length) c(median(len1), median(len2), mean(len1), mean(len2)) ``` We can overlay the cycles as: ```{r} tse$lynx <- lynx tse2 <- na.omit(tse) plot(lynx ~ tae, data=tse2) ``` ```{r} plot(tvar, lynx, type="l", lty=2) mm <- lm(lynx ~ tae + I(tae^2) + I(tae^3), data=tse2) lines(fitted(mm) ~ tvar, data=tse2, col="red") ``` \section{Acknowledgements} \label{discussion} Credit is due to Dennis Chabot, Gabor Grothendieck, Paul Murrell and Jim Robison-Cox for reporting various bugs and making various suggestions to the functionality in the \doby{} package.