---
title: "Capture melt"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Capture melt}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette explains how to use functions for "melting" wide data
tables, i.e. converting to tall/long data tables. To clarify the
discussion we introduce the following three terms:

* Reshape columns contain the data which are present in equal amounts
  in the input and output, but in fewer columns in the output.
  These are `measure.vars` in reshape2/data.table.
* Copy columns contain data which are copied from a single row of the
  input into several rows of the output. These are `id.vars`
  in reshape2/data.table.
* Capture columns are only present in the output, and contain data
  extracted from the names of the input reshape columns,
  using a capturing regular expression.
  
The `nc` functions use `data.table::melt` internally:

* `nc::capture_melt_single` is useful when you have a wide data table
  with lots of columns, all of the same type, that you would like to
  melt/reshape into a single result/output column. This uses `melt`
  with a character vector for `measure.vars`.
* `nc::capture_melt_multiple` is useful when you have a wide data
  table with columns of possibly different types that you would like
  to melt/reshape into multiple result/output columns. This uses
  `melt` with a list for `measure.vars`.

Both are useful when you want to use a regular expression to specify
both (1) the set of input columns to reshape and (2) some information
to extract from those column names.

## Reshaping several input columns into a single output column

Sometimes you want to melt a "wide" data table which has several
distinct pieces of information encoded in each column name. One
example is the familiar iris data, which have flower part and
measurement dimension encoded in each of four column names:

```{r}
head(iris)
```

Those four reshape column names can be specified via a regex in
`nc::capture_melt_single`. The first argument is the input data table
to reshape, and the subsequent arguments are interpreted as a pattern
which is passed to `nc::capture_first_vec`. Any input column names
which match the specified regex will be passed as `measure.vars` to
`melt`:

```{r}
(iris.tall <- nc::capture_melt_single(
  iris,
  part=".*",
  "[.]",
  dim=".*",
  value.name="cm"))
```

Note the output above has one copy column (Species), two capture
columns (part, dim), and one reshape column (cm). Internally the
function joins the result of `nc::capture_first_vec` (on column names)
to the result of `melt` (on the data).

The reshaped data can be plotted with different parts on rows and
different dimensions on columns:

```{r, fig.width=10}

if(require(ggplot2)){
  ggplot()+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))+
    facet_grid(part ~ dim)+
    geom_bar(aes(cm, fill=Species), data=iris.tall)
}

```

## Reshaping several input columns into multiple output columns

We could instead use `capture_melt_multiple` to get multiple output
columns. Like `capture_melt_single`, the first argument of
`capture_melt_multiple` is the subject data table and the following
arguments form a pattern which is matched to the input data column
names. However the pattern must have at least two groups:

* One group must be named `column`. This group is used to capture the
  part of the input column names which will be used for the output
  column names. The output will contain one column for each unique
  value captured in this group.
* One or more other groups (with any names) specify an identifier for
  each observation.

```{r}
(iris.part.cols <- nc::capture_melt_multiple(
  iris,
  column=".*",
  "[.]",
  dim=".*"))
```

Note that the reshaped table above contains one copy column (Species),
one capture column (dim), and two reshape columns (Petal, Sepal). We
can plot these data to see whether or not sepals are bigger than
petals:

```{r, fig.width=10}

if(require(ggplot2)){
  ggplot()+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))+
    facet_grid(dim ~ Species)+
    coord_equal()+
    geom_abline(slope=1, intercept=0, color="grey")+
    geom_point(aes(
      Petal, Sepal),
      data=iris.part.cols)
}

```

It is clear from the plot above that sepals are indeed both longer and
wider than petals, on each measured plant.

## Melting WHO data with a more complex pattern

Another data set where it is useful to do column name pattern matching
followed by melting is the World Health Organization data:

```{r}

if(requireNamespace("tidyr")){
  data(who, package="tidyr")
}else{
  who <- data.frame(id=1, new_sp_m5564=2, newrel_f65=3)
}
names(who)

```

Each column which starts with `new` has three distinct pieces of
information encoded in its name: diagnosis type (e.g. sp or rel),
gender (m or f), and age range (e.g. 5564 or 1524). We would like to
use a regex to match these column names, then using the matching
columns as measure.vars in a melt, then join the two results. 

```{r}
new.diag.gender <- list(
  "new_?",
  diagnosis=".*",
  "_",
  gender=".")
nc::capture_melt_single(who, new.diag.gender, ages=".*")
```

Note the output includes the new reshape column called `value` by
default, as in `melt`. The input reshape column names which matched
the specified pattern, and there is a new column for each group in
that pattern. The following example shows how to rename the `value`
column and use numeric type conversion functions:

```{r}
years.pattern <- list(new.diag.gender, ages=list(
  min.years="0|[0-9]{2}", as.numeric,
  max.years="[0-9]{0,2}", function(x)ifelse(x=="", Inf, as.numeric(x))))
(who.typed <- nc::capture_melt_single(
  who, years.pattern,
  value.name="count"))
str(who.typed)
```

Note in the code/result above that non-character captured output
columns can be obtained by specifying type conversion functions in the
pattern.

## Melting monthly Eurostat data

Another example is [exchange rate data from
Eurostat](https://ec.europa.eu/eurostat).
We first use read the entire data set into R:

```{r}
ert.gz <- system.file(
  "extdata", "ert_eff_ic_m.tsv.gz", package="nc", mustWork=TRUE)
ert.all <- data.table::fread(ert.gz, na.strings=":")
ert.all[1:5, 1:5]
```

We see that the first column has some CSV data which we can parse via:

```{r}
ert.first <- ert.all[, 1]
csv.lines <- c(sub("\\\\.*", "", names(ert.first)), ert.first[[1]])
ert.first.dt <- data.table::fread(text=paste(csv.lines, collapse="\n"))
ert.wide <- data.table::data.table(ert.first.dt, ert.all[,-1])
ert.wide[1:5, 1:5]
```

The wide data table can then be melted:


```{r}
(ert.tall <- nc::capture_melt_single(
  ert.wide,
  year="[0-9]{4}", as.integer,
  "M",
  month="[0-9]{2}", as.integer))
```

After that we can create a time variable and plot via

```{r}

ert.tall[, month.IDate := data.table::as.IDate(
  sprintf("%d-%d-15", year, month))]
if(require("ggplot2")){
  ggplot()+
    geom_hline(aes(
      yintercept=value),
      color="grey",
      data=data.frame(value=100))+
    geom_line(aes(
      month.IDate, value, color=geo),
      data=ert.tall[geo %in% c("CA", "US", "JP", "FR")])+
    facet_grid(exch_rt ~ .)+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))
}

```

Another way to do it would be via

```{r}
nc::capture_melt_single(ert.wide, month.POSIXct="[0-9].*", function(x){
  as.POSIXct(strptime(paste0(x,"15"), "%YM%m%d"))
})
```

## Melting into multiple output columns with missing input columns

What if the input data set has "missing" input columns?

```{r}
iris.missing <- iris[, names(iris) != "Sepal.Length"]
head(iris.missing)
```

In that case melting into multiple columns is an error by default:

```{r, error=TRUE, purl=FALSE}
iris.pattern <- list(column=".*", "[.]", dim=".*")
nc::capture_melt_multiple(iris.missing, iris.pattern)
```

The error message explains that the number of input columns for each
value of `dim` must be the same, but there is one for `Length` and two
for `Width`. To ignore the error and fill the output with missing
values,

```{r}
nc::capture_melt_multiple(iris.missing, iris.pattern, fill=TRUE)
```

Note the missing values in the table above, which correspond to the
missing input column in the original/wide data set.

## Multiple output columns of different types

Some real-world data sets can be reshaped into output columns with
different types. An example data set from the
[PROVEDIt](https://lftdi.camden.rutgers.edu/repository/PROVEDIt_1-5-Person%20CSVs%20Filtered.zip)
benchmark in criminology:

```{r}
peaks.csv <- system.file(
  "extdata", "RD12-0002_PP16HS_5sec_GM_F_1P.csv",
  package="nc", mustWork=TRUE)
peaks.wide <- data.table::fread(peaks.csv)
print(data.table::data.table(
  names=names(peaks.wide),
  class=sapply(peaks.wide, class)),
  topn=10)
```

There are 303 columns, with info for 100 peaks. Each peak has three
features: Allele=character, Size=numeric, and Height=integer. The
ending peaks are class logical because they are all missing. These
data can be reshaped via

```{r}
peaks.tall <- nc::capture_melt_multiple(
  peaks.wide,
  column=".*",
  " ",
  peak="[0-9]+", as.integer,
  na.rm=TRUE)
old.opt <- options(width=90)
print(peaks.tall)
str(peaks.tall)
options(old.opt)
```