---
title: "Small Multiples"
author: "Duncan Garmonsway"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Small Multiples}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This vignette for the [unpivotr](https://github.com/nacnudus/unpivotr) package
demonstrates unpivoting multiple similar tables from a spreadsheet via the
[tidyxl](https://github.com/nacnudus/tidyxl) package.  It is best read with the
spreadsheet open in a spreadsheet program, e.g. Excel, LibreOffice Calc or
Gnumeric.

## Introduction

The spreadsheet is from the famous Enron subpoena, made available by [Felienne
Hermans](https://www.felienne.com/archives/3634), and has has previously been
publicised by Jenny Bryan and David Robinson, in particular in Robinson's
article ['Tidying an untidyable
dataset'](https://rpubs.com/dgrtwo/tidying-enron).

Here's a screenshot:

```{r, out.width = "850px"}
knitr::include_graphics("enron-screenshot.png")
```

## Preparation

This vignette uses several common packages.

```{r}
library(unpivotr)
library(tidyxl)
library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
```

The spreadsheet is distributed with the unpivotr package, so can be loaded as a
system file.

```{r}
path <- system.file("extdata/enron.xlsx", package = "unpivotr")
```

## Main

### Importing the data

Spreadsheet cells are imported with the `xlsx_cells()` function, which returns a
data frame of all the cells in all the requested sheets.  By default, every
sheet is imported, but we don't have to worry about that in this case because
there is only one sheet in the file.  We can also straightaway discard rows
above 14 and below 56, and columns beyond 20.

```{r}
cells <-
  xlsx_cells(path) %>%
  dplyr::filter(!is_blank, between(row, 14L, 56L), col <= 20) %>%
  select(row, col, data_type, numeric, character, date)
```

Cell formatting isn't required for this vignette, but if it were, it would be
imported via `xlsx_formats(path)`.

```{r, eval = FALSE}
formatting <- xlsx_formats(path)
```

### Importing one of the multiples

The small multiples each have exactly one 'Fixed Price' header cell, so begin by
filtering for those cells, and then move the selection up one row to get the
title cells.  The title cells are the top-left corner cell of each table.

```{r}
title <-
  dplyr::filter(cells, character == "Fixed Price") %>%
  select(row, col) %>%
  mutate(row = row - 1L) %>%
  inner_join(cells, by = c("row", "col"))
```

Use these title cells to partition the sheet.

```{r}
partitions <- partition(cells, title)
```

Taking one of the partitions, unpivot with `behead()`.  The compass directions
`"NNW"` and `"N"` express the direction from each data cell to its header.
`"NNW"` means "look up and then left to find the nearest header."

```{r}
partitions$cells[[1]] %>%
  behead("NNW", "title") %>%
  behead("NNW", "price") %>%
  behead("N", "bid_offer") %>%
  print(n = Inf)
```

The same procedure can be mapped to every small multiple.

```{r}
unpivoted <-
  purrr::map_dfr(partitions$cells,
                 ~ .x %>%
                   behead("NNW", "title") %>%
                   behead("NNW", "price") %>%
                   behead("N", "bid_offer")) %>%
  select(-data_type, -character, -date)
unpivoted
```

So far, only the column headers have been joined, but there are also row headers
on the left-hand side of the spreadsheet.  The following code incorporates these
into the final dataset.

```{r}
row_headers <-
  cells %>%
  dplyr::filter(between(row, 17, 56), between(col, 2, 4)) %>%
  # Concatenate rows like "Dec-01", "to", "Mar-02"
  mutate(character = ifelse(!is.na(character),
                            character,
                            format(date, origin="1899-12-30", "%b-%y"))) %>%
  select(row, col, character) %>%
  nest(-row) %>%
  mutate(row_header = map(data,
                          ~ str_trim(paste(.x$character, collapse = " ")))) %>%
  unnest(row_header) %>%
  mutate(col = 2L) %>%
  select(row, row_header)
unpivoted <- left_join(unpivoted, row_headers, by = "row")
unpivoted
```

## 34-line code listing

```{r, eval = FALSE}
library(unpivotr)
library(tidyxl)
library(dplyr)
library(purrr)
library(tidyr)
library(stringr)

cells <-
  xlsx_cells(system.file("extdata/enron.xlsx", package = "unpivotr")) %>%
  dplyr::filter(!is_blank, between(row, 14L, 56L), col <= 20) %>%
  select(row, col, data_type, numeric, character, date)

row_headers <-
  dplyr::filter(cells, between(row, 17, 56), between(col, 2, 4)) %>%
  mutate(character = ifelse(!is.na(character),
                            character,
                            format(date, origin="1899-12-30", "%b-%y"))) %>%
  select(row, col, character) %>%
  nest(-row) %>%
  mutate(row_header = map(data,
                          ~ str_trim(paste(.x$character, collapse = " ")))) %>%
  unnest(row_header) %>%
  mutate(col = 2L) %>%
  select(row, row_header)

titles <-
  dplyr::filter(cells, character == "Fixed Price") %>%
  select(row, col) %>%
  mutate(row = row - 1L) %>%
  inner_join(cells, by = c("row", "col"))

partition(cells, titles)$cells %>%
  purrr::map_dfr(~ .x %>%
                 behead("NNW", "title") %>%
                 behead("NNW", "price") %>%
                 behead("N", "bid_offer")) %>%
  select(-data_type, -character, -date) %>%
  left_join(row_headers, by = "row")
```