---
title: "Helper functions"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Helper functions}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

There are several "helper" functions which can simplify the definition
of complex patterns. First we define some functions that will help us
display the patterns:

```{r}
one.pattern <- function(pat){
  if(is.character(pat)){
    pat
  }else{
    nc::var_args_list(pat)[["pattern"]]
  }
}
show.patterns <- function(...){
  L <- list(...)
  str(lapply(L, one.pattern))
}
```

# `nc::field` for reducing repetition

The `nc::field` function can be used to avoid repetition when defining
patterns of the form `variable: value`. The example below shows three
(mostly) equivalent ways to write a regex that captures the text after
the colon and space; the captured text is stored in the `variable`
group or output column:

```{r}
show.patterns(
  "variable: (?<variable>.*)",      #repetitive regex string
  list("variable: ", variable=".*"),#repetitive nc R code
  nc::field("variable", ": ", ".*"))#helper function avoids repetition
```

Note that the first version above has a named capture group, whereas
the second and third patterns generated by nc have an un-named capture
group and some non-capturing groups (but they all match the same
pattern).

Another example:

```{r}
show.patterns(
  "Alignment (?<Alignment>[0-9]+)",
  list("Alignment ", Alignment="[0-9]+"),
  nc::field("Alignment", " ", "[0-9]+"))
```

Another example:

```{r}
show.patterns(
  "Chromosome:\t+(?<Chromosome>.*)",
  list("Chromosome:\t+", Chromosome=".*"),
  nc::field("Chromosome", ":\t+", ".*"))
```

# `nc::quantifier` for fewer parentheses

Another helper function is `nc::quantifier` which makes patterns
easier to read by reducing the number of parentheses required to
define sub-patterns with quantifiers. For example all three patterns
below create an optional non-capturing group which contains a named
capture group:

```{r}
show.patterns(
  "(?:-(?<chromEnd>[0-9]+))?",                #regex string
  list(list("-", chromEnd="[0-9]+"), "?"),    #nc pattern using lists
  nc::quantifier("-", chromEnd="[0-9]+", "?"))#quantifier helper function
```

Another example with a named capture group inside an optional
non-capturing group:

```{r}
show.patterns(
  "(?: (?<name>[^,}]+))?",
  list(list(" ", name="[^,}]+"), "?"),
  nc::quantifier(" ", name="[^,}]+", "?"))
```

# `nc::alternatives` for simplified alternation

We also provide a helper function for defining regex patterns with
[alternation](https://www.regular-expressions.info/alternation.html). The
following three lines are equivalent.

```{r}
show.patterns(
  "(?:(?<first>bar+)|(?<second>fo+))",
  list(first="bar+", "|", second="fo+"),
  nc::alternatives(first="bar+", second="fo+"))
```

# `nc::alternatives_with_shared_groups` for alternatives with identical named sub-pattern groups

Sometimes each alternative is just a re-arrangement of the same
sub-patterns. For example consider the following subjects, each of
which are dates, in one of two formats.

```{r}
subject.vec <- c("mar 17, 1983", "26 sep 2017", "17 mar 1984")
```

In each of the two formats, the month consists of three lower-case
letters, the day consists of two digits, and the year consists of four
digits. Is there a single pattern that can match each of these
subjects? Yes, such a pattern can be defined using the code below,

```{r}
pattern <- nc::alternatives_with_shared_groups(
  month="[a-z]{3}",
  day=list("[0-9]{2}", as.integer),
  year=list("[0-9]{4}", as.integer),
  list(american=list(month, " ", day, ", ", year)),
  list(european=list(day, " ", month, " ", year)))
```

In the code above, we used `nc::alternatives_with_shared_groups`,
which requires two kinds of arguments:

* named arguments (month, day, year) define sub-pattern groups that
  are used in each alternative.
* un-named arguments (last two) define alternative patterns, each
  which can use the sub-pattern group names (month, day, year).

The pattern can be used for matching, and the result is a data table
with one column for each unique name, 

```{r}
(match.dt <- nc::capture_first_vec(subject.vec, pattern))
```

After having parsed the dates into these three columns, we can add a
date column:

```{r}
Sys.setlocale(locale="C")#to recognize months in English.
match.dt[, date := data.table::as.IDate(
  paste(month, day, year), format="%b %d %Y")]
print(match.dt, class=TRUE)
```

Another example is parsing given and family names, in two different
formats:

```{r}
nc::capture_first_vec(
  c("Toby Dylan Hocking","Hocking, Toby Dylan"),
  nc::alternatives_with_shared_groups(
    family="[A-Z][a-z]+",
    given="[^,]+",
    list(given_first=list(given, " ", family)),
    list(family_first=list(family, ", ", given))
  )
)
```