---
title: "Overview of nc functionality"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Overview of nc functionality}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

`nc` is a package for named capture regular expressions (regex), which
are useful for parsing/converting text data to tabular data (one row
per match, one column per capture group). In the terminology of regex,
we attempt to match a regex/pattern to a subject, which is a string of
text data. The regex/pattern is typically defined using a single
string (in other frameworks/packages/languages), but in `nc` we use a
special syntax: one or more R arguments are concatenated to define a
regex/pattern, and named arguments are used as capture groups.  For
more info about regex in general see
[regular-expressions.info](https://www.regular-expressions.info/reference.html)
and/or the Friedl book. For more
info about the special `nc` syntax, see `help("nc",package="nc")`.

Below is an index of topics which are explained in the different
vignettes, along with an overview of functionality using simple
examples.

## Capture first match in several subjects

[Capture first](v1-capture-first.html) is for the situation when your
input is a character vector (each element is a different subject to
parse), you want find the first match of a regex to each subject, and
your desired output is a data table (one row per subject, one column
per capture group in the regex).

```{r}
subject.vec <- c(
  "chr10:213054000-213,055,000",
  "chrM:111000",
  "chr1:110-111 chr2:220-222")
nc::capture_first_vec(
  subject.vec, chrom="chr.*?", ":", chromStart="[0-9,]+", as.integer)
```

A variant is doing the same thing, but with input
subjects coming from a data table/frame with character columns.

```{r}
library(data.table)
subject.dt <- data.table(
  JobID = c("13937810_25", "14022192_1"),
  Elapsed = c("07:04:42", "07:04:49"))
int.pat <- list("[0-9]+", as.integer)
nc::capture_first_df(
  subject.dt,
  JobID=list(job=int.pat, "_", task=int.pat),
  Elapsed=list(hours=int.pat, ":", minutes=int.pat, ":", seconds=int.pat))
```

## Capture all matches in a single subject
  
[Capture all](v2-capture-all.html) is for the situation when your
input is a single character string or text file subject, you want to
find all matches of a regex to that subject, and your desired output
is a data table (one row per match, one column per capture group in
the regex).

```{r}
nc::capture_all_str(
  subject.vec, chrom="chr.*?", ":", chromStart="[0-9,]+", as.integer)
```

## Reshape a data table with regularly named columns 

[Capture melt](v3-capture-melt.html) is for the situation when your
input is a data table/frame that has regularly named columns, and your
desired output is a data table with those columns reshaped into a
taller/longer form. In that case you can use a regex to identify the
columns to reshape.

```{r}
(one.iris <- data.frame(iris[1,]))
nc::capture_melt_single  (one.iris, part  =".*", "[.]", dim   =".*")
nc::capture_melt_multiple(one.iris, column=".*", "[.]", dim   =".*")
nc::capture_melt_multiple(one.iris, part  =".*", "[.]", column=".*")
```

## Reading regularly named data files

[Capture glob](v7-capture-glob.html) is for the situation when you
have several data files on disk, with regular names that you can match
with a glob/regex. In the example below we first write one CSV file
for each iris Species,

```{r}
dir.create(iris.dir <- tempfile())
icsv <- function(sp)file.path(iris.dir, paste0(sp, ".csv"))
data.table(iris)[, fwrite(.SD, icsv(Species)), by=Species]
dir(iris.dir)
```

We then use a glob and a regex to read those files in the code below:

```{r}
nc::capture_first_glob(file.path(iris.dir,"*.csv"), Species="[^/]+", "[.]csv")
```

## Helper functions for defining complex pattterns
  
[Helpers](v5-helpers.html) describes various functions that simplify
the definition of complex regex patterns. For example `nc::field`
helps avoid repetition below,

```{r}
subject.vec <- c("sex_child1", "age_child1", "sex_child2")
pattern <- list(
  variable="age|sex", "_",
  nc::field("child", "", "[12]", as.integer))
nc::capture_first_vec(subject.vec, pattern)
```

It also explains how to define common sub-patterns which are used in
several different alternatives.

```{r}
subject.vec <- c("mar 17, 1983", "26 sep 2017", "17 mar 1984")
pattern <- nc::alternatives_with_shared_groups(
  month="[a-z]{3}", day="[0-9]{2}", year="[0-9]{4}",
  list(month, " ", day, ", ", year),
  list(day, " ", month, " ", year))
nc::capture_first_vec(subject.vec, pattern)
```