---
title: "Regular expression flags"
author: "Laurent R. BergÃ©"
date: "`r Sys.Date()`"
output:
  html_document:
    theme: journal
    highlight: haddock
    number_sections: yes
    toc: yes
    toc_depth: 3
    toc_float:
      collapsed: no
      smooth_scroll: no
  pdf_document:
    toc: yes
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{ref_regex_flags}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(stringmagic)
```

All functions in `stringmagic` accept optional regular expressions (regex) flags when regular expressions
are expected. The idea is similar to 
[regular regex flags](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions#advanced_searching_with_flags),
but the flags are different in name and effect.

# Regex flags: Syntax

Use `"flag1, flag2/regex"` to add the flags `flag1` and `flag2` to the regular expression `regex`.
For example `"ignore, fixed/dt["` will add the flags `ignore` and `fixed` to the regex `dt[`.

Alternatively, use only the initials of the flags. Hence, `"if/dt["` would also add the 
flags `ignore` and `fixed`.

If the regex does not contain a slash (`/`), no flags are added. If your regex should 
contain a slash, see the [section on escaping](#flag_escaping).

Ex: let's find lines containing `"dt["`:
```{r}
code = c("DT = as.data.table(iris)", 
         "DT[, .(pl_sl = string_magic('PL/SL = {Petal.Length / Sepal.Length}')]")

string_get(code, "if/dt[")
```

# Regex flags: Reference

There are 6 flags:

- [ignore](#flag_ignore): always available
- [fixed](#flag_fixed): always available
- [word](#flag_word): always available
- [magic](#flag_magic): always available
- [total](#flag_total): only available in functions performing a replacement
- [single](#flag_single): only available in functions performing a replacement

### ignore {#flag_ignore}

The flag `"ignore"` leads to a case-insensitive search. 

Ex: let's extract words starting with the last letters of the alphabet.
```{r}
unhappy = "Rumble thy bellyful! Spit, fire! spout, rain!
Nor rain, wind, thunder, fire are my daughters.
I tax not you, you elements, with unkindness.
I never gave you kingdom, call'd you children,
You owe me no subscription. Then let fall
Your horrible pleasure. Here I stand your slave,
A poor, infirm, weak, and despis'd old man."

# the ignore flag allows to retain words starting with the
# upper cased letters
# ex: getting words starting with the letter 'r' to 'z'
cat_magic("{'ignore/\\b[r-z]\\w+'extract, c, 60 swidth ? unhappy}")
```

*Technically*, the [perl](https://www.pcre.org/) expression `"(?i)"` is added at the 
beginning of the pattern.

### fixed {#flag_fixed}

The flag `"fixed"` removes any special regular expression meaning from the pattern, 
and treats it as verbatim.

Ex: let's fix the equation by changing the operators. 
```{r}
x = "50 + 5 * 5 = 40"
string_clean(x, "f/+", "f/*", replacement = "-")

# Without the fixed flag, we would have gotten an error since '+' or '*'
# have a special meaning in regular expressions (it is a quantifier)
# and expects something before

# Here's the error
try(string_clean(x, "+", "*", replacement = "-"))
```

*Technically*, if `"fixed"` is the only flag, then the functions `base::grepl` or `base::gsub`
are run with the argument `fixed = TRUE`. If there are also the flags `"ignore"` or `"word"`, 
the pattern is nested into the perl boundaries `\\Q` and `\\E` which strip any special meaning
from the pattern.

### word {#flag_word}

The flag `"word"`:

- adds word boundaries to the pattern
- accepts comma-separated enumerations of words which are concatenated with a logical 'or'

The logic of accepting comma-separated enumerations is to increase readability.
For example, with the flag `"word"`, `"is, are, were"` is equivalent to `"\\b(is|are|were)\\b"`.

Ex: we hide a few words from Alfred de Vigny's poem.
```{r}
le_mont_des_oliviers = "S'il est vrai qu'au Jardin sacrÃ© des Ã‰critures,
Le Fils de l'homme ai dit ce qu'on voit rapportÃ© ;
Muet, aveugle et sourd au cri des crÃ©atures,
Si le Ciel nous laissa comme un monde avortÃ©,
Alors le Juste opposera le dÃ©dain Ã  l'absence
Et ne rÃ©pondra plus que par un froid silence
Au silence Ã©ternel de la DivinitÃ©."

# we hide a few words from this poem
string_magic("{'wi/et, le, il, au, des?, ce => _'replace ? le_mont_des_oliviers}")
```

*Technically*, first the pattern is split with respect to `",[ \t\n]+"`, then all elements
are collapsed with `"|"`. If the flag `"fixed"` was also present, each element is first wrapped 
into `"\\Q"` and `"\\E"`. Finally, we add parentheses (to enable capture) and word 
boundaries (`"\\b"`) on both sides.

### magic {#flag_magic}

Use the `"magic"` flag to interpolate variables inside the regular expression
before the regex is evaluated.

Ex: interpolating variables inside regular expressions.
```{r}
vowels ="aeiouy"
# let's keep only the vowels
# we want the pattern: "[^aeiouy]"
lmb = "'Tis safer to be that which we destroy
Than by destruction dwell in doubtful joy."
string_replace(lmb, "magic/[^{vowels}]", "_")

#
# Illustration of `string_magic` operations before regex application
#

cars = row.names(mtcars)
# Which of these models contain a digit?
models = c("Toyota", "Hornet", "Porsche")
# we want the pattern "(Toyota|Hornet|"Porsche).+\\d"
# we collapse the models with a pipe using '|'c
string_get(cars, "m/({'|'c ? models}).+\\d")

# alternative: same as above but we first comma-split the vector
models_comma = "Toyota, Hornet, Porsche"
string_get(cars, "m/({S, '|'c ? models_comma}).+\\d")

#
# Interpolation does not apply to regex-specific curly brackets
#

# We delete only successions of 2+ vowels
# {2,} has a rexex meaning and is not interpolated:
string_replace(lmb, "magic/[{vowels}]{2,}", "_")
```

*Technically*, the algorithm does not interpolate curly brackets having a 
regular expression meaning. The expression of the form `"{a, b}"` with `"a"` and `"b"`
digits means a repetition of the previous symbol of at least `"a"` times and at most `"b"` times.
The variables are fetched in the calling environment. To fetch them from a different location,
you can use the argument `envir`.

### total {#flag_total}

The flag `"total"` is only available to functions performing a replacement. 
In that case, if a pattern is detected, *the full character string* is replaced 
(instead of just the pattern). 

Ex: let's replace a few car models.
```{r}
cars_small = head(row.names(mtcars))
print(cars_small)

string_replace(cars_small, "ti/mazda", "Mazda: sold out!")
```

On top of this, the `"total"` flag allows to perform logical operations across
several regex patterns. You have more information on this in the [dedicated vignette](https://lrberge.github.io/stringmagic/articles/ref_regex_logic.html).
In a nutshell, you can write `"pat1 & !pat2 | pat3"` with `"patx"` regular 
expresion patterns. This means: contains `pat1` and does not contain `pat2`,
or contains `pat3`.

Ex: detect car brands with a digit and no 'e'.
```{r}
cars_small = head(row.names(mtcars))
print(cars_small)

string_replace(cars_small, "total, ignore/\\d & !e", "I don't like that brand!")
```

*Technically*, instead of using `gsub` to replace the pattern, [`string_is`](https://lrberge.github.io/stringmagic/articles/guide_string_tools.html#sec_detect) is used
to detect which element contains the pattern. Each element with the pattern
is then substituted with the replacement.

### single {#flag_single}

The flag `"single"` is only available to functions performing a replacement.
It allows only a single substitution to take place. Said differently, only the 
first replacement is performed.

Ex: single substitutions.
```{r}
encounter = string_vec("Hi Cyclops., Hi you. What's your name?, Odysseus is my name.")
# we only remove the first word
string_replace(encounter, "single/\\w+", "...")
```

*Technically*, the function `base::sub` is used instead of `base::gsub`.

## Escaping flags: How to, and a word of caution with paths {#flag_escaping}

If your regular expression contains a slash (`"/"`), this will come in conflict
with the parsing of the optional flags. 
At the moment a `/` is present in a pattern, the algorithm will throw an error
if the expected flags are not written correctly.

To use a slash in the regex without adding flags there is only one solutions:

- escape the first `"/"` with a double backslash

Ex: let's invert the numerator and denominator of a division.
```{r}
eq = "5/x = 3/2"
# escaping with backslashes
string_replace(eq, "(\\w)\\/(\\w)", "\\2/\\1")
```

**Warning:** when applying regular expressions on file paths, to avoid unexpected behavior,
the flags algorithm is very strict. Everytime a pattern contains a slash that is not 
associated with valid flags, an error will be thrown unless that slash is escaped.

```{r}
path = "my/path/to/the/file.tex"
# we keep the directory only

# first try: an error bc flags are expected before the first '/'
try(string_replace(path, "/[^/]+$"))

# after escaping: works (only the first slash requires escaping)
string_replace(path, "\\/[^/]+$")

# if we did add a flag, we would need to double the slash
# compare...
string_replace(path, "i//[^/]+$")
# to...
string_replace(path, "i/[^/]+$")
```

Hence if you need to write path related regexes, you very likely need to escape the first slash.