---
title: 'valr overview'
date: '`r Sys.Date()`'
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    vignette: >
      %\VignetteIndexEntry{valr-overview}
      %\VignetteEngine{knitr::rmarkdown}
      %\VignetteEncoding{UTF-8}
---

```{r}
#| label: knitr-opts
#| echo: false
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center"
)
```

```{r}
#| label: init
#| echo: false
#| message: false
library(valr)
library(dplyr)
library(ggplot2)
library(tibble)
```

### Familiar tools, natively in R

The functions in `valr` have similar names to their `BEDtools` counterparts, and so will be familiar to users coming from the `BEDtools` suite. Similar to [`pybedtools`](https://daler.github.io/pybedtools/#why-pybedtools), `valr` has a terse syntax:

```{r}
#| label: valr-demo
#| message: false
library(valr)
library(dplyr)

snps <- read_bed(valr_example("hg19.snps147.chr22.bed.gz"))
genes <- read_bed(valr_example("genes.hg19.chr22.bed.gz"))

# find snps in intergenic regions
intergenic <- bed_subtract(snps, genes)
# distance from intergenic snps to nearest gene
nearby <- bed_closest(intergenic, genes)

nearby |>
  select(starts_with("name"), .overlap, .dist) |>
  filter(abs(.dist) < 1000)
```

### Input data

`valr` assigns common column names to facilitate comparisons between tbls. All tbls will have `chrom`, `start`, and `end` columns, and some tbls from multi-column formats will have additional pre-determined column names. See the `read_bed()` documentation for details.

```{r}
#| label: file-io
bed_file <- valr_example("3fields.bed.gz")
read_bed(bed_file) # accepts filepaths or URLs
```

`valr` can also operate on BED-like data.frames already constructed in R, provided that columns named `chrom`, `start` and `end` are present. New tbls can also be constructed as either `tibbles` or base R `data.frames`.

```{r}
#| label: trbl-ivls
bed <- tribble(
  ~chrom, ~start,  ~end,
  "chr1", 1657492, 2657492,
  "chr2", 2501324, 3094650
)

bed
```

### Interval coordinates

`valr` adheres to the BED [format](https://genome.ucsc.edu/FAQ/FAQformat#format1) which specifies that the start position for an interval is zero based and the end position is one-based. The first position in a chromosome is 0. The end position for a chromosome is one position passed the last base, and is not included in the interval. For example:

```{r}
#| label: zero-based
# a chromosome 100 basepairs in length
chrom <- tribble(
  ~chrom, ~start, ~end,
  "chr1", 0,      100
)

chrom

# single base-pair intervals
bases <- tribble(
  ~chrom, ~start, ~end,
  "chr1", 0,      1, # first base of chromosome
  "chr1", 1,      2, # second base of chromosome
  "chr1", 99,     100 # last base of chromosome
)

bases
```

### Remote databases

Remote databases can be accessed with `db_ucsc()` (to access the UCSC Browser) and `db_ensembl()` (to access Ensembl databases).

```{r}
#| label: db
#| eval: false
# access the `refGene` tbl on the `hg38` assembly.
if (require(RMariaDB)) {
  ucsc <- db_ucsc("hg38")
  tbl(ucsc, "refGene")
}
```

### Visual documentation

The `bed_glyph()` tool illustrates the results of operations in `valr`, similar to those found in the `BEDtools` documentation. This glyph shows the result of intersecting `x` and `y` intervals with `bed_intersect()`:

```{r}
#| label: intersect-glyph
x <- tribble(
  ~chrom, ~start, ~end,
  "chr1", 25,     50,
  "chr1", 100,    125
)

y <- tribble(
  ~chrom, ~start, ~end,
  "chr1", 30,     75
)

bed_glyph(bed_intersect(x, y))
```

And this glyph illustrates `bed_merge()`:

```{r}
#| label: merge-glyph
x <- tribble(
  ~chrom, ~start, ~end,
  "chr1", 1, 50,
  "chr1", 10, 75,
  "chr1", 100, 120
)

bed_glyph(bed_merge(x))
```

### Grouping data

The `group_by` function in dplyr can be used to perform functions on subsets of single and multiple `data_frame`s. Functions in `valr` leverage grouping to enable a variety of comparisons. For example, intervals can be grouped by `strand` to perform comparisons among intervals on the same strand.

```{r}
#| label: group-strand
x <- tribble(
  ~chrom, ~start, ~end, ~strand,
  "chr1", 1,      100,  "+",
  "chr1", 50,     150,  "+",
  "chr2", 100,    200,  "-"
)

y <- tribble(
  ~chrom, ~start, ~end, ~strand,
  "chr1", 50,     125,  "+",
  "chr1", 50,     150,  "-",
  "chr2", 50,     150,  "+"
)

# intersect tbls by strand
x <- group_by(x, strand)
y <- group_by(y, strand)

bed_intersect(x, y)
```

Comparisons between intervals on opposite strands are done using the `flip_strands()` function:

```{r}
#| label: strand-opp
x <- group_by(x, strand)

y <- flip_strands(y)
y <- group_by(y, strand)

bed_intersect(x, y)
```

Both single set (e.g. `bed_merge()`) and multi set operations will respect groupings in the input intervals.

### Column specification

Columns in `BEDtools` are referred to by position:

``` bash
# calculate the mean of column 6 for intervals in `b` that overlap with `a`
bedtools map -a a.bed -b b.bed -c 6 -o mean
```

In `valr`, columns are referred to by name and can be used in multiple name/value expressions for summaries.

```{r}
#| label: tidy-eval
#| eval: false
# calculate the mean and variance for a `value` column
bed_map(a, b, .mean = mean(value), .var = var(value))

# report concatenated and max values for merged intervals
bed_merge(a, .concat = concat(value), .max = max(value))
```

## Getting started

### Meta-analysis

This demonstration illustrates how to use `valr` tools to perform a "meta-analysis" of signals relative to genomic features. Here we to analyze the distribution of histone marks surrounding transcription start sites.

First we load libraries and relevant data.

```{r}
#| label: tss-demo
#| warning: false
#| message: false
# `valr_example()` identifies the path of example files
bedfile <- valr_example("genes.hg19.chr22.bed.gz")
genomefile <- valr_example("hg19.chrom.sizes.gz")
bgfile <- valr_example("hela.h3k4.chip.bg.gz")

genes <- read_bed(bedfile)
genome <- read_genome(genomefile)
y <- read_bedgraph(bgfile)
```

Then we generate 1 bp intervals to represent transcription start sites (TSSs). We focus on `+` strand genes, but `-` genes are easily accommodated by filtering them and using `bed_makewindows()` with `reversed` window numbers.

```{r}
#| label: make-tss
# generate 1 bp TSS intervals, `+` strand only
tss <- genes |>
  filter(strand == "+") |>
  mutate(end = start + 1)

# 1000 bp up and downstream
region_size <- 1000
# 50 bp windows
win_size <- 50

# add slop to the TSS, break into windows and add a group
x <- tss |>
  bed_slop(genome, both = region_size) |>
  bed_makewindows(win_size)

x
```

Now we use the `.win_id` group with `bed_map()` to calculate a sum by mapping `y` signals onto the intervals in `x`. These data are regrouped by `.win_id` and a summary with `mean` and `sd` values is calculated.

```{r}
#| label: bed-map
# map signals to TSS regions and calculate summary statistics.
res <- bed_map(x, y, win_sum = sum(value, na.rm = TRUE)) |>
  group_by(.win_id) |>
  summarize(
    win_mean = mean(win_sum, na.rm = TRUE),
    win_sd = sd(win_sum, na.rm = TRUE)
  )

res
```

Finally, these summary statistics are used to construct a plot that illustrates histone density surrounding TSSs.

```{r}
#| lable: plot-tss
#| warning: false
#| message: false
x_labels <- seq(
  -region_size,
  region_size,
  by = win_size * 5
)

x_breaks <- seq(1, 41, by = 5)

sd_limits <- aes(
  ymax = win_mean + win_sd,
  ymin = win_mean - win_sd
)

ggplot(
  res,
  aes(
    x = .win_id,
    y = win_mean
  )
) +
  geom_point() +
  geom_pointrange(sd_limits) +
  scale_x_continuous(
    labels = x_labels,
    breaks = x_breaks
  ) +
  labs(
    x = "Position (bp from TSS)",
    y = "Signal",
    title = "Human H3K4me3 signal near transcription start sites"
  ) +
  theme_classic()
```

## Related work

* Command-line tools [BEDtools][1] and [bedops][5].

* The Python library [pybedtools][4] wraps BEDtools.

* The R packages [GenomicRanges][6], [bedr][7], [IRanges][8] and [GenometriCorr][9] provide similar capability with a different philosophy.

[1]: https://bedtools.readthedocs.io/en/latest/
[2]: https://github.com/hadley/dplyr
[3]: https://www.rcpp.org/
[4]: https://pythonhosted.org/pybedtools/
[5]: https://bedops.readthedocs.io/en/latest/index.html
[6]: https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html
[7]: https://CRAN.R-project.org/package=bedr
[8]: https://bioconductor.org/packages/release/bioc/html/IRanges.html
[9]: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002529
[10]: https://rmarkdown.rstudio.com/
[12]: https://bitbucket.org/snakemake/snakemake/wiki/Home
[13]: https://shiny.posit.co/