---
title: "RVenn: An R package for set operations on multiple sets"
author: "Turgut Yigit Akyol"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{RVenn: An R package for set operations on multiple sets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

This tutorial shows how to use  `RVenn`, a package for dealing with multiple
sets. The base R functions (`intersect`, `union` and `setdiff`) only work with
two sets. `%>%` can be used from `magrittr` but, for many sets this can be
tedious. `reduce` function from `purrr` package also provides a solution, which
is the function that is used for set operations in this package. The functions
`overlap`, `unite` and `discern` abstract away the details, so one can just
construct the universe and choose the sets to operate by index or set name.
Further, by using `ggvenn` Venn diagram can be drawn for 2-3 sets. As you can
notice from the name of the function, `ggvenn` is based on `ggplot2`, so it is a
neat way to show the relationship among a reduced number sets. For many sets, it
is much better to use [UpSet](http://caleydo.org/tools/upset/) or `setmap`
function provided within this package. Finally, by using `enrichment_test`
function, the p-value of an overlap between two sets can be calculated. Here,
the usage of all these functions will be shown.

## Creating toy data

This chunk of code will create 10 sets with sizes ranging from 5 to 25.

```{r, message=FALSE}
library(purrr)
library(RVenn)
library(ggplot2)
```


```{r}
set.seed(42)
toy = map(sample(5:25, replace = TRUE, size = 10),
          function(x) sample(letters, size = x))
toy[1:3]  # First 3 of the sets.
```

### Construct the Venn object

```{r}
toy = Venn(toy)
```


## Set operations

### Intersection

Intersection of all sets:

```{r}
overlap(toy)
```

Intersection of selected sets (chosen with set names or indices, respectively):

```{r}
overlap(toy, c("Set_1", "Set_2", "Set_5", "Set_8"))
```

```{r}
overlap(toy, c(1, 2, 5, 8))
```

### Pairwise intersections

```{r}
overlap_pairs(toy, slice = 1:4)
```

### Union

Union of all sets:

```{r}
unite(toy)
```

Union of selected sets (chosen with set names or indices, respectively):

```{r}
unite(toy, c("Set_3", "Set_8"))
```

```{r}
unite(toy, c(3, 8))
```

### Pairwise unions

```{r}
unite_pairs(toy, slice = 1:4)
```


### Set difference

```{r}
discern(toy, 1, 8)
```

```{r}
discern(toy, "Set_1", "Set_8")
```

```{r}
discern(toy, c(3, 4), c(7, 8))
```

### Pairwise differences

```{r}
discern_pairs(toy, slice = 1:4)
```

## Venn Diagram

For two sets:

```{r, fig.height=5, fig.width=8, fig.retina=3}
ggvenn(toy, slice = c(1, 5))
```

For three sets:

```{r, fig.height=8, fig.width=8, fig.retina=3}
ggvenn(toy, slice = c(3, 6, 8))
```

## Heatmap

```{r, fig.height=8, fig.width=8, fig.retina=3}
setmap(toy)
```

Without clustering

```{r, fig.height=8, fig.width=8, fig.retina=3}
setmap(toy, element_clustering = FALSE, set_clustering = FALSE)
```


## Enrichment test

```{r, fig.width=8, fig.height=5, fig.retina=3}
er = enrichment_test(toy, 6, 7)
er$Significance

qplot(er$Overlap_Counts, geom = "blank") +
  geom_histogram(fill = "lemonchiffon4", bins = 8, color = "black") +
  geom_vline(xintercept = length(overlap(toy, c(6, 7))), color = "firebrick2",
             size = 2, linetype = "dashed", alpha = 0.7) +
  ggtitle("Null Distribution") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(name = "Overlap Counts") +
  scale_y_continuous(name = "Frequency")
```

The test above, of course, is not very meaningful as we randomly created the
sets; therefore, we get a high p-value. However, when you are working with
actual data, e.g. to check if a motif is enriched in the promoter regions of the
genes in a gene set, you can use this test. In that case, `set1` will be the
gene set of interest, `set2` will be the all the genes that the motif is found
in the genome and `univ` will be all the genes of a genome.