---
title: "Biological Entity Dictionary (BED): exploring and converting identifiers of biological entities such as genes, transcripts or peptides"
package: "BED (version `r packageVersion('BED')`)"
output:
   rmarkdown::html_document:
      number_sections: yes
      self_contained: true
      theme: cerulean
      toc: yes
      toc_float: yes
      fig_width: 7
      fig_height: 5
vignette: >
   %\VignetteIndexEntry{BED}
   %\VignetteEncoding{UTF-8}
   %\VignetteEngine{knitr::rmarkdown}
editor_options:
   chunk_output_type: console
---


```{r setup, echo=FALSE}
library(knitr)
## The following line is to avoid building errors on CRAN
knitr::opts_chunk$set(eval=Sys.getenv("USER") %in% c("pgodard"))

vn_as_png <- function(vn){
  html_file <- tempfile(fileext = ".html")
  png_file <- tempfile(fileext = ".png")
  visSave(vn, html_file)
  invisible(webshot2::webshot(
    html_file, file=png_file, selector=".visNetwork"#, vwidth="100%"
  ))
  im <- base64enc::dataURI(file=png_file, mime="image/png")
  invisible(file.remove(c(html_file,png_file)))
  htmltools::div(
     width="100%",
     htmltools::img(src=im, alt="visNetwork", width="100%")
  )
}
```

::: {style="width:200px;"}
![](img/BED.png){width="100%"}
:::

<!----------------------------------------------------------------------------->
<!----------------------------------------------------------------------------->
# Introduction

This Biological Entity Dictionary (BED)
has been developed to address three main challenges.
The first one is related to the completeness of identifier mappings.
Indeed, direct mapping information provided by the different systems
are not always complete and can be enriched by mappings provided by other
resources.
More interestingly, direct mappings not identified by any of these
resources can be indirectly inferred by using mappings to a third reference.
For example, many human Ensembl gene ID are not directly mapped to any
Entrez gene ID but such mappings can be inferred using respective mappings
to HGNC ID. The second challenge is related to the mapping of deprecated
identifiers. Indeed, entity identifiers can change from one resource
release to another. The identifier history is provided by some resources,
such as Ensembl or the NCBI, but it is generally not used by mapping tools.
The third challenge is related to the automation of the mapping process
according to the relationships between the biological entities of interest.
Indeed, mapping between gene and protein ID scopes should not be done
the same way than between two scopes regarding gene ID.
Also, converting identifiers from different organisms should be possible
using gene orthologs information.

This document shows how to use the **BED (Biological Entity Dictionary)**
R package to get and explore mapping between
identifiers of biological entities (BE).
This package provides a way to connect to a BED Neo4j database in which
the relationships between the identifiers from different sources are recorded.

## Citing BED

This package and the underlying research has been published in
this peer reviewed article:

<a href="`r citation("BED")[1]$url`" target="_blank">
`r sub('[[]', '(', sub('[]]', ')', format(citation("BED"), style="textVersion")))`
</a>

## Installation

### Dependencies

This BED package depends on the following packages available in the CRAN
repository:

- **neo2R**
- **visNetwork**
- **dplyr**
- **readr**
- **stringr**
- **utils**
- **shiny**
- **DT**
- **miniUI**
- **rstudioapi**

All these packages must be installed before installing BED.

### Installation from github

```{r, eval=FALSE}
devtools::install_github("patzaw/BED")
```

### Possible issue when updating from releases <= 1.3.0

If you get an error like the following...

```
Error: package or namespace load failed for â€˜BEDâ€™:
 .onLoad failed in loadNamespace() for 'BED', details:
  call: connections[[connection]][["cache"]]
  error: subscript out of bounds
```

... remove the BED folder located here:

```{r, echo=TRUE, eval=FALSE}
file.exists(file.path(Sys.getenv("HOME"), "R", "BED"))
```

## Connection

Before using BED, the connection needs to be established with the
underlying Neo4j DB.
`url`, `username` and `password` should be adapted.

```{r, message=FALSE, eval=TRUE}
library(BED)
```

```{r, message=FALSE, eval=TRUE, echo=FALSE}
connectToBed()
```

```{r, message=FALSE, eval=FALSE}
connectToBed(url="localhost:5454", remember=FALSE, useCache=FALSE)
```

The `remember` parameter can be set to `TRUE` in order to save connection
information that will be automatically used the next time the `connectToBed()`
function is called.
By default, this parameter is set to `FALSE` to comply with CRAN policies.
Saved connection can be managed with the `lsBedConnections()` and
the `forgetBedConnection()` functions.

The `useCache` parameter is by default set to `FALSE` to comply with
CRAN policies.
However, it is recommended to set it to `TRUE` to improve the speed
of recurrent queries: the results of some large queries are saved locally
in a file.

The connection can be checked the following way.

```{r, message=TRUE}
checkBedConn(verbose=TRUE)
```

If the `verbose` parameter is set to TRUE, the URL and the content version
are displayed as messages.

The following function list saved connections.

```{r, eval=FALSE}
lsBedConnections()
```

The `connection` param of the `connectToBed` function can be used to
connect to a saved connection other than the last one.

## Data model

The BED underlying data model can be shown at any time using
the following command.

```{r, eval=FALSE}
showBedDataModel()
```

![](img/BED-data-model.png)

## Direct calls

Cypher queries can be run directly on the Neo4j database using the
`cypher` function from the **neo2R** package through the `bedCall` function.

```{r}
results <- bedCall(
    cypher,
    query=prepCql(
       'MATCH (n:BEID)',
       'WHERE n.value IN $values',
       'RETURN DISTINCT n.value AS value, labels(n), n.database'
    ),
    parameters=list(values=c("10", "100"))
)
results
```

## Feeding the database

Many functions are provided within the package to build your own BED database
instance. These functions are not exported in order to avoid their use when
interacting with BED normally.
Information about how to get an instance of the BED neo4j database is
provided here:

- <https://github.com/patzaw/BED#bed-database-instance-available-as-a-docker-image>
- <https://github.com/patzaw/BED#build-a-bed-database-instance>

It can be adapted to user needs.

## Caching

This part is relevant if the `useCache` parameter is set to TRUE when
calling `connectToBed()`.

Functions of the BED package used to retrieve thousands of identifiers
can take some time (generally a few seconds) before returning a result.
Thus for this kind of query, the query is run for all the relevant ID in the DB
and thanks to a cache system implemented in the package same queries
with different filters should be much faster the following times.

By default the cache is flushed when the system detect inconsistencies
with the BED database. However, it can also be manualy flushed if needed using
the `clearBedCache()` function.

Queries already in cache can be listed using the `lsBedCache()` function which
also return the occupied disk space.


<!----------------------------------------------------------------------------->
<!----------------------------------------------------------------------------->
# Exploring available data

## Biological entities

BED is organized around the central concept of **Biological Entity** (BE).
All supported types of BE can be listed.

```{r}
listBe()
```

These BE are organized according to how they are related to each other.
For example a *Gene* *is_expressed_as* a *Transcript*.
This organization allows to find the first upstream BE common to a set of
BE.

```{r}
firstCommonUpstreamBe(c("Object", "Transcript"))
firstCommonUpstreamBe(c("Peptide", "Transcript"))
```

## Organisms

Several organims can be supported by the BED underlying database.
They can be listed the following way.

```{r}
listOrganisms()
```

Common names are also supported and the corresponding taxonomic identifiers
can be retrieved. Conversely the organism names corresponding to a
taxonomic ID can be listed.

```{r}
getOrgNames(getTaxId("human"))
```

## Identifiers of biological entities

The main aim of BED is to allow the mapping of identifiers from different
sources such as Ensembl or Entrez. Supported sources can be listed the
following way for each supported organism.

```{r}
listBeIdSources(be="Transcript", organism="human")
```

The database gathering the largest number of BE of specific type can also
be identified.

```{r}
largestBeSource(be="Transcript", organism="human", restricted=TRUE)
```

Finally, the `getAllBeIdSources()` function returns all
the source databases of BE identifiers whatever the BE type.

## Experimental platforms and probes

BED also supports experimental platforms and provides mapping betweens
probes and BE identifiers (BEID).

The supported platforms can be listed the following way.
The `getTargetedBe()` function returns the type of BE on which a specific
platform focus.

```{r}
head(listPlatforms())
getTargetedBe("GPL570")
```


<!----------------------------------------------------------------------------->
<!----------------------------------------------------------------------------->
# Managing identifiers

## Retrieving all identifiers from a source

All identifiers of an organism BEs from one source can be retrieved.

```{r}
beids <- getBeIds(
    be="Gene", source="EntrezGene", organism="human",
    restricted=FALSE
)
dim(beids)
head(beids)
```

The first column, *id*, corresponds to the identifiers of the BE in the source.
The column named according to the BE type (in this case *Gene*)
corresponds to the internal identifier of the related BE.
**BE CAREFUL, THIS INTERNAL ID IS NOT STABLE AND CANNOT BE USED AS A REFERENCE**.
This internal identifier is useful to identify BEIDS corresponding to the
same BE. The following code can be used to have an overview of such
redundancy.

```{r}
sort(table(table(beids$Gene)), decreasing = TRUE)
ambId <- sum(table(table(beids$Gene)[which(table(beids$Gene)>=10)]))
```

In the example above we can see that most of Gene BE are identified by only
one EntrezGene ID. However many of them are identified by two or more
ID; `r ifelse(exists("ambId"), ambId, "XXX")` BE
are even identified by 10 or more EntrezGeneID.
In this case, most of these redundancies come from ID history extracted
from Entrez. Legacy ID can be excluded from the retrieved ID
by setting the `restricted` parameter to TRUE.

```{r}
beids <- getBeIds(
    be="Gene", source="EntrezGene", organism="human",
    restricted = TRUE
)
dim(beids)
```

The same code as above can be used to identify remaining redundancies.

```{r}
sort(table(table(beids$Gene)), decreasing = TRUE)
```

In the example above we can see that allmost all Gene BE are identified by only
one EntrezGene ID. However some of them are identified by two or more ID.
This result comes from how the BED database is constructed according to
the ID mapping provided by the different source databases.
The graph below shows how the mapping was done for such a BE with
redundant EntrezGene IDs.

<strong>
This issue has been mainly solved by not taking into account ambigous mappings
between NCBI Entrez gene identifiers and Ensembl gene identifier provided
by Ensembl. It has been achieved using the `cleanDubiousXRef()` function
from the 2019.10.11 version of the BED-UCB-Human database.
</strong>

<div style="border:solid;overflow:hidden;">
```{r}
eid <- beids$id[which(beids$Gene %in% names(which(table(beids$Gene)>=3)))][1]
print(eid)
exploreBe(id=eid, source="EntrezGene", be="Gene") %>%
   visPhysics(solver="repulsion")
```
</div>
<br>

The way the ID correspondances are reported in the different source databases
leads to this mapping ambiguity which has to be taken into account
when comparing identifiers from different databases.

The `getBeIds()` returns other columns providing additional
information about the *id*.
The same function can be used to retrieved symbols or probe identifiers.

### Preferred identifier

The BED database is constructed according to the relationships between
identifiers provided by the different sources. Biological entities (BE) are
identified as clusters of identifiers which correspond to each other
directly or indirectly (`corresponds_to` relationship).
Because of this design a BE can be identified by multiple identifiers (BEID)
from the same database as shown above.
These BEID are often related to alternate version of an entity.

For example, Ensembl provides different version (alternative sequences)
of some chromosomes parts. And genes are also annotated on these alternative
sequences. In Uniprot some *unreviewed* identifiers can correspond
to *reviewed* proteins.

When available such kind of information is associated to
an **Attribute** node through a `has` relationship providing the
value of the attribute for the BEID. This information can also
be used to define if a BEID is a *preferred* identifier for
a BE.

The example below shows the case of the MAPT gene annotated on different
version of human chromosome 17.

<div style="border:solid;overflow:hidden;">
```{r}
mapt <- convBeIds(
   "MAPT", from="Gene", from.source="Symbol", from.org="human",
   to.source="Ens_gene", restricted=TRUE
)
exploreBe(
   mapt[1, "to"],
   source="Ens_gene",
   be="Gene"
)
getBeIds(
   be="Gene", source="Ens_gene", organism="human",
   restricted=TRUE,
   attributes=listDBAttributes("Ens_gene"),
   filter=mapt$to
)
```
</div>

## Checking identifiers

The origin of identifiers can be guessed as following.

```{r}
oriId <- c(
    "17237", "105886298", "76429", "80985", "230514", "66459",
    "93696", "72514", "20352", "13347", "100462961", "100043346",
    "12400", "106582", "19062", "245607", "79196", "16878", "320727",
    "230649", "66880", "66245", "103742", "320145", "140795"
)
idOrigin <- guessIdScope(oriId)
print(idOrigin$be)
print(idOrigin$source)
print(idOrigin$organism)
```

The best guess is returned as a list but other possible origins are listed in
the *details* attribute.

```{r}
print(attr(idOrigin, "details"))
```

If the origin of identifiers is already known, it can also be tested.

```{r}
checkBeIds(ids=oriId, be="Gene", source="EntrezGene", organism="mouse")
```

```{r}
checkBeIds(ids=oriId, be="Gene", source="HGNC", organism="human")
```

## Identifier annotation

Identifiers can be annotated with symbols and names according to available
information. 
The following code returns the most relevant symbol and the most relevant name
for each ID.
Source URL can also be generated with the `getBeIdURL()` function.

```{r}
toShow <- getBeIdDescription(
    ids=oriId, be="Gene", source="EntrezGene", organism="mouse"
)
toShow$id <- paste0(
    sprintf(
        '<a href="%s" target="_blank">',
        getBeIdURL(toShow$id, "EntrezGene")
    ),
    toShow$id,
    '<a>'
)
kable(toShow, escape=FALSE, row.names=FALSE)
```

All possible symbols and all possible names for each ID can also be retrieved
using the following functions.

```{r}
res <- getBeIdSymbols(
    ids=oriId, be="Gene", source="EntrezGene", organism="mouse",
    restricted=FALSE
)
head(res)
```
```{r}
res <- getBeIdNames(
    ids=oriId, be="Gene", source="EntrezGene", organism="mouse",
    restricted=FALSE
)
head(res)
```

Also probes and some biological entities do not have directly associated
symbols or names. These elements can also be annotated according to information
related to relevant genes.

```{r}
someProbes <- c(
    "238834_at", "1569297_at", "213021_at", "225480_at",
    "216016_at", "35685_at", "217969_at", "211359_s_at"
)
toShow <- getGeneDescription(
    ids=someProbes, be="Probe", source="GPL570", organism="human"
)
kable(toShow, escape=FALSE, row.names=FALSE)
```

## Products of molecular biology processes

The BED data model has beeing built to fulfill molecular biology processes:

- **is_expressed_as** relationships correspond to the transcription process.
- **is_translated_in** relationships correspond to the translation process.
- **codes_for** is a fuzzy relationship allowing the mapping of genes on
object not necessary corresonpding to the same kind of biological molecule.

These processes are described in different databases with different level of
granularity. For exemple, Ensembl provides possible transcripts for each gene
specifying which one of them is canonical.

The following functions are used to retrieve direct products or direct
origins of molecular biology processes.

```{r}
getDirectProduct("ENSG00000145335", process="is_expressed_as")
getDirectProduct("ENST00000336904", process="is_translated_in")
getDirectOrigin("NM_001146055", process="is_expressed_as")
```


<!----------------------------------------------------------------------------->
<!----------------------------------------------------------------------------->
# Converting identifiers

## Same entity and same organism: from one source to another

```{r}
res <- convBeIds(
    ids=oriId,
    from="Gene",
    from.source="EntrezGene",
    from.org="mouse",
    to.source="Ens_gene",
    restricted=TRUE,
    prefFilter=TRUE
)
head(res)
```

## Same organism: from one entity to another

```{r}
res <- convBeIds(
    ids=oriId,
    from="Gene",
    from.source="EntrezGene",
    from.org="mouse",
    to="Peptide",
    to.source="Ens_translation",
    restricted=TRUE,
    prefFilter=TRUE
)
head(res)
```

## From one organism to another

```{r}
res <- convBeIds(
    ids=oriId,
    from="Gene",
    from.source="EntrezGene",
    from.org="mouse",
    to="Peptide",
    to.source="Ens_translation",
    to.org="human",
    restricted=TRUE,
    prefFilter=TRUE
)
head(res)
```

## Converting lists of identifiers

List of identifiers can be converted the following way.
Only converted IDs are returned in this case.

```{r}
humanEnsPeptides <- convBeIdLists(
    idList=list(a=oriId[1:5], b=oriId[-c(1:5)]),
    from="Gene",
    from.source="EntrezGene",
    from.org="mouse",
    to="Peptide",
    to.source="Ens_translation",
    to.org="human",
    restricted=TRUE,
    prefFilter=TRUE
)
unlist(lapply(humanEnsPeptides, length))
lapply(humanEnsPeptides, head)
```

### BEIDList

`BEIDList` objects are used to manage lists of BEID with
an attached explicit scope,
and metadata provided in a data frame.
The `focusOnScope()` function is used to easily convert such object to another
scope. For example, in the code below, Entrez gene identifiers are converted
in Ensembl identifiers.

```{r}
entrezGenes <- BEIDList(
   list(a=oriId[1:5], b=oriId[-c(1:5)]),
   scope=list(be="Gene", source="EntrezGene", organism="Mus musculus"),
   metadata=data.frame(
      .lname=c("a", "b"),
      description=c("Identifiers in a", "Identifiers in b"),
      stringsAsFactors=FALSE
   )
)
entrezGenes
entrezGenes$a
ensemblGenes <- focusOnScope(entrezGenes, source="Ens_gene")
ensemblGenes$a
```

## Converting data frames

IDs in data frames can also be converted.

```{r}
toConv <- data.frame(a=1:25, b=runif(25))
rownames(toConv) <- oriId
res <- convDfBeIds(
    df=toConv,
    from="Gene",
    from.source="EntrezGene",
    from.org="mouse",
    to.source="Ens_gene",
    restricted=TRUE,
    prefFilter=TRUE
)
head(res)
```

## Explore convertion shortest path between two identifiers

Because the conversion process takes into account several resources,
it might be useful to explore the path between two identifiers
which have been mapped. This can be achieved by the `exploreConvPath`
function.

<div style="border:solid;overflow:hidden;">
```{r}
from.id <- "ILMN_1220595"
res <- convBeIds(
   ids=from.id, from="Probe", from.source="GPL6885", from.org="mouse",
   to="Peptide", to.source="Uniprot", to.org="human",
   prefFilter=TRUE
)
res
exploreConvPath(
   from.id=from.id, from="Probe", from.source="GPL6885",
   to.id=res$to[1], to="Peptide", to.source="Uniprot"
)
```
</div>

The figure above shows how the `r ifelse(exists("from.id"), from.id, "XXX")`
ProbeID, targeting
the mouse NM_010552 transcript, can be associated
to the `r ifelse(exists("res"), res$to[1], "XXX")` human protein ID in Uniprot.

## Notes about converting from and to gene symbols

Canonical and non-canonical symbols are associated to genes.
In some cases the same symbol (canonical or not) can be associated to
several genes. This can lead to ambiguous mapping.
The strategy to apply for such mapping depends
on the aim of the user and his knowledge about the origin of the
symbols to consider.

The complete mapping between Ensembl gene identifiers and symbols is
retrieved by using the `getBeIDSymbolTable` function.

```{r}
compMap <- getBeIdSymbolTable(
   be="Gene", source="Ens_gene", organism="rat",
   restricted=FALSE
)
dim(compMap)
head(compMap)
```

The canonical field indicates if the symbol is canonical for the identifier.
The direct field indicates if the symbol is directly associated to the
identifier or indirectly through a relationship with another identifier.

As an example, let's consider the "Snca" symbol in rat. As shown below, this
symbol is associated to 2 genes; it is canonical for one gene and
not for another. These 2 genes are also associated to other symbols.

```{r}
sncaEid <- compMap[which(compMap$symbol=="Snca"),]
sncaEid
compMap[which(compMap$id %in% sncaEid$id),]
```

The `getBeIdDescription` function described before, reports only one symbol
for each identifier. Canonical and direct symbols are prioritized. 

```{r}
getBeIdDescription(
   sncaEid$id,
   be="Gene", source="Ens_gene", organism="rat"
)
```

The `convBeIds` works differently in order to provide a mapping as exhaustive
as possible. If a symbol is associated to several input identifiers,
non-canonical associations with this symbol are removed if a canonical
association exists for any other identifier. This can lead to inconsistent
results, depending on the user input, as show below.

```{r}
convBeIds(
   sncaEid$id[1],
   from="Gene", from.source="Ens_gene", from.org="rat",
   to.source="Symbol"
)
convBeIds(
   sncaEid$id[2],
   from="Gene", from.source="Ens_gene", from.org="rat",
   to.source="Symbol"
)
convBeIds(
   sncaEid$id,
   from="Gene", from.source="Ens_gene", from.org="rat",
   to.source="Symbol"
)
```

In the example above, when the query is run for each identifier independently,
the association to the "Snca" symbol is reported for both.
However, when running the same query with the 2 identifiers at the same time,
the "Snca" symbol is reported only for one gene corresponding to the canonical
association. An additional filter can be used to only keep canonical
symbols:

```{r}
convBeIds(
   sncaEid$id,
   from="Gene", from.source="Ens_gene", from.org="rat",
   to.source="Symbol",
   canonical=TRUE
)
```

Finally, as shown below, when running the query the other way,
"Snca" is only associated to the gene for which it is the canonical symbol.

```{r}
convBeIds(
   "Snca",
   from="Gene", from.source="Symbol", from.org="rat",
   to.source="Ens_gene"
)
```

<strong>
Therefore, the user should chose the function to use with care when needing
to convert from or to gene symbol.
</strong>

<!----------------------------------------------------------------------------->
<!----------------------------------------------------------------------------->
# An interactive dictionary: Shiny module

IDs, symbols and names can be seeked without knowing the original biological
entity or probe. Then the results can be converted to the context of interest.

```{r}
searched <- searchBeid("sv2A")
toTake <- which(searched$organism=="Homo sapiens")[1]
relIds <- geneIDsToAllScopes(
  geneids=searched$GeneID[toTake],
  source=searched$Gene_source[toTake],
  organism=searched$organism[toTake]
)
```

A Shiny gadget integrating these two function has been developped and is also
available as an Rstudio addins.

```{r, eval=FALSE}
relIds <- findBeids()
```

It relies on
a Shiny module (`beidsServer()` and `beidsUI()` functions)
made to facilitate the development
of applications focused on biological entity related information.
The code below shows a minimum example of such an application.

```{r, eval=FALSE}
library(shiny)
library(BED)
library(DT)

ui <- fluidPage(
   beidsUI("be"),
   fluidRow(
      column(
         12,
         tags$br(),
         h3("Selected gene entities"),
         DTOutput("result")
      )
   )
)

server <- function(input, output){
    found <- beidsServer("be", toGene=TRUE, multiple=TRUE, tableHeight=250)
    output$result <- renderDT({
       req(found())
       toRet <- found()
       datatable(toRet, rownames=FALSE)
    })
}

shinyApp(ui = ui, server = server)
```


<!----------------------------------------------------------------------------->
<!----------------------------------------------------------------------------->
# Session info

```{r, echo=FALSE, eval=TRUE}
sessionInfo()
```