---
title: "Introduction to tidypmc"
author: "Chris Stubben"
date: '`r gsub("  ", " ", format(Sys.time(), "%B %e, %Y"))`'
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Introduction to tidypmc}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "# "
)
```

The `tidypmc` package parses XML documents in the Open Access subset of [Pubmed Central].
Download the full text using `pmc_xml`.

```{r epmc_ftxt}
library(tidypmc)
doc <- pmc_xml("PMC2231364")
doc
```

The package includes five functions to parse the `xml_document`.


|R function     |Description                                                                |
|:--------------|:--------------------------------------------------------------------------|
|`pmc_text`     |Split section paragraphs into sentences with full path to subsection titles|
|`pmc_caption`  |Split figure, table and supplementary material captions into sentences     |
|`pmc_table`    |Convert table nodes into a list of tibbles                                 |
|`pmc_reference`|Format references cited into a tibble                                      |
|`pmc_metadata` |List journal and article metadata in front node                            |


`pmc_text` splits paragraphs into sentences and  removes any tables, figures or
formulas that are nested within paragraph tags, replaces superscripted
references with brackets, adds carets and underscores to other superscripts and
subscripts and includes the full path to the subsection title.

```{r pmc_text, message=FALSE, echo=-1}
options(width=100)
library(dplyr)
txt <- pmc_text(doc)
txt
count(txt, section)
```

`pmc_caption` splits figure, table and supplementary material captions into sentences.


```{r pmc_caption, echo=-1}
options(width=100)
cap1 <- pmc_caption(doc)
filter(cap1, sentence == 1)
```

`pmc_table` formats tables by collapsing multiline headers, expanding rowspan and
colspan attributes and adding subheadings into a new column.

```{r pmc_table, echo=-1}
options(width=100)
tab1 <- pmc_table(doc)
sapply(tab1, nrow)
tab1[[1]]
```

Captions and footnotes are added as attributes.

```{r attributes}
attributes(tab1[[1]])
```


Use `collapse_rows` to join column names and cell values in a semi-colon delimited string (and
then search using functions in the next section).

```{r collapserows, echo=-1}
options(width=100)
collapse_rows(tab1, na.string="-")
```


`pmc_reference` extracts the id, pmid, authors, year, title, journal, volume, pages,
and DOIs from reference tags.


```{r pmc_ref, echo=-1}
options(width=100)
ref1 <- pmc_reference(doc)
ref1
```


Finally, `pmc_metadata` saves journal and article metadata to a list.

```{r pmc_metadata}
pmc_metadata(doc)
```


## Searching text

There are a few functions to search within the `pmc_text` or collapsed `pmc_table` output.
`separate_text` uses the [stringr]  package to extract any matching regular expression.


```{r separate_text, echo=-1}
options(width=100)
separate_text(txt, "[ATCGN]{5,}")
```

A few wrappers search pre-defined patterns and add an extra step to expand matched ranges. `separate_refs`
matches references within brackets using `\\[[0-9, -]+\\]` and expands ranges like `[7-9]`.

```{r separate_refs, echo=-1}
options(width=100)
x <- separate_refs(txt)
x
filter(x, id == 8)
```


`separate_tags` expands locus tag ranges.


```{r locus_tags, echo=-1}
options(width=100)
collapse_rows(tab1, na="-") %>%
  separate_tags("YPO")
```


### Using `xml2`

The `pmc_*` functions use the [xml2] package for parsing and may fail in some situations, so
it helps to know how to parse `xml_documents`.  Use `cat` and `as.character` to view nodes
returned by `xml_find_all`.

```{r catchar}
library(xml2)
refs <- xml_find_all(doc, "//ref")
refs[1]
cat(as.character(refs[1]))
```


Many journals use superscripts for references cited so they usually
appear after words like `results9` below.

```{r pmcdoc1, message=FALSE}
# doc1 <- pmc_xml("PMC6385181")
doc1 <- read_xml(system.file("extdata/PMC6385181.xml", package = "tidypmc"))
gsub(".*\\. ", "", xml_text( xml_find_all(doc1, "//sec/p"))[2])
```

Find the tags using `xml_find_all` and then update the nodes by adding brackets
or other text.

```{r bib}
bib <- xml_find_all(doc1, "//xref[@ref-type='bibr']")
bib[1]
xml_text(bib) <- paste0(" [", xml_text(bib), "]")
bib[1]
```

The text is now separated from the reference.  Note the `pmc_text` function adds the brackets by default.

```{r pmc_text2, message=FALSE}
gsub(".*\\. ", "", xml_text( xml_find_all(doc1, "//sec/p"))[2])
```


Genes, species and many other terms are often included within italic tags.  You
can mark these nodes using the same code above or simply list all the names
in italics and search text or tables for matches, for example three letter gene
names in text below.


```{r italicgenes}
library(tibble)
x <- xml_name(xml_find_all(doc, "//*"))
tibble(tag=x) %>%
  count(tag, sort=TRUE)
it <- xml_text(xml_find_all(doc, "//sec//p//italic"), trim=TRUE)
it2 <- tibble(italic=it) %>%
  count(italic, sort=TRUE)
it2
filter(it2, nchar(italic) == 3)
separate_text(txt, c("fur", "cys", "hmu", "ybt", "yfe", "yfu", "ymt"))
```


[stringr]: https://stringr.tidyverse.org/
[xml2]: https://github.com/r-lib/xml2
[europepmc]: https://github.com/ropensci/europepmc
[Pubmed Central]: https://europepmc.org