--- title: "Parsing Europe PMC FTP files" author: "Chris Stubben" date: '`r gsub(" ", " ", format(Sys.time(), "%B %e, %Y"))`' output: rmarkdown::html_vignette vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Parse PMC FTP files} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "# " ) ``` The [Europe PMC FTP] includes 2.5 million open access articles separated into files with 10K articles each. Download and unzip a recent series of PMC ids and load into R using the `readr` package. A sample file with the first 10 articles is included in the `tidypmc` package. ```{r load} library(readr) pmcfile <- system.file("extdata/PMC6358576_PMC6358589.xml", package = "tidypmc") pmc <- read_lines(pmcfile) ``` Find the start of the article nodes. ```{r startnode} a1 <- grep("^