--- title: "An Introduction to `PGRdup` Package" author: "Aravind, J.^1^, Radhamani, J.^1^, Kalyani Srinivasan^1^, Ananda Subhash, B.^2^, and Tyagi, R. K.^1^" date: '`r Sys.Date()`' classoption: table, twoside output: pdf_document: fig_caption: no toc: no html_document: df_print: paged toc: yes header-includes: - \usepackage{fancyhdr} - \usepackage{wrapfig} - \pagestyle{fancy} - \fancyhead[LE,RO]{\slshape \rightmark} - \fancyhead[LO,RE]{An Introduction to \texttt{PGRdup} Package} - \fancyfoot[C]{\thepage} - \usepackage{hyperref} - \hypersetup{colorlinks=true} - \hypersetup{linktoc=all} - \hypersetup{linkcolor=blue} bibliography: bibliography.bibtex link-citations: yes vignette: | %\VignetteIndexEntry{Introduction} %\usepackage[utf8]{inputenc} %\VignetteEngine{knitr::rmarkdown_notangle} --- ```{r, include=FALSE, eval = FALSE} options(tinytex.verbose = TRUE) ``` ```{r, echo=FALSE} out_type <- knitr::opts_knit$get("rmarkdown.pandoc.to") r = getOption("repos") r["CRAN"] = "https://cran.rstudio.com/" #r["CRAN"] = "https://cloud.r-project.org/" #r["CRAN"] = "https://ftp.iitm.ac.in/cran/" options(repos = r) # Workaround for missing pandoc in CRAN OSX build machines out_type <- ifelse(out_type == "", "latex", out_type) # Workaround for missing pandoc in Solaris build machines out_type <- ifelse(identical (out_type, vector(mode = "logical", length = 0)), "latex", out_type) ``` ```{r, echo=FALSE} # Restrict threads for avoid elapsed threshold error in CRAN check threads_dt <- data.table::getDTthreads() threads_OMP <- Sys.getenv("OMP_THREAD_LIMIT") data.table::setDTthreads(2) Sys.setenv(`OMP_THREAD_LIMIT` = 2) ``` ```{r, results='asis', echo=FALSE} switch(out_type, html = {cat("<p>1. ICAR-National Bureau of Plant Genetic Resources, New Delhi, India.</p> <p>2. Centre for Development of Advanced Computing, Thiruvananthapuram, Kerala, India.</p>")}, latex = cat("\\begin{center} 1. ICAR-National Bureau of Plant Genetic Resources, New Delhi, India. 2. Centre for Development of Advanced Computing, Thiruvananthapuram, Kerala, India. \\end{center}" ) ) ``` \begin{center} \vspace{6pt} \hrule \end{center} ```{r, echo = FALSE} knitr::opts_chunk$set(comment = "", fig.cap= "") ``` \tableofcontents \pagebreak \begin{wrapfigure}{r}{0.35\textwidth} \vspace{1cm} \begin{center} \includegraphics[width=0.33\textwidth]{`r system.file("extdata", "PGRdup_v2.png", package = "PGRdup")`} \end{center} \vspace{-1.5cm} \end{wrapfigure}\leavevmode <img src="https://raw.githubusercontent.com/aravind-j/PGRdup/master/inst/extdata/PGRdup_v2.png" align="left" alt="logo" width="173" height = "200" style = "padding: 10px; border: none; float: right;"> ## Introduction **PGRdup** is an `R` package to facilitate the search for probable/possible duplicate accessions in Plant Genetic Resources (PGR) collections using passport databases. Primarily this package implements a workflow (Fig. 1) designed to fetch groups or sets of germplasm accessions with similar passport data particularly in fields associated with accession names within or across PGR passport databases. It offers a suite of functions for data pre-processing, creation of a searchable Key Word in Context (KWIC) index of keywords associated with accession records and the identification of probable duplicate sets by fuzzy, phonetic and semantic matching of keywords. It also has functions to enable the user to review, modify and validate the probable duplicate sets retrieved. The goal of this document is to introduce the users to these functions and familiarise them with the workflow intended to fetch probable duplicate sets. This document assumes a basic knowledge of `R` programming language. The functions in this package are primarily built using the `R` packages [`data.table`](https://CRAN.R-project.org/package=data.table), [`igraph`](https://CRAN.R-project.org/package=igraph), [`stringdist`](https://CRAN.R-project.org/package=stringdist) and [`stringi`](https://CRAN.R-project.org/package=stringi). \clearpage \begin{center} \includegraphics{`r system.file("extdata", "PGRdup.png", package = "PGRdup")`} \end{center} <img src="https://raw.githubusercontent.com/aravind-j/PGRdup/master/inst/extdata/PGRdup.png" align="center" alt="logo" width="750" height = "85" style = "border: none;"> ```{r, echo=FALSE, warning=FALSE, fig.cap = NULL} if (requireNamespace("diagram", quietly = TRUE)) { suppressMessages(library(diagram)) elpos <- coordinates(pos = c(2, 2, 2, 4, 2, 2, 1, 3)) elpos[8,1] <- elpos[5,1] elpos[9,1] <- 0.375 elpos[10,1] <- elpos[6,1] elpos[10,2] <- 0.5 elpos[16,1] <- elpos[16,1] - 0.05 elpos[4,2] <- elpos[4,2] + 0.0020 elpos[15,2] <- elpos[15,2] + 0.0200 elpos[16,2] <- elpos[16,2] + 0.0250 elpos[17,2] <- elpos[17,2] + 0.0250 elpos[18,2] <- elpos[18,2] + 0.0250 t <- as.data.frame(elpos) t[t$V1<0.5,][,1] <- t[t$V1<0.5,][,1] + 0.1 t[t$V1>0.5,][,1] <- t[t$V1>0.5,][,1] - 0.1 elpos <- as.matrix(t) colnames(elpos) <- NULL elpos1 <- elpos[c(3,5,8,11,13),] elpos2 <- elpos[c(5,7,9),] elpos3 <- elpos[c(7,9,11),] par(mar = c(1, 1, 1, 1)) openplotmat() #text(elpos, lab = as.character(c(1:18)), cex = 2) textrect(c(0.5,0.49), radx = 0.35, rady = 0.49, lab = "" , box.col = "ivory", shadow.size=0) textrect(c(0.5,0.9375), radx = 0.35, rady = 0.045, lab = "" , box.col = "tan2", shadow.size=0, lcol="transparent") textrect(c(0.5,0.2075), radx = 0.35, rady = 0.045, lab = "" , box.col = "tan2", shadow.size=0, lcol="transparent") textrect(c(0.5,0.6865), radx = 0.35, rady = 0.05, lab = "" , box.col = "wheat2", shadow.size=0, lcol="transparent") textrect(c(0.5,0.3115), radx = 0.35, rady = 0.065, lab = "" , box.col = "wheat2", shadow.size=0, lcol="transparent") textrect(c(0.5,0.49), radx = 0.35, rady = 0.49, lab = "" , box.col = "transparent", shadow.size=0) for (i in 1:dim(elpos1)[1]-1) { straightarrow(to = elpos1[i+1, ], from = elpos1[i, ], arr.type = "curved", arr.lwd = 0.5, lwd = 2, arr.pos = 0.5, arr.length = 0.2, arr.width = 0.15) } for (i in 2:dim(elpos2)[1]) { straightarrow(to = elpos2[i, ], from = elpos2[1, ], arr.type = "curved", arr.lwd = 0.5, lwd = 2, arr.pos = 0.5, arr.length = 0.2, arr.width = 0.15) } for (i in 1:dim(elpos3)[1]-1) { straightarrow(to = elpos3[3, ], from = elpos3[i, ], arr.type = "curved", arr.lwd = 0.5, lwd = 2, arr.pos = 0.5, arr.length = 0.2, arr.width = 0.15) } elpostitles <- elpos[c(1,2,15),] titles <- c("Workflow", "Core functions", "Helper functions") for (i in 1:dim(elpostitles)[1]){ textellipse(elpostitles[i,], radx=nchar(titles[i])*0.01, rady=0.03, lab = titles[i], cex = 0.7, adj=c(0.5, 0.5), shadow.col = "darkolivegreen4", shadow.size = 0.003, box.col = "white") } elposflow1 <- elpos[c(3,5,11,13),] elposflow2 <- elpos[c(7,8,9),] flow1 <- c("Data pre-processing", "Generation of\nKWIC Index", "Probable duplicate\nset retrieval", "Set review, modification\n& validation") flow2 <- c("Fuzzy\nmatching", "Phonetic\nmatching", "Semantic\nmatching") for (i in 1:dim(elposflow1)[1]){ textrect(elposflow1[i,], radx = 0.13, rady = 0.04, lab = flow1[i] , box.col = "white", shadow.col = "indianred1", shadow.size = 0.005, cex = 0.7) } for (i in 1:dim(elposflow2)[1]){ textrect(elposflow2[i,], radx = 0.05, rady = 0.04, lab = flow2[i] , box.col = "white", shadow.col = "indianred1", shadow.size = 0.005, cex = 0.7) } elposfunc <- elpos[c(4,6,10,14,16,17, 18),] elposfunc[-c(5),][,1] <- elposfunc[-c(5),][,1] - 0.06 func <- c("DataClean\nMergeKW\nMergePrefix\nMergeSuffix", "KWIC", "ProbDup", "DisProbDup\nReviewProbDup\nReconstructProbDup", "read.genesys\nValidatePrimKey\nDoubleMetaphone\nKWCounts", "ParseProbDup\nSplitProbDup\nMergeProbDup", "AddProbDup\nViewProbDup\nprint.KWIC\nprint.ProbDup") for (i in 1:dim(elposfunc)[1]){ textplain(elposfunc[i,], lab = func[i], cex = 0.7, adj=c(0, 0.5), font=1, family = "sans", col="steelblue4") } } else { print("package 'diagram' is required to generate this figure") } ``` **Fig. 1.** PGRdup workflow and associated functions ## Version History ```{r, results='asis', echo=FALSE} # Fetch release version rver <- getNamespaceVersion("PGRdup") # rver <- ifelse(test = gsub("(.\\.)(\\d+)(\\..)", "", getNamespaceVersion("PGRdup")) == "", # yes = getNamespaceVersion("PGRdup"), # no = as.vector(available.packages()["PGRdup",]["Version"])) ``` The current version of the package is `r rver`. The previous versions are as follows. **Table 1.** Version history of `PGRdup` `R` package. ```{r, echo=FALSE, message=FALSE} if (requireNamespace("RCurl", quietly = TRUE) & requireNamespace("httr", quietly = TRUE) & requireNamespace("XML", quietly = TRUE)) { pkg <- "PGRdup" link <- paste0("https://cran.r-project.org/src/contrib/Archive/", pkg, "/") if (RCurl::url.exists(link)) { # cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl") # page <- httr::GET(link, httr::config(cainfo = cafile)) page <- httr::GET(link) page <- httr::content(page, as = 'text') # page <- RCurl::getURL(link) VerHistory <- XML::readHTMLTable(page)[[1]][,2:3] colnames(VerHistory) <- c("Version", "Date") VerHistory <- VerHistory[VerHistory$Version != "Parent Directory",] VerHistory <- VerHistory[!is.na(VerHistory$Version), ] VerHistory$Date <- as.Date(VerHistory$Date) VerHistory$Version <- gsub("PGRdup_", "", VerHistory$Version) VerHistory$Version <- gsub(".tar.gz", "", VerHistory$Version) VerHistory <- VerHistory[order(VerHistory$Date), c("Version", "Date")] rownames(VerHistory) <- NULL knitr::kable(VerHistory) } else { print("Access to CRAN page for 'PGRdup' is required to generate this table.'") } } else { print("Packages 'RCurl', 'httr' and 'XML' are required to generate this table.") } ``` To know detailed history of changes use `news(package='PGRdup')`. ## Installation The package can be installed using the following functions: ```{r, eval=FALSE} # Install from CRAN install.packages('PGRdup', dependencies=TRUE) ``` Uninstalled dependencies (packages which `PGRdup` depends on *viz*- [`data.table`](https://CRAN.R-project.org/package=data.table), [`igraph`](https://CRAN.R-project.org/package=igraph), [`stringdist`](https://CRAN.R-project.org/package=stringdist) and [`stringi`](https://CRAN.R-project.org/package=stringi) are also installed because of the argument `dependencies=TRUE`. Then the package can be loaded using the function ```{r, eval=FALSE} library(PGRdup) ``` ## Data Format The package is essentially designed to operate on PGR passport data present in a [data frame object](http://google.com/#q=[R]+data.frame), with each row holding one record and columns representing the attribute fields. For example, consider the dataset `GN1000` supplied along with the package. ```{r} library(PGRdup) # Load the dataset to the environment data(GN1000) # Show the class of the object class(GN1000) # View the first few records in the data frame head(GN1000) ``` If the passport data exists as an excel sheet, it can be first converted to a comma-separated values (csv) file or tab delimited file and then easily imported into the `R` environment using the base functions `read.csv` and `read.table` respectively. Similarly `read_csv()` and `read_tsv()` from the [`readr`](https://CRAN.R-project.org/package=readr) package can also be used. Alternatively, the package [`readxl`](https://CRAN.R-project.org/package=readxl) can be used to directly read the data from excel. In case of large csv files, the function `fread` in the [`data.table`](https://CRAN.R-project.org/package=data.table) package can be used to rapidly load the data. If the PGR passport data is in a database management system (DBMS), the required table can be imported as a data frame into `R`. using the appropriate [`R`-database interface package](http://www.burns-stat.com/r-database-interfaces/). For example [`dbConnect`](https://CRAN.R-project.org/package=dbConnect) for MySQL, [`ROracle`](https://CRAN.R-project.org/package=ROracle) for Oracle etc. The PGR data downloaded from the [genesys](https://www.genesys-pgr.org/welcome) database as a [Darwin Core - Germplasm](https://github.com/dagendresen/darwincore-germplasm) zip archive can be imported into the `R` environment as a flat file `data.frame` using the `read.genesys` function. ```{r, eval=FALSE} # Import the DwC-Germplasm zip archive "genesys-accessions-filtered.zip" PGRgenesys <- read.genesys("genesys-accessions-filtered.zip", scrub.names.space = TRUE, readme = TRUE) ``` ## Data Pre-processing Data pre-processing is a critical step which can affect the quality of the probable duplicate sets being retrieved. It involves data standardization as well as data cleaning which can be achieved using the functions `DataClean`, `MergeKW`, `MergePrefix` and `MergeSuffix`. `DataClean` function can be used to clean the character strings in passport data fields(columns) specified as the input [character vector](http://google.com/#q=[R]+character+vector) `x` according to the conditions specified in the arguments. Commas, semicolons and colons which are sometimes used to separate multiple strings or names within the same field can be replaced with a single space using the logical arguments `fix.comma`, `fix.semcol` and `fix.col` respectively. ```{r} x <- c("A 14; EC 1697", "U 4-4-28; EC 21078; A 32", "PI 262801:CIAT 9075:GKP 9553/90", "NCAC 16049, PI 261987, RCM 493-3") x # Replace ',', ':' and ';' with space DataClean(x, fix.comma=TRUE, fix.semcol=TRUE, fix.col=TRUE, fix.bracket=FALSE, fix.punct=FALSE, fix.space=FALSE, fix.sep=FALSE, fix.leadzero=FALSE) ``` Similarly the logical argument `fix.bracket` can be used to replace all brackets including parenthesis, square brackets and curly brackets with space. ```{r} x <- c("(NRCG-1738)/(NFG649)", "26-5-1[NRCG-2528]", "Ah 1182 {NRCG-4340}") x # Replace parenthesis, square brackets and curly brackets with space DataClean(x, fix.comma=FALSE, fix.semcol=FALSE, fix.col=FALSE, fix.bracket=TRUE, fix.punct=FALSE, fix.space=FALSE, fix.sep=FALSE, fix.leadzero=FALSE) ``` The logical argument `fix.punct` can be used to remove all punctuation from the data. ```{r} x <- c("#26-6-3-1", "Culture No. 857", "U/4/47/13") x # Remove punctuation DataClean(x, fix.comma=FALSE, fix.semcol=FALSE, fix.col=FALSE, fix.bracket=FALSE, fix.punct=TRUE, fix.space=FALSE, fix.sep=FALSE, fix.leadzero=FALSE) ``` `fix.space` can be used to convert all space characters such as tab, newline, vertical tab, form feed and carriage return to spaces and finally convert multiple spaces to single space. ```{r} x <- c("RS 1", "GKSPScGb 208 PI 475855") x # Replace all space characters to space and convert multiple spaces to single space DataClean(x, fix.comma=FALSE, fix.semcol=FALSE, fix.col=FALSE, fix.bracket=FALSE, fix.punct=FALSE, fix.space=TRUE, fix.sep=FALSE, fix.leadzero=FALSE) ``` `fix.sep` can be used to merge together accession identifiers composed of alphabetic characters separated from a series of digits by a space character. ```{r} x <- c("NCAC 18078", "AH 6481", "ICG 2791") x # Merge alphabetic character separated from a series of digits by a space DataClean(x, fix.comma=FALSE, fix.semcol=FALSE, fix.col=FALSE, fix.bracket=FALSE, fix.punct=FALSE, fix.space=FALSE, fix.sep=TRUE, fix.leadzero=FALSE) ``` `fix.leadzero` can be used to remove leading zeros from accession name fields to facilitate matching to identify probable duplicates. ```{r} x <- c("EC 0016664", "EC0001690") x # Remove leading zeros DataClean(x, fix.comma=FALSE, fix.semcol=FALSE, fix.col=FALSE, fix.bracket=FALSE, fix.punct=FALSE, fix.space=FALSE, fix.sep=FALSE, fix.leadzero=TRUE) ``` This function can hence be made use of in tidying up multiple forms of messy data existing in fields associated with accession names in PGR passport databases (Table 1). ```{r} names <- c("S7-12-6", "ICG-3505", "U 4-47-18;EC 21127", "AH 6481", "RS 1", "AK 12-24", "2-5 (NRCG-4053)", "T78, Mwitunde", "ICG 3410", "#648-4 (Gwalior)", "TG4;U/4/47/13", "EC0021003") names # Clean the data DataClean(names) ``` ```{r, eval=FALSE, echo = FALSE} # DC <- data.frame(names = names, `DataClean(names)` = DataClean(names)) # knitr::kable(DC) ``` **Table 2.** Data pre-processing using `DataClean`. |**names** |**DataClean(names)** | |:------------------|:--------------------| |S7-12-6 |S7126 | |ICG-3505 |ICG3505 | |U 4-47-18;EC 21127 |U44718 EC21127 | |AH 6481 |AH6481 | |RS 1 |RS1 | |AK 12-24 |AK1224 | |2-5 (NRCG-4053) |25 NRCG4053 | |T78, Mwitunde |T78 MWITUNDE | |ICG 3410 |ICG3410 | |#648-4 (Gwalior) |6484 GWALIOR | |TG4;U/4/47/13 |TG4 U44713 | |EC0021003 |EC21003 | Several common keyword string pairs or keyword prefixes and suffixes exist in fields associated with accession names in PGR passport databases. They can be merged using the functions `MergeKW`, `MergePrefix` and `MergeSuffix` respectively. The keyword string pairs, prefixes and suffixes can be supplied as a [list](http://google.com/#q=[R]+list) or a [vector](http://google.com/#q=[R]+vector) to the argument `y` in these functions. ```{r} names <- c("Punjab Bold", "Gujarat- Dwarf", "Nagpur.local", "SAM COL 144", "SAM COL--280", "NIZAMABAD-LOCAL", "Dark Green Mutant", "Dixie-Giant", "Georgia- Bunch", "Uganda-erect", "Small Japan", "Castle Cary", "Punjab erect", "Improved small japan", "Dark Purple") names # Merge pairs of strings y1 <- list(c("Gujarat", "Dwarf"), c("Castle", "Cary"), c("Small", "Japan"), c("Big", "Japan"), c("Mani", "Blanco"), c("Uganda", "Erect"), c("Mota", "Company")) names <- MergeKW(names, y1, delim = c("space", "dash", "period")) # Merge prefix strings y2 <- c("Light", "Small", "Improved", "Punjab", "SAM", "Dark") names <- MergePrefix(names, y2, delim = c("space", "dash", "period")) # Merge suffix strings y3 <- c("Local", "Bold", "Cary", "Mutant", "Runner", "Giant", "No.", "Bunch", "Peanut") names <- MergeSuffix(names, y3, delim = c("space", "dash", "period")) names ``` These functions can be applied over multiple columns(fields) in a data frame using the [`lapply`](http://google.com/#q=[R]+lapply) function. ```{r} # Load example dataset GN <- GN1000 # Specify as a vector the database fields to be used GNfields <- c("NationalID", "CollNo", "DonorID", "OtherID1", "OtherID2") head(GN[GNfields]) # Clean the data GN[GNfields] <- lapply(GN[GNfields], function(x) DataClean(x)) y1 <- list(c("Gujarat", "Dwarf"), c("Castle", "Cary"), c("Small", "Japan"), c("Big", "Japan"), c("Mani", "Blanco"), c("Uganda", "Erect"), c("Mota", "Company")) y2 <- c("Dark", "Light", "Small", "Improved", "Punjab", "SAM") y3 <- c("Local", "Bold", "Cary", "Mutant", "Runner", "Giant", "No.", "Bunch", "Peanut") GN[GNfields] <- lapply(GN[GNfields], function(x) MergeKW(x, y1, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergePrefix(x, y2, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergeSuffix(x, y3, delim = c("space", "dash"))) head(GN[GNfields]) ``` ## Generation of KWIC Index The function `KWIC` generates a Key Word in Context index [@knupffer1988european; @kfj97] from the data frame of a PGR passport database based on the fields(columns) specified in the argument `fields` along with the keyword frequencies and gives the output as a list of class `KWIC`. The first element of the vector specified in `fields` is considered as the primary key or identifier which uniquely identifies all rows in the data frame. This function fetches keywords from different fields specified, which can be subsequently used for matching to identify probable duplicates. The frequencies of the keywords retrieved can help in determining if further data pre-processing is required and also to decide whether any common keywords can be exempted from matching (Fig. 2). ```{r} # Load example dataset GN <- GN1000 # Specify as a vector the database fields to be used GNfields <- c("NationalID", "CollNo", "DonorID", "OtherID1", "OtherID2") # Clean the data GN[GNfields] <- lapply(GN[GNfields], function(x) DataClean(x)) y1 <- list(c("Gujarat", "Dwarf"), c("Castle", "Cary"), c("Small", "Japan"), c("Big", "Japan"), c("Mani", "Blanco"), c("Uganda", "Erect"), c("Mota", "Company")) y2 <- c("Dark", "Light", "Small", "Improved", "Punjab", "SAM") y3 <- c("Local", "Bold", "Cary", "Mutant", "Runner", "Giant", "No.", "Bunch", "Peanut") GN[GNfields] <- lapply(GN[GNfields], function(x) MergeKW(x, y1, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergePrefix(x, y2, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergeSuffix(x, y3, delim = c("space", "dash"))) # Generate the KWIC index GNKWIC <- KWIC(GN, GNfields, min.freq = 1) class(GNKWIC) GNKWIC # Retrieve the KWIC index from the KWIC object KWIC <- GNKWIC[[1]] KWIC <- KWIC[order(KWIC$KEYWORD, decreasing = TRUE),] head(KWIC[,c("PRIM_ID", "KWIC_L", "KWIC_KW", "KWIC_R")], n = 10) # Retrieve the keyword frequencies from the KWIC object KeywordFreq <- GNKWIC[[2]] head(KeywordFreq) ``` ```{r, echo=FALSE, warning=FALSE, fig.height = 4, eval=TRUE} # Plot wordcloud of keyword frequencies if (requireNamespace("wordcloud", quietly = TRUE)) { suppressMessages(library(wordcloud)) par(mar = c(0,0,0,0)) # pal <- brewer.pal(8,"Dark2") pal <- c("#1B9E77", "#D95F02", "#7570B3", "#E7298A", "#66A61E", "#E6AB02", "#A6761D", "#666666") wordcloud(words = KeywordFreq[,1], freq = KeywordFreq[,2], min.freq = 4, colors = pal, random.order = FALSE, rot.per = 0, fixed.asp = FALSE) } else { print("package 'wordcloud' is required to generate this figure") } ``` **Fig. 2.** Word cloud of keywords retrieved The function will throw an error in case of duplicates or NULL values in the primary key/ID field mentioned. ```{r} GN <- GN1000 GN[GNfields] <- lapply(GN[GNfields], function(x) DataClean(x)) # Generate dummy duplicates for illustration GN[1001:1005,] <- GN[1:5,] # Generate dummy NULL values for illustration GN[1001,3] <- "" GN[1002,3] <- "" GN[1001:1005,] ``` ```{r, eval = FALSE} GNKWIC <- KWIC(GN, GNfields, min.freq=1) ``` ```{r, echo = FALSE} message(paste("Error in KWIC(GN, GNfields, min.freq = 1) :", " Primary key/ID field should be unique and not NULL", " Use PGRdup::ValidatePrimKey() to identify and rectify the aberrant records first", sep = "\n")) ``` The erroneous records can be identified using the helper function `ValidatePrimKey`. ```{r, message=FALSE} # Validate the primary key/ID field for duplication or existence of NULL values ValidatePrimKey(x = GN, prim.key = "NationalID") # Remove the offending records GN <- GN[-c(1001:1005), ] # Validate again ValidatePrimKey(x = GN, prim.key = "NationalID") ``` ## Retrieval of Probable Duplicate Sets Once KWIC indexes are generated, probable duplicates of germplasm accessions can be identified by fuzzy, phonetic and semantic matching of the associated keywords using the function `ProbDup`. The sets are retrieved as a list of data frames of class `ProbDup`. Keywords that are not to be used for matching can be specified as a vector in the `excep` argument. ### Methods The function can execute matching according to either one of the following three methods as specified by the `method` argument. 1. **Method `"a"`** : Performs string matching of keywords in a single KWIC index to identify probable duplicates of accessions in a single PGR passport database. ```{r} # Load example dataset GN <- GN1000 # Specify as a vector the database fields to be used GNfields <- c("NationalID", "CollNo", "DonorID", "OtherID1", "OtherID2") # Clean the data GN[GNfields] <- lapply(GN[GNfields], function(x) DataClean(x)) y1 <- list(c("Gujarat", "Dwarf"), c("Castle", "Cary"), c("Small", "Japan"), c("Big", "Japan"), c("Mani", "Blanco"), c("Uganda", "Erect"), c("Mota", "Company")) y2 <- c("Dark", "Light", "Small", "Improved", "Punjab", "SAM") y3 <- c("Local", "Bold", "Cary", "Mutant", "Runner", "Giant", "No.", "Bunch", "Peanut") GN[GNfields] <- lapply(GN[GNfields], function(x) MergeKW(x, y1, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergePrefix(x, y2, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergeSuffix(x, y3, delim = c("space", "dash"))) # Generate the KWIC index GNKWIC <- KWIC(GN, GNfields) ``` ```{r} # Specify the exceptions as a vector exep <- c("A", "B", "BIG", "BOLD", "BUNCH", "C", "COMPANY", "CULTURE", "DARK", "E", "EARLY", "EC", "ERECT", "EXOTIC", "FLESH", "GROUNDNUT", "GUTHUKAI", "IMPROVED", "K", "KUTHUKADAL", "KUTHUKAI", "LARGE", "LIGHT", "LOCAL", "OF", "OVERO", "P", "PEANUT", "PURPLE", "R", "RED", "RUNNER", "S1", "SAM", "SMALL", "SPANISH", "TAN", "TYPE", "U", "VALENCIA", "VIRGINIA", "WHITE") # Fetch fuzzy duplicates by method 'a' GNdup <- ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, fuzzy = TRUE, phonetic = FALSE, semantic = FALSE) class(GNdup) GNdup ``` ```{r} # Fetch phonetic duplicates by method 'a' GNdup <- ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, fuzzy = FALSE, phonetic = TRUE, semantic = FALSE) class(GNdup) GNdup ``` 2. **Method `"b"`** : Performs string matching of keywords in the first KWIC index (query) with that of the keywords in the second index (source) to identify probable duplicates of accessions of the first PGR passport database among the accessions in the second database. 3. **Method `"c"`** : Performs string matching of keywords in two different KWIC indexes jointly to identify probable duplicates of accessions from among two PGR passport databases. ```{r} # Load PGR passport databases GN1 <- GN1000[!grepl("^ICG", GN1000$DonorID), ] GN1$DonorID <- NULL GN2 <- GN1000[grepl("^ICG", GN1000$DonorID), ] GN2$NationalID <- NULL # Specify database fields to use GN1fields <- c("NationalID", "CollNo", "OtherID1", "OtherID2") GN2fields <- c("DonorID", "CollNo", "OtherID1", "OtherID2") # Clean the data GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) DataClean(x)) GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) DataClean(x)) y1 <- list(c("Gujarat", "Dwarf"), c("Castle", "Cary"), c("Small", "Japan"), c("Big", "Japan"), c("Mani", "Blanco"), c("Uganda", "Erect"), c("Mota", "Company")) y2 <- c("Dark", "Light", "Small", "Improved", "Punjab", "SAM") y3 <- c("Local", "Bold", "Cary", "Mutant", "Runner", "Giant", "No.", "Bunch", "Peanut") GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) MergeKW(x, y1, delim = c("space", "dash"))) GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) MergePrefix(x, y2, delim = c("space", "dash"))) GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) MergeSuffix(x, y3, delim = c("space", "dash"))) GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) MergeKW(x, y1, delim = c("space", "dash"))) GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) MergePrefix(x, y2, delim = c("space", "dash"))) GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) MergeSuffix(x, y3, delim = c("space", "dash"))) # Remove duplicated DonorID records in GN2 GN2 <- GN2[!duplicated(GN2$DonorID), ] # Generate KWIC index GN1KWIC <- KWIC(GN1, GN1fields) GN2KWIC <- KWIC(GN2, GN2fields) # Specify the exceptions as a vector exep <- c("A", "B", "BIG", "BOLD", "BUNCH", "C", "COMPANY", "CULTURE", "DARK", "E", "EARLY", "EC", "ERECT", "EXOTIC", "FLESH", "GROUNDNUT", "GUTHUKAI", "IMPROVED", "K", "KUTHUKADAL", "KUTHUKAI", "LARGE", "LIGHT", "LOCAL", "OF", "OVERO", "P", "PEANUT", "PURPLE", "R", "RED", "RUNNER", "S1", "SAM", "SMALL", "SPANISH", "TAN", "TYPE", "U", "VALENCIA", "VIRGINIA", "WHITE") # Fetch fuzzy and phonetic duplicate sets by method b GNdupb <- ProbDup(kwic1 = GN1KWIC, kwic2 = GN2KWIC, method = "b", excep = exep, fuzzy = TRUE, phonetic = TRUE, encoding = "primary", semantic = FALSE) class(GNdupb) GNdupb ``` ```{r} # Fetch fuzzy and phonetic duplicate sets by method c GNdupc <- ProbDup(kwic1 = GN1KWIC, kwic2 = GN2KWIC, method = "c", excep = exep, fuzzy = TRUE, phonetic = TRUE, encoding = "primary", semantic = FALSE) class(GNdupc) GNdupc ``` ### Matching Strategies 1. **Fuzzy matching** or approximate string matching of keywords is carried out by computing the [generalized levenshtein (edit) distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between them. This distance measure counts the number of deletions, insertions and substitutions necessary to turn one string to another. ```{r} # Load example dataset GN <- GN1000 # Specify as a vector the database fields to be used GNfields <- c("NationalID", "CollNo", "DonorID", "OtherID1", "OtherID2") # Clean the data GN[GNfields] <- lapply(GN[GNfields], function(x) DataClean(x)) y1 <- list(c("Gujarat", "Dwarf"), c("Castle", "Cary"), c("Small", "Japan"), c("Big", "Japan"), c("Mani", "Blanco"), c("Uganda", "Erect"), c("Mota", "Company")) y2 <- c("Dark", "Light", "Small", "Improved", "Punjab", "SAM") y3 <- c("Local", "Bold", "Cary", "Mutant", "Runner", "Giant", "No.", "Bunch", "Peanut") GN[GNfields] <- lapply(GN[GNfields], function(x) MergeKW(x, y1, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergePrefix(x, y2, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergeSuffix(x, y3, delim = c("space", "dash"))) # Generate the KWIC index GNKWIC <- KWIC(GN, GNfields) # Specify the exceptions as a vector exep <- c("A", "B", "BIG", "BOLD", "BUNCH", "C", "COMPANY", "CULTURE", "DARK", "E", "EARLY", "EC", "ERECT", "EXOTIC", "FLESH", "GROUNDNUT", "GUTHUKAI", "IMPROVED", "K", "KUTHUKADAL", "KUTHUKAI", "LARGE", "LIGHT", "LOCAL", "OF", "OVERO", "P", "PEANUT", "PURPLE", "R", "RED", "RUNNER", "S1", "SAM", "SMALL", "SPANISH", "TAN", "TYPE", "U", "VALENCIA", "VIRGINIA", "WHITE") # Fetch fuzzy duplicates GNdup <- ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, fuzzy = TRUE, max.dist = 3, phonetic = FALSE, semantic = FALSE) GNdup ``` The maximum distance to be considered for a match can be specified by `max.dist` argument. ```{r} GNdup <- ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, fuzzy = TRUE, max.dist = 1, phonetic = FALSE, semantic = FALSE) GNdup ``` Exact matching can be enforced with the argument `force.exact` set as TRUE. It can be used to avoid fuzzy matching when the number of alphabet characters in keywords is lesser than a critical value (`max.alpha`). Similarly, the value of `max.digit` can also be set according to the requirements to enforce exact matching. The default value of `Inf` avoids fuzzy matching and enforces exact matching for all keywords having any numerical characters. If `max.digit` and `max.alpha` are both set to `Inf`, exact matching will be enforced for all the keywords. When exact matching is enforced, for keywords having both alphabet and numeric characters and with the number of alphabet characters greater than `max.digit`, matching will be carried out separately for alphabet and numeric characters present. ```{r} GNdup <- ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, fuzzy = TRUE, force.exact = TRUE, max.alpha = 4, max.digit = Inf, phonetic = FALSE, semantic = FALSE) GNdup ``` 2. **Phonetic matching** of keywords is carried out using the Double Metaphone phonetic algorithm which is implemented as the helper function `DoubleMetaphone`, [@p00], to identify keywords that have the similar pronunciation. ```{r} GNdup <- ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, fuzzy = FALSE, phonetic = TRUE, semantic = FALSE) GNdup ``` Either the primary or alternate encodings can be used by specifying the `encoding` argument. ```{r} GNdup <- ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, fuzzy = FALSE, phonetic = TRUE, encoding = "alternate", semantic = FALSE) GNdup ``` The argument `phon.min.alpha` sets the limits for the number of alphabet characters to be present in a string for executing phonetic matching. ```{r} GNdup <- ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, fuzzy = FALSE, phonetic = TRUE, encoding = "alternate", phon.min.alpha = 4, semantic = FALSE) GNdup ``` Similarly `min.enc` sets the limits for the number of characters to be present in the encoding of a keyword for phonetic matching. ```{r} GNdup <- ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, fuzzy = FALSE, phonetic = TRUE, encoding = "alternate", min.enc = 4, semantic = FALSE) GNdup ``` 3. **Semantic matching** matches keywords based on a list of accession name synonyms supplied as list with character vectors of synonym sets (synsets) to the `syn` argument. Synonyms in this context refer to interchangeable identifiers or names by which an accession is recognized. Multiple keywords specified as members of the same synset in `syn` are matched. To facilitate accurate identification of synonyms from the KWIC index, identical data standardization operations using the `Merge*` and `DataClean` functions for both the original database fields and the synset list are recommended. ```{r} # Specify the synsets as a list syn <- list(c("CHANDRA", "AH 114"), c("TG-1", "VIKRAM")) # Clean the data in the synsets syn <- lapply(syn, DataClean) GNdup <- ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, fuzzy = FALSE, phonetic = FALSE, semantic = TRUE, syn = syn) GNdup ``` ### Memory and Speed Constraints As the number of keywords in the KWIC indexes increases, the memory consumption by the function also increases proportionally. This is due to the reason that for string matching, this function relies upon creation of a _n_$\times$_m_ matrix of all possible keyword pairs for comparison, where _n_ and _m_ are the number of keywords in the query and source indexes respectively. This can lead to `cannot allocate vector of size...` errors in case of large KWIC indexes where the comparison matrix is too large to reside in memory. In such a case, the `chunksize` argument can be reduced from the default 1000 to get the appropriate size of the KWIC index keyword block to be used for searching for matches at a time. However a smaller `chunksize` may lead to longer computation time due to the memory-time trade-off. The progress of matching is displayed in the console as number of keyword blocks completed out of the total number of blocks, the percentage of achievement and a text-based progress bar. In case of multi-byte characters in keywords, the speed of keyword matching is further dependent upon the `useBytes` argument as described in `help("stringdist-encoding")` for the `stringdist` function in the namesake [package](https://CRAN.R-project.org/package=stringdist) [@van2014stringdist], which is made use of here for string matching. The CPU time taken for retrieval of probable duplicate sets under different options for the arguments `chunksize` and `useBytes` can be visualized using the [`microbenchmark`](https://CRAN.R-project.org/package=microbenchmark) package (Fig. 3). ```{r, eval = FALSE} # Load example dataset GN <- GN1000 # Specify as a vector the database fields to be used GNfields <- c("NationalID", "CollNo", "DonorID", "OtherID1", "OtherID2") # Clean the data GN[GNfields] <- lapply(GN[GNfields], function(x) DataClean(x)) y1 <- list(c("Gujarat", "Dwarf"), c("Castle", "Cary"), c("Small", "Japan"), c("Big", "Japan"), c("Mani", "Blanco"), c("Uganda", "Erect"), c("Mota", "Company")) y2 <- c("Dark", "Light", "Small", "Improved", "Punjab", "SAM") y3 <- c("Local", "Bold", "Cary", "Mutant", "Runner", "Giant", "No.", "Bunch", "Peanut") GN[GNfields] <- lapply(GN[GNfields], function(x) MergeKW(x, y1, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergePrefix(x, y2, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergeSuffix(x, y3, delim = c("space", "dash"))) # Generate the KWIC index GNKWIC <- KWIC(GN, GNfields) # Specify the exceptions as a vector exep <- c("A", "B", "BIG", "BOLD", "BUNCH", "C", "COMPANY", "CULTURE", "DARK", "E", "EARLY", "EC", "ERECT", "EXOTIC", "FLESH", "GROUNDNUT", "GUTHUKAI", "IMPROVED", "K", "KUTHUKADAL", "KUTHUKAI", "LARGE", "LIGHT", "LOCAL", "OF", "OVERO", "P", "PEANUT", "PURPLE", "R", "RED", "RUNNER", "S1", "SAM", "SMALL", "SPANISH", "TAN", "TYPE", "U", "VALENCIA", "VIRGINIA", "WHITE") # Specify the synsets as a list syn <- list(c("CHANDRA", "AH 114"), c("TG-1", "VIKRAM")) syn <- lapply(syn, DataClean) ``` ```{r, eval=TRUE, echo=FALSE} if (requireNamespace("microbenchmark", quietly = TRUE)) { mbm <- TRUE } else { mbm <- FALSE } ``` ```{r, eval=FALSE, echo=TRUE} timings <- microbenchmark::microbenchmark( # Fetch duplicate sets with default chunk.size t1 = ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, chunksize = 1000, useBytes = TRUE, fuzzy = TRUE, phonetic = TRUE, semantic = TRUE, syn = syn), # Fetch duplicate sets chunk.size 2000 t2 = ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, chunksize = 2000, useBytes = TRUE, fuzzy = TRUE, phonetic = TRUE, semantic = TRUE, syn = syn), # Fetch duplicate sets chunk.size 100 t3 = ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, chunksize = 100, useBytes = TRUE, fuzzy = TRUE, phonetic = TRUE, semantic = TRUE, syn = syn), # Fetch duplicate sets useBytes = FALSE t4 = ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, chunksize = 1000, useBytes = FALSE, fuzzy = TRUE, phonetic = TRUE, semantic = TRUE, syn = syn), times = 10) ``` ```{r, eval = FALSE, echo=TRUE} plot(timings, col = c("#1B9E77", "#D95F02", "#7570B3", "#E7298A"), xlab = "Expression", ylab = "Time") legend("topright", c("t1 : chunksize = 1000,\n useBytes = T (default)\n", "t2 : chunksize = 2000,\n useBytes = T\n", "t3 : chunksize = 500,\n useBytes = T\n", "t4 : chunksize = 1000,\n useBytes = F\n"), bty = "n", cex = 0.6) ``` ```{r, eval=TRUE, echo=FALSE} timings <- data.frame(expr = structure(c(2L, 1L, 1L, 3L, 1L, 2L, 3L, 4L, 3L, 3L, 4L, 2L, 2L, 1L, 2L, 4L, 4L, 4L, 1L, 4L, 3L, 2L, 1L, 1L, 3L, 2L, 1L, 4L, 2L, 3L, 1L, 2L, 4L, 3L, 4L, 3L, 2L, 3L, 4L, 1L), .Label = c("t1", "t2", "t3", "t4"), class = "factor"), time = c(1408220600, 1363532801, 1400142541, 3000887552, 1399254608, 1224892145, 3013227764, 1328487587, 2916351114, 2950914731, 1353476659, 1509574164, 1278119453, 1413596470, 1303064433, 1350284131, 1467990710, 1444774083, 1375139405, 1396005444, 3105281223, 1224878080, 1360539070, 1421216118, 3245590555, 1367002429, 1877008017, 1622301327, 1251792020, 2949414824, 1362527416, 1263918992, 1415981648, 2955363516, 1341522638, 2957571945, 1402744001, 3177392306, 1349590813, 1565604772)) boxplot(time ~ expr, timings, xlab = "Expression", ylab = "Time", col = c("#1B9E77", "#D95F02", "#7570B3", "#E7298A")) legend("topright", c("t1 : chunksize = 1000,\n useBytes = T (default)\n", "t2 : chunksize = 2000,\n useBytes = T\n", "t3 : chunksize = 500,\n useBytes = T\n", "t4 : chunksize = 1000,\n useBytes = F\n"), bty = "n", cex = 0.6) ``` **Fig. 3.** CPU time with different `ProbDup` arguments estimated using the `microbenchmark` package. ## Set Review, Modification and Validation The initially retrieved sets may be intersecting with each other because there might be accessions which occur in more than duplicate set. Disjoint sets can be generated by merging such overlapping sets using the function `DisProbDup`. Disjoint sets are retrieved either individually for each type of probable duplicate sets or considering all type of sets simultaneously. In case of the latter, the disjoint of all the type of sets alone are returned in the output as an additional data frame `DisjointDupicates` in an object of class `ProbDup`. ```{r, results="hide", message = FALSE} # Load example dataset GN <- GN1000 # Specify as a vector the database fields to be used GNfields <- c("NationalID", "CollNo", "DonorID", "OtherID1", "OtherID2") # Clean the data GN[GNfields] <- lapply(GN[GNfields], function(x) DataClean(x)) y1 <- list(c("Gujarat", "Dwarf"), c("Castle", "Cary"), c("Small", "Japan"), c("Big", "Japan"), c("Mani", "Blanco"), c("Uganda", "Erect"), c("Mota", "Company")) y2 <- c("Dark", "Light", "Small", "Improved", "Punjab", "SAM") y3 <- c("Local", "Bold", "Cary", "Mutant", "Runner", "Giant", "No.", "Bunch", "Peanut") GN[GNfields] <- lapply(GN[GNfields], function(x) MergeKW(x, y1, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergePrefix(x, y2, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergeSuffix(x, y3, delim = c("space", "dash"))) # Generate KWIC index GNKWIC <- KWIC(GN, GNfields) # Specify the exceptions as a vector exep <- c("A", "B", "BIG", "BOLD", "BUNCH", "C", "COMPANY", "CULTURE", "DARK", "E", "EARLY", "EC", "ERECT", "EXOTIC", "FLESH", "GROUNDNUT", "GUTHUKAI", "IMPROVED", "K", "KUTHUKADAL", "KUTHUKAI", "LARGE", "LIGHT", "LOCAL", "OF", "OVERO", "P", "PEANUT", "PURPLE", "R", "RED", "RUNNER", "S1", "SAM", "SMALL", "SPANISH", "TAN", "TYPE", "U", "VALENCIA", "VIRGINIA", "WHITE") # Specify the synsets as a list syn <- list(c("CHANDRA", "AH114"), c("TG1", "VIKRAM")) # Fetch probable duplicate sets GNdup <- ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, fuzzy = TRUE, phonetic = TRUE, encoding = "primary", semantic = TRUE, syn = syn) ``` ```{r} # Initial number of sets GNdup # Get disjoint probable duplicate sets of each kind disGNdup1 <- DisProbDup(GNdup, combine = NULL) # # Number of sets after combining intersecting sets disGNdup1 # Get disjoint probable duplicate sets combining all the kinds of sets disGNdup2 <- DisProbDup(GNdup, combine = c("F", "P", "S")) # Number of sets after combining intersecting sets disGNdup2 ``` Once duplicate sets are retrieved they can be validated by manual clerical review by comparing with original PGR passport database(s) using the `ReviewProbDup` function. This function helps to retrieve PGR passport information associated with fuzzy, phonetic or semantic probable duplicate sets in an object of class `ProbDup` from the original databases(s) from which they were identified. The original information of accessions comprising a set, which have not been subjected to data standardization can be compared under manual clerical review for the validation of the set. By default only the fields(columns) which were used initially for creation of the KWIC indexes using the KWIC function are retrieved. Additional fields(columns) if necessary can be specified using the `extra.db1` and `extra.db2` arguments. When any primary ID/key records in the fuzzy, phonetic or semantic duplicate sets are found to be missing from the original databases specified in `db1` and `db2`, then they are ignored and only the matching records are considered for retrieving the information with a warning. This may be due to data standardization of the primary ID/key field using the function `DataClean` before creation of the KWIC index and subsequent identification of probable duplicate sets. In such a case, it is recommended to use an identical data standardization operation on the primary ID/key field of databases specified in `db1` and `db2` before running this function. With `R` <= v3.0.2, due to copying of named objects by `list()`, `Invalid .internal.selfref detected and fixed...` warning can appear, which may be safely ignored. The output data frame can be subjected to clerical review either after exporting into an external spreadsheet using `write.csv` function or by using the `edit` function. The column `DEL` can be used to indicate whether a record has to be deleted from a set or not. `Y` indicates "Yes", and the default `N` indicates "No". The column `SPLIT` similarly can be used to indicate whether a record in a set has to be branched into a new set. A set of identical integers in this column other than the default `0` can be used to indicate that they are to be removed and assembled into a new set. ```{r} # Load the original database and clean the Primary ID/key field GN1000 <- GN1000 GN1000$NationalID <- DataClean(GN1000$NationalID) # Get the data frame for reviewing the duplicate sets identified RevGNdup <- ReviewProbDup(pdup = disGNdup1, db1 = GN1000, extra.db1 = c("SourceCountry", "TransferYear"), max.count = 30, insert.blanks = TRUE) ``` ```{r} head(RevGNdup) ``` ```{r, eval=FALSE} # Examine and review the duplicate sets using edit function RevGNdup <- edit(RevGNdup) # OR examine and review the duplicate sets after exporting them as a csv file write.csv(file="Duplicate sets for review.csv", x=RevGNdup) ``` After clerical review, the data frame created using the function `ReviewProbDup` from an object of class `ProbDup` can be reconstituted back to the same object after the review using the function `ReconstructProbDup`. The instructions for modifying the sets entered in the appropriate format in the columns `DEL` and `SPLIT` during clerical review are taken into account for reconstituting the probable duplicate sets. Any records with `Y` in column `DEL` are deleted and records with identical integers in the column `SPLIT` other than the default `0` are reassembled into a new set. ```{r} # The original set data subset(RevGNdup, SET_NO==13 & TYPE=="P", select= c(IDKW, DEL, SPLIT)) # Make dummy changes to the set for illustration RevGNdup[c(113, 116), 6] <- "Y" RevGNdup[c(111, 114), 7] <- 1 RevGNdup[c(112, 115, 117), 7] <- 2 # The instruction for modification in columns DEL and SPLIT subset(RevGNdup, SET_NO==13 & TYPE=="P", select= c(IDKW, DEL, SPLIT)) # Reconstruct ProDup object GNdup2 <- ReconstructProbDup(RevGNdup) # Initial no. of sets disGNdup1 # No. of sets after modifications GNdup2 ``` ## Other Functions The `ProbDup` object is a list of data frames of different kinds of probable duplicate sets _viz_- `FuzzyDuplicates`, `PhoneticDuplicates`, `SemanticDuplicates` and `DisjointDuplicates`. Each row of the component data frame will have information of a set, the type of set, the set members as well as the keywords based on which the set was formed. This data can be reshaped into long form using the function `ParseProbDup`. This function which will transform a `ProbDup` object into a single data frame. ```{r} # Convert 'ProbDup' object to a long form data frame of sets GNdupParsed <- ParseProbDup(GNdup) head(GNdupParsed) ``` The prefix `K*` here indicates the KWIC index of origin. This is useful in ascertaining the database of origin of the accessions when method `"b"` or `"c"` was used to create the input `ProbDup` object. Once the sets are reviewed and modified, the validated set data fields from the `ProbDup` object can be added to the original PGR passport database using the function `AddProbDup`. The associated data fields such as `SET_NO`, `ID` and `IDKW` are added based on the `PRIM_ID` field(column). ```{r} # Loading original database GN2 <- GN1000 # Add the duplicates set data to the original database GNwithdup <- AddProbDup(pdup = GNdup, db = GN2, addto = "I") ``` In case more than one KWIC index was used to generate the object of class `ProbDup`, the argument `addto` can be used to specify to which database the data fields are to be added. The default `"I"` indicates the database from which the first KWIC index was created and `"II"` indicates the database from which the second index was created. The function `SplitProbDup` can be used to split an object of class `ProbDup` into two on the basis of set counts. This is useful for reviewing separately the sets with larger set counts. ```{r, results="hide", message = FALSE} # Load PGR passport database GN <- GN1000 # Specify as a vector the database fields to be used GNfields <- c("NationalID", "CollNo", "DonorID", "OtherID1", "OtherID2") # Clean the data GN[GNfields] <- lapply(GN[GNfields], function(x) DataClean(x)) y1 <- list(c("Gujarat", "Dwarf"), c("Castle", "Cary"), c("Small", "Japan"), c("Big", "Japan"), c("Mani", "Blanco"), c("Uganda", "Erect"), c("Mota", "Company")) y2 <- c("Dark", "Light", "Small", "Improved", "Punjab", "SAM") y3 <- c("Local", "Bold", "Cary", "Mutant", "Runner", "Giant", "No.", "Bunch", "Peanut") GN[GNfields] <- lapply(GN[GNfields], function(x) MergeKW(x, y1, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergePrefix(x, y2, delim = c("space", "dash"))) GN[GNfields] <- lapply(GN[GNfields], function(x) MergeSuffix(x, y3, delim = c("space", "dash"))) # Generate KWIC index GNKWIC <- KWIC(GN, GNfields) # Specify the exceptions as a vector exep <- c("A", "B", "BIG", "BOLD", "BUNCH", "C", "COMPANY", "CULTURE", "DARK", "E", "EARLY", "EC", "ERECT", "EXOTIC", "FLESH", "GROUNDNUT", "GUTHUKAI", "IMPROVED", "K", "KUTHUKADAL", "KUTHUKAI", "LARGE", "LIGHT", "LOCAL", "OF", "OVERO", "P", "PEANUT", "PURPLE", "R", "RED", "RUNNER", "S1", "SAM", "SMALL", "SPANISH", "TAN", "TYPE", "U", "VALENCIA", "VIRGINIA", "WHITE") # Specify the synsets as a list syn <- list(c("CHANDRA", "AH114"), c("TG1", "VIKRAM")) # Fetch probable duplicate sets GNdup <- ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, fuzzy = TRUE, phonetic = TRUE, encoding = "primary", semantic = TRUE, syn = syn) ``` ```{r} # Split the probable duplicate sets GNdupSplit <- SplitProbDup(GNdup, splitat = c(10, 10, 10)) GNdupSplit[[1]] GNdupSplit[[3]] ``` Alternatively, two different `ProbDup` objects can be merged together using the function `MergeProbDup`. ```{r} GNdupMerged <- MergeProbDup(GNdupSplit[[1]], GNdupSplit[[3]]) GNdupMerged ``` The summary of accessions according to a grouping factor field(column) in the original database(s) within the probable duplicate sets retrieved in a `ProbDup` object can be visualized by the `ViewProbDup` function. The resulting plot can be used to examine the extent of probable duplication within and between groups of accessions records. ```{r} # Load PGR passport databases GN1 <- GN1000[!grepl("^ICG", GN1000$DonorID), ] GN1$DonorID <- NULL GN2 <- GN1000[grepl("^ICG", GN1000$DonorID), ] GN2 <- GN2[!grepl("S", GN2$DonorID), ] GN2$NationalID <- NULL GN1$SourceCountry <- toupper(GN1$SourceCountry) GN2$SourceCountry <- toupper(GN2$SourceCountry) GN1$SourceCountry <- gsub("UNITED STATES OF AMERICA", "USA", GN1$SourceCountry) GN2$SourceCountry <- gsub("UNITED STATES OF AMERICA", "USA", GN2$SourceCountry) # Specify as a vector the database fields to be used GN1fields <- c("NationalID", "CollNo", "OtherID1", "OtherID2") GN2fields <- c("DonorID", "CollNo", "OtherID1", "OtherID2") # Clean the data GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) DataClean(x)) GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) DataClean(x)) y1 <- list(c("Gujarat", "Dwarf"), c("Castle", "Cary"), c("Small", "Japan"), c("Big", "Japan"), c("Mani", "Blanco"), c("Uganda", "Erect"), c("Mota", "Company")) y2 <- c("Dark", "Light", "Small", "Improved", "Punjab", "SAM") y3 <- c("Local", "Bold", "Cary", "Mutant", "Runner", "Giant", "No.", "Bunch", "Peanut") GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) MergeKW(x, y1, delim = c("space", "dash"))) GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) MergePrefix(x, y2, delim = c("space", "dash"))) GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) MergeSuffix(x, y3, delim = c("space", "dash"))) GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) MergeKW(x, y1, delim = c("space", "dash"))) GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) MergePrefix(x, y2, delim = c("space", "dash"))) GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) MergeSuffix(x, y3, delim = c("space", "dash"))) # Remove duplicated DonorID records in GN2 GN2 <- GN2[!duplicated(GN2$DonorID), ] # Generate KWIC index GN1KWIC <- KWIC(GN1, GN1fields) GN2KWIC <- KWIC(GN2, GN2fields) # Specify the exceptions as a vector exep <- c("A", "B", "BIG", "BOLD", "BUNCH", "C", "COMPANY", "CULTURE", "DARK", "E", "EARLY", "EC", "ERECT", "EXOTIC", "FLESH", "GROUNDNUT", "GUTHUKAI", "IMPROVED", "K", "KUTHUKADAL", "KUTHUKAI", "LARGE", "LIGHT", "LOCAL", "OF", "OVERO", "P", "PEANUT", "PURPLE", "R", "RED", "RUNNER", "S1", "SAM", "SMALL", "SPANISH", "TAN", "TYPE", "U", "VALENCIA", "VIRGINIA", "WHITE") # Specify the synsets as a list syn <- list(c("CHANDRA", "AH114"), c("TG1", "VIKRAM")) ``` ```{r} GNdupc <- ProbDup(kwic1 = GN1KWIC, kwic2 = GN2KWIC, method = "c", excep = exep, fuzzy = TRUE, phonetic = TRUE, encoding = "primary", semantic = TRUE, syn = syn) # Get the summary data.frames and Grob GNdupcView <- ViewProbDup(GNdupc, GN1, GN2, "SourceCountry", "SourceCountry", max.count = 30, select = c("INDIA", "USA"), order = "type", main = "Groundnut Probable Duplicates") ``` ```{r, eval = FALSE} # View the summary data.frames GNdupcView[[1]] GNdupcView[[2]] ``` ```{r, fig.height = 7, fig.width=10} # Plot the summary visualization library(gridExtra) grid.arrange(GNdupcView[[3]]) ``` **Fig. 5.** Summary visualization of groundnut probable duplicate sets retrieved according to `SourceCountry` field. The function `KWCounts` can be used to compute the keyword counts from PGR passport database fields(columns) which are considered for identification of probable duplicates. These keyword counts can give a rough indication of the completeness of the data in such fields (Fig. 3). ```{r, fig.height = 8} # Compute the keyword counts for the whole data GNKWCouts <- KWCounts(GN, GNfields, exep) # Compute the keyword counts for 'duplicated' records GND <- ParseProbDup(disGNdup2, Inf, F)$PRIM_ID GNDKWCouts <- KWCounts(GN[GN$NationalID %in% GND, ], GNfields, exep) # Compute the keyword counts for 'unique' records GNUKWCouts <- KWCounts(GN[!GN$NationalID %in% GND, ], GNfields, exep) # Plot the counts as barplot par(mfrow = c(3,1)) bp1 <- barplot(table(GNKWCouts$COUNT), xlab = "Word count", ylab = "Frequency", main = "A", col = "#1B9E77") text(bp1, 0, table(GNKWCouts$COUNT),cex = 1, pos = 3) legend("topright", paste("No. of records =", nrow(GN)), bty = "n") bp2 <- barplot(table(GNDKWCouts$COUNT), xlab = "Word count", ylab = "Frequency", main = "B", col = "#D95F02") text(bp2, 0, table(GNDKWCouts$COUNT),cex = 1, pos = 3) legend("topright", paste("No. of records =", nrow(GN[GN$NationalID %in% GND, ])), bty = "n") bp3 <- barplot(table(GNUKWCouts$COUNT), xlab = "Word count", ylab = "Frequency", main = "C", col = "#7570B3") text(bp3, 0, table(GNUKWCouts$COUNT),cex = 1, pos = 3) legend("topright", paste("No. of records =", nrow(GN[!GN$NationalID %in% GND, ])), bty = "n") ``` **Fig. 6.** The keyword counts in the database fields considered for identification of probable duplicates for **A.** the entire `GN1000` dataset, **B.** the probable duplicate records alone and **C.** the unique records alone. ## Citing `PGRdup` ```{r, eval = FALSE} citation("PGRdup") ``` ```{r, echo = FALSE, collapse = TRUE} detach("package:PGRdup", unload=TRUE) suppressPackageStartupMessages(library(PGRdup)) cit <- citation("PGRdup") # yr <- format(Sys.Date(), "%Y") # cit[1]$year <- yr # oc <- class(cit) # # cit <- unclass(cit) # attr(cit[[1]],"textVersion") <- gsub("\\(\\)", # paste("\\(", yr, "\\)", sep = ""), # attr(cit[[1]],"textVersion")) # class(cit) <- oc cit ``` ## Session Info ```{r} sessionInfo() ``` ## References ```{r, echo=FALSE} data.table::setDTthreads(threads_dt) Sys.setenv(`OMP_THREAD_LIMIT` = threads_OMP) ```