---
title: "RFLPtools: Analysis of DNA fragment samples and standalone BLAST report files"
author: "Matthias Kohl"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
bibliography: RFLPtools.bib
vignette: >
  %\VignetteIndexEntry{RFLPtools}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{utf8}
---


## Introduction  
Package RFLPtools aims at 

- the detection of similar band patterns based on DNA fingerprint fragment sizes 
(i.e. derived from RFLP-analysis)

- the analysis of standalone BLAST report files (i.e. DNA sequence analysis)

see also [@Flessa13].

In this short vignette we describe and demonstrate the available functions. 

We start with loading the package.

```{r}
library(RFLPtools)
```


## RFLP data
As a first step we can perform a QC of the samples using function RFLPqc.
This is possible, if the first band corresponds to the total length of the fragment.
The QC consists of a comparison of the length of the first band with the sum of 
the lengths of the remaining bands for each sample. If the sum is smaller than 
QC.lo times the length of the first band or larger than QC.up times the length 
of the first band, respectively, a text message is printed.

```{r}
Dir <- system.file("extdata", package = "RFLPtools") # input directory 
filename <- file.path(Dir, "AZ091016_report.txt")
RFLP1 <- read.rflp(file = filename)
str(RFLP1)

RFLP2 <- RFLPqc(RFLP1, rm.band1 = FALSE) # identical to RFLP1
identical(RFLP1, RFLP2)

RFLP3 <- RFLPqc(RFLP1)
str(RFLP3)

RFLP4 <- RFLPqc(RFLP1, rm.band1 = TRUE, QC.rm = TRUE)
str(RFLP4)
```

We load some example data and compute the Euclidean distance ...
```{r}
data(RFLPdata)
res <- RFLPdist(RFLPdata)
names(res) ## number of bands
str(res$"6")
```

Of course, we can also use other well-known distances implemented in function dist.
```{r}
res1 <- RFLPdist(RFLPdata, distfun = function(x) dist(x, method = "manhattan"))
res2 <- RFLPdist(RFLPdata, distfun = function(x) dist(x, method = "maximum"))
str(res[[1]])
str(res1[[1]])
str(res2[[1]])
```

Correlation distances can for instance be applied by using function corDist of 
package MKomics.
```{r}
library(MKomics)
res3 <- RFLPdist(RFLPdata, distfun = corDist)
str(res3$"9")
```

As we obtain a list of dist objects we can easily perform hierarchical clustering.
```{r, fig.height=7, fig.width=7}
plot(hclust(res[[1]]), main = "Euclidean distance")
plot(hclust(res1[[1]]), main = "Manhattan distance")
plot(hclust(res2[[1]]), main = "Maximum distance")
plot(hclust(res3[[1]]), main = "Pearson correlation distance")
```

For splitting the dendrogram into clusters we apply function cutree.
```{r}
clust4bd <- hclust(res[[2]])
cgroups50 <- cutree(clust4bd, h=50)
cgroups50
```

Another possibility to display the similarity of the samples are so-called 
(dis-)similarity matrices that can for instance be generated by function simPlot 
of package \pkg{MKomics}.
```{r, fig.height=7, fig.width=7}
library(RColorBrewer)
library(MKomics)
myCol <- colorRampPalette(brewer.pal(8, "RdYlGn"))(128)
ord <- order.dendrogram(as.dendrogram(hclust(res[[1]])))
temp <- as.matrix(res[[1]])
simPlot(temp[ord,ord], col = rev(myCol), minVal = 0, 
        labels = colnames(temp), title = "(Dis-)Similarity Plot")
```

We can also use function levelplot of package lattice to display (dis-)similarity
matrices.
```{r, fig.height=7, fig.width=7}
library(lattice)
print(levelplot(temp[ord,ord], col.regions = rev(myCol),
          at = do.breaks(c(0, max(temp)), 128),
          xlab = "", ylab = "",
          ## Rotate labels of x-axis
          scales = list(x = list(rot = 90)),
          main = "(Dis-)Similarity Plot"))
```

If some bands may be missing, we can apply function RFLPdist2 specifying the
number of missing bands we expect.
```{r}
## Euclidean distance
data(RFLPdata)
data(RFLPref)
nrBands(RFLPdata)
res0 <- RFLPdist(RFLPdata, nrBands = 9)
res1 <- RFLPdist2(RFLPdata, nrBands = 9, nrMissing = 1)
res2 <- RFLPdist2(RFLPdata, nrBands = 9, nrMissing = 2)
res3 <- RFLPdist2(RFLPdata, nrBands = 9, nrMissing = 3)
```

Again hierarchical clustering of the results is straight forward.
```{r, fig.height=7, fig.width=7}
plot(hclust(res0), main = "0 bands missing")
plot(hclust(res1), main = "1 band missing")
plot(hclust(res2), main = "2 bands missing")
plot(hclust(res3), main = "3 bands missing")
```

There are also ways to handle low-bp bands, which are likely to be absent in some
of the samples. First, one can use function RFLPlod to remove all bands 
below a given threshold LOD before further analyses. For displaying the
data we use function RFLPplot.
```{r, fig.height=7, fig.width=7}
RFLPdata.lod <- RFLPlod(RFLPdata, LOD = 60)
par(mfrow = c(1, 2))
RFLPplot(RFLPdata, nrBands = 4, ylim = c(40, 670))
RFLPplot(RFLPdata.lod, nrBands = 4, ylim = c(40, 670))
title(sub = "After applying RFLPlod")
```

In addition, one can specify LOD in a call to function RFLPdist2. 
If LOD is specified, it is assumed that missing bands only occur for 
molecular weights smaller than LOD.
```{r, fig.height=7, fig.width=7}
res0 <- RFLPdist(RFLPdata, nrBands = 4)
res1.lod <- RFLPdist2(RFLPdata, nrBands = 4, nrMissing = 1, LOD = 60)
ord <- order.dendrogram(as.dendrogram(hclust(res1.lod)))
temp <- as.matrix(res1.lod)
simPlot(temp[ord,ord], col = rev(myCol), minVal = 0, 
        labels = colnames(temp), 
        title = "(Dis-)Similarity Plot\n1 band missing below LOD")
```

We can also make a comparison to reference data.
```{r, fig.height=7, fig.width=7}
RFLPrefplot(RFLPdata, RFLPref, nrBands = 9, cex.axis = 0.8)
```


## BLAST data
To analyze tabular report files generated with standalone BLAST from NCBI 
(see https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download), 
a function for reading the BLAST report files is included (read.blast). 

Possible steps are:

1) Install NCBI BLAST
2) Generate and import database(s)
3) Apply BLAST with options outfmt and out; e.g.

> blastn -query Testquery -db Testdatabase -outfmt 6 -out out.txt

or

> blastn -query Testquery -db Testdatabase -outfmt 10 -out out.csv

One could also call BLAST from inside R by using function system

```{r, eval=FALSE}
system("blastn -query Testquery -db Testdatabase -outfmt 6 -out out.txt")
```

4) Read in the results

```{r, eval=FALSE}
## -outfmt 6
test.res <- read.blast(file = "out.txt")
```

or

```{r, eval=FALSE}
## -outfmt 10
test.res <- read.blast(file = "out.csv", sep = ",")
```


We now read in a example file included in folder extdata of our package.
```{r}
Dir <- system.file("extdata", package = "RFLPtools") # input directory 
filename <- file.path(Dir, "BLASTexample.txt")
BLAST1 <- read.blast(file = filename)
str(BLAST1)
```

This example BLAST data is also available as loadable example data.
```{r}
data(BLASTdata)
```

The loaded data.frame can be used to compute similarities between the 
BLASTed sequences via function simMatrix. This function includes the following 
steps:

1) the length of each sequence (LS) comprised in the input data file is extracted.

2) if there is more than one comparison for one sequence including different parts of 
the respective sequence, that one with maximum base length is chosen.

3) the number of matching bases (mB) is calculated by multiplying two variables 
given in the BLAST output: the identity between sequences (%) and the number of nucleotides 
divided by 100. 

4) the resulting value is rounded to the next integer. 

5) the similarity is calculated by dividing mB by LS and saved in the corresponding 
similarity matrix. 

If the similarity of a combination is not shown in the BLAST report file (because 
the similarity was lower than 70%), this comparison is included in the similarity 
matrix with the result zero.
```{r}
res <- simMatrix(BLASTdata)
```

Optionally, the range of sequence length can be specified to exclude sequences, 
which were too short or too long, respectively.
```{r}
res1 <- simMatrix(BLASTdata, sequence.range = TRUE, Min = 100, Max = 450)
res2 <- simMatrix(BLASTdata, sequence.range = TRUE, Min = 500)
```

We display the similarity matrix.
```{r, fig.height=7, fig.width=7}
library(MKomics)
simPlot(res2, col = myCol, minVal = 0, cex.axis = 0.5,
        labels = colnames(res2), title = "(Dis-)Similarity Plot")
```

Alternatively, we can again use function levelplot of package lattice.
```{r, fig.height=7, fig.width=7}
library(lattice)
txt <- trellis.par.get("add.text")
txt$cex <- 0.5
trellis.par.set("add.text" = txt)
myCol <- colorRampPalette(brewer.pal(8, "RdYlGn"))(128)
print(levelplot(res2, col.regions = myCol,
          at = do.breaks(c(0, max(res2)), 128),
          xlab = "", ylab = "", 
          ## Rotate labels of x axis
          scales = list(x = list(rot = 90)),
          main = "(Dis-)Similarity Plot"))
```

We can also convert the similarity matrix to an object of S3 class "dist".
```{r}
res.d <- sim2dist(res2)
```

After the conversion we can for instance perform hierarchical clustering.
```{r, fig.height=7, fig.width=7}
## hierarchical clustering
plot(hclust(res.d), cex = 0.7)
```


## sessionInfo
```{r}
sessionInfo()
```


## References