title: "Introduction to Silly Putty"
author: "Dwayne Tally, Zachary B. Abrams, and Kevin R. Coombes"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_document:
    theme: journal
    highlight: kate
vignette: >
  %\VignetteIndexEntry{Introduction to SillyPutty}
  %\VignetteKeywords{SillyPutty,Clustering Algorithm,Clusters,Graphics}
  %\VignetteDepends{SillyPutty}
  %\VignettePackage{SillyPutty}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r opts, echo=FALSE}
knitr::opts_chunk$set(fig.width=8, fig.height=8)
options(width=96)
.format <- knitr::opts_knit$get("rmarkdown.pandoc.to")
.tag <- function(N, cap ) ifelse(.format == "html",
                                 paste("Figure", N, ":",  cap),
                                 cap)
```
```{r mycss, results="asis", echo=FALSE}
cat('
<style type="text/css">
.figure { text-align: center; }
.caption { font-weight: bold; }
</style>
')
```
# Introduction
In many diseases, such as cancer, it is important to have a clear understanding of what potential
clinical subgroup an individual patient belongs to. Unsupervised clustering is a useful analytic
tool to address this problem. A variety of clustering methods already exist and are differentiated
by the kinds of outcome measures they are intended to optimize. For example, K-means is designed
to minimize the within-cluster sum of square errors. Partitioning around medoids generalizes this
idea from the Euclidean distance metric defined by sums of squares to an arbitrary distance metric. 
The different linkage rules used in hierarchical clustering methods also change the nature of the
value being optimized.

Ever since Kaufmann and Rooseeuw introduced the idea of the silhouette width,
researches have used its average value to select the best method when applying different clustering
methods to the same data set. To out knowledge, no one has tried to use silhouette width as a
quantity to be optimized directly when finding clusters. To test the idea that optimizing the
silhouette width could be used to cluster elements, we developed a novel algorithm that we call
"SillyPutty". In brief, after elements have been assigned to clusters, we can calculate the
silhouette width (SW) of each element, yielding numbers between -1 and +1. A positive value of SW
indicates that an element is likely to be properly clustered, while a negative value of SW indicates
the element is probably not in the correct cluster. The repeated step in the SillyPutty algorithm
is to reclassify the element with the most negative silhouette width by placing it into the
cluster to which it is closest. This process can (usually) be repeated until there are no negative
silhouette widths present in the data. (There is a small chance that this algorithm will fail to
converge by entering a small infinite loop where the same elements are rearranged to get back to
an earlier configuration.) 

## Setup
We must first load the necessary packages.
```{r Setup}
library(SillyPutty)
library(Umpire)
suppressMessages( library(Mercator) )
suppressMessages( library(mclust) ) # for adjusted rand index
```

# Generating and Formatting Data
We use the Umpire R package (version `r packageVersion("Umpire")`) to generate more complex and realistic
synthetic data. We then compute the Euclidean distances between elements. Then, we use the Mercator 
R package (version `r packageVersion("Mercator")`) to visualize the data. Finally, we use the mclust R package
(version `r packageVersion("mclust")`) to compute the Adjusted Rand index (ARI), a measure of cluster quality
that compares clusters to externally known truth.

## Assign Umpire Model Parameters
The next chunk of code creates the objects that we will use to simulate a data set. We set things
up to represent a kind of cancer with four subtypes corresponding to recurrent sets of "hits", where
each hit can be thought of as an abstract "mutation" that affects the expression of a pathway of related
genes.
```{r genData}
set.seed(21315)
trueK <- 4
## Set up survival outcome; baseline is exponential
sm <- SurvivalModel(baseHazard=1/5, accrual=5, followUp=1)
## Build a CancerModel with four subtypes
nBlocks <- 20    
cm <- CancerModel(name="cansim",
                  nPossible=nBlocks,
                  nPattern=trueK,
                  OUT = function(n) rnorm(n, 0, 1), 
                  SURV= function(n) rnorm(n, 0, 1),
                  survivalModel=sm)
## Include 100 blocks/pathways that are not hit by cancer
nTotalBlocks <- nBlocks + 100
## Assign values to hyperparameters
## block size
blockSize <- round(rnorm(nTotalBlocks, 100, 30))
## log normal mean hyperparameters
mu0    <- 6
sigma0 <- 1.5
## log normal sigma hyperparameters
rate   <- 28.11
shape  <- 44.25
## block correlation
p <- 0.6
w <- 5
## Set up the baseline Engine
rho <- rbeta(nTotalBlocks, p*w, (1-p)*w)
base <- lapply(1:nTotalBlocks,
               function(i) {
                 bs <- blockSize[i]
                 co <- matrix(rho[i], nrow=bs, ncol=bs)
                 diag(co) <- 1
                 mu <- rnorm(bs, mu0, sigma0)
                 sigma <- matrix(1/rgamma(bs, rate=rate, shape=shape), nrow=1)
                 covo <- co *(t(sigma) %*% sigma)
                 MVN(mu, covo)
               })
eng <- Engine(base)
## Alter the means if there is a hit
altered <- alterMean(eng, normalOffset, delta=0, sigma=1)
## Build the CancerEngine using character strings
object <- CancerEngine(cm, "eng", "altered")
rm(sm, nBlocks, cm, nTotalBlocks, blockSize, mu0, sigma0, rate, shape, p, w, rho, base, eng, altered)
```

## Simulate Data
Now we can take a random sample of 144 elements from the distribution that we just defined.
```{r simData}
trueN <- 144
dset <- rand(object, trueN, keepall = TRUE) # contains two objects
labels <- dset$clinical$CancerSubType # the true clusters/types
d1 <- dset$data # the noise-free simulated data
```

To make our data set even more realistic, we are going to add noise that mimics what happens in
some biological assays.
```{r noiseModel}
SpecialNoise <- function(nFeat, nu = 0.1, shape = 1.02, scale = 0.05/shape) {
  NoiseModel(nu = nu,
             tau = rgamma(nFeat, shape = shape, scale = scale),
             phi = 0)
}
nm <- SpecialNoise(nrow(d1), nu = 0)
d1 <- blur(nm, d1)
dim(d1)
```


## Euclidean Distance Matrix
Now we compute the Euclidean distances between pairs of elements in our simulated data set.
```{r distancematrix}
tdis <- t(d1)
dimnames(tdis) <- list(paste("Sample", 1:nrow(tdis), sep=''),
                     paste("Feature", 1:ncol(tdis), sep=''))
dis <- dist(tdis)   ## This step is the rate-liomiting factor. Only way to speed up is to use fewerw samples
names(labels) <- rownames(tdis)
```

```{r eval=FALSE, echo=FALSE, results='hide'}
dataset <- tdis
eucdist <- dis
trueGroups <- labels
save(eucdist, trueGroups, file="../data/eucdist.rda")
```

## Mercator Visualization
As noted above, we will use the Mercator package for visualization. This function will ensure that
we generate consistent sets of pictures.
```{r mercViews}
mercViews <- function(object, main, tag = NULL) {
  opar <- par(mfrow = c(2, 2))
  on.exit(par(opar))
  pts <- barplot(object, main = main)
  if (!is.null(tag)) {
    gt <- as.vector(as.matrix(table(getClusters(object))))
    loc <- pts[round((c(0, cumsum(gt))[-(1 + length(gt))] + cumsum(gt))/2)]
    mtext(tag, side =1, line = 0, at = loc, col = object@palette, font = 2)
  }
  plot(object, view = "tsne", main = "t-SNE")
  plot(object, view = "hclust")
  plot(object, view = "mds", main = "MDS")
}
```

# Different Clustering Methods
We will apply various clustering methods to the data (represented primarily through its distance
matrix). We want to demonstrate with this example that SillyPutty clustering can do a better job 
than hierarchical clustering or PAM.

## Hierarchical Clustering
**Figure 1** presents multiple views of the Euclidean distances between our simulated data. Since
we know that we started with `r trueK` clusters, we chose that as the number to find using the default method
of hierarchical clustering with Ward's linkage rule. (We will later illustrate how to use SillyPutty
to find the number of clusters.)

The silhouette width plot in the upper left panel indicates that each of the clusters contains
some poorly-classified elements, identified by their negative silhouette widths. Both the multidimensional scaling
(MDS) plot in the lower right and the t-stochastic neighbor embedding (t-SNE) plot in the upper right
clearly display colored points that appear to be in the wrong regions.
```{r fig01, fig.cap = .tag(1, "Hierachical Clustering, with four clusters.")}
set.seed(1987)
vis <- Mercator(dis, "euclid", "hclust", K = trueK)
palette <- vis@palette[c(1:3, 7, 8, 6, 10, 4, 11, 5, 15, 14, 17:18, 9, 12, 16, 19:24)]
vis@palette <- palette
vis <- addVisualization(vis, "mds")
vis <- addVisualization(vis, "tsne")
mercViews(vis, "Hierarchical Clustering, Five Clusters")
```
The adjusted Rand index isn't very good, either.
```{r ari}
ari.hier <- adjustedRandIndex(labels, vis@clusters)
ari.hier
```

## Graphing Truth
Since we know the truth, we can reassign the clusters inside the Mercator object to see what
everything is supposed to look like (**Figure 2**). Notice that the silhouette width plot
agrees that everything is in the right place, and that the MDS and t-SNE plots are also
consistent.
```{r fig02, fig.cap = .tag(2, "Visualization of true cancer clusters.")}
truebin <- remapColors(vis, setClusters(vis, labels))
mercViews(truebin, main = "True Cluster Types", 
          tag = unique(sort(labels)))
```

## PAM Clustering
Here we apply PAM clustering to the same distance matrix (**Figure 3**). These results are
clearly much worse than hierarchical clustering.
```{r fig03, fig.cap = .tag(3, "PAM Clustering, K = 4.")}
pc <- pam(dis, k = trueK, diss=TRUE)
pamc <- remapColors(vis, setClusters(vis, pc$clustering))
mercViews(pamc, main = "PAM, K = 4", 
          tag = paste("P", 1:trueK, sep = ""))
ari.pam <- adjustedRandIndex(labels, pamc@clusters)
ari.pam
```


# SillyPutty Clustering
RandomSillyPutty is the core of the SillyPutty package. It takes a distance matrix, the desired number
of clusters K, and the number N of times you want to apply SillyPutty to the data set. Each time, you
start with _different_ random cluster assignments. You then apply the "move the worst element"
algorithm described above. RandomSillyPutty saves the best and worst silhouette width scores along
with their associated data clusters.
```{r RandomSilly}
set.seed(12)
y2 <- suppressWarnings(RandomSillyPutty(dis, trueK, N = 100)) ## this is also slow
ari.max <- adjustedRandIndex(truebin@clusters, y2@MX)
ari.min <- adjustedRandIndex(truebin@clusters, y2@MN)
ari.max
ari.min
```
The adjusted rand index of the best SillyPutty clustering is 0.98, meaning that we have almost completely
recovered the true cluster assignments present in the data. Note that even the worst result that
we obtained from SillyPutty has a better ARI (0.61) than PAM (0.43), though not quite as good as 
hierarchical clustering (0.71).

We can now update the Mercator object using the cluster assignments defined by the best SillyPutty
result (**Figure 4**). The silhouette width plot now says that it thinks all elements are in good
clusters, and both the MDS and t-SNE plots support that conclusion.
```{r fig04, fig.cap = .tag(4, "Random SillyPutty clustering,  K = 4.")}
randSillyBin <- remapColors(vis, setClusters(vis, y2@MX))
mercViews(randSillyBin, main = "SillyPutty Cluster Types, K = 4", 
          tag = paste("C", 1:trueK, sep = ""))
```

We can also plot the cluster assignments that had the maximum and minimum silhouette widths from
running the \code{RandokmSillyPutty} algorithm. We will use the multidimensional scaling layout and
the color palette from the Mercator object.

```{r fig05, fig.cap = .tag(5, "Cluster assignements with best and worst silhouette widths after random starts.")}
plot(y2, randSillyBin@view[["mds"]], distobj = dis, col = randSillyBin@palette)
summary(y2)
```

## Combining SillyPutty With Hierarchical Clustering
Since SillyPutty can start with any existing cluster assignments (even random ones, as we just saw),
we can combine it with any other method. here we are going to start with the results of hierarchcial
clustering, and just take one pass of SillyPutty to "improve" its results. To apply SillyPutty to an
already precomputed clustering algorithm, you have to have the cluster identities of the clustering
algorithm and the distance matrix of the data set. SillyPutty will then recalculate the clusters from
a starting point within the post-clustered clusters and return the best silhouette width score and
the new cluster identities.
```{r fig06, fig.cap = .tag(6, "Hierarchical Clustering + SillyPutty, K = 4.")}
hierSilly <- SillyPutty(vis@clusters, dis)
hierSillyBin <- remapColors(vis, setClusters(vis, hierSilly@cluster))
mercViews(hierSillyBin, main = "HClust + Silly, k = 4",tag = paste("C", 1:trueK, sep = ""))
ari.Sillyhier <- adjustedRandIndex(labels, hierSillyBin@clusters)
ari.Sillyhier
```

# Finding the Number of Clusters With SillyPutty
RangeSillyPutty uses RandomSillyPutty to determine the best mean silhouette width, for a range 
of clusters values. Then you can use the best silhouette widths to apporximate the actual number
of clusters within the dataset.


**Figure 6** shows the best silhouette width achieved with each possible number of clusters. The best
overall value occurs when K = 4, which is the true number of clusters.

```{r fig07, fig.width=6, fig.height=5, fig.cap=.tag(7, "Best mean siilhouette width, by number of clusters, found by combining huierarchical clustering with Silly Putty.")}
y <- findClusterNumber(dis, start = 2, end = 12, method = "HCSP")
plot(names(y), y, xlab = "K", ylab = "Silhouette Width", type = "b", lwd = 2, pch = 16)
```


# Appendix
```{r}
sessionInfo()
```