---
title: "Data Integration using Unsupervised Multiple Kernel Learning"
author: "Jérôme Mariette, Céline Brouard, Rémi Flamary and Nathalie Vialaneix"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    toc: yes
    code_folding: show
    highlight: haddock
    df_print: kable
vignette: >
  %\VignetteIndexEntry{Integrative exploratory analysis with mixKernel}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

The TARA Oceans expedition facilitated the study of plankton communities by 
providing ocean metagenomic data combined with environmental measures to the 
scientific community. This study focuses on 139 prokaryotic-enriched samples 
collected from 68 stations and spread across three depth layers: the surface 
(SRF), the deep chlorophyll maximum (DCM) layer and the mesopelagic (MES) zones.
Samples were located in 8 different oceans or seas: Indian Ocean (IO), 
Mediterranean Sea (MS), North Atlantic Ocean (NAO), North Pacific Ocean (NPO), 
Red Sea (RS), South Atlantic Ocean (SAO), South Pacific Ocean (SPO) and South 
Ocean (SO). 

In this vignette, we consider a subset of the original data analyzed in the 
article [(Mariette & Villa-Vialaneix, 2018)](http://dx.doi.org/10.1093/bioinformatics/btx682)
and illustrate the usefulness of mixKernel to 1/ perform an integrative 
exploratory analysis as in (Mariette & Villa-Vialaneix, 2018) and to 2/ select
relevant variables for unsupervised analysis.

The data include 1% of the 35,650 prokaryotic OTUs and of the 39,246 bacterial 
genes that were randomly selected. **The aim is to integrate prokaryotic 
abundances and functional processes to environmental measure with an 
unsupervised method**.

```{r global_options, include=FALSE}
knitr::opts_chunk$set(dpi = 100, echo = TRUE, warning = FALSE, message = FALSE)
```

Install and load the mixOmics and mixKernel packages:

```{r load_lib}
## required python modules: autograd, numpy, scipy, sklearn
## To properly install packages, run:
# install.packages("BiocManager")
# BiocManager::install("mixOmics")
# BiocManager::install("phyloseq")
# install.packages("mixKernel")
library(mixOmics)
library(mixKernel)
```

# Loading TARA Ocean datasets

The (previously normalized) datasets are provided as matrices with matching 
sample names (rownames):

```{r input_data}
data(TARAoceans)
# more details with: ?TARAOceans
# we check the dimension of the data:
lapply(list("phychem" = TARAoceans$phychem, "pro.phylo" = TARAoceans$pro.phylo, 
            "pro.NOGs" = TARAoceans$pro.NOGs), dim)
```


# Multiple kernel computation

## Individual kernel computation

For each input dataset, a kernel is computed using the function `compute.kernel`
with the choice of linear, phylogenic or abundance kernels. A user defined 
function can also be provided as input(argument `kernel.func`, see 
`?compute.kernel`).

The results are lists with a 'kernel' entry that stores the kernel matrix. The 
resulting kernels are symmetric matrices with a size equal to the number of 
observations (rows) in the input datasets.

```{r compute_kernel, echo=TRUE}
phychem.kernel <- compute.kernel(TARAoceans$phychem, kernel.func = "linear")
pro.phylo.kernel <- compute.kernel(TARAoceans$pro.phylo, kernel.func = "abundance")
pro.NOGs.kernel <- compute.kernel(TARAoceans$pro.NOGs, kernel.func = "abundance")

# check dimensions
dim(pro.NOGs.kernel$kernel)
```

A general overview of the correlation structure between datasets is obtained as 
described in Mariette and Villa-Vialaneix (2018) and displayed using the 
function `cim.kernel`:

```{r cim_kernel, fig.width=4}
cim.kernel(phychem = phychem.kernel,
           pro.phylo = pro.phylo.kernel,
           pro.NOGs = pro.NOGs.kernel, 
           method = "square")
```

The figure shows that `pro.phylo` and `pro.NOGs` is the most correlated pair of 
kernels. This result is expected as both kernels provide a summary of
prokaryotic communities.


## Combined kernel computation

The function ```combine.kernels``` implements 3 different methods for combining
kernels: STATIS-UMKL, sparse-UMKL and full-UMKL (see more details in Mariette 
and Villa-Vialaneix, 2018). It returns a meta-kernel that can be used as an 
input for the function ```kernel.pca``` (kernel PCA). The three methods bring 
complementary information and must be chosen according to the research question.

The ```STATIS-UMKL``` approach gives an overview on the common information 
between the different datasets. The ```full-UMKL``` computes a kernel that 
minimizes the distortion between all input kernels. The ```sparse-UMKL``` is a 
sparse variant of ```full-UMKL``` that selects the most relevant kernels in 
addition to distortion minimization.

```{r meta_kernel}
meta.kernel <- combine.kernels(phychem = phychem.kernel,
                               pro.phylo = pro.phylo.kernel,
                               pro.NOGs = pro.NOGs.kernel, 
                               method = "full-UMKL")
```

# Exploratory analysis: Kernel Principal Component Analysis (KPCA)

## Perform KPCA

A kernel PCA can be performed from the combined kernel with the function 
`kernel.pca``. The argument `ncomp` allows to choose how many components to 
extract from KPCA.

```{r KPCA}
kernel.pca.result <- kernel.pca(meta.kernel, ncomp = 10)
```

Sample plots using the ```plotIndiv``` function from ```mixOmics```:

```{r plotIndiv_PCA, fig.keep='all'}
all.depths <- levels(factor(TARAoceans$sample$depth))
depth.pch <- c(20, 17, 4, 3)[match(TARAoceans$sample$depth, all.depths)]
plotIndiv(kernel.pca.result,
          comp = c(1, 2),
          ind.names = FALSE,
          legend = TRUE,
          group = as.vector(TARAoceans$sample$ocean),
          col = c("#f99943", "#44a7c4", "#05b052", "#2f6395", "#bb5352", 
                  "#87c242", "#07080a", "#92bbdb"),
          pch = TARAoceans$sample$depth,
          legend.title = "Ocean / Sea",
          title = "Projection of TARA Oceans stations",
          size.title = 10,
          legend.title.pch = "Depth")
```

The explained variance supported by each axis of KPCA is displayed with the 
`plot`  function, and can help choosing the number of components in KPCA.

```{r tune_pca}
plot(kernel.pca.result)
```

The first axis summarizes ~ 20% of the total variance.


## Assessing important variables

Here we focus on the information summarized on the first component. Variable 
values are randomly permuted with the function `permute.kernel.pca`.

In the following example, physical variable are permuted at the variable level 
(kernel `phychem`), OTU abundances from `pro.phylo` kernel are permuted at the 
phylum level (OTU phyla are stored in the second column, named `Phylum`, of the
taxonomy annotation provided in `TARAoceans` object in the entry `taxonomy`) and
gene abundances from `pro.NOGs` are permuted at the GO level (GO are provided in
the entry `GO` of the dataset):

```{r permute_kpca}
head(TARAoceans$taxonomy[ ,"Phylum"], 10)
head(TARAoceans$GO, 10)
# here we set a seed for reproducible results with this tutorial
set.seed(17051753)
kernel.pca.result <- kernel.pca.permute(kernel.pca.result, ncomp = 1,
                                        phychem = colnames(TARAoceans$phychem),
                                        pro.phylo = TARAoceans$taxonomy[, "Phylum"],
                                        pro.NOGs = TARAoceans$GO)
```

Results are displayed with the function `plotVar.kernel.pca`. The argument 
`ndisplay` indicates the number of variables to display for each kernel:

```{r display_var}
plotVar.kernel.pca(kernel.pca.result, ndisplay = 10, ncol = 3)
```

`Proteobacteria` is the most important variable for the `pro.phylo` kernel. 

The relative abundance of `Proteobacteria`` is then extracted in each of our 
`r nrow(TARAoceans$phychem)` samples, and each sample is colored according to 
the value of this variable in the KPCA projection plot:

```{r proteobacteria_display, fig.keep='all'}
selected <- which(TARAoceans$taxonomy[, "Phylum"] == "Proteobacteria")
proteobacteria.per.sample <- apply(TARAoceans$pro.phylo[, selected], 1, sum) /
  apply(TARAoceans$pro.phylo, 1, sum)
colfunc <- colorRampPalette(c("royalblue", "red"))
col.proteo <- colfunc(length(proteobacteria.per.sample))
col.proteo <- col.proteo[rank(proteobacteria.per.sample, ties = "first")]
plotIndiv(kernel.pca.result,
          comp = c(1, 2),
          ind.names = FALSE,
          legend = FALSE,
          col = col.proteo,
          pch = TARAoceans$sample$depth,
          legend.title = "Ocean / Sea",
          title = "Representation of Proteobacteria abundance",
          legend.title.pch = "Depth")
```

Similarly, the temperature is the most important variable for the `phychem`
kernel. The temperature values can be displayed on the kernel PCA projection as 
follows:

```{r temperature_display, fig.keep='all'}
col.temp <- colfunc(length(TARAoceans$phychem[, 4]))
col.temp <- col.temp[rank(TARAoceans$phychem[, 4], ties = "first")]
plotIndiv(kernel.pca.result,
          comp = c(1, 2),
          ind.names = FALSE,
          legend = FALSE,
          col = col.temp,
          pch = TARAoceans$sample$depth,
          legend.title = "Ocean / Sea",
          title = "Representation of mean temperature",
          legend.title.pch = "Depth")
```

## Selecting relevant variables

Here, we use a feature selection approach that does not rely on any assumption 
but explicitly takes advantage of the kernel structure in an unsupervised 
fashion. The idea is to preserve at best the similarity structure between 
samples. These examples requires the installation of the python modules
`autograd`, `scipy`, `numpy`, and `sklearn`. See detailed instructions in the
installation vignette or on mixKernel website : http://mixkernel.clementine.wf

```{r dependencies}
have_depend <- reticulate::py_module_available("autograd") &
  reticulate::py_module_available("scipy") &
  reticulate::py_module_available("numpy") &
  reticulate::py_module_available("sklearn") 
```

```{r select_ukfs}
if (have_depend) {
  ukfs.res <- select.features(TARAoceans$pro.phylo, kx.func = "bray", lambda = 1, 
                              keepX = 5, nstep = 1)
  selected <- sort(ukfs.res, decreasing = TRUE, index.return = TRUE)$ix[1:5]
  TARAoceans$taxonomy[selected, ]
}
```

The `select.features` function allows to add a structure constraint to the 
variable selection. The adjacency matrix of the graph representing relations 
between OTUs can be obtained by computing the Pearson correlation matrix as 
follows:

```{r comput_correlation_graph, eval=FALSE}
library("MASS")
library("igraph")
library("correlationtree")

pro.phylo.alist <- data.frame("names" = colnames(TARAoceans$pro.phylo), 
                              t(TARAoceans$pro.phylo))
L <- mat2list(df2mat(pro.phylo.alist, 1))
corr.mat <- as.matrix(cross_cor(L, remove = TRUE))
pro.phylo.graph <- graph_from_adjacency_matrix(corr.mat, 
                                               mode = "undirected",
                                               weighted = TRUE)
Lg <- laplacian_matrix(pro.phylo.graph, sparse=TRUE)
```

```{r select_ukfsg}
if (have_depend) {
  load(file = file.path(system.file(package = "mixKernel"), "loaddata", "Lg.rda"))
  ukfsg.res <- select.features(TARAoceans$pro.phylo, kx.func = "bray", 
                               lambda = 1, method = "graph", Lg = Lg, keepX = 5,
                               nstep = 1)
  
  selected <- sort(ukfsg.res, decreasing = TRUE, index.return = TRUE)$ix[1:5]
  TARAoceans$taxonomy[selected, ]
}
```

# References

1. Mariette, J. and Villa-Vialaneix, N. (2018). Unsupervised multiple kernel 
learning for heterogeneous data integration. *Bioinformatics*, **34**(6), 
1009-1015.

2. Zhuang, J., Wang, J., Hoi, S., and Lan, X. (2011). Unsupervised multiple 
kernel clustering. *Journal of Machine Learning Research* (Workshop and 
Conference Proceedings), **20**, 129–144.

3. Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994). The act 
(statis method). *Computational Statistics & Data Analysis*, **18**(1), 97–119.

# Session information

```{r session_information}
sessionInfo()
```