---
title: 'Consensus OPLS for Multi-Block Data Fusion'
author: "Celine Bougel, Julien Boccard, Florence Mehl, Marie Tremblay-Franco, Mark Ibberson, Van Du T. Tran"
date: "`r format(Sys.time(), '%B %d, %Y')`"
bibliography: ConsensusOPLS.bib
biblio-style: apalike
link-citations: yes
nocite: '@*'
vignette: >
  %\VignetteIndexEntry{Consensus OPLS for Multi-Block Data Fusion}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  html_document:
    code_folding: show
    df_print: kable
    highlight: pygments
    number_sections: yes
    self_contained: yes
    theme: journal
    toc: yes
    toc_depth: 4
    toc_float:
      collapsed: true
      smooth_scroll: true
  editor_options:
    chunk_output_type: console
---

<style>
body {
text-align: justify}
</style>

# The Consensus OPLS method

Omics approaches have proven their value to provide a broad monitoring of 
biological systems. However, despite the wealth of data generated by modern 
analytical platforms, the analysis of a single dataset is still limited and 
insufficient to reveal the full biochemical complexity of biological samples. 
The fusion of information from several data sources constitutes therefore a 
relevant approach to assess biochemical events more comprehensively. However, 
inherent problems encountered when analyzing single tables are amplified with 
the generation of multiblock datasets and finding the relationships between 
data layers of increasing complexity constitutes a challenging task. For that 
purpose, a versatile methodology is proposed by combining the strengths of 
established data analysis strategies, i.e. multiblock approaches and the OPLS-DA 
framework to offer an efficient tool for the fusion of Omics data obtained from 
multiple sources [@BOCCARD2013].

The method, already available in MATLAB (available at
[Gitlab repository](https://gitlab.unige.ch/Julien.Boccard/consensusopls)),
has been translated into an R package (available at
[GitHub repository](https://github.com/sib-swiss/consensusOPLS)) that includes 
quality metrics for optimal model selection, such as the R-squared (R²) 
coefficient, the Stone-Geisser Q² coefficient, the discriminant Q² index (DQ²) 
[@WESTERHUIS2008], the permutation **diagnostics statistics** 
[@SZYMANSKA2012], as well as many graphical outputs (scores plot, block 
contributions, individual loadings, permutation results, etc.). It has been 
enhanced with additional functionalities such as the computation of the Variable 
Importance in Projection **(VIP) values** [@WOLD2001]. Moreover, the new 
implementation now offers the possibility of using different kernels, i.e. 
linear (previously the only option was a kernel-based reformulations of the 
**NIPALS** algorithm [@LINDGREN1993]), polynomial, or Gaussian, which greatly 
enhances the versatility of the method and extends its scope to a wide range of 
applications. The package also includes a function to **predict new samples** 
using an already computed model. 

A demonstration case study available from a public repository of the National 
Cancer Institute, namely the NCI-60 data set, was used to illustrate the 
method's potential for omics data fusion. A subset of NCI-60 data 
(transcriptomics, proteomics and metabolomics) involving experimental data from 
14 cancer cell lines from two tissue origins, i.e. colon and ovary, was used 
[@SHOEMAKER2006]. Results from the consensusOPLS R package and Matlab on this 
dataset were strictly identical (tolerance of 10e-06). 

The combination of these data sources was excepted to provide a global profiling 
of the cell lines in an integrative systems biology perspective. The Consensus 
OPLS-DA strategy was applied for the differential analysis of the two selected 
tumor origins and the simultaneous analysis of the three blocks of data.


# R environment preparation

```{r setup, class.source='fold-hide', warning=FALSE}
#install.packages("knitr")
library(knitr)
opts_chunk$set(echo = TRUE)

# To ensure reproducibility
set.seed(12)
```

Before any action, it is necessary to verify that the needed packages were 
installed (the code chunks are not shown, click on `Show` to open them). The 
code below has been designed to have as few dependencies as possible on R 
packages, except for the stable packages.

```{r packages_installation, eval=FALSE, class.source='fold-hide'}
pkgs <- c("ggplot2", "ggrepel", "DT", "psych", "ConsensusOPLS")
sapply(pkgs, function(x) {
  if (!requireNamespace(x, quietly = TRUE)) {
    install.packages(x)
  }
})
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
  BiocManager::install("ComplexHeatmap")
}
```

```{r packages_load, warning=FALSE, message=FALSE, class.source='fold-hide'}
library(ggplot2)        # to make beautiful graphs
library(ggrepel)        # to annotate ggplot2 graph
library(DT)             # to make interactive data tables
library(psych)          # to make specific quantitative summaries
library(ComplexHeatmap) # to make heatmap with density plot
library(ConsensusOPLS)  # to load ConsensusOPLS
```

Then we create a uniform theme (`theme_graphs`) that will be used for all 
graphic outputs.

```{r theme_ggplot2}
require(ggplot2)
theme_graphs <- theme_bw() + theme(strip.text = element_text(size=14),
                                   axis.title = element_text(size=16),
                                   axis.text = element_text(size=14),
                                   plot.title = element_text(size=16),
                                   legend.title = element_text(size=14))
```

# Data preprocessing

As mentioned earlier, the demonstration dataset proposed in Matlab was used for
the package.

```{r import_demo_3_Omics}
data(demo_3_Omics, package = "ConsensusOPLS")
```

This will load an data object of type `r class(demo_3_Omics)` with
`r length(demo_3_Omics)` matrices `r paste(ls(demo_3_Omics), collapse=', ')`.

In other words, this list contains three data blocks, a list of observation
names (samples), and the binary response matrix Y. Since the `ConsensusOPLS`
method performs **horizontal integration**, with **kernel-based data fusion** 
[@BOCCARD2014_INTEGRATION], all data blocks should have exactly the same samples 
(rows). The block dimension can be checked with the following command:

```{r check_dims, class.source='fold-hide'}
# Check dimension
BlockNames <- c("MetaboData", "MicroData", "ProteoData")
nbrBlocs <- length(BlockNames)
dims <- lapply(X=demo_3_Omics[BlockNames], FUN=dim)
names(dims) <- BlockNames
dims

# Remove unuseful object for the next steps
rm(dims)
```

Now that the number of lines is identical, we need to check that they are the 
same subjects and in the same order for the different blocks.

```{r check_orders_and_names, class.source='fold-hide'}
# Check rows names in any order
row_names <- lapply(X=demo_3_Omics[BlockNames], FUN=rownames)
rns <- do.call(cbind, row_names)
rns.unique <- apply(rns, 1, function(x) length(unique(x)))
if (max(rns.unique) > 1) {
  stop("Rows names are not identical between blocks.")
}

# Check order of samples
check_row_names <- all(sapply(X=row_names, FUN=identical, y = row_names[[1]]))
if (!check_row_names && max(rns.unique) == 1) {
  print("Rows names are not in the same order for all blocks.")
}

# Remove unuseful object for the next steps
rm(row_names, rns, rns.unique, check_row_names)
```


The identical order of samples in the three omics blocks should be ensured.

The list of the data blocks `demo_3_Omics[BlockNames]` and the response
`demo_3_Omics$Y` are required as input to the ConsensusOPLS analysis.

# Data visualization

## Summary by Y groups

Before performing the multiblock analysis, we might investigate the nature of
variables in each omics block w.r.t the response. The interactive tables are
produced here below, so that the variables can be sorted in ascending or
descending order. A variable of interest can also be looked up.


```{r describe_data_by_Y_function, class.source='fold-hide'}
require(psych)
require(DT)
describe_data_by_Y <- function(data, group) {
  bloc_by_Y <- describeBy(x = data, group = group,
                          mat = TRUE)[, c("group1", "n", "mean", "sd",
                                          "median", "min", "max", "range",
                                          "se")]
  bloc_by_Y[3:ncol(bloc_by_Y)] <- round(bloc_by_Y[3:ncol(bloc_by_Y)], 
                                        digits = 2)
  return (datatable(bloc_by_Y))
}
```

For metabolomic data,

```{r describe_data_by_Y_bloc1}
describe_data_by_Y(data = demo_3_Omics[[BlockNames[1]]],
                   group = demo_3_Omics$ObsNames[,1])
```

For microarray data,

```{r describe_data_by_Y_bloc2}
describe_data_by_Y(data = demo_3_Omics[[BlockNames[2]]],
                   group = demo_3_Omics$ObsNames[,1])
```

For proteomic data,

```{r describe_data_by_Y_bloc3}
describe_data_by_Y(data = demo_3_Omics[[BlockNames[3]]],
                   group = demo_3_Omics$ObsNames[,1])
```

What information do these tables provide? To begin with, we see that there are 
the same number of subjects in the two groups defined by the Y response variable.
Secondly, there is a great deal of variability in the data, both within and 
between blocks. For example, let's focus on the range of values. The order of
magnitude for the :

- `r BlockNames[1]` is `r range(demo_3_Omics[[BlockNames[1]]])`,

- `r BlockNames[2]` is `r range(demo_3_Omics[[BlockNames[2]]])`, and

- `r BlockNames[3]` is `r range(demo_3_Omics[[BlockNames[3]]])`.

A data transformation is therefore recommended before proceeding.

## Unit variance scaling

To use the **Consensus OPLS-DA** method, it is possible to calculate the 
Z-score of the data, i.e. each columns of the data are centered to have mean 0,
and scaled to have standard deviation 1. The user is free to perform it before 
executing the method, just after loading the data, and using the method of his 
choice.

According to previous results, the scales of the variables in the data blocks 
are highly variable. So, the data needs to be standardized.

```{r scale_data, class.source='fold-hide'}
# Save not scaled data
demo_3_Omics_not_scaled <- demo_3_Omics

# Scaling data
demo_3_Omics[BlockNames] <- lapply(X = demo_3_Omics[BlockNames], 
                                   FUN = function(x){
                                     scale(x, center = TRUE, scale = TRUE)
                                   }
)
```

## Heatmap and density plots

Heat maps can be used to compare results before and after scaling. Here, the 
interest factor is categorical, so it was interesting to create a heat map for 
each of these groups. The function used to create the heat map is based on the 
following code (code hidden).
 
```{r heatmap_function, message = FALSE, class.source='fold-hide'}
heatmap_data <- function(data, bloc_name, factor = NULL){
  if(!is.null(factor)){
    ht <- Heatmap(
      matrix = data, name = "Values",
      row_dend_width = unit(3, "cm"),
      column_dend_height = unit(3, "cm"),
      column_title = paste0("Heatmap of ", bloc_name),
      row_split = factor,
      row_title = "Y = %s",
      row_title_rot = 0
    )
  } else{
    ht <- Heatmap(
      matrix = data, name = "Values",
      row_dend_width = unit(3, "cm"),
      column_dend_height = unit(3, "cm"),
      column_title = paste0("Heatmap of ", bloc_name)
    )
  }
  return(ht)
}
```

Let's apply this function to the demo data:

```{r heatmap_no_scale, message = FALSE, class.source='fold-hide'}
# Heat map for each data block
lapply(X = 1:nbrBlocs,
       FUN = function(X){
         bloc <- BlockNames[X]
         heatmap_data(data = demo_3_Omics_not_scaled[[bloc]],
                      bloc_name = bloc,
                      factor = demo_3_Omics_not_scaled$Y[,1])})
```

And on the scaled data:

```{r heatmap_scale, message = FALSE, class.source='fold-hide'}
# Heat map for each data block
lapply(X = 1:nbrBlocs,
       FUN = function(X){
         bloc <- BlockNames[X]
         heatmap_data(data = demo_3_Omics[[bloc]],
                      bloc_name = bloc,
                      factor = demo_3_Omics$Y[,1])})
```

By comparing these graphs, several observations can be made. To begin with, the 
unscaled data had a weak signal for the proteomics and transcriptomics blocks. 
The metabolomics block seemed to contain a relatively usable signal as it stood. 
These graphs therefore confirm that it was wise to perform this transformation 
prior to the analyses. And secondly, the profiles seem to differ according to 
the Y response variable.

In the same way, the user can visualize density distribution using a heat map
(here on scaled data):

```{r heatmap_density, class.source='fold-hide'}
# Heatmap with density for each data bloc
lapply(X = 1:nbrBlocs,
       FUN = function(X){
         bloc <- BlockNames[X]
         factor <- demo_3_Omics$Y[, 1]
         densityHeatmap(t(demo_3_Omics[[bloc]]),
                        ylab = bloc,
                        column_split  = factor,
                        column_title = "Y = %s")})
```

In the light of these graphs, it would appear that the Y = 0 data is denser 
than the Y = 1 data. This means that the discriminant model (DA) should be able 
to detect the signal contained in this data.

```{r rm_unscale_data, class.source='fold-hide'}
# Remove unscaled data
rm(demo_3_Omics_not_scaled)
```





# Consensus OPLS-DA model

A model with a predictor variable and an orthogonal latent variable was 
evaluated. For this, the following parameters were defined:

```{r define_cv_parameters}
# Number of predictive component(s)
LVsPred <- 1

# Maximum number of orthogonal components
LVsOrtho <- 3

# Number of cross-validation folds
CVfolds <- nrow(demo_3_Omics[[BlockNames[[1]]]])
CVfolds
```

Then, to use the ConsensusOPLS method proposed by the package of the same name, 
**only one function** needs to be called. This function, `ConsensusOPLS`, 
takes as arguments the data blocks, the response variable, the maximum number 
of predictive and orthogonal components allowed in the model, the number of 
partitions for n-fold cross-validation, and the model type to indicate 
discriminant analysis. The result is the optimal model, without permutation.

```{r run_consensusOPLSmodel}
copls.da <- ConsensusOPLS(data = demo_3_Omics[BlockNames],
                          Y = demo_3_Omics$Y,
                          maxPcomp = LVsPred,
                          maxOcomp  = LVsOrtho,
                          modelType = "da",
                          nperm = 1000,
                          cvType = "nfold",
                          nfold = 14,
                          nMC = 100,
                          cvFrac = 4/5,
                          kernelParams = list(type = "p", 
                                              params = c(order = 1)),
                          mc.cores = 1)
```

The summary information of the model can be obtained as 

```{r outputs_model_summary, class.source='fold-hide'}
copls.da
```

The list of available attributes in the produced ConsensusOPLS object:

```{r outputs_model_attributes, class.source='fold-hide'}
summary(attributes(copls.da))
```


# Display the main results

As indicated at the beginning of the file, the R package `ConsensusOPLS` 
calculates:

- the R-squared (R²) coefficient, gives a measure of how predictive the 
model is and how much variation is explained by the model. The lowest 
R-squared is 0 and means that the points are not explained by the regression 
whereas the highest R-squared is 1 and means that all the points are explained 
by the regression line.

```{r print_main_results_R2, class.source='fold-hide'}
position <- copls.da@nOcomp

paste0('R2: ', round(copls.da@R2Y[paste0("po", position)], 4))
```

Here, that means the model explain 
`r round(copls.da@R2Y[paste0("po", position)], 4)*100`$\%$ of the 
variation in the Y response variable.

- The Stone-Geisser Q² coefficient, also known as the redundancy index in 
cross-validation, is used to evaluate the quality of each structural equation, 
and thus to assess, independently of each other, the predictive quality of each 
model construct [@TENENHAUS2005]. If the Q² is positive, the model has 
predictive validity, whereas if it is negative, the model has no (absence of) 
predictive validity. It is defined as `1 - (PRESS/ TSS)`, with `PRESS` is the 
prediction error sum of squares, and `TSS` is the total sum of squares of the 
response vector Y [@WESTERHUIS2008]. This coefficient can take values between 
-1 and 1, but a positive coefficient with a high value is expected for 
predictive validity.

```{r print_main_results_Q2, class.source='fold-hide'}
paste0('Q2: ', round(copls.da@Q2[paste0("po", position)], 4))
```

Here, this means that the model has a predictive validity.

- the discriminant Q² index (`DQ2`) to assess the model fit as it does not 
penalize class predictions beyond the class label value. The `DQ2` is defined 
as `1 - (PRESSD/ TSS)`, with PRESSD is the prediction error sum of squares, 
disregarded when the class prediction is beyond the class label  (i.e. `>1` or 
`<0`, for two classes named 0 and 1), and `TSS` is the total sum of squares of 
the response vector Y. This value is a measure for class prediction ability 
[@WESTERHUIS2008]. As with Q², this coefficient can take values between -1 and 1, 
but a positive coefficient with a high value is expected to have predictive 
validity.

```{r print_main_results_DQ2, class.source='fold-hide'}
paste0('DQ2: ', round(copls.da@DQ2[paste0("po", position)], 4))
```

Here, this means that the model can predict classes.

- the variable Importance in projection (VIP) for each block of data [@WOLD2001].
Within each block, the relevance of the variables in explaining variation in the 
Y response was assessed using the VIP parameter, which reflects the importance 
of the variables in relation to both response and projection quality 
[@GALINDOPRIETO2015]. 

Similarly, `individual loadings` represent the contribution of variables in the 
space defined by the predictive and orthogonal components. Their `sign` 
indicates the direction of the contribution (positive for correlated with response; 
negative for anti-correlated with response), and the value indicates the 
intensity of the contribution.

Using the `VIP* sign(loadings)` value [@MEHL2024], the relevant features, i.e. 
those with higher `|VIP|` values, can be represented as follows:

```{r extract_VIP, class.source='fold-hide'}
# Compute the VIP
VIP <- copls.da@VIP

# Multiply VIP * sign(loadings for predictive component)
VIP_plot <- lapply(X = 1:nbrBlocs,
                   FUN = function(X){
                     sign_loadings <- sign(copls.da@loadings[[X]][, "p_1"])
                     result <- VIP[[X]][, "p"]*sign_loadings
                     return(sort(result, decreasing = TRUE))})
names(VIP_plot) <- BlockNames
```

```{r plot_VIP, class.source='fold-hide'}
# Metabo data
ggplot(data = data.frame(
  "variables" = factor(names(VIP_plot[[1]]),
                       levels=names(VIP_plot[[1]])[order(abs(VIP_plot[[1]]), 
                                                         decreasing=T)]), 
  "valeur" = VIP_plot[[1]]), 
  aes(x = variables, y = valeur)) +
  geom_bar(stat = "identity") +
  labs(title = paste0("Barplot of ", names(VIP_plot)[1])) +
  xlab("Predictive variables") +
  ylab("VIP x loading sign") +
  theme_graphs +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) 

# Microarray data
ggplot(data = data.frame(
  "variables" = factor(names(VIP_plot[[2]]),
                       levels=names(VIP_plot[[2]])[order(abs(VIP_plot[[2]]), 
                                                         decreasing=T)]), 
  "valeur" = VIP_plot[[2]]), 
  aes(x = variables, y = valeur)) +
  geom_bar(stat = "identity") +
  labs(title = paste0("Barplot of ", names(VIP_plot)[2])) +
  xlab("Predictive variables") +
  ylab("VIP x loading sign") +
  theme_graphs +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) 

# Proteo data
ggplot(data = data.frame(
  "variables" = factor(names(VIP_plot[[3]]),
                       levels=names(VIP_plot[[3]])[order(abs(VIP_plot[[3]]), 
                                                         decreasing=T)]), 
  "valeur" = VIP_plot[[3]]), 
  aes(x = variables, y = valeur)) +
  geom_bar(stat = "identity") +
  labs(title = paste0("Barplot of ", names(VIP_plot)[3])) +
  xlab("Predictive variables") +
  ylab("VIP x loading sign") +
  theme_graphs +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) 
```

One possibility might be to select only the 20 most important components (the 
first 10 and the last 10). The user is free to do this.

The advantage of using this representation, compared to individual loading, is 
that there is a usual selection threshold set at 1. In other words, variables 
with a VIP $\geq$ 1 are usually considered important.


# Plot the main results

## Consensus Score plot

The scores plot shows the representation of the samples in the two new 
components calculated by the optimal model. A horizontal separation (i.e. 
according to the predictive component) is expected.

```{r ggplot_score_data, class.source='fold-hide', warning=FALSE}
ggplot(data = data.frame("p_1" = copls.da@scores[, "p_1"],
                         "o_1" = copls.da@scores[, "o_1"],
                         "Labs" = as.matrix(unlist(demo_3_Omics$ObsNames[, 1]))),
       aes(x = p_1, y = o_1, label = Labs, 
           shape = Labs, colour = Labs)) +
  xlab("Predictive component") +
  ylab("Orthogonal component") +
  ggtitle("ConsensusOPLS Score plot")+
  geom_point(size = 2.5) + 
  geom_text_repel(size = 4, show.legend = FALSE) + 
  theme_graphs+
  scale_color_manual(values = c("#7F3C8D", "#11A579"))
```

Graph of scores obtained by the optimal ConsensusOPLS model for ovarian tissue 
(triangle) and colon tissue (circle) from NCI-60 data, for three data blocks 
(metabolomics, proteomics, and transcriptomics). Each cancer cell is represented 
by a unique symbol whose location is determined by the contributions of the 
predictive and orthogonal components of the ConsensusOPLS-DA model. A clear 
partition of the classes was obtained.

## Block contributions to the predictive component

The contribution of the blocks represents the position of the samples in the 
new space of predictive and orthogonal components. In other words, intuitively, 
they represent the magnitude/importance of each block's transformation in the
new space.

Here, for the predictive component, this amounts to quantifying the information 
contribution of each block in the ConsensusOPLS model.

```{r ggplot_data_pred_compo, class.source='fold-hide'}
ggplot(
  data = data.frame("Values" = copls.da@blockContribution[, "p_1"],
                    "Blocks" = as.factor(labels(demo_3_Omics[1:nbrBlocs]))),
  aes(x = Blocks, y = Values,
      fill = Blocks, labels = Values)) +
  geom_bar(stat = 'identity') + 
  geom_text(aes(label = round(Values, 2), y = Values), 
            vjust = 1.5, color = "black", fontface = "bold") +
  ggtitle("Block contributions to the predictive component")+
  xlab("Data blocks") +
  ylab("Weight") +
  theme_graphs +
  scale_color_manual(values = c("#1B9E77", "#D95F02", "#7570B3"))
```

The block contributions of the predictive latent variable indicated the specific 
importance of the proteomic block (38.5$\%$), the transcriptomic block (34.7$\%$) 
and the metabolomic block (26.8$\%$).

## Block contributions to the first orthogonal component

For the orthogonal component, the block contribution quantifies the noise 
contribution of each block in the ConsensusOPLS model.

```{r ggplot_data_1st_ortho_compo, class.source='fold-hide'}
ggplot(
  data = 
    data.frame("Values" = copls.da@blockContribution[, "o_1"],
               "Blocks" = as.factor(labels(demo_3_Omics[1:nbrBlocs]))),
  aes(x = Blocks, y = Values,
      fill = Blocks, labels = Values)) +
  geom_bar(stat = 'identity') + 
  geom_text(aes(label = round(Values, 2), y = Values), 
            vjust = 1.5, color = "black", fontface = "bold") +
  ggtitle("Block contributions to the first orthogonal component") +
  xlab("Data blocks") +
  ylab("Weight") +
  theme_graphs +
  scale_color_manual(values = c("#1B9E77", "#D95F02", "#7570B3"))
```

The block contributions of first orthogonal component indicated the specific 
importance of the metabolomic block (41.8$\%$), the transcriptomic block (31.3$\%$) 
and the proteomic block (26.9$\%$).

## Block contributions: the two previous plots into one

```{r ggplot_data_pred_ortho, message = FALSE, class.source='fold-hide'}
data_two_plots <- data.frame("Values" = copls.da@blockContribution[, "p_1"],
                             "Type" = "Pred",
                             "Blocks" = labels(demo_3_Omics[1:nbrBlocs]))
data_two_plots <- data.frame("Values" = c(data_two_plots$Values,
                                          copls.da@blockContribution[, "o_1"]),
                             "Type" = c(data_two_plots$Type,
                                        rep("Ortho", times = length(copls.da@blockContribution[, "o_1"]))),
                             "Blocks" = c(data_two_plots$Blocks,
                                          labels(demo_3_Omics[1:nbrBlocs])))

ggplot(data = data_two_plots,
       aes(x = factor(Type), 
           y = Values, 
           fill = factor(Type))) +
  geom_bar(stat = 'identity') + 
  ggtitle("Block contributions to each component")+
  geom_text(aes(label = round(Values, 2), y = Values), 
            vjust = 1.5, color = "black", fontface = "bold") +
  xlab("Data blocks") +
  ylab("Weight") +
  facet_wrap(. ~ Blocks)+
  theme_graphs+
  scale_fill_discrete(name = "Component")+
  scale_fill_manual(values = c("#7F3C8D", "#11A579"))
```

In the same way, the previous graph can be represented as:

```{r plot_bloc_PredVSOrtho_bis, message = FALSE, class.source='fold-hide'}
ggplot(data = data_two_plots,
       aes(x = Blocks, 
           y = Values, 
           fill = Blocks)) +
  geom_bar(stat = 'identity') +
  geom_text(aes(label = round(Values, 2), y = Values), 
            vjust = 1.5, color = "black", fontface = "bold") +
  ggtitle("Block contributions to each component") +
  xlab("Components") +
  ylab("Weight") +
  facet_wrap(. ~ factor(Type, levels = c("Pred", "Ortho"))) +
  theme_graphs +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        plot.title = element_text(hjust = 0.5, 
                                  margin = margin(t = 5, r = 0, b = 0, l = 100))) +
  scale_fill_manual(values = c("#1B9E77", "#D95F02", "#7570B3"))
```

This synthetic figure is generally used in presentations.




## Block contributions predictive vs. orthogonal

```{r ggplot_data_pred_vs_ortho, message = FALSE, warning = FALSE, class.source='fold-hide'}
ggplot(data = data.frame("Pred" = copls.da@blockContribution[, "p_1"],
                         "Ortho" = copls.da@blockContribution[, "o_1"],
                         "Labels" = labels(demo_3_Omics[1:nbrBlocs])),
       aes(x = Pred, y = Ortho, label = Labels, 
           shape = Labels, colour = Labels)) +
  xlab("Predictive component") +
  ylab("Orthogonal component") +
  ggtitle("Block contributions predictive vs. orthogonal") +
  geom_point(size = 2.5) + 
  geom_text_repel(size = 4, show.legend = FALSE) + 
  theme_graphs +
  scale_color_manual(values = c("#1B9E77", "#D95F02", "#7570B3"))
```

## Loading plots (one for each data set)

Individual loading of each block were calculated for the predictive latent 
variable of the optimal model, to detect metabolite, protein and transcript 
level differences between the two groups of tissues cell lines.

```{r create_data_loadings}
loadings <- copls.da@loadings
data_loads <- sapply(X = 1:nbrBlocs,
                     FUN = function(X){
                       data.frame("Pred" = 
                                    loadings[[X]][, grep(pattern = "p_",
                                                         x = colnames(loadings[[X]]),
                                                         fixed = TRUE)],
                                  "Ortho" = 
                                    loadings[[X]][, grep(pattern = "o_",
                                                         x = colnames(loadings[[X]]),
                                                         fixed = TRUE)],
                                  "Labels" = labels(demo_3_Omics[1:nbrBlocs])[[X]])
                     })
data_loads <- as.data.frame(data_loads)
```

The loading plot shows the representation of variables in the two new components 
calculated by the optimal model.

```{r ggplot_data_loadings, class.source='fold-hide'}
ggplot() +
  geom_point(data = as.data.frame(data_loads$V1),
             aes(x = Pred, y = Ortho, colour = Labels), 
             size = 2.5, alpha = 0.5) + 
  geom_point(data = as.data.frame(data_loads$V2),
             aes(x = Pred, y = Ortho, colour = Labels),
             size = 2.5, alpha = 0.5) +
  geom_point(data = as.data.frame(data_loads$V3),
             aes(x = Pred, y = Ortho, colour = Labels),
             size = 2.5, alpha = 0.5) +
  xlab("Predictive component") +
  ylab("Orthogonal component") +
  ggtitle("Loadings plot on first orthogonal and predictive component")+
  theme_graphs+
  scale_color_manual(values = c("#1B9E77", "#D95F02", "#7570B3"))
```

The graph here represents the dispersion of variable contributions on orthogonal 
and predictive components, displaying intra- and inter-block variation. 
Intuitively, we would expect significant distinct variables to stand out 
horizontally (i.e. according to the predictive component). In omics data, due to 
the large amount of information, this is rarely the case: the explanatory 
importance is not clearly distinguishable, so we get an unstructured 
scatterplot (distributed in ellipsoidal form).

## Loading and VIP of the optimal model

```{r create_data_loadings_VIP}
loadings <- do.call(rbind.data.frame, copls.da@loadings)
loadings$block <- do.call(c, lapply(names(copls.da@loadings), function(x) 
  rep(x, nrow(copls.da@loadings[[x]]))))
loadings$variable <- gsub(paste(paste0(names(copls.da@loadings), '.'), 
                                collapse='|'), '', 
                          rownames(loadings))

VIP <- do.call(rbind.data.frame, copls.da@VIP)
VIP$block <- do.call(c, lapply(names(copls.da@VIP), function(x) 
  rep(x, nrow(copls.da@VIP[[x]]))))
VIP$variable <- gsub(paste(paste0(names(copls.da@VIP), '.'), 
                           collapse='|'), '', 
                     rownames(VIP))

loadings_VIP <- merge(x = loadings[, c("p_1", "variable")], 
                      y = VIP[, c("p", "variable")], 
                      by = "variable", all = TRUE)
colnames(loadings_VIP) <- c("variable", "loadings", "VIP")
loadings_VIP <- merge(x = loadings_VIP, 
                      y = loadings[, c("block", "variable")], 
                      by = "variable", all = TRUE)
loadings_VIP$label <- ifelse(loadings_VIP$VIP > 1, loadings_VIP$variable, NA)
```

```{r ggplot_data_loadings_VIP, class.source='fold-hide'}
ggplot(data = loadings_VIP,
       aes(x=loadings, y=VIP, col=block, label = label)) +
  geom_point(size = 2.5, alpha = 0.5) + 
  xlab("Predictive component") +
  ylab("Variable Importance in Projection") +
  ggtitle("VIP versus loadings on predictive components")+
  theme_graphs+
  scale_color_manual(values = c("#1B9E77", "#D95F02", "#7570B3"))
```

Again, intuitively, VIPs are linearly dependent on individual loadings, which 
means that the scatterplot should have a V-shape: a positive slope for loadings 
correlated to the response variable and a negative one for loadings 
anti-correlated to the response variable. In practice, this is not always the 
case. In fact, this graph can be used to :

- detect outliers if the V shape is not respected ;

- assess the robustness of the model in a different way from the indicators 
presented above ;

- verify the expected linear dependence between VIPs and loadings

- select top features to contribute to the model.



# Permutations

The permutations test [@SZYMANSKA2012] (for both the R² and Q²/ DQ² indicators) 
assesses the robustness of the model. The test hypothesis determines whether the 
model is statistically significant or has picked up noise in the data. For this, 
a permutation (random mixing) of the values of the response variable Y is 
performed, with the aim of destroying the structural relationship existing 
between X and Y. With each permutation, the ConsensusOPLS model is re-evaluated. 
In the end, for the optimal model to be robust, we need :

- the value of the optimal model must be within the high values of the 
permutations (the true value, represented by the vertical dotted line, must be 
on the right-hand side of the graph),

- the distribution of all permuted values to be relatively Gaussian (see density, 
shown in blue on the graph).

If these two criteria are met, then the model is robust and the results 
interpretive.

According to [@SZYMANSKA2012], authors have suggested that to estimate a 
permutation  `P` value of 0.01, up to `10^4` permutations are required in 
genetic applications. In our case, permutation tests were done with `10^3` 
replicates to test model validity.

```{r run_permutations, warning=FALSE}
PermRes <- copls.da@permStats
```

```{r plot_R2_perm, warning=FALSE, class.source='fold-hide'}
ggplot(data = data.frame("R2Yperm" = PermRes$R2Y),
       aes(x = R2Yperm)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 color="grey", fill="grey") +
  geom_density(color = "blue", linewidth = 0.5) +
  geom_vline(aes(xintercept=PermRes$R2Y[1]), 
             color="blue", linetype="dashed", size=1) +
  xlab("R2 values") +
  ylab("Frequency") +
  ggtitle("R2 Permutation test")+
  theme_graphs
```

```{r plot_Q2_perm, warning=FALSE, class.source='fold-hide'}
ggplot(data = data.frame("Q2Yperm" = PermRes$Q2Y),
       aes(x = Q2Yperm)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 color="grey", fill="grey") +
  geom_density(color = "blue", size = 0.5) +
  geom_vline(aes(xintercept=PermRes$Q2Y[1]), 
             color="blue", linetype="dashed", size=1) +
  xlab("Q2 values") +
  ylab("Frequency") +
  ggtitle("Q2 Permutation test")+
  theme_graphs
```

```{r plot_DQ2_perm, warning=FALSE, class.source='fold-hide'}
ggplot(data = data.frame("DQ2Yperm" = PermRes$DQ2Y),
       aes(x = DQ2Yperm)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 color="grey", fill="grey") +
  geom_density(color = "blue", size = 0.5) +
  geom_vline(aes(xintercept=PermRes$DQ2Y[1]), 
             color="blue", linetype="dashed", size=1) +
  xlab("DQ2 values") +
  ylab("Frequency") +
  ggtitle("DQ2 Permutation test")+
  theme_graphs
```

We can then estimate the robustness of the built model with such plots.

# Prediction

The model can then be used to predict the response of new data with the same
structure as the input data. For instance, an attempt to repredict the response
of `demo_3_Omics` can be done as follows:

```{r prediction}
reprediction <- predict(copls.da, newdata = demo_3_Omics[BlockNames])
```

`Y` shows the estimated response from the model and `class` indicates the
class determined by the highest estimated response, along with the confidence
score measured by the margin between the highest estimated response and the
second highest value, and also by the softmax probability.

```{r prediction_output}
reprediction$Y
reprediction$class
```

# Reproducibility

This vignette was produced with the following R session configuration.

```{r reproducibility}
sessionInfo()
```

# References