% \VignetteIndexEntry{CePa Vignette}
% \VignetteDepends{CePa}
% \VignetteKeywords{Pathway Enrichment Analysis}
% \VignettePackage{CePa}



\documentclass[a4paper]{article}

\title{Centrality-based Pathway Enrichment}

\author{Zuguang Gu}

\usepackage{Sweave}
\begin{document}

\maketitle 
\section{Introduction}
Gene set enrichment analysis is broadly used in microarray data analysis \cite{Khatri2005,Huang2009a}.
It aimes to find which biological functions are affected by a group of 
related genes behind the massive information. The most used methotology is finding
these significant gene set from a 2 $\times$ 2 contingency table, usually by Fisher's
exact test or chi-square test. This kind of analysis is known as Over-represented
Analysis (ORA). It takes a list of differential expressed gene, and returns significant
gene sets that the differential genes are enriched in. A lot of methods have been
developed under the framework of ORA such as {DAVID} \cite{Huang2009b} (https://david.abcc.ncifcrf.gov/)
and \texttt{GOstats} package \cite{Falcon2007}. The second methodology to find significant
pathways is to use whole expression matrix, named Gene-set Analysis (GSA). 
GSA methods are implemented via either a univariate or a multivariate procedure \cite{Ackermann2009}. 
In univariate analysis, gene level statistics are initially calculated from fold changes 
or statistical tests (e.g., {\it t}-test). These statistics are then combined into 
a pathway level statistic by summation or averaging. GSEA \cite{Subramanian2005} 
is a widely used univariate tool that utilizes a weighted Kolmogorov-Smirnov test 
to measure the degree of differential expression of a gene set by calculating a 
running sum from the top of a ranked gene list. Multivariate analysis considers 
the correlations between genes in the pathway and calculates the pathway level 
statistic directly from the expression value matrix using Hotelling's $T^2$ test \cite{Song2006} 
or MANOVA models \cite{Hummel2008}.

For a specific form of gene sets, biological pathways are collections of correlated genes/proteins,
RNAs and compounds that work together to regulate specific biological
processes. Instead of just being a list of genes, a pathway contains 
the most important information that is how the member genes interact 
with each other. Thus network structure information is necessary for
the intepretation of the importance of the pathways.

In this package, the original pathway enrichment method
(ORA and GSA) is extended by introducing network centralities as the weight 
of nodes which have been mapped from differentially expressed genes 
in pathways \cite{Gu2012}. There are two advantages compared to former methods.
First, for the diversity of genes' characters and the difficulties of 
covering the importance of genes from all aspects, we do not design a 
fixed measurement for each gene but set it as an optional parameter in the model. 
Researchers can select from candidate choices where different measurement 
reflects different aspect of the importance of genes. 
In our model, network centralities are used to measure the importance of genes in pathways. 
Different centrality measurements assign the importance to nodes from different aspects. 
For example, degree centrality measures the amount of neighbours that 
a node directly connects to, and betweenness centrality measures how many 
information streams must pass through a certain node. Generally speaking, 
nodes having large centrality values are central nodes in the network. 
It's observed that nodes represented as metabolites, proteins or genes 
with high centralities are essential to keep the steady state of biological networks. 
Moreover, different centrality measurements may relate to different biological functions. 
The selection of centralities for researchers depends on what kind of genes 
they think important. Second, we use nodes as the basic units of pathways 
instead of genes. We observe that nodes in the pathways include different 
types of molecules, such as single gene, complex and protein families. 
Assuming a complex or family contains ten differentially expressed member genes, 
in traditional ORA, these ten genes behave as the same position as other
genes represented as single nodes, and thus they have effect of ten. 
It is not proper because these ten genes stay in a same node in the 
pathway and make functions with the effect of one node. Also, 
a same gene may locate in different complexes in a pathway and if 
taking the gene with effect of one, it would greatly decrease the importance 
of the gene. Therefore a mapping procedure from genes to pathway nodes 
is applied in our model. What's more, the nodes in pathways also include 
non-gene nodes such as microRNAs and compounds. These nodes also 
contribute to the topology of the pathway. So, when analyzing pathways, 
all types of nodes are retained.

\section{Pathway Catalogue}
Pathways are collected from public databases, such as PID, KEGG, BioCarta etc.
In \texttt{CePa} package, four catalogues (PID, KEGG, BioCarta and Reactome) from PID database have been integrated.
The pathway data are parsed from XML format file provided by the PID FTP site. 
The Perl code for parsing can be obtained from the author's website 
(https://mcube.nju.edu.cn/jwang/lab/soft/cepa/). The pathway data is stored
in \texttt{PID.db}. Note only part of pathways in the XML file are listed on the PID website. Also, 
we have set the minimum and maximum connected nodes when extracting pathways 
from PID, so not all the pathways listed on the PID website are in \texttt{PID.db}.
\begin{Schunk}
\begin{Sinput}
> library(CePa)
> data(PID.db)
> names(PID.db)
\end{Sinput}
\begin{Soutput}
[1] "NCI"      "BioCarta" "KEGG"     "Reactome"
\end{Soutput}
\end{Schunk}

Each pathway catalogue has been stored as a \texttt{pathway.catalogue} class object.
The \texttt{print.pathway.catalogue} function simply prints the number of pathways in the catalogue.
The \texttt{plot.pathway.catalogue} function visulizes general information of the catalogue (figure \ref{f1}).
It plot:  A) Distribution of the number of member genes in each node; 
B) Distribution of the number of nodes in which a single gene resides; C) Relationship 
between node count and gene count in biological pathways. 
\begin{Schunk}
\begin{Sinput}
> class(PID.db$NCI)
\end{Sinput}
\begin{Soutput}
[1] "pathway.catalogue"
\end{Soutput}
\begin{Sinput}
> PID.db$NCI
\end{Sinput}
\begin{Soutput}
  The catalogue contains 206 pathways.
\end{Soutput}
\begin{Sinput}
> plot(PID.db$NCI)
\end{Sinput}
\end{Schunk}

\begin{figure}[htbp]
\begin{center}
\includegraphics[width=0.99\textwidth]{f1.pdf}
\caption{Meta analysis of pathway catalogue}\label{f1}
\end{center}
\end{figure}

The pathway catalogue data contains a list of pathways and each pathway contains 
a list of interactions. There are several parts in the pathway data where three of them is must: the pathway list,
the interaction list and the mapping list. The corresponding list name are \texttt{pathList},
\texttt{interactionList} and \texttt{mapping}.
\begin{Schunk}
\begin{Sinput}
> names(PID.db$NCI)
\end{Sinput}
\begin{Soutput}
[1] "pathList"        "interactionList" "mapping"         "node.name"      
[5] "node.type"       "version" 
\end{Soutput}
\end{Schunk}

You can find the version of NCI data.
\begin{Schunk}
\begin{Sinput}
> PID.db$NCI$version
\end{Sinput}
\begin{Soutput}
[1] "2012_07_19 09:34::20"
\end{Soutput}
\end{Schunk}

The \texttt{pathList} is a list in which each item is a list of interaction IDs
\begin{Schunk}
\begin{Sinput}
> head(PID.db$NCI$pathList, n = 2)
\end{Sinput}
\begin{Soutput}
$wnt_signaling_pathway
 [1] "203098" "203087" "203104" "203106" "203125" "203127" "203092" "203142"
 [9] "203097" "203111" "203099" "203118" "203103" "203137" "203143" "203091"
[17] "203088" "203141" "203128" "203089" "203101" "203117" "203126" "203140"
[25] "203095" "203108" "203119" "203129" "203113" "203120" "203094" "203102"
[33] "203122" "203136" "203145" "203105" "203123" "203144" "203130" "203132"
[41] "203124" "203107" "203110" "203093" "203133" "203090" "203096" "203109"
[49] "203100" "203121" "203116" "203112"

$cdc42_reg_pathway
 [1] "203416" "203396" "203420" "203393" "203405" "203418" "203392" "203415"
 [9] "203388" "203408" "203389" "203390" "203403" "203412" "203398" "203410"
[17] "203406" "203395" "203391" "203401" "203394" "203419" "203404" "203397"
[25] "203399" "203407" "203402" "203400" "203413" "203411" "203409" "203414"
\end{Soutput}
\end{Schunk}

The \texttt{interactionList} is a three-column matrix in which the first column
is the interaction ID, the second column is the input node ID and the third column
is the output node ID.
\begin{Schunk}
\begin{Sinput}
> head(PID.db$NCI$interactionList)
\end{Sinput}
\begin{Soutput}
  interaction.id  input output
1         503376 507485 506711
2         503376 507487 507485
3         204164 202538 208490
4         204164 208487 208490
5         100688 101169 101176
6         100688 101177 101176
\end{Soutput}
\end{Schunk}

The \texttt{mapping} is the two-column matrix in which the first column is the node ID
and the second column is the gene ID.
\begin{Schunk}
\begin{Sinput}
> head(PID.db$NCI$mapping)
\end{Sinput}
\begin{Soutput}
  node.id  symbol
1  202230 ARHGAP6
2  201405    XIAP
3  503376  SLC7A2
4  203548   SATB1
5  201647    CRY2
6  508774     CRH
\end{Soutput}
\end{Schunk}

The pathway catalogue can also be self-defined by \texttt{set.pathway.catalogue} function. The
function returns a \texttt{pathway.catalogue} class object. E.g. we only need the
first ten pathways in NCI catalogue.
\begin{Schunk}
\begin{Sinput}
> new.catalogue = set.pathway.catalogue(pathList = PID.db$NCI$pathList[1:10],
+                 interactionList = PID.db$NCI$interactionList,
+                 mapping = PID.db$NCI$mapping)
\end{Sinput}
\end{Schunk}

In the following examples, we will use NCI catalogue as the default pathway catalogue.

\section{ORA Extension}
The pathway score is defined as the summation of the weights of differentially affected 
nodes in the pathway:
\begin{equation}\label{equation:1}
    s = \sum^n_{i = 1}{w_id_i}
\end{equation}

where $s$ is the score of the pathway, $w_i$ is the weight of the $i^{th}$ node 
and reflects the importance of the node, $n$ is the number of nodes in the pathway, 
and $d_i$ identifies whether the $i^{th}$ node is differentially affected ( $= 1$) or not ( $= 0$).

The \texttt{CePa} package needs a differentially expressed gene list and a background gene list.
The differential gene list can be obtained through variaty of methods such as
{\it t}-test, SAM \cite{Tusher2001} and limma \cite{Smyth2005}. The background gene list is the complete category of genes
that exist on a certain microarray platform or from the whole genome. The \texttt{CePa} package
contains an example gene list and a background gene list. The gene list is obtained
from a microarray study by {\it t}-test \cite{Burchard2010}.
\begin{Schunk}
\begin{Sinput}
> data(gene.list)
> names(gene.list)
\end{Sinput}
\begin{Soutput}
[1] "bk"  "dif"
\end{Soutput}
\end{Schunk}

In order to find significant pathways under several centrality measurements, we use 
\texttt{cepa.all} function.In the function, \texttt{dif} refers to the differential
gene list, \texttt{bk} refers to the background gene list and the \texttt{pc} refers
to the pathway catalogue.
\begin{Schunk}
\begin{Sinput}
> res = cepa.all(dif = gene.list$dif, bk = gene.list$bk,
+                pc = PID.db$NCI)
\end{Sinput}
\begin{Soutput}
  Calculate pathway scores...
    1/211, wnt_signaling_pathway...
      - equal.weight: 0.878
      - in.degree: 0.864
      - out.degree: 0.921
      - betweenness: 0.89
      - in.reach: 0.81
      - out.reach: 0.91
    ...
\end{Soutput}
\end{Schunk}

The differential gene list and the background gene list should be indicated
with the same identifiers (e.g. gene symbol or refseq ID). All genes in
the differential gene list should exist in the background gene list. In this
example, since \texttt{PID.db} is applied, gene list must be formatted as gene symbol. If background gene list is
not specified, the function use whole human genome genes as default.

By default, \texttt{cepa.all} calls \texttt{equal.weight}, \texttt{in.degree}, \texttt{out.degree},
\texttt{betweenness}, \texttt{in.reach} and \texttt{out.reach} centralities as pathway nodes' weight.
More centrality measurements can be used by setting it as a function (such as closeness,
cluster coefficient). The non-default centralities can be set by \texttt{cen} argument, and remember
to set the \texttt{cen.name} argument to get the name of the centrality. Note you can mix the centralities
in string format and function format. When you set the function object, the only parameter for the
function is the network in \texttt{igraph} object format. The following codes are examples to set the centralities.
\begin{Schunk}
\begin{Sinput}
> # if you use the function, you should quote the function
> # because we need the function name as the centrality name
> cepa.all(dif = gene.list$dif, bk = gene.list$bk,
+          pc = PID.db$NCI,
+          cen = list("in.degree", quote(closeness)))
\end{Sinput}
\end{Schunk}

Moreover, if your centrality function contains more than one argument, you must
wrap to a function that only have one \texttt{igraph} object argument.
\begin{Schunk}
\begin{Sinput}
> in.closeness = function(g) closeness(g, mode = "in")
> cepa.all(dif = gene.list$dif, bk = gene.list$bk,
+          pc = PID.db$NCI,
+          cen = list("in.degree", quote(in.closeness)))
> # If you don't like the function name to be centrality name
> # you can set by cen.name argument
> cepa.all(dif = gene.list$dif, bk = gene.list$bk,
+          pc = PID.db$NCI,
+          cen = list("in.degree", quote(in.closeness)),
+          cen.name = c("In-degree", "In-closeness"))
\end{Sinput}
\end{Schunk}

In order to generate the null distribution of the pathway score, novel differential
gene list is sampled from the background gene list. P-values are calculated from 1000 simulations by default.

The calculation would spend about 12 min. \texttt{res} is a \texttt{cepa.all} class object. To see the general information
of this object:
\begin{Schunk}
\begin{Sinput}
> res
\end{Sinput}
\begin{Soutput}
number of pathways: 211 

Significant pathways (p.value <= 0.01):
             Number
equal.weight     20
in.degree        19
out.degree       19
betweenness      14
in.reach         19
out.reach        20
\end{Soutput}
\end{Schunk}

It will print the number of significant pathways under different centralities.
For ORA extension, \texttt{cepa.all} in fact calls \texttt{cepa.ora.all} function.
So the following code is same as the former code.
\begin{Schunk}
\begin{Sinput}
> res = cepa.ora.all(dif = gene.list$dif, bk = gene.list$bk,
+       pc = PID.db$NCI)
\end{Sinput}
\end{Schunk}

The p-values or adjusted p-values of all pathways under different centralities
can be compared through the heatmap of p-values (Figure \ref{f2}). Users can select methods to adjust
raw p-values.
\begin{Schunk}
\begin{Sinput}
> plot(res)
\end{Sinput}
\end{Schunk}
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=0.99\textwidth]{f2.pdf}
\caption{Heatmap of p-values of all pathways}\label{f2}
\end{center}
\end{figure}

By default, \texttt{CePa} use \texttt{p.adjust} to calculate adjusted p-values,
so only methods valid for \texttt{p.adjust} can be applied to \texttt{CePa}. However,
there is another popular method to adjust p-values: \texttt{qvalue}. \texttt{CePa}
did not implement it since errors may occur when evaluating some kind of p-values.
Nevertheless, users can override the default \texttt{p.adjust} to support \texttt{qvalue} 
by themselves, use code below:
\begin{Schunk}
\begin{Sinput}
> library(qvalue)
> p.adjust = function(p, method = c("holm", "hochberg", "hommel", "bonferroni",
+                   "BH", "BY", "fdr", "none", "qvalue"), ...) {
+   if(method == "qvalue") {
+       # qvalue has more arguments, pass them by ...
+       qvalue(p, ...)$qvalue
+   } else {
+       stats::p.adjust(p, method)
+   }
+ }
\end{Sinput}
\end{Schunk}

R will first look for \texttt{p.adjust} in \texttt{.GlobalEnv} environment and get your own \texttt{p.adjust}.

By default, \texttt{plot} generates the heatmap containing all pathways.
If only significant pathways are of interest, the \texttt{only.sig} argument
can be set to \texttt{TRUE}. (Figure \ref{f3}). Here we do not set \texttt{cutoff}
arguments because if adjusted method is used, the default cutoff is 0.05, while if user
just wants the raw p-values, the default cutoff is 0.01.
\begin{Schunk}
\begin{Sinput}
> plot(res, adj.method = "BH", only.sig = TRUE)
\end{Sinput}
\end{Schunk}
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=0.99\textwidth]{f3.pdf}
\caption{Heatmap of p-values of significant pathways}\label{f3}
\end{center}
\end{figure}

The numeric values of p-values can be obtained via \texttt{p.table}. The function
just returns the raw p-values.
\begin{Schunk}
\begin{Sinput}
> pt = p.table(res)
> head(pt)
\end{Sinput}
\begin{Soutput}
                        equal.weight  in.degree  out.degree betweenness
wnt_signaling_pathway    0.878121878 0.86413586 0.921078921 0.890109890
cdc42_reg_pathway        0.858141858 0.84515485 0.818181818 0.852147852
mtor_4pathway            0.777222777 0.80319680 0.707292707 0.596403596
plk3_pathway             0.007992008 0.01598402 0.007992008 0.002997003
era_genomic_pathway      0.079920080 0.08891109 0.047952048 0.074925075
insulin_glucose_pathway  1.000000000 1.00000000 1.000000000 1.000000000
                          in.reach   out.reach
wnt_signaling_pathway   0.81018981 0.910089910
cdc42_reg_pathway       0.84815185 0.859140859
mtor_4pathway           0.83616384 0.422577423
plk3_pathway            0.01398601 0.005994006
era_genomic_pathway     0.09290709 0.017982018
insulin_glucose_pathway 1.00000000 1.000000000
\end{Soutput}
\end{Schunk}

We can get the result for single pathway under specific centrality from the 
\texttt{cepa.all} object by identifying the index for the pathway and the index
for the centrality.
\begin{Schunk}
\begin{Sinput}
> g = get.cepa(res, id = "mapktrkpathway", cen = "in.reach")
> g
\end{Sinput}
\begin{Soutput}
  procedure: ora 
  weight: in.reach 
  p-value: 0.002 
\end{Soutput}
\end{Schunk}

\texttt{g} is a \texttt{cepa} class object. It stores information of the
evaluation of a single pathway under a single centrality. The distribution of the
pathway score and the network graph can be generated by \texttt{plot} function on the 
\texttt{cepa} object by specifying \texttt{type} argument (figure \ref{f4} and figure \ref{f5}).
\begin{Schunk}
\begin{Sinput}
> plot(g, type = "graph")
\end{Sinput}
\end{Schunk}
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=0.99\textwidth]{f4.pdf}
\caption{Network visualization of a pathway}\label{f4}
\end{center}
\end{figure}

\begin{Schunk}
\begin{Sinput}
> plot(g, type = "null")
\end{Sinput}
\end{Schunk}
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=0.99\textwidth]{f5.pdf}
\caption{Null distribution of pathway score}\label{f5}
\end{center}
\end{figure}

By default, \texttt{type} is set to \texttt{graph}, and the node labels is combined from member genes. The exact name for each
node can be set by \texttt{node.name} argument. Also, more detailed categories
of the nodes can be set by \texttt{node.type} argument (Figure \ref{f6}). 
\begin{Schunk}
\begin{Sinput}
> plot(g, node.name = PID.db$NCI$node.name,
+      node.type = PID.db$NCI$node.type)
\end{Sinput}
\end{Schunk}
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=0.99\textwidth]{f6.pdf}
\caption{Network visualization of a pathway, with node name and node type specified}\label{f6}
\end{center}
\end{figure}

For simplicity, the plotting for the \texttt{cepa} object can be directly applied
on the \texttt{cepa.all} object by specifying the index of the pathway and the 
index of the centrality (Figure \ref{f6}).
\begin{Schunk}
\begin{Sinput}
> plot(res, id = "mapktrkpathway", cen = "in.reach")
> plot(res, id = "mapktrkpathway", cen = "in.reach", type = "null")
> plot(res, id = "mapktrkpathway", cen = "in.reach",
+      node.name = PID.db$NCI$node.name,
+      node.type = PID.db$NCI$node.type)
\end{Sinput}
\end{Schunk}

If users use \texttt{plot} to draw network graphs, the function would return an
\texttt{igraph} object. So if users are not satisfy with the default graph, they
can visulize by their own methods. An example of custumize your network can be found
at the next section.
\begin{Schunk}
\begin{Sinput}
> obj = plot(res, id = "mapktrkpathway", cen = "in.reach")
> class(obj)
[1] "igraph"
\end{Sinput}
\end{Schunk}

The \texttt{igraph} package provides a \texttt{write.graph} function to output
graph into several formats. As I have tried, with \texttt{graphml} format, 
Cytoscape Web \cite{Lopes15092010} (https://cytoscapeweb.cytoscape.org/) can make a more beautiful
and interactive visualization of the network.
\begin{Schunk}
\begin{Sinput}
> write.graph(obj, file = "example-network.xml", format = "graphml")
> write.graph(obj, file = "example-network.gml", format = "gml")
\end{Sinput}
\end{Schunk}


Instead of analysis a list of pathways, users can also be focused on a single pathway
under a single centrality by identifying the id of the pathway in the catalogue.
\begin{Schunk}
\begin{Sinput}
> res.pathway = cepa(dif = gene.list$dif, bk = gene.list$bk,
+               pc = PID.db$NCI, "mapktrkpathway",
+               cen = "in.reach")
\end{Sinput}
\end{Schunk}

Similarly, \texttt{cepa} function here directly calls \texttt{cepa.ora}.

\section{GSA extension}

In the traditional univariate GSA procedure, the score $s$ of the pathway is defined as:
\begin{equation}\label{equation:3}
    s = f(\mathbf{g})
\end{equation}

where $f$ transforms the gene-level statistic to a pathway-level statistic 
(e.g. by summation, averaging) and $\mathbf{g}$ is the gene-level statistic vector which typically 
comprises $t$-values. In ORA, $\mathbf{g}$ is a binary variant and $f(\mathbf{g})$ is summation. 
In our model to extend GSA, gene-level statistic is first transformed to node-level 
statistic. We define the vector of the node-level statistics as $\mathbf{d}$. 
When nodes in pathways comprise multiple genes, the node-level statistic can be 
considered as the largest principle component of the corresponding member genes. 
Using centrality as the weight, the score is defined as
\begin{equation}\label{equation:4}
    s = f(\mathbf{wd})
\end{equation}

where $\mathbf{w}$ is the weight vector and the transformation function $f$ acts upon 
the product of $\mathbf{w}$ and $\mathbf{d}$. Equation \ref{equation:4} incorporates centrality weight into the original node-level statistic. 
The null distribution of the pathway score could then be generated by permuting the gene expression matrix.

Since GSA procedure need a complete expression matrix, we first read the P53 microarray data set.
The \texttt{P53\_symbol.gct} and \texttt{P53.cls} can be downloaded from
https://mcube.nju.edu.cn/jwang/lab/soft/cepa/. \texttt{read.gct} and \texttt{read.cls}
are simple functions to read expression data and phenotype data.
\begin{Schunk}
\begin{Sinput}
> eset = read.gct("P53_symbol.gct")
> # some process of the names of genes
> rownames(eset) = gsub("\\s+.*$", "", rownames(eset))
> label = read.cls("P53.cls", treatment="MUT", control="WT")
\end{Sinput}
\end{Schunk}

Here, we also use \texttt{cepa.all} to do batch pathway analysis. The following
code spent about 38 min with 1000 sample permutations.
\begin{Schunk}
\begin{Sinput}
> res = cepa.all(mat = eset, label = label, pc = PID.db$NCI,
+                nlevel = "tvalue_sq", plevel = "mean")
\end{Sinput}
\begin{Soutput}
  Calculate gene level values.
  Calculate pathway score...
    1/211, wnt_signaling_pathway...
      Calculate node level value and permutate sample labels...
      17 genes measured in the pathway...
      - equal.weight: 0.587
      - in.degree: 0.652
      - out.degree: 0.777
      - betweenness: 0.466
      - in.reach: 0.56
      - out.reach: 0.696
    ...
\end{Soutput}
\end{Schunk}

Here, we use \texttt{mat} and \texttt{label} arguments instead of \texttt{dif} and 
\texttt{bk} arguments. In fact, when specifying \texttt{mat} and \texttt{label} arguments,
\texttt{cepa.all} calls \texttt{cepa.univaraite.all}.

In GSA procedure, first a node level statistic should be calculated. In \texttt{CePa} package,
there are three methods to calculate node level statistics. User can choose from \texttt{tvalue},
\texttt{tvalue\_abs} and \texttt{tvalue\_sq}. \texttt{tvalue\_abs} is choosen as
the default node level method because it can capture two directional regulations.
After we get the node level statistics in the pathway, a pathway level transformation
should be applied. User can choose from \texttt{max}, \texttt{min}, \texttt{median}, 
\texttt{sum}, \texttt{mean} and \texttt{rank}. \texttt{mean} is taken as default.

The node level statistic can be self-defined. The self-defined function should only
contain two argumetns, one for vector of expression value in treatment class and one
for that in control class. E.g. we set the node level statistic as kind of robust t-value:
\begin{Schunk}
\begin{Sinput}
> robust_tvalue = function(x, y) {
+    qx = quantile(x, c(0.1, 0.9))
+    qy = quantile(y, c(0.1, 0.9))
+
+    x = x[(x <= qx[2]) & (x >= qx[1])]
+    y = y[(y <= qy[2]) & (y >= qy[1])]
+
+    n1 = length(x)
+    n2 = length(y)
+    v1 = var(x)
+    v2 = var(y)
+    ifelse(v1 + v2 == 0, 0, (mean(x) - mean(y)) / sqrt(v1/n1 + v2/n2))
+ }
> res = cepa.all(mat = eset, label = label, pc = PID.db$NCI,
+                nlevel = robust_tvalue, plevel = "mean")
\end{Sinput}
\end{Schunk}

Similarly, the pathway level transformation can also be self-defined:
\begin{Schunk}
\begin{Sinput}
> trim_mean = function(x) mean(x, trim = 0.2)
> res = cepa.all(mat = eset, label = label, pc = PID.db$NCI,
+                nlevel = "tvalue_abs", plevel = trim_mean)
\end{Sinput}
\end{Schunk}

Print the general result of the analysis and plot figures (figure \ref{f8}).
\begin{Schunk}
\begin{Sinput}
> res
\end{Sinput}
\begin{Soutput}
number of pathways: 211 

Significant pathways (p.value <= 0.01):
             Number
equal.weight      6
in.degree         6
out.degree        7
betweenness       5
in.reach          7
out.reach         7
\end{Soutput}
\end{Schunk}

\begin{Schunk}
\begin{Sinput}
> plot(res, only.sig = TRUE, adj.method = "BH", cutoff = 0.1)
\end{Sinput}
\end{Schunk}
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=0.6\textwidth]{f8}
\caption{Heatmap of FDRs of significant pathways}\label{f8}
\end{center}
\end{figure}

If we are instread in p53 downstream pathway. First we extract this pathway under "in.degree"
centrality from \texttt{res}.

\begin{Schunk}
\begin{Sinput}
> g = get.cepa(res, id = "p53downstreampathway", cen="in.degree")
> g
\end{Sinput}
\begin{Soutput}
  procedure: gsa.univariate 
  weight: in.degree 
  p-value: 9.990e-04 
\end{Soutput}
\begin{Sinput}
> plot(g)
\end{Sinput}
\end{Schunk}
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=0.99\textwidth]{f9}
\caption{Network visulization of a pathway}\label{f9}
\end{center}
\end{figure}

Figure \ref{f9} illustrates the graph of p53 downstream pathway. Since the pathway is evaluated
under GSA procedure, the color of each node is continues in which red refers to up-regulated,
green refers to down-regulated and white refers to no-change.

Maybe due to to many nodes in the graph, Figure \ref{f9} is really hard to read.
However, since the plotting function returns an \texttt{igraph} object. We can customize
it handly (Figure \ref{f12}).
\begin{Schunk}
\begin{Sinput}
> g2 = plot(g)
> # only label those nodes with high centralities
> V(g2)$label[V(g2)$size < quantile(V(g2)$size, 0.95)] = ""
> # we do not need margins
> par(mar = c(0,0,0,0))
> plot(g2)
\end{Sinput}
\end{Schunk}
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=0.99\textwidth]{f12}
\caption{customize the network graph}\label{f12}
\end{center}
\end{figure}

\section{The \texttt{report} function}
One of the advantages of \texttt{CePa} package is that it can generate a detailed
report in HTML format. The function \texttt{report} is used to generate report.
The report will locate in the current working directory. By default it only generate
figures of the significant pathways, but this can be changed by setting \texttt{only.sig}
argument to \texttt{FALSE}.
\begin{Schunk}
\begin{Sinput}
> report(res)
\end{Sinput}
\begin{Soutput}
  generate images for ap1_pathway ...
  generate images for epopathway ...
  generate images for il12_stat4pathway ...
  generate images for foxm1pathway ...
  generate images for mapktrkpathway ...
  generate images for aurora_a_pathway ...
  ...

\end{Soutput}
\begin{Sinput}
> report(res, adj.method = "BH", cutoff = 0.2)
> report(res, only.sig = FALSE)
\end{Sinput}
\end{Schunk}

An example of the report can be found in figure \ref{f10}. After \texttt{CePa} version
0.4, the network for pathways can be viewed interactively by Cytoscape Web \cite{Lopes15092010} (figure \ref{f11}).
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=0.8\textwidth]{f10}
\caption{An report of the CePa analysis}\label{f10}
\end{center}
\end{figure}

\begin{figure}[htbp]
\begin{center}
\includegraphics[width=0.7\textwidth]{f11}
\caption{Network is visualized by Cytoscape Web}\label{f11}
\end{center}
\end{figure}

\section{Parallel computing}
Since \texttt{CePa} evaluates pathways independently, the process can be realized
through parallel computing. In R statistical environment, there are many packages
focusing on parallel computing such as \texttt{snow}, \texttt{multicore}, etc. 
After version 4.0, the package implemented a \texttt{cepa.all.parallel} function to do
parallel computing.

\texttt{cepa.all.parallel} use \texttt{snow} package.

All the arguments for \texttt{cepa.all.parallel} are same as the arguments for \texttt{cepa.all}
except the \texttt{ncores} arguments. The \texttt{ncores} specifies the number of cores for
parallel computing.
\begin{Schunk}
\begin{Sinput}
> res = cepa.all.parallel(dif = gene.list$dif, bk = gene.list$bk,
+       pc = PID.db$NCI, ncores = 4)
> res = cepa.all.parallel(mat = eset, label = label, pc = PID.db$NCI,
        nlevel = "tvalue_sq", plevel = "mean", ncores = 4)
\end{Sinput}
\end{Schunk}

The returned value \texttt{res} is a \texttt{cepa.all} class object.

\bibliographystyle{abbrv}
\bibliography{bibliography}

\end{document}