%\VignetteIndexEntry{SIBER Vignette}
%\VignetteDepends{mclust,edgeR,doParallel}
%\VignetteKeywords{SIBERG}
%\VignettePackage{SIBERG}

\documentclass{article}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{amsmath}
\usepackage{natbib}

\pagestyle{myheadings}
\markright{Pan Tong}

\setlength{\topmargin}{0in}
\setlength{\textheight}{8in}
\setlength{\textwidth}{6.5in}
\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}

\title{SIBERG User Manual} 
\author{Pan Tong and Kevin R Coombes}
\date{\today}

\usepackage{Sweave}
\begin{document}

\SweaveOpts{prefix.string=SIBER}

\maketitle

\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% alien part for debuging bibtex (from GRTS package)

%\chapter{Short introduction to Generalized Random Tessellation Stratified sampling (GRTS)}
%Yet to be written\ldots \citep{Stevens_Olsen_2004, Stevens_Olsen_2003, Stevens_Olsen_1999, Theobald_etal_2007}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% alien part for debuging bibtex


<<echo=FALSE>>=
options(width=80)
options(continue=' ')
@ 

\section{Introduction}
\texttt{SIBERG} (\textbf{S}ystematic \textbf{I}dentification of
\textbf{B}imodally \textbf{E}xp\textbf{R}essed \textbf{G}enes using RNAseq
data) is an R package that effectively identifies bimodally expressed
genes from RNAseq data based on Bimodality Index. SIBER models the
RNAseq data in the finite mixture modeling framework and incorporates
mechanisms for dealing with RNAseq normalization. Three types of
mixture models are implemented, namely, the mixture of log normal,
negative binomial or generalized poisson distribution. For
completeness, we also add the normal mixture model that has been used
to identify bimodal genes from microarray data.

\texttt{SIBER} proceeds in two steps. The first step fits a
two-component mixture model.  The second step calculates the
Bimodality Index corresponding to the assumed mixture distribution.
Four types of mixture models are implemented: log normal (LN),
Negative Binomial (NB), Generalized Poisson (GP) and normal mixture
(NL).

Besides identifying bimodally expressed genes, \texttt{SIBER} provides
functionalities to fit 2-component mixture distribution from LN, NB
and GP models. A degenerate case where one component becomes a point
mass at zero (called 0-inflation) is also incorporated. The 0-inflated
model is designed specificaly to deal with the observed zero count in
real RNAseq data.

\section{Using SIBER}

\subsection{A Quick Example}

Of course, we need to load the \texttt{SIBER} package.
<<loadLibrary>>=
library(SIBERG)
@ 

We simulate RNAseq count data from 1-component Negative Binomial
distribution as below:

<<keep.source=TRUE>>=
set.seed(1000)
N <- 100 # sample size
G <- 200 # number of simulated genes
# RNAseq count data simulated from NB model with mean 1000, dispersion=0.2
Dat <-  matrix(rnbinom(G*N, mu=1000, size=1/0.2), nrow=G) 
@

We use the first gene for our illustration.  We first fit the LN
mixture model and calculate BI:

<<>>=
SIBER(y=Dat[1, ], model='LN')
@ 

To apply the NB model: 

<<keep.source=TRUE>>=
SIBER(y=Dat[1, ], model='NB')
@

To apply the GP model: 

<<>>=
SIBER(y=Dat[1, ], model='GP')
@

For the NL model, we first transform the data such that it follows
normal mixture distribution.

<<>>=
SIBER(y=log(Dat[1, ]+1), model='NL')
@ 


Since the data is simulated from 1-component model, all of the
calculated BIs are small indicating lack of bimodality.

\subsection{Dealing With RNAseq Normalization}
Previously, only the raw RNAseq count data is passed to
\texttt{SIBER}. It is easy to incorporate RNAseq normalization in the
mixture modeling. Currently, the RPKM \citep{Mortazavi2008}, TMM
\citep{Robinson2010a} and RLE \citep{Anders2010} methods have been
widely used to normalize RNAseq data.  Once the normalization constant
is estimated, i.e. using the \texttt{edgeR} package
\citep{Robinson2010}, we can easily calculate the BI after adjusting
for the normalization.

In the following, we use \texttt{edgeR} package to calculate the
normalization factor using TMM approach.
<<>>=
if (require(edgeR)) {
  TMM <- calcNormFactors(Dat, method='TMM')
} else {
  # manually set factors from previous computations
  TMM <- c(1.0390711, 0.9813734, 1.0091593, 0.9641022, 1.0137000,
           1.0188657, 0.9648757, 0.9956814, 0.9689530, 0.9774278,
           1.0059115, 1.0076910, 0.9923854, 1.0121838, 1.0249094,
           1.0403172, 0.9887074, 1.0003546, 0.9998479, 0.9844905,
           1.0040203, 0.9692244, 0.9987567, 1.0063895, 0.9954510,
           1.0204917, 0.9717720, 1.0317981, 0.9826344, 0.9817171,
           0.9949059, 0.9745569, 0.9652138, 1.0075196, 0.9879748,
           0.9929244, 0.9895606, 1.0144117, 1.0612923, 0.9626716,
           1.0049376, 1.0192416, 0.9826612, 1.0234523, 0.9921186,
           1.0029780, 1.0199930, 1.0054256, 1.0152748, 0.9655475,
           0.9919175, 1.0231102, 0.9750882, 0.9958528, 1.0268000,
           0.9651300, 1.0158949, 0.9803130, 1.0385707, 0.9870510,
           1.0211765, 1.0326759, 1.0234579, 0.9524254, 0.9742719,
           0.9887936, 1.0476640, 0.9787385, 0.9992178, 1.0046021,
           0.9929379, 0.9595237, 1.0690364, 0.9910940, 1.0158325,
           0.9799790, 1.0316363, 1.0341890, 1.0036944, 0.9728850,
           1.0080238, 1.0190104, 0.9735436, 0.9744903, 0.9974915,
           0.9804733, 1.0243671, 0.9881085, 0.9923432, 0.9638553,
           1.0178705, 1.0476191, 1.0260725, 1.0474791, 1.0449745,
           0.9987096, 1.0028339, 0.9971751, 0.9487246, 0.9696386)
}
@

We now incorporate the TMM normalization into SIBER. We use the LN
model below. The calculation with other models is similar.  Note that
our definition of the normalization factor differs from \texttt{edgeR}
package. In our notation, $\textrm{E} [C_s]=d_s \mu_{c(s)}$ where
$C_s$ is the observed raw count for sample s, $d_s$ is the
normalization factor applied to sample s, c(s)=\{1, 2\} denotes which
of the two components sample s comes from and $\mu_1, \mu_2$ are mean
parameters for the two components. Therefore, our definition of $d_s$
maps the true expression level to the observed counts.  In contrast,
the normalization constant estimated by \texttt{edgeR} maps the
observed counts to the estimated true expression. As a result, we need
to pass the reciprocal of the normalization vector estimated by
\texttt{edgeR} to SIBER.

<<>>=
SIBER(y=Dat[1, ], d=1/TMM, model='LN')
@

\subsection{Parallelizing \texttt{SIBER}}
When there are many genes to be fitted, we can easily parallel
\texttt{SIBER} to speed up the computation.  There are several ways
for parallelization. Here we choose the \texttt{foreach} package for
the backend. The workers are requested and registered by the
\texttt{doSNOW} package.

%Note that, although the parallel code works fine on Windows after the
%program is installed, it causes an error when checking the package on
%Windows.  As a result, we disable this section of the code if running
%on Windows.
%<<notWin>>=
%notwin <- .Platform$OS != "windows"
%@ 


%
% parallel is commented out so that we can pass R-Forge check due to snow package is not installed on the server
%
\begin{verbatim}
    library(doParallel)
    cl <- makeCluster(3, type = "SOCK")
    registerDoParallel(cl)
\end{verbatim}


Note that the above command also works on Linux servers. However, it
requests master nodes when run within R. For good practice, we can use
qsub such that the computation is done in the compute nodes.

Below we illustrate how to use SIBER with parallel computation.

%<<parallelAnalysis>>=
%if (notwin) {
\begin{verbatim}
    func <- function(i) {
	SIBER(y=Dat[i, ], model='LN')
    }
    BIinfo_LN <- foreach(i=1:nrow(Dat), 
                         .combine='rbind', 
                         .packages='SIBER') %dopar% {
	func(i)
    }
\end{verbatim}
%    BIinfo_LN[1:3, ]
%}
%@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Fitting Two-component Mixture Models}
\texttt{SIBER} package provides functions to fit three types of
mixture models besides detecting bimodally expressed genes. These
include: (1) 2-component mixture with equal dispersion or variance (E
model); (2) 2-component mixture with unequal dispersion or variance (V
model); (3) 0-inflated model.  All three types of distributions are
implemented.

The rule to fit 0-inflated model is that the observed percentage of
count exceeds the user specified threshold. This rule overrides the
model argument (E or V) when observed percentae of zero count exceeds
the threshold.

First, we illustrate how to fit the E and V models. We use the
simulated data from LN model. The gene we use is not 0-inflated. By
default, the minimum observed percentage of zero is not achieved
(zeroPercentThr=0.2 ). Hence, the 0-inflated model is disabled. In
this case, the model specification will be effective.

<<keep.source=TRUE>>=
data(simDat)
ind <- 1
# true parameter generating the simulated data
parList$LN[ind, ]
# fit by E model
fitLN(y=dataList$LN[ind, ], base=exp(1), eps=1, model='E')
# fit by V model. 
fitLN(y=dataList$LN[ind, ], base=exp(1), eps=1, model='V')

@

Now we choose a gene that has zero inflation and illustrate how to fit
a 0-inflated model:

<<keep.source=TRUE>>=
ind <- 5 # 0-inflated gene
# true parameter generating the simulated data
parList$LN[ind, ]
# fit by E model. 0-inflated model is disabled by setting zeroPercentThr=1.
# the result is biased. 
fitLN(y=dataList$LN[ind, ], base=exp(1), eps=1, model='E', zeroPercentThr=1)
# fit by 0-inflated model. 0-inflated model overrides the E model since percentage
# of observed zero counts exceeds the threshold.
fitLN(y=dataList$LN[ind, ], base=exp(1), eps=1, model='E', zeroPercentThr=0.2)

@

Here we see that when there is severe 0-inflation, fitting a E (or V)
model gives biased estimate. Instead, our 0-inflated model works
pretty well.

The usage of fitNB(), fitGP() is quite similar and is omitted in this
manual.

\section{Session Info}
After all the computations, we close the connection to the workers.

%<<>>=
%if(notwin) {
\begin{verbatim}
    stopCluster(cl)
\end{verbatim}
%    }
%@


<<sessionInfo>>=
getwd()
sessionInfo()
@ 


\bibliographystyle{plainnat}
\bibliography{siber}

\end{document}