%\VignetteIndexEntry{SIBER Vignette} %\VignetteDepends{mclust,edgeR,doParallel} %\VignetteKeywords{SIBERG} %\VignettePackage{SIBERG} \documentclass{article} \usepackage{graphicx} \usepackage{hyperref} \usepackage{amsmath} \usepackage{natbib} \pagestyle{myheadings} \markright{Pan Tong} \setlength{\topmargin}{0in} \setlength{\textheight}{8in} \setlength{\textwidth}{6.5in} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \title{SIBERG User Manual} \author{Pan Tong and Kevin R Coombes} \date{\today} \usepackage{Sweave} \begin{document} \SweaveOpts{prefix.string=SIBER} \maketitle \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% alien part for debuging bibtex (from GRTS package) %\chapter{Short introduction to Generalized Random Tessellation Stratified sampling (GRTS)} %Yet to be written\ldots \citep{Stevens_Olsen_2004, Stevens_Olsen_2003, Stevens_Olsen_1999, Theobald_etal_2007} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% alien part for debuging bibtex <<echo=FALSE>>= options(width=80) options(continue=' ') @ \section{Introduction} \texttt{SIBERG} (\textbf{S}ystematic \textbf{I}dentification of \textbf{B}imodally \textbf{E}xp\textbf{R}essed \textbf{G}enes using RNAseq data) is an R package that effectively identifies bimodally expressed genes from RNAseq data based on Bimodality Index. SIBER models the RNAseq data in the finite mixture modeling framework and incorporates mechanisms for dealing with RNAseq normalization. Three types of mixture models are implemented, namely, the mixture of log normal, negative binomial or generalized poisson distribution. For completeness, we also add the normal mixture model that has been used to identify bimodal genes from microarray data. \texttt{SIBER} proceeds in two steps. The first step fits a two-component mixture model. The second step calculates the Bimodality Index corresponding to the assumed mixture distribution. Four types of mixture models are implemented: log normal (LN), Negative Binomial (NB), Generalized Poisson (GP) and normal mixture (NL). Besides identifying bimodally expressed genes, \texttt{SIBER} provides functionalities to fit 2-component mixture distribution from LN, NB and GP models. A degenerate case where one component becomes a point mass at zero (called 0-inflation) is also incorporated. The 0-inflated model is designed specificaly to deal with the observed zero count in real RNAseq data. \section{Using SIBER} \subsection{A Quick Example} Of course, we need to load the \texttt{SIBER} package. <<loadLibrary>>= library(SIBERG) @ We simulate RNAseq count data from 1-component Negative Binomial distribution as below: <<keep.source=TRUE>>= set.seed(1000) N <- 100 # sample size G <- 200 # number of simulated genes # RNAseq count data simulated from NB model with mean 1000, dispersion=0.2 Dat <- matrix(rnbinom(G*N, mu=1000, size=1/0.2), nrow=G) @ We use the first gene for our illustration. We first fit the LN mixture model and calculate BI: <<>>= SIBER(y=Dat[1, ], model='LN') @ To apply the NB model: <<keep.source=TRUE>>= SIBER(y=Dat[1, ], model='NB') @ To apply the GP model: <<>>= SIBER(y=Dat[1, ], model='GP') @ For the NL model, we first transform the data such that it follows normal mixture distribution. <<>>= SIBER(y=log(Dat[1, ]+1), model='NL') @ Since the data is simulated from 1-component model, all of the calculated BIs are small indicating lack of bimodality. \subsection{Dealing With RNAseq Normalization} Previously, only the raw RNAseq count data is passed to \texttt{SIBER}. It is easy to incorporate RNAseq normalization in the mixture modeling. Currently, the RPKM \citep{Mortazavi2008}, TMM \citep{Robinson2010a} and RLE \citep{Anders2010} methods have been widely used to normalize RNAseq data. Once the normalization constant is estimated, i.e. using the \texttt{edgeR} package \citep{Robinson2010}, we can easily calculate the BI after adjusting for the normalization. In the following, we use \texttt{edgeR} package to calculate the normalization factor using TMM approach. <<>>= if (require(edgeR)) { TMM <- calcNormFactors(Dat, method='TMM') } else { # manually set factors from previous computations TMM <- c(1.0390711, 0.9813734, 1.0091593, 0.9641022, 1.0137000, 1.0188657, 0.9648757, 0.9956814, 0.9689530, 0.9774278, 1.0059115, 1.0076910, 0.9923854, 1.0121838, 1.0249094, 1.0403172, 0.9887074, 1.0003546, 0.9998479, 0.9844905, 1.0040203, 0.9692244, 0.9987567, 1.0063895, 0.9954510, 1.0204917, 0.9717720, 1.0317981, 0.9826344, 0.9817171, 0.9949059, 0.9745569, 0.9652138, 1.0075196, 0.9879748, 0.9929244, 0.9895606, 1.0144117, 1.0612923, 0.9626716, 1.0049376, 1.0192416, 0.9826612, 1.0234523, 0.9921186, 1.0029780, 1.0199930, 1.0054256, 1.0152748, 0.9655475, 0.9919175, 1.0231102, 0.9750882, 0.9958528, 1.0268000, 0.9651300, 1.0158949, 0.9803130, 1.0385707, 0.9870510, 1.0211765, 1.0326759, 1.0234579, 0.9524254, 0.9742719, 0.9887936, 1.0476640, 0.9787385, 0.9992178, 1.0046021, 0.9929379, 0.9595237, 1.0690364, 0.9910940, 1.0158325, 0.9799790, 1.0316363, 1.0341890, 1.0036944, 0.9728850, 1.0080238, 1.0190104, 0.9735436, 0.9744903, 0.9974915, 0.9804733, 1.0243671, 0.9881085, 0.9923432, 0.9638553, 1.0178705, 1.0476191, 1.0260725, 1.0474791, 1.0449745, 0.9987096, 1.0028339, 0.9971751, 0.9487246, 0.9696386) } @ We now incorporate the TMM normalization into SIBER. We use the LN model below. The calculation with other models is similar. Note that our definition of the normalization factor differs from \texttt{edgeR} package. In our notation, $\textrm{E} [C_s]=d_s \mu_{c(s)}$ where $C_s$ is the observed raw count for sample s, $d_s$ is the normalization factor applied to sample s, c(s)=\{1, 2\} denotes which of the two components sample s comes from and $\mu_1, \mu_2$ are mean parameters for the two components. Therefore, our definition of $d_s$ maps the true expression level to the observed counts. In contrast, the normalization constant estimated by \texttt{edgeR} maps the observed counts to the estimated true expression. As a result, we need to pass the reciprocal of the normalization vector estimated by \texttt{edgeR} to SIBER. <<>>= SIBER(y=Dat[1, ], d=1/TMM, model='LN') @ \subsection{Parallelizing \texttt{SIBER}} When there are many genes to be fitted, we can easily parallel \texttt{SIBER} to speed up the computation. There are several ways for parallelization. Here we choose the \texttt{foreach} package for the backend. The workers are requested and registered by the \texttt{doSNOW} package. %Note that, although the parallel code works fine on Windows after the %program is installed, it causes an error when checking the package on %Windows. As a result, we disable this section of the code if running %on Windows. %<<notWin>>= %notwin <- .Platform$OS != "windows" %@ % % parallel is commented out so that we can pass R-Forge check due to snow package is not installed on the server % \begin{verbatim} library(doParallel) cl <- makeCluster(3, type = "SOCK") registerDoParallel(cl) \end{verbatim} Note that the above command also works on Linux servers. However, it requests master nodes when run within R. For good practice, we can use qsub such that the computation is done in the compute nodes. Below we illustrate how to use SIBER with parallel computation. %<<parallelAnalysis>>= %if (notwin) { \begin{verbatim} func <- function(i) { SIBER(y=Dat[i, ], model='LN') } BIinfo_LN <- foreach(i=1:nrow(Dat), .combine='rbind', .packages='SIBER') %dopar% { func(i) } \end{verbatim} % BIinfo_LN[1:3, ] %} %@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Fitting Two-component Mixture Models} \texttt{SIBER} package provides functions to fit three types of mixture models besides detecting bimodally expressed genes. These include: (1) 2-component mixture with equal dispersion or variance (E model); (2) 2-component mixture with unequal dispersion or variance (V model); (3) 0-inflated model. All three types of distributions are implemented. The rule to fit 0-inflated model is that the observed percentage of count exceeds the user specified threshold. This rule overrides the model argument (E or V) when observed percentae of zero count exceeds the threshold. First, we illustrate how to fit the E and V models. We use the simulated data from LN model. The gene we use is not 0-inflated. By default, the minimum observed percentage of zero is not achieved (zeroPercentThr=0.2 ). Hence, the 0-inflated model is disabled. In this case, the model specification will be effective. <<keep.source=TRUE>>= data(simDat) ind <- 1 # true parameter generating the simulated data parList$LN[ind, ] # fit by E model fitLN(y=dataList$LN[ind, ], base=exp(1), eps=1, model='E') # fit by V model. fitLN(y=dataList$LN[ind, ], base=exp(1), eps=1, model='V') @ Now we choose a gene that has zero inflation and illustrate how to fit a 0-inflated model: <<keep.source=TRUE>>= ind <- 5 # 0-inflated gene # true parameter generating the simulated data parList$LN[ind, ] # fit by E model. 0-inflated model is disabled by setting zeroPercentThr=1. # the result is biased. fitLN(y=dataList$LN[ind, ], base=exp(1), eps=1, model='E', zeroPercentThr=1) # fit by 0-inflated model. 0-inflated model overrides the E model since percentage # of observed zero counts exceeds the threshold. fitLN(y=dataList$LN[ind, ], base=exp(1), eps=1, model='E', zeroPercentThr=0.2) @ Here we see that when there is severe 0-inflation, fitting a E (or V) model gives biased estimate. Instead, our 0-inflated model works pretty well. The usage of fitNB(), fitGP() is quite similar and is omitted in this manual. \section{Session Info} After all the computations, we close the connection to the workers. %<<>>= %if(notwin) { \begin{verbatim} stopCluster(cl) \end{verbatim} % } %@ <<sessionInfo>>= getwd() sessionInfo() @ \bibliographystyle{plainnat} \bibliography{siber} \end{document}