%\VignetteIndexEntry{integIRTy Vignette} %\VignetteDepends{ltm, foreach, doParallel, mclust, MASS, abind} %\VignetteKeywords{integIRTy} %\VignettePackage{integIRTy} \documentclass{article} \usepackage{graphicx} \usepackage{hyperref} \usepackage{amsmath} \usepackage{natbib} \pagestyle{myheadings} \markright{Pan Tong} \setlength{\topmargin}{0in} \setlength{\textheight}{8in} \setlength{\textwidth}{6.5in} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \title{integIRTy: a method to identify altered genes in cancer accounting for multiple mechanisms of regulation using Item Response Theory} \author{Pan Tong and Kevin R Coombes} \date{\today} \usepackage{Sweave} \begin{document} \SweaveOpts{prefix.string=integIRTy} \maketitle \tableofcontents \listoffigures \listoftables <<echo=FALSE>>= options(width=80) options(continue=' ') @ \section{Introduction} \texttt{integIRTy} is an R package that integrates multiple modalities of high throughput assays using Item Response Theory (IRT) \citep{integIRTy}. The goal is to identify genes that are altered in cancer either marginally in individual assay or jointly across different assays. \texttt{integIRTy} is able to systematically integrate across assay types while automatically adjusting for the heterogeneity among different platforms and different samples. When applied to a single assay, \texttt{integIRTy} is more robust and reliable than conventional methods such as student's t-test or Wilcoxon rank-sum test. When applied to integrate multiple assays, \texttt{integIRTy} can identify novel altered genes that cannot be found by looking at individual assays separately. \section{A Quick Example} \texttt{integIRTy} provides two top-level interfaces to the user. One is \texttt{intIRTeasyRun}, which accepts dichotomized matrices prepared by the user; the other is \texttt{intIRTeasyRunFromRaw}, which accepts original data and uses the dichotomization methods proposed in \citep{integIRTy} to transform the data. Both interfaces perform a series of computations: fitting IRT models in individual assays, estimating latent traits from individual assay, inferring latent traits from integrated data, and performing permutations to assess statistical significance. We will present a simple example to illustrate how to use \texttt{integIRTy}. We begin with the \texttt{intIRTeasyRunFromRaw} function. The usage of \texttt{intIRTeasyRun} is similar. Later, we will break down the procedures step by step. As usual, the first step is to load the package from the library. <<loadLibrary>>= library(integIRTy) @ We use a subset of TCGA ovarian cancer data ~\citep{TCGAov} (shipped along with the \texttt{integIRTy} package). We use trhe \texttt{data} function to load expression, methylation, and copy number data for both tumor and normal samples. <<loadData>>= data(OV) ls() @ The \texttt{intIRTeasyRunFromRaw} function requires two lists of matrices corresponding to tumor and normal samples. The lists must contain data from matching assays in the same order. The rows of each matrix should also be in the same order, corresponding to the same set of genes. The two lists are prepared below. <<lists>>= controlList <- list(Expr_N, Methy_N, CN_N) tumorList <- list(Expr_T, Methy_T, CN_T) @ To avoid repeated computation, we define a function that detects if an object is already present: <<testObject>>= testObject <- function(object) { exists(as.character(substitute(object))) } @ Now we can call \texttt{intIRTeasyRunFromRaw} to integrate the data. Note that the assay type is specified in the order of the two lists so that the program can choose the right dichotomization methods. We also add gene sampling for significance assessment. <<runFromRaw>>= if(!testObject(runFromRaw)) { runFromRaw <- intIRTeasyRunFromRaw(platforms=tumorList, platformsCtr=controlList, assayType=c("Expr", "Methy", "CN"), permutationMethod="gene sampling") } class(runFromRaw) attributes(runFromRaw) @ \texttt{intIRTeasyRunFromRaw} returns a list of the following elements: \begin{itemize} \item \texttt{fits}: a list of model fits for each platform as returned by \texttt{fitOnSinglePlat} function; specifically, these include \begin{itemize} \item \texttt{fit}: An object returned by calling ltm package. Item parameters and other auxillary inforamtion (i.e. loglikelihood, convergence, Hessian) can be accessed from this object. For more details, please refer to ltm package \item \texttt{model}: The model type \item \texttt{guessing}: The guessing parameter \item \texttt{sampleIndices}: The sample indices used in the model \item \texttt{geneIndices}: The gene indices used in the model \end{itemize} \item \texttt{estimatedScoreMat}: A matrix of estimated latent traits. The first several columns correspond to the individual assays; the last column represents the integrated latent trait with all data. \item \texttt{permutedScoreMat}: A matrix of latent trait estimates after permuting the binary matrix within columns. This is only available if addPermutedScore is set to TRUE. The first several columns correspond to the individual assays; the last column represents the integrated data. \item \texttt{dscrmnList}: A list of discrimination parameters. Each element contains all of the discrimination parameters as a vector for each assay. The last element contains the discrimination parameters for the integrated data which is formed by combining discrimination parameters from each assay sequentially. \item \texttt{dffcltList}: Same format as dscrmnList except it contains difficulty parameter. \item \texttt{gussngList}: Same format as dscrmnList except it contains guessing parameter. Be default, this is just all 0's. \item \texttt{permutedScoreMatWithLabelPerm}: A matrix of latent trait estimates using sample label permutation. This is only available if permutationMethod=`sample label permutation' is used. The first several columns correspond to the individual assays; the last column represents the integrated data. \end{itemize} Below, we just extract the estimated latent traits for each assay and visualize them by matrix plot. The upper diagonal panels show the pairwise smoothed scatter where dark clouds indicate the density. The lower panels show the correlation of the latent traits. <<echo=FALSE>>= panel.hist <- function(x, ...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, breaks=50, plot = FALSE) breaks <- h$breaks; nB <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...) } panel.cor <- function(x, y, digits=2, prefix="", cex.cor) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r <- abs(cor(x, y, use='complete.obs')) txt <- format(c(r, 0.123456789), digits=digits)[1] txt <- paste(prefix, txt, sep="") if(missing(cex.cor)) cex <- 0.6/strwidth(txt) text(0.5, 0.5, txt, cex = cex) } panel.smooth <- function(...){ par(new=TRUE); smoothScatter(...) abline(0, 1, col=2, lty=2) } @ <<MatrixScatter, fig=TRUE>>= runFromRaw_ScoreMat <- runFromRaw$estimatedScoreMat pairs(runFromRaw_ScoreMat, lower.panel=panel.cor, upper.panel=panel.smooth, labels=c('Expression', 'Methylation', 'Copy \n Number', 'Integrated'), cex.labels=1.2, gap=1.7) @ \section{Building The Pipeline Step By Step} Usually the \texttt{intIRTeasyRunFromRaw} function is enough for most applications. In case of assays rather than expression, methylation and copy number is present, the user might need to build the pipeline themselves. Also, we allow the user to define their own dichotomization methods. Below we show step by step how to build the pipeline for data integration. \subsection{Data Dichotomization} We implement 3 dichotomization methods proposed in \citep{integIRTy}. The dichotomize() function is a wrapper for each method. Many options can be specified. For details, please refer to the package documentation. <<DataDichotomization>>= if(!testObject(binDat_CN)){ binDat_expr <- dichotomize(Expr_T, Expr_N, assayType='Expr') binDat_methy <- dichotomize(Methy_T, Methy_N, assayType='Methy') binDat_CN <- dichotomize(CN_T, CN_N, assayType='CN') } @ \subsection{Fit IRT Model On Each Assay} With dichotomized data, we can fit an IRT model on each assay. Three IRT models are implemented: the Rasch model where all item discrination are set to 1; the constrained 2PL model where all item discrimation are set to be equal but not necessarily 1; the 2PL model where no constraint is put on the item difficulty and discrimination parameter. By default, the 2PL model will be used. <<FitOnSingleAssay>>= if(!testObject(fit2PL_CN)){ fit2PL_Expr <- fitOnSinglePlat(binDat_expr, model=3) fit2PL_Methy <- fitOnSinglePlat(binDat_methy, model=3) fit2PL_CN <- fitOnSinglePlat(binDat_CN, model=3) } @ \subsection{Estimating Latent Traits} Result returned by fitOnSinglePlat() contains item parameters, which can be used to estimate the latent traits through the computeAbility() function. First we extract the item parameters: <<>>= dffclt_expr <- coef(fit2PL_Expr$fit)[, 'Dffclt'] dscrmn_expr <- coef(fit2PL_Expr$fit)[, 'Dscrmn'] dffclt_methy <- coef(fit2PL_Methy$fit)[, 'Dffclt'] dscrmn_methy <- coef(fit2PL_Methy$fit)[, 'Dscrmn'] dffclt_CN <- coef(fit2PL_CN$fit)[, 'Dffclt'] dscrmn_CN <- coef(fit2PL_CN$fit)[, 'Dscrmn'] @ Then, latent traits from each assay can be estimated separately. <<LatentTraitEstimation>>= if(!testObject(score_expr)){ score_expr <- computeAbility(binDat_expr, dscrmn=dscrmn_expr, dffclt=dffclt_expr) score_methy <- computeAbility(binDat_methy, dscrmn=dscrmn_methy, dffclt=dffclt_methy) score_CN <- computeAbility(binDat_CN, dscrmn=dscrmn_CN, dffclt=dffclt_CN) } @ \subsection{The Integrated Latent Trait} The integrated latent trait can be estimated similarby by combining the items into a larger test: <<integratedLatentTrait>>= if(!testObject(score_integrated)){ score_integrated <- computeAbility(respMat=cbind(binDat_expr, binDat_methy, binDat_CN), dscrmn=c(dscrmn_expr, dscrmn_methy, dscrmn_CN), dffclt=c(dffclt_expr, dffclt_methy, dffclt_CN)) } @ By building the pipeline step by step, we get the same result as using the intIRTeasyRunFromRaw() function: <<>>= all(score_integrated==runFromRaw_ScoreMat[, 4]) @ \subsection{The intIRTeasyRun() Function} The intIRTeasyRun() function is a pipeline for binary input. It can be used as follows: <<easy,eval=FALSE>>= # not run to allow vignette to complete more quickly runFromBinary <- intIRTeasyRun(platforms=list(binDat_expr, binDat_methy, binDat_CN)) @ \section{Parallelizing \texttt{integIRTy}} We add an option for parallel computing to speed up the computation. Parallel computing uses \texttt{foreach} as the backend while workers are requested by \texttt{doParallel}. The parallelism can happen when fitting serveral IRT models (due to multiple assay types to integrate) or estimating the latent traits or performing permutation to assess statistical significance. %% parallel is disabled to pass R Forge check %Note that, although the parallel code works fine on Windows after the %program is installed, it causes an error when checking the package on %Windows. As a result, we disable this section of te code if running %on Windows. %<<notWin>>= %notwin <- .Platform$OS != "windows" %@ The following functions can be paralleled which can be controlled by the \textit{parallel} option: intIRTeasyRun(), intIRTeasyRunFromRaw(), calculatePermutedScoreByGeneSampling(), computeAbility() and dichotomizeExpr(). Enabling parallelism is quite simple for the user. The first step involves requesting workers. The second step is to set the option \textit{parallel} to TRUE in a specified function. By default, all parallel options are set to be FALSE. Several cluster types can be requested including MPI, SOCK, PVM and NWS. For example, on a windows machine with multiple cores, the following code requests 3 workers through SOCK: % % parallel is commented out so that we can pass R-Forge check due to snow package is not installed on the server % \begin{verbatim} library(doParallel) cl <- makeCluster(3, type = "SOCK") registerDoParallel(cl) \end{verbatim} Note that the above command also works on Linux servers. However, it requests master nodes when run within R. For Linux servers, especially linux clusters with MPI installed, we can request more workers as: %<<register,eval=FALSE>>= %# not run to allow vignette to complete more quickly \begin{verbatim} registerDoParallel(makeCluster(30)) \end{verbatim} %@ Usually it's not a good practice to request the master nodes on the cluster. To avoid this, one can use qsub command to submit a PBS job using the R script written without further modification. We proceed with the 3 nodes requested and call the intIRTeasyRun() to illustrate how to use the parallel computing: %<<parallelExample>>= %if(notwin) { \begin{verbatim} runFromBinary <- intIRTeasyRun(platforms=list(binDat_expr, binDat_methy, binDat_CN), addPermutedScore=TRUE, parallel=TRUE) \end{verbatim} %} %@ Now we close the connection to the workers. %<<>>= %if(notwin) { \begin{verbatim} stopCluster(cl) \end{verbatim} % } %@ \section{File Location and Session Info} <<sessionInfo>>= getwd() sessionInfo() @ \bibliographystyle{plainnat} \bibliography{integIRTy} \end{document}