%\VignetteIndexEntry{Using Oncotree for the ovarian cancer CGH data}
%\VignetteDepends{lattice,boot}
\documentclass[reqno]{amsart}
\usepackage[margin=0.8in]{geometry}
\usepackage{graphicx}

\title{Using the Oncotree package}
\author[A. Szabo]{Aniko Szabo, Kenneth Boucher and Lisa Pappas}
\SweaveOpts{echo=TRUE}

\begin{document}
\setkeys{Gin}{width=0.5\textwidth}
\begin{abstract}
This paper shows a short example of building and exploring oncogenetic trees
using the Oncotree package. A detailed
description of the theory of oncogenetic trees can be found in 
\begin{itemize}
\item Desper, R.; Jiang, F.; Kallioniemi, O.; Moch, H.; Papadimitriou, C. and Sch\"{a}ffer, A. A. ``Inferring tree models for oncogenesis from comparative genome hybridization data.'' \emph{Journal of Computational Biology}, 1999, 6, 37-51.
\item Szabo, A. and Boucher, K. ``Estimating an oncogenetic tree when false negatives and positives are present.'' \emph{Mathematical Biosciences}, 2002, 176, 219-236.
\item Szabo, A. and Boucher, K. ``Oncogenetic trees'' in \emph{Handbook of cancer models with applications} Tan, Hanin (ed.) World Scientific, 2008.
\end{itemize}
A short introduction is given in doc/Oncotree.pdf.

\end{abstract}
\maketitle


We start by loading a dataset. The package contains an example dataset:
<<Intro, echo=false>>=
 options(width=100)
 ps.options(colormodel="rgb")
<<DataLoad>>=
 library(Oncotree)
 data(ov.cgh)
 str(ov.cgh)
@

Based on these data, we construct the oncogenetic tree using the default 
$\ell_2$-distance error function to estimate the false-positive and false-negative 
error rates.
<<TreeFit>>=
  ov.tree <- oncotree.fit(ov.cgh)
@

The fitted tree can be examined several ways: \texttt{print}ing it produces a
quick summary, but the result of \texttt{plot}ting is easier to interpret (the
plots are shown in Figure~\ref{F:treeplot}). 
<<TreePrint>>=
  ov.tree
  plot(ov.tree, edge.weights="est")   
<<TreePrint2, results=hide>>=
  pstree.oncotree(ov.tree, edge.weights="est", shape="oval")
@
\begin{figure}
\begin{minipage}{0.5\textwidth}
\setkeys{Gin}{width=\textwidth}
<<TreePlot,fig=true, echo=false>>=
  plot(ov.tree, edge.weights="est")
@
\end{minipage}
\begin{minipage}{0.45\textwidth}
\includegraphics[width=\textwidth]{PlotForVignette}
\end{minipage}
\caption{Fitted oncogenetic tree for the \texttt{ov.cgh} data set.}\label{F:treeplot}
\end{figure}
\setkeys{Gin}{width=0.5\textwidth}

We can compare the observed and fitted marginal occurrence frequencies of the
mutations (the distance between these two was minimized for the error-rate
estimation).  The plot is shown in Figure~\ref{F:marg}.

<<TreeMarg>>=
  print(obs <- colMeans(ov.tree$data))
  print(est <- marginal.distr(ov.tree, with.errors=TRUE))
  #plot is in Figure 2
  barplot(rbind(obs[-1],est[-1]), beside=T,  legend.text=c("Observed","Fitted"),
	        main="Marginal frequencies of occurrence")
@
\begin{figure}
<<TreeMargPlot,fig=true, width=7, height=4, echo=false>>=
  barplot(rbind(obs[-1],est[-1]), beside=T,  legend.text=c("Observed","Fitted"),
	        main="Marginal frequencies of occurrence")
@
\caption{Observed and fitted frequencies of occurrence of each event.}\label{F:marg}
\end{figure} 

In addition to the marginal frequencies, it is possible to estimate the entire
joint distribution generated by the tree:
<<TreeDist>>=
  dd <- distribution.oncotree(ov.tree, with.errors=TRUE)
	head(dd)
@  
Using the overall joint distribution, it is straightforward to obtain marginal 
joint distributions (2- or higher way) if needed (the plot is shown in Figure~\ref{F:marg2}).

<<Marg2way>>=
  #estimated probabilities of 2 events
  print(est2way <- t(data.matrix(dd[2:8])) %*% diag(dd$Prob) %*% data.matrix(dd[2:8]))
  #observed probabilities of 2 events
  print(obs2way <- t(ov.tree$data[,-1]) %*% ov.tree$data[,-1]/nrow(ov.tree$data))
  oe.diff <- obs2way-est2way
  oe.diff[lower.tri(oe.diff)] <- NA  #clear half of symmetric matrix for plotting
  
  require(lattice)  #the plot is in Figure 3
  levelplot(oe.diff, xlab="", ylab="", scales=list(x=list(alternating=2), tck=0),
	          main="Observed - Expected probabilities of co-occurrence of events")
@

\begin{figure}
<<Marg2wayPlot,fig=true, echo=false>>=
  print(levelplot(oe.diff, xlab="", ylab="", scales=list(x=list(alternating=2), tck=0),
	          main="Observed - Expected probabilities of co-occurrence of events"))
@
\caption{Goodness-of-fit plot: difference between observed and expected 
 probabilities of two events being observed.}\label{F:marg2}
\end{figure} 

Another way to evaluate goodness-of-fit is through bootstrap resampling of the
data. Two approaches are implemented: a parametric bootstrap that assumes that 
the model is correct and a non-parametric bootstrap. The plot is shown in Figure~\ref{F:boot}.

<<BootNP>>=
  set.seed(43636)
  ov.boot <- bootstrap.oncotree(ov.tree, type="nonparam", R=1000)
  ov.boot
  opar <- par(mfrow=c(3,2))   #the plot is in Figure 4
    plot(ov.boot, minfreq=45)
  par(opar)  
@

\setkeys{Gin}{width=0.8\textwidth}

\begin{figure}
<<BootNPplot, fig=true, width=7, height=10, echo=false>>=
  opar <- par(mfrow=c(3,2))   #the plot is in Figure 4
    plot(ov.boot, minfreq=45, cex=1)
  par(opar)  
@
\caption{The most frequently occurring bootstrap trees.} \label{F:boot}
\end{figure}   
\setkeys{Gin}{width=0.5\textwidth}

The non-parametric bootstrap gives an estimate of the reconstruction confidence:
the original tree was obtained 83 times out of 1000 resamples, so the estimated
confidence is 8.3\%.

We can look at the frequency of edge occurrences in the bootstrapped trees:

<<BootFreq>>=
  ov.boot$parent.freq
@

It is clear that some edges are really stable: Root $\rightarrow$ 8q+, 
8q+ $\rightarrow$  3q+, root $\rightarrow$ 1q+, all with confidence $>80\%$, while
other edges are less stable (for example, 8p- is the child of 8q+ about as often
as of 5q-).
\end{document}