\documentclass[10pt,a4paper]{article}
\usepackage[T1]{fontenc}
\usepackage{natbib, url}
\usepackage{ucs}
\usepackage{longtable}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{color, colortbl}
\usepackage{fullpage}% sans marge
\renewcommand{\baselinestretch}{1}% definition interligne
\usepackage{lmodern}
\usepackage{float}
\usepackage{multirow}
\usepackage{lscape}
\usepackage{graphicx}
\usepackage{xcolor}
\usepackage{multicol}
\newcommand{\quotes}[1]{``#1''}
\parindent=0pt \parskip=0pt
%\VignetteIndexEntry{Using the MareyMap package}

\title{The \texttt{MareyMap} package version 1.3}
\author{Aurélie Siberchicot, Clément Rezvoy, Delphine Charif, \\
Laurent Guéguen and Gabriel Marais}

\begin{document}
\SweaveOpts{concordance=TRUE}
\maketitle

\texttt{MareyMap} is an \texttt{R} package to estimate local recombination rates along the genome. 
\texttt{MareyMap} relies on comparing the genetic and the physical maps of a given chromosome to estimate local recombination rates (given by the slope of the curve describing the relationship between both variables), 
a graphical method called the Marey map method introduced by A. Chakravarti in 1991\footnote{Chakravarti A. (1991) A graphical representation of genetic and physical maps: the Marey map. Genomics 11(1):219-22.}. 
\texttt{MareyMap} accepts Marey map data as input (genetic and physical positions of markers for a set of chromosomes of a species) and will return local recombination rate estimates.\\

\texttt{MareyMap} has many features and possible options (detailled in the present user guideline document) including:
\begin{itemize}
  \item taking Marey map data from any species, including some Marey map data for a few species provided with the package
  \item estimating local recombination rates using different interpolation methods
  \item providing in an automatic way local recombination rates for any given gene (or set of genes) in the genome
\end{itemize}
\vspace{0.6cm}

If you use \texttt{MareyMap}, please cite: \\
Rezvoy C, Charif D, Guéguen L, Marais GAB. (2007) MareyMap: an R-based tool with graphical interface for estimating recombination rates. \textit{Bioinformatics} 23(16):2188-9. \\
https://doi.org/10.1093/bioinformatics/btm315\\

If you use \texttt{MareyMapOnline}, please cite: \\
Siberchicot A, Bessy A, Guéguen L, Marais G (2017) MareyMap Online: A User-Friendly Web Application and Database Service for Estimating Recombination Rates Using Physical and Genetic Maps. \textit{Genome Biology and Evolution} 9(10):2506-2509. https://doi.org/10.1093/gbe/evx178

\vspace{0.6cm}
\tableofcontents
\vspace{0.8cm}

\section{Installing and starting \texttt{MareyMap}}

\subsection{Initial installation}
\texttt{MareyMap} is a package developed under the \texttt{R} software;
sources are available on \url{http://cran.r-project.org/}. The
\texttt{R} software must be installed in such a way that graphical
interfaces can work. On Windows and Mac OS, this is automatically
done when the \texttt{R} software is installed. On Linux, the two
libraries \textit{tcl} and \textit{tk} must be installed, which is done by installing
\texttt{R} with the -\textit{-with-tcltk} option.\\

When \texttt{R} is installed, the package \texttt{MareyMap} and its
dependencies \texttt{tcltk}, \texttt{tkrplot} and \texttt{tools} must be installed,
using the commands 

\begin{verbatim}
  install.packages(MareyMap)
  install.packages(tcltk)
  install.packages(tkrplot)
  install.packages(tools)
\end{verbatim}

on a \texttt{R} console.



\subsection{Starting MareyMap}
In a \texttt{R} console, first load the
package:

\begin{verbatim}
  library(MareyMap)
\end{verbatim}

Then, open a graphical interface
with the command:
\begin{verbatim}
  startMareyMapGUI()
\end{verbatim}

A new window, as shown in Figure~\ref{fig:1} should open. If not, close your
\texttt{R} console, re-load and re-start the package.

\begin{figure}[H]
\centering
\includegraphics[width=0.85\textwidth]{start.png}
\caption{The \texttt{MareyMap} graphical interface}
\label{fig:1}
\end{figure}


\section{Data}


\subsection{Loading data}
The user must either choose at least one dataset among
those available in the \texttt{MareyMap} package or import his/her own
dataset.\\

6 ready-to-use datasets are provided along with the package. 
This includes Marey maps for:
\textit{Arabidopsis thaliana}\footnote{Wright SI, Agrawal N, Bureau TE. 
  Effects of recombination rate and gene density on transposable
  element distributions in Arabidopsis thaliana. 
  Genome Res. 2003. 13:1897-1903.}, 
\textit{Caenorhabditis elegans}\footnote{Wormbase Release WS160
  \url{https://wormbase.org//}. see Rizzon C, Martin E, Marais G, Duret L, Segalat L, Biemont C. 
  Patterns of selection against transposons inferred from the distribution of Tc1, Tc3 and Tc5 
  insertions in the mut-7 line of the nematode Caenorhabditis elegans. 
  Genetics. 2003. 165:1127-1135.}, 
\textit{Drosophila melanogaster}\footnote{Marais G, Piganeau G. 
  Hill-Robertson interference is a minor determinant of variations in codon bias
  across Drosophila melanogaster and Caenorhabditis elegans genomes.
  Mol Biol Evol. 2002. 19:1399-1406.} 
and \textit{Homo sapiens}\footnote{Rutgers Combined Linkage-Physical Maps, version
  2.0 (Build 35). Xiangyang Kong and Tara Matise 12/08/2004.
  see Kong et al. A high-resolution recombination map of the human genome. 
  Nat Genet. 2002. 31:241-247.} 
  (male, female and sex-averaged).\\


When the input dataset does not come from the package, the extension of the data file must be \textit{txt}, \textit{rda}, \textit{Rda}, \textit{rdata} or \textit{Rdata}. 
When using a text file, the input data must be a data frame with the columns \quotes{set}, \quotes{map}, \quotes{mkr}, \quotes{phys} and \quotes{gen}. 
If missing, an additional column \quotes{vld} (indicating if the marker is valid or not) is added with \texttt{TRUE} value by default.
The \quotes{set} colum corresponds to the organism, 
\quotes{map} corresponds to chromosomes, 
\quotes{mkr} corresponds to markers of genes, 
\quotes{phys} corresponds to physical position of markers,
\quotes{gen} corresponds to genetic distances between each marker and 
\quotes{vld} corresponds to valid markers (this column is not mandatory).\\

Column names must be in the first row and values must be separated by a white space and each character string must be between double quotes (including column names), as in the example below.
The rows which contains NA entries are removed. 

\begin{verbatim}
"set" "map" "mkr" "phys" "gen" "vld"
"Arabidopsis thaliana" "Chromosome 1" "GST1" 663291 3.99 TRUE
"Arabidopsis thaliana" "Chromosome 1" "SGCSNP151" 1148355 3.35 TRUE
"Arabidopsis thaliana" "Chromosome 1" "AtEAT1" 1435872 3.87 TRUE
"Arabidopsis thaliana" "Chromosome 1" "ve002" 1521308 7.15 TRUE
"Arabidopsis thaliana" "Chromosome 1" "SGCSNP388" 1526933 7.66 TRUE
"Arabidopsis thaliana" "Chromosome 1" "SGCSNP170" 1642565 7.66 TRUE
"Arabidopsis thaliana" "Chromosome 1" "ve003" 2032443 7.76 TRUE
"Arabidopsis thaliana" "Chromosome 1" "SGCSNP308" 2664435 0.89 TRUE
\end{verbatim}



Choose and open a dataset with the \quotes{\textbf{File}} and \quotes{\textbf{Open}}
menus. The \quotes{\textbf{Data}} menu lists all the dataset opened. When one
dataset is selected, the \quotes{{\footnotesize \textbf{MAPS}}} left frame is
updated and shows the Marey maps (one for each chromosome) available
in the dataset.\\

In the \quotes{{\footnotesize \textbf{MAPS}}} frame, the user selects one map
(\textit{i.e.} one chromosome) by clicking on it. The selected map is displayed
(the physical positions on $x$-axis and the genetic distances on
$y$-axis) in the central part of the interface. The {\footnotesize
\quotes{\textbf{INTERPOLATIONS}}} right frame becomes active and the user can
perform interpolations. Figure~\ref{fig:2} shows the
\textit{Arabidopsis\_thaliana} \textit{Chromosome 1} Marey map as
example.

\begin{figure}[H]
\centering
\includegraphics[width=0.85\textwidth]{opendata.png}
\caption{Displayed Marey map of a chromosome}
\label{fig:2}
\end{figure}



\subsection{Map cleaning}

Physical or genetic maps occasionally include errors. Those will appear 
as outliers in a Marey map of a chromosome, disrupting the 
monotonically increasing behaviour expected from a Marey map function. 
Clicking on a marker on
the map (the point becomes filled red) will display information about
this marker in the \quotes{{\footnotesize \textbf{MARKER}}} left frame. If you
un-select the \quotes{\textbf{Valid}} option (a red cross covers the point),
this marker will not be included in the interpolations. This operation
is reversible.\\

Deleting the marker is also possible. The marker will be removed from
the rest of the analysis, but not from the raw data.  The marker will be
included again if the dataset is re-uploaded.


\section{Interpolation methods}

\subsection{Selecting and running an interpolation method}

\subsubsection{Selecting a method}
\newpage
\begin{multicols}{2}
  To run an interpolation method on a Marey map, click on the
  \includegraphics[width=0.03\textwidth]{gtk-add} icon in the \quotes{{\footnotesize
  \textbf{INTERPOLATIONS}}} right frame and select an interpolation
  method from the list (see Figure~\ref{fig:3}).\\
  After interpolation is done, the results are displayed in the central frame.
  \columnbreak
 
  \begin{figure}[H]
    \centering
    \includegraphics[width=0.3\textwidth]{listmethods.png}
    \caption{List of interpolation methods.}
    \label{fig:3}
  \end{figure}
 
\end{multicols}


\subsubsection{Changing and deleting interpolations}

You can change the parameters of an interpolation by clicking on the
\includegraphics[width=0.03\textwidth]{gnome-settings} icon in the \quotes{{\footnotesize
\textbf{INTERPOLATIONS}}} frame and delete an interpolation by
clicking on the \includegraphics[width=0.03\textwidth]{stock_delete} icon. Interpolations
can be shown in the central displaying frame,
using the \includegraphics[width=0.03\textwidth]{stock_show-all} checkbox. The
\includegraphics[width=0.03\textwidth]{stock_save} checkbox indicates whether the
interpolation results should be included when saving the results into a text file.

\subsubsection{Common parameters}

Some parameters are common for all interpolation methods. By default, a name 
(\quotes{\textbf{Name}} parameter) is given to an interpolation (can be changed by the
user), the interpolation results will be saved 
(\quotes{\textbf{Saved}} parameter) and displayed (\quotes{\textbf{Displayed}} parameter), and
a line color (\quotes{\textbf{Line color}} parameter) is automatically chosen. These parameters
can be changed at any time.

See \ref{specific} for specific parameters to each interpolation
method.

\subsubsection{Running a method to every map in a set}

It is possible to run the same
interpolation method (with the same parameters) on all the Marey maps (all the
chromosomes) of a dataset. Just click on the \quotes{\textbf{Apply to every map
in the set}} checkbox in the window that opens when a new
interpolation is being set (see Figure~\ref{fig:3}). In this case, the interpolation 
will have the same name for all the Marey maps.

Similarly, changing or deleting an interpolation will affect all the maps if you
use the \quotes{\textbf{Apply to every map in the set}} checkbox.



\subsection{Available interpolation methods}
\label{specific}
The \texttt{MareyMap} package currently provides three interpolation
methods: Loess, Sliding Windows and Cubic Splines.

\subsubsection{Loess}
Loess (or Lowess for LOcally WEighted Scatterplot Smoothing) estimates
the recombination rates by locally adjusting a polynomial curve
($1^{st}$ or $2^{nd}$ degree). The size of the window is defined as a
percentage of the total number of markers and therefore can adapt to
the variation of the density of markers across the map. Inside of a
given window, each marker is attributed a weight depending on how far
they are from the center of the window. The parameters $\beta$ of the
curves are those that minimize the mean squared deviation between the
data and the curve:
\begin{center}
$Q = \sum_{i=1}^{n} \omega_i [y_i - f(x_i , \hat{\beta})]^2$
\end{center}

where ($x_i$ , $y_i$) are the observed data and $\omega_i$ is the
weight of each marker calculated by:
\begin{center}
$\omega(u) = (1 - u^3)^3$
\end{center}
with:
\begin{center}
$u=\frac{|x_0 - x_i|}{max_N(x_0)|x_0 - x_i|} $
\end{center}

For this method, you can select the degree of the fitted curve
(\quotes{\textbf{Degree}} parameter) and the size of the window (\quotes{\textbf{Span}}
parameter). The span parameter is the percentage of the total number
of points to take into account for computing the local polynomial at
the vicinity of a marker. Span controls the degree of smoothing.
The same span value is applied to all the maps, which may not be optimal if the error
variance or the curvature of the underlying function $f$ varies.\\

\vspace{0.6cm}
\begin{multicols}{2}
  This method is based on the \texttt{R} \texttt{loess} function. For more information about this method, write
  \texttt{?loess} in a \texttt{R} console.\\
  Selecting this method will open a window as shown in Figure~\ref{fig:4}.
  
  \columnbreak
  \vspace{0.6cm}
  \begin{figure}[H]
  \centering
  \includegraphics[width=0.35\textwidth]{loessmethod.png}
  \caption{The Loess method}
  \label{fig:4}
  \end{figure}
\end{multicols}


\subsubsection{Sliding window}

\begin{multicols}{2}
  This method estimates the local recombination rates by carrying out
  linear regressions within a sliding window of a given physical size. You may
  adjust the size of the window (\quotes{\textbf{Size}} parameter), the distance
  between two successive windows (\quotes{\textbf{Shift}} parameter), as well the
  minimum number of marker per window for the interpolation to be carried out
  (\quotes{\textbf{Threshold}} parameter).

  Selecting this method will open a window as shown in Figure~\ref{fig:5}.
  \columnbreak
  \begin{figure}[H]
  \centering
  \includegraphics[width=0.35\textwidth]{slidingwindowmethod.png}
  \caption{The sliding window method}
  \label{fig:5}
  \end{figure}

\end{multicols}


\subsubsection{Cubic splines}
A cubic smoothing spline behaves approximately like a kernel smoother,
but it corresponds to the function $\hat{f}$ that minimizes the penalized
residual sum of squares given by:
\begin{center}
$PRSS= \sum_{i=1}^{n} (y_i - f(x_i))^2 + \lambda \int (f''(t))^2 dt $
\end{center}

$\lambda$ is the smoothing parameter, corresponding to the span in
loess. A different $\lambda$ can be specified using the \quotes{\textbf{Spar}} parameter. \\

The \quotes{\textbf{Degree of freedom}} parameter controls the amount of
smoothing and corresponds to the trace of the smoothing matrix. It is
also estimated automatically using spar or by cross-validation.\\

These two parameters will be estimated automatically under \texttt{R} either by locally or generalized cross-validation.\\

The generalized cross-validation is performed using this function:
\begin{center}
$CV(\lambda) = \frac{1}{n} \sum_{i=1}^n (y_i^* - \hat{f}_{\lambda}^{-i} (x_i))$
\end{center}

Here $\hat{f}_{\lambda}^{-i} (x_i)$ is the leave-one-out smooth at
$x_i$, that is constructed using all the data except for ($x_i$,
$y_i$) and then the resulting least squares line is evaluated at
$x_i$. CV is calculated for different values of $\lambda$ and the
$\lambda$ that minimizes this criterion is chosen. The
\quotes{\textbf{Generalized cross-validation}} method should be used when there
are several markers with identical physical position.\\

\begin{multicols}{2}
  In the graphical interface, you must fill the parameter chosen in the
  \quotes{\textbf{Type}} list.\\ 
  
  This method is directly based on the function \texttt{smooth.spline}
  of \texttt{R}. To get more information about this method you can type
  \texttt{?smooth.spline} in a \texttt{R} console.\\
  Selecting this method will open a window as shown in Figure~\ref{fig:6}.
  \columnbreak
  \begin{figure}[H]
  \centering
  \includegraphics[width=0.35\textwidth]{splinemethod.png}
  \caption{The cubic splines method}
  \label{fig:6}
  \end{figure}
\end{multicols}

\section{Queries}

Once an interpolation method has been run on a map, you can make queries about local
recombination rates using the \quotes{{\footnotesize \textbf{LOCAL RECOMBINATION RATE}}} right frame. 
There are four different ways of using this frame. \\

1. You may want to know the recombination rate at a given
physical position on the currently displayed map. The position must be
specified in base pair (ex. 31564623) but can also be expressed using
Mb or Kb (ex. 31Mb, 564Kb or even 31Mb+564Kb+623). The local recombination 
rate from each interpolation available will be provided for this position.
This is done when the \quotes{\textbf{Query}} button is pressed, and shown in the
updated \quotes{{\footnotesize \textbf{LOCAL RECOMBINATION RATE}}} window. A
vertical red line is then displayed on the two central graphics, at the
physical position of interest.\\

\newpage
\begin{multicols}{2}
  2. You may want to know the recombination rate at several positions 
  on the currently displayed map. Just list them separating them by "\string:" (ex.
  31Mb\string:12287456\string:44Kb+564). When clicking on the
  \quotes{\textbf{Query}} button, results will be displayed in a separate
  window (see Figure~\ref{fig:7}) and can be saved into a text file. The results 
  will include one column per interpolation available for the displayed map.
  
  \columnbreak
  \begin{figure}[H]
  \centering
  \includegraphics[width=0.45\textwidth]{severalqueries.png}
  \caption{Example of output for several queries.}
  \label{fig:7}
  \end{figure}
\end{multicols}

3. You may want to know the recombination rate at many positions 
(for instance all the genes of a genome). This can be done by
up-loading a text file including all the positions. To do this, you can
\begin{itemize}
 \item enter the path of the above mentioned file and click on the \quotes{\textbf{Query}} button,
 \item or click on \quotes{\textbf{Read positions from file}} and select the file using the file selector dialog window.
\end{itemize}

The input file must be a text file (\textit{txt} extension) containing at
least a \quotes{map} column and a \quotes{phys} column indicating
respectively the map and the physical position of each gene. An example
file \textit{test\_query.txt} is provided along with the package. This file
may also include a \quotes{set} column if there are genes from several
genomes for instance (if this column is not present all the genes are
considered from the same genome, \textit{i.e.} the same query). Any other column will be ignored
by the program but will be kept in the output file.\\

4. It is also possible to know the recombination rate at a position in an
interactive way. When one marker is selected (by clicking) on the
displayed map (in the top central frame), some details are updated in
the \quotes{{\footnotesize \textbf{MARKERS}}} left frame. You will be able to click
on the \quotes{\textbf{Query recombination rate}} button. As before, 
results are shown in the updated \quotes{{\footnotesize \textbf{LOCAL
RECOMBINATION RATE}}} frame and a vertical red line is displayed on
the two central graphics, at the physical position of interest (see
Figure~\ref{fig:8}).
    
\begin{figure}[H]
\centering
\includegraphics[width=0.85\textwidth]{interactivequery.png}
\caption{Example of an interactive query.}
\label{fig:8}
\end{figure}


\section{Saving your results}

\subsection{Saving data}
Maps can be saved to \texttt{R} data files (\textit{rda}, \textit{Rda}, 
\textit{rdata} or \textit{Rdata}) or to text files (\textit{txt}).

All interpolation methods created (applied on either a map or a set of
maps) in the current \texttt{R} console are saved in the file.\\

If the file is a text file, it will include a line per marker with
columns \quotes{set} (for the dataset name), \quotes{map} (for the
map name, ie. the chromosome name), \quotes{phys} (for the physical
position of the marker), \quotes{gen} (for the genetic position of the
marker) and \quotes{vld} (indicating if the marker is valid or not).
If interpolation methods are included (those for which the
\includegraphics[width=0.03\textwidth]{stock_save} checkbox is checked), the file also
contains a column per interpolation (the column name is the
interpolation method name) with the local recombination rate computed
for each marker. Functions used to build the interpolations are also
saved as comments at the beginning of the text file.


\subsection{Exporting pictures}
Maps can also be graphically exported in \textit{jpeg}, \textit{png},
\textit{pdf} or \textit{eps} formats. Only the currently diplayed map
is exported, with only the interpolations which are checked as
\quotes{\textbf{Displayed}}. You can choose to export either the Marey map (on
the top), or the recombination rate display (on the bottom), or both.


\subsection{Loading previous analyses}
You may want to resume work on a dataset. 

If the work was saved in a \textit{txt} format, you can re-run 
interpolation methods using the \texttt{R} commands 
previously used, which can be found at the top of the \textit{txt} file.

If it has been saved in a \textit{rda} format, the \quotes{\textbf{Open}} command loads all previously saved interpolations.



\end{document}