%\VignetteIndexEntry{Using Rankcluster} \documentclass[preprint,authoryear,12pt]{elsarticle} % \usepackage{mathptmx} % use Times fonts if available on your TeX system \usepackage[OT1]{fontenc} \usepackage{graphicx} \usepackage{rotating} \usepackage{array} \usepackage{textcomp} \usepackage[latin1]{inputenc} \usepackage{multirow} \usepackage{graphicx,epsfig,amssymb,amstext,amsmath,amsthm,color} \usepackage[english]{babel} \def\1{\mbox{1\hspace{-.22em}I}} \newcommand{\X}{\mathbb{X}} \newcommand{\R}{\mathbb{R}} \newcommand{\Rd}{\mathbb{R}^d} \newcommand{\Xd}{\mathbb{X}^d} \newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\trace}{\mathop{\mathrm{Tr}}} \newcommand{\diag}{\mathop{\mathrm{diag}}} %\newcommand{\mix}{\textsc{mixmod}} \newcommand{\II}{1 \! \! 1} \newcommand{\IR}{\mathbb{R}} \newcommand{\IZ}{\mathbb{Z}} \newcommand{\IN}{\mathbb{N}} \newcommand{\IE}{\mathbb{E}} \newcommand{\mat}[4]{\begin{array}{cc}#1 & #2 \\#3 & #4 \end{array}} \newcommand{\matb}[4]{\begin{array}{cc}{\bf #1} & {\bf #2} \\{\bf #3} & {\bf #4} \end{array}} \newcommand{\med}{\mathrm{med}} \newcommand{\tr}{\mbox{trace}} \newcommand{\tra}[1]{\mbox{tr}{\bf #1}} \newcommand{\var}{\mbox{var}} % Lettre en gras \newcommand{\ba}{\mathbf{a}} \newcommand{\bc}{\mathbf{c}} \newcommand{\bg}{\mathbf{g}} \newcommand{\be}{\mathbf{e}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bu}{\mathbf{u}} \newcommand{\bx}{\mathbf{x}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bZ}{\mathbf{Z}} \newcommand{\bz}{\mathbf{z}} \newcommand{\bt}{\mathbf{t}} \newcommand{\by}{\mathbf{y}} \newcommand{\x}{x} \newcommand{\mx}{x} \newcommand{\muu}{u} \newcommand{\mv}{v} \newcommand{\my}{y} \newcommand{\mz}{z} \newcommand{\z}{z} \newcommand{\y}{y} \newcommand{\Y}{Y} % \newcommand{\X}{X} \newcommand{\Z}{Z} % Lettre grecque en gras % requiert \usepackage{amsbsy} \newcommand{\balpha}{\boldsymbol{\alpha}} \newcommand{\bDelta}{\boldsymbol{\Delta}} \newcommand{\bepsilon}{\boldsymbol{\epsilon}} \newcommand{\bGamma}{\boldsymbol{\Gamma}} \newcommand{\blambda}{\boldsymbol{\lambda}} \newcommand{\bmu}{\boldsymbol{\mu}} \newcommand{\bpi}{\boldsymbol{\pi}} \newcommand{\bphi}{\boldsymbol{\phi}} \newcommand{\brho}{\boldsymbol{\rho}} \newcommand{\btheta}{\boldsymbol{\theta}} \newcommand{\bTheta}{\boldsymbol{\Theta}} \newcommand{\bvarepsilon}{\boldsymbol{\varepsilon}} \def\1{\mbox{1\hspace{-.22em}I}} % \newcommand{\btheta}{\boldsymbol{\theta}} \newtheorem{defn}{Definition}[section] \newtheorem{thm}{Theorem}[section] \newtheorem{proposition}{Proposition}[section] \newtheorem{corollary}{Corollary}[section] \newtheorem{remark}{Remark}[section] \newtheorem{lemma}{Lemma}[section] % \newcommand{\argmin}{\mathop{\mathrm{argmin}}} % \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \newcommand{\demi}{\mbox{$\frac{1}{2}$}} \addtolength{\hoffset}{-1cm} \addtolength{\textwidth}{2cm} \addtolength{\voffset}{-1cm} \addtolength{\textheight}{2cm} % utile pour la revision \usepackage{ulem} % caracteres barres \sout{} et soulignes \uwave{} \usepackage{cancel} % barre dans un environnement math \renewcommand{\sout}[1]{} % supprime le texte barre (hors eq. math) %\renewcommand{\cancel}[1]{} % supprime les eq barrees \renewcommand{\uwave}[1]{#1} \begin{document} \SweaveOpts{concordance=TRUE} \begin{frontmatter} \title{Rankcluster: An \textbf{R} package for clustering multivariate partial rankings} \author{Julien Jacques$^{1,2}$\corref{cor1}} \ead{julien.jacques@polytech-lille.fr} \cortext[cor1]{Corresponding author. Tel.: +33 320 436 760, Fax: +33 320 434 302} \author{Quentin Grimonprez$^{2}$} \ead{quentin.grimonprez@inria.fr} \author{Christophe Biernacki$^{1,2}$} \ead{christophe.biernacki@math.univ-lille1.fr} \address{$^{1}$Laboratoire Paul Painlev\'e, UMR CNRS 8524, University Lille I, Lille, France} \address{$^{2}$MODAL team, Inria Lille-Nord Europe} \begin{abstract} \textit{Rankcluster} is the first \textbf{R} package proposing both modelling and clustering tools for ranking data, potentially multivariate and partial. Ranking data are modelled by the Insertion Sorting Rank (\textsc{isr}) model, which is a meaningful model parametrized by a central ranking and a dispersion parameter. A conditional independence assumption allows to take into account multivariate rankings, and clustering is performed by the mean of mixtures of multivariate \textsc{isr} model. The clusters' parameters (central rankings and dispersion parameters) help the practitioners in the interpretation of the clustering. Moreover, the \textit{Rankcluster} package provides an estimation of the missing ranking positions when rankings are partial. After an overview of the mixture of multivariate \textsc{isr} model, the \textit{Rankcluster} package is described and its use is illustrated through two real datasets analysis. \end{abstract} \begin{keyword} model-based clustering, multivariate rankings, partial rankings, \textbf{R}, Rankcluster. \end{keyword} \end{frontmatter} \textbf{All results from sections \ref{sec:Rankcluster} and \ref{sec:Examples} were obtained with Rankcluster 0.91.6. Since Rankcluster 0.92, the data format has changed: ranks must be provided in their ranking representation (and not ordering representation). This change was made to manage tied objects.} \section{Introduction} Ranking data occur when a number of subjects are asked to rank a list of objects according to their personal preference order. Such data are of great interest in human activities involving preferences, attitudes or choices like Psychology, Sociology, Politics, Marketing, \textit{etc}. For instance, the voting system \textit{single transferable vote} occurring in Ireland, Australia and New Zeeland, is based on preferential voting \citep{Gor2008a}. In a lot of applications, the study of ranking data discloses heterogeneity, due for instance to different political meanings, different human preferences, \textit{etc}. Recently, \cite{Jac2012b} proposed a model-based clustering algorithm in order to analyse and explore such ranking data. This algorithm is able to take into account multivariate rankings with potential partial rankings (when a subject did not rank all the objects). To the best of our knowledge, this is the only clustering algorithm for ranking data with a so wide application scope. This algorithm is based on an extension of the Insertion Sorting Rank (\textsc{isr}) model~\citep{Jac2012} for ranking data, which is a meaningful and effective model obtained by modelling the ranking generating process assumed to be a sorting algorithm. The \textsc{isr} model is parametrized by a location parameter (the modal ranking) and a dispersion parameter. The heterogeneity of the rank population is modelled by a mixture of \textsc{isr} whereas conditional independence assumption allows the extension to multivariate rankings. Maximum likelihood estimation is performed through a SEM-Gibbs algorithm, in which partial rankings are considered as missing data, what allows to simulate them during the estimation process.\\ This algorithm has been implemented in C++ and is available through the \textit{Rankcluster} package for {R}, available on {R-forge} (and soon on the {CRAN} website) and presented at long in the sequel of this paper. % organisation du papier The paper is organised as follows: Section \ref{sec:model} briefly presents the clustering algorithm proposed in \cite{Jac2012b}. Section~\ref{sec:OtherPackages} describes the existing {R} packages dedicated to ranking data, whereas Section~\ref{sec:Rankcluster} discusses the functionalities of the \textit{Rankcluster} package. Section~\ref{sec:Examples} illustrates the use of \textit{Rankcluster} through the cluster analysis of two datasets: 1. Fligner and Verducci's \textit{words} univariate dataset without any missing ranking positions \citep{Fli1986}, and 2. the rankings of the Big Four of English Football (Manchester United, Liverpool, Arsenal and Chelsea) according to the Premier League results and to their UEFA coefficients between 1993 and 2013 (bivariate dataset with missing ranking positions). \section[Overview of the model-based clustering algorithm]{Overview of the model-based clustering algorithm} \label{sec:model} This section gives an overview of the model-based clustering algorithm for multivariate partial rankings proposed in \cite{Jac2012b}. It relies on the univariate \textsc{isr} model that we introduce first. \subsection[The univariate isr model]{The univariate ISR model} \label{subsec:ISR} Rank data arise when judges or subjects are asked to rank several objects $\mathcal{O}_1,\ldots,\mathcal{O}_m$ according to a given order of preference. The resulting ranking can be designed by its \textit{ordering} representation $\mx=(x^1,\ldots,x^m)\in\mathcal{P}_m$ which signifies that Object $\mathcal{O}_{x^h}$ is the $h$th ($h=1,\ldots,m$), where $\mathcal{P}_m$ is the set of the permutations of the first $m$ integers. Based on the assumption that a rank datum is the result of a sorting algorithm based on paired comparisons, and that the judge who ranks the objects uses the insertion sort because of its optimality properties (minimum number of paired comparisons), \cite{Jac2012} state the following so-called \textsc{isr} model: \begin{eqnarray} \label{eq:model} \mbox{p}(\mx;\mu,\pi) = \frac{1}{m!} \sum_{\my\in \mathcal{P}_m}\mbox{p}(\mx|\my;\mu,\pi) = \frac{1}{m!} \sum_{\my\in \mathcal{P}_m} \pi^{G(\mx,\my,\mu)}(1-\pi)^{A(\mx,\my)-G(\mx,\my,\mu)}, \end{eqnarray} where \begin{itemize} \item $\mu\in\mathcal{P}_m$, the modal ranking, is a \textit{location parameter}. Its \textit{opposite} ranking $\bar{\mu}$ ($\bar{\mu}=\mu\circ \bar e$ with $\bar e=(m,\ldots,1)$) is the rank of smallest probability, \item $\pi\in[\demi,1]$, which is the probability of good paired comparison according to $\mu$ in the sort algorithm, is a \textit{scale parameter}: the distribution is uniform when $\pi=\demi$ and the mode $\mu$ of the distribution is uniformly more pronounced when $\pi$ grows, being a Dirac in $\mu$ when $\pi=1$, \item the sum over $\my \in \mathcal{P}_m$ corresponds to all the possible initial presentation orders of the objects to rank (with identical prior probabilities equal to $1/m!$), \item $G(\mx,\my,\mu)$ is equal to the number of good paired comparisons during the sorting process leading to return $\mx$ when the presentation order is $\my$, \item $A(\mx,\my)$ corresponds to the total number of paired comparisons (good or wrong). \end{itemize} The accurate definitions of $G(\mx,\my,\mu)$ and $A(\mx,\my)$ can be found in \cite{Jac2012}. \subsection[Mixture of multivariate isr]{Mixture of multivariate ISR} Let now redefine $\x=(\mx^1,\ldots,\mx^p)\in \mathcal{P}_{m_1}\times\ldots\times \mathcal{P}_{m_p}$ as a \textit{multivariate} rank, in which $\mx^j=(x^{j1},\ldots,x^{jm_j})$ is a rank of $m_j$ objects ($1\leq j \leq p$). The population of multivariate ranks is assumed to be composed of $K$ groups in proportions $p_k$ ($p_k\in[0,1]$ and $\sum_{k=1}^{K}p_k=1$). Given a group $k$, the $p$ components $\mx^1,\ldots,\mx^p$ of the multivariate rank datum $\x$ are assumed to be sampled from independent \textsc{isr} distributions with corresponding modal rankings $\mu^1_k,\ldots,\mu^{p}_k$ (each $\mu^j_k\in\mathcal{P}_{m_j}$) and good paired comparison probabilities $\pi^{1}_k,\ldots,\pi^{p}_k\in[\demi,1]$. The unconditional probability of a rank $\x$ is then \begin{eqnarray} \label{eq:model-melange} \mbox{p}(\x;\btheta) = \sum_{k=1}^{K}p_k\prod_{j=1}^{p} \frac{1}{m_j!}\sum_{\my\in \mathcal{P}_{m_j}}\mbox{p}(\mx^j|\my;\mu^{j}_k,\pi^{j}_k), \end{eqnarray} where $\btheta=(\pi^{j}_k,\mu^{j}_k,p_k)_{k=1,\ldots,K\,, j=1,\ldots,p}$ and $\mbox{p}(\mx^j|\my;\mu^{j}_k,\pi^{j}_k)$ is defined by (\ref{eq:model}). Each component $\mx^j$ of $\x$ can be full or partial. Frequently, the objects in the top positions will be ranked and the missing ones will be at the end of the ranking, but our model does not impose such situation and is able to work with partial ranking whatever are the positions of the missing data (see details in \cite{Jac2012b}). \subsection[Estimation algorithm]{Estimation algorithm} Let $\boldsymbol{\mathrm{x}}=\{\x_1,\ldots,\x_n\}$ be a sample of $n$ multivariate rankings, and $\boldsymbol{\mathrm{z}}=\{\z_1,\ldots,\z_n\}$ the corresponding latent cluster memberships. Let $\check{I}^j_i$ and $\hat{I}^j_i$ be respectively the sets of indices of observed and unobserved positions in the $j$th component $\mx_i^j$ of the $i$th observation $\x_i$. Similarly, let $\check{\mx}_i^{j}$ and $\hat{\mx}_i^{j}$ correspond to the previous notations for the $j$th component of the $i$th observation, $\check{\x}_i=\{\check{\mx}_i^1,\ldots,\check{\mx}_i^p\}$ and $\hat{\x}_i=\{\hat{\mx}_i^1,\ldots,\hat{\mx}_i^p\}$. Let also define $\boldsymbol{\mathrm{\check{x}}}=\{\check{\x}_i; i=1,\ldots,n\}$ and $\boldsymbol{\mathrm{\hat{x}}}=\{\hat{\x}_i; i=1,\ldots,n\}$. Finally $\y_i=(\my_i^1,\ldots,\my_i^p)\in\mathcal{P}_{m_1}\times\ldots\times\mathcal{P}_{m_p}$ denotes the presentation orders of the objects for the $i$th observation and $\boldsymbol{\mathrm{y}}=\{\y_1,\ldots,\y_n\}$.\\ Assuming that triplets $(\x_i,\y_i,\z_i)$ arise independently ($i=1,\ldots,n$), the observed-data log-likelihood of model (\ref{eq:model-melange}) is: \begin{eqnarray*} l(\btheta;\boldsymbol{\mathrm{\check{x}}}) = \sum_{i=1}^{n}\ln\left( \sum_{k=1}^{K} p_k\prod_{j=1}^{p}\frac{1}{m_j!} \sum_{\my\in \mathcal{P}_{m_j}}\sum_{\mx\in \mathcal{X}_{i}^j}\mbox{p}(\mx|\my;\mu^{j}_k,\pi^{j}_k)\right), \end{eqnarray*} where $\mathcal{X}_{i}^j=\{\mx \in \mathcal{P}_{m_j}:x^{h}=\check{x}^{jh}_i, \,\forall h\in \check{I}^j_i\}$ is the set of all the rankings compatible with the observed part $\check{\mx}^j_i$ of ${\mx}^j_i$. Maximum likelihood estimation is not straightforward since several missing data occur: the cluster memberships $\z_i$ of the observations, the presentation orders $\y_i$ and the unobserved ranking positions $\hat{\x}_i$ (for partial rankings). In such a situation, a convenient way to maximize the likelihood is to consider an EM algorithm \citep{Dem1977}. This algorithm relies on the completed-data log-likelihood, and proceeds in iterating an E step, in which the conditional expectation of the completed-data log-likelihood is computed, and a M step, in which the model parameters are estimated by maximizing the conditional expectation computed in the E step. Unfortunately, the EM algorithm is tractable only for univariate full rankings with moderate $m$ ($m\leq 7$), respectively for mathematical and numerical reasons. In particular, when partial rankings occur, the E step is intractable since the completed-data log-likelihood is not linear for all three types of missing data (refer to \cite{Jac2012b} for its expression). A SEM-Gibbs approach is then proposed in \cite{Jac2012b} to overcome these problems. The fundamental idea of this algorithm is to reduce the computational complexity that is present in both E and M steps of EM by removing all explicit and extensive use of the conditional expectations of any product of missing data. First, it relies on the SEM algorithm \citep{geman84,celeux85} which generates the latent variables at a so-called stochastic step (S step) from the conditional probabilities computed at the E step. Then these latent variables are directly used in the M step. Second, the advantage of the SEM-Gibbs algorithm in comparison with the basic SEM ones relies on the fact that the latent variables are generated without calculating conditional probabilities at the E step, thanks to a Gibbs algorithm. Refer to \cite{Jac2012b} for more details.\\ Let noticed that label switching can occur with the SEM algorithm when clusters are not well separated. To avoid this situation, we recommend to use model selection criteria as BIC~\citep{Sch1978} or ICL~\citep{Bie2000} to select the number $K$ of clusters. \section[Existing packages]{Existing {R} packages for ranking data} \label{sec:OtherPackages} To the best of our knowledge, there exists only two packages dedicated to the analysis of ranking data, available on the {CRAN} website, but none of them seems to work to date. More important, their announced functionalities are significantly limited in comparison to our package \textit{Rankcluster}, as we discuss now: \begin{itemize} \item \textit{pmr} package for {R}: provide some descriptive statistics and modelling tools using classical rank data models for full univariate ranking data: Luce models, distance-based models, and rank-ordered logit (refer to \cite{Mar1995} for a description of these models). Visualization of ranking data using polytopes is also available for less than four objects to rank ($m\leq 4$).\\ Let note that even after correcting their \texttt{NAMESPACE} file (by adding the missing command line \texttt{exportPattern("\textasciicircum[[:alpha:]]+")}) and recompiling their package, the main function seems to not work even with the proposed examples. \item \textit{RMallow} package for {R}: suppose to perform clustering of univariate ranking data using mixture of Mallows model (\cite{Mur2003}). \end{itemize} \textit{Rankcluster} proposes modelling and clustering tools on the basis of the mixture of multivariate \textsc{isr} presented model in Section \ref{sec:model}. Comparing to the existing packages, \textit{Rankcluster} is the only package taking into account multivariate and partial ranking data. \section[Overview of the Rankcluster functions]{Overview of the \textit{Rankcluster} functions} \label{sec:Rankcluster} \textbf{This section presents functions from Rankcluster 0.91.6. Since Rankcluster 0.92, the data format has changed: ranks (\texttt{data} parameter) must be provided in their ranking representation\footnote{The \textit{ranking} representation $x^{-1}=(x_1^{-1},\ldots,x_m^{-1})$ contains the ranks assigned to the objects, and means that Object $\mathcal{O}_i$ is in the $x_i^{-1}$th position ($i=1,\ldots,m$). Notice that $x$ is associated to the ordering representation.} (and not ordering representation). This change was made to manage tied objects.} This section presents, firstly, the main function \texttt{rankclust()} which performs cluster analysis, and, secondly, several companion functions which can be helpful for additional ranking data analysis. \subsection[Main function]{The main function: \texttt{rankclust()}} Cluster analysis can be performed with the \texttt{rankclust()} function. Illustration of its use is given in Section \ref{sec:Examples}. \subsubsection[Inputs]{Input arguments} This function has only one mandatory argument, \texttt{data}, which is a matrix composed of the $n$ observed ranks in their ordering representation. For \textbf{univariate rankings} the number of columns of \texttt{data} is $m$ (default value of argument \texttt{m}). For \textbf{multivariate rankings}, \texttt{data} has $m_1+\ldots+m_p$ columns: the first $m_1$ columns contain $\mx^1$ (first dimension), the columns $m_1+1$ to $m_1+m_2$ contain $\mx^2$ (second dimension), and so on. In this case, the argument \texttt{m} must be filled with the vector of sizes $(m_1,\ldots,m_j)$. If the user works with a \textbf{ranking}\footnote{The \textit{ranking} representation $x^{-1}=(x_1^{-1},\ldots,x_m^{-1})$ contains the ranks assigned to the objects, and means that Object $\mathcal{O}_i$ is in the $x_i^{-1}$th position ($i=1,\ldots,m$). Notice that $x$ is associated to the ordering representation.} \textbf{representation} of the ranks, the \texttt{convertRank()} function can be used to transform ranking representation into ordering one. The number of clusters (1 by default) can be set up with option \texttt{K}. Vector of numbers of clusters are possible (for instance \texttt{K=1:10}). %The estimation algorithm can be specified by \texttt{algorithm = "SEM"} (for SEM-Gibbs algorithm) or \texttt{algorithm = "EM"}. Notice that the EM algorithm is available only for univariate full rankings with $m\leq 8$. Nevertheless, we encourage to use the SEM-Gibbs algorithm in all situations, since it leads to the same estimation as EM when this last is tractable (refer to \cite{Jac2012} for numerical illustration of equivalence between EM and SEM-Gibbs). In order to select the number of clusters, two criteria are available: BIC, by default, and ICL, selected by \texttt{criterion = "icl"}. Additionally, several parameters allow to set up the different tuning parameters (iterations numbers) used in the SEM-Gibbs estimation. Refer to \cite{Jac2012b} and to \texttt{rankclust()} help for more details. Section \ref{sec:Examples} gives also some examples of iterations numbers choices. The option \texttt{run} allows to set the number of initializations of the SEM-Gibbs algorithm (1 by default). In the case of multiple initializations, the best solution according to the approximated log-likelihood is retained. Finally, the computing times (total, for the SE and M steps and for the likelihood approximation) can be printed by setting the option \texttt{detail} to \texttt{TRUE} (\texttt{FALSE} by default). \subsubsection[Outputs]{Output arguments} The \texttt{rankclust()} function returns an instance of the \texttt{ResultTab} class. Its attributes will contain 4 slots: \begin{itemize} \item \texttt{K}: a vector of the number of clusters, \item \texttt{results}: a list of \texttt{ResultList} class, containing the results for each number of clusters (one element of the list is associated to one number of clusters), \item \texttt{data}: the data used for clustering, \item \texttt{criterion}: the model selection criterion used, \item \texttt{convergence}: a boolean indicating if none problem of empty class has been encountered (for any number of clusters). \end{itemize} Each element of the list \texttt{results} contains all the results for a given number $K$ of classes, which are summarized in the following 18 slots:%, reachable by \texttt{res[k]@slotname} if \texttt{res} is the result of \texttt{rankclust()} execution: \begin{itemize} \item \texttt{proportion}: a $K$-vector of proportions $p_1,\ldots,p_K$, \item \texttt{pi}: a $K\times p$-matrix composed of the scale parameters $\pi^{j}_{k}$ ($1\leq k \leq K$ and $1\leq j \leq p$), \item \texttt{mu}: a matrix with $K$ lines and $m_1+\ldots+m_p$ columns in which line $k$ is composed of the location parameters $(\mu^{1}_{k},\ldots,\mu^{p}_{k})$ of cluster $k$, \item \texttt{ll, bic, icl}: values of the log-likelihood, BIC criterion and ICL criterion, \item \texttt{tik}: a $n\times K$-matrix containing the estimation of the conditional probabilities for the observed ranks to belong to each cluster, \item \texttt{partition}: a $n$-vector containing the partition estimation resulting from the clustering, \item \texttt{entropy}: a $n\times 2$-matrix containing for each observation its estimated cluster (column 2, similar to \texttt{partition}) and its entropy (column 1), defined as $-\sum_{k=1}^{K}t_{ik}\log(t_{ik})$ where $t_{ik}$ is the conditional probabilities for the $i$th observation to belong to cluster $k$, given by \texttt{tik}. The \texttt{entropy} output illustrates the confidence in the clustering of each observation (a high entropy means a low confidence in the clustering), \item \texttt{probability}: a $n\times 2$-matrix similar to the \texttt{entropy} output, containing for each observation its estimated cluster (column 2, similar to \texttt{partition}) and its probability $p(x_i;\mu_k,\pi_k)$ given its cluster. This probability is estimated using the last simulation of the presentation orders used for the likelihood approximation. The \texttt{probability} output exhibits the best representative of each cluster. \item \texttt{convergence}: a boolean indicating if none problem of empty class has been encountered, \item \texttt{partial}: a boolean indicating the presence of partial rankings, \item \texttt{partialRank}: a matrix containing the full rankings, estimated using the within cluster \textsc{isr} parameters when the ranking is partial. When ranking is full, \texttt{partialRank} simply contains the observed ranking. Available only in presence of at least one partial ranking. \item \texttt{distanceProp, distancePi, distanceMu}: distances between the final estimation and the current value at each iteration of the SEM-Gibbs algorithm (except the burning phase) for respectively: proportions $p_k$, scale parameters $\pi^{j}_{k}$, location parameters $\mu^{j}_{k}$. For $\mu^{j}_{k}$, the Kendall distance for ranking has been considered \citep{Mar1995}. For $p_k$ and $\pi^{j}_{k}$, the distance is the mean squared difference. \texttt{distanceProp}, \texttt{distancePi} and \texttt{distanceMu} are lists of \texttt{Qsem-Bsem} elements, each element being a $K\times p$-matrix.\\ These elements are reachable by \texttt{res[k]@slotname[[iteration]]}. \item \texttt{distanceZ}: a vector of size \texttt{Qsem-Bsem} containing the rand index \citep{Ran1971} between the final estimated partition and the current value at each iteration of the SEM-Gibbs algorithm (except the burning phase). Let precise that the rand index is not affected by label switching. \item \texttt{distancePartialRank}: Kendall distance between the final estimation of the partial rankings (missing positions in such rankings are estimated) and the current value at each iteration of the SEM-Gibbs algorithm (except the burning phase). \texttt{distancePartialRank} is a list of \texttt{Qsem-Bsem} elements, each element being a matrix of size $n\times p$. Available only in presence of at least one partial ranking. \item \texttt{proportionInitial, piInitial, muInitial, partialRankInitial}: initializations of the parameters in the SEM-Gibbs algorithm (for expert use only). \end{itemize} If \texttt{res} is the result of \texttt{rankclust()}, each slot of \texttt{results} can be reached by \texttt{res[k]@slotname}, where \texttt{k} is the number of clusters and \texttt{slotname} is the name of the slot we want to reach (\texttt{proportion}, \texttt{pi} ...). For the slots \texttt{ll, bic, icl}, \texttt{res[``slotname'']} returns a vector of size $K$ containing the values of the slot for each number of clusters. \subsection[Companion functions]{Companion functions} In addition to the main function, \texttt{rankclust()}, several companion functions are available in \textit{Rankcluster}: \begin{itemize} \item \texttt{convertRank()}: converts ranking representation $x^{-1}$ of a rank to its ordering representation $x$, and \textit{vice-versa} since $x \circ x^{-1}=x^{-1}\circ x$. % \item \texttt{correlSpearman()}: compute Spearman correlation between rankings (refer to \cite{Mar1995}), \item \texttt{criteria()}: estimate the log-likelihood, BIC and ICL criteria from a dataset and a corresponding estimation of the \textsc{isr} model parameters (see details with \texttt{help(criteria)}). \item \texttt{distCayley(), distHamming(), distKendall(), distSpearman()}: compute usual distances between rankings (refer to \cite{Mar1995}) for either ranking or ordering representation (refer to the help of the functions for details), \item \texttt{frequence()}: transform a raw dataset composed of a list of ranks into a matrix rank/frequency, in which the last column is the frequency of observation of the rank. Conversely, \texttt{unfrequence()} transform a rank/frequency dataset in a raw dataset, as requested in input argument of \texttt{rankclust()}, \item \texttt{khi2()}: perform a chi-squared goodness-of-fit tests and return the p-value of the test (refer to \cite{Jac2012} for details). \item \texttt{kullback()}: estimate the Kullback-Leibler divergence between two \textsc{isr} models, \item \texttt{simulISR()}: simulate a univariate and unimodal dataset of rankings according to the \textsc{isr} model, \end{itemize} \section[Rankcluster through examples]{\textit{Rankcluster} through examples} \label{sec:Examples} \textbf{All results from this section were obtained with Rankcluster 0.91.6. Since Rankcluster 0.92, the data format has changed: ranks must be provided in their ranking representation (and not ordering representation). This change was made to manage tied objects.} This section illustrates the use of the \texttt{rankclust()} function on two real datasets. The first one, \texttt{words}, is a well-known dataset in ranking study, due to \cite{Fli1986}, which consists of words associations by students. The second one, \texttt{big4} consists of the rankings of the Big Four of English Football (Manchester United, Liverpool, Arsenal and Chelsea) according to the Premier League results and to their UEFA coefficients between 1993 and 2013 (see Appendix \ref{sec:big4}). % The second one, \textit{Eurovision}, presented in \cite{Jac2012b}, consists of the votes of European countries during the Eurovision Song Contest. Both datasets are available in the \textit{Rankcluster} package, as well as the three other datasets analysed and described in \cite{Jac2012b}: \texttt{APA, quiz, sports}. \subsection[Words]{The \texttt{words} dataset} This data was collected under the auspices of the Graduate Record Examination Board \citep{Fli1986}. A sample of 98 college students were asked to rank five words according to strength of association (least to most associated) with the target word "Idea": 1 = Thought, 2 = Play, 3 = Theory, 4 = Dream and 5 = Attention. \begin{figure}[!h] \begin{center} \includegraphics[width=8cm]{images/Words-BIC} \caption{\label{fig:Words-bic}Value of the BIC criterion with mixture of \textsc{isr} for the \texttt{words} dataset.} \end{center} \end{figure} First we start by installing and loading the \textit{Rankcluster} package:\\ \texttt{R> install.packages("Rankcluster")}\\ \texttt{R> library(Rankcluster)}\\ and then loading the \texttt{words} dataset:\\ \texttt{R> data(words)}\\ Using the \texttt{rankclust()} function, a clustering with respectively $1$ to $5$ clusters is estimated:\\ \texttt{R> res=rankclust(words\$data,m=words\$m,K=1:5,Qsem=1000,Bsem=100,Ql=500,Bl=50,\\maxTry=20,run=10)}\\ The number of SEM-Gibbs iterations (\texttt{Qsem}) has been set to 1000, with a burning phase of 100 iterations (\texttt{Bsem}). For likelihood approximation the numbers of iterations (\texttt{Ql} and \texttt{Bl}) have been divided by two. Option \texttt{maxTry=20} allows to restart the estimation algorithm in the limit of 20 times if one cluster becomes empty (frequent for $K=5$). Finally, the SEM-Gibbs algorithm is initialized 5 times (\texttt{run=5}), and the best solution (according to the approximated likelihood) is retained. Computing time on a laptop with 2.80GHz CPU is about 3 minutes (7 seconds per run et per number of clusters). The reader who wants to test more quickly the package can reduced the number of runs. The values of the BIC criterion, reached by \texttt{res[``bic'']} and plotted on Figure \ref{fig:Words-bic}, tend to select three clusters. The parameter estimation for $K=3$ are given below for proportions $p_k$, probabilities $\pi_k$ and modes $\mu_k$:\\ % \begin{figure}[!h] \texttt{> res[3]@proportion\\ $[1]$ 0.3061224 0.4918367 0.2020408\\ > res[3]@pi\\ dim 1\\ cl 1 0.9060649\\ cl 2 0.9416822\\ cl 3 0.8642753\\ > res[3]@mu\\ dim 1 \\ cl 1 2 5 3 4 1\\ cl 2 2 5 4 3 1\\ cl 3 5 2 4 3 1\\ } The words Thought is the most associated with Idea for all clusters. Regarding the rankings of the four other words can suggest an interesting interpretation of the clusters. Indeed, the first cluster, composed of about $30\%$ of the students, is characterized by the following modal ranking: Play, Attention, Theory, Dream, Thought. Students of this cluster are probably literary-minded students, rankings the word Dream just after Thought. Students of the second cluster (about half of total students) are probably more scientific since they rank the word Theory just after Thought, and so before the word Dream: Play, Attention, Dream, Theory, Thought. This cluster is also the most homogeneous, with a high scale parameter value (low dispersion): $\pi_2\simeq 0.94$. Finally, the last cluster is characterized by the following mode: Attention, Play, Dream, Theory, Thought. The only difference in the modal ranking with the scientific students is the preference of Play rather than Attention. This cluster, which is the smallest ($20\%$ of the students), can be qualified as intermediary cluster, probably composed of a set of students not too scientific or too literary-minded, as evidenced by the smallest of the three scale parameter values ($\pi_3\simeq 0.86$). \subsection[Big4]{The \texttt{big4} dataset} In the two last decades, the English football has been dominated by four clubs, forming the ``Big Four'': Manchester United, Liverpool, Arsenal and Chelsea. In this application, we analyse the rankings of these four teams at the English championship (Premier League) and their rankings according to the UEFA coefficients. This coefficient is an European statistic on football teams based on the results of European football competitions and used for ranking and seeding teams in international competitions. The \texttt{big4} dataset is composed of Premier League rankings and UEFA rankings from 1993 to 2013, in ordering notation (see Table \ref{tab:big4} in Appendix \ref{sec:big4}, in which club ``1'' is Manchester United, ``2'' is Liverpool, ``3'' is Arsenal and ``4'' is Chelsea). In 2001 Arsenal and Chelsea had the same UEFA coefficient and then are tied for the first ranking dimension. With \textit{Rankcluster}, one way to take into account such tied in ranking data is to consider the corresponding ranking positions as missing: the UEFA ranking becomes then $(1,0,0,2)$ for 2001, what means that Manchester United is the first, Liverpool is the last, and the two intermediate positions are for Arsenal and Chelsea in an unknown order. First, the \texttt{big4} dataset is loaded:\\ \texttt{R> data(big4)} Clustering for $1$ to $3$ clusters is estimated with \texttt{rankclust()} function in about 25 seconds on a laptop 2.80GHz CPU (less than 2 seconds per run and per number of clusters):\\ \texttt{R> res=rankclust(big4\$data,m=big4\$m,K=1:3,Qsem=1000,Bsem=100,Ql=500,Bl=50,\\maxTry=20,run=5)} % For more than $3$ groups, the clustering tends to empty the additional groups. The values of the BIC criterion are plotted on Figure \ref{fig:big4-bic}, and tend to select two groups. \begin{figure}[h!] \begin{center} \includegraphics[width=8cm]{images/big4-bic2} \caption{\label{fig:big4-bic}Values of the BIC criterion with mixture of \textsc{isr} for the \texttt{big4} dataset.} \end{center} \end{figure} The printed outputs for $K=2$ are given below: value of the log-likelihood (\texttt{ll}), values of BIC and ICL criteria, estimation of the proportions $p_k$'s, the dispersion parameters $\pi_k^j$'s, the location parameters $\mu_k^j$'s, the estimated partition and finally the conditional probability of the observations to belong to each cluster ($t_{ik}$). \newpage \noindent\texttt{R> res[2]\\ ******************************************************************\\ Number of clusters: 2\\ ******************************************************************\\ ll= -108.5782\\ bic = 244.5571\\ icl = 253.5601\\ proportion: 0.3854034 0.6145966\\ mu:\\ dim1 dim2 \\ cl1 1 3 2 4 1 3 2 4\\ cl2 1 4 2 3 1 4 3 2\\ pi:\\ dim1 dim2\\ cl1 0.9698771 0.7759781\\ cl2 0.6945405 0.7707101\\ partition:\\ $[1]$ 2 1 1 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2\\ tik:\\ \indent\indent\indent\indent \hspace{0.22cm} $[,1]$ \indent\indent\indent $[,2]$\\ $[1,]$ 6.280426e-01 0.371957427\\ $[2,]$ 9.933149e-01 0.006685094\\ $[3,]$ 9.565052e-01 0.043494814\\ $[4,]$ 9.923445e-01 0.007655450\\ $[5,]$ 6.609592e-01 0.339040841\\ $[6,]$ 9.916482e-01 0.008351815\\ $[7,]$ 1.215925e-03 0.998784075\\ $[8,]$ 3.500066e-01 0.649993363\\ $[9,]$ 3.441535e-02 0.965584647\\ $[10,]$ 4.063712e-01 0.593628767\\ $[11,]$ 9.589745e-01 0.041025518\\ $[12,]$ 9.751644e-01 0.024835551\\ $[13,]$ 6.806917e-02 0.931930833\\ $[14,]$ 3.175113e-03 0.996824887\\ $[15,]$ 1.263591e-03 0.998736409\\ $[16,]$ 7.713598e-06 0.999992286\\ $[17,]$ 9.115424e-04 0.999088458\\ $[18,]$ 1.734860e-01 0.826513968\\ $[19,]$ 1.605939e-04 0.999839406\\ $[20,]$ 7.425989e-02 0.925740107\\ $[21,]$ 2.171738e-02 0.978282624\\ ******************************************************************\\ } % analyse des resultats The estimated clustering exhibits two groups, the second one being larger than the first one ($p_1\simeq 0.39$ and $p_2\simeq 0.61$). The values of the location parameters for each cluster and dimension leads to two interesting remarks. First, the ranking in each dimension is very similar in both clusters: exactly the same for cluster 1 and just one transposition in the last two positions for cluster 2. This means that the performance of the clubs at the Premier League is highly correlated with their UEFA rankings, which is related to the results of the clubs in the European competitions over the previous five seasons. This first comment shows a certain inertia in the performance of the clubs. Secondly, the distinction between the two clusters is essentially due to the position of the club ``4'', Chelsea: indeed, in the first cluster Chelsea is the last in both rankings, but it is in second position in the second cluster. Moreover, on the partition, we find that cluster 2 is mainly present in the second half of the period 1993-2013 (see for instance the conditional probabilities of cluster membership on Figure \ref{fig:big4-partition}). This rise of Chelsea's results can be explained by the acquisition of the club by a Russian investor (Abramovich) in 2003, who brought great players in the club. \begin{figure}[h!] \begin{center} \includegraphics[width=12cm]{images/big4-partition2} \caption{\label{fig:big4-partition}Conditional probabilities for each observation (year) to belong to cluster 1 (black circle) and two (red filled circle).} \end{center} \end{figure} In addition to this information, the \texttt{summary()} function gives an overview of the partition by printing the five ranks of highest probability and the five ranks of highest entropy for each cluster:\\ \texttt{R> summary(res)}\\ The ranks of highest probability are the best representatives of the cluster, whereas the ranks of highest entropy are those for which their membership to the cluster are the less obvious. Notice that the full list of the cluster member with their probability and entropy are available through the slots \texttt{probability} and \texttt{entropy}. Table \ref{tab:big4-result3} gives an example of these outputs for cluster 2. The rankings of 2011 are the most representative of the cluster, and the five most representatives of the cluster correspond to rankings after 2003. Similarly, the four observations whose membership to cluster 2 is the most questionable correspond to observations before 2003. This information confirms the previous analysis indicating that cluster 2 is due to the advent of Chelsea in the first positions. \begin{table}[!h] \begin{center} {\begin{tabular}{lccl} \hline year& UEFA & Prem. League & entropy \\\hline 2002 & (1,3,4,2) & (3,2,1,4) & 0.6755106 \\ 1993 & (1,2,3,4) & (1,2,3,4) & 0.6599892 \\ 2000 & (1,4,3,2) & (1,3,2,4) & 0.6474507 \\ 1997 & (1,2,3,4) & (1,3,2,4) & 0.6403972 \\ 2010 & (1,0,0,4) & (4,1,3,2) & 0.4613709\\\hline year& UEFA & Prem. League & probability \\ 2011& (1,4,2,3) & (1,4,3,2)& 2.367e-04 \\ 2008& (4,2,3,1) & (1,4,3,2)& 6.862e-05 \\ 2013& (4,1,3,2) & (1,4,3,2)& 3.529e-05 \\ 2009& (4,2,1,3) & (1,2,4,3)& 3.097e-05 \\ 2005& (1,2,3,4) & (4,3,1,2)& 2.151e-05\\\hline \end{tabular}} \caption{\label{tab:big4-result3}Rankings with the highest entropies and probabilities in the second cluster.} \end{center} \end{table} The \texttt{summary()} function prints also the estimated full ranking for each partial ranking. For instance, in 2001 Arsenal and Chelsea had the same UEFA coefficient, and when asking to our model to differentiate these two teams, Arsenal is ranked before Chelsea, what is not surprising as we already remarked that the results of Chelsea were generally among the worst of the Big Four before 2003. Finally, the variability of estimation of the model parameters can be achieved by the mean of the distances between the final estimation and the current value at each step of the SEM-Gibbs algorithm (refer to \cite{Jac2012b} for accurate definitions of these distances). These distances are available in the slots \texttt{distanceProp, distancePi, distanceMu} of the output \texttt{res[2]}. The standard deviation of these distances can be used for instance as an indicator of estimation variability. For instance, the standard deviation of the Kendall distance between the final estimation of the location parameters and its current value at each step of the SEM-Gibbs algorithm is for cluster 2: 0.52 for the UEFA coefficients rankings and 0.43 for Premier League rankings. Similarly, the standard deviation of the dispersion parameters estimations is for cluster 2 about 0.002 for both the UEFA coefficients and the Premier League rankings. Let note that these variabilities are relatively small, due to the low overlapping of the two clusters (dispersion coefficients are quite high). In a similar way, the slot \texttt{distancePartition} illustrates the convergence of the SEM-Gibbs algorithm by given the rand index between the final partition and the current partition at each SEM-Gibbs iteration (Figure \ref{fig:big4-partition-evol}). \begin{figure}[h!] \begin{center} \includegraphics[width=10cm]{images/big4-partition-evolution} \caption{\label{fig:big4-partition-evol}Evolution of the partition along with the SEM-Gibbs iterations: values of the Rand index between current and final partitions.} \end{center} \end{figure} % \subsection[Eurovision]{The Eurovision dataset} % The Eurovision Song Contest is an annual competition held among active member countries of the European Broadcasting Union. Each member country submits a song to be performed on live television and then casts votes for the other countries' songs to determine the most popular song in the competition. The vote consists in ranking ten preferred song in order of preference. We consider in these experiments the votes of the $n=34$ countries who participate to the competitions from 2007 to 2012. During these six years, only 8 countries have participated to the six finals of the competition: 1: France, 2: Germany, 3: Greece, 4: Romania, 5: Russia, 6: Spain, 7: Ukraine and 8: United Kingdom. The studied dataset is then composed of multivariate rankings ($p=6$ corresponding to the six contests between 2007 and 2012), each rank being of size $m=8$ (only the votes for the 8 countries which participated to the six finals are considered) and all rankings being partial. The absence of full ranking signifies that none % country % participating to the votes has ranked all of the 8 previously cited countries in its 10 preferences. % % % \subsection[The analysis]{The analysis} % % \begin{figure}[h!] % \begin{center} % \includegraphics[width=8cm]{images/eurovision-bic} % \caption{\label{fig:eurovision-bic}Value of the BIC criterion with mixture of \textsc{isr} for the Eurovision dataset.} % \end{center} % \end{figure} % % The \texttt{eurovision} dataset is loaded by:\\ % \texttt{R> data(eurovision)}\\ % % This dataset is challenging since the number of observations ($n=34$) is small compared to the size of the ranks ($m=8$ and $p=6$) and the presence of partial rankings (precisely, $57.7\%$ of the rankings elements are missing). For this reason, large iterations numbers (\texttt{Qsem=Ql=1000}) have been chosen to estimate a clustering with respectively $1$ to $9$ clusters:\\ % \texttt{R> res=rankclust(eurovision\$data,m=eurovision\$m,K=1:9,Qsem=1000,Bsem=100,\\Ql=1000,Bl=100,maxTry=30,run=20) \#caution: can be time consuming (see below)}\\ % % With these large iterations numbers, \texttt{rankclust()} takes about 4 hours per number of clusters and per run (laptop 2.80GHz CPU). Most of this computing time is due to the likelihood approximation (detail of the computing time can be obtained by setting the input option \texttt{detail} to \texttt{TRUE}). The reason is the high proportion of missing elements, which leads to a large number of different multivariate modal rankings $\mu^1_k,\ldots,\mu^{p}_k$ simulated during the SEM-Gibbs algorithm and then to a large number of likelihood approximations (let recall that the retained parameters at the end of the estimation algorithm are those leading to the highest approximated likelihood). % % In the present work, large iterations numbers have intentionally be used in order to provide results with reduced variability, which can be interpreted with confidence. But the reader who want to test the package on this dataset can use less iteration numbers (dividing for instance by ten the iterations numbers leads to 1 minute of computing time per cluster and per run). In this case, the variability of parameter estimation will be larger, about two times larger for 100 SEM-Gibbs iterations rather than 1000 according to the variability indicator described at the end of this section (standard deviation of the distances between the final parameter estimation and the current value at each step of the SEM-Gibbs algorithm). % Moreover, the total computing time can be easily reduced by parallelizing the \texttt{rankclust()} executions for each run and each number of clusters, thanks to the \textit{doMC} package for {R}. For instance, the following code launch parallel executions of \texttt{rankclust()} for each number of clusters (\texttt{nbcore} being the number of available clusters): % % \texttt{R> library('doMC')\\ % R> registerDoMC(nbcore)\\ % R> ResFinal = foreach(k=1:9,.combine='c') \%dopar\% \{\\ % R> res=rankclust(eurovision\$data,m=eurovision\$m,K=k,Qsem=100,Bsem=10,Ql=100\\,Bl=10,maxTry=30,run=20)\\ % R> \} % } % % The values of the BIC criterion (obtained with \texttt{Qsem=Ql=1000}) are plotted on Figure \ref{fig:eurovision-bic}. They tend to select five groups. % % The printed outputs for $K=5$ are given below: value of the log-likelihood (ll), values of BIC and ICL criteria, estimation of the proportions $p_k$'s, the probabilities $\pi_k^j$'s, the modes $\mu_k^j$'s, the estimated partition and finally the conditional probability of the observations to belong to each cluster ($t_{ik}$). % % % \begin{figure}[!h] % % \texttt{\scriptsize{ % \texttt{R> res=rankclust(eurovision\$data,m=eurovision\$m,K=5,Qsem=1000,Bsem=100,Ql=1000,\\ % Bl=100,maxTry=30,run=20) \#caution: can be time consuming (see below)\\ % R> print(res)} % % % \\\begin{texttt} % ****************************************************************** % Number of clusters: 5 % ****************************************************************** % ll= -1400.816 % bic = 3027.319 % icl = 3027.321 % proportion: 0.3235294 0.1470588 0.2352941 0.1176471 0.1764706 % pi: % dim 1 dim 2 dim 3 dim 4 dim 5 dim 6 % cl 1 0.8373494 0.9195402 0.8596491 0.8588235 0.7891566 0.8400000 % cl 2 0.9090909 0.9493671 0.8064516 0.8493151 0.7916667 0.8428571 % cl 3 0.8771930 0.8728814 0.8800000 0.7920792 0.7983871 0.8914729 % cl 4 0.9740260 0.8095238 0.9634146 0.8955224 0.8181818 0.8545455 % cl 5 0.8543689 0.9047619 0.8333333 0.9263158 0.8200000 0.8854167 % mu: % dim 1 dim 2 dim 3 dim 4 % cl 1 3 5 7 4 2 6 1 8 3 5 7 6 4 2 8 1 3 8 1 4 5 2 6 7 3 2 6 4 1 8 5 7 % cl 2 5 7 3 2 4 1 6 8 5 7 3 1 4 2 6 8 8 4 5 1 7 3 6 2 4 2 5 8 7 6 1 3 % cl 3 7 3 5 2 4 6 1 8 3 5 7 1 6 4 8 2 1 8 2 3 5 6 7 4 3 4 2 1 6 8 7 5 % cl 4 7 5 4 6 3 1 2 8 5 7 3 1 4 8 6 2 5 1 8 6 7 2 4 3 2 5 7 6 4 1 3 8 % cl 5 7 5 8 4 2 3 6 1 7 5 3 4 8 1 2 6 8 1 4 3 2 5 7 6 2 4 1 8 3 6 7 5 % dim 5 dim 6 % cl 1 7 3 8 5 1 4 2 6 3 5 6 4 1 2 7 8 % cl 2 7 5 8 6 4 3 1 2 4 5 7 2 1 3 8 6 % cl 3 6 2 8 1 3 4 5 7 6 5 2 4 1 7 3 8 % cl 4 5 8 3 1 2 7 4 6 5 2 3 7 1 8 6 4 % cl 5 2 8 7 1 3 4 5 6 5 2 4 7 1 3 8 6 % partition: % 1 2 1 3 1 5 1 5 2 2 3 3 1 2 3 5 4 4 4 5 2 5 3 1 1 1 3 5 1 1 3 1 4 3 % tik: % 1 2 3 4 5 % 1 1.000000e+00 4.926687e-15 3.395961e-11 2.175299e-20 2.172081e-14 % 2 1.020925e-12 1.000000e+0 0 3.596271e-15 9.629061e-14 1.176418e-16 % 3 9.999993e-01 1.631053e-1 3 7.254845e-07 8.621185e-17 6.955543e-13 % 4 1.120337e-09 5.431846e-1 6 1.000000e+00 1.290113e-17 6.791088e-14 % 5 1.000000e+00 5.245435e-1 8 9.260641e-13 2.667931e-21 8.113361e-16 % 6 5.738516e-16 3.275921e-16 2.041697e-15 8.810364e-27 1.000000e+00 % 7 1.000000e+00 2.177675e-19 7.848022e-13 6.013325e-29 5.084076e-18 % 8 4.028680e-13 2.044741e-16 4.920668e-10 2.912902e-22 1.000000e+00 % 9 1.547418e-09 1.000000e+00 2.268076e-11 1.711549e-10 4.627540e-12 % 10 2.535008e-12 1.000000e+00 1.524028e-17 5.826940e-17 1.425966e-14 % 11 3.249961e-09 7.008065e-12 1.000000e+00 2.493183e-17 2.333494e-10 % 12 3.438573e-12 2.480877e-20 1.000000e+00 6.300352e-18 8.826780e-15 % 13 1.000000e+00 1.241974e-15 6.370072e-09 5.750960e-25 1.884747e-17 % 14 6.535477e-15 1.000000e+00 5.803484e-13 1.043131e-12 9.389025e-15 % 15 2.237704e-09 2.923127e-13 1.000000e+00 1.137870e-12 4.878962e-13 % 16 1.303839e-16 8.068278e-12 3.667421e-14 1.131581e-18 1.000000e+00 % 17 1.142291e-15 7.192854e-11 2.523839e-13 1.000000e+00 1.012907e-19 % 18 9.853473e-23 9.429721e-17 8.455583e-18 1.000000e+00 1.746191e-21 % 19 1.320047e-15 2.978251e-16 1.266509e-14 1.000000e+00 7.754970e-11 % 20 8.550912e-12 2.466856e-13 4.301909e-10 5.708839e-23 1.000000e+00 % 21 5.288867e-13 1.000000e+00 1.220869e-14 1.212887e-15 1.316850e-14 % 22 7.037856e-14 1.914019e-18 2.234911e-14 4.996731e-23 1.000000e+00 % 23 3.835737e-03 1.161211e-03 9.949931e-01 5.409898e-11 9.921546e-06 % 24 1.000000e+00 3.673471e-22 3.072776e-11 1.504338e-24 1.937220e-16 % 25 1.000000e+00 2.579782e-13 2.734699e-10 2.205859e-15 2.174011e-09 % 26 1.000000e+00 4.005485e-17 2.967356e-13 3.898933e-20 7.700084e-15 % 27 1.807390e-02 3.457956e-07 5.457201e-01 9.152708e-10 4.362056e-01 % 28 4.262774e-15 3.084315e-13 2.330521e-13 1.932227e-20 1.000000e+00 % 29 9.999994e-01 1.060360e-09 5.685554e-07 2.121323e-16 5.110376e-08 % 30 8.980571e-01 2.670018e-11 1.019404e-01 1.149813e-15 2.475373e-06 % 31 1.112656e-13 2.914716e-19 1.000000e+00 2.719565e-17 6.941053e-17 % 32 1.000000e+00 1.530419e-11 1.757512e-12 3.221608e-13 3.433776e-11 % 33 5.265654e-15 1.111768e-14 1.023056e-17 1.000000e+00 2.158626e-20 % 34 2.900612e-14 1.798547e-20 1.000000e+00 1.473098e-26 6.098085e-16 % ****************************************************************** % \end{texttt} % % % % \begin{figure}[!h] % % \texttt{\scriptsize{ % % ******************************************************************\\ % % Number of clusters: 5\\ % % ******************************************************************\\ % % LL= -1400.816\\ % % bic = 3027.319\\ % % icl = 3027.321\\ % % proportion: 0.3235294 0.1470588 0.2352941 0.1176471 0.1764706\\ % % pi:\\ % % \begin{tabular}{ccccccc} % % &dim 1 & dim 2 & dim 3& dim 4 & dim 5 & dim 6\\ % % cl 1 &0.8373494 &0.9195402 &0.8596491 &0.8588235 &0.7891566 &0.8400000\\ % % cl 2 &0.9090909 &0.9493671 &0.8064516 &0.8493151 &0.7916667 &0.8428571\\ % % cl 3 &0.8771930 &0.8728814 &0.8800000 &0.7920792 &0.7983871 &0.8914729\\ % % cl 4 &0.9740260 &0.8095238 &0.9634146 &0.8955224 &0.8181818 &0.8545455\\ % % cl 5 &0.8543689 &0.9047619 &0.8333333 &0.9263158 &0.8200000 &0.8854167\\ % % \end{tabular} % % \\ % % mu:\\ % % \begin{tabular}{ccccccc} % % & dim 1 & dim 2 & dim 3 & dim 4 & dim 5 & dim 6\\ % % cl 1 & 3 5 7 4 2 6 1 8 & 3 5 7 6 4 2 8 1 & 3 8 1 4 5 2 6 7 & 3 2 6 4 1 8 5 7 & 7 3 8 5 1 4 2 6 & 3 5 6 4 1 2 7 8\\ % % cl 2 & 5 7 3 2 4 1 6 8 & 5 7 3 1 4 2 6 8 & 8 4 5 1 7 3 6 2 & 4 2 5 8 7 6 1 3 & 7 5 8 6 4 3 1 2 & 4 5 7 2 1 3 8 6\\ % % cl 3 & 7 3 5 2 4 6 1 8 & 3 5 7 1 6 4 8 2 & 1 8 2 3 5 6 7 4 & 3 4 2 1 6 8 7 5 & 6 2 8 1 3 4 5 7 & 6 5 2 4 1 7 3 8\\ % % cl 4 & 7 5 4 6 3 1 2 8 & 5 7 3 1 4 8 6 2 & 5 1 8 6 7 2 4 3 & 2 5 7 6 4 1 3 8 & 5 8 3 1 2 7 4 6 & 5 2 3 7 1 8 6 4\\ % % cl 5 & 7 5 8 4 2 3 6 1 & 7 5 3 4 8 1 2 6 & 8 1 4 3 2 5 7 6 & 2 4 1 8 3 6 7 5 & 2 8 7 1 3 4 5 6 & 5 2 4 7 1 3 8 6\\ % % \end{tabular} % % partition:\\ % % 1 2 1 3 1 5 1 5 2 2 3 3 1 2 3 5 4 4 4 5 2 5 3 1 1 1 3 5 1 1 3 1 4 3\\ % % tik:\\ % % \begin{tabular}{cccccc} % % & 1 & 2 & 3& 4 & 5\\ % % 1& 1.000000e+00& 4.926687e-15& 3.395961e-11& 2.175299e-20& 2.172081e-14\\ % % 2& 1.020925e-12& 1.000000e+0&0 3.596271e-15& 9.629061e-14& 1.176418e-16\\ % % 3& 9.999993e-01& 1.631053e-1&3 7.254845e-07& 8.621185e-17& 6.955543e-13\\ % % 4& 1.120337e-09& 5.431846e-1&6 1.000000e+00& 1.290113e-17& 6.791088e-14\\ % % 5& 1.000000e+00& 5.245435e-1&8 9.260641e-13& 2.667931e-21& 8.113361e-16\\ % % 6& 5.738516e-16& 3.275921e-16& 2.041697e-15& 8.810364e-27& 1.000000e+00\\ % % 7& 1.000000e+00& 2.177675e-19& 7.848022e-13& 6.013325e-29& 5.084076e-18\\ % % 8& 4.028680e-13& 2.044741e-16& 4.920668e-10& 2.912902e-22& 1.000000e+00\\ % % 9& 1.547418e-09& 1.000000e+00& 2.268076e-11& 1.711549e-10& 4.627540e-12\\ % % 10& 2.535008e-12& 1.000000e+00& 1.524028e-17& 5.826940e-17& 1.425966e-14\\ % % 11& 3.249961e-09& 7.008065e-12& 1.000000e+00& 2.493183e-17& 2.333494e-10\\ % % 12& 3.438573e-12& 2.480877e-20& 1.000000e+00& 6.300352e-18& 8.826780e-15\\ % % 13& 1.000000e+00& 1.241974e-15& 6.370072e-09& 5.750960e-25& 1.884747e-17\\ % % 14& 6.535477e-15& 1.000000e+00& 5.803484e-13& 1.043131e-12& 9.389025e-15\\ % % 15& 2.237704e-09& 2.923127e-13& 1.000000e+00& 1.137870e-12& 4.878962e-13\\ % % 16& 1.303839e-16& 8.068278e-12& 3.667421e-14& 1.131581e-18& 1.000000e+00\\ % % 17& 1.142291e-15& 7.192854e-11& 2.523839e-13& 1.000000e+00& 1.012907e-19\\ % % 18& 9.853473e-23& 9.429721e-17& 8.455583e-18& 1.000000e+00& 1.746191e-21\\ % % 19& 1.320047e-15& 2.978251e-16& 1.266509e-14& 1.000000e+00& 7.754970e-11\\ % % 20& 8.550912e-12& 2.466856e-13& 4.301909e-10& 5.708839e-23& 1.000000e+00\\ % % 21& 5.288867e-13& 1.000000e+00& 1.220869e-14& 1.212887e-15& 1.316850e-14\\ % % 22& 7.037856e-14& 1.914019e-18& 2.234911e-14& 4.996731e-23& 1.000000e+00\\ % % 23& 3.835737e-03& 1.161211e-03& 9.949931e-01& 5.409898e-11& 9.921546e-06\hline\\ % % 24& 1.000000e+00& 3.673471e-22& 3.072776e-11& 1.504338e-24& 1.937220e-16\\ % % 25& 1.000000e+00& 2.579782e-13& 2.734699e-10& 2.205859e-15& 2.174011e-09\\ % % 26& 1.000000e+00& 4.005485e-17& 2.967356e-13& 3.898933e-20& 7.700084e-15\\ % % 27& 1.807390e-02& 3.457956e-07& 5.457201e-01& 9.152708e-10& 4.362056e-01\\ % % 28& 4.262774e-15& 3.084315e-13& 2.330521e-13& 1.932227e-20& 1.000000e+00\\ % % 29& 9.999994e-01& 1.060360e-09& 5.685554e-07& 2.121323e-16& 5.110376e-08\\ % % 30 &8.980571e-01& 2.670018e-11& 1.019404e-01& 1.149813e-15& 2.475373e-06\\ % % 31 &1.112656e-13& 2.914716e-19& 1.000000e+00& 2.719565e-17& 6.941053e-17\\ % % 32& 1.000000e+00& 1.530419e-11& 1.757512e-12& 3.221608e-13& 3.433776e-11\\ % % 33& 5.265654e-15& 1.111768e-14& 1.023056e-17& 1.000000e+00& 2.158626e-20\\ % % 34& 2.900612e-14& 1.798547e-20& 1.000000e+00& 1.473098e-26& 6.098085e-16\\ % % \end{tabular} % % \\******************************************************************\\ % % }} % % \caption{\label{fig:euro-result1}Printed outputs of \texttt{rankclust()} with $K=5$ clusters.} % % \end{figure} % % In addition to such information, the \texttt{summary()} function % \\ % \texttt{R> summary(res)}\\ % gives an overview of the partition by printing the five ranks of highest probability (output \texttt{res[5]@probability}) and the five ranks of highest entropy (output \texttt{res[5]@entropy}) for each cluster. % The ranks of highest probability are the best representatives of the cluster, whereas the ranks of highest entropy are those for which their belonging to the cluster are the less obvious. Notice that the full list of the cluster member with their probability and entropy are available through the slots \texttt{probability} and \texttt{entropy}. Table \ref{tab:euro-result3} gives an example of these outputs for cluster 1, which is composed of: Albania, Belgium, Bulgaria, Cyprus, Germany, Romania, Russia, Serbia, Sweden, Switzerland and Turkey. Cyprus is the best representative of the cluster, whereas the belonging of Switzerland to this cluster is relatively questionable (high entropy in comparison to others). % % \begin{table}[!h] % \begin{center} % {\begin{tabular}{lc|lc} % country & entropy & country & probability\\\hline % Switzerland & 6.587196e-01 & Cyprus & 8.290355e-14 \\ % Belgium & 2.196254e-05 &Germany &4.805708e-14 \\ % Sweden & 1.935311e-05 &Turkey & 4.594954e-15\\ % Germany & 2.531678e-07 &Belgium& 2.308477e-15\\ % Russia & 1.036828e-07 &Romania & 6.515448e-16 \\ % \end{tabular}} % \caption{\label{tab:euro-result3}Countries with the highest probabilities and entropy in the first cluster.} % \end{center} % \end{table} % % % \begin{table}[!h] % % \begin{center} % % \scriptsize{\begin{tabular}{lc|lc} % % country & entropy & country & probability\\\hline % % Slovenia & 1.529900e+00&United Kingdom & 7.610601e-13 \\ % % Portugal & 6.859227e-02&France & 6.360517e-14 \\ % % Finland & 1.447298e-07&Bosnia Herzegovina &1.731317e-15 \\ % % Iceland & 9.372669e-08&Iceland & 1.201603e-15 \\ % % Bosnia Herzegovina& 4.842444e-08&The Netherlands & 1.429658e-16 \\ % % \end{tabular}} % % \caption{\label{tab:euro-result3}Countries with the highest probabilities and entropy in the first cluster.} % % \end{center} % % \end{table} % % The \texttt{summary()} function prints also the estimated full ranking for each partial ranking. For instance, Table \ref{fig:euro-result2} gives the real votes of France to the Eurovision contest between 2007 and 2012 (top line of the table) and the estimated ones (bottom line). In the top line (real votes), a ``0'' in the ranking means that the country ranked by France at this position does not belong to the eight studied countries (1: France, 2: Germany, 3: Greece, 4: Romania, 5: Russia, 6: Spain, 7: Ukraine and 8: United Kingdom). This example suggests an interesting comment. Of course, countries can not vote for themselves. However, if France could vote for itself, we show in the estimated full rankings that France will often rank itself in a good position. Indeed, France appears almost always in the second place of the missing positions. % % \begin{table}[!h] % \begin{center} % \begin{tabular}{lcccccc} % &2007&2008&2009\\ % observed ranking & 4 6 7 3 5 0 0 0 & 6 4 3 5 0 0 0 0 & 8 2 0 0 0 0 0 0 \\ % estimated ranking & 4 6 7 3 5 2 1 8 & 6 4 3 5 7 1 8 2 & 8 2 3 1 5 6 7 4 \\ % &2010&2011&2012\\ % observed ranking & 3 2 7 0 0 0 0 0 & 6 2 8 0 0 0 0 0 & 2 6 5 4 0 0 0 0\\ % estimated ranking & 3 2 7 4 1 6 8 5 & 6 2 8 4 5 1 3 7 & 2 6 5 4 7 1 3 8\\ % \end{tabular} % \caption{\label{fig:euro-result2}Votes of France to the Eurovision contest between 2007 and 2012: the top line is the real ranking with missing position (0) and the bottom line is the estimated full ranking. (1: France, 2: Germany, 3: Greece, 4: Romania, 5: Russia, 6: Spain, 7: Ukraine and 8: United Kingdom)} % \end{center} % \end{table} % % A geographical representation of the estimated clustering is given by Figure \ref{fig:eurovision-partition}. As noticed in \cite{Jac2012b}, this clustering can suggest some geographical interpretation of the clustering: indeed, the countries of a same groups seem to be geographically close. For instance, the cluster 1 (black) is essentially composed of East Europe countries whereas cluster 3 (green) mainly concerns West European countries. The geographical proximity of clusters members can be due either to cultural proximity of these countries or to geographical alliances, often suspected for this contest. % % \begin{figure}[h!] % \begin{center} % \includegraphics[width=13cm]{images/carteArticle} % \caption{\label{fig:eurovision-partition}Clustering of the European countries according to their votes at the Eurovision contest between 2007 and 2012.} % \end{center} % \end{figure} % % Finally, the variability of estimation of the model parameters can be achieved by the mean of the distances between the final estimation and the current value at each step of the SEM-Gibbs algorithm. These distances are available in the slots \texttt{distanceProp, distancePi, distanceMu} of the output \texttt{res[5]}. The standard deviation of these distances can be used for instance as an indicator of estimation variability. Let note also that a plot of these distances (Figure \ref{fig:eurovision-variability}) can be used to empirically judge if the burning phase size of the SEM-Gibbs algorithm is sufficiently large. Indeed, we remark on Figure \ref{fig:eurovision-variability} that the variability of estimation is larger for the first 50 iterations than for the following ones (we recall that the first 100 iterations corresponding to the burning phase are not plotted on Figure \ref{fig:eurovision-variability}). A burning phase of 150 iterations rather than 100 iterations would probably be a wise choice. % % Similarly, the slot \texttt{distancePartition} illustrates the convergence of the SEM-Gibbs algorithm by given the rand index between the final partition and the current partition at each SEM-Gibbs iteration (Figure \ref{fig:eurovision-partition-evol}). % Notice that the partition is relatively stable at the end of the algorithm iterations, although the cluster memberships of the observations are simulated at each iteration of the SEM-Gibbs algorithm. This phenomenon is due to the data space which is high-dimensional because of the complexity of the dataset (high proportion of missing ranking elements, low number of observations ...). Thus, even if the location and scale parameters move during the SEM-Gibbs iterations (Figure \ref{fig:eurovision-variability}), the clusters stay well separated and the partition is stable. % % \begin{figure}[h!] % \begin{center} % \includegraphics[width=16cm]{images/Eurovision-Variability} % \caption{\label{fig:eurovision-variability}Variability of the parameters estimation (proportion and scale and location parameters for the first dimension) along with the SEM-Gibbs iterations (without the burning phase): cluster 1 in black, 2 in red, 3 in green, 4 in blue, 5 in cyan.} % \end{center} % \end{figure} % % \begin{figure}[h!] % \begin{center} % \includegraphics[width=10cm]{images/Eurovision-Partition-evol} % \caption{\label{fig:eurovision-partition-evol}Evolution of the partition along with the SEM-Gibbs iterations: values of the rand index between current and final partitions.} % \end{center} % \end{figure} % \newpage \section{Conclusion} \textit{Rankcluster} is the first {R} package dedicated to ranking data, allowing modelling and cluster analysis for multivariate partial ranking data. Available on {R-forge}, this package is simple of use with its main function, \texttt{rankclust()}, having only one mandatory argument, the ranking dataset. By default a modelling is performed, and mentioning the number of desired clusters leads to perform a cluster analysis, with selection of the number of clusters if several numbers are given. The analysis of two real datasets presented in this paper allows to illustrate the possibilities of the package, and also constitutes a user guide for the practitioners. % \newpage % \appendix % \section{R codes} % In this section are given the R code used to plot the distances between the final estimation and the current value at each step of the SEM-Gibbs algorithm on Figure \ref{fig:eurovision-variability}. % % \texttt{R> par(mfrow=c(2,2)) % plot(1:900,res5new@results[[1]]@distanceProp[,1],type='l',ylim=c(0,0.15),ylab='Estimation variability',main='Proportion',xlab='SEM-Gibbs iterations') % for (i in 2:5) lines(1:900,res5new@results[[1]]@distanceProp[,i],col=i) % distancePidim1=matrix(0,900,5)\\ % for (i in 1:900)\{ for(j in 1:5)\\ % \{distancePidim1[i,j]=as.vector(res5new@results[[1]]@distancePi[[i]][1,j])\}\\ % \}\\ % plot(1:900,distancePidim1[,1],type='l',ylim=c(0,0.04),\\ylab='Estimation variability',main='Probability pi',xlab='SEM-Gibbs iterations')\\ % for (i in 2:5) lines(1:900,distancePidim1[,i],col=i)\\ % distanceMudim1=matrix(0,900,5)\\ % for (i in 1:900)\{ for(j in 1:5)\\ % \{distanceMudim1[i,j]=as.vector(res5new@results[[1]]@distanceMu[[i]][1,j])\}\\\}\\ % plot(1:900,distanceMudim1[,1],type='l',ylim=c(0,17),\\ylab='Estimation variability',main='Modal ranking',xlab='SEM-Gibbs iterations')\\ % for (i in 2:5) lines(1:900,distanceMudim1[,i],col=i)\\ % plot(1:900,1-res5new@results[[1]]@distanceZ,col=1,\\ylab='Estimation variability',main='Partition',xlab='SEM-Gibbs iterations')\\ % % } % \newpage % \newpage \appendix \section[big4]{The {big4} dataset} \label{sec:big4} The \texttt{big4} dataset\footnote{Sources: Wikipedia website \texttt{http://en.wikipedia.org/wiki/Big\_Four\_(English\_football)} and UEFA website \texttt{http://fr.uefa.com/memberassociations/uefarankings/club/index.html}} (Table \ref{tab:big4}) is composed of the rankings (in ordering notation) of the ``Big Four'' of English football: Manchester United (quoted by 1), Liverpool (2), Arsenal (3) and Chelsea (4), at the English football championship (Premier League) and according to their UEFA coefficients (statistics based on European competitions), from 1993 to 2013. \begin{table}[h!] \begin{center} \begin{tabular}{ccc} \hline year&Premier League & UEFA coefficient\\\hline 1993&(1,2,3,4)&(1,2,3,4)\\ 1994&(1,3,2,4)&(1,3,2,4)\\ 1995&(1,3,2,4)&(1,2,4,3)\\ 1996&(1,3,2,4)&(1,2,3,4)\\ 1997&(1,2,3,4)&(1,3,2,4)\\ 1998&(1,3,2,4)&(3,1,2,4)\\ 1999&(1,4,2,3)&(1,3,4,2)\\ 2000&(1,4,3,2)&(1,3,2,4)\\ 2001&(1,0,0,2)&(1,4,2,3)\\ 2002&(1,3,4,2)&(3,2,1,4)\\ 2003&(1,3,2,4)&(1,3,4,2)\\ 2004&(1,3,2,4)&(3,4,1,2)\\ 2005&(1,2,3,4)&(4,3,1,2)\\ 2006&(2,3,1,4)&(4,1,2,3)\\ 2007&(2,3,4,1)&(1,4,2,3)\\ 2008&(4,2,3,1)&(1,4,3,2)\\ 2009&(4,2,1,3)&(1,2,4,3)\\ 2010&(1,0,0,4)&(4,1,3,2)\\ 2011&(1,4,2,3)&(1,4,3,2)\\ 2012&(1,4,3,2)&(1,3,4,2)\\ 2013&(4,1,3,2)&(1,4,3,2)\\\hline \end{tabular} \caption{\label{tab:big4} Rankings of the ``Big Four" of English football: Manchester United (1), Liverpool (2), Arsenal (3) and Chelsea (4), according to the Premier League results and to their UEFA coefficients from 1993 to 2013.} \end{center} \end{table} %-------------------------------------------------------------------------------------------------------------------- \bibliographystyle{elsarticle-harv} \bibliography{Rank-biblio} %-------------------------------------------------------------------------------------------------------------------- \end{document}