---
title: "Introduction to Flexible Clustering of (Mixed-With-)Ordinal Data"
output:
  rmarkdown::html_vignette:
    fig_width: 8 
    fig_height: 6
    toc: true
    toc_depth: 2
bibliography: vignettes.bib
vignette: >
  %\VignetteIndexEntry{Introduction to Flexible Clustering of (Mixed-With-)Ordinal Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  out.width = "60%",
  comment = "#>"
)
```

## Package description and contents

Package **flexord** is an add on-package to packages **flexclust** and
**flexmix** that provide suites for partitioning and model-based
clustering with flexible method switching and comparison.

We provide additional distance and centroid calculation functions, and
additional model drivers for component distributions that are tailored
towards ordinal, or mixed-with-ordinal data. These new methods can
easily be plugged into the capabilities for clustering provided by
**flexclust** and **flexmix**.

By plugging them into the *flex-scheme*, they can be used for:

- one-off K-centroids and model-based clustering (via `flexclust::kcca` and `flexmix::flexmix`),
- repeated clustering runs with various cluster numbers `k` (via `flexclust::stepFlexclust` and `flexmix::stepFlexmix`),
- bootstrapping repeated clustering runs with various cluster numbers `k` for K-centroids clustering (via `flexclust::bootFlexclust`),
- applying the various methods for the resulting objects, such as `predict`, `plot`, `barchart`, ...

The new methods provided are:
```{r table, echo=FALSE, results='asis'}
knitr::kable(
  data.frame(
    clusType = c("Partitioning (K-centroids)", "", "", "", "", "", "", "Model-based", "", "", ""),
    Funtype = c("distance", "", "", "centroid", "", "", "wrapper", "driver", "", "", ""),
    Fun = c(
      "`distSimMatch`",
      "`distGDM2`",
      "`distGower`",
      "`centMode`",
      "`centMin`",
      "`centOptimNA`",
      "`kccaExtendedFamily`",
      "`FLXMCregnorm`",
      "`FLXMCregmultinom`",
      "`FLXMCregbinom`",
      "`FLXMCbetabinomial`"
    ),    
    Method = c(
      "Simple Matching Distance", 
      "GDM2 distance for ordinal data",
      "Gower's distance",
      "Mode as centroid",
      "Factor level with minimal distance as centroid",
      "Centroid calculation by general purpose optimizer",
      "Creates a `kccaFamily` object pre-configured for kModes-, kGDM2- or kGower clustering",
      "Regularized multivariate normal distribution",
      "Regularized multivariate multinomial distribution",
      "Regularized multivariate binomial distribution",
      "Regularized multivariate beta-binomial distribution"
    ),
    Scale = c(
      "nominal", 
      "ordinal",
      "mixed-with-ordinal",
      "nominal", 
      "nominal/ordinal",
      "numeric",
      "",
      "numeric", 
      "nominal",
      "ordinal",
      "ordinal"
    ),
    NAs = c(
      "not implemented", 
      "not implemented",
      "upweighing of present variables",
      "not implemented", 
      "not implemented",
      "complete-case analysis",
      "",
      "not implemented", 
      "not implemented",
      "not implemented", 
      "not implemented"
    ),
    Source = c(
      "@kaufman_finding_1990, p. 19",
      "@walesiak_finding_2010; @ernst_ordinal_2025",
      "@kaufman_finding_1990, p. 32-37", 
      "@weihs_klaR_2005; @leisch_toolbox_2006",
      "@ernst_ordinal_2025",
      "@leisch_toolbox_2006",
      "",
      "@fraley2007bayesian; @ernst_ordinal_2025",
      "@galindo2006avoiding; @ernst_ordinal_2025",
      "@ernst_ordinal_2025",
      "@kondofersky2008; @ernst_ordinal_2025"
    )
  ), 
  format = "html", 
  escape = FALSE, 
  col.names = c(
    "Clustering Type", "Function Type", "Function Name", 
    "Method", "Scale Assumptions", "NA Handling", "Source"
  )
)
```

## Example 1: Clustering purely nominal data

We load necessary packages and set a random seed for reproducibility.

```{r setup, message=FALSE}
library("flexord")
library("flexclust")
library("flexmix")
set.seed(1111)
```

As an example for purely nominal data, we will use the classic `Titanic` data set:
```{r nominal_1}
titanic_df <- data.frame(Titanic)
titanic_df <- titanic_df[rep(1:nrow(titanic_df), titanic_df$Freq), -5]
str(titanic_df)
```

### Partitioning approach
We can conduct K-centroids clustering with the kModes algorithm directly on the
data frame^[Internally, it will be converted to a `data.matrix`. However, as only
equality operations and frequency counts are used, this is of no consequence.]:

```{r nominal_p2}
kcca(titanic_df, k = 4, family = kccaExtendedFamily('kModes'))
```
Let us assume that for some reason we are unhappy with the mode as a centroid,
and rather want to use an optimized centroid value, by choosing the factor level
for which Simple Matching distance^[I.e., the mean disagreement count.] is minimal:
```{r nominal_p3}
kcca(titanic_df, k = 4,
     family = kccaFamily(dist = distSimMatch, 
                         cent = \(y) centMin(y, dist = distSimMatch,
                                             xrange = 'columnwise')))
```

This already showcases one of the advantages of package **flexclust**:
As the name suggests, we are quickly able to mix and match our
distance and centroid functions, and quickly create our own
K-centroids algorithms.

Furthermore, **flexclust** allows us to decrease the influence of
randomness via running the algorithm several times, and keeping only
the solution with the minimum within cluster distance. This can be
done for one specific number of clusters `k` or several values `k`:

```{r nominal_p4}
titanic_dm <- data.matrix(titanic_df)
stepFlexclust(titanic_dm, k = 2:4, nrep = 1, 
              family = kccaExtendedFamily('kModes')) 
```

The output above shows the solutions with lowest within cluster
distance out of 1 run for 2 to 4 clusters, in comparison to 1 big
cluster. The number of runs was reduced to 1 to reduce run time. The
results show that none of the algorithms converged. Presumably this
is due to observations which have the same distance to two centroids
and which are randomly assigned to one of the two centroids, implying
that the partitions are still changing in each iteration, even if the
centroids do not change.

Selecting a suitable number of clusters based on the output of
`stepFlexclust` might be still difficult. This is where
`bootFlexclust` comes in. In `bootFlexclust`, `nboot` bootstrap
samples of the original data are drawn, on which `stepFlexclust` is
performed for each `k`. This results in `k`$\times$`nboot` best out of
`nrep` clustering solutions obtained for each bootstrap data
set. Based on these solutions cluster memberships are predicted for
the original data set, and the stability of these partitions is tested
via the Adjusted Rand Index [@hubert_arabie_1985]:

```{r nominal_p5}
(nom <- bootFlexclust(titanic_dm, k = 2:4, nrep = 1, nboot = 5, 
                      family = kccaExtendedFamily('kModes')))
```
Note that ridiculously few repetitions are used for the sake of having
a short run time. Clearly `nboot` should be increased in applications. 
Also `nrep` has been set to 1 to reduce run time. 

The resulting ARIs can be quickly visualized via a predefined plotting method:
```{r nominal_p6}
plot(nom)
```

This plot indicates that out of the 2 to 4 cluster solutions, no
solution has a clearly better performance with respect ot he median
ARI in case only 5 bootstrap pairs are considered. Clearly one should
investigate this for more pairs and also after increasing `nrep`.

After deciding on a suitable number of clusters, we could select the
corresponding cluster solution from `kcca` or `stepFlexclust`, and
make use of the further visualization, prediction, and other
tools. For this, we refer to the documentation available in
@leisch_toolbox_2006 and @dolnicar_market_2018.

### Model-based approach

We also offer an algorithm specifically designed for model-based
clustering of unordered categorical data via a regularized multinomial
distribution. The multinomial driver also supports varying number of
categories between variables. We call **flexmix** using the model
driver `FLXMCregmultinom()` where we specify via the argument `r` the
number of categories for each variable:

```{r nominal_m2}
titanic_ncats <- apply(titanic_dm, 2, max)
flexmix(formula = titanic_dm ~ 1, k = 3,
        model = FLXMCregmultinom(r = titanic_ncats)) 
```

As we are estimating many category probabilities across multiple
clusters, some of those may become numerically zero, resulting in a
degenerate distribution. To avoid this we may use the regularization
parameter $\alpha$, which acts if we added $\alpha$ observations
according to the population mean to each component:

```{r nominal_m3}
flexmix(titanic_dm ~ 1, k = 3,
        model = FLXMCregmultinom(r = titanic_ncats, alpha = 1))
```

**flexmix** also provides function `stepFlexmix()`, where the EM
algorithm for each `k` is restarted `nrep` times, and only the maximum
likelihood solution is retained:
```{r nominal_m4}
(nom <- stepFlexmix(titanic_dm ~ 1, k = 2:4,
                    nrep = 1, # please increase for real-life use
                    model = FLXMCregmultinom(r = titanic_ncats)))
```

The output provides an overview on the best solutions out of three EM
runs for 3 different values of `k`. For each solution, it is indicated
how many iterations of the EM algorithm were performed (`iter`), if
the EM algorithm converged (`converged`), the number of components in
the final solution (`k`) and the number of components the EM algorithm
was initialized with (`k0`) as well as the maximum log-likelihood
(`logLik`) and results for different model selection criteria, namely,
AIC, BIC and ICL.

Similar to package **flexclust** in the partitioning case, package
**flexmix** also offers various plotting methods for the returned
objects. We just showcase here one:

```{r nominal_m5}
plot(nom)
```

For more information on the further methods and utilities offered,
check out the documentation for **flexmix**
(see for example `browseVignettes('flexmix')`).

## Example 2: Clustering purely ordinal data

Our next example data set is from a survey conducted among 563
Australians in 2015 where they indicated on a scale from 1-5 how
inclined they are to take 6 types of risks.  It consists of purely
ordinal variables without missing values, and the response level
length is the same for all variables. We load the data set and inspect
it:

```{r ordinal_1}
data("risk", package = "flexord")
str(risk)
colnames(risk)
```

### Partitioning approach

In our package, we offer two partitioning methods designed for ordinal
data based on either Gower's distance or GDM2 distance. 

Applying Gower's distance as implemented in `distGower` to purely
ordinal data corresponds to using Manhattan distance (as provided also
in `flexclust::distManhattan`) with previous scaling as described by
@kaufman_finding_1990 and Gower's upweighing of non-missing
values. The clustering can be performed using:

```{r ordinal_p2}
kcca(risk, k = 3, family = kccaExtendedFamily('kGower'))
```

The default centroid for this family is the general purpose optimizer
`centOptimNA`, which is the general purpose optimizer
`flexclust::centOptim`, just with NA removal. In our case of purely
ordinal data with no missing values, we could also choose the median
as a centroid:

```{r ordinal_p3}
kcca(risk, k = 3,
     family = kccaExtendedFamily('kGower', cent = centMedian))
```

This results in kMedians with previous scaling, and non-missing value
upweighing. In our `risk` example with no NAs and equal response level
lengths for all variables, `flexclust::kccaFamily('kmedians')` would
suffice, but there are still many data situations where the `"kGower"`
approach will be preferable.

As a second alternative designed specifically for ordinal data without
missing values, we implement the GDM2 distance for ordinal data
suggested by @walesiak_finding_2010, which conducts only relational
operations on ordinal variables.  We reformulated this distance for
use in K-centroids clustering in @ernst_ordinal_2025, and implemented
it in the package such that one can use it for clustering via:

```{r ordinal_p4}
kcca(risk, k = 3, family = kccaExtendedFamily('kGDM2'))
```

Similar to `"kGower"`, a default general optimizer centroid is applied,
which we could replace as desired.

Another parameter used in both `"kGower"` and `"kGDM2"` is
`xrange`. Both algorithms require information on the range of the
variables of the data object for data pre-processing: `"kGower"` uses
this for scaling, while `"kGDM2"` for transforming the data to
empirical distributions. The range calculation can be influenced in
the following ways: We can use the range of the whole `x` (argument
`all`, the default for `"kGDM2"`), columnwise ranges (`xrange =
"columnwise"`), a vector specifying the range across all variables in
the data set, or a list of length `ncol(x)` with range vectors for
each column.  Let us assume that the highest possible response to the
`risk` questions was `Extremely often (6)`, but it was never chosen by
any of the respondents. We can take the new assumed full range of the
data into account:

```{r ordinal_p5}
kcca(risk, k = 3,
     family = kccaExtendedFamily('kGDM2', xrange = c(1, 6)))
```

Again, the distances, centroids, and wrapper alternatives presented
can be used also in the further capabilities of **flexclust**.

### Model-based approach

We also offer model drivers for two component distributions suitable
for ordinal data, which are the binomial distribution and its
extension the beta-binomial distribution:
```{r ordinal_m2}
risk1 <- risk - 1
flexmix(risk1 ~ 1, k = 3, model = FLXMCregbinom(size = 4))
flexmix(risk1 ~ 1, k = 3, model = FLXMCregbetabinom(size = 4, alpha = 1))
```

In both cases we specify the number of trials of the binomial distribution
(`size`). For both distributions we can also use a regularization parameter
`alpha` that shrinkgs the component estimates towards the population mean. While
this incurs small distortions it can be helpful to avoid
boundary estimates.

The beta-binomial distribution is parameterized by two parameters `a` and
`b` and is therefore more flexible than the binomial. It may potentially
perform better in more difficult clustering scenarios even if we assume
the original data was drawn from a binomial mixture [@ernst_ordinal_2025].
Your mileage may vary.

We can further use the capabilities of `stepFlexmix` and the
corresponding plot functions. 

### Treating the data as purely nominal

Treating ordered categorical data as unordered is a frequent approach. In fact,
in our simulation study it was a quite competitive approach for model-based
methods. However, applying `kmodes` to ordered data brought subpar results in
the partitioning ambit [@ernst_ordinal_2025].

### Treating the data as equidistant (=integer)

Also treating ordered categorical data as integer values is at least
as common as nominalization. In fact, some of the methods presented
above, such as `"kGower"` - as used above on purely ordinal data without
missing values - make only lax concessions towards
ordinality. Depending on data characteristics and specific method
applied, this approach may also be a very good choice
[@ernst_ordinal_2025].

We do not offer any new methods for this in the partitioning ambit, as
already many options are available in **flexclust**. In the
model-based ambit we offer additional capabilities via `FLXMCregnorm`,
which, as mentioned, is a driver for clustering with multivariate
normal distributions (assuming conditional independence) while
allowing for regularization (as in the case for `FLXMCregmultinom` to
help avoid degenerate solutions). One can proceed as follows to use
the data-driven default regularization parameters as proposed by
@fraley2007bayesian:

```{r numerical}
params <- FLXMCregnorm_defaults(risk, kappa_p = 0.1, k = 3)
flexmix(risk ~ 1, k = 3, model = FLXMCregnorm(params = params))
```

For details on the parameters used for regularization, see
@fraley2007bayesian or package **mclust**. Here, we would only like to
point out that the shrinkage parameter `kappa_p` (the suffix `_p`
stands for prior), acts as if we added `kappa_p` observations
according to the population mean to each component. Using
`FLXMCregnorm_defaults` with the data and the number of components
determines the scale parameter `zeta_p` by dividing the empirical
variance by the square of the number of components. Thus we need to
pass the data and `k = 3`. Alternatively, we could also specify a
value for `zeta_p` and then omit the parameter `k`. Note that we
cannot set both parameters at the same time and therefore, `zeta_p`
takes precedence if both are given.

Again, the model can be plugged into all of the further tools offered
by **flexmix**.

## Example 3: Clustering mixed-type data with missing values
```{r mixed_1}
data("vacmot", package = "flexclust")
vacmot2 <- cbind(vacmotdesc,
                 apply(vacmot, 2, as.logical))
vacmot2 <- vacmot2[, c('Gender', 'Age', 'Income2', 'Relationship.Status', 'Vacation.Behaviour',
                       sample(colnames(vacmot), 3, replace = FALSE))]
vacmot2$Income2 <- as.ordered(vacmot2$Income2) 
str(vacmot2)
colMeans(is.na(vacmot2))*100 
```

For our last example, we use a data set which is obtained by merging
two data sets shared in package **flexclust**.  **flexclust** provides
object `vacmot` consisting of a $1000 \times 20$ matrix of binary
responses to questions on travel motives asked Australian tourists in
2006, plus a separate data frame `vacmotdesc` with 12 demographic
variables for each respondent.

This data set has been thoroughly explored using clustering methods in
the field of market segmentation research, see for example
@dolnicar_investigation_2008. We now use it as a data example for a
mixed-data case with a moderate amount of missingness.  For this, we
select one symmetric binary variable (Gender, which was collected as
Male/Female in 2006), two numeric variables (Age and Vacation
Behaviour^[Mean environmental friendly behavior score, ranging from 1
to 5.]), one unordered categorical variable (Relationship Status), one
ordered categorical variable (Income2, which is a recoding of
`Income`), and three randomly selected asymmetric binary variables (3 of
the 20 questions on whether a specific travel motive applies to a
respondent). Missing values are present, but the percentage is
low^[This is by choice. While Gower's distance is designed to handle
missingness via variable weighting, and the general optimizer used
here is written to omit NAs, both methods will degenerate with high
percentages of missing values. While we have not yet determined the
critical limit, we have successfully run the algorithm on purely
ordinal data with MCAR missingness percentages of up to 30%. However,
common sense dictates that solutions obtained for such high
missingness percentages need to be treated with caution.].

### Partitioning approach

Currently, we only offer one method for mixed-type data with missing
values, which is `"kGower"` (scaling and distances as proposed by
@gower_1971 and @kaufman_finding_1990, and a general purpose optimizer
centroid as provided in **flexclust**, but with NA omission): 

```{r mixed_2} 
kcca(vacmot2, k = 3, family = kccaExtendedFamily('kGower'),
     control = list(iter.max = 5))
```
We set the maximum number of iterations to 5 to reduce run time.

In our example above, the default methods for each variable type are
used (Simple Matching Distance for the categorical variables, squared
Euclidean distance for the numerical/integer variables, Manhattan
distance for ordinal variables, and Jaccard distance for logical
variables). 

We could instead provide a vector of length `ncol(vacmot2)` where each
distance measure to be used is specified. Let us assume that we have
many outliers in the variable `Age`, that we woud like to consider
`Vacation.Behaviour` an ordered factor as well in addition to
`Income2`, and that the three binary responses to vacation motives
should be treated symmetric instead of asymmetric^[Meaning that 2
disagreements are just as important as 2 agreements.], and for this
reason want to evaluate the first three with Manhattan distance, and
the latter three with Euclidean distance^[We could achieve symmetric
treatment also via Simple Matching Distance.]:

```{r mixed_3}
colnames(vacmot2)
xmthds <- c('distSimMatch', rep('distManhattan', 3),
            'distSimMatch', rep('distEuclidean', 3))
kcca(vacmot2, k = 3,
     family = kccaExtendedFamily('kGower', xmethods = xmthds),
     control = list(iter.max = 5))
```
Again the maximum number of iterations was reduced. 

For `"kGower"`, all numeric/integer and ordered variables are scaled as
proposed by @kaufman_finding_1990, by shifting by the minimum
and dividing by the range. This means that also for `"kGower"`, the
range of the variables will influence the clustering results. Same as
for `"kGDM2"` (**Example 2**), we can specify the range to be used in
parameter `xrange`. In the case of `"kGower"`, the default value is
`"columnwise"`, where the range for each column is calculated
separately.

Again, the distance, centroid and wrapper functions can be used in the
further tools provided by **flexclust**, for examples on that see
**Example 1**.

## References