---
title: "Clustering through the epigraph and hypograph indices"
output: 
 rmarkdown::html_vignette:
    toc: yes
vignette: >
  %\VignetteIndexEntry{Clustering through the epigraph and hypograph indices}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
editor_options: 
  markdown: 
    wrap: 78
---




```r
set.seed(42)

library(ehymet)
```

# The EHyClus function

One of the main utilities of the epigraph and the hypograph index is to be
able to measure \`\`extremality'' of a curve with respect to a bunch of curves
and to provide and ordination of the curves from top to bottom or vice-versa.

EHyClus is a methodology for clustering functional data which is based in four
steps:

1.  Smooth the functional dataset.
2.  Apply the indices to data, first and second derivatives.
3.  Apply classical multivariate clustering technique.
4.  Obtain a clustering partition in $k$ groups ($k$ fixed in advanced).

In `ehymet`, the function that allows us to perform this process is `EHyClus`.

EHyClus function is designed for clustering functional datasets. The input
data can be a one-dimensional dataset, $n \times p$ matrix, or a
multidimensional one, $n \times p \times q$ array. The function transforms the
initial dataset by first smoothing the data and then applying different
indices analyzed on the first vignette such as the epigraph, hypograph, and
their modified versions, and then applies various clustering algorithms to
this dataset. Indices for one or multiple dimension are applied depending on
the size of the input data.

It supports multiple clustering methods: **hierarchical clustering**
("hierarch"), **k-means** ("kmeans"), **kernel k-means** ("kkmeans"), and
**spectral clustering** ("spc"). Also, it allows customization of hierarchical
clustering methods, distance metrics, and kernels using parameters like
`l_dist_hierarch`.

For smoothing, it uses B-splines with a specified or automatically selected
number of basis functions.

To check the quality of the results, if true labels are provided, the function
can validate the clustering results and compute performance metrics such as
purity, F-measure, and Rand Index (RI). Also, it records the time taken for
each clustering method, in case you want to measure it in case of wanting to
know the trade-off between results and execution time

## Parameters

All the parameters and its functionality can be found on the function
documentation, but arguably the most important ones are:

-   **curves**: The dataset containing the curves to be clustered.
-   **vars_combinations**: This parameter can be provided or not. In case it is
    provided, it can be a list that determines the combination of variables
    to use or "auto", where a one combination of data and indices is set trying 
    to optimize the results. If not provided, a default list with generic 
    combinations of variables is used.
-   **clustering_methods**: A vector specifying which clustering methods to
    apply.
-   **n_clusters** Number of clusters to generate.
-   **k**: Number of basis functions for B-splines. Not to be confused with
    **n_clusters**.
-   **bs**: Type of penalized smoothing basis to use.
-   **true_labels**: If provided, evaluation metrics are provided along the
    result.

## Example of usage and validation

In this subsection, we are going to display how to obtain the results of the
`EHyClus` function and validate them (if `true_labels` are present). First, we
are going to generate multidimensional data using `sim_model_ex2`:


```r
n <- 50
curves <- sim_model_ex2(n = n, i_sim = 4)
```

And, as we are generating $n = 50$ curves per group and there are two groups,
the true labels are:


```r
true_labels <- c(rep(1, n), rep(2, n))
```

We can run the algorithm with or without the true labels. For the combinations
of variables, we will use "d2dtaMEI" with "d2dtaMHI".


```r
res <- EHyClus(curves, vars_combinations = list(c("d2dtaMEI", "d2dtaMHI")))
res_with_labels <- EHyClus(curves,
  vars_combinations = list(c("d2dtaMEI", "d2dtaMHI")),
  true_labels = true_labels
)
```

For the result without labels, it can be seen that we only have the generated
clusters:


```r
names(res)
#> [1] "cluster"
```

Whereas the one we generated with labels has both `cluster` and `metrics`:


```r
names(res_with_labels)
#> [1] "cluster" "metrics"
```

Let's explore both components of the latter. We're going to start with
`cluster`.


```
#>                                               levelName
#> 1  cluster                                             
#> 2   ¦--hierarch                                        
#> 3   ¦   ¦--hierarch_single_euclidean_d2dtaMEId2dtaMHI  
#> 4   ¦   ¦   ¦--valid                                   
#> 5   ¦   ¦   °--internal_metrics                        
#> 6   ¦   ¦--hierarch_complete_euclidean_d2dtaMEId2dtaMHI
#> 7   ¦   ¦   ¦--valid                                   
#> 8   ¦   ¦   °--internal_metrics                        
#> 9   ¦   ¦--hierarch_average_euclidean_d2dtaMEId2dtaMHI 
#> 10  ¦   ¦   ¦--valid                                   
#> 11  ¦   ¦   °--internal_metrics                        
#> 12  ¦   ¦--hierarch_centroid_euclidean_d2dtaMEId2dtaMHI
#> 13  ¦   ¦   ¦--valid                                   
#> 14  ¦   ¦   °--internal_metrics                        
#> 15  ¦   ¦--hierarch_ward.D2_euclidean_d2dtaMEId2dtaMHI 
#> 16  ¦   ¦   ¦--valid                                   
#> 17  ¦   ¦   °--internal_metrics                        
#> 18  ¦   ¦--hierarch_single_manhattan_d2dtaMEId2dtaMHI  
#> 19  ¦   ¦   ¦--valid                                   
#> 20  ¦   ¦   °--internal_metrics                        
#> 21  ¦   ¦--hierarch_complete_manhattan_d2dtaMEId2dtaMHI
#> 22  ¦   ¦   ¦--valid                                   
#> 23  ¦   ¦   °--internal_metrics                        
#> 24  ¦   ¦--hierarch_average_manhattan_d2dtaMEId2dtaMHI 
#> 25  ¦   ¦   ¦--valid                                   
#> 26  ¦   ¦   °--internal_metrics                        
#> 27  ¦   ¦--hierarch_centroid_manhattan_d2dtaMEId2dtaMHI
#> 28  ¦   ¦   ¦--valid                                   
#> 29  ¦   ¦   °--internal_metrics                        
#> 30  ¦   °--hierarch_ward.D2_manhattan_d2dtaMEId2dtaMHI 
#> 31  ¦       ¦--valid                                   
#> 32  ¦       °--internal_metrics                        
#> 33  ¦--kmeans                                          
#> 34  ¦   ¦--kmeans_euclidean_d2dtaMEId2dtaMHI           
#> 35  ¦   ¦   ¦--valid                                   
#> 36  ¦   ¦   °--internal_metrics                        
#> 37  ¦   °--kmeans_mahalanobis_d2dtaMEId2dtaMHI         
#> 38  ¦       ¦--valid                                   
#> 39  ¦       °--internal_metrics                        
#> 40  ¦--kkmeans                                         
#> 41  ¦   ¦--kkmeans_rbfdot_d2dtaMEId2dtaMHI             
#> 42  ¦   ¦   ¦--valid                                   
#> 43  ¦   ¦   °--internal_metrics                        
#> 44  ¦   °--kkmeans_polydot_d2dtaMEId2dtaMHI            
#> 45  ¦       ¦--valid                                   
#> 46  ¦       °--internal_metrics                        
#> 47  °--spc                                             
#> 48      ¦--spc_rbfdot_d2dtaMEId2dtaMHI                 
#> 49      ¦   ¦--valid                                   
#> 50      ¦   °--internal_metrics                        
#> 51      °--spc_polydot_d2dtaMEId2dtaMHI                
#> 52          ¦--valid                                   
#> 53          °--internal_metrics
```

In the suffix of the names we can see the combination of variables used, that
is "d2dtaMEI" with "d2dtaMHI". We can also see the different parameters used.
For example, for the kmeans it is easy to see that it has been performed with
both the Euclidean and the Mahalanobis distances.

Looking in particular at some of the elements, we see that it contains the
following:


```r
str(res_with_labels$cluster$hierarch$hierarch_ward.D2_euclidean_d2dtaMEId2dtaMHI)
#> List of 4
#>  $ cluster         : int [1:100] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ valid           :List of 4
#>   ..$ Purity  : num 0.93
#>   ..$ Fmeasure: num 0.868
#>   ..$ RI      : num 0.869
#>   ..$ ARI     : num 0.737
#>  $ internal_metrics:List of 4
#>   ..$ davies_bouldin: num 1.01
#>   ..$ dunn          : num 0.06
#>   ..$ silhouette    : num 0.409
#>   ..$ infomax       : num 0.993
#>  $ time            : num 0.000373
```

On the one hand we have `cluster` which is the vector that assigns each of the
curves to a cluster. Then we have `valid`, which is the validation data. And
finally `time`, which is the time that this particular method has taken to be
executed. Let's take a look at `valid`:


```r
head(res_with_labels$cluster$hierarch$hierarch_ward.D2_euclidean_d2dtaMEId2dtaMHI$valid)
#> $Purity
#> [1] 0.93
#> 
#> $Fmeasure
#> [1] 0.8678
#> 
#> $RI
#> [1] 0.8685
#> 
#> $ARI
#> [1] 0.737
```

It gives us 3 metrics: Purity, F-measure and the Rand Index (RI). However, we
can obtain this information in another way. Going back to the second element
of `res_with_labels`: `metrics` give us a summary of all metrics:


```r
head(res_with_labels$metrics, 3)
#>                                             Purity Fmeasure     RI   ARI         Time
#> kmeans_euclidean_d2dtaMEId2dtaMHI             1.00   1.0000 1.0000 1.000 0.0032920837
#> kmeans_mahalanobis_d2dtaMEId2dtaMHI           1.00   1.0000 1.0000 1.000 0.0038609505
#> hierarch_ward.D2_euclidean_d2dtaMEId2dtaMHI   0.93   0.8678 0.8685 0.737 0.0003728867
```

It gives us the Purity, F-measure, RI and Time for every clustering method
with every combination of parameters that has been executed. We can search for
"hierarch_single_euclidean_d2dtaMEId2dtaMHI" and see that it yields the same
results as seen previously:


```r
res_with_labels$metrics["hierarch_single_euclidean_d2dtaMEId2dtaMHI", ]
#>                                            Purity Fmeasure     RI   ARI         Time
#> hierarch_single_euclidean_d2dtaMEId2dtaMHI   0.52   0.6535 0.4958 8e-04 0.0004339218
```

But now imagine that we executed `EHyClus` without giving the `true_labels`
parameter but we want to compute the metrics. For that purpose, we can use the
`clustering_validation` function, using the true labels as the first parameter
and the ones generated by the clustering method as the second one:


```r
clustering_validation(res$cluster$hierarch$hierarch_single_euclidean_d2dtaMEId2dtaMHI$cluster, true_labels)
#> $Purity
#> [1] 0.52
#> 
#> $Fmeasure
#> [1] 0.6535
#> 
#> $RI
#> [1] 0.4958
#> 
#> $ARI
#> [1] 8e-04
```

Note that we are only taking a look at some of the result. Let's see if some
of them have given us better metrics. Results are sorted based on the RI:


```r
head(res_with_labels$metrics, 5)
#>                                             Purity Fmeasure     RI    ARI         Time
#> kmeans_euclidean_d2dtaMEId2dtaMHI             1.00   1.0000 1.0000 1.0000 0.0032920837
#> kmeans_mahalanobis_d2dtaMEId2dtaMHI           1.00   1.0000 1.0000 1.0000 0.0038609505
#> hierarch_ward.D2_euclidean_d2dtaMEId2dtaMHI   0.93   0.8678 0.8685 0.7370 0.0003728867
#> hierarch_ward.D2_manhattan_d2dtaMEId2dtaMHI   0.93   0.8685 0.8685 0.7370 0.0005681515
#> kkmeans_rbfdot_d2dtaMEId2dtaMHI               0.77   0.6684 0.6422 0.2857 0.0494818687
```

Indeed, we can see that some methods have given us excellent results,
performing an almost perfect clustering.

In addition, if we only want to obtain the results for the best clustering
method, we can use the `only_best` parameter. Note that this parameter only
works when `true_labels` is provided.


```r
res_only_best <- EHyClus(curves,
  vars_combinations = list(c("d2dtaMEI", "d2dtaMHI")),
  true_labels = true_labels,
  only_best = TRUE
)
```

If we inspect the object, we see that in fact it only contains the results of
the clustering method that has obtained the best results.


```r
res_only_best$cluster
#> $kmeans_euclidean_d2dtaMEId2dtaMHI
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$cluster
#>   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
#>  [55] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$valid
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$valid$Purity
#> [1] 1
#> 
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$valid$Fmeasure
#> [1] 1
#> 
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$valid$RI
#> [1] 1
#> 
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$valid$ARI
#> [1] 1
#> 
#> 
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$internal_metrics
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$internal_metrics$davies_bouldin
#> [1] 0.9024109
#> 
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$internal_metrics$dunn
#> [1] 0.07203434
#> 
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$internal_metrics$silhouette
#> [1] 0.4593925
#> 
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$internal_metrics$infomax
#> [1] 1
#> 
#> 
#> $kmeans_euclidean_d2dtaMEId2dtaMHI$time
#> [1] 0.004937172
```


```r
res_only_best$metrics
#>                                   Purity Fmeasure RI ARI        Time
#> kmeans_euclidean_d2dtaMEId2dtaMHI      1        1  1   1 0.004937172
```

This can be useful if you want to use the function as if it were a typical
clustering method.

## Automatic variable selection

When `vars_combinations = "auto"`, `EHyClus` will automatically
select an optimal combination of variables through a data-driven approach.
This process works is comprised of two main steps:

The first step focuses on identifying variables that show enough variation to
be meaningful discriminators. For each variable, the function calculates how
many unique values it contains relative to its total number of observations.
Only variables where this ratio exceeds 50% are retained. This ensures that we
focus on variables that have sufficient variation to be useful in
distinguishing between different groups.

The second step deals with redundancy in our selected variables. After all,
having multiple variables that essentially contain the same information won't
improve our clustering results. To address this, the function examines how
correlated our remaining variables are with each other. If two variables show
a correlation higher than 0.75, we keep only one of them. Specifically, we
look at how correlated each variable is with all other variables on average,
and remove the one that shows higher average correlation. This helps ensure
our final set of variables provides unique, non-redundant information for
clustering.

Here's an example using automatic variable selection:


```r
# Generate sample data
n <- 50
curves <- sim_model_ex2(n = n, i_sim = 4)
true_labels <- c(rep(1, n), rep(2, n))

# Use automatic variable selection
res_auto <- EHyClus(curves, 
                    vars_combinations = "auto",
                    true_labels = true_labels)

# Look at which variables were selected
attr(res_auto, "vars_combinations")[[1]]
#> [1] "ddtaMEI"  "dtaMHI"   "ddtaMHI"  "d2dtaMHI"
```