---
title: "Principal Balances"
author: "Marc Comas-Cufí"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Principal Balances}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
references:
- id: fernandez2017
  title: Advances in Principal Balances for Compositional Data
  author:
  - family: Martín-Fernández
    given: Josep A.
  - family: Pawlowsky-Glahn
    given: Vera
  - family: Egozcue
    given: Juan J.
  - family: Tolosan-Delgado
    given: Raimon
  container-title: Mathematical Geosciences
  volume: 50
  URL: 'https://link.springer.com/article/10.1007/s11004-017-9712-z'
  DOI: 10.1007/s11004-017-9712-z
  page: 273–298
  type: article-journal
  issued:
    year: 2017
- id: ruskey1993
  title: Simple Combinatorial Gray Codes Constructed by Reversing Sublists
  author:
  - family: Ruskey
    given: Frank
  container-title: Lecture Notes in Computer Science
  volume: 762
  DOI: 10.1007/3-540-57568-5_250
  page: 201-208
  type: article-journal
  issued:
    year: 1993
- id: pawlowsky2011
  title: Principal balances
  author:
  - family: Pawlowsky-Glahn
    given: Vera
  - family: Egozcue
    given: Juan J.
  - family: Tolosan-Delgado
    given: Raimon
  container-title: In Egozcue JJ, Tolosana-Delgado R, Ortego M (eds) Proceedings of the 4th international workshop on compositional data analysis, Girona, Spain
  type: article-journal
  page: 1-10
  issued:
    year: 2011
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(digits = 4)
```

_Principal balances_ are defined as follows [@pawlowsky2011]:

> Given an $n$-sample of a $D$-part random composition, the set of Principal Balances (PB) is a set of $D-1$ balances satisfying  the following conditions:
> 
> * Each sample PB is obtained as the projection of a sample composition on a unitary composition or balancing element associated to the PB.
> * The first PB is the balance with maximum sample variance.
> * The $i$-th PB has maximum variance conditional to its balancing element being orthogonal to the previous 1st, 2nd, ..., $(i-1)$-th balancing elements.
>

Because of the large number of possibilities, for a given compositional sample, finding the PB is a computational demanding task. @fernandez2017 proposed different approaches to deal with the calculation and approximation of such type of basis. `coda.base` implements all methods presented in [@fernandez2017]. 

To illustrate `coda.base` functionalities, we use the following dataset:

```{r}
library(coda.base)
X = parliament2017[,-1]
```

consisting of parties in the 2017 Catalan Parliament Elections.

## Exact Principal Balances

`coda.base` can calculate the PB with the function `pb_basis()`. Function `pb_basis()` needs the parameter `method` to be set. To obtain the PB users needs to set `method = "exact"`.

```{r}
B1 = pb_basis(X, method = "exact")
```

Where the obtained sequential binary partition can be summarized with the following sequential binary tree:

```{r, fig.width=7.5, fig.height=4.5}
plot_balance(B1)
```

```{r}
apply(coordinates(X, B1), 2, var)
```

When `method` is set to `"exact"`, exhaustive search is performed to obtain the PB. The time needed to calculate the PB grows exponentially regarding the number of parts of `X`. To iterate through all the possibilities, CoDaPack uses the algorithm @ruskey1993. Currently, exhaustive search can find the PB of a compositional data set with 20 parts in a reasonable amount of time (5 minutes approximately). 


```{r, echo=FALSE, fig.width=7.5, fig.height=4.5, warning=FALSE}
library(ggplot2)
D = 10

# time_ = sapply(5:20, function(D){
#   print(D)
#   x = matrix(rlnorm(100*D), ncol=D)
#   a = tic()
#   r = pb_basis(x, method = 'exact')
#   b = toc()
#   b$toc - b$tic
# })
# dplot = data.frame(
#   parts = tail(5:20, 11),
#   time = tail(time_, 11)
# )

dplot = structure(list(parts = 10:20, time = c(0.00499999999999545, 0.0119999999999436, 
0.0260000000000673, 0.0749999999999318, 0.243000000000052, 0.831999999999994, 
2.64400000000001, 9.56200000000001, 28.787, 95.4560000000001, 
311.133)), class = "data.frame", row.names = c(NA, -11L))

ggplot(data=dplot) +
  geom_point(aes(x=parts, y=time)) +
  geom_segment(aes(x=parts, xend =parts, y = 0, yend=time)) +
  labs(x = 'Number of parts', y = 'Time in seconds (logarithmic scale)', title = 'Time needed to calculate Principal Balances',
       caption = "Times measured in a MacBook Pro (13-inch, 2017) \n2,3 GHz Dual-Core Intel Core i5.16 GB 2133 MHz LPDDR3") +
  scale_x_continuous(breaks = tail(5:20, 11)) +
  scale_y_continuous(trans = 'log', breaks = c(0.004, 0.04, 0.4, 4, 40, 400), labels = c("0.004", "0.04", "0.4", 4, 40, 400)) +
  theme_minimal()
```

When the number of part is higher, the exact method should be discarded in favor of other alternatives. Different alternatives are available in @pawlowsky2011 and @fernandez2017. Between these alternatives, the once that can be implemented are agglomerate partition approach based on the Ward method [@pawlowsky2011, @fernandez2017] and the approximation based on a reduced approximation to the principal components [@fernandez2017].

##  Ward Method for Parts

For a composition $X$, @pawlowsky2011 proposed to build the variation array and use it to produce a matrix distance. The variation array calculates the variance between a pair of logratios.

```{r}
D = as.dist(variation_array(X))
D
```

Combining pairs with lower variance, it is expected to get a final merging with high variability.


```{r, fig.width=7.5, fig.height=4.5}
B2 = pb_basis(X, method = 'cluster')
plot_balance(B2)
```


```{r}
apply(coordinates(X, B2), 2, var)
```

##  Constrained PCs Algorithm

```{r, fig.width=7.5, fig.height=4.5}
B3 = pb_basis(X, method = 'constrained')
plot_balance(B3)
```

```{r}
apply(coordinates(X, B3), 2, var)
```


```{r, eval=FALSE, echo=FALSE}
hc = hclust_dendrogram(B1)
hcd = as.dendrogram(hc)
dd = dendro_data(hcd)
ggdendrogram(dd)
dd$segments = dd$segments %>%
  mutate(
    end_node = yend == 0
  )
  
p <- ggplot(dd$segments) + 
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend, linetype = end_node))+
  geom_label(data = dd$labels, aes(x, y, label = label),
            hjust = 0.5, angle = 90, size = 4) +
  theme_void() + scale_linetype_discrete(guide=FALSE)
p

## Build tree
build_tree_order = function(B, ibalance){
  balance = B[,ibalance]
  if(sum(balance != 0) == 2){
    return(ibalance)
  }
  L = NULL; R = NULL
  left_ = balance < 0
  if(sum(left_) > 1){
    L = Recall(B, which(apply((B != 0 & left_) | (B == 0 & !left_), 2, all)))
  }
  right_ = balance > 0
  if(sum(right_) > 1){
    R = Recall(B, which(apply((B != 0 & right_) | (B == 0 & !right_), 2, all)))
  }
  return(unname(c(L, R, ibalance)))
}
ord_ = build_tree_order(B, which(apply(B != 0, 2, all)))
TREE = setNames(lapply(1:ncol(X), function(i) list('c' = i)), names(X)[ORDER])
for(i in ord_){
  balance = B[,i]
  name_ = paste(sort(names(balance)[balance != 0]), collapse = '_')
  name_left = paste(sort(names(balance)[balance < 0]), collapse = '_')
  name_right = paste(sort(names(balance)[balance > 0]), collapse = '_')
  l_node = list(
    list(
      name = name_,
      left = name_left,
      right = name_right,
      l = TREE[[name_left]]$c,
      r = TREE[[name_right]]$c,
      c = (TREE[[name_left]]$c+TREE[[name_right]]$c)/2
  ))
  names(l_node) = name_
  TREE = c(TREE, l_node)
}
nodes = sapply(ord_, function(i){
  balance = B[,i]
  paste(sort(names(balance)[balance != 0]), collapse = '_')
})
TREE = TREE[nodes]
lapply(TREE, as_tibble) %>% bind_rows()
for(node in rev(nodes)){
  TREE[[node]]
}

# lapply(nodes, function(name_){
#   tibble(
#     name = name_,
#     name_l = TREE[[name_]]$left,
#     name_r = TREE[[name_]]$right,
#     l = TREE[[name_]]$l,
#     r = TREE[[name_]]$r,
#     c = TREE[[name_]]$c,
#     var = var(H[,i])
#   )
# }) %>% bind_rows()


H = coordinates(X, B)
H_mean = colMeans(H)
H_var = apply(H, 2, var)

names(X)[ORDER]

l_nodes = lapply(rev(ord_), function(i){
  list(
    
    balance = B[,i],
    mean = boot::inv.logit(H_mean[i]),
    var = H_var[i]
  )
})
names(l_nodes) = sapply(l_nodes, function(x) paste(sort(names(x$balance)[x$balance != 0]), collapse = '_'))
l_nodes
```

```{r, eval=FALSE, include=FALSE}
library(coda.base)
X = iris[,1:4]
B = pb_basis(X, method = 'exact')
rownames(B) = names(X)
H = coordinates(X, B)
apply(H, 2, var)
```

```{r, eval = FALSE, include=FALSE}
plot(H[,1], H[,2])
```


```{r, eval=FALSE, include=FALSE}
lX = split(X, iris$Species)
m_ = lapply(lX, colMeans)
s_ = lapply(lX, cov)
S = cov(X)
S_ = replicate(3, S, simplify = FALSE)
```

```{r, eval=FALSE, include=FALSE}
Prob = mapply(function(m,s){
  mvtnorm::dmvnorm(X, m, s)
}, m_, S_)
```

```{r, eval=FALSE, include=FALSE}
B_1 = pb_basis(Prob, method = 'exact')
H_1 = coordinates(Prob, B_1)
plot(H_1, col = iris$Species)
```

```{r, eval=FALSE, include=FALSE}
B_2 = pc_basis(Prob)
H_2 = coordinates(Prob, B_2)
plot(-H_2, col = iris$Species)
```

```{r, eval=FALSE, include=FALSE}
library(fpc)
dp = discrproj(X, iris$Species)
plot(dp$proj[,1:2], col = iris$Species)
```


# References


```{r, eval=FALSE, echo=FALSE}
@article{cite-key,
	Date-Added = {2020-06-13 08:44:39 +0000},
	Date-Modified = {2020-06-13 08:44:39 +0000},
	Id = {ref25},
	Title = {Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2011) Principal balances. In Egozcue JJ, Tolosana-Delgado R, Ortego M (eds) Proceedings of the 4th international workshop on compositional data analysis, Girona, Spain, pp 1--10},
	Ty = {STD}}

@article{cite-key,
	Author = {Mart{\'\i}n-Fern{\'a}ndez, J. A. and Pawlowsky-Glahn, V. and Egozcue, J. J. and Tolosona-Delgado, R.},
	Da = {2018/04/01},
	Date-Added = {2020-06-13 08:42:47 +0000},
	Date-Modified = {2020-06-13 08:42:47 +0000},
	Doi = {10.1007/s11004-017-9712-z},
	Id = {Mart{\'\i}n-Fern{\'a}ndez2018},
	Isbn = {1874-8953},
	Journal = {Mathematical Geosciences},
	Number = {3},
	Pages = {273--298},
	Title = {Advances in Principal Balances for Compositional Data},
	Ty = {JOUR},
	Url = {https://doi.org/10.1007/s11004-017-9712-z},
	Volume = {50},
	Year = {2018},
	Bdsk-Url-1 = {https://doi.org/10.1007/s11004-017-9712-z}}

Frank Ruskey LEcture note computer science 762 (1993) 205-206
@String{j-LECT-NOTES-COMP-SCI   = "Lecture Notes in Computer Science"}
@String{ack-nhfb = "Nelson H. F. Beebe,
                    University of Utah,
                    Department of Mathematics, 110 LCB,
                    155 S 1400 E RM 233,
                    Salt Lake City, UT 84112-0090, USA,
                    Tel: +1 801 581 5254,
                    FAX: +1 801 581 4148,
                    e-mail: \path|beebe@math.utah.edu|,
                            \path|beebe@acm.org|,
                            \path|beebe@computer.org| (Internet),
                    URL: \path|http://www.math.utah.edu/~beebe/|"}
@Article{Ruskey:1993:SCG,
  author =       "F. Ruskey",
  title =        "Simple Combinatorial Gray Codes Constructed by
                 Reversing Sublists",
  journal =      j-LECT-NOTES-COMP-SCI,
  volume =       "762",
  pages =        "201--208",
  year =         "1993",
  CODEN =        "LNCSD9",
  ISSN =         "0302-9743 (print), 1611-3349 (electronic)",
  ISSN-L =       "0302-9743",
  bibdate =      "Wed Sep 15 10:01:31 MDT 1999",
  bibsource =    "http://www.math.utah.edu/pub/tex/bib/lncs1993.bib",
  acknowledgement = ack-nhfb,
  keywords =     "algorithms; computation; ISAAC",
}
```