---
title: "Use Case: Florentine Families"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{08 use case}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This vignette describes a typical use case of the `netrankr` package. It contrasts
the usual centrality analysis based on indices with the dominance based assessment.
It is advisable to read the other vignettes before going through this example.

________________________________________________________________________________


```{r libs, warning=FALSE,message=FALSE}
library(igraph)
library(netrankr)
library(magrittr)
```

## Data
For this tutorial, we use the famous *florentine families* dataset (Padget & Ansell, 1993). The marriage links
of families together with the wealth attribute are included in the `netrankr` package.

```{r plot,fig.height=5,fig.width=5,fig.align='center'}
data("florentine_m")
# Delete Pucci family (isolated)
florentine_m <- delete_vertices(florentine_m, which(degree(florentine_m) == 0))

# plot the graph (label size proportional to wealth)
set.seed(111)
plot(florentine_m,
  vertex.label.cex = V(florentine_m)$wealth * 0.01,
  vertex.label.color = "black",
  vertex.color = "white",
  vertex.frame.color = "gray"
)
```

We use this dataset below to illustrate how a dominance based assessment of centrality
offers an alternative approach to centrality in two ways:

* Are the Medici universally the most central family?
* Using centrality as explanatory variable (Can centrality explain the wealth attribute?)

________________________________________________________________________________


## Most central family (Index approach)

The network is often used to benchmark new centrality indices. The premise is that
the Medici should always emerge as one of the most central once (if not **the** most central).  

We start by applying some of the standard centrality indices given in the `igraph` package.

```{r cent}
cent.df <- data.frame(
  degree = degree(florentine_m),
  betweenness = betweenness(florentine_m),
  closeness = closeness(florentine_m),
  eigenvector = eigen_centrality(florentine_m)$vector,
  subgraph = subgraph_centrality(florentine_m)
)

# most central family according to the 5 indices
V(florentine_m)$name[apply(cent.df, 2, which.max)]
```

In all cases, the Medici are considered to be the most central family. However,
it is possible to find indices that rank other families on top. An example is
*odd subgraph centrality*, which can be assembled with the `netrankr` package.

```{r cent_new}
# odd subgraph centrality
sc_odd <- florentine_m %>%
  indirect_relations(type = "walks", FUN = walks_exp_odd) %>%
  aggregate_positions(type = "self")

# family with highest score
V(florentine_m)$name[which.max(sc_odd)]
```

In this example, the Strozzi family are considered to be the most central family
and the Medici are ranked third.

Although we have found $5$ indices that consider the Medici the most central family,
we can not guarantee, that there do not exist hundreds (or thousands?) of indices
that would give an entirely different result. 

________________________________________________________________________________


## Most central family (Dominance approach)

We start by calculating the neighborhood-inclusion preorder, the most general
requirement for any centrality index.

```{r ni}
P <- neighborhood_inclusion(florentine_m)
```

With the function `comparable_pairs()` we can assess how many pairs of families are already
ordered, before applying any index.
```{r ni_comp}
comparable_pairs(P)
```

Only around `r round(comparable_pairs(P)*100,0)`% of pairs of families are comparable,
leaving `r 100-round(comparable_pairs(P)*100,0)`% of pairs of families that could be 
ordered (basically) arbitrarily.

If we want to visually assess the dominance relations we can use the function `dominance_graph()`.

```{r dom_graph,fig.height=5,fig.width=5,fig.align='center'}
d <- dominance_graph(P)
V(d)$name <- V(florentine_m)$name
set.seed(113)
plot(d,
  vertex.label.color = "black",
  vertex.color = "white",
  vertex.frame.color = "gray",
  edge.arrow.size = 0.5
)
```

The Castellan family neither dominates nor is dominated by any other family. This means, that we
can find indices that potentially rank them on top or on the bottom, or anything in between.

To better assess the potential ranks of nodes, we can plot the rank intervals

```{r rk_inter,fig.align='center',fig.width=5,fig.height=5}
plot(rank_intervals(P))
```

Observe how big the intervals are, indicating that there is ample scope to rank the
families differently. These intervals, however, only give us a rough estimate of this
arbitrariness and not any rank probabilities. To get all exact probabilities,
we use the function `exact_rank_prob()`.

```{r probs}
res <- exact_rank_prob(P)
```
There are `r format(res$lin.ext,scientific = F,big.mark = ",")` different possibilities to rank
the families! (*the value is stored in `res$lin.ext`. *). This means that theoretically, we
can find almost 4 Billion indices that rank the families differently. 

The *rank probabilities* of families can be found in `res$rank.prob`. They are returned 
as a matrix where rows are families and columns are ranks. That is, an entry in 
row $u$ and column $k$ gives the probability that $u$ has rank $k$ (larger $k$ indicate higher ranks)
Mostly, you will be interested in the probability to be the most central node of a network.
Below, we calculate these probability for all families and return the one's that have
a higher probability than $0.1$.

```{r likely_most_central}
top_rank_prob <- res$rank.prob[, 15]
names(top_rank_prob) <- V(florentine_m)$name
round(top_rank_prob[top_rank_prob > 0.1], 3)
```

The Strozzi family, with $0.13$, has the highest probability to be top ranked, followed
by the Medici with $0.12$. 

If we are only interested in a subset of nodes, in our case maybe the Strozzi and Medici,
we can assess the *relative rank probabilities* in `res$relative.rank`. Again, 
probabilities are returned as matrix objects, where an entry in row $u$ and column $v$
gives the probability that $u$ is ranked below $v$. Below we calculate this probability
for the Strozzi and Medici.

```{r medici_strozzi,eval=TRUE}
id_strozzi <- which(V(florentine_m)$name == "Strozzi")
id_medici <- which(V(florentine_m)$name == "Medici")
res$relative.rank[id_strozzi, id_medici]
```

The probability that the Strozzi are less central than the Medici is 
`r round(res$relative.rank[id_strozzi,id_medici],2)` and thus very close to a "fifty-fifty" chance.

The last result of interest returned by `exact_rank_prob()` are the *expected ranks* in
`res$expected.rank`. The expected ranks, as the name indicates, returns the ranks that we
expect families to have in a centrality ranking.

```{r exp_rank,echo=FALSE}
tab <- data.frame(Name = V(florentine_m)$name, Expected = round(res$expected.rank, 2))
tab <- tab[order(tab[, 2], decreasing = TRUE), ]
knitr::kable(tab, row.names = F)
```

Although the Strozzi have a higher probability to be the most central family, over all we still
expect the Medici to be the most central.  

This very general assessment gives us a general idea of the scope of potential centrality
analyses. The more possible rankings we have (as in this case!) the more unreliable an
index driven approach can be. We will explore this in more detail in the following section. 

________________________________________________________________________________


## Centrality as explanatory variable (Index approach)

Usually, we are not simply interested in a ranking of nodes, but we rather would 
like to use centrality to explain certain node attributes. In our case, we might 
be interested in the question: "can an index explain the wealth of families?", or if
we already have a more concrete idea "can proximity to other families explain the wealth attribute?" 

"Proximity" can be translated to the graph-theoretic concept of shortest path distances, 
such that closeness centrality would be an adequate candidate as an index. We here
use the pipeline approach of the `netrankr` package instead of the `closeness()`
function of `igraph`. The reasons will become evident in the next section.

```{r closeness_vs_wealth}
# Closeness
c_C <- florentine_m %>%
  indirect_relations(type = "dist_sp") %>%
  aggregate_positions(type = "invsum")

cor(c_C, V(florentine_m)$wealth, method = "kendall")
```

The correlation between closeness and wealth (`r round(cor(c_C,V(florentine_m)$wealth,method="kendall"),4)`) is far to low to constitute that "proximity" is related to wealth. However, there 
exist various other indices, that are based on the shortest path distances in a graph.
Refer to the literature for more details on these indices.

```{r harmclos_wealth}
# harmonic closeness
c_HC <- florentine_m %>%
  indirect_relations(type = "dist_sp", FUN = dist_inv) %>%
  aggregate_positions(type = "sum")
# residual closeness (Dangalchev,2006)
c_RC <- florentine_m %>%
  indirect_relations(type = "dist_sp", FUN = dist_2pow) %>%
  aggregate_positions(type = "sum")

# integration centrality (Valente & Foreman, 1998)
dist_integration <- function(x) {
  x <- 1 - (x - 1) / max(x)
}
c_IN <- florentine_m %>%
  indirect_relations(type = "dist_sp", FUN = dist_integration) %>%
  aggregate_positions(type = "sum")

c(
  cor(c_HC, V(florentine_m)$wealth, method = "kendall"),
  cor(c_RC, V(florentine_m)$wealth, method = "kendall"),
  cor(c_IN, V(florentine_m)$wealth, method = "kendall")
)
```

The highest correlation (0.1159) is achieved for residual closeness, however, this is 
still too low to conclude that proximity is related to wealth.

Besides the already considered indices, there exist further one's that include a
free parameter. The idea is that the parameter can be tuned to maximize the correlation
between the index and the attribute under consideration. Again, the mathematical
details can be found in the respective literature.

```{r distalpha_wealth,warning=FALSE}
# generalized closeness (Agneessens et al.,2017) (alpha>0) sum(dist^-alpha)
alpha <- c(seq(0.01, 0.99, 0.01), seq(1, 10, 0.1))
scores <-
  sapply(alpha, function(x) {
    florentine_m %>%
      indirect_relations(type = "dist_sp", FUN = dist_dpow, alpha = x) %>%
      aggregate_positions(type = "sum")
  })
cors_gc <- apply(
  scores, 2,
  function(x) cor(x, V(florentine_m)$wealth, method = "kendall")
)

res_gc <- c(max(cors_gc), alpha[which.max(cors_gc)])

# decay centrality (Jackson, 2010) (alpha in [0,1]) sum(alpha^dist)
alpha <- seq(0.01, 0.99, 0.01)
scores <-
  sapply(alpha, function(x) {
    florentine_m %>%
      indirect_relations(type = "dist_sp", FUN = dist_powd, alpha = x) %>%
      aggregate_positions(type = "sum")
  })
cors_dc <- apply(
  scores, 2,
  function(x) cor(x, V(florentine_m)$wealth, method = "kendall")
)

res_dc <- c(max(cors_dc), alpha[which.max(cors_dc)])
```

The highest correlation for generalized closeness is `r res_gc[1]` achieved for 
$\alpha$= `r res_gc[2]`  

The highest correlation for decay centrality is `r res_dc[1]` achieved for 
$\alpha$=`r res_dc[2]`

We could know accept that there is no index based on shortest path distances that could 
explain wealth. Or, we could start to craft new indices that might yield a better
correlation with wealth. However, we then enter the dilemma that was mentioned at the end of
the last section. If we find one, we can not be certain that there might not even be a 
better one out there. In contrast, if we do not succeed, we can not guarantee that 
there does not exist an index with a higher correlation. 

________________________________________________________________________________

## Centrality as explanatory variable (Dominance approach)

Since we are postulating a connection between proximity and wealth, we compute 
the pairwise shortest path distances as our *indirect relation* of interest and calculate
the positional dominance relations.

```{r pos_dom_hetero}
D <- florentine_m %>%
  indirect_relations(type = "dist_sp") %>%
  positional_dominance(benefit = F)

comparable_pairs(D)
```

Note that exactly the same pairs are comparable as for neighborhood-inclusion. However,
with one additional assumption, we will be able to increase the number of comparable pairs
significantly (and thus reduce the space of potential rankings). By summing up distances
in various ways, as done by the indices above, we assume families to be *homogeneous*.
It doesn't matter to whom we have a small distance, it just matters that they are small.

If we can safely comply with this assumption, we can use positional dominance 
*under total homogeneity*. It is important to note, that if a family is dominated 
by another under this premise, it will have a lower score in **any** distance based 
centrality index.

```{r pos_dom_homo}
D <- florentine_m %>%
  indirect_relations(type = "dist_sp") %>%
  positional_dominance(benefit = F, map = T)

comparable_pairs(D)
```

The number of comparable pairs increased from $0.15$ to $0.82$, thus reducing the 
space of potential centrality rankings based on distances significantly. 

We proceed to explore if there is potential for a distance based ranking to explain
wealth perfectly. This is only possible, if families with lower wealth do not dominate 
wealthier family. Otherwise they would always be ranked higher, prohibiting a perfect 
correlation.  

The figure below shows the dominance relations as a directed graph, where the x coordinate
of nodes is proportional to the wealth attribute and the y coordinate to the number of
dominated families.  Any edge pointing to the left (shown in red) denotes a pair of
"wrongly" ordered families, i.e. a wealthy family is dominated by a less wealthy
one.

```{r plot dist_dom,fig.align="center",fig.height=5,fig.width=5,echo=F}
d <- dominance_graph(D)
V(d)$name <- V(florentine_m)$name
x <- V(florentine_m)$wealth
x[9] <- x[9] - 50
x[14] <- x[14] - 50
y <- colSums(D)
el <- get.edgelist(d, names = F)
E(d)$color <- "gray"
col <- apply(el, 1, function(x) V(florentine_m)$wealth[x[1]] > V(florentine_m)$wealth[x[2]])
# wrong:41
E(d)$color[col] <- "indianred"

plot(d,
  layout = cbind(x, y),
  vertex.label.cex = 0.75,
  vertex.label.color = "black",
  vertex.color = "white",
  vertex.frame.color = "gray",
  edge.arrow.size = 0.4
)
```

In total, we find $41$ such pairs ($39$% of all pairs). This implies that we are
(potentially) quite far from being able to explain wealth perfectly with shortest path 
distances. In the following we will explore *how* far away. 

We start by calculating the rank intervals to illustrate the difference to 
neighborhood-inclusion.
```{r rk_inter_dist,fig.align='center',fig.width=5,fig.height=5}
plot(rank_intervals(D))
```
All intervals shrunk significantly and even correspond to a single point for two
families (Pazzi and Medici). This implies that no matter which distance based index we
use, the Pazzi family will always be ranked last and the Medici always on top.

For a more exact assessment we again use the function `exact_rank_prob()`.

```{r dist_probs}
res <- exact_rank_prob(D)
```
In total, there are `r format(res$lin.ext,scientific = F,big.mark = ",")` distance
based rankings possible. This is a huge reduction from the general case where almost
$4$ billion are possible. 

To determine the best possible correlation between wealth and any distance based ranking,
we first need to determine all `r format(res$lin.ext,scientific = F,big.mark = ",")`
rankings. For this purpose, we rerun the previous analysis with `only.results=FALSE` to obtain 
the necessary data structure.
```{r dist_probs_lat}
res <- exact_rank_prob(D, only.results = FALSE)
```

Now, we can use the function `get_rankings()` which returns all rankings as a matrix.

```{r all_rankings}
all_ranks <- get_rankings(res)
dim(all_ranks)
```

No we can simply loop over all rankings and calculate the correlation between the 
ranking and the wealth attribute.
```{r all_cors}
dist_cor <- apply(
  all_ranks, 2,
  function(x) cor(V(florentine_m)$wealth, x, method = "kendall")
)
c(max_cor = max(dist_cor), mean_cor = mean(dist_cor))
```
The highest achievable correlation is `r round(max(dist_cor),4)`.

We can conclude, that there can not be any distance based centrality index that can
reasonably explain the wealth attribute. 

We can additionally consider the correlation between degree and wealth, calculated below.
```{r cor_deg}
cor(degree(florentine_m), V(florentine_m)$wealth, method = "kendall")
```

The correlation is higher than any distance based index can have. Thus, we can 
additionally conclude that marriage ties are more indicative for wealth than proximity
in the marriage network.