Heterogeneity

We have learnt how to exploit the network topology to gauge the similarity (or dissimilarity) of pairs of networks. This opens the possibility of defining a measure of heterogeneity for a node in terms of the dissimilarity of its neighbors:

A node is heterogeneous if it has dissimilar neighbors.

We will work on weighted undirected graphs. In the case of directed graphs, one edge direction among incoming and outgoing must be chosen. The unweighted case is a special case of the weighted one. Let $A = (a_{i,j})$ be the adjacency matrix of a weighted undirected graph. Let $i$ be a node with positive degree $d_i$. We set: $$p_{i,j} = \frac{a_{i,j}}{\sum_k a_{i,k}} = \frac{a_{i,j}}{d_i}$$ Notice that $0 \leq p_{i,j} \leq 1$ and $\sum_j p_{i,j} = 1$, hence $(p_{i,1}, \ldots, p_{i,n})$ is a probability distribution for every node $i$.

We will work out the math for a general probability distribution $p = (p_1, \ldots, p_n)$, with $0 \leq p_{i} \leq 1$ and $\sum_i p_{i} = 1$. A measure of heterogeneity of $p$ is Shannon entropy: $$ H(p) = -\sum_i p_i \log p_i $$ where $p_i \log p_i = 0$ if $p_i = 0$. Another measure of heterogeneity of $p$ is Simpson diversity: $$ S(p) = \sum_{i \neq j} p_i p_j = \sum_{i, j} p_i p_j - \sum_i p_{i}^{2} = 1 - \sum_i p_{i}^{2} $$ where in the latter we have used the fact that $$ \sum_{i, j} p_{i} p_{j} = \sum_{i} p_{i} \sum_j p_{j} = \sum_{i} p_{i} 1 = 1 $$ Both measures are minimized when $p$ is totally concentrated on some element $k$, that is $p_k = 1$ and $p_i = 0$ for every $i \neq k$. In such case $H(p) = S(p) = 0$. Both measures are maximized when $p$ is evenly distributed among the elements, that is $p_i = 1/n$ for every $i$. In such case $H(p) = \log n$ and $S(p) = 1 - 1/n$. Notice that in this case both measures grow with $n$, although Simpson diversity is limited by $1$.

If we have information about pairwise distance (dissimilarity) $d_{i,j}$ of elements $i$ and $j$, then another measure of heterogeneity is Rao quadratic entropy: $$ R(p) = \sum_{i,j} p_{i} \, p_{j} \, d_{i,j} $$ There are two components in this definition of heterogeneity:

  1. the evenness of the distribution $p$, and
  2. the distances among neighbors of $k$.
To isolate both components of the definition, let us first assume that the distribution $p$ is uniform, that is $p_i = 1/n$ for every $i$. Hence: $$ R(p) = \sum_{i,j} p_{i} \, p_{j} \, d_{i,j} = \frac{1}{n^2} \sum_{i,j} d_{i,j} $$ is the arithmetic mean distance among elements in $p$. Hence, the more distant are the elements in $p$ among each other, the more heterogeneous is $p$.

Let us now assume the simplest distance among nodes: $d_{i,j} = 1$ for $i \neq j$ and $d_{i,i} = 0$. In this case: $$ R(p) = \sum_{i,j} p_{i} \, p_{j} \, d_{i,j} = \sum_{i \neq j} p_{i} \, p_{j} = S(p) $$ Hence in this case the more uniform is $p$, the more heterogeneous is $p$.

It follows that, in general, $R(p)$ is large, hence $p$ is heterogeneous, when $p$ evenly distributes its probability among dissimilar elements. On the contrary, $p$ is homogeneous when it concentrates its probability on similar elements. If we go back to heterogeneity of nodes of a graph and apply $R(p)$ as a measure, we have that a node is heterogeneous when it evenly distributes its strength among dissimilar neighbors and it is homogeneous when it concentrates its strength on similar neighbors.

shannon = function(p) { x = p * log2(p) x = replace(x, is.nan(x), 0) return(-sum(x)) } simpson = function(p) { x = 1 - sum(p * p) return(x) } rao = function(p, D) { x = diag(p) %*% D %*% diag(p) return(sum(c(x))) } heterogeneity = function(g, D, mode = "col") { A = as_adjacency_matrix(g, attr = "weight", sparse = FALSE) if (mode == "col") { A = A %*% diag(1/colSums(A)) dim = 2 } else { A = diag(1/rowSums(A)) %*% A dim = 1 } return(list(shannon = apply(A, dim, shannon), simpson = apply(A, dim, simpson), rao = apply(A, dim, rao, D))) } }