Similarity and heterogeneity

Overview

Another central concept in social network analysis is that of similarity between vertices.

In what ways can two actors in a network be similar, and how can we quantify this similarity?

Similarity is relevant in at least two important applications:

recommender systems, where the goal is to suggest to a user new products to buy according to what similar users bought or liked, and
link prediction, where the task is to predict new connections for a user on social networks according to the connections of similar users.

Similarity

Similarity agrees with the following thesis:

Two nodes are similar to each other if they share many neighbors.

For instance, two customers are similar if they buy the same products (Amazon), or view the same movies (Netflix), or listen to the same music (Spotify).

Math

let \(A = (a_{i,j})\) be the adjacency matrix of a possibly weighted graph
the general idea is to associate each node \(i\) with the \(i\)-th row (or column) of the adjacency matrix (a vector of size the number of nodes of the graph)
the similarity (or dissimilarity) of two nodes \(i\) and \(j\) is then measured using a similarity (or distance) function among the associated vectors

Math

Let us focus on unweighted undirected graphs with \(n\) nodes.

We denote with \[k_i = \sum_k A_{i,k}\] the degree of node \(i\) and with \[n_{i,j} = \sum_k A_{i,k} A_{j,k}\] the number of neighbors that nodes \(i\) and \(j\) have in common.

Moreover, \(A_i\) is the \(i\)-th row of \(A\), a boolean (0-1) vector of length \(n\).

Cosine similarity

We are going to describe a pool of similarity measures.

The first one is cosine similarity:

The similarity \(\sigma_{i,j}\) of nodes \(i\) and \(j\) is the cosine of the angle between vectors \(A_i\) and \(A_j\).

\[ \sigma_{i,j} = \cos (A_i, A_j) = \frac{A_i \cdot A_j}{\|A_i\| \|A_j\|} = \frac{\sum_k A_{i,k} A_{j,k}}{\sqrt{\sum_k A_{i,k}^{2}} \sqrt{\sum_k A_{j,k}^{2}}} \]

The measure runs from 0 (orthogonal vectors or maximum dissimilarity) to 1 (parallel vectors or maximum similarity).

Since the involved vectors are 0-1 vectors, we have that:

\[ \sigma_{i,j} = \frac{n_{i,j}}{\sqrt{k_i k_j}} \]

That is, cosine similarity between \(i\) and \(j\) is the number of neighbors shared by \(i\) and \(j\) divided by the geometric mean of their degrees.

Pearson similarity

An alternative similarity measure is Pearson correlation coefficient:

The similarity \(\sigma_{i,j}\) of nodes \(i\) and \(j\) is the correlation coefficient between vectors \(A_i\) and \(A_j\)

\[ \sigma_{i,j} = \frac{cov(A_i, A_j)}{sd(A_i) \, sd(A_j)} = \frac{\sum_k (A_{i,k} - \langle A_i \rangle) (A_{j,k} - \langle A_j \rangle)} {\sqrt{\sum_k (A_{i,k} - \langle A_i \rangle)^{2}} \sqrt{\sum_k (A_{j,k} - \langle A_j \rangle)^{2}}} \]

The measure runs from -1 (maximum negative correlation or maximum dissimilarity) to 1 (maximum positive correlation or maximum similarity). Notice that values close to \(0\) indicate no correlation, hence neither similarity nor dissimilarity.

Again, since the involved vectors are 0-1 vectors, it is not difficult to see that the numerator of the correlation coefficient, that is the co-variance between vectors \(A_i\) and \(A_j\) is:

\[ cov(A_i, A_j) = n_{i,j} - \frac{k_i k_j}{n} \]

Notice that \(k_i k_j/n\) is the expected number of common neighbors between \(i\) and \(j\) if they would choose their neighbors at random: the probability that a random neighbor of \(i\) is also a neighbor of \(j\) is \(k_j / n\), hence the expected number of common neighbors between \(i\) and \(j\) is \(k_i k_j/n\).

Hence a positive covariance, or a similarity between \(i\) and \(j\), holds when \(i\) and \(j\) share more neighbors than we would expect by chance, while a negative covariance, or a dissimilarity between \(i\) and \(j\) happens when \(i\) and \(j\) have less neighbors in common than we would expect by chance.

Code

The user defined function similarity computes both cosine and Pearson similarity on rows or columns of a matrix:

similarity = function(g, type = "cosine", mode = "col" ) {
  A = as_adjacency_matrix(g, sparse = FALSE)
  if (mode == "row") {A = t(A)}
  if (type == "cosine") {
    euclidean = function(x) {sqrt(x %*% x)}
    d = apply(A, 2, euclidean)
    D = diag(1/d)
    S = D %*% t(A) %*% A %*% D
  }
  if (type == "pearson") {
    S = cor(A)
  }
  return(S)
}

Global similarity

all similarity measures so far are local: they count the number of common neighbors, or number of paths of length two, between two nodes
however, two nodes might be similar in a more general way, using longer paths, or indirect similarity
two nodes may have few or none neighbors in common, but they still may be similar in an indirect, global way
paths between nodes, of any length, are hints of similarity; nevertheless, the shorter the path, the stronger the contribution to similarity.

Global similarity

The recursive thesis of global similarity is the following:

Two nodes \(i\) and \(j\) are similar if the neighbors of \(i\) are similar to \(j\).

Mathematically, the recursive thesis of global similarity can be written as follows:

\[ \sigma_{i,j} = \alpha \sum_k A_{i,k} \sigma_{k,j} + \delta_{i,j} \]

or in matrix form as:

\[ \sigma = \alpha A \sigma + I \]

Evaluating the expression by iterating starting from \(\sigma^{0} = 0\) we get:

\[ \begin{array}{lcl} \sigma^{(1)} & = & I \\ \sigma^{(2)} & = & \alpha A + I \\ \sigma^{(3)} & = & \alpha^2 A^2 + \alpha A + I \\ & \ldots & \end{array} \]

Hence, \(\sigma^{(k)}\) counts the contribution of paths of length less than \(k\) and the contribution of a path of length \(k\) is weighted by the attenuation factor \(\alpha^k\), as soon as \(\alpha < 1\).

Notice that the contribution of the identity matrix \(I\) is that each node is similar to itself; this is necessary to propagate similarity through network using the network topology represented by the adjacency matrix \(A\).

If follows that the similarity matrix \(\sigma\) is the following:

\[ \sigma = \sum_{k = 0}^{\infty} (\alpha A)^k = (I - \alpha A)^{-1} \]

Variants of global similarity

As defined, the metric tends to give high similarity to nodes with high degree. However, two hermits might be similar in an interesting way as well.

If we wish, we can remove the bias in favor of hubs by dividing by node degree, hence:

\[ \sigma_{i,j} = \frac{\alpha}{k_i} \sum_k A_{i,k} \sigma_{k,j} + \delta_{i,j} \] and hence

\[ \sigma = (D - \alpha A)^{-1} \]

Variants of global similarity

another variant allows for cases where the term \(\delta_{i,j}\) is not simply diagonal, but includes off-diagonal items
this would allow us to specify explicitly that particular pairs of nodes are similar based on some external (non-network) information we have at disposal

Global similarity and Katz centrality

this solution is reminiscent of Katz centrality: \(x = \beta (I - \alpha A)^{-1}\)
indeed, Katz centrality of node \(i\) is exactly the sum of similarities of \(i\) with all other nodes
hence, a node is central in Katz view if it is similar to many other vertices
as with Katz centrality, the parameter \(\alpha\) must be chosen less then \(1 / \rho(A)\), with \(\rho(A)\) the spectral radius of \(A\)
in the same way, the degree-controlled version of the similarity is reminiscent the PageRank centrality

Play

Compute and visualize the global similarity on the Zachary graph

library(tidyverse)
library(igraph)
library(ggraph)
g = make_graph("Zachary")

# similarity matrix
A = as_adjacency_matrix(g)
alpha = 0.85 / max(abs(eigen(A)$values))
I = diag(1, vcount(g))
S = solve(I - alpha * A)
S = S - diag(diag(S))

# similarity graph
s = graph_from_adjacency_matrix(S, mode = "undirected", weighted = TRUE)
V(s)$name = 1:vcount(s)

ggraph(s, layout = "circle") + 
  geom_edge_link(aes(alpha = weight, 
                     filter = (weight > quantile(weight, 0.9)))) + 
  geom_node_point() +
  geom_node_text(aes(label = name, x = x * 1.05, y = y * 1.05)) + 
  theme_graph() +
  coord_fixed()

Heterogeneity

We have learnt how to exploit the network topology to gauge the similarity (or dissimilarity) of pairs of networks.

This opens up the possibility of defining a measure of heterogeneity for a node in terms of the dissimilarity of its neighbors:

A node is heterogeneous if it has dissimilar neighbors.

Heterogeneity can be integrated into centrality. All else equal, nodes that receive endorsements from hetegoregeous peers are considered more important than nodes that are endorsed by peers that are very similar among each other.

For instance, in the context of Quadratic Voting, if the voting patterns of two (or more) individuals is highly positively correlated, they are assumed to be coordinated and act as a single agent, hence their voting contribution is discounted. On the other hand, if the voting patterns are statistically independent, the full voting contribution of each individual is counted.

Math

We will work on weighted undirected graphs. Let \(A = (a_{i,j})\) be the adjacency matrix of a weighted undirected graph. Let \(i\) be a node with positive degree \(d_i\). We set:

\[p_{i,j} = \frac{a_{i,j}}{\sum_k a_{i,k}} = \frac{a_{i,j}}{d_i}\]

Notice that \(0 \leq p_{i,j} \leq 1\) and \(\sum_j p_{i,j} = 1\), hence \[p_i = (p_{i,1}, \ldots, p_{i,n})\] is a probability distribution for every node \(i\).

Shannon entropy

We will work out the math for a general probability distribution \(p = (p_1, \ldots, p_n)\), with \(0 \leq p_{i} \leq 1\) and \(\sum_i p_{i} = 1\).

A measure of heterogeneity of \(p\) is Shannon entropy:

\[ H(p) = -\sum_i p_i \log p_i \]

where \(p_i \log p_i = 0\) if \(p_i = 0\).

Simpson diversity

Another measure of heterogeneity of \(p\) is Simpson diversity:

\[ S(p) = \sum_{i \neq j} p_i p_j = \sum_{i, j} p_i p_j - \sum_i p_{i}^{2} = 1 - \sum_i p_{i}^{2} \]

where in the latter we have used the fact that

\[ \sum_{i, j} p_{i} p_{j} = \sum_{i} p_{i} \sum_j p_{j} = \sum_{i} p_{i} 1 = 1 \]

Minimum and maximum

Both measures are minimized when \(p\) is fully concentrated on some element \(k\), that is \(p_k = 1\) and \(p_i = 0\) for every \(i \neq k\). In such case \[H(p) = S(p) = 0\]
Both measures are maximized when \(p\) is evenly distributed among the elements, that is \(p_i = 1/n\) for every \(i\). In such case \[H(p) = \log n\] and \[S(p) = 1 - 1/n\] Notice that in this case both measures grow with \(n\), although Simpson diversity is limited by \(1\).

Rao quadratic entropy

If we have information about pairwise distance (dissimilarity) \(d_{i,j}\) of elements \(i\) and \(j\), then another measure of heterogeneity is Rao quadratic entropy:

\[ R(p) = \sum_{i,j} p_{i} \, p_{j} \, d_{i,j} \]

There are two components in this definition of heterogeneity:

the evenness of the distribution \(p\)
the distances \(d\) among elements

Rao quadratic entropy

To isolate both components of the definition, let us first assume that the distribution \(p\) is uniform, that is \(p_i = 1/n\) for every \(i\). Hence:

\[ R(p) = \sum_{i,j} p_{i} \, p_{j} \, d_{i,j} = \frac{1}{n^2} \sum_{i,j} d_{i,j} \]

is the arithmetic mean distance among elements in \(p\). Hence, the more distant are the elements in \(p\) among each other, the more heterogeneous is \(p\).

Rao quadratic entropy

Let us now assume the simplest distance among nodes: \(d_{i,j} = 1\) for \(i \neq j\) and \(d_{i,i} = 0\). In this case:

\[ R(p) = \sum_{i,j} p_{i} \, p_{j} \, d_{i,j} = \sum_{i \neq j} p_{i} \, p_{j} = S(p) \]

Hence in this case the more uniform is \(p\), the more heterogeneous is \(p\).

Rao quadratic entropy

It follows that, in general:

\(R(p)\) is large, hence \(p\) is heterogeneous, when \(p\) evenly distributes its probability among dissimilar elements
on the contrary, \(p\) is homogeneous when it concentrates its probability on similar elements
If we go back to heterogeneity of nodes of a graph and apply \(R(p)\) as a measure, we have that a node is heterogeneous when it evenly distributes its strength among dissimilar neighbors
a node is homogeneous when it concentrates its strength on similar neighbors

Code

shannon = function(p) {
  x = p * log2(p)
  x = replace(x, is.nan(x), 0)
  return(-sum(x))
}

simpson = function(p) {
  x = 1 - sum(p * p)
  return(x)
}

rao = function(p, D) {
  x = diag(p) %*% D %*% diag(p)
  return(sum(c(x)))
}


heterogeneity = function(g, D, mode = "col") {
  A = as_adjacency_matrix(g, attr = "weight", sparse = FALSE)
  if (mode == "col") {
    A = A %*% diag(1/colSums(A))
    dim = 2 
  } else {
    A = diag(1/rowSums(A)) %*% A
    dim = 1 
  }
  return(list(shannon = apply(A, dim, shannon), 
              simpson = apply(A, dim, simpson), 
              rao = apply(A, dim, rao, D)))
}