Similarity

Similarity agrees with the following thesis:

Two nodes are similar to each other if they share many neighbors.

In case of directed graphs, we may replace neighbors with either predecessor or successor nodes. For instance, is a friendship network like Facebook, two people are similar if they have many friends in common. In a directed social network like Twitter, two users are similar if they share many followers or many followees.

Let $A = (a_{i,j})$ be the adjacency matrix of a possibly weighted graph. The general idea is to associate each node $i$ with a $i$th row of the adjacency matrix. If the graph is directed, we can associate node $i$ with either the $i$-th adjacency row if we focus on out-going links or with the $i$-th adjacency column if we focus on in-coming links. Hence a node is associated with a vector in $\mathcal{R}^n$, where $n$ is the number of nodes. The similarity (or dissimilarity) of two nodes $i$ and $j$ is then measured using a similarity (or distance) function among the associated vectors.

Let us focus on unweighted undirected graphs with $n$ nodes. We denote with $k_i = \sum_k A_{i,k}$ the degree of node $i$ and with $n_{i,j} = \sum_k A_{i,k} A_{j,k}$ the number of neighbors that nodes $i$ and $j$ have in common. Moreover, $A_i$ is the $i$-th row of $A$, a boolean (0-1) vector of length $n$.

We are going to describe a pool of similarity measures. The first one is cosine similarity: the similarity $\sigma_{i,j}$ of nodes $i$ and $j$ is the cosine of the angle between vectors $A_i$ and $A_j$: $$ \sigma_{i,j} = \cos (A_i, A_j) = \frac{A_i \cdot A_j}{\|A_i\| \|A_j\|} = \frac{\sum_k A_{i,k} A_{j,k}}{\sqrt{\sum_k A_{i,k}^{2}} \sqrt{\sum_k A_{j,k}^{2}}} $$ The measure runs from 0 (orthogonal vectors or maximum dissimilarity) to 1 (parallel vectors or maximum similarity). Since the involved vectors are 0-1 vectors, we have that: $$ \sigma_{i,j} = \frac{n_{i,j}}{\sqrt{k_i k_j}} $$ That is, cosine similarity between $i$ and $j$ is the number of neighbors shared by $i$ and $j$ divided by the geometric mean of their degrees.

An alternative similarity measure in Pearson correlation coefficient: the similarity $\sigma_{i,j}$ of nodes $i$ and $j$ is the correlation coefficient between vectors $A_i$ and $A_j$: $$ \sigma_{i,j} = \frac{cov(A_i, A_j)}{sd(A_i) \, sd(A_j)} = \frac{\sum_k (A_{i,k} - \langle A_i \rangle) (A_{j,k} - \langle A_j \rangle)} {\sqrt{\sum_k (A_{i,k} - \langle A_i \rangle)^{2}} \sqrt{\sum_k (A_{j,k} - \langle A_j \rangle)^{2}}} $$ The measure runs from -1 (maximum negative correlation or maximum dissimilarity) to 1 (maximum positive correlation or maximum similarity). Notice that values close to $0$ indicate no correlation, hence neither similarity nor dissimilarity. Again, since the involved vectors are 0-1 vectors, it is not difficult to see that the numerator of the correlation coefficient, that is the co-variance between vectors $A_i$ and $A_j$ is: $$ cov(A_i, A_j) = n_{i,j} - \frac{k_i k_j}{n} $$ Notice that $k_i k_j/n$ is the expected number of common neighbors between $i$ and $j$ if they would choose their neighbors at random: the probability that a random neighbor of $i$ is also a neighbor of $j$ is $k_j / n$, hence the expected number of common neighbors between $i$ and $j$ is $k_i k_j/n$. Hence a positive covariance, or a similarity between $i$ and $j$, holds when $i$ and $j$ share more neighbors than we would expect by change, while a negative covariance, or a dissimilarity between $i$ and $j$ happens when $i$ and $j$ have less neighbors in common than we would expect by change.

similarity = function(g, type = "cosine", mode = "col" ) { A = as_adjacency_matrix(g, sparse = FALSE) if (mode == "row") {A = t(A)} if (type == "cosine") { euclidean = function(x) {sqrt(x %*% x)} d = apply(A, 2, euclidean) D = diag(1/d) S = D %*% t(A) %*% A %*% D } if (type == "pearson") { S = cor(A) } return(S) }