Community detection and clustering

Overview

Loosely stated, community detection is the problem of finding the natural divisions of a network into groups of vertices, called communities, such that there are many edges within groups and few edges between groups.

For instance:

in social networks, communities represent groups of individuals that are socially tight
in a citation network among academic journals, communities might correspond to topics or disciplines
communities of nodes in a web graph might indicate groups of related web pages
communities of nodes in a metabolic network might indicate functional units within the network

Modularity

our goal is to find a measure that quantifies how many edges lie within groups in our network relative to the number of such edges expected on the basis of chance
a good division of nodes into communities is one that maximizes such a measure
equivalently, we want a measure that quantifies how many edges lie between groups in our network relative to the expected number of such links
a good division of nodes into communities is one that minimizes such a measure
we will concentrate on the former measure of modularity of a network

Modularity

Let us focus on undirected multi-graphs, that is, graphs that allow self-edges (edges involving the same node) and multi-edges (more than one simple edge between two vertices).

A measure of modularity of a network is the number of edges that run between vertices of the same community minus the number of such edges we would expect to find if edges were positioned at random while preserving the vertex degrees (configuration model).

Let us denote by \(c_i\) the community of vertex \(i\) and \(\delta(c_i,c_j) = 1\) if \(c_i = c_j\) and \(\delta(c_i,c_j) = 0\) otherwise.

Hence, the actual number of edges that run between vertices of the same group is:

\[\sum_{(i,j) \in E} \delta(c_i, c_j) = \frac{1}{2} \sum_{i,j} A_{i,j} \delta(c_i, c_j)\]

where \(E\) is the set of edges of the graph and \(A_{i,j}\) is the actual number of edges between \(i\) and \(j\), which is zero or more (notice that each undirected edge is represented by two pairs in the second sum, hence the factor one-half).

Modularity

The expected number of edges that run between vertices of the same group is:

\[\frac{1}{2} \sum_{i,j} \frac{k_i k_j}{2m} \delta(c_i, c_j)\]

where \(k_i\) and \(k_j\) are the (weighted) degrees of \(i\) and \(j\), while \(m\) is the number of edges of the graph.

Notice that \(k_i k_j / 2m\) is the expected number of edges between vertices \(i\) and \(j\) in the configuration model assumption.

Indeed, consider a particular edge attached to vertex \(i\):

the probability that this edge goes to node \(j\) is \(k_j / 2m\), since the number of edges attached to \(j\) is \(k_j\) and the total number of edge ends in the network is \(2m\) (the sum of all node degrees)
since node \(i\) has \(k_i\) edges attached to it, the expected number of edges between \(i\) and \(j\) is \(k_i k_j / 2m\)

Modularity

The difference between the actual and expected number of edges connecting nodes of the same group, expressed as a fraction with respect to the total number of edges \(m\), is called modularity, and given by:

\[Q = \frac{1}{2m} \sum_{i,j} \left(A_{i,j} - \frac{k_i k_j}{2m}\right) \delta(c_i, c_j) = \frac{1}{2m} \sum_{i,j} B_{i,j} \delta(c_i, c_j)\]

where:

\[ B_{i,j} = A_{i,j} - \frac{k_i k_j}{2m} \]

and \(B\) is called the modularity matrix.

The modularity \(Q\) takes positive values if there are more edges between same-group vertices than expected, and negative values if there are less.

Modularity

Our goal is to find the partition of network nodes into communities such that the modularity of the division is maximum.

Unfortunately, this is a computationally hard problem.

Indeed, the number of ways a network of \(n\) nodes can be divided into two groups of \(n_1\) and \(n_2\) nodes, with \(n_1 + n_2 = n\) is:

\[ \binom{n}{n_1} = \binom{n}{n_2} = \frac{n!}{n_1! \, n_2!} \sim \frac{2^{n+1}}{\sqrt{n}} \]

when \(n_1 = n_2 = n/2\), which is exponential in \(n\).

Instead, therefore, we turn to heuristic algorithms, algorithms that attempt to maximize the modularity in an intelligent way that gives reasonably good results in a quick time.

Spectral modularity maximization

This method tries to maximize the modularity of a partition by exploiting the spectral properties of the modularity matrix.

Recall that modularity is defined as:

\[ Q = \frac{1}{2m} \sum_{i,j} \left(A_{i,j} - \frac{k_i k_j}{2m}\right) \delta(c_i, c_j) = \frac{1}{2m} \sum_{i,j} B_{i,j} \delta(c_i, c_j) \]

Notice that the modularity matrix \(B\) is symmetric and all rows and all columns of \(B\) sum to \(0\). Indeed:

\[ \sum_j B_{i,j} = \sum_j (A_{i,j} - \frac{k_i k_j}{2m}) = \sum_j A_{i,j} - \frac{k_i}{2m} \sum_j k_j = k_i - \frac{k_i}{2m} 2m = 0 \]

Spectral modularity maximization

Let us consider the division of a network into just two parts, namely group 1 and group 2 (the more general case can be treated with repeated bisection).

We represent the assignment of node \(i\) to a group with the variable \(s_i\) such that \(s_i = 1\) if \(i\) belongs to group 1 and \(s_i = -1\) if \(i\) belongs to group 2. It follows that:

\[\delta(c_i, c_j) = \frac{1}{2} (s_i s_j + 1)\]

Substituting this expression in the modularity formula we have:

\[ Q = \frac{1}{4m} \sum_{i,j} B_{i,j} (s_i s_j + 1) = \frac{1}{4m} \sum_{i,j} B_{i,j} s_i s_j \]

where we have used the fact that \(\sum_{i,j} B_{i,j} = 0\).

In matrix form we have:

\[ Q = \frac{1}{4m} s B s \]

Spectral modularity maximization

we wish to find the division of a given network, that is the value of \(s\), that maximizes the modularity \(Q\)
the elements of \(s\) are constrained to take integer values \(1\) or \(-1\), so that the vector always points to one of the corners of an n-dimensional hypercube
unfortunately, this optimization problem is a hard one, but it can be tackled approximately by a relaxation method
we relax the constraint that \(s\) must point to a corner of the hypercube and allow it to point in any direction, though keeping its length the same: \[s s = \sum_i s_{i}^{2} = n\]

Spectral modularity maximization

We maximize the modularity equation by differentiating, imposing the constraint with a single Lagrange multiplier \(\beta\):

\[ \frac{\partial}{\partial s_i} \left[ \sum_{j,k} B_{j,k} s_j s_k + \beta (n - \sum_j s_{j}^{2}) \right] = 0 \]

Notice the derivates are all null unless when \(j = i\) or \(k = i\):

\[ \sum_{k} B_{i,k} s_k + \sum_{j} B_{j,i} s_j - 2 \beta s_i = 2 \sum_{j} B_{i,j} s_j - 2 \beta s_i = 0 \]

which is:

\[ \sum_{j} B_{i,j} s_j = \beta s_i \]

or in matrix notation:

\[ B s = \beta s \]

Spectral modularity maximization

Hence the solution \(s\) is an eigenvector of the modularity matrix.

Therefore the modularity is given by:

\[ Q = \frac{1}{4m} s B s = \frac{1}{4m} \beta s s = \frac{n}{4m} \beta \]

which leads us to the conclusion that for maximum modularity we have to choose \(s\) as the eigenvector \(u\) corresponding to the largest (positive) eigenvalue of the modularity matrix.

We typically cannot in fact choose \(s = u\), since the elements of \(s\) are subject to the constraint \(s_i \in \{+1, -1\}\). But we do the best we can and choose it as close to \(u\) as possible, hence:

\(s_i = +1\) if \(u_i > 0\)
\(s_i = -1\) if \(u_i < 0\)
either value of \(s_i\) is good if \(u_i = 0\)

Spectral modularity maximization

So we are led the following algorithm:

we calculate the eigenvector of the modularity matrix corresponding to the largest (most positive) eigenvalue
then we assign vertices to communities according to the signs of the vector elements, positive signs in one group and negative signs in the other
if all elements in the eigenvector are of the same sign that means that the network has no underlying community structure

Code

let’s work with the Zachary network represents the pattern of friendships between members of a karate club at a North American university
the network is determined by direct observation of the club’s members by the experimenter over a period of about two years
the network is interesting because during the period of observation a dispute arose among the members of the club over whether to raise the club’s fees
as a result the club eventually split into two parts, of 18 and 16 members respectively, the latter departing to form their own club

Code

library(igraph)
g = make_graph("Zachary")
coords = layout_with_fr(g)
plot(g, layout=coords, 
     vertex.label=NA, vertex.size=6, vertex.color = "blue")

# spectral community detection
c = cluster_leading_eigen(g)

# modularity measure
modularity(c)

## [1] 0.3934089

# number of communities
length(c)

## [1] 4

# size of communities
sizes(c)

## Community sizes
##  1  2  3  4 
##  7 12  9  6

# memberships of nodes
membership(c)

##  [1] 1 3 3 3 1 1 1 3 2 2 1 1 3 3 2 2 1 3 2 3 2 3 2 4 4 4 2 4 4 2 2 4 2 2

# inter-community crossing edges
crossing(c, g)

##  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [25] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [61]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
## [73] FALSE FALSE FALSE  TRUE  TRUE FALSE

# modularity matrix (why membership? Possible bug)
B = modularity_matrix(g, membership(c))
# first row
round(B[1,], 2)

##  [1] -1.64  0.08 -0.03  0.38  0.69  0.59  0.59  0.59  0.49 -0.21  0.69  0.90
## [13]  0.79  0.49 -0.21 -0.21 -0.21  0.79 -0.21  0.69 -0.21  0.79 -0.21 -0.51
## [25] -0.31 -0.31 -0.21 -0.41 -0.31 -0.41 -0.41  0.38 -1.23 -1.74

sum(B[1,])

## [1] 4.440892e-16

# plot communities with shaded regions
plot(c, g, layout=coords)

# plot communities without shaded regions
plot(g, vertex.color=membership(c), layout=coords)

Greedy optimization of modularity

cluster_fast_greedy starts out with each vertex in our network in a one-vertex group of its own
then it successively amalgamate groups in pairs, choosing at each step the pair whose amalgamation gives the biggest increase in modularity, or the smallest decrease if no choice gives an increase
eventually all vertices are amalgamated into a single large community and the algorithm ends
then we go back over the states through which the network passed during the course of the algorithm and select the one with the highest value of the modularity

Edge betweenness

cluster_edge_betweenness looks for the edges that lie between communities
if we can find and remove these edges, we will be left with just the isolated communities
to identify edges between communities one common approach is to use edge betweenness centrality, which counts the number of geodesic paths that run along edges
a less expensive alternative would be to look for edges that belong to an unusually small number of short loops

Random walks

cluster_walktrap is an approach based on random walks
the general idea is that if you perform random walks on the graph, then the walks are more likely to stay within the same community because there are only a few edges that lead outside a given community
this method runs short random walks of 3-4-5 steps (depending on one of its parameters) and uses the results of these random walks to merge separate communities in a bottom-up manner
you can use the modularity score to select where to cut the dendrogram

Statistical meachanics

cluster_spinglass is an approach from statistical physics, based on the so-called Potts model
in this model, each particle (i.e. vertex) can be in one of k spin states
the interactions between the particles (i.e. the edges of the graph) specify which pairs of vertices would prefer to stay in the same spin state and which ones prefer to have different spin states
the model is then simulated for a given number of steps, and the spin states of the particles in the end define the communities

Label propagation

cluster_label_prop is a simple approach in which every node is assigned one of k labels.
the method then proceeds iteratively and re-assigns labels to nodes in a way that each node takes the most frequent label of its neighbors in a synchronous manner
the method stops when the label of each node is one of the most frequent labels in its neighborhood

Infomap community finding

cluster_infomap is based on information theoretic principles
it tries to build a grouping which provides the shortest description length for a random walk on the graph, where the description length is measured by the expected number of bits per vertex required to encode the path of a random walk
when we describe a network as a set of interconnected communities, we are highlighting certain regularities of the network’s structure while filtering out the relatively unimportant details
a modular description of a network can be viewed as a lossy compression of that network’s topology

Multi-level optimization of modularity

cluster_louvain is based on the modularity measure and a hierarchical approach
initially, each vertex is assigned to a community on its own
in every step, vertices are re-assigned to communities in a local, greedy way: each vertex is moved to the community with which it achieves the highest contribution to modularity
when no vertices can be reassigned, each community is considered a vertex on its own, and the process starts again with the merged communities
the process stops when there is only a single vertex left or when the modularity cannot be increased any more in a step

Optimal community structure

cluster_optimal calculates the optimal community structure for a graph, in terms of maximal modularity score
the calculation is done by transforming the modularity maximization into an integer programming problem, and then calling the GLPK library to solve that

Play

Run all community detection algorithms on the Zachary network and find the best one according to modularity.

Solution

library(igraph)

# Load the Zachary Karate Club network
g <- make_graph("Zachary")

# List of community detection methods in igraph
methods <- list(
  "Edge Betweenness" = cluster_edge_betweenness,
  "Fast Greedy" = cluster_fast_greedy,
  "Label Propagation" = cluster_label_prop,
  "Leading Eigenvector" = cluster_leading_eigen,
  "Louvain" = cluster_louvain,
  "Walktrap" = cluster_walktrap,
  "Spinglass" = function(graph) cluster_spinglass(graph, spins = 10),
  "Infomap" = cluster_infomap,
  "Optimal" = cluster_optimal
)

# Compute modularity for each method
results <- data.frame(Method = character(), Modularity = numeric(), stringsAsFactors = FALSE)

for (method in names(methods)) {
  tryCatch({
    # Detect communities
    communities <- methods[[method]](g)
    
    # Compute modularity
    modularity_value <- modularity(communities)
    
    # Store the result
    results <- rbind(results, data.frame(Method = method, Modularity = modularity_value))
  }, error = function(e) {
    # Handle any errors (e.g., if a method is not applicable)
    cat("Error with method:", method, "\n")
  })
}

# Sort results by modularity in decreasing order
results <- results[order(-results$Modularity), ]

# Print the results
print(results)

##                Method Modularity
## 9             Optimal  0.4197896
## 5             Louvain  0.4197896
## 7           Spinglass  0.4174063
## 8             Infomap  0.4020381
## 1    Edge Betweenness  0.4012985
## 3   Label Propagation  0.3990796
## 4 Leading Eigenvector  0.3934089
## 2         Fast Greedy  0.3806706
## 6            Walktrap  0.3532216

Hierarchical clustering

the basic idea behind hierarchical clustering is to define a measure of similarity or connection strength between vertices, based on the network structure
then start by joining together those pairs of vertices with the highest similarities, forming a group or groups of size two
then we further join together the groups that are most similar to form larger groups, and so on, until a unique group is formed that contains all nodes
this produces a hierarchical decomposition of a network into a set of nested (non-overlapping) communities, that can be visualized in the form of a dendrogram

Hierarchical clustering

But what we actually have is a measure of similarity between individual vertices, so we need to combine these vertex similarities somehow to create similarities for the groups.

There are three common ways to do so:

single-linkage: the similarity between two groups is defined to be the maximum of the similarities of pairs of vertices belonging to different groups
complete-linkage: the similarity between two groups is defined to be the minimum of the similarities of pairs of vertices belonging to different groups
average-linkage: the similarity between two groups is defined to be the average of the similarities of pairs of vertices belonging to different groups

Hierarchical clustering

Given a similarity measure for individual nodes and a clustering method to create similarities for groups of nodes, the hierarchical clustering method is as follows:

evaluate the similarity measures for all vertex pairs
assign each vertex to a group of its own, consisting of just that one vertex
find the pair of groups with the highest similarity and join them together into a single group
calculate the similarity between the new composite group and all others
repeat from step 3 until all vertices have been joined into a single group

Code

Function hclust computes hierarchical clustering:

library(igraph)

# make graph
g = make_graph("Zachary")

# hierarchical clustering
A = as_adjacency_matrix(g, sparse=FALSE)

# cosine similarity
euclidean = function(x) {sqrt(x %*% x)}
d = apply(A, 2, euclidean)
D = diag(1/d)
S = D %*% t(A) %*% A %*% D

# distance matrix
D = 1-S

# distance object
d = as.dist(D)

# average-linkage clustering method
cc = hclust(d, method = "average")

# plot dendrogram
plot(cc)

# draw blue borders around clusters
#clusters.list = rect.hclust(cc, k = 2, border="blue")

# cut dendrogram at 2 clusters
clusters = cutree(cc, k = 2)

# plot graph with clusters
coords = layout_with_fr(g)
plot(g, vertex.color=clusters, layout=coords)

# cut dendrogram at different heights
clusters = cutree(cc, k = 2:(vcount(g)/2))

Play

Compute hierarchical clustering on Zachary graph using Pearson similarity and compare with cosine similarity.

Solution

library(igraph)

# make graph
g = make_graph("Zachary")

# hierarchical clustering
A = as_adjacency_matrix(g, sparse=FALSE)

# cosine similarity
euclidean = function(x) {sqrt(x %*% x)}
d = apply(A, 2, euclidean)
D = diag(1/d)
S1 = D %*% t(A) %*% A %*% D

# distance matrix
d1 = as.dist(1-S1)

# average-linkage clustering method
cc1 = hclust(d1, method = "average")

# Pearson similarity
S2 = cor(A)

# distance matrix
d2 = as.dist(1-S2)

# average-linkage clustering method
cc2 = hclust(d2, method = "average")


# plot dendrogram
plot(cc1)

plot(cc2)

clusters1 = cutree(cc1, k = 2)
clusters2 = cutree(cc2, k = 2)
sum(clusters1 != clusters2)

## [1] 0

clusters1 = cutree(cc1, k = 3)
clusters2 = cutree(cc2, k = 3)
sum(clusters1 != clusters2)

## [1] 0

clusters1 = cutree(cc1, k = 4)
clusters2 = cutree(cc2, k = 4)
sum(clusters1 != clusters2)

## [1] 21