Hierarchical clustering

The basic idea behind hierarchical clustering is to define a measure of similarity or connection strength between vertices, based on the network structure, and then join together the closest or most similar vertices to form groups. Any of the measures of similarity, for instance, may be used. This method is implemented in R in the function hclust of package stats.

The basic strategy adopted by the hierarchical clustering method is to start by joining together those pairs of vertices with the highest similarities, forming a group or groups of size two. Then we further join together the groups that are most similar to form larger groups, and so on, until a unique group is formed that contains all nodes. This produces a hierarchical decomposition of a network into a set of nested communities, that can be visualized in the form of a dendrogram.

But what we actually have is a measure of similarity between individual vertices, so we need to combine these vertex similarities somehow to create similarities for the groups. There are three common ways to do so:

Given a similarity measure for individual nodes and a clustering method to create similarities for groups of nodes, the hierarchical clustering method is as follows:

  1. Evaluate the similarity measures for all vertex pairs.
  2. Assign each vertex to a group of its own, consisting of just that one vertex.
  3. Find the pair of groups with the highest similarity and join them together into a single group.
  4. Calculate the similarity between the new composite group and all others.
  5. Repeat from step 3 until all vertices have been joined into a single group.

Given two groups, the calculation of the similarity among them is relatively straightforward and can be performed in $O(1)$ for all the clustering methods described above. On each step of the algorithm we have to calculate similarities in this way for the composite group with every other group, of which there are $O(n)$. Hence the recalculation of similarities will take $O(n)$ time on each step. A naive search through the similarities to find the greatest one, on the other hand, takes time $O(n^2)$, since there are $O(n^2)$ pairs of groups to check, so this will be the most time-consuming step in the algorithm. We can speed things up, however, by storing the similarities in a binary heap, which allows us to add and remove entries in time $O(\log n)$ and find the greatest one in time $O(1)$. This slows the recalculation of the similarities to $O(n \log n)$ but speeds the search for the largest to $O(1)$. Then the whole process of joining groups has to be repeated $n - 1$ times until all vertices have been joined into a single group. Thus the total running time of the algorithm is $O(n^3)$ in the naive implementation or $O(n^2 \log n)$ if we use a heap. For the special case of single-linkage clustering, there is a slightly faster way to implement the algorithm that makes use of the union/find technique and runs in time $O(n^2)$.