Centrality: usual suspects

Overview

A large volume of research on networks has been devoted to the concept of centrality. This research addresses the question:

Which are the most important nodes or edges in a network?

Overview

what are the important Web pages about a certain topic to be delivered by a search engine?
which are the popular actors in a social networks?
what are the influential academic papers in a discipline?
what are the crucial proteins in an living being?
which are the vital species in an ecosystem?
which are the indispensable routers in a computer network?

Degree centrality

degree is a simple centrality measure that counts how many neighbors a node has
if the network is directed, we have two versions of the measure: in-degree is the number of in-coming links; out-degree is the number of out-going links
typically, we are interested in in-degree, since in-links are assigned by other nodes in the network, while out-links are determined by the node itself.

Degree centrality

The thesis of degree centrality reads as follows:

A node is important if it has many neighbors, or, in the directed case, if there are many other nodes that link to it.

Math

Let \(A = (a_{i,j})\) be the adjacency matrix of a directed graph and \(e\) is a vector with all components equal to unity.

The in-degree centrality \(x_{i}\) of node \(i\) is given by:

\[x_{i} = \sum_k a_{k,i}\]

or in matrix form:

\[x = e A\]

The out-degree centrality \(y_{i}\) of node \(i\) is given by:

\[y_{i} = \sum_k a_{i,k}\]

or in matrix form:

\[y = e A^T\]

Code

Function degree computes unweighted degree centrality
Function strength computes weighted degree centrality

library(igraph)
edges = c(1,3, 1,4, 2,5, 3,2, 4,4, 4,5, 5,3, 5,6, 5,4)
dg = graph(edges, directed=TRUE)
coords = layout_with_fr(dg)
plot(dg, layout=coords,
     vertex.size = 20, vertex.color = "lightblue", 
     edge.width = 3, edge.color = "black")

degree(dg, mode = "in")

## [1] 0 1 2 3 2 1

degree(dg, mode = "out")

## [1] 2 1 1 2 3 0

degree(dg, mode = "total")

## [1] 2 2 3 5 5 1

edges = c(1,2, 1,5, 2,3, 2,4, 3,4, 3,5, 3,6)
wg = graph(edges, directed=TRUE)
E(wg)$weight = c(1, 5, 2, 8, 4, 5, 6)
coords = layout_with_fr(wg)
plot(wg, layout=coords,
     vertex.size = 20, vertex.color = "lightblue", 
     edge.width = 1, edge.color = "grey", edge.label = E(wg)$weight)

strength(wg, mode = "in")

## [1]  0  1  2 12 10  6

strength(wg, mode = "out")

## [1]  6 10 15  0  0  0

strength(wg, mode = "total")

## [1]  6 11 17 12 10  6

Play

Write a function degreeCentrality that, given an adjacency matrix \(A\), computes the in, out and total degree of the corresponding graph (without using the degree function).

Solution

# Degree centrality
# INPUT
# g = graph
# mode = degree mode 
degreeCentrality = function(A, mode) {
  n = nrow(A)
  e = rep(1, n);
  if (mode == "in") x = e %*% A
  if (mode == "out") x = e %*% t(A) 
  if (mode == "total") x = e %*% (A + t(A)); 
  return(as.vector(x));
}  

(A = as_adjacency_matrix(dg, sparse=FALSE))

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    1    1    0    0
## [2,]    0    0    0    0    1    0
## [3,]    0    1    0    0    0    0
## [4,]    0    0    0    1    1    0
## [5,]    0    0    1    1    0    1
## [6,]    0    0    0    0    0    0

degreeCentrality(A, mode = "in")

## [1] 0 1 2 3 2 1

degreeCentrality(A, mode = "out")

## [1] 2 1 1 2 3 0

degreeCentrality(A, mode = "total")

## [1] 2 2 3 5 5 1

Play

The friendship paradox is a phenomenon observed by sociologist Scott L. Feld. It claims that, on average,

I have fewer friends than my friends have :-(

On a social network, it formally says that the mean node degree:

\[\mu_1 = \frac{\sum_i k_i}{n}\]

is less than or equal to the mean neighbors degree:

\[\mu_2 = \frac{\sum_{i,j} a_{i,j} k_j}{\sum_{i,j} a_{i,j}}\]

Prove that it always holds that \(\mu_1 \leq \mu_2\)
When is \(\mu_1 = \mu_2\)?

Hint

We have that:

\[\mu_2 = \frac{\sum_{i,j} a_{i,j} k_j}{\sum_{i,j} a_{i,j}} = \frac{\sum_{j} \sum_{i} a_{i,j} k_j}{\sum_{j} \sum_{i} a_{i,j}} = \frac{\sum_j k_j \sum_i a_{i,j}}{\sum_j k_j} = \frac{\sum_j k_j \cdot k_j}{\sum_i k_i} = \frac{\langle k^2 \rangle}{\langle k \rangle}\]

where \(\langle k^2 \rangle\) is the quadratic mean of degrees and \(\langle k \rangle\) is the mean of degrees.

Also recall that:

\[\mathrm{Var}(X) = \\ E[(X - E[X])^2] = \\ E[X^2 - 2XE[X] + E[X]^2] = \\ E[X^2] - 2E[X]E[X] + E[X]^2 = \\ E[X^2] - E[X]^2\] In other words, the variance of X is equal to the mean of the square of X minus the square of the mean of X.

Solution

\[\mu_2 - \mu_1 = \frac{\langle k^2 \rangle}{\langle k \rangle} - \langle k \rangle = \frac{\langle k^2 \rangle - \langle k \rangle^2}{\langle k \rangle} = \frac{\sigma^{2}}{\langle k \rangle} \geq 0\]

Since both the variance \(\sigma^2\) and the mean \(\langle k \rangle\) of the degree distribution are positive quantities, we have that the inequality holds.

In particular we have that \(\mu_1 = \mu_2\) if and only if \(\sigma^2 = 0\) if and only if all nodes have the same degree (the graph is regular). On the other hand, when the distribution of degrees is skewed (the typical case), the inequality is proper.

This because hub nodes, which are nodes with a high degree, are counted once in \(\mu_1\), but are counted many times in \(\mu_2\), because they are, by definition, friends on many others.

library(igraph)
# scale-free network
g = sample_pa(100, m=2)
d = degree(g)
sd(d)

## [1] 4.737173

(mu1 = mean(d))

## [1] 3.94

(mu2 = mean(d*d) / mean(d))

## [1] 9.57868

#random network
g = sample_gnm(100, 200)
d = degree(g)
sd(d)

## [1] 2.169578

(mu1 = mean(d))

## [1] 4

(mu2 = mean(d*d) / mean(d))

## [1] 5.165

Closeness centrality

Closeness centrality measures the mean distance from a vertex to other vertices.

Given nodes \(i\) and \(j\), the distance \(d_{i,j}\) between them is the length of a shortest path from \(i\) to \(j\).

The mean distance for vertex \(i\) to others is:

\[l_i = \frac{1}{n-1} \sum_{j\neq i} d_{i,j}\] and the closeness centrality for \(i\) is:

\[C_i = \frac{1}{l_i} = \frac{n-1}{\sum_{j\neq i} d_{i,j}}\]

Play

when does the mean distance take maximum value and what is this value?
when does the mean distance take minimum value and what is this value?

Solution

The maximum value for mean distance, hence the minimum for closeness, occurs for a root of a chain graph, a graph where the \(n\) nodes form a linear sequence or chain, where the roots are the two ends of the chain:

\[\frac{1}{n-1} \sum_{i=1}^{n-1} i = \frac{(n-1) n}{2 (n-1)} = \frac{n}{2}\]

The minimum value for mean distance, hence the maximum for closeness, occurs for the central node of a star graph, a network composed of a vertex attached to \(n-1\) other vertices, whose only connection is with the central node:

\[\frac{1}{n-1} \sum_{i=1}^{n-1} 1 = \frac{n-1}{n-1} = 1\]

Code

Function closeness computes closeness centrality (this function assigns a distance equal to the number of nodes of the graph to pairs of nodes that are not reachable).

g = make_lattice(c(5,5))
plot(g)

x = closeness(g)
names(x) = 1:vcount(g)
round(x, 3)

##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 0.010 0.012 0.013 0.012 0.010 0.012 0.014 0.015 0.014 0.012 0.013 0.015 0.017 
##    14    15    16    17    18    19    20    21    22    23    24    25 
## 0.015 0.013 0.012 0.014 0.015 0.014 0.012 0.010 0.012 0.013 0.012 0.010

Betweenness centrality

betweenness centrality measures the extent to which a vertex lies on paths between other vertices
vertices with high betweenness may have considerable influence within a network by virtue of their control over information passing between others
they are also the ones whose removal from the network will most disrupt communications between other vertices because they lie on the largest number of paths taken by messages

Betweenness centrality

Let \(n_{s,t}^{i}\) be the number of shortest paths from \(s\) to \(t\) that pass through \(i\) and let \(n_{s,t}\) be the total number of geodesic paths from \(s\) to \(t\).

Then the betweenness centrality of vertex \(i\) is:

\[b_i = \sum_{\begin{array}{l} s\neq t \\ s \neq i \\ t \neq i\end{array}} w_{s,t}^{i} = \sum_{\begin{array}{l} s\neq t \\ s \neq i \\ t \neq i\end{array}} \frac{n_{s,t}^{i}}{n_{s,t}}\]

where the formula counts undirected paths in only one direction and, by convention, the ratio \(w_{s,t}^{i} = 0\) if \(n_{s,t} = 0\).

Notice that each pair of vertex \(s, t\) contribute to the sum for \(i\) with a weight \(w_{s,t}^{i}\) between 0 and 1 expressing the betweenness of \(i\) with respect to the pair \(s, t\).

Play

when does the mean betweenness take maximum value and what is this value?
when does the mean betweenness take minimum value and what is this value?

Solution

The maximum possible value for betweenness occurs for the central node of a star graph. All paths go through the central vertex, hence its betweenness is

\[\sum_{i=1}^{n-2} i = (n-2) (n-1) / 2\]

At the other end of the scale, the smallest possible value for betweenness occurs for a leaf node (a node of degree 1). There are no shortest paths containing the leaf node, hence its betweenness is 0.

Code

Function betweenness computes betweenness centrality.

g = make_star(5, mode = "undirected")
plot(g)

x = betweenness(g)
names(x) = 1:vcount(g)
round(x, 1)

## 1 2 3 4 5 
## 6 0 0 0 0

g = make_lattice(c(5,5))
plot(g)

x = betweenness(g)
names(x) = 1:vcount(g)
round(x, 0)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  3 17 22 17  3 17 45 55 45 17 22 55 66 55 22 17 45 55 45 17  3 17 22 17  3

Dig deeper

Typically, shortest paths are considered in the definition of both closeness and betweenness. There are two drawbacks of this approach:

all paths (even slightly) longer than the shortest ones are not considered at all
the actual number of shortest paths that lie among the two vertices is irrelevant

In many applications, however, it is reasonable to consider the abundance and the length of all paths of the network, since communication on the network is enhanced as soon as more routes are possible, in particular if these pathways are short.

Current flow centrality addresses these issues, you can read more in this article: Resistance distance, closeness, and betweenness.