Degree distribution

  • a property of the full-scale structure of a network that is typically investigated is the distribution of the network node degrees
  • for any integer \(k \geq 0\), the quantity \(p_k\) is the fraction of nodes having degree \(k\)
  • this is also the probability that a randomly chosen node in the network has degree \(k\)
  • the quantities \(p_k\), for \(k \geq 0\), represent the degree distribution of the network

Degree distribution

Consider for instance the following simple graph with 10 nodes where nodes are labelled with their degree:

The degree distribution is the following (notice that \(p_k = 0\) for all \(k \geq 5\)):

\[ p_0 = 0, p_1 = \frac{5}{10}, p_2 = \frac{3}{10}, p_3 = \frac{1}{10}, p_4 = \frac{1}{10}\]

Long tail distribution

  • in most real networks, the degree distribution is highly asymmetric (or skewed)
  • most of the nodes (the trivial many) have low degrees while a small but significant fraction of nodes (the vital few) have an extraordinarily high degree
  • a highly connected node, a node with remarkably high degree, is called hub
  • since the probability of hubs, although low, is significant, the degree distribution \(p_k\), when plotted as a function of the degree \(k\), shows a long tail, which is much fatter than the tail of a Gaussian or exponential model

Failures and attacks

  • this asymmetric shape of the degree distribution has important consequences for the processes taking place on networks
  • the highly connected nodes, the hubs of the network, are generally responsible for keeping the network connected. In other words, the network falls apart if the hubs are removed from the network.
  • since hubs are rare, a randomly chosen node is most likely not a hub, and hence the removal of random nodes from the network has a negligible effect on the network cohesion
  • substantially, networks with long tail degree distributions are resilient to failures – random removal of nodes – but vulnerable to attacks – removal of the the hub nodes

Information spread

  • hubs are also important for the spread of information or of any other quantity flowing on the network (like a desease)
  • in fact, hubs play a dual role in information diffusion over the network:
    1. on the one hand, since they are highly connected, they quickly harvest information
    2. on the other hand, and for the same reason, they effectively spread it
  • in a network with hub nodes, the probability that each node spreads the information to its neighbours need not be large for the information to reach the whole community

Quantitative measures of skewness

  • skewness measures the symmetry of a distribution
  • a distribution is symmetric if the values are equally distributed around its mean
  • it is right skewed if it contains many low values and a relatively few high values
  • it is left skewed if it comprises many high values and a relatively few low values
  • as a rule of thumb, when the mean is larger than the median the distribution is right skewed and when the median dominates the mean the distribution is left skewed

Quantitative measures of skewness

A numerical indicator of skewness of an empirical distribution \(X = (x_1, \ldots, x_n)\) is the third standardized central moment of the sample:

\[ {\frac {m_{3}}{s^{3}}}={\frac {{\tfrac {1}{n}}\sum _{i=1}^{n}(x_{i}-{\overline {x}})^{3}}{{\sqrt {{\tfrac {1}{n-1}}\sum _{i=1}^{n}(x_{i}-{\overline {x}})^{2}}}^{\,3}}}\]

  • this formula can be thought of as the average cubed deviation in the sample divided by the cubed sample standard deviation.
  • positive values for the skewness indicator correspond to right skewness (since there are outliers far above the mean)
  • negative values correspond to left skewness (since there are outliers far below the mean)
  • values close to 0 mean symmetry.

Play

  1. Write a function that computed the sample skewness of a given distribution
  2. Test the skewness of the degree distribution of Zachary network
  3. Compare mean and median of the distribution

Solution

library(igraph)
skewness = function(x) mean( (x - mean(x))^3 ) / sd(x)^3

g = make_graph("Zachary")
d = degree(g)
skewness(d)
## [1] 1.913312
summary(d)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.588   5.000  17.000

Power laws

To be sure, the most popular long tail probability distribution is the power law.

For a degree distribution, it states that the probability \(p_k\) of having a node with \(k\) neighbours is:

\[p_k = C k^{-\alpha}\]

where \(C\) is a normalization constant and \(\alpha\) is known as the exponent of the power law.

Values in the range \(2 \leq \alpha \leq 3\) are typical for degree distributions networks.

Notice that, taking the logarithm of both members of the power law definition, we have:

\[\log p_k = -\alpha \log k + log C\]

Hence, if \(p_k\) follows a power law and we plot \(p_k\) as a function of \(k\) on log-log scales (both axes are logarithmic), then we should notice a straight line of slope \(-\alpha\).

Power laws

An alternative (and more convenient) method to visualize and detect a power law behaviour is to plot the complementary cumulative distribution function (CCDF) on log-log scales.

The CCDF \(P_k\) is the fraction of vertices that have degree \(k\) or greater:

\[P_k = \sum_{x = k}^{\infty} p_{x}\]

Notice that, if \(p_k = C k^{-\alpha}\) and \(\alpha > 1\), then:

\[P_k = \sum_{x = k}^{\infty} p_{x} = C \sum_{x = k}^{\infty} x^{-\alpha} \simeq C \int_{k}^{\infty} x^{-\alpha} dx = \frac{C}{\alpha -1} k^{-(\alpha-1)}\]

It follows that if a distribution follows a power law, then so does the CCDF of the distribution, but with exponent one less than the original exponent.

Hence, when plotted on log-log scales, the CCDF of a power law should appear as a straight line.

Scale-free networks

  • in practice, few empirical phenomena obey power laws on the entire domain
  • more often the power law applies only for values greater than or equal to some minimum location
  • in such case, we say that the tail of the distribution follows a power law
  • networks that have a power law degree distribution are sometimes called scale-free networks
  • the name comes from the fact that, as opposed to random networks, scale-free networks possess no characteristic scale, meaning that there is no typical node in the network that represents the degree for the other nodes

Preferential attachment networks

  • the degree distribution of a graph generated with the preferential attachment model asymptotically follows a power law distribution
  • the exponent parameter of the power law is \(\alpha = 3\)

Random networks

As for the network models, consider a random graph with \(n\) nodes and edge probability \(p\).

The probability \(p_k\) that a node has \(k\) neighbours is the binomial

\[p_k = {n \choose k} p^k (1-p)^{n-1-k}\]

which can be approximated by the Poisson distribution \[p_k = \frac{\lambda^k e^{-\lambda}}{k!}\] for large values of \(n\) and small values of \(p\), where \(\lambda = p(n-1)\) is the mean degree of a node in the network.

  1. the Poisson distribution is peaked at the average degree \(\lambda\): the majority of nodes have the same typical degree and nodes deviating from the average are extremely rare
  2. in other words, a random network has a characteristic scale, which contrasts with the scale-free nature of real webs

Code - Preferential attachment networks

# Cumulative distribution
ccdf = function(d) {
  n = length(d)
  max = max(d)
  p = rep(0, max)
  for (i in 1:length(p)) {
    p[i] = length(d[d >= i]) / n
  } 
  return(p)
}

links = 3
g = sample_pa(n=1000, m=links, directed=FALSE)
d = degree(g)
hist(d, breaks = 30)

p = ccdf(d)
# nodes have at least links neighbours 
plot(links:max(d), p[links:length(p)], log="xy", type = "l", xlab="Degree", ylab="CCDF")

Code - Random networks

g = sample_gnp(n=1000, p = 10/1000, directed=FALSE)
d = degree(g)
hist(d)

p = ccdf(d)
plot(links:max(d), p[links:length(p)], 
     log="xy", type = "l", xlab="Degree", ylab="CCDF")

Play