Tidy network data?

  • there’s a discrepancy between network data and the tidy data idea, in that network data cannot in any meaningful way be encoded as a single tidy data frame
  • on the other hand, both node and edge data by itself fits very well within the tidy concept as each node and edge is, in a sense, a single observation
  • thus, a close approximation of tidyness for network data is two tidy data frames, one describing the node data and one describing the edge data

tidygraph

  • tidygraph is an entry into the tidyverse that provides a tidy framework for network (graph) data
  • tidygraph provides an approach to manipulate node and edge data frames using the interface defined in the dplyr package
  • moreover it provides tidy interfaces to a lot of common graph algorithms, including igraph network analysis toolkit
  • it is developed by Thomas Lin Pedersen

ggraph

  • ggraph is an extension of ggplot2 that implements a visualization grammar for network data
  • it provides a huge variety of geoms for drawing nodes and edges, along with an assortment of layouts making it possible to produce a very wide range of network visualization types
  • while tidygraph provides a manipulation and analysis grammar for network data (like dplyr for tabular data), ggraph offers a visualization grammar (like ggplot for tabular data)
  • it is developed by Thomas Lin Pedersen

A full example: dplyr, tidygraph and ggraph

We are going to use the friendship dataset that shows the friendship among high school boys as assessed by the question:

What fellows here in school do you go around with most often?

The question was posed twice, with one year in between (1957 and 1958) and shows the evolution in friendship between the two timepoints.

library(dplyr)
library(ggraph)
library(tidygraph)

# setting the graph theme
set_graph_style()
# a graph of highschool friendships
head(highschool)
##   from to year
## 1    1 14 1957
## 2    1 15 1957
## 3    1 21 1957
## 4    1 54 1957
## 5    1 55 1957
## 6    2 21 1957
# create the graph and add popularity using dplyr and igraph
graph <- as_tbl_graph(highschool) %>% 
    mutate(Popularity = centrality_degree(mode = "in"))

# print the graph
graph
## # A tbl_graph: 70 nodes and 506 edges
## #
## # A directed multigraph with 1 component
## #
## # A tibble: 70 × 2
##   name  Popularity
##   <chr>      <dbl>
## 1 1              2
## 2 2              0
## 3 3              0
## 4 4              4
## 5 5              5
## 6 6              2
## # ℹ 64 more rows
## #
## # A tibble: 506 × 3
##    from    to  year
##   <int> <int> <dbl>
## 1     1    13  1957
## 2     1    14  1957
## 3     1    20  1957
## # ℹ 503 more rows
# plot the graph (using ggraph)
ggraph(graph, layout = "kk") + 
    geom_edge_link(aes(alpha = stat(index)), show.legend = FALSE) + 
    geom_node_point(aes(size = Popularity)) + 
    facet_edges(~year) + 
    theme_graph(foreground = "steelblue", fg_text_colour = "white")

Read the graph with tidygraph

Let’s read a dolphin network:

  1. a set of nodes representing dolphins (dolphin_nodes.csv)
  2. a set of edges representing ties among dolphins (dolphin_edges.csv)

Package tidygraph represents the graph as a pair of data frames:

  • a data frame for nodes containing information about the nodes in the graph
  • A data frame for edges containing information about the edges in the graph. The terminal nodes of each edge must either be encoded in a to and from column, or in the two first columns, as integers. These integers refer to nodes index.
library(readr)

nodes = read_csv("dolphin_nodes.csv")
edges = read_csv("dolphin_edges.csv")

nodes
## # A tibble: 62 × 2
##    name       sex  
##    <chr>      <chr>
##  1 Beak       M    
##  2 Beescratch M    
##  3 Bumper     M    
##  4 CCL        F    
##  5 Cross      M    
##  6 DN16       F    
##  7 DN21       M    
##  8 DN63       M    
##  9 Double     F    
## 10 Feather    M    
## # ℹ 52 more rows
edges
## # A tibble: 159 × 2
##        x     y
##    <dbl> <dbl>
##  1     4     9
##  2     6    10
##  3     7    10
##  4     1    11
##  5     3    11
##  6     6    14
##  7     7    14
##  8    10    14
##  9     1    15
## 10     4    15
## # ℹ 149 more rows
# add edge type
edges = 
  edges %>% 
  mutate(type = sample(c("love", "friendship"), 
                       nrow(edges), 
                       replace = TRUE) )

# make a tidy graph
dolphin = tbl_graph(nodes = nodes, edges = edges, directed = FALSE)
dolphin
## # A tbl_graph: 62 nodes and 159 edges
## #
## # An undirected simple graph with 1 component
## #
## # A tibble: 62 × 2
##   name       sex  
##   <chr>      <chr>
## 1 Beak       M    
## 2 Beescratch M    
## 3 Bumper     M    
## 4 CCL        F    
## 5 Cross      M    
## 6 DN16       F    
## # ℹ 56 more rows
## #
## # A tibble: 159 × 3
##    from    to type 
##   <int> <int> <chr>
## 1     4     9 love 
## 2     6    10 love 
## 3     7    10 love 
## # ℹ 156 more rows
# extract node and edge data frames from the graph
as.list(dolphin)
# extract node data frame from the graph
as.list(dolphin)$nodes
# extract edge data frame from the graph
as.list(dolphin)$edges

ggraph components

ggraph builds upon three core concepts that are quite easy to understand:

  • the layout defines how nodes are placed on the plot. ggraph has access to all layout functions available in igraph and much more
  • the nodes are the connected entities in the graph structure. These can be plotted using the geom_node_*() family of geoms
  • the edges are the connections between the entities in the graph structure. These can be visualized using the geom_edge_*() family of geoms

ggraph basics

# basic plot
ggraph(dolphin) + 
  geom_edge_link() + 
  geom_node_point()

# plot edge type
ggraph(dolphin) + 
  geom_edge_link(aes(color = type)) + 
  geom_node_point()

# plot node sex
ggraph(dolphin) + 
  geom_edge_link(aes(color = type)) + 
  geom_node_point(aes(shape = sex))

# plot node name
ggraph(dolphin) + 
  geom_edge_link() + 
  geom_node_point() + 
  geom_node_text(aes(label = name), repel=TRUE)

Faceting

Faceting allows to create sub-plots according to the values of a qualitative attribute on nodes or edges.

# facet edges by type
ggraph(dolphin) + 
  geom_edge_link(aes(color = type)) + 
  geom_node_point() +
  facet_edges(~type)

# facet nodes by sex
ggraph(dolphin) + 
  geom_edge_link() + 
  geom_node_point(aes(color = sex)) +
  facet_nodes(~sex)

# facet both nodes and edges
ggraph(dolphin) + 
  geom_edge_link() + 
  geom_node_point() +
  facet_graph(type~sex) + 
  th_foreground(border = TRUE)

Directed graphs

# directed graphs
package = tibble(
  name = c("igraph", "ggraph", "dplyr", "ggplot", "tidygraph")
)

tie = tibble(
  from = c("igraph", "ggplot", "igraph", "dplyr", "ggraph"),
  to =   c("tidygraph", "ggraph", "tidygraph", "tidygraph", "tidygraph")
)

tidy = tbl_graph(nodes = package, edges = tie, directed = TRUE)


# use arrows for directions
ggraph(tidy, layout = "graphopt") + 
    geom_edge_link(aes(start_cap = label_rect(node1.name), 
                       end_cap = label_rect(node2.name)), 
                   arrow = arrow(type = "closed", 
                                 length = unit(3, "mm"))) + 
    geom_node_text(aes(label = name))

# use edge alpha to indicate direction, 
# direction is from lighter to darker node
ggraph(tidy, layout = 'graphopt') + 
    geom_edge_link(aes(start_cap = label_rect(node1.name), 
                       end_cap = label_rect(node2.name), 
                       alpha = stat(index)), 
                   show.legend = FALSE) + 
    geom_node_text(aes(label = name))

Hierarchical layouts

# This dataset contains the graph that describes the class 
# hierarchy for the Flare visualization library
head(flare$vertices)
##                                           name size             shortName
## 1 flare.analytics.cluster.AgglomerativeCluster 3938  AgglomerativeCluster
## 2   flare.analytics.cluster.CommunityStructure 3812    CommunityStructure
## 3  flare.analytics.cluster.HierarchicalCluster 6714   HierarchicalCluster
## 4            flare.analytics.cluster.MergeEdge  743             MergeEdge
## 5  flare.analytics.graph.BetweennessCentrality 3534 BetweennessCentrality
## 6           flare.analytics.graph.LinkDistance 5731          LinkDistance
head(flare$edges)
##                      from                                           to
## 1 flare.analytics.cluster flare.analytics.cluster.AgglomerativeCluster
## 2 flare.analytics.cluster   flare.analytics.cluster.CommunityStructure
## 3 flare.analytics.cluster  flare.analytics.cluster.HierarchicalCluster
## 4 flare.analytics.cluster            flare.analytics.cluster.MergeEdge
## 5   flare.analytics.graph  flare.analytics.graph.BetweennessCentrality
## 6   flare.analytics.graph           flare.analytics.graph.LinkDistance
# flare class hierarchy
graph = tbl_graph(edges = flare$edges, nodes = flare$vertices)

# dendrogram
ggraph(graph, layout = "dendrogram") + 
  geom_edge_diagonal()

# circular dendrogram
# notice the "dynamic" variable leaf
ggraph(graph, layout = "dendrogram", circular = TRUE) + 
  geom_edge_diagonal() + 
  geom_node_point(aes(filter = leaf)) + 
  coord_fixed()

# rectangular tree map
# notice the "dynamic" variable depth
ggraph(graph, layout = "treemap", weight = size) + 
  geom_node_tile(aes(fill = depth), size = 0.25)

# circular tree map
ggraph(graph, layout = "circlepack", weight = size) + 
  geom_node_circle(aes(fill = depth), size = 0.25, n = 50) + 
  coord_fixed()

# icicle
ggraph(graph, layout = "partition") + 
  geom_node_tile(aes(y = -y, fill = depth))

# sunburst (circular icicle)
ggraph(graph, layout = "partition", circular = TRUE) +
  geom_node_arc_bar(aes(fill = depth)) +
  coord_fixed()

Network analysis with tidygraph

  • the data frame graph representation can be easily augmented with metrics computed on the graph
  • before computing a metric on nodes or edges use the activate() function to activate either node or edge data frames
  • use dplyr verbs filter, arrange and mutate to manipulate the graph

Network analysis with tidygraph

dolphin = 
  dolphin %>% 
  activate(nodes) %>% 
  mutate(degree = centrality_degree()) %>% 
  filter(degree > 0) %>% 
  arrange(-degree) %>% 
  activate(edges) %>% 
  mutate(betweenness = centrality_edge_betweenness(), 
         # .N() gets the nodes data from edge you're accessing
         homo = (.N()$sex[from] == .N()$sex[to])) %>% 
  arrange(-betweenness)

dolphin
## # A tbl_graph: 62 nodes and 159 edges
## #
## # An undirected simple graph with 1 component
## #
## # A tibble: 159 × 5
##    from    to type       betweenness homo 
##   <int> <int> <chr>            <dbl> <lgl>
## 1    10    17 friendship        283. FALSE
## 2    13    29 friendship        219. FALSE
## 3     6    10 friendship        184. TRUE 
## 4     2    17 friendship        181. TRUE 
## 5    17    48 friendship        173. TRUE 
## 6     9    48 love              146. FALSE
## # ℹ 153 more rows
## #
## # A tibble: 62 × 3
##   name    sex   degree
##   <chr>   <chr>  <dbl>
## 1 Grin    F         12
## 2 SN4     F         11
## 3 Topless M         11
## # ℹ 59 more rows

Analyse and visualize network: centrality

Packages tidygraph and ggraph can be pipelined to perform analysis and visualization tasks in one go.

dolphin %>% 
  activate(nodes) %>%
  mutate(pagerank = centrality_pagerank()) %>%
  activate(edges) %>%
  mutate(betweenness = centrality_edge_betweenness()) %>%
  ggraph() +
  geom_edge_link(aes(alpha = betweenness)) +
  geom_node_point(aes(size = pagerank, colour = pagerank)) + 
  # discrete colour legend
  scale_color_gradient(guide = "legend")

Analyse and visualize network: communities

# visualize communities of nodes
dolphin %>% 
  activate(nodes) %>%
  mutate(community = as.factor(group_louvain())) %>% 
  ggraph() + 
  geom_edge_link() + 
  geom_node_point(aes(colour = community), size = 5)