Tidygraph

There’s a discrepancy between relational data and the tidy data idea, in that relational data cannot in any meaningful way be encoded as a single tidy data frame. On the other hand, both node and edge data by itself fits very well within the tidy concept as each node and edge is, in a sense, a single observation. Thus, a close approximation of tidyness for relational data is two tidy data frames, one describing the node data and one describing the edge data.

Create a tidy graph

Underneath the hood of tidygraph lies the well-oiled machinery of igraph, ensuring efficient graph manipulation. Rather than keeping the node and edge data in a list and creating igraph objects on the fly when needed, tidygraph subclasses igraph with the tbl_graph class and simply exposes it in a tidy manner. This ensures that all your beloved algorithms that expects igraph objects still works with tbl_graph objects. Further, tidygraph is very careful not to override any of igraphs exports so the two packages can coexist quite happily.

# graph analysis and visualziation
library(tidygraph)
library(ggraph)
library(igraph)

# tidy data analysis and visualziation
library(readr)
library(dplyr)

Create graphs

# ring graph
create_ring(10)

## # A tbl_graph: 10 nodes and 10 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 10 x 0 (active)
## #
## # Edge Data: 10 x 2
##    from    to
##   <int> <int>
## 1     1     2
## 2     2     3
## 3     3     4
## # ... with 7 more rows

# Erdos-Renyi graph
play_erdos_renyi(n = 100, p = 0.02)

## # A tbl_graph: 100 nodes and 193 edges
## #
## # A directed simple graph with 5 components
## #
## # Node Data: 100 x 0 (active)
## #
## # Edge Data: 193 x 2
##    from    to
##   <int> <int>
## 1    63     1
## 2    18     2
## 3    48     2
## # ... with 190 more rows

# Barabasi-Albert graph
play_barabasi_albert(n = 100, power = 2, growth = 2, directed = FALSE)

## # A tbl_graph: 100 nodes and 197 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 100 x 0 (active)
## #
## # Edge Data: 197 x 2
##    from    to
##   <int> <int>
## 1     1     2
## 2     1     3
## 3     2     3
## # ... with 194 more rows

# Graph from data frames
nodes = read_csv("dolphin_nodes.csv")
edges = read_csv("dolphin_edges.csv")
tbl_graph(nodes = nodes, edges = edges, directed = FALSE)

## # A tbl_graph: 62 nodes and 159 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 62 x 2 (active)
##   name       sex  
##   <chr>      <chr>
## 1 Beak       M    
## 2 Beescratch M    
## 3 Bumper     M    
## 4 CCL        F    
## 5 Cross      M    
## 6 DN16       F    
## # ... with 56 more rows
## #
## # Edge Data: 159 x 2
##    from    to
##   <int> <int>
## 1     4     9
## 2     6    10
## 3     7    10
## # ... with 156 more rows

Manipulating the graph with verbs

There are many ways a multitable setup could fit into the tidyverse. The approach used by tidygraph is to let the data object itself carry around a pointer to the active data frame that should be the target of manipulation. This pointer is changed using the activate() verb, which, on top of changing which part of the data is being worked on, also changes the print output to show the currently active data on top:

g = tbl_graph(nodes = nodes, edges = edges, directed = FALSE)

g <- g %>% 
  activate(nodes) %>% 
  mutate(degree = centrality_degree()) %>% 
  activate(edges) %>% 
  mutate(betweenness = centrality_edge_betweenness(), homo = (.N()$sex[from] == .N()$sex[to])) %>% 
  arrange(betweenness)

g

## # A tbl_graph: 62 nodes and 159 edges
## #
## # An undirected simple graph with 1 component
## #
## # Edge Data: 159 x 4 (active)
##    from    to betweenness homo 
##   <int> <int>       <dbl> <lgl>
## 1    10    14        2.87 TRUE 
## 2     7    14        3.64 TRUE 
## 3     7    10        5.52 TRUE 
## 4    19    46        5.64 TRUE 
## 5    19    25        6.60 TRUE 
## 6     6    10        6.61 FALSE
## # ... with 153 more rows
## #
## # Node Data: 62 x 3
##   name       sex   degree
##   <chr>      <chr>  <dbl>
## 1 Beak       M          6
## 2 Beescratch M          8
## 3 Bumper     M          4
## # ... with 59 more rows

In the above the .N() function is used to gain access to the node data while manipulating the edge data. Similarly .E() will give you the edge data and .G() will give you the tbl_graph object itself.

The current active data can always be extracted as a tibble using as_tibble():

activate(g, nodes) %>% as_tibble()

## # A tibble: 62 x 3
##    name       sex   degree
##    <chr>      <chr>  <dbl>
##  1 Beak       M          6
##  2 Beescratch M          8
##  3 Bumper     M          4
##  4 CCL        F          3
##  5 Cross      M          1
##  6 DN16       F          4
##  7 DN21       M          6
##  8 DN63       M          5
##  9 Double     F          6
## 10 Feather    M          7
## # ... with 52 more rows

# or
as.list(g)$nodes

## # A tibble: 62 x 3
##    name       sex   degree
##    <chr>      <chr>  <dbl>
##  1 Beak       M          6
##  2 Beescratch M          8
##  3 Bumper     M          4
##  4 CCL        F          3
##  5 Cross      M          1
##  6 DN16       F          4
##  7 DN21       M          6
##  8 DN63       M          5
##  9 Double     F          6
## 10 Feather    M          7
## # ... with 52 more rows

activate(g, edges) %>% as_tibble()

## # A tibble: 159 x 4
##     from    to centrality homo 
##    <int> <int>      <dbl> <lgl>
##  1    10    14       2.87 TRUE 
##  2     7    14       3.64 TRUE 
##  3     7    10       5.52 TRUE 
##  4    19    46       5.64 TRUE 
##  5    19    25       6.60 TRUE 
##  6     6    10       6.61 FALSE
##  7    22    46       6.74 TRUE 
##  8    17    39       6.96 TRUE 
##  9    17    34       7.32 TRUE 
## 10    19    22       7.89 TRUE 
## # ... with 149 more rows

# or
as.list(g)$edges

## # A tibble: 159 x 4
##     from    to centrality homo 
##    <int> <int>      <dbl> <lgl>
##  1    10    14       2.87 TRUE 
##  2     7    14       3.64 TRUE 
##  3     7    10       5.52 TRUE 
##  4    19    46       5.64 TRUE 
##  5    19    25       6.60 TRUE 
##  6     6    10       6.61 FALSE
##  7    22    46       6.74 TRUE 
##  8    17    39       6.96 TRUE 
##  9    17    34       7.32 TRUE 
## 10    19    22       7.89 TRUE 
## # ... with 149 more rows

Expanding graphs

All joins from dplyr are supported. Nodes and edges are added and removed as required by the join. New edge data to be joined in must have a to and from column referencing valid nodes in the existing graph.

# a data frame with information about ages of nodes
age = tibble(name = nodes$name, age = sample.int(100, nrow(nodes)))
age

## # A tibble: 62 x 2
##    name         age
##    <chr>      <int>
##  1 Beak          11
##  2 Beescratch    28
##  3 Bumper        96
##  4 CCL           67
##  5 Cross         38
##  6 DN16         100
##  7 DN21          31
##  8 DN63           3
##  9 Double        87
## 10 Feather       72
## # ... with 52 more rows

# join graph nodes with node ages
g %>% 
  activate(nodes) %>% 
  left_join(age)

## # A tbl_graph: 62 nodes and 159 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 62 x 4 (active)
##   name       sex   degree   age
##   <chr>      <chr>  <dbl> <int>
## 1 Beak       M          6    11
## 2 Beescratch M          8    28
## 3 Bumper     M          4    96
## 4 CCL        F          3    67
## 5 Cross      M          1    38
## 6 DN16       F          4   100
## # ... with 56 more rows
## #
## # Edge Data: 159 x 4
##    from    to centrality homo 
##   <int> <int>      <dbl> <lgl>
## 1    10    14       2.87 TRUE 
## 2     7    14       3.64 TRUE 
## 3     7    10       5.52 TRUE 
## # ... with 156 more rows

Analogous to bind_rows(), tidygraph provides three functions to expand your data: bind_nodes() and bind_edges() append nodes and edges to the graph respectively. As with the join functions bind_edges() must contain valid from and to columns. bind_graphs() allows you to combine multiple graphs in the same graph structure resulting in each original graph to become a component in the returned graph.

# a new set of nodes to add
g2 = tibble(name = c("Mumi", "Juli"), sex = c("M", "F"))
g2

## # A tibble: 2 x 2
##   name  sex  
##   <chr> <chr>
## 1 Mumi  M    
## 2 Juli  F

# add new nodes
g %>% 
  activate(nodes) %>% 
  bind_nodes(g2)

## # A tbl_graph: 64 nodes and 159 edges
## #
## # An undirected simple graph with 3 components
## #
## # Node Data: 64 x 3 (active)
##   name       sex   degree
##   <chr>      <chr>  <dbl>
## 1 Beak       M          6
## 2 Beescratch M          8
## 3 Bumper     M          4
## 4 CCL        F          3
## 5 Cross      M          1
## 6 DN16       F          4
## # ... with 58 more rows
## #
## # Edge Data: 159 x 4
##    from    to centrality homo 
##   <int> <int>      <dbl> <lgl>
## 1    10    14       2.87 TRUE 
## 2     7    14       3.64 TRUE 
## 3     7    10       5.52 TRUE 
## # ... with 156 more rows

# two notable graphs
g1 <- create_notable('bull') 
g2 <- create_ring(5) 

# bind and plot them
bind_graphs(g1, g2) %>% 
    ggraph(layout = 'kk') + 
    geom_edge_link() + 
    geom_node_point(size = 8, colour = 'steelblue') +
    theme_graph()

Network analysis and visualization

While being able to use the dplyr verbs on relational data is nice and all, one of the reasons we are dealing with graph data in the first place is because we need some graph-based algorithms for solving our problem at hand. If we need to break out of the tidy workflow every time this was needed we wouldn’t have gained much. Because of this tidygraph has wrapped more or less all of igraphs algorithms in different ways, ensuring a consistent syntax as well as output that fits into the tidy workflow. In the following we’re going to take a look at these.

Node and edge types

create_tree(20, 3) %>% 
    mutate(leaf = node_is_leaf(), root = node_is_root()) %>% 
    ggraph(layout = 'tree') +
    geom_edge_diagonal() +
    geom_node_point(aes(filter = leaf), colour = 'forestgreen', size = 10) +
    geom_node_point(aes(filter = root), colour = 'firebrick', size = 10) +
    theme_graph()

Centrality

tidygraph has different centrality measures and all of these are prefixed with centrality_* for easy discoverability. All of them returns a numeric vector matching the nodes (or edges in case of edge centrality).

g %>% 
  activate(nodes) %>%
  mutate(pagerank = centrality_pagerank()) %>%
  activate(edges) %>%
  mutate(betweenness = centrality_edge_betweenness()) %>%
  ggraph(layout = 'kk') +
  geom_edge_link(aes(alpha = betweenness)) +
  geom_node_point(aes(size = pagerank, colour = pagerank)) + 
  scale_color_continuous(guide = 'legend') + 
  theme_graph()

Node pairs measures

Some statistics are a measure between two nodes, such as distance or similarity between nodes. In a tidy context one of the ends must always be the node defined by the row, while the other can be any other node. All of the node pair functions are prefixed with node_* and ends with _from/_to if the measure is not symmetric and _with if it is; e.g. there’s both a node_max_flow_to() and node_max_flow_from() function while only a single node_cocitation_with() function.

g %>% 
  activate(nodes) %>%
  mutate(similarity = node_similarity_with(which.max(centrality_pagerank()))) %>% 
  ggraph(layout = 'kk') + 
  geom_edge_link(colour = "grey") + 
  geom_node_point(aes(size = similarity), colour = 'steelblue') + 
  theme_graph()

Communities

Another common operation is to group nodes based on the graph topology. All community detection algorithms from igraph is available in tidygraph using the group_* prefix. All of these functions return an integer vector with nodes (or edges) sharing the same integer being grouped together.

g %>% 
  activate(nodes) %>%
  mutate(community = as.factor(group_louvain())) %>% 
  ggraph(layout = 'kk') + 
  geom_edge_link(aes(alpha = stat(index)), show.legend = FALSE) + 
  geom_node_point(aes(colour = community), size = 5) + 
  theme_graph()

Node searches

g %>% 
  activate(nodes) %>%
  mutate(order = bfs_rank(which.max(centrality_pagerank()))) %>%
  arrange(order)

## # A tbl_graph: 62 nodes and 159 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 62 x 4 (active)
##   name  sex   degree order
##   <chr> <chr>  <dbl> <int>
## 1 Grin  F         12     1
## 2 Beak  M          6     2
## 3 CCL   F          3     3
## 4 Hook  F          6     4
## 5 MN83  M          6     5
## 6 Scabs F         10     6
## # ... with 56 more rows
## #
## # Edge Data: 159 x 4
##    from    to centrality homo 
##   <int> <int>      <dbl> <lgl>
## 1    54    55       2.87 TRUE 
## 2    53    54       3.64 TRUE 
## 3    53    55       5.52 TRUE 
## # ... with 156 more rows

g %>% 
  activate(nodes) %>%
  mutate(order = dfs_rank(which.max(centrality_pagerank()))) %>%
  arrange(order)

## # A tbl_graph: 62 nodes and 159 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 62 x 4 (active)
##   name   sex   degree order
##   <chr>  <chr>  <dbl> <int>
## 1 Grin   F         12     1
## 2 Beak   M          6     2
## 3 Fish   F          5     3
## 4 Bumper M          4     4
## 5 SN96   M          6     5
## 6 PL     M          5     6
## # ... with 56 more rows
## #
## # Edge Data: 159 x 4
##    from    to centrality homo 
##   <int> <int>      <dbl> <lgl>
## 1    12    14       2.87 TRUE 
## 2    11    14       3.64 TRUE 
## 3    11    12       5.52 TRUE 
## # ... with 156 more rows