There’s a discrepancy between relational data and the tidy data idea, in that relational data cannot in any meaningful way be encoded as a single tidy data frame. On the other hand, both node and edge data by itself fits very well within the tidy concept as each node and edge is, in a sense, a single observation. Thus, a close approximation of tidyness for relational data is two tidy data frames, one describing the node data and one describing the edge data.
Underneath the hood of tidygraph lies the well-oiled machinery of igraph, ensuring efficient graph manipulation. Rather than keeping the node and edge data in a list and creating igraph objects on the fly when needed, tidygraph subclasses igraph with the tbl_graph class and simply exposes it in a tidy manner. This ensures that all your beloved algorithms that expects igraph objects still works with tbl_graph objects. Further, tidygraph is very careful not to override any of igraphs exports so the two packages can coexist quite happily.
# graph analysis and visualziation
library(tidygraph)
library(ggraph)
library(igraph)
# tidy data analysis and visualziation
library(readr)
library(dplyr)
# ring graph
create_ring(10)
## # A tbl_graph: 10 nodes and 10 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 10 x 0 (active)
## #
## # Edge Data: 10 x 2
## from to
## <int> <int>
## 1 1 2
## 2 2 3
## 3 3 4
## # ... with 7 more rows
# Erdos-Renyi graph
play_erdos_renyi(n = 100, p = 0.02)
## # A tbl_graph: 100 nodes and 193 edges
## #
## # A directed simple graph with 5 components
## #
## # Node Data: 100 x 0 (active)
## #
## # Edge Data: 193 x 2
## from to
## <int> <int>
## 1 63 1
## 2 18 2
## 3 48 2
## # ... with 190 more rows
# Barabasi-Albert graph
play_barabasi_albert(n = 100, power = 2, growth = 2, directed = FALSE)
## # A tbl_graph: 100 nodes and 197 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 100 x 0 (active)
## #
## # Edge Data: 197 x 2
## from to
## <int> <int>
## 1 1 2
## 2 1 3
## 3 2 3
## # ... with 194 more rows
# Graph from data frames
nodes = read_csv("dolphin_nodes.csv")
edges = read_csv("dolphin_edges.csv")
tbl_graph(nodes = nodes, edges = edges, directed = FALSE)
## # A tbl_graph: 62 nodes and 159 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 62 x 2 (active)
## name sex
## <chr> <chr>
## 1 Beak M
## 2 Beescratch M
## 3 Bumper M
## 4 CCL F
## 5 Cross M
## 6 DN16 F
## # ... with 56 more rows
## #
## # Edge Data: 159 x 2
## from to
## <int> <int>
## 1 4 9
## 2 6 10
## 3 7 10
## # ... with 156 more rows
There are many ways a multitable setup could fit into the tidyverse. The approach used by tidygraph is to let the data object itself carry around a pointer to the active data frame that should be the target of manipulation. This pointer is changed using the activate()
verb, which, on top of changing which part of the data is being worked on, also changes the print output to show the currently active data on top:
g = tbl_graph(nodes = nodes, edges = edges, directed = FALSE)
g <- g %>%
activate(nodes) %>%
mutate(degree = centrality_degree()) %>%
activate(edges) %>%
mutate(betweenness = centrality_edge_betweenness(), homo = (.N()$sex[from] == .N()$sex[to])) %>%
arrange(betweenness)
g
## # A tbl_graph: 62 nodes and 159 edges
## #
## # An undirected simple graph with 1 component
## #
## # Edge Data: 159 x 4 (active)
## from to betweenness homo
## <int> <int> <dbl> <lgl>
## 1 10 14 2.87 TRUE
## 2 7 14 3.64 TRUE
## 3 7 10 5.52 TRUE
## 4 19 46 5.64 TRUE
## 5 19 25 6.60 TRUE
## 6 6 10 6.61 FALSE
## # ... with 153 more rows
## #
## # Node Data: 62 x 3
## name sex degree
## <chr> <chr> <dbl>
## 1 Beak M 6
## 2 Beescratch M 8
## 3 Bumper M 4
## # ... with 59 more rows
In the above the .N()
function is used to gain access to the node data while manipulating the edge data. Similarly .E()
will give you the edge data and .G()
will give you the tbl_graph object itself.
The current active data can always be extracted as a tibble using as_tibble()
:
activate(g, nodes) %>% as_tibble()
## # A tibble: 62 x 3
## name sex degree
## <chr> <chr> <dbl>
## 1 Beak M 6
## 2 Beescratch M 8
## 3 Bumper M 4
## 4 CCL F 3
## 5 Cross M 1
## 6 DN16 F 4
## 7 DN21 M 6
## 8 DN63 M 5
## 9 Double F 6
## 10 Feather M 7
## # ... with 52 more rows
# or
as.list(g)$nodes
## # A tibble: 62 x 3
## name sex degree
## <chr> <chr> <dbl>
## 1 Beak M 6
## 2 Beescratch M 8
## 3 Bumper M 4
## 4 CCL F 3
## 5 Cross M 1
## 6 DN16 F 4
## 7 DN21 M 6
## 8 DN63 M 5
## 9 Double F 6
## 10 Feather M 7
## # ... with 52 more rows
activate(g, edges) %>% as_tibble()
## # A tibble: 159 x 4
## from to centrality homo
## <int> <int> <dbl> <lgl>
## 1 10 14 2.87 TRUE
## 2 7 14 3.64 TRUE
## 3 7 10 5.52 TRUE
## 4 19 46 5.64 TRUE
## 5 19 25 6.60 TRUE
## 6 6 10 6.61 FALSE
## 7 22 46 6.74 TRUE
## 8 17 39 6.96 TRUE
## 9 17 34 7.32 TRUE
## 10 19 22 7.89 TRUE
## # ... with 149 more rows
# or
as.list(g)$edges
## # A tibble: 159 x 4
## from to centrality homo
## <int> <int> <dbl> <lgl>
## 1 10 14 2.87 TRUE
## 2 7 14 3.64 TRUE
## 3 7 10 5.52 TRUE
## 4 19 46 5.64 TRUE
## 5 19 25 6.60 TRUE
## 6 6 10 6.61 FALSE
## 7 22 46 6.74 TRUE
## 8 17 39 6.96 TRUE
## 9 17 34 7.32 TRUE
## 10 19 22 7.89 TRUE
## # ... with 149 more rows
All joins from dplyr are supported. Nodes and edges are added and removed as required by the join. New edge data to be joined in must have a to and from column referencing valid nodes in the existing graph.
# a data frame with information about ages of nodes
age = tibble(name = nodes$name, age = sample.int(100, nrow(nodes)))
age
## # A tibble: 62 x 2
## name age
## <chr> <int>
## 1 Beak 11
## 2 Beescratch 28
## 3 Bumper 96
## 4 CCL 67
## 5 Cross 38
## 6 DN16 100
## 7 DN21 31
## 8 DN63 3
## 9 Double 87
## 10 Feather 72
## # ... with 52 more rows
# join graph nodes with node ages
g %>%
activate(nodes) %>%
left_join(age)
## # A tbl_graph: 62 nodes and 159 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 62 x 4 (active)
## name sex degree age
## <chr> <chr> <dbl> <int>
## 1 Beak M 6 11
## 2 Beescratch M 8 28
## 3 Bumper M 4 96
## 4 CCL F 3 67
## 5 Cross M 1 38
## 6 DN16 F 4 100
## # ... with 56 more rows
## #
## # Edge Data: 159 x 4
## from to centrality homo
## <int> <int> <dbl> <lgl>
## 1 10 14 2.87 TRUE
## 2 7 14 3.64 TRUE
## 3 7 10 5.52 TRUE
## # ... with 156 more rows
Analogous to bind_rows()
, tidygraph provides three functions to expand your data: bind_nodes()
and bind_edges()
append nodes and edges to the graph respectively. As with the join functions bind_edges()
must contain valid from and to columns. bind_graphs()
allows you to combine multiple graphs in the same graph structure resulting in each original graph to become a component in the returned graph.
# a new set of nodes to add
g2 = tibble(name = c("Mumi", "Juli"), sex = c("M", "F"))
g2
## # A tibble: 2 x 2
## name sex
## <chr> <chr>
## 1 Mumi M
## 2 Juli F
# add new nodes
g %>%
activate(nodes) %>%
bind_nodes(g2)
## # A tbl_graph: 64 nodes and 159 edges
## #
## # An undirected simple graph with 3 components
## #
## # Node Data: 64 x 3 (active)
## name sex degree
## <chr> <chr> <dbl>
## 1 Beak M 6
## 2 Beescratch M 8
## 3 Bumper M 4
## 4 CCL F 3
## 5 Cross M 1
## 6 DN16 F 4
## # ... with 58 more rows
## #
## # Edge Data: 159 x 4
## from to centrality homo
## <int> <int> <dbl> <lgl>
## 1 10 14 2.87 TRUE
## 2 7 14 3.64 TRUE
## 3 7 10 5.52 TRUE
## # ... with 156 more rows
# two notable graphs
g1 <- create_notable('bull')
g2 <- create_ring(5)
# bind and plot them
bind_graphs(g1, g2) %>%
ggraph(layout = 'kk') +
geom_edge_link() +
geom_node_point(size = 8, colour = 'steelblue') +
theme_graph()
While being able to use the dplyr verbs on relational data is nice and all, one of the reasons we are dealing with graph data in the first place is because we need some graph-based algorithms for solving our problem at hand. If we need to break out of the tidy workflow every time this was needed we wouldn’t have gained much. Because of this tidygraph has wrapped more or less all of igraphs algorithms in different ways, ensuring a consistent syntax as well as output that fits into the tidy workflow. In the following we’re going to take a look at these.
create_tree(20, 3) %>%
mutate(leaf = node_is_leaf(), root = node_is_root()) %>%
ggraph(layout = 'tree') +
geom_edge_diagonal() +
geom_node_point(aes(filter = leaf), colour = 'forestgreen', size = 10) +
geom_node_point(aes(filter = root), colour = 'firebrick', size = 10) +
theme_graph()
tidygraph has different centrality measures and all of these are prefixed with centrality_*
for easy discoverability. All of them returns a numeric vector matching the nodes (or edges in case of edge centrality).
g %>%
activate(nodes) %>%
mutate(pagerank = centrality_pagerank()) %>%
activate(edges) %>%
mutate(betweenness = centrality_edge_betweenness()) %>%
ggraph(layout = 'kk') +
geom_edge_link(aes(alpha = betweenness)) +
geom_node_point(aes(size = pagerank, colour = pagerank)) +
scale_color_continuous(guide = 'legend') +
theme_graph()
Some statistics are a measure between two nodes, such as distance or similarity between nodes. In a tidy context one of the ends must always be the node defined by the row, while the other can be any other node. All of the node pair functions are prefixed with node_*
and ends with _from/_to
if the measure is not symmetric and _with
if it is; e.g. there’s both a node_max_flow_to()
and node_max_flow_from()
function while only a single node_cocitation_with()
function.
g %>%
activate(nodes) %>%
mutate(similarity = node_similarity_with(which.max(centrality_pagerank()))) %>%
ggraph(layout = 'kk') +
geom_edge_link(colour = "grey") +
geom_node_point(aes(size = similarity), colour = 'steelblue') +
theme_graph()
Another common operation is to group nodes based on the graph topology. All community detection algorithms from igraph is available in tidygraph using the group_*
prefix. All of these functions return an integer vector with nodes (or edges) sharing the same integer being grouped together.
g %>%
activate(nodes) %>%
mutate(community = as.factor(group_louvain())) %>%
ggraph(layout = 'kk') +
geom_edge_link(aes(alpha = stat(index)), show.legend = FALSE) +
geom_node_point(aes(colour = community), size = 5) +
theme_graph()
g %>%
activate(nodes) %>%
mutate(order = bfs_rank(which.max(centrality_pagerank()))) %>%
arrange(order)
## # A tbl_graph: 62 nodes and 159 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 62 x 4 (active)
## name sex degree order
## <chr> <chr> <dbl> <int>
## 1 Grin F 12 1
## 2 Beak M 6 2
## 3 CCL F 3 3
## 4 Hook F 6 4
## 5 MN83 M 6 5
## 6 Scabs F 10 6
## # ... with 56 more rows
## #
## # Edge Data: 159 x 4
## from to centrality homo
## <int> <int> <dbl> <lgl>
## 1 54 55 2.87 TRUE
## 2 53 54 3.64 TRUE
## 3 53 55 5.52 TRUE
## # ... with 156 more rows
g %>%
activate(nodes) %>%
mutate(order = dfs_rank(which.max(centrality_pagerank()))) %>%
arrange(order)
## # A tbl_graph: 62 nodes and 159 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 62 x 4 (active)
## name sex degree order
## <chr> <chr> <dbl> <int>
## 1 Grin F 12 1
## 2 Beak M 6 2
## 3 Fish F 5 3
## 4 Bumper M 4 4
## 5 SN96 M 6 5
## 6 PL M 5 6
## # ... with 56 more rows
## #
## # Edge Data: 159 x 4
## from to centrality homo
## <int> <int> <dbl> <lgl>
## 1 12 14 2.87 TRUE
## 2 11 14 3.64 TRUE
## 3 11 12 5.52 TRUE
## # ... with 156 more rows