Visualising and Analysing Community Network
In Take-home Exercise 6, I need to refer to bullet point 2 of
Challenge 1 of VAST Challenge 2022, and reveal the patterns of community
interactions in the engagement city. Besides that, I will also observe
the relationships among all of the participants.
In this exercise, I am going to use visNetwork package, igraph package and ggraph to display the complex relationships among the participants who live in the engaged city.
The data file used for this exercise are
Participants.csv and
SocialNetwork.csv.
The first file
contains information about the residents of Engagement, OH that have
agreed to participate in the study. Following are the definitions of
each column of data:
The second file contains information about participants’ evolving social relationships. Following are the definitions of each column of data:
For this exercise, I used 8 libraries. They are igraph, tidygraph, ggraph, visNetwork, lubridate, clock, graphlayouts and tidyverse. The R code in the following code chunk is used to install the required packages and load them into RStudio environment.
packages <- c('igraph', 'tidygraph',
'ggraph', 'visNetwork',
'lubridate', 'clock',
'tidyverse', 'graphlayouts')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
Data import was completed by using read_csv() and read_rds() which are functions in readr package. This function is useful for reading delimited files into a tibble.
GAStech_nodes <- read_csv("data/Participants.csv")
GAStech_edges <- read_rds("data/rds/SocialNetwork.rds")
The “nodes” parameter of the tbl_graph() function must be a continuous integer sequence starting from 1. If the input parameter is 0, it will return some error. However, in the source dataset, the participantId starts from 0, so, here, I add one to all of the participants’ ID in the study.
GAStech_nodes <- GAStech_nodes %>%
mutate(participantId = participantId + 1)
GAStech_nodes
# A tibble: 1,011 x 7
participantId householdSize haveKids age educationLevel
<dbl> <dbl> <lgl> <dbl> <chr>
1 1 3 TRUE 36 HighSchoolOrCollege
2 2 3 TRUE 25 HighSchoolOrCollege
3 3 3 TRUE 35 HighSchoolOrCollege
4 4 3 TRUE 21 HighSchoolOrCollege
5 5 3 TRUE 43 Bachelors
6 6 3 TRUE 32 HighSchoolOrCollege
7 7 3 TRUE 26 HighSchoolOrCollege
8 8 3 TRUE 27 Bachelors
9 9 3 TRUE 20 Bachelors
10 10 3 TRUE 35 Bachelors
# ... with 1,001 more rows, and 2 more variables:
# interestGroup <chr>, joviality <dbl>
For the edges dataset, I also do the add-one operation to both participantIdFrom column and participantIdTo column. Moreover, if I put all the social network in the tbl_graph() function, it will take a long time to process all the data, so I just use all the social activities happened in 2022 March.
GAStech_edges <- GAStech_edges %>%
mutate(participantIdFrom = participantIdFrom + 1) %>%
mutate(participantIdTo = participantIdTo + 1) %>%
filter(year(timestamp) == 2022) %>%
filter(month(timestamp) == 3)
GAStech_edges
# A tibble: 171,796 x 3
timestamp participantIdFrom participantIdTo
<dttm> <dbl> <dbl>
1 2022-03-01 00:00:00 174 181
2 2022-03-01 00:00:00 179 184
3 2022-03-01 00:00:00 179 186
4 2022-03-01 00:00:00 181 174
5 2022-03-01 00:00:00 184 179
6 2022-03-01 00:00:00 184 186
7 2022-03-01 00:00:00 186 179
8 2022-03-01 00:00:00 186 184
9 2022-03-01 00:00:00 187 188
10 2022-03-01 00:00:00 187 205
# ... with 171,786 more rows
set.seed(1234)
ggraph(GAStech_graph,
layout = "stress") +
geom_edge_link() +
geom_node_point() +
theme_graph()
Although the data preprocessing has taken into account the large amount of social network data, there were more than 900 participants had social activities in March. The final social network is so dense that it is difficult to clearly observe the connection between points. Therefore, I performed a second data filter. Only display the social network with more than 25 social interactions in March.
In addition, the unconnected dots in the above graph indicate that the participants did not have any form of social interaction with other participants in March. The tbl_graph function requires that the input “nodeId” must be continuous, so I cannot remove these participants without social activities from the node list, otherwise tbl_graph will report an error.
set.seed(1234)
ggraph(GAStech_graph,
layout = "stress") +
geom_edge_link() +
geom_node_point() +
theme_graph()
From the social network, it can be seen that most of the participants had quite frequent interactions and communication at the beginning of the study. However, the difference between this network diagram and the previous one is that the unconnected points in this diagram do not mean that the participants do no have interaction with other participants. It just means that the number of their interactions with other participants are less than 25 times.
In order to make the social network diagram much clearer, I use “nicely” layout to display the same social network again. It is shown as followed.
set.seed(1234)
ggraph(GAStech_graph,
layout = "nicely") +
geom_edge_link(aes()) +
geom_node_point(aes()) +
theme_graph()
The funciont of the code in the below chunk is to filter out the participants who had interacted with at least 10 other participants in March. More than half of the participants had social activities with at least 10 participants in March, so the social network diagram is too dense for me to observe the details in it.
FromId <- GAStech_edges_aggregated_1 %>%
count(participantIdFrom) %>%
filter(n >= 10)
GAStech_edges_aggregated_3 <- subset(GAStech_edges_aggregated_1, participantIdFrom %in% FromId$participantIdFrom)
GAStech_graph_1 <- tbl_graph(nodes = GAStech_nodes,
edges = GAStech_edges_aggregated_3,
directed = TRUE)
set.seed(1234)
ggraph(GAStech_graph_1,
layout = "nicely") +
geom_edge_link() +
geom_node_point() +
theme_graph()
By adding a new filter, I filtered out participants who socialized with at least 10 participants each in March and had at least 20 interactions with the participants. By drawing the social network diagram, we can still find 6 clusters. This means that the participants in this six clusters had very close communication and interactions in the early stage of the study. Maybe their relationships are close friends, colleagues at work or neighbors.
GAStech_edges_aggregated_4 <- GAStech_edges %>%
group_by(participantIdFrom, participantIdTo) %>%
summarise(Weight = n()) %>%
filter(participantIdFrom!=participantIdTo) %>%
filter(Weight >= 20) %>%
ungroup()
FromId <- GAStech_edges_aggregated_4 %>%
count(participantIdFrom) %>%
filter(n >= 10)
GAStech_edges_aggregated_4 <- subset(GAStech_edges_aggregated_4, participantIdFrom %in% FromId$participantIdFrom)
GAStech_graph_2 <- tbl_graph(nodes = GAStech_nodes,
edges = GAStech_edges_aggregated_4,
directed = TRUE)
set.seed(1234)
ggraph(GAStech_graph_2,
layout = "stress") +
geom_edge_link() +
geom_node_point() +
theme_graph()