In this take home exercise I will study the social network of the town in the Vast Challenge 2022 dataset.
In this exercise I will use visual analytic techniques to address this question:
The required packages will be called with the following code chunk:
packages = c('igraph', 'tidygraph',
'ggraph', 'visNetwork',
'lubridate', 'clock',
'tidyverse', 'graphlayouts')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The SocialNetwork and ParticipantNodes data will be imported using the code chunks below:
SocialNetwork <- read_csv("data/SocialNetwork.csv")
ParticipantNodes <- read_csv("data/Participants.csv")
Change householdsize to character to use it as a category.
ParticipantNodes$householdSize <- as.character(ParticipantNodes$householdSize)
Split age and joviality into quartiles. 1st quartile represents the lowest value and the 4th quartile represents the highest value.
ParticipantNodes <- ParticipantNodes %>%
mutate(ageQuartile = as.character(ntile(age, 4))) %>%
mutate(jovialityQuartile = as.character(ntile(joviality, 4)))
Drop age and joviality columns as they are no longer needed:
Next create a day attribute for SocialNetwork data:
SocialNetwork <- SocialNetwork %>%
mutate(day = weekdays(timestamp))
Aggregate data to tidygraph format:
Filter notes for those selected:
Create cgraph:
cgraph <- graph_from_data_frame(SocialNetworkAggregated, vertices =
ParticipantNodesSelected, directed = TRUE) %>%
as_tbl_graph()
Activate Edges:
cgraph %>%
activate(edges) %>%
arrange(desc(Weight))
# A tbl_graph: 648 nodes and 16658 edges
#
# A directed multigraph with 57 components
#
# Edge Data: 16,658 x 4 (active)
from to day Weight
<int> <int> <chr> <int>
1 5 30 Thursday 65
2 5 30 Wednesday 65
3 8 384 Thursday 65
4 12 521 Thursday 65
5 13 84 Thursday 65
6 15 130 Thursday 65
# ... with 16,652 more rows
#
# Node Data: 648 x 7
name householdSize haveKids educationLevel interestGroup
<chr> <chr> <lgl> <chr> <chr>
1 2 3 TRUE HighSchoolOrC~ A
2 4 3 TRUE Bachelors H
3 5 3 TRUE HighSchoolOrC~ D
# ... with 645 more rows, and 2 more variables: ageQuartile <chr>,
# jovialityQuartile <chr>
g <- ggraph(cgraph,
layout = "nicely") +
geom_edge_link(aes()) +
geom_node_point(aes())
g + theme_graph()
Observation:
It is observed from the general network plot that there are many micro networks surrounding the main network. Centrality filters will be used to clean up.
deg.dist <- degree_distribution(cgraph, cumulative=T, mode="all")
plot( x=0:max(degree(cgraph)), y=1-deg.dist, pch=19, cex=1.2, col="orange",
xlab="Degree", ylab="Cumulative Frequency")
Observation:
It is observed that the cumulative frequency graph for degree follows a sigmoidal shape, indicating that it follows a Gaussian distribution (Normal Distribution).
Eigenvector centrality measures a node’s importance while giving consideration to the importance of its neighbors. For example. a person with few connections could have a very high eigenvector centrality if those few connections were to be very well-connected others.
cgraph %>%
mutate(centrality = centrality_eigen()) %>%
filter(centrality > 0.01) %>%
ggraph(layout = 'kk') +
geom_edge_link(aes()) +
geom_node_point(aes(size = centrality, colour = educationLevel)) +
labs(title = 'Eigenvector Centrality by Education Level',
colour = "Education Level") +
theme_graph()
Observation:
It can be seen that the nodes with the highest eigenvector centrality tend to be either of graduate or bachelor education level. Those of low education level tend to have low eigenvector centrality.
cgraph %>%
mutate(centrality = centrality_eigen()) %>%
filter(centrality > 0.01) %>%
ggraph(layout = 'kk') +
geom_edge_link() +
geom_node_point(aes(size = centrality, colour = householdSize)) +
labs(title = 'Eigenvector Centrality by Household Size',
colour = "Household Size") +
theme_graph()
Observation:
Nodes with the highest eigenvector centrality tend to be either of household size 2 or 3. Single member households tend to have low eigenvector centrality, indicating that they are not well connected to influential nodes.
cgraph %>%
mutate(centrality = centrality_eigen()) %>%
filter(centrality > 0.01) %>%
ggraph(layout = 'kk') +
geom_edge_link() +
geom_node_point(aes(size = centrality, colour = jovialityQuartile)) +
labs(title = 'Eigenvector Centrality by Joviality',
colour = "Joviality Quartile") +
theme_graph()
Observation:
Those with the lowest joviality had the lowest eigenvector centrality value. This indicates that more joviality played a role in eigenvector centrality value.
cgraph %>%
mutate(centrality = centrality_eigen()) %>%
filter(centrality > 0.01) %>%
ggraph(layout = 'kk') +
geom_edge_link() +
geom_node_point(aes(size = centrality, colour = ageQuartile)) +
labs(title = 'Eigenvector Centrality by Age',
colour = "Age Quartile") +
theme_graph()
Observation:
The nodes with the highest eigenvector centrality tend to be those of the 2nd and 3rd age quartile. This indicates that the being too young or too old influences the connections a person has.
Code chunk below is used to create a new SocialNetworkAggregatedCopy
SocialNetworkAggregatedCopy <- NULL
SocialNetworkAggregatedCopy <- SocialNetwork %>%
left_join(ParticipantNodesSelected, by = c("participantIdFrom" = "participantId")) %>%
rename(from = participantIdFrom) %>%
left_join(ParticipantNodesSelected, by = c("participantIdTo" = "participantId")) %>%
rename(to = participantIdTo) %>%
group_by(from, to) %>%
summarise(weight = n()) %>%
filter(from!=to) %>%
filter(weight > 150) %>%
ungroup()
visNetwork(ParticipantsCopy,
SocialNetworkAggregatedCopy,
height = "700px", width = "100%") %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visOptions(highlightNearest = TRUE,
selectedBy = "group") %>%
visLegend()
visNetwork(ParticipantsCopy,
SocialNetworkAggregatedCopy,
height = "700px", width = "100%") %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visOptions(highlightNearest = TRUE,
selectedBy = "group") %>%
visLegend()
Observation:
Participants of interest group E were the least connected to each other, indicating that this may not be a group activity.