Overview

In this take-home exercise, appropriate static statistical graphics methods are used to reveal the demographic of the city of engagement, Ohio USA. The data will be processed by using appropriate tidyverse family of packages and the statistical graphics will be prepared using ggplot2 and its extensions.

Getting Started

The required packages will be called with the following code chunk:

packages = c("tidyverse")

for (p in packages) {
  if(!require(p, character.only = T)) {
    install.packages(p)
  }
  library(p, character.only = T)
}

Importing Data

The code chunk below will import Participants.csv from the data folder into R by using read_csv() of readr package and save it as a tibble data frame called participants.

participants <- read_csv("data/Participants.csv")

The following code chunk will import Participants_Metadata.csv from the data folder in to R by using read_csv() of readr package and save it as a tibble data frame called metadata.

metadata <- read_csv("data/Participants_Metadata.csv")

Data Preperation

The following code chunk will be used to change the data type of the variables in participants using the as.character() function.

participants$householdSize <- as.character(participants$householdSize)

The following code chunk will order the education levels in participants using the factor() function. Order used will be from the least advanced to the most advanced qualification.

participants$educationLevel <- factor(participants$educationLevel,
                                      levels =  c("Low", "HighSchoolOrCollege", 
                                                  "Bachelors", "Graduate"))

Metadata

The knitr::kable function is used in the following code chunk to output the table for participants metadata.

knitr::kable(metadata,
caption = "Participants Metadata")

Table 1: Participants Metadata
Attribute	Data Type	Description
participantId	Numeric	Unique ID assigned to each participant
householdSize	Character	The number of people in the participant’s household
haveKids	Logical	Whether there are children living in the participant’s household
age	Numeric	Participant’s age in years at the start of the study
educationLevel	Factor	The participant’s education level
interestGroup	Numeric	Participant’s stated primary interest group
joviality	Numeric	Value ranging from [0,1] indicating the participant’s overall happiness level at the start of the study

Analysis

Household Size versus Kids

The code chunk below plots a bar chart by using geom_col() of ggplot2. It aims to analyze the relationship between the size of a household and study if that is influenced by whether the households have kids. group_by and summarise functions are also used to enable the output of count values on the charts via the geom_text function.

participants %>%
  group_by(householdSize, haveKids) %>%
  summarise(n = n()) %>%
  ggplot(aes(fill=haveKids, x=householdSize, y=n)) + 
  geom_col() +
  geom_text(aes(label = n), size = 3, position = position_stack(vjust = 0.5)) +
  labs(x="Household Size", y="Count",
       title = "Household Size versus Kids", fill = "Have Kids?") +
  theme_minimal()

From the plot above it can be seen that household size of 2 was the most adundant, followed by 1 and 3. It is also observed that all 3 member households had kids, while all 1 and 2 member households did not have kids.

Education Level of Participants

The code chunk below plots a count plot using geom_col() of ggplot2 for participants education level. The count function is used to enable the output of count values on the charts via the geom_text function.

participants %>% count(educationLevel) %>% 
  ggplot(aes(x=educationLevel, y=n)) + 
  geom_col(fill="#00BFC4") +
  geom_text(aes(label = n), size = 3, position = position_stack(vjust = 0.5)) +
  labs(x="Education Level", y="Count", title="Education Level Count Plot") +
  theme_minimal()

The bar chart above shows that high school or college graduates make up the largest proportion of the participants, while people of low education level make up the smallest proportion of participants. With the exception of low education level, there is a decreasing trend observed in terms of more senior educations levels.

Education Level versus Age

The following chunk code plots box plots using geom_boxplot() of ggplot2 to access whether age differs by education levels of participants. Notches are added to the box plot to help visually assess whether the medians of distributions differ. Mean of each group is also added to the plot using stat_summary, and is represented by a red dot.

ggplot(data=participants, aes(x=educationLevel, y=age)) +
  geom_boxplot(notch = TRUE) +
  stat_summary(geom = "point",
               fun="mean",
               colour="red",
               size=4) +
  labs(y="Age", x="Education Level", title="Education Level versus Age") +
  theme_minimal()

From the plot it can be seen that there is no discernible relationship between education levels and age.

Education Level versus Kids

The following code chunk below plots a stacked bar chart of the frequency of education level split into those having kids and those who do not have kids using geom_col() of ggplot2. The group_by, summarise and mutate functions were used to manipulate the data and create a frequency column. Frequency values were shown on the charts via the geom_text function.

participants %>%
  group_by(educationLevel, haveKids) %>%
  summarise(n = n()) %>%
  mutate(freq = round(n / sum(n),3)) %>%
  ggplot(aes(fill=haveKids, x=educationLevel, y=freq)) + 
  geom_col() +
  geom_text(aes(label = freq), size = 3, position = position_stack(vjust = 0.5)) +
  labs(x="Education Level", y="Frequency",
       title = "Education Level versus Kids", fill = "Have Kids?") +
  theme_minimal()

From the bar chart it can be seen that the proportion of those who have kids reduces with higher education levels. Participants with Low education level have the highest proportion of kids. While those with bachelors have the lowest proportion of kids, this proportion is close to that of the graduate group.

Household Size versus Education for Participants without Kids

Th code chunk takes a subset of the participants data, using the subset() function, where haveKids is FALSE, to study the impact of education on household sizes. This is done as we have already established that all 3 person households have kids and the previous segment has studied the impact of education level on having kids. Here we aim to study if education levels has an impact on single person households. Stacked bar chart was plotted using geom_col() of ggplot2. The group_by, summarise and mutate functions were used to manipulate the data and create a frequency column. Frequency values were shown on the charts via the geom_text function.

subset(participants, participants$haveKids == FALSE) %>%
  group_by(educationLevel, householdSize) %>%
  summarise(n = n()) %>%
  mutate(freq = round(n / sum(n),3)) %>%
  ggplot(aes(fill=householdSize, x=educationLevel, y=freq)) + 
    geom_col() +
    geom_text(aes(label = freq), size = 3, position = position_stack(vjust = 0.5)) +
    labs(x="Education Level", y="Frequency",
         title = "Education Level versus Household Size (No Kids)", fill = "Household Size") +
    theme_minimal()

From the plot above it is evident that a majority of participants with low education level form single households, while the split is more even for the other higher education levels.

Education Level and Joviality

The following chunk code plots box plots using geom_boxplot() of ggplot2 to access whether education levels influence the joviality of participants. Notches are added to the box plot to help visually assess whether the medians of distributions differ. Mean of each group is also added to the plot using stats_summary, and is represented by a red dot.

#  Education level vs joviality
ggplot(data=participants, aes(x=educationLevel, y=joviality)) +
  geom_boxplot(notch=TRUE) +
  stat_summary(geom = "point",
               fun="mean",
               colour="red",
               size=4) +
  labs(y="Joviality", x="Education Level", title="Education Level versus Joviality") +
  theme_minimal()

From the plot it is observed that both median and mean of joviality of the participants increases with higher education levels. The largest difference between mean and median is seen in the low education level group, where the mean is much higher than the median, indicating that this group has outliers who have a higher joviality in comparison to the others in the same education level.

Interest Group Count

The following chunk code plots bar charts using geom_col() of ggplot2 sorted in ascending order of count for interest groups via the reorder() function. This will allow us to identify the least and most popular interest group among the general population. The line depicting mean interest group participant count is plotted in red using geom_hline. The count() function in conjunction with geom_text is used to show the count values on the plot.

participants %>% count(interestGroup) %>% 
  ggplot(aes(x=reorder(interestGroup, n, fun=sum), y=n)) + 
  geom_col(fill="#00BFC4") +
  geom_text(aes(label = n), size = 3, position = position_stack(vjust = 0.5)) +
  geom_hline(aes(yintercept = mean(n), color = "mean")) +
  labs(x="Interest Group", y="Count", title="Interest Group Count Plot", 
       color = "") +
  theme_minimal()

From the chart above it can be seen that the least popular interest group is E and the most popular is J.Interest groups B, D, E and I are also below mean participant count for all interest groups.

Interest Group Impact on Joviality

The following chunk code plots box plots using geom_boxplot() of ggplot2 to study how the joviality of participants differs by interest group. Notches are added to the box plot to help visually assess whether the medians of distributions differ. Mean of each group is also added to the plot using stats_summary, and is represented by a red dot.

ggplot(data=participants, aes(x= reorder(interestGroup, joviality, median), 
                              y=joviality)) +
  geom_boxplot(notch = TRUE) +
  stat_summary(geom = "point",
               fun="mean",
               colour="red",
               size=4) +
  xlab("Interest Group") +
  ylab("Joviality") +
  theme_minimal()

From the above plot, it is observed that participants in interest group H have the lowest median and mean joviality while those in interest group E had the highest median and mean joviality. Interest group G had the largest difference between median and mean, with the mean being much higher. This indicates that there were participants in interest group G that had significantly higher joviality compared to the rest of the participants in that group.

Influence of Kids on Interest Group

The code chunk below plots a bar chart using geom_col() of ggplot2 depicting the proportion of participants with and without kids in each interest group with the intend of studying which interest group applies to each group. The output graph is however not sorted by proportion of interest group due to certain limitations and is an area for future improvement. The group_by, summarise and mutate functions were used to manipulate the data and create a frequency column. Frequency values were shown on the charts via the geom_text function.

participants %>%
  group_by(interestGroup, haveKids) %>%
  summarise(n = n()) %>%
  mutate(freq = round(n / sum(n),3)) %>%
  ggplot(aes(fill=haveKids, x= interestGroup, y=freq)) + 
  geom_col() +
  geom_text(aes(label = freq), size = 3, position = position_stack(vjust = 0.5)) +
  labs(x="Interest Group", y="Frequency",
       title = "Interest Group versus Kids", fill = "Have Kids?") +
  theme_minimal()

From the above output it is observed that interest group D is most appealing to participants who have kids, while interest group A is most appealing to those without kids.

Influence of Age on Interest Group

The following chunk code plots box plots using geom_boxplot() of ggplot2 to study how the appeal of interest groups differs by age. Notches are added to the box plot to help visually assess whether the medians of distributions differ. Mean of each group is also added to the plot using stats_summary, and is represented by a red dot.

ggplot(data=participants, aes(x=reorder(interestGroup, age, median), y=age)) +
  geom_boxplot(notch = TRUE) +
  stat_summary(geom = "point",
               fun="mean",
               colour="red",
               size=4) +
  labs(x="Interest Group", y="Age",
       title = "Interest Group vs Age") +
  theme_minimal()

From the box plot it is seen that interest group E has lowest median age, indicating that it appeals more to the younger participants. Interest group D has the highest median age, indicating that it appeals more to the older participants. Visually it seems that interest group J has the widest appeal across age group, based on it’s wide IQR, while the inverse seems true for interest group H. Further study is needed to acertain this.

Influence of Education Level on Choice of Interest Group

The chunk code below plots a heat map using geom_tile of ggplot2 to show the frequency relationship between education levels and interest group choice. The group_by, summarise and mutate functions were used to manipulate the data and create a frequency column. Frequency values were shown on the charts via the geom_text function. scale_fill_continuous is used to change the colour range of the heat map.

participants %>%
  group_by(educationLevel, interestGroup) %>%
  summarise(n = n()) %>%
  mutate(freq = round(n / sum(n),2)) %>%
  ggplot(aes(y=interestGroup, x=educationLevel, fill=freq)) +
    geom_tile() +
    geom_text(aes(label = freq)) +
    scale_fill_continuous(low = "red", high = "aquamarine") +
    labs(y='Interest Group', x='Education Level', color ='Freq', 
         title = 'Education Level vs Interest Group') +
    theme_minimal()

The heatmap above allows us to see which interest group appeals to participants in different education levels. Participants of low education level had the highest preference for interest group C and G, while having the lowest preference for interest group D. Participants of high school or college education level preferred interest group F and A. Their preference did not differ much across the other interest groups. For participants with a bachelor education level, there was a strong preference for interest group H and J, while a very low preference was observed for interest group E. Graduate participants had a strong preference for interest group G, while showing the lowest preference for interest group A.

Influence of Kids on Joviality

The following chunk code plots box plots using geom_boxplot() of ggplot2 to study how the having kids influences joviality. Notches are added to the box plot to help visually assess whether the medians of distributions differ. Mean of each group is also added to the plot using stats_summary, and is represented by a red dot.

ggplot(data=participants, aes(x=haveKids, y=joviality)) +
  geom_boxplot(notch = TRUE) +
  stat_summary(geom = "point",
               fun="mean",
               colour="red",
               size=4) +
  labs(x="Have Kids", y="Joviality",
       title = "Kids vs Joviality") +
  theme_minimal()

From the output above it is observed that participants with kids have a higher joviality than those who do not.

Household Size Influence on Joviality

The following chunk code plots box plots using geom_boxplot() of ggplot2 to study how household size influences joviality. Notches are added to the box plot to help visually assess whether the medians of distributions differ. Mean of each group is also added to the plot using stats_summary, and is represented by a red dot.

ggplot(participants,aes(x=householdSize, y=joviality)) +
  geom_boxplot(notch = TRUE) +
  stat_summary(geom = "point",
               fun="mean",
               colour="red",
               size=4) +
  labs(x="Household Size", y="Joviality",
       title = "Household Size vs Joviality") +
  theme_minimal()

The above chart shows that participants in single households have the lowest mean and median joviality. Those of participants in household sizes of 2 and 3 are similar.

Conclusion

From the above analysis, it can be concluded that the joviality of participants is influenced by whether they have kids or not, their education levels, size of their households. Choice of interest group is influenced by the participants age, education level and whether they have a kid or not. Education levels also seem to influence the household size of participants and likelihood of them having kids.

Take-home Exercise 1