Reveal the demographic of the city of Engagement, Ohio USA

Take-home Exercise 1

Author

Affiliation

Published

April 22, 2022

DOI

1. Introduction

The main task of this take-home exercise is to reveal the demographic of Ohio in the USA. Through this exercise, I will learn how to use the tidyverse family of packages in R, and how to use the ggplot2 package to draw charts.

2. Data Discription

The data file used for this exercise is Participant.csv.

In this csv file, there are 7 columns of data. They are participantId, householdSize, haveKids, age, educationLevel, interestGroup and joviality. The following are the data type and description of each column of data:

participantId: Participant’s ID, integer. (Counting from 0)
householdSize: Participant’s family size, integer.
haveKids: Whether the participant has children, boolean.
age: Participant’s age, integer.
educationLevel: Participant’s education level, string.
interestGroup: The interest groups the participants are divided into, string (Each group is represented by a capital letter)
joviality: Participants’ happiness index, float. (Decimal between 0 and 1)

3. Data Preparation

Installing and launching R packages

For this exercise, only tidyverse package is required, because both readr sub-package and ggplot2 sub-package are under this package. The R code in the following code chunk is used to install tidyverse package and load it into RStudio environment.

packages <- c('tidyverse', 'patchwork', 'ggthemes')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Importing the dataset

Data import was completed by using read_csv() which is a function in readr package. This function is useful for reading delimited files into a tibble. Here, str(data) is used to show the structure of the dataset.

data <- read_csv("data/Participants.csv")
# Showing the structure of the dataset
str(data)

spec_tbl_df [1,011 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ participantId : num [1:1011] 0 1 2 3 4 5 6 7 8 9 ...
 $ householdSize : num [1:1011] 3 3 3 3 3 3 3 3 3 3 ...
 $ haveKids      : logi [1:1011] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ age           : num [1:1011] 36 25 35 21 43 32 26 27 20 35 ...
 $ educationLevel: chr [1:1011] "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" ...
 $ interestGroup : chr [1:1011] "H" "B" "A" "I" ...
 $ joviality     : num [1:1011] 0.00163 0.32809 0.39347 0.13806 0.8574 ...
 - attr(*, "spec")=
  .. cols(
  ..   participantId = col_double(),
  ..   householdSize = col_double(),
  ..   haveKids = col_logical(),
  ..   age = col_double(),
  ..   educationLevel = col_character(),
  ..   interestGroup = col_character(),
  ..   joviality = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Analyzing data and rebuilding data frame

Analysis of Participants’ age

# Minimum participant's age
youngest <- min(data["age"])
y <- paste0("The youngest participant is ", youngest, " years old.")
y

[1] "The youngest participant is 18 years old."

# Maximum participant's age
oldest <- max(data["age"])
o <- paste0("The oldest participant is ", oldest, " years old.")
o

[1] "The oldest participant is 60 years old."

Analysis of Participants’ family size

# Smallest participant's family size
smallest <- min(data["householdSize"])
s <- paste0("The smallest participant's family size is ", smallest, " .")
s

[1] "The smallest participant's family size is 1 ."

# Largest participant's family size
largest <- max(data["householdSize"])
l <- paste0("The largest participant's family size is ", largest, " .")
l

[1] "The largest participant's family size is 3 ."

Analysis of Participants’ education level

# Participants' education levels
education_levels <- unique(data["educationLevel"])
education_levels

# A tibble: 4 x 1
  educationLevel     
  <chr>              
1 HighSchoolOrCollege
2 Bachelors          
3 Graduate           
4 Low

Analysis of Participants’ interest group

# Participants' interest groups
interest_groups <- unique(data["interestGroup"])
interest_groups

# A tibble: 10 x 1
   interestGroup
   <chr>        
 1 H            
 2 B            
 3 A            
 4 I            
 5 D            
 6 G            
 7 F            
 8 J            
 9 E            
10 C

Rebuild a new data frame according to the results of above analysis.

The purpose of rebuilding the new data frame is to make the subsequent deeper analysis much easier.

The new data frame is built based on different age group. Since the youngest participant is 18 years old and the oldest participant is 60 years old, I divided all the participants into 9 groups, namely 18-20, 21-25, 26-30, 31-35, 36-40, 41-45, 46-50, 51-55 and 56-60.

Considering the generality of the analysis results, all the numerical results in the new table will be filled in with the average value. The new data frame is shown as followed.

dt_frame <- data.frame(age_group=c('18~20', '21~25', '26~30', '31~35', '36~40', '41~45', '46~50', '51~55', '56~60'),
                       cnt=c(c18_20, c21_25, c26_30, c31_35, c36_40, c41_45, c46_50, c51_55, c56_60),
                       avg_happiness=c(h18_20, h21_25, h26_30, h31_35, h36_40, h41_45, h46_50, h51_55, h56_60),
                       cnt_have_kids=c(child18_20, child21_25, child26_30, child31_35, child36_40, child41_45, child46_50, child51_55, child56_60),
                       avg_happiness_kids=c(hk18_20, hk21_25, hk26_30, hk31_35, hk36_40, hk41_45, hk46_50, hk51_55, hk56_60),
                       avg_happiness_college=c(e1_18_20$n, e1_21_25$n, e1_26_30$n, e1_31_35$n, e1_36_40$n, e1_41_45$n, e1_46_50$n, e1_51_55$n, e1_56_60$n),
                       avg_happiness_bachelors=c(e2_18_20$n, e2_21_25$n, e2_26_30$n, e2_31_35$n, e2_36_40$n, e2_41_45$n, e2_46_50$n, e2_51_55$n, e2_56_60$n),
                       avg_happiness_graduate=c(e3_18_20$n, e3_21_25$n, e3_26_30$n, e3_31_35$n, e3_36_40$n, e3_41_45$n, e3_46_50$n, e3_51_55$n, e3_56_60$n),
                       avg_happiness_low=c(e4_18_20$n, e4_21_25$n, e4_26_30$n, e4_31_35$n, e4_36_40$n, e4_41_45$n, e4_46_50$n, e4_51_55$n, e4_56_60$n))

str(dt_frame)

'data.frame':   9 obs. of  9 variables:
 $ age_group              : chr  "18~20" "21~25" "26~30" "31~35" ...
 $ cnt                    : int  72 112 118 125 107 119 126 118 114
 $ avg_happiness          : num  0.486 0.53 0.538 0.492 0.529 ...
 $ cnt_have_kids          : int  26 33 33 44 31 36 34 33 31
 $ avg_happiness_kids     : num  0.509 0.514 0.476 0.492 0.606 ...
 $ avg_happiness_college  : num  0.483 0.531 0.523 0.482 0.474 ...
 $ avg_happiness_bachelors: num  0.444 0.514 0.565 0.471 0.58 ...
 $ avg_happiness_graduate : num  0.583 0.564 0.566 0.54 0.59 ...
 $ avg_happiness_low      : num  0.466 0.523 0.503 0.496 0.648 ...

4. Plotting charts

agePlot <- ggplot(data=dt_frame, aes(x=age_group, y=cnt)) + 
  geom_col(aes(x=dt_frame$age_group, y=dt_frame$cnt), width=0.7, color="skyblue") + 
  geom_text(aes(label=dt_frame$cnt), check_overlap=TRUE, colour="blue", fontface="bold", position=position_stack(vjust = 1.04)) + 
  theme(axis.text.x = element_text(angle = 90)) + 
  ggtitle("Age Distribution")

agePlot

Through observing the histogram chart, it can be found that the age distribution of all the participants is very even, except that the number of people aged 18 to 20 is relatively small due to the age span. The number of participants in all the other age groups are about 115.

kids_plot<- ggplot(data=dt_frame, aes(x=age_group, y=cnt)) +
           geom_point(aes(x=dt_frame$age_group, y=dt_frame$cnt), colour=alpha('red', 1), size=5) + 
           geom_line(aes(x=dt_frame$age_group, y=dt_frame$cnt, group=1), colour=alpha('red', 1), size=2) + 
           geom_point(aes(x=dt_frame$age_group, y=dt_frame$cnt_have_kids), colour=alpha('blue', 1), size=5) +
           geom_line(aes(x=dt_frame$age_group, y=dt_frame$cnt_have_kids, group=1), colour=alpha('blue', 1), size=2) + 
           geom_text(aes(label=dt_frame$cnt),  check_overlap=TRUE, colour="darkgreen", fontface="bold", position=position_stack(vjust = 1.08)) + 
           geom_text(aes(label=paste0(dt_frame$cnt_have_kids, " (", round(dt_frame$cnt_have_kids/dt_frame$cnt, digits=2), ")")),  check_overlap=FALSE, colour="darkgreen", fontface="bold", position=position_stack(vjust = 0.45)) + 
           theme(axis.text.x = element_text(angle = 90)) +
           ggtitle("Kids statistic")
kids_plot

Through observing the line chart, it is not difficult to find that these participants living in Ohio, USA generally do not want children, because the proportion of participants with children in each age group is about 30%, and the distribution is very even. And there is no certain age group whose ratio is particularly high or particularly low.

happiness_plot<- ggplot(data=dt_frame, aes(x=age_group, y=avg_happiness)) +
                 geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness), colour=alpha('red', 1), size=5) + 
                 geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness, group=1), colour=alpha('red', 1), size=2) + 
                 geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_kids), colour=alpha('blue', 1), size=5) +
                 geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_kids, group=1), colour=alpha('blue', 1), size=2) + 
                 geom_text(aes(label=round(dt_frame$avg_happiness, 2)),  check_overlap=TRUE, colour="red", fontface="bold", position=position_stack(vjust = 0.85)) + 
                 geom_text(aes(label=round(dt_frame$avg_happiness_kids, 2)),  check_overlap=FALSE, colour="blue", fontface="bold", position=position_stack(vjust = 1.2)) + 
                 theme(axis.text.x = element_text(angle = 90)) +
                 ggtitle("Happiness - kids")
happiness_plot

Through observing the line graph, it can be found that the influence of children on participants’ happiness index is mainly divided into two stages: 20 to 35 years old and 36 to 55 years old. The average happiness index of participants aged 20 to 35 with children is lower than the average happiness index in the same age group, however, the happiness level of participants aged 36 to 55 with children is higher than the average level in the same age group.

edu_plot<- ggplot(data=dt_frame, aes(x=age_group, y=avg_happiness))+
           
           geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness), colour=alpha('red', 1), size=5) + 
           geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness, group=1), colour=alpha('red', 1), size=2) + 
           annotate("text", label = "Average Level", x = 2, y = 0.350, size = 5, colour = "red") +

           geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_graduate), colour=alpha('blue', 1), size=5) +
           geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_graduate, group=1), colour=alpha('blue', 1), size=2) + 
           annotate("text", label = "Graduate", x = 2, y = 0.325, size = 5, colour = "blue") + 
  
           geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_bachelors), colour=alpha('darkgreen', 1), size=5) +
           geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_bachelors, group=1), colour=alpha('darkgreen', 1), size=2) + 
           annotate("text", label = "Bachelors", x = 2, y = 0.300, size = 5, colour = "darkgreen") + 
           
           geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_college), colour=alpha('orange', 1), size=5) +
           geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_college, group=1), colour=alpha('orange', 1), size=2) +
           annotate("text", label = "College", x = 2, y = 0.275, size = 5, colour = "orange") + 

           geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_low), colour=alpha('cyan', 1), size=5) +
           geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_low, group=1), colour=alpha('cyan', 1), size=2) +
           annotate("text", label = "Low", x = 2, y = 0.250, size = 5, colour = "cyan") + 

           theme(axis.text.x = element_text(angle = 90)) +
           ggtitle("Happiness - education level")

edu_plot

By looking at the graph, it can be seen that there is no significant relationship between the participants’ happiness index and education level.

interest_plot<- ggplot(data=data, aes(x=interestGroup, colour=interestGroup)) + 
            geom_histogram(stat="count", bins=10, fill="skyblue", color="skyblue")

interest_plot

Although, so far, the grouping criteria of each interest group has not been clear yet, it is still easy for us to observe that the number of people in each interest group is still very even.

5. Conclusion

agePlot / happiness_plot |edu_plot

Summary

By analyzing the line graph of the happiness index, it is not difficult to find that children will bring more stress rather than joy to young couples under the age of 35; Having accumulated a certain amount of wealth and professional contacts, children may often bring them more happiness.
By analyzing the education level chart, we cannot currently draw a relationship between happiness index and educational level, because participants with low education like 36 to 40 years old have the highest average happiness index, while for Graduate students between the ages of 45 and 50 had the lowest average happiness scores.

Through above basic statistical analysis, it can be clearly found that the distribution of the data set is very even which means this is a very ideal dataset.