Take-home Exercise 1
The main task of this take-home exercise is to reveal the demographic of Ohio in the USA. Through this exercise, I will learn how to use the tidyverse family of packages in R, and how to use the ggplot2 package to draw charts.
The data file used for this exercise is Participant.csv.
In this csv file, there are 7 columns of data. They are participantId, householdSize, haveKids, age, educationLevel, interestGroup and joviality. The following are the data type and description of each column of data:
For this exercise, only tidyverse package is required, because both readr sub-package and ggplot2 sub-package are under this package. The R code in the following code chunk is used to install tidyverse package and load it into RStudio environment.
packages <- c('tidyverse', 'patchwork', 'ggthemes')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
Data import was completed by using read_csv() which is a function in
readr package. This function is useful for reading
delimited files into a tibble. Here, str(data) is used to
show the structure of the dataset.
data <- read_csv("data/Participants.csv")
# Showing the structure of the dataset
str(data)
spec_tbl_df [1,011 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ participantId : num [1:1011] 0 1 2 3 4 5 6 7 8 9 ...
$ householdSize : num [1:1011] 3 3 3 3 3 3 3 3 3 3 ...
$ haveKids : logi [1:1011] TRUE TRUE TRUE TRUE TRUE TRUE ...
$ age : num [1:1011] 36 25 35 21 43 32 26 27 20 35 ...
$ educationLevel: chr [1:1011] "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" ...
$ interestGroup : chr [1:1011] "H" "B" "A" "I" ...
$ joviality : num [1:1011] 0.00163 0.32809 0.39347 0.13806 0.8574 ...
- attr(*, "spec")=
.. cols(
.. participantId = col_double(),
.. householdSize = col_double(),
.. haveKids = col_logical(),
.. age = col_double(),
.. educationLevel = col_character(),
.. interestGroup = col_character(),
.. joviality = col_double()
.. )
- attr(*, "problems")=<externalptr>
# Minimum participant's age
youngest <- min(data["age"])
y <- paste0("The youngest participant is ", youngest, " years old.")
y
[1] "The youngest participant is 18 years old."
# Maximum participant's age
oldest <- max(data["age"])
o <- paste0("The oldest participant is ", oldest, " years old.")
o
[1] "The oldest participant is 60 years old."
# Smallest participant's family size
smallest <- min(data["householdSize"])
s <- paste0("The smallest participant's family size is ", smallest, " .")
s
[1] "The smallest participant's family size is 1 ."
# Largest participant's family size
largest <- max(data["householdSize"])
l <- paste0("The largest participant's family size is ", largest, " .")
l
[1] "The largest participant's family size is 3 ."
# Participants' education levels
education_levels <- unique(data["educationLevel"])
education_levels
# A tibble: 4 x 1
educationLevel
<chr>
1 HighSchoolOrCollege
2 Bachelors
3 Graduate
4 Low
# Participants' interest groups
interest_groups <- unique(data["interestGroup"])
interest_groups
# A tibble: 10 x 1
interestGroup
<chr>
1 H
2 B
3 A
4 I
5 D
6 G
7 F
8 J
9 E
10 C
The purpose of rebuilding the new data frame is to make the subsequent deeper analysis much easier.
The new data frame is built based on different age group. Since the youngest participant is 18 years old and the oldest participant is 60 years old, I divided all the participants into 9 groups, namely 18-20, 21-25, 26-30, 31-35, 36-40, 41-45, 46-50, 51-55 and 56-60.
Considering the generality of the analysis results, all the numerical results in the new table will be filled in with the average value. The new data frame is shown as followed.
dt_frame <- data.frame(age_group=c('18~20', '21~25', '26~30', '31~35', '36~40', '41~45', '46~50', '51~55', '56~60'),
cnt=c(c18_20, c21_25, c26_30, c31_35, c36_40, c41_45, c46_50, c51_55, c56_60),
avg_happiness=c(h18_20, h21_25, h26_30, h31_35, h36_40, h41_45, h46_50, h51_55, h56_60),
cnt_have_kids=c(child18_20, child21_25, child26_30, child31_35, child36_40, child41_45, child46_50, child51_55, child56_60),
avg_happiness_kids=c(hk18_20, hk21_25, hk26_30, hk31_35, hk36_40, hk41_45, hk46_50, hk51_55, hk56_60),
avg_happiness_college=c(e1_18_20$n, e1_21_25$n, e1_26_30$n, e1_31_35$n, e1_36_40$n, e1_41_45$n, e1_46_50$n, e1_51_55$n, e1_56_60$n),
avg_happiness_bachelors=c(e2_18_20$n, e2_21_25$n, e2_26_30$n, e2_31_35$n, e2_36_40$n, e2_41_45$n, e2_46_50$n, e2_51_55$n, e2_56_60$n),
avg_happiness_graduate=c(e3_18_20$n, e3_21_25$n, e3_26_30$n, e3_31_35$n, e3_36_40$n, e3_41_45$n, e3_46_50$n, e3_51_55$n, e3_56_60$n),
avg_happiness_low=c(e4_18_20$n, e4_21_25$n, e4_26_30$n, e4_31_35$n, e4_36_40$n, e4_41_45$n, e4_46_50$n, e4_51_55$n, e4_56_60$n))
str(dt_frame)
'data.frame': 9 obs. of 9 variables:
$ age_group : chr "18~20" "21~25" "26~30" "31~35" ...
$ cnt : int 72 112 118 125 107 119 126 118 114
$ avg_happiness : num 0.486 0.53 0.538 0.492 0.529 ...
$ cnt_have_kids : int 26 33 33 44 31 36 34 33 31
$ avg_happiness_kids : num 0.509 0.514 0.476 0.492 0.606 ...
$ avg_happiness_college : num 0.483 0.531 0.523 0.482 0.474 ...
$ avg_happiness_bachelors: num 0.444 0.514 0.565 0.471 0.58 ...
$ avg_happiness_graduate : num 0.583 0.564 0.566 0.54 0.59 ...
$ avg_happiness_low : num 0.466 0.523 0.503 0.496 0.648 ...
agePlot <- ggplot(data=dt_frame, aes(x=age_group, y=cnt)) +
geom_col(aes(x=dt_frame$age_group, y=dt_frame$cnt), width=0.7, color="skyblue") +
geom_text(aes(label=dt_frame$cnt), check_overlap=TRUE, colour="blue", fontface="bold", position=position_stack(vjust = 1.04)) +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Age Distribution")
agePlot
Through observing the histogram chart, it can be found that the age distribution of all the participants is very even, except that the number of people aged 18 to 20 is relatively small due to the age span. The number of participants in all the other age groups are about 115.
kids_plot<- ggplot(data=dt_frame, aes(x=age_group, y=cnt)) +
geom_point(aes(x=dt_frame$age_group, y=dt_frame$cnt), colour=alpha('red', 1), size=5) +
geom_line(aes(x=dt_frame$age_group, y=dt_frame$cnt, group=1), colour=alpha('red', 1), size=2) +
geom_point(aes(x=dt_frame$age_group, y=dt_frame$cnt_have_kids), colour=alpha('blue', 1), size=5) +
geom_line(aes(x=dt_frame$age_group, y=dt_frame$cnt_have_kids, group=1), colour=alpha('blue', 1), size=2) +
geom_text(aes(label=dt_frame$cnt), check_overlap=TRUE, colour="darkgreen", fontface="bold", position=position_stack(vjust = 1.08)) +
geom_text(aes(label=paste0(dt_frame$cnt_have_kids, " (", round(dt_frame$cnt_have_kids/dt_frame$cnt, digits=2), ")")), check_overlap=FALSE, colour="darkgreen", fontface="bold", position=position_stack(vjust = 0.45)) +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Kids statistic")
kids_plot
Through observing the line chart, it is not difficult to find that these participants living in Ohio, USA generally do not want children, because the proportion of participants with children in each age group is about 30%, and the distribution is very even. And there is no certain age group whose ratio is particularly high or particularly low.
happiness_plot<- ggplot(data=dt_frame, aes(x=age_group, y=avg_happiness)) +
geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness), colour=alpha('red', 1), size=5) +
geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness, group=1), colour=alpha('red', 1), size=2) +
geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_kids), colour=alpha('blue', 1), size=5) +
geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_kids, group=1), colour=alpha('blue', 1), size=2) +
geom_text(aes(label=round(dt_frame$avg_happiness, 2)), check_overlap=TRUE, colour="red", fontface="bold", position=position_stack(vjust = 0.85)) +
geom_text(aes(label=round(dt_frame$avg_happiness_kids, 2)), check_overlap=FALSE, colour="blue", fontface="bold", position=position_stack(vjust = 1.2)) +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Happiness - kids")
happiness_plot
Through observing the line graph, it can be found that the influence of children on participants’ happiness index is mainly divided into two stages: 20 to 35 years old and 36 to 55 years old. The average happiness index of participants aged 20 to 35 with children is lower than the average happiness index in the same age group, however, the happiness level of participants aged 36 to 55 with children is higher than the average level in the same age group.
edu_plot<- ggplot(data=dt_frame, aes(x=age_group, y=avg_happiness))+
geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness), colour=alpha('red', 1), size=5) +
geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness, group=1), colour=alpha('red', 1), size=2) +
annotate("text", label = "Average Level", x = 2, y = 0.350, size = 5, colour = "red") +
geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_graduate), colour=alpha('blue', 1), size=5) +
geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_graduate, group=1), colour=alpha('blue', 1), size=2) +
annotate("text", label = "Graduate", x = 2, y = 0.325, size = 5, colour = "blue") +
geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_bachelors), colour=alpha('darkgreen', 1), size=5) +
geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_bachelors, group=1), colour=alpha('darkgreen', 1), size=2) +
annotate("text", label = "Bachelors", x = 2, y = 0.300, size = 5, colour = "darkgreen") +
geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_college), colour=alpha('orange', 1), size=5) +
geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_college, group=1), colour=alpha('orange', 1), size=2) +
annotate("text", label = "College", x = 2, y = 0.275, size = 5, colour = "orange") +
geom_point(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_low), colour=alpha('cyan', 1), size=5) +
geom_line(aes(x=dt_frame$age_group, y=dt_frame$avg_happiness_low, group=1), colour=alpha('cyan', 1), size=2) +
annotate("text", label = "Low", x = 2, y = 0.250, size = 5, colour = "cyan") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Happiness - education level")
edu_plot
By looking at the graph, it can be seen that there is no significant relationship between the participants’ happiness index and education level.
interest_plot<- ggplot(data=data, aes(x=interestGroup, colour=interestGroup)) +
geom_histogram(stat="count", bins=10, fill="skyblue", color="skyblue")
interest_plot
Although, so far, the grouping criteria of each interest group has not been clear yet, it is still easy for us to observe that the number of people in each interest group is still very even.
agePlot / happiness_plot |edu_plot
Through above basic statistical analysis, it can be clearly found that the distribution of the data set is very even which means this is a very ideal dataset.