STA 175 DataFest Activity 6
Data
This week, we will be working again with the
checkpoints_eoc data and summaries of the
page_views data. The following code from last time will
load the checkpoints_eoc data, and will load the
page_views data and create a data frame called
engagement_summary which summarizes student engagement on
each chapter.
checkpoints_eoc <-read.csv("https://www.dropbox.com/scl/fi/3xda6hny48ojknoifupul/checkpoints_eoc.csv?rlkey=q6rksum9wc91pkte2qdg31s33&st=fr0rynut&dl=1")
page_views <- read.csv("https://www.dropbox.com/scl/fi/85mvej8xfr5z5fpr02sg4/page_views.csv?rlkey=lvnrp6956unq6jikgtfarixib&st=ptbqo2w4&dl=1")
engagement_summary <- page_views |>
group_by(student_id, chapter) |>
summarize( TotalEngagement = sum(engaged,na.rm = TRUE),
TotalClicks = sum(tried_again_clicks,na.rm = TRUE)) |>
filter(startsWith(chapter, "Chapter")) |>
mutate(chapter_num = as.numeric(str_extract(chapter, "(?<=Chapter )\\d+")))
rm(page_views)Activity Part I: Joining
In previous activities, you have explored the page_views
data and combined it with the checkpoints_eoc data. For
Activity 5, we provided the code used to combined these two datasets.
Combining datasets in this way is called joining. Here we will
explain how joining works, and then you will use it on the CourseKata
data.
Joining datasets: Example 1
Suppose you work for a local healthcare network. You have information
on the doctors in your network and the patients who see them. This
information is stored in two different tables, called
doctors and patients. Here are the first few
rows of each table:
## provider location
## 1 Dr. Foyle Greensboro
## 2 Dr. Renard Winston-Salem
## 3 Dr. Zhang Winston-Salem
## age insurance provider
## 1 23 Aetna Dr. Zhang
## 2 47 BCBS Dr. Foyle
## 3 38 Medicaid Dr. Zhang
We notice that these tables contain some overlapping information (the
provider), but also some different information (the
patients table contains the patient’s age and insurance,
whereas the doctors table contains the doctor’s
location).
Linking the tables
Now suppose we want to know where each patient goes to visit their
doctor. We can’t find this information in the patients
table, but we can find if we also use the doctors
table: we find each patient’s provider, then look up that provider’s
location in doctors.
The provider column therefore provides a key that
links these two tables together. We can use the
provider column to join the two
datasets:
## age insurance provider location
## 1 23 Aetna Dr. Zhang Winston-Salem
## 2 47 BCBS Dr. Foyle Greensboro
## 3 38 Medicaid Dr. Zhang Winston-Salem
A left join means that we are keeping all the rows
from the left-hand table (patients), and joining any
matching rows from the right-hand table (doctors). You will
notice that Dr. Renard does not appear in our joined dataset, because no
patients have Dr. Renard listed as a provider.
Joining datasets: Example 2
Now suppose we also have information about the insurance companies accepted by our network:
## age insurance provider
## 1 23 Aetna Dr. Zhang
## 2 47 BCBS Dr. Foyle
## 3 38 Medicaid Dr. Zhang
## 4 54 Humana Dr. Renard
## company phone
## 1 Anthem 800-676-2583
## 2 BCBS 877-258-3334
## 3 Kaiser 800-810-4766
## 4 Medicaid 877-201-3750
Here the information that links the two datasets is the insurance
company. Note, though, that the insurance company has a different name
in the two datasets. We can match these two names in the
join_by function:
## age insurance provider phone
## 1 23 Aetna Dr. Zhang <NA>
## 2 47 BCBS Dr. Foyle 877-258-3334
## 3 38 Medicaid Dr. Zhang 877-201-3750
## 4 54 Humana Dr. Renard <NA>
This time we have two NA values for the
phone column in the joined data. Why? Because two of the
insurance companies in the patients table don’t appear in
the insurance table. The left join keeps all the rows in
the patients table, and joins what it can from the
insurance table.
What if we only want to keep the rows that match in both data frames? We can use an inner join instead:
## age insurance provider phone
## 1 47 BCBS Dr. Foyle 877-258-3334
## 2 38 Medicaid Dr. Zhang 877-201-3750
Joining engagement and end-of-chapter performance
Now let’s apply this tool to combine our
engagement_summary and checkpoints_eoc
dataframes.
Question 1
Modify the example joining code above to join the
checkpoints_eoc and engagement_summary
dataframes, and save the joined data as a dataframe called
combined_data.
Adding pulse questions
One of the other datasets available in the CourseKata data release is
the checkpoints_pulse data. The
checkpoints_pulse dataset consists of 76848 rows and 8
columns, and the data records student responses to four “pulse”
questions which are asked at the start of each chapter (except chapter
1), and which are designed to measure how students feel about the
course, material, and textbook. Each question is scored on a scale of 1
(strongly disagree) to 6 (strongly agree). These questions are:
- Cost: I was unable to put in the time needed to do well in the previous chapter
- Expectancy: I am confident about what I learned in the previous chapter
- Intrinsic Value: I think this class is interesting
- Utility Value: I think what I have learned in the previous chapter is useful
The columns are:
book- The title of the specific book used by the student. CourseKata releases multiple books, generally each targeting a specific age or skill levelrelease- The release of the specific book used by the student. Think of this like a version number, as some versions of the textbook are newer than others.instituion_id- The institution at which the student was enrolled; this is as specific as Wake Forest, but recorded as a numberclass_id- The class the student was enrolled in; this is as specific as Dr. Dalzell’s STA 279 class in Spring 2025, but recorded as a number.student_id- The unique ID number of the student.chapter_num- The chapter of the bookconstruct- The question the student is responding to (Cost, Expectancy, Intrinsic Value, Utility Value)response- The student’s response (an integer between 1 and 6, orNAif missing)
The following code will read in the data and reshape it so it can be
combined with the checkpoints_eoc data and the
page_views data.
checkpoints_pulse <- read.csv("https://sta175.github.io/class_activities/data/checkpoints_pulse.csv")
checkpoints_pulse <- checkpoints_pulse |>
mutate(chapter_num = chapter_num - 1) |>
pivot_wider(id_cols = c(book, release, institution_id,
class_id, student_id, chapter_num),
names_from = construct,
values_from = response,
values_fn = function(x){mean(x, na.rm=T)})Question 2
Now join your the checkpoints_pulse data with the
combined_data that you created above. Each row will now
contain information about the end-of-chapter scores (from the original
checkpoints_eoc data), student engagement and clicks (from
the summary of the page_views data), and student feelings
about the chapter material (from the checkpoints_pulse
data).
Question 3
The client would like to know if students who see more intrinsic value in the chapter material have higher EOC scores. Using the merged data from question 2, make a plot to explore this question, and write a summary of your conclusions for the client. Note that the pulse questions, such as Intrinsic Value, only take a few discrete values, and so it may be helpful to code them as categorical.
Question 4
The client would like to know if students who were unable to put in
the necessary time (high Cost) have lower EOC scores. Using
the merged data from question 2, make a plot to explore this question,
and write a summary of your conclusions for the client. Note that the
pulse questions, such as Cost, only take a few discrete
values, and so it may be helpful to code them as categorical.
Activity Part II: Further visualizations
Another option is to consider visualizations at the chapter level. For example, the following code plots the mean EOC score for each chapter. Since there is a time ordering to the chapters – students complete the chapters in order – we can connect the points with a line to see a trend over time. This is called a line graph.
combined_data |>
group_by(chapter_num) |>
summarize(mean_eoc = mean(eoc)) |>
ggplot(aes(x = chapter_num,
y = mean_eoc)) +
geom_point() +
geom_line() +
labs(x = "Chapter", y = "Average EOC score") +
theme_bw()Question 5
Adapting the code above, create line graphs to visualize the trend in Total Engagement and Total Clicks over the chapters. What do you notice? Are there similar patterns?
Question 6
Now make line graphs to visualize the trends in the pulse
measurements of Intrinsic Value and Cost over
the chapters. EOC scores show a decreasing trend, suggesting students
may find the material more challenging in later chapters. Do you see
similar patterns for the pulse questions?
Activity Part III: Regression
In the previous activity, you explored linear regression to model end
of chapter scores with total engagement and clicks. Now let’s add the
checkpoints_pulse information to that model.
Question 7
Fit a regression model with response variable \(Y =\) EOC score, and \(X_1 =\) total engagement, \(X_2 =\) total tried again clicks, \(X_3 =\) Cost, and \(X_4 =\) Intrinsic Value,
controlling for chapter and textbook. Call this
model_all_combined.
Is model_all_combined a better fit to the data than the
model without Cost and Intrinsic Value? Explain.
Checking assumptions
Linear regression works best when some assumptions about the relationship between our predictor(s) and response are satisfied. In particular, we often make the following assumptions:
- Shape: the true relationship is approximately linear
- Constant variance: the variance of the residuals is the same for different values of the predictor(s)
- Normality: the residuals are approximately normal
As you may recall from STA 112, we can assess these assumptions with diagnostic plots, like residual plots and QQ plots.
Question 8
Create diagnostic plots for the model you chose in question 7. Do the assumptions look reasonable?
Question 9
Based on the final model you have chosen, describe the relationship between engagement and EOC scores.
Data citation
The data we are working with are provided by CourseKata, creators of an online interactive statistics textbook.
CourseKata provided the data for DataFest 2024, and we are now working with a publicly available version of these data downloaded from https://github.com/coursekata/data
- Citation for the Book: Son, J.Y. & Stigler, J.W. (2017-2024). Statistics and data science: A modeling approach. Los Angeles: CourseKata, https://coursekata.org/preview/default/program . Currently available in 7 versions.