STA 175 DataFest Activity 4
The Goal
In the previous activities, we have seen how to perform some of the key steps in exploratory data analysis, including some fundamental data wrangling tools: creating new columns (mutating), choosing subsets of rows and columns (filtering and selecting, respectively), and calculating summary statistics (grouping and summarizing).
In the process, we have mostly focused on working with data from a
single table (in our case, the checkpoints_eoc data). Most
years, however, the data for DataFest involves multiple different files.
The CourseKata data, for example, includes the following tables:
checkpoints_eoc: end-of-chapter assessment informationcheckpoints_pulse: check-in “pulse” questions to see how students feel about the book and materialresponses: student responses to individual questions throughout the chapterspage_views: information about student activity and engagement on each textbook page
When we have information spread across multiple tables, we often need
to combine datasets to effectively answer research questions.
In today’s activity, you will explore information in the
page_views dataset. You will begin by practicing some of
the data wrangling steps from last week on the new
page_views table, then you will combine the
page_views information with the
checkpoints_eoc information and answer questions about the
combined data.
Setup
R packages
For this activity you will need the dplyr,
ggplot2, and tidyr packages. Each of these
packages is part of the tidyverse. If you don’t have
tidyverse installed, go to your R console and install it
first with install.packages("tidyverse") (remember you only
need to do this once).
Once the tidyverse package is installed, load it into R
with
Data
Recall that you can download the checkpoints_eoc data
with the following:
checkpoints_eoc <-read.csv("https://www.dropbox.com/scl/fi/3xda6hny48ojknoifupul/checkpoints_eoc.csv?rlkey=q6rksum9wc91pkte2qdg31s33&st=fr0rynut&dl=1")You can download the page_views data with the
following:
page_views data
Each row in the page_views data represents a student’s
access to a particular page within a chapter, with the following
variables:
instituion_id- The institution at which the student was enrolled; this is as specific as Wake Forest, but recorded as a numberclass_id- The class the student was enrolled in; this is as specific as Dr. Dalzell’s STA 279 class in Spring 2025, but recorded as a number.student_id- The unique ID number of the student.chapter- The name of the chapter in the book. Note that some are main chapters in the book (“Chapter 1 - Welcome to Statistics: A Modeling Approach”, e.g.), while other “chapters” are really short introductory or transition sections (“Getting Started (Don’t Skip This Part)”, e.g.)page- The number of the page the student is accessingdt_accessed- The date and time the page was accessed (in UTC)was_complete- if set to TRUE, this indicates the page was already completed when the user accessed it. A “complete” page is one in which all of the items/questions have been answeredengaged- The amount of time (in milliseconds) that the student was engaged (i.e., doing something) on the pagetried_again_dt- On each page, the students are able to “try again”: reset their answers to the questions on the page and start again. If they try again, thentried_again_dtis the date and time of the latest attempt. If they did not try again, this column will be blanktried_again_clicks- how many times a student reset their answers on a particular page. Note thattried_again_dtandtried_again_clicksare reported at the page level and NOT the row level
There are also a few other variables (idle_brief,
idle_long, off_page_brief,
off_page_long, and trace) for which the
documentation is sparse. This often happens at DataFest! From the
variable names, we can guess that these variables have to do with the
amount of time the student spent on the page but not engaged (idle), and
with a different tab open on their browser (off page), but exactly how
they were measured is unclear.
Activity Part 1
Before combining the page_views and
checkpoints_eoc data, we need to explore them individually!
You’ve already performed EDA on the checkpoints_eoc data,
but we haven’t seen the page_views data yet. In the first
part of the activity, you will practice your data wrangling and
visualization skills from previous activities with the
page_views data.
Measuring total student engagement
Right now, we have a row for each page, for each student. Without knowing what each page in the book represents, however, we might not know how best to use information at the page level.
Instead, let’s begin by measuring engagement at the chapter level.
Question 1
Create a new dataframe from page_views called
engagement_summary, which summarizes the total engagement
and the total number of “tried again” clicks across the different
students and chapters. Each row in the summarized data should represent
a student-chapter combination (as opposed to a student-chapter-page
combination in the original page_views data).
Note: Many of the values in the
engagement column are missing (NA). You can
ignore these when you sum with the na.rm=T argument.
Question 2
Create a plot showing the distribution of the total number of “tried
again” clicks in the engagement_summary dataframe you
created in question 1.
Question 3
The client is curious whether student engagement changes over the course of the book.
Create a plot showing the distribution of total engagement for each chapter. Do you notice a difference for later chapters in the book?
Getting chapter numbers
In your plot in question 3, you might notice that the chapter names
are really long! Whereas the checkpoint_eoc dataset
contains a chapter_num column that records chapter number,
we have the full chapter names in the page_views data.
How do we get chapter numbers from the chapter names? We will do this
in two steps. First, we need to remove the short introductory/transition
chapters like “Getting Started (Don’t Skip This Part)”, keeping only the
chapters with names like “Chapter 1 - …”. In R, we can do this with the
startsWith function:
## [1] FALSE
## [1] TRUE
Then, we will extract the chapter number from the
chapter name with the str_extract function from the
stringr package:
## [1] "1"
The "(?<=Chapter )\\d+" in the code above means “find
a number (\\d+) which comes after the word
Chapter”.
Question 4
Fill in the following code to keep only the chapters of interest, and extract chapter numbers:
engagement_summary <- engagement_summary |>
...(startsWith(..., "Chapter")) |>
...(chapter_num = as.numeric(str_extract(..., "(?<=Chapter )\\d+")))Hints:
- Which function from last week allows us to choose rows?
- Which function from last week allows us to create a new column?
Question 5
Remake your plot from question 3, this time using the modified
engagement_summary data from question 4 and chapter numbers
instead of chapter names.
Activity Part II
Now that we’ve explored the page_views data, we would
like to combine it with the checkpoints_eoc data. Combining
these two dataframes will allow us to answer questions about variables
from each dataset – for example, we could look at the relationship
between student engagement and performance on the end-of-chapter
assessments!
Combining datasets in this way is called joining. Here we will explain how joining works, and then you will use it on the CourseKata data.
Joining datasets: Example 1
Suppose you work for a local healthcare network. You have information
on the doctors in your network and the patients who see them. This
information is stored in two different tables, called
doctors and patients. Here are the first few
rows of each table:
## provider location
## 1 Dr. Foyle Greensboro
## 2 Dr. Renard Winston-Salem
## 3 Dr. Zhang Winston-Salem
## age insurance provider
## 1 23 Aetna Dr. Zhang
## 2 47 BCBS Dr. Foyle
## 3 38 Medicaid Dr. Zhang
We notice that these tables contain some overlapping information (the
provider), but also some different information (the
patients table contains the patient’s age and insurance,
whereas the doctors table contains the doctor’s
location).
Linking the tables
Now suppose we want to know where each patient goes to visit their
doctor. We can’t find this information in the patients
table, but we can find if we also use the doctors
table: we find each patient’s provider, then look up that provider’s
location in doctors.
The provider column therefore provides a key that
links these two tables together. We can use the
provider column to join the two
datasets:
## age insurance provider location
## 1 23 Aetna Dr. Zhang Winston-Salem
## 2 47 BCBS Dr. Foyle Greensboro
## 3 38 Medicaid Dr. Zhang Winston-Salem
A left join means that we are keeping all the rows
from the left-hand table (patients), and joining any
matching rows from the right-hand table (doctors). You will
notice that Dr. Renard does not appear in our joined dataset, because no
patients have Dr. Renard listed as a provider.
Joining datasets: Example 2
Now suppose we also have information about the insurance companies accepted by our network:
## age insurance provider
## 1 23 Aetna Dr. Zhang
## 2 47 BCBS Dr. Foyle
## 3 38 Medicaid Dr. Zhang
## 4 54 Humana Dr. Renard
## company phone
## 1 Anthem 800-676-2583
## 2 BCBS 877-258-3334
## 3 Kaiser 800-810-4766
## 4 Medicaid 877-201-3750
Here the information that links the two datasets is the insurance
company. Note, though, that the insurance company has a different name
in the two datasets. We can match these two names in the
join_by function:
## age insurance provider phone
## 1 23 Aetna Dr. Zhang <NA>
## 2 47 BCBS Dr. Foyle 877-258-3334
## 3 38 Medicaid Dr. Zhang 877-201-3750
## 4 54 Humana Dr. Renard <NA>
This time we have two NA values for the
phone column in the joined data. Why? Because two of the
insurance companies in the patients table don’t appear in
the insurance table. The left join keeps all the rows in
the patients table, and joins what it can from the
insurance table.
What if we only want to keep the rows that match in both data frames? We can use an inner join instead:
## age insurance provider phone
## 1 47 BCBS Dr. Foyle 877-258-3334
## 2 38 Medicaid Dr. Zhang 877-201-3750
Joining engagement and end-of-chapter performance
Now let’s apply this tool to combine our
engagement_summary and checkpoints_eoc
dataframes.
Question 6
Modify the example joining code above to join the
checkpoints_eoc and engagement_summary
dataframes, and save the joined data as a dataframe called
combined_data.
And now that we have our combined data, we can examine the relationships between variables.
Question 7
The client would like to know if students who engage more with a
chapter do better on the end-of-chapter assessment. Using the
combined_data table that you made in question 6, make a
plot to explore this question, and write a summary of your conclusions
for the client.
Question 8
The client would also like to know if students who “try again” on
questions during the chapter do better on the end-of-chapter assessment.
Using the combined_data table that you made in question 6,
make a plot to explore this question, and write a summary of your
conclusions for the client.
Data citation
The data we are working with are provided by CourseKata, creators of an online interactive statistics textbook.
CourseKata provided the data for DataFest 2024, and we are now working with a publicly available version of these data downloaded from https://github.com/coursekata/data
- Citation for the Book: Son, J.Y. & Stigler, J.W. (2017-2024). Statistics and data science: A modeling approach. Los Angeles: CourseKata, https://coursekata.org/preview/default/program . Currently available in 7 versions.