STA 175 Activity 2

The goal

In the first activity last week, you began to use visualizations to explore the CourseKata data, focusing on the student responses to the end-of-chapter questions. Exploratory data analysis (EDA) is an important step in your workflow; it is crucial for understanding the variables that are available in the data before choosing a hypothesis test or model, and indeed sometimes EDA on its own will answer the research questions that you have.

In today’s activity, you will continue practicing EDA with the CourseKata data. This time, the questions will be a bit more open-ended, to give you more experience with the open-ended nature of DataFest. You will have to decide in your team the most appropriate visualizations to answer today’s research questions!

The data

Today we will use both the checkpoints_eoc dataset from last week, and another dataset from CourseKata called checkpoints_pulse.

Loading the data

Recall from last week that you can download the checkpoints_eoc data with the following:

checkpoints_eoc <-read.csv("https://www.dropbox.com/scl/fi/3xda6hny48ojknoifupul/checkpoints_eoc.csv?rlkey=q6rksum9wc91pkte2qdg31s33&st=fr0rynut&dl=1")

Today, we are going to add a second data set from the same provider. You can download this second data set, checkpoints_pulse, using:

checkpoints_pulse <- read.csv("https://sta175.github.io/class_activities/data/checkpoints_pulse.csv")

The first part of today’s activity will focus on the new checkpoints_pulse data. In the second part, you will combine the two datasets.

Data citation

The data we are working with are provided by CourseKata, creators of an online interactive statistics textbook.

CourseKata provided the data for DataFest 2024, and we are now working with a publicly available version of these data downloaded from https://github.com/coursekata/data

Citation for the Book: Son, J.Y. & Stigler, J.W. (2017-2024). Statistics and data science: A modeling approach. Los Angeles: CourseKata, https://coursekata.org/preview/default/program . Currently available in 7 versions.

New Data Description

The checkpoints_pulse dataset consists of 76848 rows and 8 columns, and the data records student responses to four “pulse” questions which are asked at the start of each chapter (except chapter 1), and which are designed to measure how students feel about the course, material, and textbook. Each question is scored on a scale of 1 (strongly disagree) to 6 (strongly agree). These questions are:

Cost: I was unable to put in the time needed to do well in the previous chapter
Expectancy: I am confident about what I learned in the previous chapter
Intrinsic Value: I think this class is interesting
Utility Value: I think what I have learned in the previous chapter is useful

The columns are:

book - The title of the specific book used by the student. CourseKata releases multiple books, generally each targeting a specific age or skill level
release - The release of the specific book used by the student. Think of this like a version number, as some versions of the textbook are newer than others.
instituion_id - The institution at which the student was enrolled; this is as specific as Wake Forest, but recorded as a number
class_id - The class the student was enrolled in; this is as specific as Dr. Dalzell’s STA 279 class in Spring 2025, but recorded as a number.
student_id - The unique ID number of the student.
chapter_num - The chapter of the book
construct - The question the student is responding to (Cost, Expectancy, Intrinsic Value, Utility Value)
response - The student’s response (an integer between 1 and 6, or NA if missing)

Client Request

Like last time, our client is interested in understanding how students experience and engage with the material in their textbooks. For today, they have provided us with several guiding research questions about the “pulse” questions that they want you to answer:

Do students tend to respond differently to the different “pulse” questions? (e.g., is there a difference between the utility value and the intrinsic utility that students see in the material?)
Does there seem to be a difference in responses across different chapters? The client is unsure whether students will become more or less invested in the material over the course of the semester.
Does the rate of missing responses change across chapters? The client suspects that more students will answer the questions in earlier chapters, and they want you to see if that is true.
What is the relationship between the different constructs? For example, do students who find the material interesting also tend to find it useful?
How are the “pulse” questions related to student performance in the end-of-chapter questions (in the checkpoints_eoc file)?

Research questions 1–3 can be explored using the data in its current state. Research question 4 will require us to reformat the data, and research question 5 will require us to combine the checkpoints_pulse and checkpoints_eoc datasets. We will therefore begin with research questions 1–3, before tackling the others.

Research questions 1–3

In DataFest, it will be up to you to choose appropriate visualizations to answer the research questions from your client. Typically, there is no one right answer, and there are multiple ways you could approach the question. Today, you are going to practice choosing and creating visualizations to answer the client’s first three research questions.

It can feel a little overwhelming when the research questions are fairly open, but you’ve got this! See the activity from last week, the data exploration cheat sheet, and the ggplot help sheet for possible plots you could make. And let us know if you get stuck!

Here are some things you may want to consider as you answer the research questions:

Some variables are treated by R as numeric, but may work better as categorical variables for certain visualizations. You can change variables to categorical in R with as.factor. For example, to make chapter_num categorical:

# Option 1: If you've used tidyverse before
library(tidyverse)

checkpoints_pulse <- checkpoints_pulse |>
  mutate(chapter_num = as.factor(chapter_num))

# Option 2: If you have not used tidyverse before
checkpoints_pulse$chapter_num <- as.factor(checkpoints_pulse$chapter_num)

You may wish to focus on certain chapters or constructs. For instance, you may only want to look at the rows in the data set that relate to Chapter 2. To do this, you want to filter to Chapter 2, which means choose only rows which belong to Chapter 2. You can do this using the following code:

library(tidyverse)

pulse_ch2 <- checkpoints_pulse |>
  filter(chapter_num == 2) # Choose only row relating to Chapter 2

You can identify missing values with is.na. For example, to find the total number of missing responses in the dataset:

sum(is.na(checkpoints_pulse$response))

To answer questions about different books, chapters, or constructs, you may also find it helpful to facet when making a plot, as introduced in last week’s activity

Dividing work between team members

When you have several different research questions you are trying to answer, you may wish to divide them between your different team members. Each team member can tackle a different question, and share their results with the group. If you find anything interesting or unusual about the data, make sure to tell your teammate – it may be important for them to know too!

Questions

Okay, now that we’ve talked about the tips, it’s time to answer the client’s research questions. Remember, there is more than one correct way to answer, so explore a bit.

Question 1

Create at least one figure or visualization to answer Research Question 1 from the client. Use that visualization to try to answer Research Question 1 for the client.

Research Question 1: Do students tend to respond differently to the different “pulse” questions? (e.g., is there a difference between the utility value and the intrinsic utility that students see in the material?)

Question 2

Create at least one figure or visualization to answer Research Question 2 from the client. Use that visualization to try to answer Research Question 2 for the client.

Research Question 2: Does there seem to be a difference in responses across different chapters? The client is unsure whether students will become more or less invested in the material over the course of the semester.

Question 3

Create at least one figure or visualization to answer Research Question 3 from the client. Use that visualization to try to answer Research Question 3 for the client.

Research Question 3: Does the rate of missing responses change across chapters? The client suspects that more students will answer the questions in earlier chapters, and they want you to see if that is true.

Research question 4

Reshaping data

The fourth question from the client asks us to explore the relationship between different constructs. This is usually easiest when we have a column for each construct (cost, expectancy, etc) in the data. Right now, however, we have a single column for construct, with different rows for the different constructs.

In order to answer the fourth research question, it may be helpful to change the shape of the data, by converting the single construct column into separate columns for each construct. In R, this is called pivoting, and the tidyr package provides two pivoting functions: pivot_wider and pivot_longer. We want to make “long” data “wider” by adding columns, so we will use the pivot_wider function.

You will see more data wrangling in a later activity. For now, here is the code that allows us to pivot the data in this way:

pulse_pivot <- checkpoints_pulse |>
  pivot_wider(id_cols = c(book, release, institution_id,
                          class_id, student_id, chapter_num),
              names_from = construct,
              values_from = response,
              values_fn = function(x){mean(x, na.rm=T)})

Questions

Question 4

Create at least one figure or visualization to answer Research Question 4 from the client.

Research Question 4: What is the relationship between the different constructs? For example, do students who find the material interesting also tend to find it useful?

Research question 5

Finally, we need to tackle the fifth research question. The fifth research question is more open-ended than the others, and will also require us to combine the checkpoints_eoc and checkpoints_pulse data.

Joining datasets

Combining multiple datasets is a common task in DataFest, in which you will often receive several different files, recording different but related pieces of information.

If we examine the checkpoints_eoc data, we see that there is one row for each combination of student and chapter. The checkpoints_pulse data, however, has one row for each combination of student, chapter, and construct. Instead, we will use the pulse_pivot dataframe we created above, and combine it with the checkpoints_eoc dataframe.

The two dataframes have different numbers of rows, and the rows are probably not in the same order. So how should we decide how to join them? The crucial point is that each row is identified uniquely by a set of variables: book, release, class, student, and chapter. We will match rows from pulse_pivot to rows from checkpoints_eoc using these variables.

There is also one more wrinkle: questions at the beginning of a chapter really refer to student’s feelings about the previous chapter. To match chapter experiences, we will therefore change the chapter numberings in the “pulse” questions; e.g., “pulse” questions answered in chapter 2 really refer to chapter 1.

Here is the code in R:

checkpoints_combined <- pulse_pivot |>
  mutate(chapter_num = as.numeric(chapter_num) - 1) |>
  inner_join(checkpoints_eoc, join_by(book, release, class_id, student_id, 
                                      chapter_num))

Now if you look at the new checkpoints_combined dataframe, you will see that there are the columns from the pulse_pivot dataframe, AND the columns from checkpoints_eoc.

Questions

The fifth research question (“How are the”pulse” questions related to student performance in the end-of-chapter questions?“) is quite open-ended. Before we try to answer it, we should create more specific questions to answer. For example, we could hypothesize that when students perform poorly on end-of-chapter questions, they will also report lower interest in the course material.

Question 5

In your teams, come up with three specific questions about the combined “pulse” and end-of-chapter questions that allow you to address the client’s question. For each, create at least one visualization.

Research Question 5: How are the “pulse” questions related to student performance in the end-of-chapter questions (in the checkpoints_eoc file)?

Submitting your assignment

For today, everyone is going to submit their own assignment. Knit your Markdown (or Quarto if you prefer) file and submit either a PDF or html on Canvas.

The data referenced in this lab were retrieved from the here on January 10, 2025.

This work was created by Ciaran Evans and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2025 January 21.