Module 13 – Grouping Data

Module 13 focuses on the key technique of grouping data in R, vital for data analysis within the tidyverse and dplyr frameworks. This module’s video lecture delves into analyzing data by group categories using the group_by() and summarize() functions. It demonstrates how to compute summary statistics like mean, total, and standard deviation for different data groups, essential for revealing insights and patterns in large data sets. To reinforce these concepts, the module concludes with exercises that provide practical experience in applying these grouping and summarizing operations.

Video Lecture

Exercises

Import the sleep-data.csv file that was created in Module 12 into R and store it in a tibble named sleep_data.
Group sleep_data by the week_no column and calculate the total hours of sleep for each week. In which week did you sleep more?
Group sleep_data by the day column and compute the mean hours of sleep for each day across both weeks. On which day did you sleep the most?
Create a new column in sleep_data to classify each day as “Weekday” or “Weekend”. Group by this new column and calculate the average sleep hours for weekdays and weekends.

The solutions to these exercises can be found at the bottom of this page.

Data & R Code of This Lecture

You may download the data set used in this lecture here.

# install.packages("tidyverse")                   # Install tidyverse packages
library("tidyverse")                              # Load tidyverse packages
 
my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/"  # Specify directory path
 
team_coffee <- read_csv(str_c(my_path,            # Import CSV file
                              "Team-Coffee-Data.csv"))
team_coffee                                       # Print tibble
 
team_coffee %>%                                   # Group & summarize by member
  group_by(member) %>%
  dplyr::summarize(total_cups = sum(cups))
 
team_coffee %>%                                   # Group & summarize by day
  group_by(day) %>%
  dplyr::summarize(mean_cups = mean(cups))
 
team_coffee %>%                                   # Keep order of day column
  mutate(day = fct_inorder(day)) %>%
  group_by(day) %>%
  dplyr::summarize(mean_cups = mean(cups))
 
team_coffee %>%                                   # Multiple summary statistics
  mutate(day = fct_inorder(day)) %>%
  group_by(day) %>%
  dplyr::summarize(mean_cups = mean(cups),
                   total_cups = sum(cups),
                   sd_cups = sd(cups))
 
team_coffee_summary <- team_coffee %>%            # Store output in data object
  mutate(day = fct_inorder(day)) %>%
  group_by(day) %>%
  dplyr::summarize(mean_cups = mean(cups),
                   total_cups = sum(cups),
                   sd_cups = sd(cups))
team_coffee_summary                               # Print summary tibble

Exercise Solutions

Below, you can find our solutions for the exercises of this module. Before beginning the exercises, we install and load the tidyverse packages. The tidyverse enables us to use the dplyr functions.

install.packages("tidyverse")	                               # Install tidyverse packages
 
library(tidyverse)	                                       # Load tidyverse packages

With the tidyverse packages loaded, we can now proceed to the solutions of the exercises.

Exercise 1: Import the sleep-data.csv file that was created in Module 12 into R and store it in a tibble named sleep_data.

my_path <- "path to sleep-data.csv"	                       # Specify directory path
 
sleep_data <- read_csv(str_c(my_path, "sleep-data.csv"))       # Import CSV file
 
sleep_data	                                               # Print to console

In the above solution, we first specified the path to the directory where sleep-data.csv is stored. Then we passed that path to the str_c() function inside the read_csv() function to read the CSV file.

Exercise 2: Group sleep_data by the week_no column and calculate the total hours of sleep for each week. In which week did you sleep more?

sleep_data %>%                                                 # Group & summarize by week number
  group_by(week_no) %>% 
  dplyr::summarize(total_sleep_hours = sum(sleep_hours))

Here we grouped the data by the week_no column and used the summarize() function to create a new column total_sleep_hours, which is the sum of the Sleep_Hours column.

That way, you can tell the week in which you had the most hours of sleep.

Exercise 3: Group sleep_data by the day column and compute the mean hours of sleep for each day across both weeks. On which day did you sleep the most?

sleep_data %>%                                                 # Group & summarize by days
  group_by(days) %>% 
  dplyr::summarize(mean_sleep_hours = mean(sleep_hours))

In the solution above, we used the group_by() function to group the data by the days column and used the summarize() function to create a new column mean_sleep_hours, which is the average sleep hours on each day of the week.

Exercise 4: Create a new column in sleep_data to classify each day as “Weekday” or “Weekend”. Group by this new column and calculate the average sleep hours for weekdays and weekends.

sleep_data %>%                                                 # Classify days as either weekend or weekday
  mutate(class = if_else(days == "Sunday" | days == "Saturday","Weekend","Weekday")) %>%	 
  group_by(class) %>%
  dplyr::summarize(average_sleep_hours = mean(sleep_hours))

Here we used the mutate() function to create a new column class which conditionally classifies each day of the week as either a Weekend or a Weekday. Then we grouped the data by the newly created column class and used the summarize() function to create a new column average_sleep_hours, which contains the average hours of sleep recorded on weekends and weekdays.

Solutions to these exercises were created in collaboration with Ifeanyi Idiaye and Cansu Kebabci. Thanks to them for their contribution!

Further Resources

You can access the course overview page, timetable, and table of contents by clicking here.