Module 13 – Grouping Data
Module 13 focuses on the key technique of grouping data in R, vital for data analysis within the tidyverse and dplyr frameworks. This module’s video lecture delves into analyzing data by group categories using the group_by()
and summarize()
functions. It demonstrates how to compute summary statistics like mean, total, and standard deviation for different data groups, essential for revealing insights and patterns in large data sets. To reinforce these concepts, the module concludes with exercises that provide practical experience in applying these grouping and summarizing operations.
Video Lecture
Exercises
- Import the
sleep-data.csv
file that was created in Module 12 into R and store it in a tibble namedsleep_data
. - Group
sleep_data
by theweek_no
column and calculate the total hours of sleep for each week. In which week did you sleep more? - Group
sleep_data
by the day column and compute the mean hours of sleep for each day across both weeks. On which day did you sleep the most? - Create a new column in
sleep_data
to classify each day as “Weekday” or “Weekend”. Group by this new column and calculate the average sleep hours for weekdays and weekends.
The solutions to these exercises can be found at the bottom of this page.
Data & R Code of This Lecture
You may download the data set used in this lecture here.
# install.packages("tidyverse") # Install tidyverse packages library("tidyverse") # Load tidyverse packages my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/" # Specify directory path team_coffee <- read_csv(str_c(my_path, # Import CSV file "Team-Coffee-Data.csv")) team_coffee # Print tibble team_coffee %>% # Group & summarize by member group_by(member) %>% dplyr::summarize(total_cups = sum(cups)) team_coffee %>% # Group & summarize by day group_by(day) %>% dplyr::summarize(mean_cups = mean(cups)) team_coffee %>% # Keep order of day column mutate(day = fct_inorder(day)) %>% group_by(day) %>% dplyr::summarize(mean_cups = mean(cups)) team_coffee %>% # Multiple summary statistics mutate(day = fct_inorder(day)) %>% group_by(day) %>% dplyr::summarize(mean_cups = mean(cups), total_cups = sum(cups), sd_cups = sd(cups)) team_coffee_summary <- team_coffee %>% # Store output in data object mutate(day = fct_inorder(day)) %>% group_by(day) %>% dplyr::summarize(mean_cups = mean(cups), total_cups = sum(cups), sd_cups = sd(cups)) team_coffee_summary # Print summary tibble
Exercise Solutions
Below, you can find our solutions for the exercises of this module. Before beginning the exercises, we install and load the tidyverse
packages. The tidyverse
enables us to use the dplyr
functions.
install.packages("tidyverse") # Install tidyverse packages library(tidyverse) # Load tidyverse packages
With the tidyverse
packages loaded, we can now proceed to the solutions of the exercises.
Exercise 1: Import the sleep-data.csv
file that was created in Module 12 into R and store it in a tibble named sleep_data
.
my_path <- "path to sleep-data.csv" # Specify directory path sleep_data <- read_csv(str_c(my_path, "sleep-data.csv")) # Import CSV file sleep_data # Print to console
In the above solution, we first specified the path to the directory where sleep-data.csv
is stored. Then we passed that path to the str_c()
function inside the read_csv()
function to read the CSV file.
Exercise 2: Group sleep_data
by the week_no
column and calculate the total hours of sleep for each week. In which week did you sleep more?
sleep_data %>% # Group & summarize by week number group_by(week_no) %>% dplyr::summarize(total_sleep_hours = sum(sleep_hours))
Here we grouped the data by the week_no
column and used the summarize()
function to create a new column total_sleep_hours
, which is the sum of the Sleep_Hours
column.
That way, you can tell the week in which you had the most hours of sleep.
Exercise 3: Group sleep_data
by the day column and compute the mean hours of sleep for each day across both weeks. On which day did you sleep the most?
sleep_data %>% # Group & summarize by days group_by(days) %>% dplyr::summarize(mean_sleep_hours = mean(sleep_hours))
In the solution above, we used the group_by()
function to group the data by the days
column and used the summarize()
function to create a new column mean_sleep_hours
, which is the average sleep hours on each day of the week.
Exercise 4: Create a new column in sleep_data
to classify each day as “Weekday” or “Weekend”. Group by this new column and calculate the average sleep hours for weekdays and weekends.
sleep_data %>% # Classify days as either weekend or weekday mutate(class = if_else(days == "Sunday" | days == "Saturday","Weekend","Weekday")) %>% group_by(class) %>% dplyr::summarize(average_sleep_hours = mean(sleep_hours))
Here we used the mutate()
function to create a new column class
which conditionally classifies each day of the week as either a Weekend
or a Weekday
. Then we grouped the data by the newly created column class
and used the summarize()
function to create a new column average_sleep_hours
, which contains the average hours of sleep recorded on weekends and weekdays.
Solutions to these exercises were created in collaboration with Ifeanyi Idiaye and Cansu Kebabci. Thanks to them for their contribution!
Further Resources
- Statistics Glob Article – Numbering Rows within Groups of Data Frame in R
- Statistics Globe Article – Group Data Frame by Multiple Columns in R
- Statistics Globe Article – Select First Row of Each Group in Data Frame in R
- Statistics Globe Article – R Error: `n()` must only be used inside dplyr verbs.
- Statistics Globe Article – R dplyr Message: `summarise()` has grouped output by ‘X’. You can override using the `.groups` argument.
.
You can access the course overview page, timetable, and table of contents by clicking here.