Module 14 – Joining Data
Module 14 delves into joining data sets in R, a fundamental concept for data analysts who often need to synthesize information from multiple sources for comprehensive insights. The video lecture in this module offers an in-depth guide on various join techniques (e.g., left_join()
and right_join()
). It focuses on how to effectively combine data sets by leveraging shared variables, which can involve different joining methods depending on the data structure and analysis needs. To ensure a practical understanding, the module concludes with exercises that allow you to apply these joining techniques.
Video Lecture
Exercises
- Create a tibble named
productivity_data
with a column containing the last 7 days, a week number column with the repeated character string “week_2”, and a productivity score column ranging from 1 to 10 for each day. - Perform a
full_join
betweensleep_data
andproductivity_data
on the day and week number columns. - Perform an
inner_join
betweensleep_data
andproductivity_data
on the day and week number columns. - Perform a
left_join
betweensleep_data
andproductivity_data
on the day and week number columns. - Perform a
right_join
betweensleep_data
andproductivity_data
on the day and week number columns. - Compare the outputs of the full, inner, left, and right joins. Identify if any of the join techniques result in the same data set and explain why this might happen.
- Choose one of the joined data sets that you find most informative. Export this data set to a new CSV file called
sleep-productivity-data.csv
for future analysis.
The solutions to these exercises can be found at the bottom of this page.
Data & R Code of This Lecture
You may download the data set used in this lecture here.
# install.packages("tidyverse") # Install tidyverse packages library("tidyverse") # Load tidyverse packages my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/" # Specify directory path team_coffee <- read_csv(str_c(my_path, # Import CSV file "Team-Coffee-Data.csv")) team_coffee # Print tibble vec_days <- c("Monday", "Tuesday", # Create vector of weekdays "Wednesday", "Thursday", "Friday", "Saturday", "Sunday") vec_days # Print vector of weekdays team_fresh <- tibble(member = c(rep("Cansu", 7), # Create second tibble rep("Ifeanyi", 7), rep("Micha", 7)), day = rep(vec_days, 3), frlvl = c(75, 90, 80, 75, 65, 25, 10, 95, 70, 85, 65, 60, 40, 30, 85, 80, 85, 80, 85, 5, 5)) team_fresh %>% # Print second tibble print(n = nrow(team_fresh)) team_full <- full_join(team_coffee, # Full join team_fresh, by = c("member", "day")) View(team_full) # View merged data team_inner <- inner_join(team_coffee, # Inner join team_fresh, by = c("member", "day")) View(team_inner) # View merged data team_left <- left_join(team_coffee, # Left join team_fresh, by = c("member", "day")) View(team_left) # View merged data team_right <- right_join(team_coffee, # Right join team_fresh, by = c("member", "day")) View(team_right) # View merged data my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/" # Specify directory path team_inner %>% # Export CSV file write_csv(str_c(my_path, "Team-Coffee-Freshness-Data.csv"))
Exercise Solutions
Below, you can find our solutions for the exercises of this module. Before beginning the exercises, we have to install and load the tidyverse
packages. The tidyverse
enables us to use the dplyr
functions.
install.packages("tidyverse") # Install tidyverse packages library(tidyverse) # Load tidyverse packages
We will also need to import the sleep
data set that we created in Module 12.
data_path <- "path to sleep-data.csv" # Specify directory path sleep_data <- read_csv(str_c(data_path, "sleep-data.csv")) # Import CSV file
With the tidyverse
packages loaded, and the sleep
data set imported, we can now proceed to the solutions of the exercises.
Exercise 1: Create a tibble named productivity_data
with a column containing the last 7 days, a week number column with the repeated character string “week_2”, and a productivity score column ranging from 1 to 10 for each day.
productivity_data <- tibble(days = c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"), # Create tibble week_no = "week_2", productivity_score = c(5,7,6,2,8,3,4)) productivity_data # Print to console
In the solution above, we created a tibble with the columns days
, week_no
, and productivity_score
.
Exercise 2: Perform a full_join
between sleep_data
and productivity_data
on the day and week number columns.
data_full_join <- full_join(sleep_data, # Full join productivity_data, by = c("days","week_no") ) View(data_full_join) # View merged data
Here we performed a full join of the two data sets using the days
and week_no
columns as join keys. In a full join, the rows of a join key that does not exist in any of the data sets are replaced with NA
in the merged data set.
Exercise 3: Perform an inner_join
between sleep_data
and productivity_data
on the day and week number columns.
data_inner_join <- inner_join(sleep_data, # Inner join productivity_data, by = c("days","week_no")) View(data_inner_join) # View merged data
In the above solution, we performed an inner join of the two data sets using the days
and week_no
columns as join keys. An inner join returns the rows of the merged data set where the values of the join keys match.
Exercise 4: Perform a left_join
between sleep_data
and productivity_data
on the day and week number columns.
data_left_join <- left_join(sleep_data, # Left join productivity_data, by = c("days","week_no")) View(data_left_join) # View merged data
Here we performed a left join of the two data sets using the days
and week_no
columns as join keys. A left join returns a merged data set where the rows of the second data set are replaced with NA
if its join key values do not match with the join key values of the first data set.
Exercise 5: Perform a right_join
between sleep_data
and productivity_data
on the day and week number columns.
data_right_join <- right_join(sleep_data, # Right join productivity_data, by = c("days","week_no")) View(data_right_join) # View merged data
Here we performed a right join of the two data sets using the days
and week_no
columns as join keys. A right join returns a merged data set where the rows of the first data set are replaced with NA
if its join key values do not match with the join key values of the second data set.
Exercise 6: Compare the outputs of the full, inner, left, and right joins. Identify if any of the join techniques result in the same data set and explain why this might happen.
The inner join and right join result in the same data set, whereas the left join and full join also result in the same output. This is because the first (left) data set sleep_data
has additional data from week 1, while the second (right) data set productivity_data
does not contain any data from week 1. Therefore, the full joined set including all join key values returns the same output as the left joined set which includes all join key values of the left data set. Similarly, the inner joined set showing only the matching join key values returns the same result as the right joined set which only includes the second data set’s join key values.
Exercise 7: Choose one of the joined data sets that you find most informative. Export this data set to a new CSV file called sleep-productivity-data.csv
for future analysis.
my_path <- "your current project directory path" # Specify directory path data_full_join %>% # Export CSV file
write_csv(str_c(my_path, “sleep-productivity-data.csv”))
We have chosen to export the data set merged with a full join since it contains all the rows of both the sleep
data set and the productivity
data set.
Solutions to these exercises were created in collaboration with Ifeanyi Idiaye and Cansu Kebabci. Thanks to them for their contribution!
Further Resources
- Statistics Globe Article – Join Data Frames with the R dplyr Package
- Statistics Globe Article – R Merge Multiple Data Frames in List | Base R vs. tidyverse
- Statistics Globe Article – Specify Column Names for X & Y when Joining with dplyr Package in R
- dplyr Documentation – Mutating joins
.
You can access the course overview page, timetable, and table of contents by clicking here.