Module 14 – Joining Data

Module 14 delves into joining data sets in R, a fundamental concept for data analysts who often need to synthesize information from multiple sources for comprehensive insights. The video lecture in this module offers an in-depth guide on various join techniques (e.g., left_join() and right_join()). It focuses on how to effectively combine data sets by leveraging shared variables, which can involve different joining methods depending on the data structure and analysis needs. To ensure a practical understanding, the module concludes with exercises that allow you to apply these joining techniques.

Video Lecture

Exercises

Create a tibble named productivity_data with a column containing the last 7 days, a week number column with the repeated character string “week_2”, and a productivity score column ranging from 1 to 10 for each day.
Perform a full_join between sleep_data and productivity_data on the day and week number columns.
Perform an inner_join between sleep_data and productivity_data on the day and week number columns.
Perform a left_join between sleep_data and productivity_data on the day and week number columns.
Perform a right_join between sleep_data and productivity_data on the day and week number columns.
Compare the outputs of the full, inner, left, and right joins. Identify if any of the join techniques result in the same data set and explain why this might happen.
Choose one of the joined data sets that you find most informative. Export this data set to a new CSV file called sleep-productivity-data.csv for future analysis.

The solutions to these exercises can be found at the bottom of this page.

Data & R Code of This Lecture

You may download the data set used in this lecture here.

# install.packages("tidyverse")                   # Install tidyverse packages
library("tidyverse")                              # Load tidyverse packages
 
my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/"  # Specify directory path
 
team_coffee <- read_csv(str_c(my_path,            # Import CSV file
                              "Team-Coffee-Data.csv"))
team_coffee                                       # Print tibble
 
vec_days <- c("Monday", "Tuesday",                # Create vector of weekdays
              "Wednesday", "Thursday",
              "Friday", "Saturday", "Sunday")
vec_days                                          # Print vector of weekdays
 
team_fresh <- tibble(member = c(rep("Cansu", 7),  # Create second tibble
                              rep("Ifeanyi", 7),
                              rep("Micha", 7)),
                     day = rep(vec_days, 3),
                     frlvl = c(75, 90, 80, 75, 65, 25, 10,
                               95, 70, 85, 65, 60, 40, 30,
                               85, 80, 85, 80, 85, 5, 5))
team_fresh %>%                                    # Print second tibble
  print(n = nrow(team_fresh))
 
team_full <- full_join(team_coffee,               # Full join
                       team_fresh,
                       by = c("member", "day"))
View(team_full)                                   # View merged data
 
team_inner <- inner_join(team_coffee,             # Inner join
                         team_fresh,
                         by = c("member", "day"))
View(team_inner)                                  # View merged data
 
team_left <- left_join(team_coffee,               # Left join
                       team_fresh,
                       by = c("member", "day"))
View(team_left)                                   # View merged data
 
team_right <- right_join(team_coffee,             # Right join
                         team_fresh,
                         by = c("member", "day"))
View(team_right)                                  # View merged data
 
my_path <- "D:/Dropbox/Jock/Data Sets/dplyr Course/"  # Specify directory path
 
team_inner %>%                                    # Export CSV file
  write_csv(str_c(my_path, "Team-Coffee-Freshness-Data.csv"))

Exercise Solutions

Below, you can find our solutions for the exercises of this module. Before beginning the exercises, we have to install and load the tidyverse packages. The tidyverse enables us to use the dplyr functions.

install.packages("tidyverse")	                                      # Install tidyverse packages
 
library(tidyverse)	                                              # Load tidyverse packages

We will also need to import the sleep data set that we created in Module 12.

data_path <- "path to sleep-data.csv"	                              # Specify directory path
 
sleep_data <- read_csv(str_c(data_path, "sleep-data.csv"))    	      # Import CSV file

With the tidyverse packages loaded, and the sleep data set imported, we can now proceed to the solutions of the exercises.

Exercise 1: Create a tibble named productivity_data with a column containing the last 7 days, a week number column with the repeated character string “week_2”, and a productivity score column ranging from 1 to 10 for each day.

productivity_data <- tibble(days = c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"),  # Create tibble
                            week_no = "week_2",
                            productivity_score = c(5,7,6,2,8,3,4))
 
productivity_data	                                              # Print to console

In the solution above, we created a tibble with the columns days, week_no, and productivity_score.

Exercise 2: Perform a full_join between sleep_data and productivity_data on the day and week number columns.

data_full_join <- full_join(sleep_data,                               # Full join
                            productivity_data,
                            by = c("days","week_no")		
)
 
View(data_full_join)	                                              # View merged data

Here we performed a full join of the two data sets using the days and week_no columns as join keys. In a full join, the rows of a join key that does not exist in any of the data sets are replaced with NA in the merged data set.

Exercise 3: Perform an inner_join between sleep_data and productivity_data on the day and week number columns.

data_inner_join <- inner_join(sleep_data,                             # Inner join
                              productivity_data,
                              by = c("days","week_no"))
 
View(data_inner_join)		                                      # View merged data

In the above solution, we performed an inner join of the two data sets using the days and week_no columns as join keys. An inner join returns the rows of the merged data set where the values of the join keys match.

Exercise 4: Perform a left_join between sleep_data and productivity_data on the day and week number columns.

data_left_join <- left_join(sleep_data,                               # Left join
                            productivity_data,
                            by = c("days","week_no"))
 
View(data_left_join)		                                      # View merged data

Here we performed a left join of the two data sets using the days and week_no columns as join keys. A left join returns a merged data set where the rows of the second data set are replaced with NA if its join key values do not match with the join key values of the first data set.

Exercise 5: Perform a right_join between sleep_data and productivity_data on the day and week number columns.

data_right_join <- right_join(sleep_data,                             # Right join
                              productivity_data,
                              by = c("days","week_no"))
 
View(data_right_join)		                                      # View merged data

Here we performed a right join of the two data sets using the days and week_no columns as join keys. A right join returns a merged data set where the rows of the first data set are replaced with NA if its join key values do not match with the join key values of the second data set.

Exercise 6: Compare the outputs of the full, inner, left, and right joins. Identify if any of the join techniques result in the same data set and explain why this might happen.

The inner join and right join result in the same data set, whereas the left join and full join also result in the same output. This is because the first (left) data set sleep_data has additional data from week 1, while the second (right) data set productivity_data does not contain any data from week 1. Therefore, the full joined set including all join key values returns the same output as the left joined set which includes all join key values of the left data set. Similarly, the inner joined set showing only the matching join key values returns the same result as the right joined set which only includes the second data set’s join key values.

Exercise 7: Choose one of the joined data sets that you find most informative. Export this data set to a new CSV file called sleep-productivity-data.csv for future analysis.

my_path <- "your current project directory path"                      # Specify directory path
 
data_full_join %>%                                                    # Export CSV file

write_csv(str_c(my_path, “sleep-productivity-data.csv”))

We have chosen to export the data set merged with a full join since it contains all the rows of both the sleep data set and the productivity data set.

Solutions to these exercises were created in collaboration with Ifeanyi Idiaye and Cansu Kebabci. Thanks to them for their contribution!

Further Resources

You can access the course overview page, timetable, and table of contents by clicking here.