Join Data Frames with Base R vs. dplyr (Example) | The Fastest Way to Merge
In this tutorial you’ll learn how to merge data frames using Base R vs. the dplyr package in R programming.
Table of contents:
Let’s start right away…
Different Types of Joins
Before we jump into the R codes, let’s have a look at the different types of joins that we will use in this tutorial.
The following figure illustrates two input data sets and their joined outputs when using inner, left, right, and full joins:
These four different types of joins can be summarized as follows:
- Inner join: Keep only IDs that are contained in both data sets.
- Left join: Keep only IDs that are contained in the first data set.
- Right join: Keep only IDs that are contained in the second data set.
- Full join: Keep all IDs.
So far, so good. Let’s apply these kinds of joins in R!
Creation of Example Data
The following data frames will be used as basement for this R tutorial:
data_A <- data.frame(ID = paste0("ID_", 1:6), # First example data x1 = 1:6, x2 = letters[1:6]) data_A
data_B <- data.frame(ID = paste0("ID_", 4:9), # Second example data y1 = 104:109, y2 = LETTERS[4:9]) data_B
The output of the previous R programming code is shown in Tables 1 and 2: We have created two data frames both containing three columns.
The first data frame contains the variables ID, x1, and x2; and the second data frame contains the columns ID, y1, and y2. The ID-column will be used to merge our data frames.
Example 1: Merging Data Using Base R
The syntax below explains how to join two data frames using the basic installation of the R programming language.
Have a look at the R code below:
data_base_inner <- merge(data_A, data_B, by = "ID") # Inner join data_base_left <- merge(data_A, data_B, by = "ID", all.x = TRUE) # Left join data_base_right <- merge(data_A, data_B, by = "ID", all.y = TRUE) # Right join data_base_full <- merge(data_A, data_B, by = "ID", all = TRUE) # Full join
After running the previous syntax, we have created four merged data frames corresponding to the different types of joins that were introduced at the beginning of this tutorial.
Example 2: Merging Data Using dplyr Package
Example 2 demonstrates how to merge data frames using the join functions of the dplyr package.
We first need to install and load the dplyr package, if we want to use the functions that are included in the package:
install.packages("dplyr") # Install dplyr package library("dplyr") # Load dplyr package
Next, we can apply the different join functions of the dplyr package:
data_dplyr_inner <- inner_join(data_A, data_B, by = "ID") # Inner join data_dplyr_left <- left_join(data_A, data_B, by = "ID") # Left join data_dplyr_right <- right_join(data_A, data_B, by = "ID") # Right join data_dplyr_full <- full_join(data_A, data_B, by = "ID") # Full join
The previous R syntax has created four new data frames that contain exactly the same merged versions of our input data frames that we have already created in Example 1.
However, this time we have used the functions of the dplyr package instead of Base R.
So, what’s actually the difference between those two ways to combine data sets?
One important difference – especially in case of large data files – is illustrated in the next section: The speed of the different ways to merge two data sets.
Example 3: Comparing Speed of Base R vs. dplyr Package
Example 3 compares the performance of Base R and dplyr merges in terms of speed.
For this comparison, we first have to create data frames with a much larger size:
set.seed(8642468) # Reproducibility N <- 10000000 # Sample size data_A_large <- data.frame(ID = 1:N, # First large data set x1 = rnorm(N), x2 = runif(N)) data_B_large <- data.frame(ID = 1:N, # Second large data set y1 = rnorm(N), y2 = runif(N))
In the next step, we can measure the execution time of a Base R inner join using the Sys.time function:
time_base_start <- Sys.time() # Time of Base R join data_base_large <- merge(data_A_large, data_B_large, by = "ID") time_base_finish <- Sys.time() time_base_finish - time_base_start # Time difference of 14.86277 secs
The previous output of the RStudio console shows that it took 14.86277 seconds to merge our data frames using Base R.
Let’s do the same for dplyr:
time_dplyr_start <- Sys.time() # Time of dplyr join data_dplyr_large <- inner_join(data_A_large, data_B_large, by = "ID") time_dplyr_finish <- Sys.time() time_dplyr_finish - time_dplyr_start # Time difference of 5.186138 secs
The dplyr package needed only 5.186138 seconds to merge the same two data frames.
In other words: dplyr was almost three times faster than Base R!
Video & Further Resources
Have a look at the following video on my YouTube channel. In the video, I show the topics of this article in RStudio.
Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.
If you accept this notice, your choice will be saved and the page will refresh.
In addition, you may have a look at the other RStudio tutorials on my website:
- Merge Data Frames by Row Names in R
- Merge Data Frames by Column Names in R
- Merge Multiple Data Frames in List in R
- Merge Two Unequal Data Frames & Replace NA with 0
- Merge Data Frames by Two ID Columns in R
- Merge data.tables in R
- R Programming Language
You have learned in this article how to join data frames using Base R vs. the dplyr package in the R programming language. In this tutorial, we have also tested which way to join data frames is quicker. Don’t hesitate to let me know in the comments section, if you have further questions.
2 Comments. Leave new
Leave a Reply Cancel reply
Statistics Globe Newsletter
Do you know how to merge data tables in the data table way?
I think this thread on Stack Overflow explains your question.