Join Data Frames with Base R vs. dplyr (Example) | The Fastest Way to Merge

data_A <- data.frame(ID = paste0("ID_", 1:6),                      # First example data
                     x1 = 1:6,
                     x2 = letters[1:6])
data_A

table 1 data frame fastest way merge data frames r

data_B <- data.frame(ID = paste0("ID_", 4:9),                      # Second example data
                     y1 = 104:109,
                     y2 = LETTERS[4:9])
data_B

table 2 data frame fastest way merge data frames r

The output of the previous R programming code is shown in Tables 1 and 2: We have created two data frames both containing three columns.

The first data frame contains the variables ID, x1, and x2; and the second data frame contains the columns ID, y1, and y2. The ID-column will be used to merge our data frames.

Example 1: Merging Data Using Base R

The syntax below explains how to join two data frames using the basic installation of the R programming language.

Have a look at the R code below:

data_base_inner <- merge(data_A, data_B, by = "ID")                # Inner join
data_base_left <- merge(data_A, data_B, by = "ID", all.x = TRUE)   # Left join
data_base_right <- merge(data_A, data_B, by = "ID", all.y = TRUE)  # Right join
data_base_full <- merge(data_A, data_B, by = "ID", all = TRUE)     # Full join

After running the previous syntax, we have created four merged data frames corresponding to the different types of joins that were introduced at the beginning of this tutorial.

Example 2: Merging Data Using dplyr Package

Example 2 demonstrates how to merge data frames using the join functions of the dplyr package.

We first need to install and load the dplyr package, if we want to use the functions that are included in the package:

install.packages("dplyr")                                          # Install dplyr package
library("dplyr")                                                   # Load dplyr package

Next, we can apply the different join functions of the dplyr package:

data_dplyr_inner <- inner_join(data_A, data_B, by = "ID")          # Inner join
data_dplyr_left <- left_join(data_A, data_B, by = "ID")            # Left join
data_dplyr_right <- right_join(data_A, data_B, by = "ID")          # Right join
data_dplyr_full <- full_join(data_A, data_B, by = "ID")            # Full join

The previous R syntax has created four new data frames that contain exactly the same merged versions of our input data frames that we have already created in Example 1.

However, this time we have used the functions of the dplyr package instead of Base R.

So, what’s actually the difference between those two ways to combine data sets?

One important difference – especially in case of large data files – is illustrated in the next section: The speed of the different ways to merge two data sets.

Example 3: Comparing Speed of Base R vs. dplyr Package

Example 3 compares the performance of Base R and dplyr merges in terms of speed.

For this comparison, we first have to create data frames with a much larger size:

set.seed(8642468)                                                  # Reproducibility
N <- 10000000                                                      # Sample size
 
data_A_large <- data.frame(ID = 1:N,                               # First large data set
                           x1 = rnorm(N),
                           x2 = runif(N))
 
data_B_large <- data.frame(ID = 1:N,                               # Second large data set
                           y1 = rnorm(N),
                           y2 = runif(N))

In the next step, we can measure the execution time of a Base R inner join using the Sys.time function:

time_base_start <- Sys.time()                                      # Time of Base R join
data_base_large <- merge(data_A_large, data_B_large, by = "ID")
time_base_finish <- Sys.time()
time_base_finish - time_base_start
# Time difference of 14.86277 secs

The previous output of the RStudio console shows that it took 14.86277 seconds to merge our data frames using Base R.

Let’s do the same for dplyr:

time_dplyr_start <- Sys.time()                                     # Time of dplyr join
data_dplyr_large <- inner_join(data_A_large, data_B_large, by = "ID")
time_dplyr_finish <- Sys.time()
time_dplyr_finish - time_dplyr_start
# Time difference of 5.186138 secs

The dplyr package needed only 5.186138 seconds to merge the same two data frames.

In other words: dplyr was almost three times faster than Base R!

Video & Further Resources

Have a look at the following video on my YouTube channel. In the video, I show the topics of this article in RStudio.

In addition, you may have a look at the other RStudio tutorials on my website:

You have learned in this article how to join data frames using Base R vs. the dplyr package in the R programming language. In this tutorial, we have also tested which way to join data frames is quicker. Don’t hesitate to let me know in the comments section, if you have further questions.

2 Comments. Leave new

Biomagician
December 19, 2021 3:31 pm

Hello,
Do you know how to merge data tables in the data table way?
Thanks!

Reply
- Joachim
  December 20, 2021 7:07 am
  
  Hey,
  
  I think this thread on Stack Overflow explains your question.
  
  Regards,
  Joachim
  
  Reply