Join Data Frames with Base R vs. dplyr (Example) | The Fastest Way to Merge

 

In this tutorial you’ll learn how to merge data frames using Base R vs. the dplyr package in R programming.

Table of contents:

Let’s start right away…

 

Different Types of Joins

Before we jump into the R codes, let’s have a look at the different types of joins that we will use in this tutorial.

The following figure illustrates two input data sets and their joined outputs when using inner, left, right, and full joins:

 

merge data frames r theory

 

These four different types of joins can be summarized as follows:

  1. Inner join: Keep only IDs that are contained in both data sets.
  2. Left join: Keep only IDs that are contained in the first data set.
  3. Right join: Keep only IDs that are contained in the second data set.
  4. Full join: Keep all IDs.

So far, so good. Let’s apply these kinds of joins in R!

 

Creation of Example Data

The following data frames will be used as basement for this R tutorial:

data_A <- data.frame(ID = paste0("ID_", 1:6),                      # First example data
                     x1 = 1:6,
                     x2 = letters[1:6])
data_A

 

table 1 data frame fastest way merge data frames r

 

data_B <- data.frame(ID = paste0("ID_", 4:9),                      # Second example data
                     y1 = 104:109,
                     y2 = LETTERS[4:9])
data_B

 

table 2 data frame fastest way merge data frames r

 

The output of the previous R programming code is shown in Tables 1 and 2: We have created two data frames both containing three columns.

The first data frame contains the variables ID, x1, and x2; and the second data frame contains the columns ID, y1, and y2. The ID-column will be used to merge our data frames.

 

Example 1: Merging Data Using Base R

The syntax below explains how to join two data frames using the basic installation of the R programming language.

Have a look at the R code below:

data_base_inner <- merge(data_A, data_B, by = "ID")                # Inner join
data_base_left <- merge(data_A, data_B, by = "ID", all.x = TRUE)   # Left join
data_base_right <- merge(data_A, data_B, by = "ID", all.y = TRUE)  # Right join
data_base_full <- merge(data_A, data_B, by = "ID", all = TRUE)     # Full join

After running the previous syntax, we have created four merged data frames corresponding to the different types of joins that were introduced at the beginning of this tutorial.

 

Example 2: Merging Data Using dplyr Package

Example 2 demonstrates how to merge data frames using the join functions of the dplyr package.

We first need to install and load the dplyr package, if we want to use the functions that are included in the package:

install.packages("dplyr")                                          # Install dplyr package
library("dplyr")                                                   # Load dplyr package

Next, we can apply the different join functions of the dplyr package:

data_dplyr_inner <- inner_join(data_A, data_B, by = "ID")          # Inner join
data_dplyr_left <- left_join(data_A, data_B, by = "ID")            # Left join
data_dplyr_right <- right_join(data_A, data_B, by = "ID")          # Right join
data_dplyr_full <- full_join(data_A, data_B, by = "ID")            # Full join

The previous R syntax has created four new data frames that contain exactly the same merged versions of our input data frames that we have already created in Example 1.

However, this time we have used the functions of the dplyr package instead of Base R.

So, what’s actually the difference between those two ways to combine data sets?

One important difference – especially in case of large data files – is illustrated in the next section: The speed of the different ways to merge two data sets.

 

Example 3: Comparing Speed of Base R vs. dplyr Package

Example 3 compares the performance of Base R and dplyr merges in terms of speed.

For this comparison, we first have to create data frames with a much larger size:

set.seed(8642468)                                                  # Reproducibility
N <- 10000000                                                      # Sample size
 
data_A_large <- data.frame(ID = 1:N,                               # First large data set
                           x1 = rnorm(N),
                           x2 = runif(N))
 
data_B_large <- data.frame(ID = 1:N,                               # Second large data set
                           y1 = rnorm(N),
                           y2 = runif(N))

In the next step, we can measure the execution time of a Base R inner join using the Sys.time function:

time_base_start <- Sys.time()                                      # Time of Base R join
data_base_large <- merge(data_A_large, data_B_large, by = "ID")
time_base_finish <- Sys.time()
time_base_finish - time_base_start
# Time difference of 14.86277 secs

The previous output of the RStudio console shows that it took 14.86277 seconds to merge our data frames using Base R.

Let’s do the same for dplyr:

time_dplyr_start <- Sys.time()                                     # Time of dplyr join
data_dplyr_large <- inner_join(data_A_large, data_B_large, by = "ID")
time_dplyr_finish <- Sys.time()
time_dplyr_finish - time_dplyr_start
# Time difference of 5.186138 secs

The dplyr package needed only 5.186138 seconds to merge the same two data frames.

In other words: dplyr was almost three times faster than Base R!

 

Video & Further Resources

Have a look at the following video on my YouTube channel. In the video, I show the topics of this article in RStudio.

 

Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.

YouTube Content Consent Button Thumbnail

YouTube privacy policy

If you accept this notice, your choice will be saved and the page will refresh.

 

In addition, you may have a look at the other RStudio tutorials on my website:

 

You have learned in this article how to join data frames using Base R vs. the dplyr package in the R programming language. In this tutorial, we have also tested which way to join data frames is quicker. Don’t hesitate to let me know in the comments section, if you have further questions.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


2 Comments. Leave new

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top