Merge Data Frames by Two ID Columns in R (2 Examples)

 

In this article you’ll learn how to combine multiple data frames based on more than one ID column in R.

The article looks as follows:

Let’s take a look at some R codes in action!

 

Creation of Example Data

First, I’ll have to create some data that we can use in the following examples:

data1 <- data.frame(ID1 = 1:5,                                 # Create first data frame
                    ID2 = letters[1:5],
                    x1 = c(4, 1, 6, 7, 8),
                    x2 = 9)
data1                                                          # Print first data frame
#   ID1 ID2 x1 x2
# 1   1   a  4  9
# 2   2   b  1  9
# 3   3   c  6  9
# 4   4   d  7  9
# 5   5   e  8  9

As you can see based on the previously shown output of the RStudio console, our first example data frame consists of five rows and four columns. The variables ID1 and ID2 will be used for the combination of our data frames.

Let’s create a second example data frame:

data2 <- data.frame(ID1 = 3:7,                                 # Create second data frame
                    ID2 = letters[3:7],
                    y1 = c(4, 4, 5, 1, 1),
                    y2 = 5)
data2                                                          # Print second data frame
#   ID1 ID2 y1 y2
# 1   3   c  4  5
# 2   4   d  4  5
# 3   5   e  5  5
# 4   6   f  1  5
# 5   7   g  1  5

The second data frame also contains five rows and four columns, including the two ID columns ID1 and ID2.

 

Example 1: Combine Data by Two ID Columns Using merge() Function

In Example 1, I’ll illustrate how to apply the merge function to combine data frames based on multiple ID columns. For this, we have to specify the by argument of the merge function to be equal to a vector of ID column names (i.e. by = c(“ID1”, “ID2”)).

data_merge1 <- merge(data1, data2, by = c("ID1", "ID2"))       # Applying merge() function
data_merge1                                                    # Print merged data
#   ID1 ID2 x1 x2 y1 y2
# 1   3   c  6  9  4  5
# 2   4   d  7  9  4  5
# 3   5   e  8  9  5  5

Have a look at the previous output of the RStudio console. We have created a merged data frame based on two ID columns.

 

Example 2: Combine Data by Two ID Columns Using inner_join() Function of dplyr Package

This Example illustrates how to use the dplyr package to merge data by two ID columns.

First, we need to install and load the dplyr package:

install.packages("dplyr")                                      # Install dplyr package
library("dplyr")                                               # Load dplyr package

Now, we can apply the inner_join function to create exactly the same output as in Example 1:

data_merge2 <- inner_join(data1, data2, by = c("ID1", "ID2"))  # Applying inner_join() function
data_merge2                                                    # Print merged data
#   ID1 ID2 x1 x2 y1 y2
# 1   3   c  6  9  4  5
# 2   4   d  7  9  4  5
# 3   5   e  8  9  5  5

Note that the previous examples performed an inner join. However, it is also possible to apply other types of data joins such as left joins, right joins, outer joins, and so on.

 

Video, Further Resources & Summary

Do you want to learn more about merging data? Then I can recommend having a look at the following video of my YouTube channel. I explain the content of this tutorial in the video.

 

Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.

YouTube Content Consent Button Thumbnail

YouTube privacy policy

If you accept this notice, your choice will be saved and the page will refresh.

 

Besides the video, you may want to read the related tutorials of this homepage:

 

You learned in this tutorial how to join several data frames based on two ID variables in the R programming language. In case you have further comments or questions, let me know in the comments.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


2 Comments. Leave new

  • Hello, this is a very good tutorial, very well laid out and explained. I was also wondering if it is possible to specify with “by=..” (or any other means) a way to use the combination of ID1 and ID2 in any order. Meaning that ID1 and ID2 of both data frames may be sorted differently. In my case, both data sets have the first two columns as pairwise comparisons, however even though they overlap in the majority of pairs, the orders in which they are stored is different for each data set. For example:

    Dataset 1 may have:
    ID1 ID2 dista
    BA BY 12
    DX BY 523
    G1 BY 57

    While dataset 2 may have:
    ID1 ID2 distb
    BA BY 0.111
    BY DX 0.035
    DA BY 0.99

    I want to get something like:
    ID1 ID2 dista distb
    BA BY 12 0.111
    DX BY 523 0.035
    G1 BY 57 NA
    DA BY NA 0.99

    I would like to merge two datasets, however I am not sure how to get the function to do this, regardless of the order of the values in ID1 and ID2. As long as both columns contain the same pair (even if the columns are switched) then I want that data to be merged. Could you help me?

    Reply
    • Hey Mark,

      Thank you very much for the kind words, glad you like the tutorial!

      Regarding your question:

      Would it be a solution to create a new ID column that contains the values of ID1 and ID2 alphabetically sorted?

      Have a look at the following example code:

      data1 <- data.frame(ID1 = c("BA", "DX", "G1"),
                          ID2 = "BY",
                          dista = c(12, 523, 57))
       
      data2 <- data.frame(ID1 = c("BA", "BY", "DA"),
                          ID2 = c("BY", "DX", "BY"),
                          distb = c(0.111, 0.035, 0.99))
       
      data1$ID_new <- apply(data1[ , c("ID1", "ID2")], 1, function(x) {paste(sort(x), collapse = " ")})
      data1 <- data1[ , !colnames(data1) %in% c("ID1", "ID2")]
       
      data2$ID_new <- apply(data2[ , c("ID1", "ID2")], 1, function(x) {paste(sort(x), collapse = " ")})
      data2 <- data2[ , !colnames(data2) %in% c("ID1", "ID2")]
       
      data_merge <- merge(data1, data2, by = "ID_new", all = TRUE)
      data_merge
      #   ID_new dista distb
      # 1  BA BY    12 0.111
      # 2  BY DA    NA 0.990
      # 3  BY DX   523 0.035
      # 4  BY G1    57    NA

      I hope that helps!

      Joachim

      Reply

Leave a Reply

Your email address will not be published.

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top