Fuzzy Matching in R (Example) | Approximate String & Name Search

This tutorial provides several examples to help with fuzzy matching (also called fuzzy string searching or approximate string matching) in the R programming language.

Fuzzy matching can be incredibly useful when merging or joining multiple data sets where the identifying information has slight misspellings, inconsistent capitalization, or character differences due to language/locality differences.

This tutorial will contain the following sections:

1) Packages and Example Data

2) Overview

3) Base R Functions

4) stringdist Package

5) Applications

6) Video Tutorial & Further Resources

7) Subscribe to the Statistics Globe Newsletter

8) Thank you!

Kirby White Researcher Statistician Programmer

Note: This article was created in collaboration with Kirby White. Kirby is an organizational effectiveness consultant and researcher, who is currently pursuing a Ph.D. at the Seattle Pacific University. You can read more about Kirby here!

Packages and Example Data

You’ll need the stringdist package for this tutorial, which you can install with install.packages("stringdist") and load with library(stringdist) (more info here).

We’ll create and use two simple datasets to illustrate this functionality:

pres <- c("Bill Clinton", "Barack Obama")
 
pres_df <- data.frame(President = c("Joseph R. Biden, Jr", "Donald J. Trump", "Barack H. Obama", "George W. Bush", "William J. Clinton"),
                      Vice_President = c("Kamala D. Harris", "Michael R. Pence", "Joseph R. Biden", "Dick B. Cheney", "Albert A. Gore, Jr."))

Overview

Imagine that you need to match the two presidents in your first object pres to the presidents in the second object pres_df so that you can lookup the vice president.

This is impossible with exact matching, such as the match or %in% functions, which won’t find any matches:

match(pres, pres_df$Presidents)
# [1] NA NA
 
pres %in% pres_df$Presidents
# [1] FALSE FALSE

Instead, approximate matching uses an algorithm called the Levenshtein distance, which counts how many edits it would take for the two words (or phrases) to become identical. A pair of words that require fewer changes are more similar to a pair that needs numerous changes to become identical.

Base R Functions

Some of the functionality for approximate matching in R is included in the base packages in functions like agrep() and adist().

adist returns a matrix of the Levenshtein distance for each combination:

adist(pres, pres_df$President)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]   16   13   13   13    7
# [2,]   18   12    3   13   16

This lets us see the range of similarity between all elements in our two vectors. The lower the number, the more similar the elements are. We can see the lowest values in each row are 7 and 3, meaning that those are the best matches.

agrep returns a vector of the elements that meet your criteria for a “good enough” match, which is set with the max.distance argument:

agrep(pres[1], pres_df$President, max.distance = 10, value = TRUE)
# [1] "Joseph R. Biden, Jr" "Donald J. Trump"     "William J. Clinton"

Unfortunately, it only compares one character string to a vector of strings (rather than vector to vector), which reduces its usability at scale. Another limitation of agrep is that it doesn’t return the list in order of it’s similarity, meaning it can still be difficult to identify the best match when there are several.

To see the position of the elements rather than their actual values, you can change value = FALSE or remove it altogether.

agrep(pres[1], pres_df$President, max.distance = 10)
# [1] 1 2 5

stringdist Package

The stringdist package contains several functions related to fuzzy matching, and several algorithms are available to optimize your matching if Levenshtein Distance isn’t the most appropriate for your situation.

The amatch() function works similarly to agrep() and match() but is usually simpler to work with because it only returns the most similar elements, and can compare vectors to vectors.

amatch(pres, pres_df$President, maxDist = 10)
# [1] 5 3

This means that the best match for our first name text (“Bill Clinton”) is the 5th element of the second vector (“William J. Clinton”), and that our second name (“Barack Obama”) most closely matches the 3rd element (“Barack H. Obama”).

If the maxDist argument is too low, it will return NA to indicate that no match was found.

Applications

Fuzzy matching is typically used to locate similar identifiers across datasets (e.g. names or addresses), and you can apply these examples in a variety of ways in your work.

Here are two quick examples with our sample data.

First, let’s return the rows of pres_df where the President matches the name words in our pres vector:

pres_df[amatch(pres, pres_df$President, maxDist = 10),]
#            President      Vice_President
# 5 William J. Clinton Albert A. Gore, Jr.
# 3    Barack H. Obama     Joseph R. Biden

Second, let’s merge the name texts from pres and the Vice Presidents from pres_df:

data.frame(pres = pres,
           vice = pres_df[amatch(pres, pres_df$President, maxDist = 10),2])
#           pres                vice
# 1 Bill Clinton Albert A. Gore, Jr.
# 2 Barack Obama     Joseph R. Biden