Extract Substring Before or After Pattern in R (2 Examples)

In this article, you’ll learn how to return characters of a string in front or after a certain pattern in the R programming language.

The content of the page is structured as follows:

Let’s dive right in:

Creation of Example Data

Let’s first create a character string in R that we can use in the examples later on:

x <- "hello xxx other stuff"         # Example character string
x                                    # Print example string
# "hello xxx other stuff"

Our example string consists of the words “hello” and “other stuff” as well as of the pattern “xxx” in between.

Example 1: Extract Characters Before Pattern in R

Let’s assume that we want to extract all characters of our character string before the pattern “xxx”. Then, we can use the sub function as follows:

sub(" xxx.*", "", x)                 # Extract characters before pattern
# "hello"

As you can see based on the output of the RStudio console, the previous R code returned only the substring “hello”, i.e. the characters before the pattern “xxx”.

Note that we had to specify the symbols “.*” after the pattern “xxx” within the sub function in order to get this result.

Example 2: Extract Characters After Pattern in R

In this example, I’ll show you how to return the characters after a particular pattern. As in Example 1, we have to use the sub function and the symbols “.*”. However, this time we have to put these symbols in front of our pattern “xxx”:

sub(".*xxx ", "", x)                 # Extract characters after pattern
# "other stuff"

This time the sub function is extracting the words on the right side of our pattern, i.e. “other stuff”.

Video, Further Resources & Summary

If you need further explanations on the R programming codes of this post, I can recommend watching the following video of my YouTube channel. In the video, I illustrate how to truncate and trim character strings from a certain character using the R codes of this article.

Furthermore, you could have a look at some of the related tutorials on my website:

Summary: This article illustrated how to get substrings according to a specified position in the R programming language. If you have any further comments and/or questions, don’t hesitate to let me know in the comments below.

36 Comments. Leave new

Carolina
November 22, 2020 2:51 am

Hello, i found quite interesting this information. I was wondering if there is a way to apply this in a dataframe context. I mean I have a dataframe and i need to get the values from various columns that are after a column with a specific value.

Reply
- Joachim
  November 23, 2020 6:31 am
  
  Hey Carolina,
  
  Thank you for the kind words!
  
  To clarify your question: You want to check for a certain value in each column of your data frame and then you want to extract all columns after the column containing this value?
  
  Regards,
  
  Joachim
  
  Reply
Austin
November 27, 2020 12:33 am

Hi,

This was super helpful. Is there any way to just extract the word directly in front of a string? So if you had “hello i am” to be able to extract just the ‘i’ in front of am?

Reply
- Joachim
  November 27, 2020 6:54 am
  Hey Austin,
  
  Thanks a lot for the nice feedback!
  
  You can do that by using the following R code:
  sub(" .*", "", x)
  Explanation: Please compare that code with Example 1. In Example 1, we were looking for the pattern ” xxx”. At this position, you can specify any pattern you want, so in this case we are using the pattern ” “.
  
  I hope that helps!
  
  Joachim
  Reply
cmoreno
September 29, 2021 3:35 am

Hi Joachim,
Very good trick! I was wondering what do you do when the pattern is repeated two or more times in the string and the extraction needs to be from the first pattern. For example, ‘I_love_R’, if the before pattern is used [e.g. sub(“_.*”, “”, x) ], we extract ‘I’ and if the after pattern is used [e.g. sub(“.*_”, “”, x) ], we extract ‘R’, what about if we need to extract ‘love_R’?
Thank you!

Reply
- Joachim
  September 29, 2021 5:54 am
  Hey,
  
  Thank you for the kind words, glad you liked it! 🙂
  
  Regarding your question, please try the following R code:
  sub(".*?_", "", "I_love_R")
  The ? specifies that only the first occurrence should be used.
  
  Regards
  
  Joachim
  Reply
Carly Andrews
October 18, 2021 4:10 pm

Hello!

Can I extract only a certain number of characters after the string of interest? I have a column with a ‘ecoscore’ formatted as follows:

“Ecosystem Classification: boreal forest; EcoScore: 3.5/5. Small glade like openings”

I want to pull out only the 3.5/5.

Reply
- Joachim
  October 25, 2021 9:19 am
  Hey Carly,
  
  You may combine the sub and substr functions for this as shown below. The last number in the second line defines the number of characters to be extracted (i.e. 6).
  x <- "Ecosystem Classification: boreal forest; EcoScore: 3.5/5. Small glade like openings" x_new <- substr(sub(".*EcoScore: ", "", x), 1, 6) x_new # "3.5/5."
  Regards,
  Joachim
  Reply
  - Vera
    June 13, 2022 8:07 am
    
    This is amazing and has really helped me in reading PDFs, thanks!
    Is there any way to make it read all characters until a certain character (e.g. for the example above, make it start reading after “EcoScore: “, but then rather than reading 6 characters, it read until the word “Small”)
    
    Reply
    - Joachim
      June 13, 2022 9:43 am
      Hey Vera,
      
      Thanks a lot for the kind words, glad it was helpful!
      
      Regarding your question, you may use the sub function twice as shown below:
      
      x <- "Ecosystem Classification: boreal forest; EcoScore: 3.5/5. Small glade like openings" x_new <- sub(".*EcoScore: ", "", x) x_new <- sub("Small.*", "", x_new) x_new # [1] "3.5/5. "
      
      Regards,
      Joachim
      Reply
Kevin
February 6, 2022 10:23 pm

How can I extract two words before a pattern in R. For example:

“We love R”. I want to get the first two words “We love”

Reply
- Joachim
  February 7, 2022 12:46 pm
  Hey,
  
  I assume there must be better alternatives, but the following code should extract the characters before the 2nd occurrence of ” “:
  x <- "We love R" substr(x, 1, gregexpr(" ", x)[[1]][2] - 1) # [1] "We love"
  Regards,
  Joachim
  Reply
Linda
February 24, 2022 9:14 am

Hello,

Thank you for sharing this! I have a question: if I want to extract substring from a column of website urls. What is the best way to do that?

For example: http://hamptoninn3.hilton.com/en/hotels/washington/hampton-inn-and-suites-portland-vancouver-PDXVEHX/index.html

I just want to extract “hamptoninn3.hilton” from the website url.

Thank you!

Reply
- Joachim
  March 9, 2022 11:15 am
  
  Hey Linda,
  
  Apologies for the late response, I just got back from vacation. Do you still need help with this?
  
  Regards,
  Joachim
  
  Reply
Jonathan Williams
May 16, 2022 10:54 am

Hello Joachim,

I want to extract new terms from formulae for successive R models, in order to annotate some output.
So, suppose the formula for my first model is
y ~ x1 + x2
I now add two more terms to the model, in succession:-
y ~ x1 + x2 + x3
y ~ x1 + x2 + x3 + x4
So, the difference between the first two models is ‘+x3’ and between the second two models is ‘x4’
In reality I have many gamlss models that are rather more complex and I want to construct a table that shows how the addition of each term contributes to the fit of each model. Something like:-
Model new term BIC
m1 – 1000
m2 +x3 910
m3 +x4 833
gamlss provides the ‘mu.formula’ – “mu.fo” for each model, which has 3 elements:
as.character(m1$mu.fo)
[1] ‘y’
[2] ‘~’
[3] ‘x1+x2’

I tried
gsub(as.character(m1$mu.fo)[3],””,as.character(m2$mu.fo)[3])
but this doesn’t work
I would greatly appreciate your advice on how to ‘subtract’ the formulae
With many thanks, in anticipation of your reply
Jonathan

Reply
- Joachim
  May 16, 2022 11:33 am
  Hey Jonathan,
  
  Are you looking for something like this?
  all_models <- "y ~ x1" for(i in 2:10) { all_models[i] <- paste0(all_models[i - 1], " + x", i) } all_models # [1] "y ~ x1" # [2] "y ~ x1 + x2" # [3] "y ~ x1 + x2 + x3" # [4] "y ~ x1 + x2 + x3 + x4" # [5] "y ~ x1 + x2 + x3 + x4 + x5" # [6] "y ~ x1 + x2 + x3 + x4 + x5 + x6" # [7] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7" # [8] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8" # [9] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9" # [10] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10"
  Regards,
  Joachim
  Reply
Daniel
November 10, 2022 12:57 pm

Hey Jonathan, tank you for your effort. I have a little problem.

I’ve strings like “dchbkcnlds[Y]”, “cdban[C]” etc. and I want to extract the Letter(s) between the brackets. unfortunately all previous mentioned examples won’t work here because sub() trows an error if i type something like this: sub(“[.*”,””, ExampleString).

I hope you can help me out 🙂

Regards,
Daniel

Reply
- Daniel
  November 10, 2022 1:28 pm
  
  Furthermore if i have something like this
  
  “A|[B]phosphorylation[C]|D|[E]phosphorylation[F]|G|[H]Unknown:I[J]”
  
  my goal would be to get “C” aswell as “F”. My indicator to what i want is
  
  1. It is always in between brackets.
  2. The substring prior to the desired brackets is alwas beginning with either “phos” or “Phos”.
  
  Reply
  - Joachim
    November 14, 2022 12:54 pm
    
    Thank you for the further clarifications.
    
    Regards,
    Joachim
    
    Reply
- Joachim
  November 14, 2022 12:54 pm
  
  Hi Daniel,
  
  I apologize for the delayed reply. I was on a long holiday, so unfortunately I wasn’t able to get back to you earlier. Do you still need help with your syntax?
  
  Regards,
  Joachim
  
  Reply
Eva
January 19, 2023 3:55 pm

Hello, this is very interesting!
I also have a question about data frame: my problem is that I have a column with sample names in a data frame, and I want to cut a part of the sample name which is not relevant. The other problem is that this part I want to remove is changing among samples… so ideally I am searching for a way to “remove everything before a common pattern”, in lines of a data frame, but I can’t find that for the moment…

If you know how to do it and have time to help me, would be really kind. Thank you in advance!

Regards,
Eva

Reply
- Cansu (Statistics Globe)
  January 19, 2023 4:31 pm
  Hello Eva,
  
  First of all, thank you for your kind words! As far as I understood, you want to apply the method in the data frame context. If I am wrong, let me know. If so, you can implement it like in the sample code below.
  data<-data.frame(col=c( "hshs xxx well you can", "jsjsj xxx probably", "xxx do it", "ahdhsh jljkjl xxx like this")) # Example dataframe data # col # 1 hshs xxx well you can # 2 jsjsj xxx probably # 3 xxx do it # 4 ahdhsh jljkjl xxx like this data$rmvd_col<-sub(".*xxx ", "", data$col) data # col rmvd_col # 1 hshs xxx well you can well you can # 2 jsjsj xxx probably probably # 3 xxx do it do it # 4 ahdhsh jljkjl xxx like this like this
  Regards,
  Cansu
  Reply
  - Eva
    January 20, 2023 11:11 am
    
    Hello Cansu,
    Thanks for your quick reply!
    Unfortunately it doesn’t work, I don’t know why… But anyway thanks for your help.
    
    Best,
    
    Eva
    
    Reply
    - Cansu (Statistics Globe)
      January 20, 2023 12:06 pm
      
      Hello Eva,
      
      Do you get any errors or does nothing just happen when you run the code? If it is the latter case, then is it possible that you to provide a sample from your data?
      
      Regards,
      Cansu
      
      Reply
jay
January 26, 2023 8:31 pm

Great tutorial!

Is there any way to search for 3 consecutive letters instead of xxx? I have a data frame with a long string in one of the columns. It is always messy, but also always starts with at least 3 letters.

myColumn
2022- PQ. ITEM DESC red
2022- #( RF. PURPLE CAR DESC purple

I would like a new column (along with the original) with:

ITEM DESC red
ITEM DESC purple

Reply
- Cansu (Statistics Globe)
  January 27, 2023 9:04 am
  
  Hello Jay,
  
  Do you want your new column to always start with ITEM DESC? Is it possible for you to show more samples from your original column and the column that you want to convert?
  
  Regards,
  Cansu
  
  Reply
SabV
March 13, 2023 8:16 pm

I have a notes column with sample text below. I would like extract only the date from each of them and create a column named date. What is the best way to do this? Thanks
Notes Column
10/17/22 JHones: CASEA keep this record
11/17/22 HCamarones: CASEA keep this record
3/3/22 KGalvanonens: CASEA keep this record

Reply
- Cansu (Statistics Globe)
  March 14, 2023 4:53 pm
  
  Hello,
  
  Maybe this thread can help with what you want to implement.
  
  Regards,
  Cansu
  
  Reply
SabV
March 15, 2023 2:25 am

This what i needed. Thank so much

Reply
- Cansu (Statistics Globe)
  March 15, 2023 8:55 am
  
  Great, welcome!
  
  Reply
Julia Rosa
April 3, 2024 12:32 pm

If I have a column vector in a dataframe, and I each field entry is a unique long character string with ages ‘hidden’ therein, how do I extract only the ages as numerical values and and add them to an existing ‘age’ column vector, or a new one? An example field entry in the column ‘text’ may look like this: “FALL; injured his head on jan 12, 2023.; This is a spontaneous report from a contactable colleague reporting on their family member. A 90-year-old male patient received first dose of B”

My idea was to locate “-year-old” and to simply ‘report’ the two numbers that come before it, but alas, I am getting no where fast. Any help would be appreciated!

J

Reply
- Joachim (Statistics Globe)
  April 4, 2024 7:30 am
  
  Hey Julia,
  
  To achieve this, you may use gsub and a regular expression to match the age pattern. For instance, df$age <- as.numeric(gsub(".*?(\\d+)-year-old.*", "\\1", df$text)) will search for the pattern "X-year-old" where X is the age, capture the age part, and replace the entire string with just the age, converting it to numeric. This approach assumes your data is in a data frame called df.
  
  I hope this helps!
  
  Joachim
  
  Reply
  - Julia Rosa
    April 4, 2024 9:14 am
    
    Very helpful! Here’s another level of complexity. I have more than one key expression associated with pulling out ages such as “-years-old” (notice the additional ‘s’) in the ‘text’ column, so how can I run this same code a number of times to ‘fill out’ the AGE column properly without replacing the previous sub-ins? I tried mutating new columns and thought to cbind them, but my code is not working.
    
    Thank you again!
    
    Reply
    - Joachim (Statistics Globe)
      April 4, 2024 9:25 am
      Glad it is helpful! To handle multiple key expressions without overwriting previous substitutions, you can modify your approach to use ifelse alongside your gsub to conditionally update the age column only when a match is found. This way, if the age column already contains a valid age (anything other than the default value you set for non-matches), it won’t be overwritten by subsequent operations. Here’s how you could do it assuming your default non-match value is NA or some invalid age representation (e.g., -1):
      
      # Assuming df is your dataframe and 'age' column is initialized with NAs or an invalid age indicator like -1 # First pattern: "-year-old" df$age <- ifelse(is.na(df$age) | df$age == -1, as.numeric(gsub(".*?(\\d+)-year-old.*", "\\1", df$text)), df$age) # Second pattern: "-years-old" (notice the additional 's') df$age <- ifelse(is.na(df$age) | df$age == -1, as.numeric(gsub(".*?(\\d+)-years-old.*", "\\1", df$text)), df$age) # You can add more patterns here following the same structure
      
      Regards,
      Joachim
      Reply
      - Julia Rosa
        April 4, 2024 9:45 am
        
        wow. you’re good! it works perfectly. 🙂
        
        thanks so much for not only being a pro but being fast!
        
        J
      - Joachim (Statistics Globe)
        April 4, 2024 10:43 am
        
        You are very welcome Julia, and thank you so much for the very kind words! 🙂