Extract Substring Before or After Pattern in R (2 Examples)

 

In this article, you’ll learn how to return characters of a string in front or after a certain pattern in the R programming language.

The content of the page is structured as follows:

Let’s dive right in:

 

Creation of Example Data

Let’s first create a character string in R that we can use in the examples later on:

x <- "hello xxx other stuff"         # Example character string
x                                    # Print example string
# "hello xxx other stuff"

Our example string consists of the words “hello” and “other stuff” as well as of the pattern “xxx” in between.

 

Example 1: Extract Characters Before Pattern in R

Let’s assume that we want to extract all characters of our character string before the pattern “xxx”. Then, we can use the sub function as follows:

sub(" xxx.*", "", x)                 # Extract characters before pattern
# "hello"

As you can see based on the output of the RStudio console, the previous R code returned only the substring “hello”, i.e. the characters before the pattern “xxx”.

Note that we had to specify the symbols “.*” after the pattern “xxx” within the sub function in order to get this result.

 

Example 2: Extract Characters After Pattern in R

In this example, I’ll show you how to return the characters after a particular pattern. As in Example 1, we have to use the sub function and the symbols “.*”. However, this time we have to put these symbols in front of our pattern “xxx”:

sub(".*xxx ", "", x)                 # Extract characters after pattern
# "other stuff"

This time the sub function is extracting the words on the right side of our pattern, i.e. “other stuff”.

 

Video, Further Resources & Summary

If you need further explanations on the R programming codes of this post, I can recommend watching the following video of my YouTube channel. In the video, I illustrate how to truncate and trim character strings from a certain character using the R codes of this article.

 

 

Furthermore, you could have a look at some of the related tutorials on my website:

 

Summary: This article illustrated how to get substrings according to a specified position in the R programming language. If you have any further comments and/or questions, don’t hesitate to let me know in the comments below.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


30 Comments. Leave new

  • Hello, i found quite interesting this information. I was wondering if there is a way to apply this in a dataframe context. I mean I have a dataframe and i need to get the values from various columns that are after a column with a specific value.

    Reply
    • Hey Carolina,

      Thank you for the kind words!

      To clarify your question: You want to check for a certain value in each column of your data frame and then you want to extract all columns after the column containing this value?

      Regards,

      Joachim

      Reply
  • Hi,

    This was super helpful. Is there any way to just extract the word directly in front of a string? So if you had “hello i am” to be able to extract just the ‘i’ in front of am?

    Reply
    • Hey Austin,

      Thanks a lot for the nice feedback!

      You can do that by using the following R code:

      sub(" .*", "", x)

      Explanation: Please compare that code with Example 1. In Example 1, we were looking for the pattern ” xxx”. At this position, you can specify any pattern you want, so in this case we are using the pattern ” “.

      I hope that helps!

      Joachim

      Reply
  • Hi Joachim,
    Very good trick! I was wondering what do you do when the pattern is repeated two or more times in the string and the extraction needs to be from the first pattern. For example, ‘I_love_R’, if the before pattern is used [e.g. sub(“_.*”, “”, x) ], we extract ‘I’ and if the after pattern is used [e.g. sub(“.*_”, “”, x) ], we extract ‘R’, what about if we need to extract ‘love_R’?
    Thank you!

    Reply
    • Hey,

      Thank you for the kind words, glad you liked it! 🙂

      Regarding your question, please try the following R code:

      sub(".*?_", "", "I_love_R")

      The ? specifies that only the first occurrence should be used.

      Regards

      Joachim

      Reply
  • Carly Andrews
    October 18, 2021 4:10 pm

    Hello!

    Can I extract only a certain number of characters after the string of interest? I have a column with a ‘ecoscore’ formatted as follows:

    “Ecosystem Classification: boreal forest; EcoScore: 3.5/5. Small glade like openings”

    I want to pull out only the 3.5/5.

    Reply
    • Hey Carly,

      You may combine the sub and substr functions for this as shown below. The last number in the second line defines the number of characters to be extracted (i.e. 6).

      x <- "Ecosystem Classification: boreal forest; EcoScore: 3.5/5. Small glade like openings"
      x_new <- substr(sub(".*EcoScore: ", "", x), 1, 6)
      x_new
      # "3.5/5."

      Regards,
      Joachim

      Reply
      • This is amazing and has really helped me in reading PDFs, thanks!
        Is there any way to make it read all characters until a certain character (e.g. for the example above, make it start reading after “EcoScore: “, but then rather than reading 6 characters, it read until the word “Small”)

        Reply
        • Hey Vera,

          Thanks a lot for the kind words, glad it was helpful!

          Regarding your question, you may use the sub function twice as shown below:

          x <- "Ecosystem Classification: boreal forest; EcoScore: 3.5/5. Small glade like openings"
          x_new <- sub(".*EcoScore: ", "", x)
          x_new <- sub("Small.*", "", x_new)
          x_new
          # [1] "3.5/5. "

          Regards,
          Joachim

          Reply
  • How can I extract two words before a pattern in R. For example:

    “We love R”. I want to get the first two words “We love”

    Reply
    • Hey,

      I assume there must be better alternatives, but the following code should extract the characters before the 2nd occurrence of ” “:

      x <- "We love R"
      substr(x, 1, gregexpr(" ", x)[[1]][2] - 1)
      # [1] "We love"

      Regards,
      Joachim

      Reply
  • Hello,

    Thank you for sharing this! I have a question: if I want to extract substring from a column of website urls. What is the best way to do that?

    For example: http://hamptoninn3.hilton.com/en/hotels/washington/hampton-inn-and-suites-portland-vancouver-PDXVEHX/index.html

    I just want to extract “hamptoninn3.hilton” from the website url.

    Thank you!

    Reply
  • Jonathan Williams
    May 16, 2022 10:54 am

    Hello Joachim,

    I want to extract new terms from formulae for successive R models, in order to annotate some output.
    So, suppose the formula for my first model is
    y ~ x1 + x2
    I now add two more terms to the model, in succession:-
    y ~ x1 + x2 + x3
    y ~ x1 + x2 + x3 + x4
    So, the difference between the first two models is ‘+x3’ and between the second two models is ‘x4’
    In reality I have many gamlss models that are rather more complex and I want to construct a table that shows how the addition of each term contributes to the fit of each model. Something like:-
    Model new term BIC
    m1 – 1000
    m2 +x3 910
    m3 +x4 833
    gamlss provides the ‘mu.formula’ – “mu.fo” for each model, which has 3 elements:
    as.character(m1$mu.fo)
    [1] ‘y’
    [2] ‘~’
    [3] ‘x1+x2’

    I tried
    gsub(as.character(m1$mu.fo)[3],””,as.character(m2$mu.fo)[3])
    but this doesn’t work
    I would greatly appreciate your advice on how to ‘subtract’ the formulae
    With many thanks, in anticipation of your reply
    Jonathan

    Reply
    • Hey Jonathan,

      Are you looking for something like this?

      all_models <- "y ~ x1"
      for(i in 2:10) {
        all_models[i] <- paste0(all_models[i - 1], " + x", i)
      }
      all_models
      # [1] "y ~ x1"                                              
      # [2] "y ~ x1 + x2"                                         
      # [3] "y ~ x1 + x2 + x3"                                    
      # [4] "y ~ x1 + x2 + x3 + x4"                               
      # [5] "y ~ x1 + x2 + x3 + x4 + x5"                          
      # [6] "y ~ x1 + x2 + x3 + x4 + x5 + x6"                     
      # [7] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7"                
      # [8] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8"           
      # [9] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9"      
      # [10] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10"

      Regards,
      Joachim

      Reply
  • Hey Jonathan, tank you for your effort. I have a little problem.

    I’ve strings like “dchbkcnlds[Y]”, “cdban[C]” etc. and I want to extract the Letter(s) between the brackets. unfortunately all previous mentioned examples won’t work here because sub() trows an error if i type something like this: sub(“[.*”,””, ExampleString).

    I hope you can help me out 🙂

    Regards,
    Daniel

    Reply
    • Furthermore if i have something like this

      “A|[B]phosphorylation[C]|D|[E]phosphorylation[F]|G|[H]Unknown:I[J]”

      my goal would be to get “C” aswell as “F”. My indicator to what i want is

      1. It is always in between brackets.
      2. The substring prior to the desired brackets is alwas beginning with either “phos” or “Phos”.

      Reply
    • Hi Daniel,

      I apologize for the delayed reply. I was on a long holiday, so unfortunately I wasn’t able to get back to you earlier. Do you still need help with your syntax?

      Regards,
      Joachim

      Reply
  • Hello, this is very interesting!
    I also have a question about data frame: my problem is that I have a column with sample names in a data frame, and I want to cut a part of the sample name which is not relevant. The other problem is that this part I want to remove is changing among samples… so ideally I am searching for a way to “remove everything before a common pattern”, in lines of a data frame, but I can’t find that for the moment…

    If you know how to do it and have time to help me, would be really kind. Thank you in advance!

    Regards,
    Eva

    Reply
    • Hello Eva,

      First of all, thank you for your kind words! As far as I understood, you want to apply the method in the data frame context. If I am wrong, let me know. If so, you can implement it like in the sample code below.

      data<-data.frame(col=c( "hshs xxx well you can", "jsjsj xxx probably", "xxx do it", "ahdhsh jljkjl xxx like this"))     # Example dataframe
      data 
      #                           col
      # 1       hshs xxx well you can
      # 2          jsjsj xxx probably
      # 3                   xxx do it
      # 4 ahdhsh jljkjl xxx like this
       
       
      data$rmvd_col<-sub(".*xxx ", "", data$col)
      data
      #                           col     rmvd_col
      # 1       hshs xxx well you can well you can
      # 2          jsjsj xxx probably     probably
      # 3                   xxx do it        do it
      # 4 ahdhsh jljkjl xxx like this    like this

      Regards,
      Cansu

      Reply
  • Great tutorial!

    Is there any way to search for 3 consecutive letters instead of xxx? I have a data frame with a long string in one of the columns. It is always messy, but also always starts with at least 3 letters.

    myColumn
    2022- PQ. ITEM DESC red
    2022- #( RF. PURPLE CAR DESC purple

    I would like a new column (along with the original) with:

    ITEM DESC red
    ITEM DESC purple

    Reply
  • I have a notes column with sample text below. I would like extract only the date from each of them and create a column named date. What is the best way to do this? Thanks
    Notes Column
    10/17/22 JHones: CASEA keep this record
    11/17/22 HCamarones: CASEA keep this record
    3/3/22 KGalvanonens: CASEA keep this record

    Reply
  • This what i needed. Thank so much

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top