Extract Substring Before or After Pattern in R (2 Examples)

 

In this article, you’ll learn how to return characters of a string in front or after a certain pattern in the R programming language.

The content of the page is structured as follows:

Let’s dive right in:

 

Creation of Example Data

Let’s first create a character string in R that we can use in the examples later on:

x <- "hello xxx other stuff"         # Example character string
x                                    # Print example string
# "hello xxx other stuff"

Our example string consists of the words “hello” and “other stuff” as well as of the pattern “xxx” in between.

 

Example 1: Extract Characters Before Pattern in R

Let’s assume that we want to extract all characters of our character string before the pattern “xxx”. Then, we can use the sub function as follows:

sub(" xxx.*", "", x)                 # Extract characters before pattern
# "hello"

As you can see based on the output of the RStudio console, the previous R code returned only the substring “hello”, i.e. the characters before the pattern “xxx”.

Note that we had to specify the symbols “.*” after the pattern “xxx” within the sub function in order to get this result.

 

Example 2: Extract Characters After Pattern in R

In this example, I’ll show you how to return the characters after a particular pattern. As in Example 1, we have to use the sub function and the symbols “.*”. However, this time we have to put these symbols in front of our pattern “xxx”:

sub(".*xxx ", "", x)                 # Extract characters after pattern
# "other stuff"

This time the sub function is extracting the words on the right side of our pattern, i.e. “other stuff”.

 

Video, Further Resources & Summary

If you need further explanations on the R programming codes of this post, I can recommend watching the following video of my YouTube channel. In the video, I illustrate how to truncate and trim character strings from a certain character using the R codes of this article.

 

Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.

YouTube Content Consent Button Thumbnail

YouTube privacy policy

If you accept this notice, your choice will be saved and the page will refresh.

 

Furthermore, you could have a look at some of the related tutorials on my website:

 

Summary: This article illustrated how to get substrings according to a specified position in the R programming language. If you have any further comments and/or questions, don’t hesitate to let me know in the comments below.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


20 Comments. Leave new

  • Hello, i found quite interesting this information. I was wondering if there is a way to apply this in a dataframe context. I mean I have a dataframe and i need to get the values from various columns that are after a column with a specific value.

    Reply
    • Hey Carolina,

      Thank you for the kind words!

      To clarify your question: You want to check for a certain value in each column of your data frame and then you want to extract all columns after the column containing this value?

      Regards,

      Joachim

      Reply
  • Hi,

    This was super helpful. Is there any way to just extract the word directly in front of a string? So if you had “hello i am” to be able to extract just the ‘i’ in front of am?

    Reply
    • Hey Austin,

      Thanks a lot for the nice feedback!

      You can do that by using the following R code:

      sub(" .*", "", x)

      Explanation: Please compare that code with Example 1. In Example 1, we were looking for the pattern ” xxx”. At this position, you can specify any pattern you want, so in this case we are using the pattern ” “.

      I hope that helps!

      Joachim

      Reply
  • Hi Joachim,
    Very good trick! I was wondering what do you do when the pattern is repeated two or more times in the string and the extraction needs to be from the first pattern. For example, ‘I_love_R’, if the before pattern is used [e.g. sub(“_.*”, “”, x) ], we extract ‘I’ and if the after pattern is used [e.g. sub(“.*_”, “”, x) ], we extract ‘R’, what about if we need to extract ‘love_R’?
    Thank you!

    Reply
    • Hey,

      Thank you for the kind words, glad you liked it! 🙂

      Regarding your question, please try the following R code:

      sub(".*?_", "", "I_love_R")

      The ? specifies that only the first occurrence should be used.

      Regards

      Joachim

      Reply
  • Carly Andrews
    October 18, 2021 4:10 pm

    Hello!

    Can I extract only a certain number of characters after the string of interest? I have a column with a ‘ecoscore’ formatted as follows:

    “Ecosystem Classification: boreal forest; EcoScore: 3.5/5. Small glade like openings”

    I want to pull out only the 3.5/5.

    Reply
    • Hey Carly,

      You may combine the sub and substr functions for this as shown below. The last number in the second line defines the number of characters to be extracted (i.e. 6).

      x <- "Ecosystem Classification: boreal forest; EcoScore: 3.5/5. Small glade like openings"
      x_new <- substr(sub(".*EcoScore: ", "", x), 1, 6)
      x_new
      # "3.5/5."

      Regards,
      Joachim

      Reply
      • This is amazing and has really helped me in reading PDFs, thanks!
        Is there any way to make it read all characters until a certain character (e.g. for the example above, make it start reading after “EcoScore: “, but then rather than reading 6 characters, it read until the word “Small”)

        Reply
        • Hey Vera,

          Thanks a lot for the kind words, glad it was helpful!

          Regarding your question, you may use the sub function twice as shown below:

          x <- "Ecosystem Classification: boreal forest; EcoScore: 3.5/5. Small glade like openings"
          x_new <- sub(".*EcoScore: ", "", x)
          x_new <- sub("Small.*", "", x_new)
          x_new
          # [1] "3.5/5. "

          Regards,
          Joachim

          Reply
  • How can I extract two words before a pattern in R. For example:

    “We love R”. I want to get the first two words “We love”

    Reply
    • Hey,

      I assume there must be better alternatives, but the following code should extract the characters before the 2nd occurrence of ” “:

      x <- "We love R"
      substr(x, 1, gregexpr(" ", x)[[1]][2] - 1)
      # [1] "We love"

      Regards,
      Joachim

      Reply
  • Hello,

    Thank you for sharing this! I have a question: if I want to extract substring from a column of website urls. What is the best way to do that?

    For example: http://hamptoninn3.hilton.com/en/hotels/washington/hampton-inn-and-suites-portland-vancouver-PDXVEHX/index.html

    I just want to extract “hamptoninn3.hilton” from the website url.

    Thank you!

    Reply
  • Jonathan Williams
    May 16, 2022 10:54 am

    Hello Joachim,

    I want to extract new terms from formulae for successive R models, in order to annotate some output.
    So, suppose the formula for my first model is
    y ~ x1 + x2
    I now add two more terms to the model, in succession:-
    y ~ x1 + x2 + x3
    y ~ x1 + x2 + x3 + x4
    So, the difference between the first two models is ‘+x3’ and between the second two models is ‘x4’
    In reality I have many gamlss models that are rather more complex and I want to construct a table that shows how the addition of each term contributes to the fit of each model. Something like:-
    Model new term BIC
    m1 – 1000
    m2 +x3 910
    m3 +x4 833
    gamlss provides the ‘mu.formula’ – “mu.fo” for each model, which has 3 elements:
    as.character(m1$mu.fo)
    [1] ‘y’
    [2] ‘~’
    [3] ‘x1+x2’

    I tried
    gsub(as.character(m1$mu.fo)[3],””,as.character(m2$mu.fo)[3])
    but this doesn’t work
    I would greatly appreciate your advice on how to ‘subtract’ the formulae
    With many thanks, in anticipation of your reply
    Jonathan

    Reply
    • Hey Jonathan,

      Are you looking for something like this?

      all_models <- "y ~ x1"
      for(i in 2:10) {
        all_models[i] <- paste0(all_models[i - 1], " + x", i)
      }
      all_models
      # [1] "y ~ x1"                                              
      # [2] "y ~ x1 + x2"                                         
      # [3] "y ~ x1 + x2 + x3"                                    
      # [4] "y ~ x1 + x2 + x3 + x4"                               
      # [5] "y ~ x1 + x2 + x3 + x4 + x5"                          
      # [6] "y ~ x1 + x2 + x3 + x4 + x5 + x6"                     
      # [7] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7"                
      # [8] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8"           
      # [9] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9"      
      # [10] "y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10"

      Regards,
      Joachim

      Reply
  • Hey Jonathan, tank you for your effort. I have a little problem.

    I’ve strings like “dchbkcnlds[Y]”, “cdban[C]” etc. and I want to extract the Letter(s) between the brackets. unfortunately all previous mentioned examples won’t work here because sub() trows an error if i type something like this: sub(“[.*”,””, ExampleString).

    I hope you can help me out 🙂

    Regards,
    Daniel

    Reply
    • Furthermore if i have something like this

      “A|[B]phosphorylation[C]|D|[E]phosphorylation[F]|G|[H]Unknown:I[J]”

      my goal would be to get “C” aswell as “F”. My indicator to what i want is

      1. It is always in between brackets.
      2. The substring prior to the desired brackets is alwas beginning with either “phos” or “Phos”.

      Reply
    • Hi Daniel,

      I apologize for the delayed reply. I was on a long holiday, so unfortunately I wasn’t able to get back to you earlier. Do you still need help with your syntax?

      Regards,
      Joachim

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top