Extract Text from PDF in R (Example)

 

Hi! This short tutorial will show you how to scrape text from a PDF file in the R programming language.

Here is an overview:

Let’s jump into the R code!

 

Install & Load pdftools R Library

First, we will have to install and load the pdftools library.

Therefore, in your preferred R programming code editor, run the lines of code below to install and load the pdftools library:

# install pdftools
install.packages("pdftools")
 
# load pdftools
library(pdftools)

Great! With pdftools now installed and loaded into our R programming environment, we can use it to scrape the text of a PDF document.
 

Extract Text from PDF Document

First, make sure that you set the working directory to where your pdf file is located. Then, you can get the text of a sample PDF file using the R code below. In this tutorial, we will extract the text from a PDF file called eBird.

pdf_file <- "eBird.pdf"
 
# extract text from pdf file
text <- pdf_text(pdf = pdf_file)
 
cat(text)
 
# eBird Basic Dataset Metadata (v1.13)
# revised 30 Apr 2021
 
# 1.13 updates: Minor updates, including updated links and some clarifications. Two
# column headers are changed (BREEDING BIRD ATLAS CODE BREEDING CODE and
# BREEDING BIRD ATLAS CATEGORY BREEDING CATEGORY) and one new column is
# added (BEHAVIOR CODE).
# . . .

What we’ve simply done here is to read the content of a PDF file named eBird.pdf using the pdf_text() function and then storing it in text.

After extracting the text from the PDF, we then print the entire content of the file to the console using the cat() function.
 

Print Text from Selected Pages

Now that we have scraped all the text in the PDF file and stored it in text, we can decide to print the text of selected pages in the file like this:

# print text of page 10 in pdf file
cat(text[[10]])
 
# Stationary (P21) – Observations made over a known period of time but without any
# distance/area components are classified as a Stationary Count. This does not mean you
# must stand completely still as you record the birds, but you should remain in an area
# approximately 30 meters (30 yards) in diameter while you are recording birds. If you
# move much farther than that, you should consider entering your observations as a
# Traveling Count or an Exhaustive Area Count. Examples of Stationary Counts are: a hawk
# watch, lake watch, or sea watch, or even sitting in your backyard for a period of time
# identifying birds. Required Date/Effort fields: Date, Start Time, and Duration.
# . . .

To get the text on page 10 of the PDF file, we used the double square brackets to access the 10th element (page) of text and then printed the content of that page to the R console using cat().

We can also extract the text of a range of pages using a for loop:

# print text on pages 9 to 16
for(pages in 9:16){
  cat(text[[pages]])
}
 
# C--Courtship, Display or Copulation – Courtship or copulation observed, including
# displays and courtship feeding. Typically considered Probable.
 
# T--Territory held for 7+ days – Territorial behavior or singing male present at the same
# location 7+ days apart. Typically considered Probable.
 
# P--Pair in suitable habitat – Pair observed in suitable breeding habitat within breeding
# season. Typically considered Probable.
# . . .

Here, we programmatically accessed and printed to the console the text on pages 9 to 16 of the PDF document.

 

Video, Further Resources & Summary

Do you need more explanations on how to extract text from a pdf file in R? Then you should have a look at the following YouTube video of the Statistics Globe YouTube channel.

In the video, we explain how to extract text from a pdf file in R.

 

The YouTube video will be added soon.

 

Furthermore, you could have a look at some of the other interesting R tutorials on Statistics Globe:

This post has shown how to extract text from an Adobe pdf file in R. There are still other interesting things you can do with the pdftools R package, such as converting a file from PDF to image, compressing a PDF file, and extracting metadata from a PDF file.

I hope you enjoyed reading this tutorial! In case you have further questions, you may leave a comment below.

 

R & Python Expert Ifeanyi Idiaye

This page was created in collaboration with Ifeanyi Idiaye. You might check out Ifeanyi’s personal author page to read more about his academic background and the other articles he has written for the Statistics Globe website.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top