Run SQL Query in R (Example)

 

This tutorial provides three examples of executing a SQL query in R. The queries are identical so that you can see how the methods differ even when the output does not.

Table of contents:

As you can see in the table of contents, I’ll introduce three different methods in this article:

  • Method 1 is for those who already know SQL and are working in RMarkdown
  • Method 2 is for those who already know SQL but want to use an R function
  • Method 3 is for those who know how to use dplyr but not SQL

Let’s dive into the R code!

Kirby White Statistician Programmer

Note: This article was created in collaboration with Kirby White. Kirby is an organizational effectiveness consultant and researcher, who is currently pursuing a Ph.D. at the Seattle Pacific University. You can read more about Kirby here!

 

Install & Load Software Packages

Connecting to and working with SQL databases in R generally requires two packages, odbc and DBI.

The tidyverse suite of packages (dplyr, in particular) was designed to interface with and mimic SQL queries, which can be a very convenient way to work with a SQL database even if you are familiar with R but not SQL (yet)!

# install.packages("odbc")
# install.packages("DBI")
# install.packages("tidyverse")
 
library(odbc)
library(DBI)
library(tidyverse)

 

Example Data

This tutorial uses the sample SQL database created in this tutorial and stored in an object called con. Please follow those instructions to learn more, or execute this code:

#Load the sample data
data("population")
data("who")
 
#Create database
con <- dbConnect(drv = RSQLite::SQLite(),
                 dbname = ":memory:")
 
#store sample data in database
dbWriteTable(conn = con, 
             name = "population",
             value = population)
dbWriteTable(conn = con, 
             name = "who",
             value = who)
 
#remove the local data from the environment
rm(who, population)

 

Query

Our database has two tables: who and population. You can preview the data in the who table with this code:

tbl(src = con, #the source if the database connection profile
    "who") #the name of the table to preview
## # Source:   table [?? x 60]
## # Database: sqlite 3.36.0 [:memory:]
##    country     iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
##                                            
##  1 Afghanistan AF    AFG    1980          NA           NA           NA           NA
##  2 Afghanistan AF    AFG    1981          NA           NA           NA           NA
##  3 Afghanistan AF    AFG    1982          NA           NA           NA           NA
##  4 Afghanistan AF    AFG    1983          NA           NA           NA           NA
##  5 Afghanistan AF    AFG    1984          NA           NA           NA           NA
##  6 Afghanistan AF    AFG    1985          NA           NA           NA           NA
##  7 Afghanistan AF    AFG    1986          NA           NA           NA           NA
##  8 Afghanistan AF    AFG    1987          NA           NA           NA           NA
##  9 Afghanistan AF    AFG    1988          NA           NA           NA           NA
## 10 Afghanistan AF    AFG    1989          NA           NA           NA           NA
## # ... with more rows, and 52 more variables: new_sp_m4554 ,
## #   new_sp_m5564 , new_sp_m65 , new_sp_f014 ,
## #   new_sp_f1524 , new_sp_f2534 , new_sp_f3544 ,
## #   new_sp_f4554 , new_sp_f5564 , new_sp_f65 ,
## #   new_sn_m014 , new_sn_m1524 , new_sn_m2534 ,
## #   new_sn_m3544 , new_sn_m4554 , new_sn_m5564 ,
## #   new_sn_m65 , new_sn_f014 , new_sn_f1524 , ...

The two tables contain different information. population contains the number of people in each country for the years 1995-2013, and who contains the number of new tuberculosis cases across each country, year, age group, and diagnosis method. You can read more about the column meanings by typing ?who into the console.

What I want to do is merge and store data from both tables into a new R object. The new data should contain four columns:

  • country
  • year
  • new cases in 35-44 year old males
  • the country’s total population for that year

I also want to filter the data according to a few criteria:

  • Only cases in Brazil and Germany
  • Only cases between 2000 and 2010

Let’s get started!

 

Method 1: Write a SQL Statement & Execute it in RStudio

This method works best if you already know how to write SQL queries and are working in an RMarkdown file. If both those conditions aren’t true, feel free to skip this method.

Create a new RMarkdown chunk, but change its heading to sql, connection = con, output.var = “M1_results” instead of the default r

SELECT 
  who.country, who.year, who.new_sp_m3544, population.population
FROM 
  who
LEFT JOIN
  population ON population.country = who.country AND population.year = who.year
WHERE
  who.country IN ('Brazil', 'Germany') AND
  who.year >= 2000 AND
  who.year <= 2010

RStudio can natively read SQL script when it’s in a markdown chunk set to sql. output.var sets the name of the data frame to store the results in, which we’ve called M1_results. The text is the SQL script required to retrieve and filter the data from both tables in our database.

You can now work with this data like you would with any other data frame. Let’s take a look at the first five rows:

head(M1_results)
##   country year new_sp_m3544 population
## 1  Brazil 2000        11906  174504898
## 2  Brazil 2001         5184  176968205
## 3  Brazil 2002         6325  179393768
## 4  Brazil 2003         6034  181752951
## 5  Brazil 2004         6481  184010283
## 6  Brazil 2005         6128  186142403

 

Method 2: Write a SQL Statement in an R Function

You’re not always able to write a SQL statement in a dedicated chunk. For instance, you might be looping through some data and need to execute a SQL query based on an changing value. This means you’d need to programmatically change the SQL query, which is best done with the R script.

Though not required for this example, I’ll demonstrate how to build the SQL query in multiple pieces and then join them together as a single piece of text. You can modify this method to work with loops or other R objects. We’ll use the same query shown in the first method.

The dbGetQuery function will process and send the SQL statement to the database, and then returns data which we can store in a data frame called M2_results.

select <- "SELECT who.country, who.year, who.new_sp_m3544, population.population"
from <- "FROM who"
ljoin <- "LEFT JOIN population ON population.country = who.country AND population.year = who.year"
where = 2000 AND who.year <= 2010"
 
query <- paste(select, from, ljoin, where)
 
M2_results <- DBI::dbGetQuery(conn = con,
                              statement = query)
 
head(M2_results)
##   country year new_sp_m2534 population
## 1  Brazil 2000        11568  174504898
## 2  Brazil 2001         5536  176968205
## 3  Brazil 2002         5890  179393768
## 4  Brazil 2003         5709  181752951
## 5  Brazil 2004         6321  184010283
## 6  Brazil 2005         6119  186142403

 

Method 3: Use dplyr to Query the Dataset without any SQL

This method is the easiest for those who are already comfortable working with dplyr functions but not yet familiar with SQL scripting. The dplyr package was actually designed to replicate the most common SQL commands, so you may find it easy to learn the SQL-equivalent of your favorite dplyr functions!

You can read more about getting started in the helpful dplyr/database documentation, here, or more technical details in the dbdplyr package documentation, here.

M3_results <- 
  tbl(src = con, "who") %>% 
  filter(country %in% c("Brazil", "Germany"),
         year >= 2000,
         year <= 2010) %>% 
  dplyr::select(country, year, new_sp_m014) %>% 
  left_join(y = tbl(src = con, "population"),
            by = c("country", "year")) %>% 
  collect() #this tells dplyr to execute and store the query

At this point, you should have three nearly identical datasets: M1_results, M2_results, and M3_results.

 

Video Tutorial & Further Resources

Have a look at the following YouTube video to get more information on SQL file queries in the R programming language. In the video, the R code of this tutorial is explained in in some more detail.

 

 

Additionally, you might check out the related tutorials on Statistics Globe:

If you have further questions or comments on SQL queries and the access and management of SQL servers in R, please let us know in the comments section below. We are looking forward to reading your feedback!

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


4 Comments. Leave new

  • jokoding.com
    April 8, 2023 4:37 am

    I have an error when running this code, It solved when I have installed RSQLite package first. Thaks for the code.

    Reply
  • Vanessa Villanova Kuhnen
    September 12, 2023 8:53 pm

    Dear White,
    My name is Vanessa, I am Brazilian, and I would like to thank you for the helpful tutorial.
    I successfully built the dataset in the memory, as your example.
    However, I tried to run the query using Method 1 you suggested and it did not work.

    “Error in .rs.runSqlForDataCapture(“SELECT \n who.country, who.year, who.new_sp_m3544, population.population\nFROM \n who\nLEFT JOIN\n population ON population.country = who.country AND population.year = who.year\nWHERE\n who.country IN (‘Brazil’, ‘Germany’) AND\n who.year >= 2000 AND\n who.year <= 2010", :
    The 'connection' option (DBI connection) is required for sql chunks.
    Failed to execute SQL chunk"

    Would you have any idea about what is going on?
    Thank you very much for your help!!!!

    Reply
    • Hello Vanessa,

      Thank you for your encouraging feedback. It looks like the error is saying that the required database connection is missing when you are trying to run the SQL query.

      Please make sure that you specify the correct path to your SQLite database. If you’re using another type of database, the method to connect will be different, and you’ll likely need another package (e.g., RMySQL for MySQL, RPostgres for PostgreSQL, etc.).

      If you are using RMarkdown in RStudio and want to run an SQL chunk, you need to specify the connection in the chunk options. The chunk might look like:
      “`{sql, connection=con}
      YOUR SQL QUERY HERE
      “`
      I hope one of these solutions helps. If you encounter any further issues, let me know.

      Best,
      Cansu

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top