Run SQL Query in R (Example)
This tutorial provides three examples of executing a SQL query in R. The queries are identical so that you can see how the methods differ even when the output does not.
Table of contents:
As you can see in the table of contents, I’ll introduce three different methods in this article:
- Method 1 is for those who already know SQL and are working in RMarkdown
- Method 2 is for those who already know SQL but want to use an R function
- Method 3 is for those who know how to use dplyr but not SQL
Let’s dive into the R code!
Note: This article was created in collaboration with Kirby White. Kirby is an organizational effectiveness consultant and researcher, who is currently pursuing a Ph.D. at the Seattle Pacific University. You can read more about Kirby here!
Install & Load Software Packages
The tidyverse suite of packages (dplyr, in particular) was designed to interface with and mimic SQL queries, which can be a very convenient way to work with a SQL database even if you are familiar with R but not SQL (yet)!
# install.packages("odbc") # install.packages("DBI") # install.packages("tidyverse") library(odbc) library(DBI) library(tidyverse)
This tutorial uses the sample SQL database created in this tutorial and stored in an object called con. Please follow those instructions to learn more, or execute this code:
#Load the sample data data("population") data("who") #Create database con <- dbConnect(drv = RSQLite::SQLite(), dbname = ":memory:") #store sample data in database dbWriteTable(conn = con, name = "population", value = population) dbWriteTable(conn = con, name = "who", value = who) #remove the local data from the environment rm(who, population)
Our database has two tables: who and population. You can preview the data in the who table with this code:
tbl(src = con, #the source if the database connection profile "who") #the name of the table to preview ## # Source: table [?? x 60] ## # Database: sqlite 3.36.0 [:memory:] ## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544 ## ## 1 Afghanistan AF AFG 1980 NA NA NA NA ## 2 Afghanistan AF AFG 1981 NA NA NA NA ## 3 Afghanistan AF AFG 1982 NA NA NA NA ## 4 Afghanistan AF AFG 1983 NA NA NA NA ## 5 Afghanistan AF AFG 1984 NA NA NA NA ## 6 Afghanistan AF AFG 1985 NA NA NA NA ## 7 Afghanistan AF AFG 1986 NA NA NA NA ## 8 Afghanistan AF AFG 1987 NA NA NA NA ## 9 Afghanistan AF AFG 1988 NA NA NA NA ## 10 Afghanistan AF AFG 1989 NA NA NA NA ## # ... with more rows, and 52 more variables: new_sp_m4554 , ## # new_sp_m5564 , new_sp_m65 , new_sp_f014 , ## # new_sp_f1524 , new_sp_f2534 , new_sp_f3544 , ## # new_sp_f4554 , new_sp_f5564 , new_sp_f65 , ## # new_sn_m014 , new_sn_m1524 , new_sn_m2534 , ## # new_sn_m3544 , new_sn_m4554 , new_sn_m5564 , ## # new_sn_m65 , new_sn_f014 , new_sn_f1524 , ...
The two tables contain different information. population contains the number of people in each country for the years 1995-2013, and who contains the number of new tuberculosis cases across each country, year, age group, and diagnosis method. You can read more about the column meanings by typing ?who into the console.
What I want to do is merge and store data from both tables into a new R object. The new data should contain four columns:
- new cases in 35-44 year old males
- the country’s total population for that year
I also want to filter the data according to a few criteria:
- Only cases in Brazil and Germany
- Only cases between 2000 and 2010
Let’s get started!
Method 1: Write a SQL Statement & Execute it in RStudio
This method works best if you already know how to write SQL queries and are working in an RMarkdown file. If both those conditions aren’t true, feel free to skip this method.
Create a new RMarkdown chunk, but change its heading to sql, connection = con, output.var = “M1_results” instead of the default r
SELECT who.country, who.year, who.new_sp_m3544, population.population FROM who LEFT JOIN population ON population.country = who.country AND population.year = who.year WHERE who.country IN ('Brazil', 'Germany') AND who.year >= 2000 AND who.year <= 2010
RStudio can natively read SQL script when it’s in a markdown chunk set to sql. output.var sets the name of the data frame to store the results in, which we’ve called M1_results. The text is the SQL script required to retrieve and filter the data from both tables in our database.
You can now work with this data like you would with any other data frame. Let’s take a look at the first five rows:
head(M1_results) ## country year new_sp_m3544 population ## 1 Brazil 2000 11906 174504898 ## 2 Brazil 2001 5184 176968205 ## 3 Brazil 2002 6325 179393768 ## 4 Brazil 2003 6034 181752951 ## 5 Brazil 2004 6481 184010283 ## 6 Brazil 2005 6128 186142403
Method 2: Write a SQL Statement in an R Function
You’re not always able to write a SQL statement in a dedicated chunk. For instance, you might be looping through some data and need to execute a SQL query based on an changing value. This means you’d need to programmatically change the SQL query, which is best done with the R script.
Though not required for this example, I’ll demonstrate how to build the SQL query in multiple pieces and then join them together as a single piece of text. You can modify this method to work with loops or other R objects. We’ll use the same query shown in the first method.
The dbGetQuery function will process and send the SQL statement to the database, and then returns data which we can store in a data frame called M2_results.
select <- "SELECT who.country, who.year, who.new_sp_m3544, population.population" from <- "FROM who" ljoin <- "LEFT JOIN population ON population.country = who.country AND population.year = who.year" where = 2000 AND who.year <= 2010" query <- paste(select, from, ljoin, where) M2_results <- DBI::dbGetQuery(conn = con, statement = query) head(M2_results) ## country year new_sp_m2534 population ## 1 Brazil 2000 11568 174504898 ## 2 Brazil 2001 5536 176968205 ## 3 Brazil 2002 5890 179393768 ## 4 Brazil 2003 5709 181752951 ## 5 Brazil 2004 6321 184010283 ## 6 Brazil 2005 6119 186142403
Method 3: Use dplyr to Query the Dataset without any SQL
This method is the easiest for those who are already comfortable working with dplyr functions but not yet familiar with SQL scripting. The dplyr package was actually designed to replicate the most common SQL commands, so you may find it easy to learn the SQL-equivalent of your favorite dplyr functions!
M3_results <- tbl(src = con, "who") %>% filter(country %in% c("Brazil", "Germany"), year >= 2000, year <= 2010) %>% dplyr::select(country, year, new_sp_m014) %>% left_join(y = tbl(src = con, "population"), by = c("country", "year")) %>% collect() #this tells dplyr to execute and store the query
At this point, you should have three nearly identical datasets: M1_results, M2_results, and M3_results.
Video Tutorial & Further Resources
Have a look at the following YouTube video to get more information on SQL file queries in the R programming language. In the video, the R code of this tutorial is explained in in some more detail.
Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.
If you accept this notice, your choice will be saved and the page will refresh.
Additionally, you might check out the related tutorials on Statistics Globe:
- Create & Connect to SQL Database in R
- Access & Collect Data with APIs in R
- Save & Load RData Workspace Files in R
- How to Use R to Download a File from the Internet
- Introduction to the dplyr Package in R
- All R Programming Tutorials on Statistics Globe
If you have further questions or comments on SQL queries and the access and management of SQL servers in R, please let us know in the comments section below. We are looking forward to reading your feedback!