Create & Connect to SQL Database in R (Example)
This tutorial is designed to introduce the subject of working with a SQL database in R.
Table of contents:
Note: This article was created in collaboration with Kirby White. Kirby is an organizational effectiveness consultant and researcher, who is currently pursuing a Ph.D. at the Seattle Pacific University. You can read more about Kirby here!
Overview
Relational databases (such as SQL server) are an excellent and efficient way to store large sets of data.
Structured Query Language (SQL) is a programming language for interacting with these databases to add or edit tables, add or edit the rows of data within those tables, and merge or retrieve the data across multiple tables. These commands are generally called “queries”.
Many organizations use relational databases to store the data they collect about their customers, sales, employees, vendors, inventory, and much more.
Many data analysts are used to analyzing data imported from an Excel or CSV file. Often, these files may be created by an IT professional who used SQL to retrieve the data and store it in a CSV file.
This works well in many situations, but has some clear limitations:
- the file may be very large and difficult to transfer
- some files are limited in the number of records they can store
- the file is immediately obsolete, as new information is not automatically added to it
- if an error in the file is discovered, it can be time-consuming to request and receive an updated file
- it can be difficult for multiple analysts to collaborate on the same file
- international data privacy laws may prohibit storing and sending data with certain file types
- most data files cannot encrypt or protect the data within them, possibly increasing the likelihood of a data breach
Directly connecting to a SQL database can alleviate many of these problems!
Thankfully, RStudio contains several features for connecting to a database, and many R Packages provide helpful functions for interacting with them.
You may find the RStudio documentation helpful for your situation.
Software Packages
The two main packages for connecting to and working with databases in R are ‘odbc’ and ‘DBI’. The odbc package provides a set of drivers for connecting to several kinds of databases, including SQL servers, Amazon Web Services Redshift, and Google Cloud databases.
# install.packages("odbc") # install.packages("DBI") library(odbc) #Contains drivers to connect to a database library(DBI) #Contains functions for interacting with the database
Connecting to a Database
The dbConnect function makes it easy to create a connection to the database. If you wish to connect to an existing database, you will need the settings from your configuration files or the database administrator at your company.
The function returns a database connection profile, which we can store in an R object call “con”. Once this is created, you can use “con” to access the database with a variety of other functions.
This code provides a template for connecting to an existing database:
con <- DBI::dbConnect(drv = odbc::odbc(), Driver = "driver_name", Server = "server_url", Database = "database_name", user = "user", #optional password = "password") #optional
Don’t worry if you don’t already have a database to connect to, we’ll create one in the next step.
Creating a Sample Database
Because it can be time-consuming and complex to create a live SQL database, this tutorial does not require you to have one. Instead, we’ll use some other packages to create a sample database we can interact with immediately.
For this task, we’ll have to install and load the following packages:
# install.packages("RSQL") # install.packages("RSQLite") # install.packages("tidyverse") library(RSQL) #Generate and Process 'SQL' Queries in R library(RSQLite) #Can create an in-memory SQL database library(tidyverse) #Provides helpful functions and sample data to use in R
We’re going to use two datasets in this tutorial: population and who. These are pre-built tables accompanying the tidyverse package.
Both datasets were created by the World Health Organization (WHO) as a part of their work tracking tuberculosis cases between 1980 and 2013.
The who table tracks new cases by country, year, age group, and the method of diagnosis. population is a simple table of the population in each country from 1995-2013.
You can load and preview the data with this code:
data("population") population ## # A tibble: 4,060 x 3 ## country year population ## ## 1 Afghanistan 1995 17586073 ## 2 Afghanistan 1996 18415307 ## 3 Afghanistan 1997 19021226 ## 4 Afghanistan 1998 19496836 ## 5 Afghanistan 1999 19987071 ## 6 Afghanistan 2000 20595360 ## 7 Afghanistan 2001 21347782 ## 8 Afghanistan 2002 22202806 ## 9 Afghanistan 2003 23116142 ## 10 Afghanistan 2004 24018682 ## # ... with 4,050 more rows
data("who") who ## # A tibble: 7,240 x 60 ## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544 ## ## 1 Afghanistan AF AFG 1980 NA NA NA NA ## 2 Afghanistan AF AFG 1981 NA NA NA NA ## 3 Afghanistan AF AFG 1982 NA NA NA NA ## 4 Afghanistan AF AFG 1983 NA NA NA NA ## 5 Afghanistan AF AFG 1984 NA NA NA NA ## 6 Afghanistan AF AFG 1985 NA NA NA NA ## 7 Afghanistan AF AFG 1986 NA NA NA NA ## 8 Afghanistan AF AFG 1987 NA NA NA NA ## 9 Afghanistan AF AFG 1988 NA NA NA NA ## 10 Afghanistan AF AFG 1989 NA NA NA NA ## # ... with 7,230 more rows, and 52 more variables: new_sp_m4554 , ## # new_sp_m5564 , new_sp_m65 , new_sp_f014 , ## # new_sp_f1524 , new_sp_f2534 , new_sp_f3544 , ## # new_sp_f4554 , new_sp_f5564 , new_sp_f65 , ## # new_sn_m014 , new_sn_m1524 , new_sn_m2534 , ## # new_sn_m3544 , new_sn_m4554 , new_sn_m5564 , ## # new_sn_m65 , new_sn_f014 , new_sn_f1524 , ...
This next chunk of code creates the placeholder for a database and stores it in the ‘con’ object. The dbListTables function returns a list of tables stored within the database, which is currently empty.
#Build the placeholder con <- dbConnect(drv = RSQLite::SQLite(), dbname = ":memory:") dbListTables(con) ## character(0)
Now we can use the dbWriteTable function to store the who and population tables in the SQL database. Again, we’ll use the dbListTables function to verify that they were successfully loaded.
#Load the population table dbWriteTable(conn = con, name = "population", value = population) #Load the who table dbWriteTable(conn = con, name = "who", value = who) dbListTables(con) ## [1] "population" "who"
Voila! You are now connected to a SQL database containing two tables.
Querying Your Database
There are many functions useful for creating and processing SQL queries from R code, which will be covered in more detail in the SQL Queries in R tutorial.
Here is one example to demonstrate how SQL queries can bring data together from multiple tables in a database:
res <- DBI::dbGetQuery(conn = con, statement = " SELECT who.country, who.year, who.new_sp_m014, population.population FROM population, who WHERE who.country = 'Afghanistan' AND who.year > 1995 AND who.country = population.country AND who.year = population.year ") res ## country year new_sp_m014 population ## 1 Afghanistan 1996 NA 18415307 ## 2 Afghanistan 1997 0 19021226 ## 3 Afghanistan 1998 30 19496836 ## 4 Afghanistan 1999 8 19987071 ## 5 Afghanistan 2000 52 20595360 ## 6 Afghanistan 2001 129 21347782 ## 7 Afghanistan 2002 90 22202806 ## 8 Afghanistan 2003 127 23116142 ## 9 Afghanistan 2004 139 24018682 ## 10 Afghanistan 2005 151 24860855 ## 11 Afghanistan 2006 193 25631282 ## 12 Afghanistan 2007 186 26349243 ## 13 Afghanistan 2008 187 27032197 ## 14 Afghanistan 2009 200 27708187 ## 15 Afghanistan 2010 197 28397812 ## 16 Afghanistan 2011 204 29105480 ## 17 Afghanistan 2012 188 29824536 ## 18 Afghanistan 2013 NA 30551674
Video Tutorial & Further Resources
Do you need more explanations on how to work with a SQL database interface in R? Then you may have a look at the following YouTube video. In the video, Kirby White explains the R code of this tutorial in some more detail.
Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.
If you accept this notice, your choice will be saved and the page will refresh.
In addition, you might have a look at the related tutorials on Statistics Globe:
- Querying a SQL Database in R
- Access & Collect Data with APIs in R
- How to Use R to Download a File from the Internet
- Save & Load RData Workspace Files in R
- All R Programming Tutorials on Statistics Globe
If you have any further questions or comments on the management of SQL databases in R, please let us know in the comments section below. We are looking forward to reading your feedback!