Create & Connect to SQL Database in R (Example)

 

This tutorial is designed to introduce the subject of working with a SQL database in R.

Table of contents:

 

Kirby White Statistician Programmer

Note: This article was created in collaboration with Kirby White. Kirby is an organizational effectiveness consultant and researcher, who is currently pursuing a Ph.D. at the Seattle Pacific University. You can read more about Kirby here!

 

Overview

Relational databases (such as SQL server) are an excellent and efficient way to store large sets of data.

Structured Query Language (SQL) is a programming language for interacting with these databases to add or edit tables, add or edit the rows of data within those tables, and merge or retrieve the data across multiple tables. These commands are generally called “queries”.

Many organizations use relational databases to store the data they collect about their customers, sales, employees, vendors, inventory, and much more.

Many data analysts are used to analyzing data imported from an Excel or CSV file. Often, these files may be created by an IT professional who used SQL to retrieve the data and store it in a CSV file.

This works well in many situations, but has some clear limitations:

  • the file may be very large and difficult to transfer
  • some files are limited in the number of records they can store
  • the file is immediately obsolete, as new information is not automatically added to it
  • if an error in the file is discovered, it can be time-consuming to request and receive an updated file
  • it can be difficult for multiple analysts to collaborate on the same file
  • international data privacy laws may prohibit storing and sending data with certain file types
  • most data files cannot encrypt or protect the data within them, possibly increasing the likelihood of a data breach

Directly connecting to a SQL database can alleviate many of these problems!

Thankfully, RStudio contains several features for connecting to a database, and many R Packages provide helpful functions for interacting with them.

You may find the RStudio documentation helpful for your situation.

 

Software Packages

The two main packages for connecting to and working with databases in R are ‘odbc’ and ‘DBI’. The odbc package provides a set of drivers for connecting to several kinds of databases, including SQL servers, Amazon Web Services Redshift, and Google Cloud databases.

# install.packages("odbc")
# install.packages("DBI")
 
library(odbc) #Contains drivers to connect to a database
library(DBI) #Contains functions for interacting with the database

 

Connecting to a Database

The dbConnect function makes it easy to create a connection to the database. If you wish to connect to an existing database, you will need the settings from your configuration files or the database administrator at your company.

The function returns a database connection profile, which we can store in an R object call “con”. Once this is created, you can use “con” to access the database with a variety of other functions.

This code provides a template for connecting to an existing database:

con <- DBI::dbConnect(drv = odbc::odbc(),
                      Driver = "driver_name",
                      Server = "server_url",
                      Database = "database_name",
                      user = "user", #optional
                      password = "password") #optional

Don’t worry if you don’t already have a database to connect to, we’ll create one in the next step.

 

Creating a Sample Database

Because it can be time-consuming and complex to create a live SQL database, this tutorial does not require you to have one. Instead, we’ll use some other packages to create a sample database we can interact with immediately.

For this task, we’ll have to install and load the following packages:

# install.packages("RSQL")
# install.packages("RSQLite")
# install.packages("tidyverse")
 
library(RSQL) #Generate and Process 'SQL' Queries in R
library(RSQLite) #Can create an in-memory SQL database
library(tidyverse) #Provides helpful functions and sample data to use in R

We’re going to use two datasets in this tutorial: population and who. These are pre-built tables accompanying the tidyverse package.

Both datasets were created by the World Health Organization (WHO) as a part of their work tracking tuberculosis cases between 1980 and 2013.

The who table tracks new cases by country, year, age group, and the method of diagnosis. population is a simple table of the population in each country from 1995-2013.

You can load and preview the data with this code:

data("population")
population
## # A tibble: 4,060 x 3
##    country      year population
##                 
##  1 Afghanistan  1995   17586073
##  2 Afghanistan  1996   18415307
##  3 Afghanistan  1997   19021226
##  4 Afghanistan  1998   19496836
##  5 Afghanistan  1999   19987071
##  6 Afghanistan  2000   20595360
##  7 Afghanistan  2001   21347782
##  8 Afghanistan  2002   22202806
##  9 Afghanistan  2003   23116142
## 10 Afghanistan  2004   24018682
## # ... with 4,050 more rows
data("who")
who
## # A tibble: 7,240 x 60
##    country     iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
##                                            
##  1 Afghanistan AF    AFG    1980          NA           NA           NA           NA
##  2 Afghanistan AF    AFG    1981          NA           NA           NA           NA
##  3 Afghanistan AF    AFG    1982          NA           NA           NA           NA
##  4 Afghanistan AF    AFG    1983          NA           NA           NA           NA
##  5 Afghanistan AF    AFG    1984          NA           NA           NA           NA
##  6 Afghanistan AF    AFG    1985          NA           NA           NA           NA
##  7 Afghanistan AF    AFG    1986          NA           NA           NA           NA
##  8 Afghanistan AF    AFG    1987          NA           NA           NA           NA
##  9 Afghanistan AF    AFG    1988          NA           NA           NA           NA
## 10 Afghanistan AF    AFG    1989          NA           NA           NA           NA
## # ... with 7,230 more rows, and 52 more variables: new_sp_m4554 ,
## #   new_sp_m5564 , new_sp_m65 , new_sp_f014 ,
## #   new_sp_f1524 , new_sp_f2534 , new_sp_f3544 ,
## #   new_sp_f4554 , new_sp_f5564 , new_sp_f65 ,
## #   new_sn_m014 , new_sn_m1524 , new_sn_m2534 ,
## #   new_sn_m3544 , new_sn_m4554 , new_sn_m5564 ,
## #   new_sn_m65 , new_sn_f014 , new_sn_f1524 , ...

This next chunk of code creates the placeholder for a database and stores it in the ‘con’ object. The dbListTables function returns a list of tables stored within the database, which is currently empty.

#Build the placeholder
con <- dbConnect(drv = RSQLite::SQLite(),
                 dbname = ":memory:")
 
dbListTables(con)
## character(0)

Now we can use the dbWriteTable function to store the who and population tables in the SQL database. Again, we’ll use the dbListTables function to verify that they were successfully loaded.

#Load the population table
dbWriteTable(conn = con, 
             name = "population",
             value = population)
 
#Load the who table
dbWriteTable(conn = con, 
             name = "who",
             value = who)
 
dbListTables(con)
## [1] "population" "who"

Voila! You are now connected to a SQL database containing two tables.

 

Querying Your Database

There are many functions useful for creating and processing SQL queries from R code, which will be covered in more detail in the SQL Queries in R tutorial.

Here is one example to demonstrate how SQL queries can bring data together from multiple tables in a database:

res <- DBI::dbGetQuery(conn = con,
                       statement = "
                SELECT who.country, who.year, who.new_sp_m014, population.population 
                FROM population, who 
                WHERE who.country = 'Afghanistan' AND 
                who.year > 1995 AND 
                who.country = population.country AND 
                who.year = population.year
                ")
res
##        country year new_sp_m014 population
## 1  Afghanistan 1996          NA   18415307
## 2  Afghanistan 1997           0   19021226
## 3  Afghanistan 1998          30   19496836
## 4  Afghanistan 1999           8   19987071
## 5  Afghanistan 2000          52   20595360
## 6  Afghanistan 2001         129   21347782
## 7  Afghanistan 2002          90   22202806
## 8  Afghanistan 2003         127   23116142
## 9  Afghanistan 2004         139   24018682
## 10 Afghanistan 2005         151   24860855
## 11 Afghanistan 2006         193   25631282
## 12 Afghanistan 2007         186   26349243
## 13 Afghanistan 2008         187   27032197
## 14 Afghanistan 2009         200   27708187
## 15 Afghanistan 2010         197   28397812
## 16 Afghanistan 2011         204   29105480
## 17 Afghanistan 2012         188   29824536
## 18 Afghanistan 2013          NA   30551674

 

Video Tutorial & Further Resources

Do you need more explanations on how to work with a SQL database interface in R? Then you may have a look at the following YouTube video. In the video, Kirby White explains the R code of this tutorial in some more detail.

 

 

In addition, you might have a look at the related tutorials on Statistics Globe:

If you have any further questions or comments on the management of SQL databases in R, please let us know in the comments section below. We are looking forward to reading your feedback!

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top