Access & Collect Data with APIs in R (Example)

In this tutorial, I’ll demonstrate how to use an API in R Programming.

Here are the topics we’ll cover:

 

Kirby White Statistician Programmer

Note: This article was written in collaboration with Kirby White. Kirby is an organizational effectiveness consultant and researcher, who is currently pursuing a Ph.D. at the Seattle Pacific University. You can read more about Kirby here!

 

Packages

For this tutorial, you’ll need these two packages:

install.packages(c("httr","jsonlite"))
library(httr)
library(jsonlite)

What is an API?

An API (Application Programming Interface) is an intermediary between a dataset (usually a very large one) and the rest of the world (like us!) APIs provide an accessible way to request a dataset. which is referred to as making a “call” to the API. A call is sent to the API by opening a web address.

In this tutorial, we’re going to request data from the API at COVID Act Now.

 

Components of a URL

This particular call would request time series COVID data for a single county in the United States (identified by its FIPS code 06037):

https://api.covidactnow.org/v2/county/06037.timeseries.json?apiKey=xyxyxy

There are several pieces to this API call. The first is the base URL:

https://api.covidactnow.org/v2/

This part of the URL will be the same for all our calls to this API.

The county/ portion of the URL indicates that we only want COVID data for a single county. By looking at the COVID Act Now API documentation, I can see that states is an alternative option for this part of the URL.

06037 is the unique identifier for a single county. If I want to get the same data but for a different county, I just have to change this number.

.timeseries provides the API with more information about the data I’m requesting, and .json tells the API to format the data as a JSON (which we’ll convert to a data frame).

Everything after ‘apiKey=’ is my authorization token, which tells the COVID Act Now servers that I’m allowed to ask for this data. ‘xyxyxy’ is not a real token, and you can get your own token here.

Now that we’ve dissected the anatomy of an API, you can see how easy it is to build them! Basically anybody with an internet connection, an authorization token, and who knows the grammar of the API can access it. Most APIs are published with extensive documentation to help you understand the available options and parameters.

 

Calling an API

It’s easiest to build an API URL by joining multiple text strings together. For this example, I want to get a time series of COVID data for a few counties. Let’s build the URL for one county, and later on we’ll see how to loop through multiple counties.

base <- 'https://api.covidactnow.org/v2/county/'
county <- '06037'
info_key <- '.timeseries.json?apiKey=xyxyxy'
 
API_URL <- paste0(base, county, info_key)

Now we have the entire URL stored in a simple R object called API_URL.

We can now use the URL to call the API, and we’ll store the returned data in an object called raw_data:

raw_data <- GET(API_URL)

You can type VIEW(raw_data) to examine what the API sent back, which isn’t in a usable format yet. You’ll notice a “status” element of the list. Traditionally, a status of “200” means that the API call was successful, and other codes are used to indicate errors. You can troubleshoot those error codes using the API documentation.

 

Converting JSON Results to a Data Frame

We received the data in a format that isn’t very easy to work with yet. Thankfully, we can store it in a data frame with just a few steps.

First, we’ll convert the raw data into an R list:

COVID_list <- fromJSON(rawToChar(raw_data$content), flatten = TRUE)

Now that it’s in a list format, you can see that it actually contains several data frames!

You can use this data right away if you are already familiar with lists in R, or you can extract the data frames into separate objects, like this:

df <- COVID_list$actualsTimeseries

The data frame that we have just created contains many different variables and a lot of information. Below, you can see the first six rows of a selection of some interesting variables in our data:

head(df[ , c("cases", "deaths", "newCases", "newDeaths", "date")])
#   cases deaths newCases newDeaths       date
# 1    NA     NA       NA        NA 2020-01-22
# 2    NA     NA       NA        NA 2020-01-23
# 3    NA     NA       NA        NA 2020-01-24
# 4    NA     NA       NA        NA 2020-01-25
# 5     1      0       NA        NA 2020-01-26
# 6     1      0        0         0 2020-01-27

 

Looping Multiple API Calls

Now that we’ve seen how to make an API call for one county, let’s create a simple loop to make several calls at a time. We’ll use a for loop, which you can read more about here.

First, we’ll create a vector with the ID code for each county we want to get data for:

counties <- c('01001', '01003', '01005')

Then, we’ll loop through each element of the vector and adjust our API_URL accordingly:

base <- 'https://api.covidactnow.org/v2/county/'
county <- '06037'
info_key <- '.timeseries.json?apiKey=xyxyxy'
 
for(i in 1:length(counties)) {
 
  # Build the API URL with the new county code
  API_URL <- paste0(base, counties[i], info_key)
 
  # Store the raw and processed API results in temporary objects
  temp_raw <- GET(API_URL)
  temp_list <- fromJSON(rawToChar(temp_raw$content), flatten = TRUE)
 
  # Add the most recent results to your data frame
  df <- rbind(df, temp_list$actualsTimeseries)
}

Working with APIs is challenging at first (and even once you have the hang of it!), but they can provide a scalable and customizable way to gather data directly in R.

 

Video Tutorial & Further Resources

Do you need more explanations on how to use APIs from within R? Then you might have a look at the following YouTube video on the Statistics Globe YouTube channel.

In the video, Kirby White shows another example on how to collect data using APIs. Furthermore, he shows a Shiny app that he has created based on the API he is introducing in the video.

 

 

Furthermore, you might have a look at the related tutorials on Statistics Globe:

In case you have any further questions or comments, please let us know in the comments section below. We are happy to read your feedback!

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


20 Comments. Leave new

  • SCOTT A PROST-DOMASKY
    October 14, 2021 3:09 pm

    This line does not work
    COVID_list <- fromJSON(rawToChar(raw_data$content), flatten = TRUE)

    COVID_list is a List of length 1, and has in "invalid API key' for an error. There is
    data in raw_data$content.

    Thus the next line
    df <- COVID_list$actualsTimeseries

    results in df=NULL

    Reply
  • SCOTT A PROST-DOMASKY
    October 14, 2021 3:39 pm

    when I go here directly https://api.covidactnow.org/v2/county/

    I get an error message that I have to register for an API key at
    https://apidocs.covidactnow.org/#register

    and tell them what I want to use the data for, as if that’s any of their business?

    Reply
  • Herbert Holeman
    October 15, 2021 4:46 pm

    Thanks for a useful tutorial. Everything went smoothly.
    Herb

    Reply
  • That works well! Great to get more on how to work with bigger databases like this!
    Many thanks.

    Reply
  • Peter Versteegen
    November 3, 2021 3:12 am

    I was able to read the data. The dataframe only contains 20 names, while the json file contains several hundred. How do I drill down further into a time series set?

    I appreciate your comments.

    Thanks

    Reply
    • Hey Peter,

      Thank you for the comment! I’ve forwarded your question to Kirby.

      Regards,
      Joachim

      Reply
    • Hi Peter,
      Sometimes a list (or data frame) contains other dataframes, one for each row. You can access these “sub” data frames by referring to the individual elements. Sometimes it can be easiest to access these by using the View() function and then using the interface to find the right code.

      I hope that helps!
      -Kirby

      Reply
  • Okey so this is weird. I have installed and included the packages in my script. However, I get the following error:
    could not find function “GET”
    Please help!

    Reply
  • Great tutorial!

    Is the final code considered a function?
    The reason I’m asking this is because I would like to create an R package that retrieves information from an API website.

    I am not sure if the code provided above is already a function or if I will have to do an extra step to bundle the code into a function.

    Thanks,

    Reply
  • Hey!

    Super useful tutorial, I’m having a bit of trouble looping a different API, it seems to be binding only the last SiteCode rather than the three that I have in the list, any ideas on why this might be happening? I’ll put the code below!

    Thanks in advance

    SiteCodes_all <- c('CLDP0002', 'CLDP0003', 'CLDP0004')

    for(i in 1:length(SiteCodes_all)) {

    allsites <- paste0(Base,Node,SiteCodes_all[i],'/',Pollutant,StartTime,EndTime,Averaging,Key)

    temp_raw <- GET(allsites)
    temp_list <- fromJSON(rawToChar(temp_raw$content))
    df <- rbind(RoyalLondon_List, temp_list)

    }

    Reply
    • Hey Eleri,

      Thank you for the kind words regarding the tutorial, glad you find it helpful!

      Regarding your question, it seems like you are always overwriting the results of the previous iteration at the end of your loop (i.e. df is overwritten by new results).

      Does the following code work for you?

      SiteCodes_all <- c('CLDP0002', 'CLDP0003', 'CLDP0004')
       
      RoyalLondon_List_new <- RoyalLondon_List
       
      for(i in 1:length(SiteCodes_all)) {
       
        allsites <- paste0(Base,Node,SiteCodes_all[i],'/',Pollutant,StartTime,EndTime,Averaging,Key)
       
        temp_raw <- GET(allsites)
        temp_list <- fromJSON(rawToChar(temp_raw$content))
        RoyalLondon_List_new <- rbind(RoyalLondon_List_new, temp_list)
       
      }

      RoyalLondon_List_new should contain all your data.

      Regards,
      Joachim

      Reply
  • Thank you for this really wonderful tutorial!

    If I was to not have an existing dataframe but wanted to run the loop to end up with a dataframe of several combined calls…how would I go about tweaking the last line of code?

    Referring to this line (if I have not declared df outside the loop):
    df <- rbind(df, temp_list$actualsTimeseries)

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top