Analyze & Visualize Country Data in R Using dplyr & ggplot2 (Example)
Recently, I have launched the first-ever Statistics Globe online course on “Data Manipulation in R Using dplyr & the tidyverse“, and this course has 103 participants ( Hooray! 🙂 ).
In this tutorial, I will use the country information from these participants to show how to analyze and visualize country data in the R programming language.
The content of the post is structured as follows:
Let’s start right away.
Creating Example Data
The first step is to create some data that we can use in the tutorial later on. Since we want to analyze the country information of our participants, we first have to store these data in a vector object:
x <- c("United Kingdom", # Create vector of countries "United Kingdom", "Australia", "United States", "United States", "United Kingdom", "Netherlands", "Austria", "United States", "United States", "Ireland", "United States", "United States", "United States", "Japan", "United States", "United States", "Bangladesh", "Congo", "Spain", "Spain", "Netherlands", "United States", "United States", "United States", "Chile", "United States", "Canada", "United States", "Spain", "United Kingdom", "Ireland", "United Kingdom", "Mexico", "Namibia", "United States", "India", "United States", "Romania", "Mexico", "Canada", "Tanzania", "Netherlands", "Portugal", "Germany", "United Kingdom", "United States", "United States", "Australia", "Sweden", "Japan", "Canada", "United States", "Italy", "France", "Germany", "Germany", "Sweden", "Mexico", "New Zealand", "Mexico", "South Korea", "United States", "United States", "United States", "South Africa", "United States", "United States", "Australia", "United States", "United States", "United States", "Iceland", "United States", "United States", "United Kingdom", "United Kingdom", "Ireland", "Germany", "United States", "United States", "Singapore", "United Kingdom", "United States", "Mexico", "United Kingdom", "Norway", "Brazil", "United States", "United States", "Canada", "Netherlands", "Canada", "Sweden", "United States", "United States", "United Kingdom", "Germany", "United States", "United States", "United States", "United States", "Trinidad and Tobago")
The previous R code has created a vector object called x, which contains one country name for each participant in the course.
Let’s work with these data!
Manipulate & Analyze Country Data
The following code explains how to calculate summary statistics for our country data using the packages of the tidyverse (i.e. dplyr & ggplot2).
First, we have to install and load the tidyverse packages by running the code below:
install.packages("tidyverse") # Install tidyverse package library("tidyverse") # Load tidyverse
We can use the group_by() and summarize() function if we want to calculate summary statistics by group using the tidyverse.
In this specific example, I’m interested in the country counts. To calculate this, we can use the n() function within the summarize() function.
Furthermore, I’d like to order my grouped output tibble from the most represented countries to the least represented countries. We can do this using the arrange() and desc() functions.
Take a look at the syntax and its output below:
my_tib_grouped <- tibble(country = x) %>% # Convert vector to tibble group_by(country) %>% # Group tibble summarize(country_count = n()) %>% # Calculate country count arrange(desc(country_count)) # Arrange tibble descendingly my_tib_grouped # Print country data # # A tibble: 30 × 2 # country country_count # <chr> <int> # 1 United States 40 # 2 United Kingdom 11 # 3 Canada 5 # 4 Germany 5 # 5 Mexico 5 # 6 Netherlands 4 # 7 Australia 3 # 8 Ireland 3 # 9 Spain 3 # 10 Sweden 3 # # ℹ 20 more rows # # ℹ Use `print(n = ...)` to see more rows
As you can see, we have created a grouped tibble with two columns: The first column shows the different countries and the second column shows the corresponding count. For instance, there are 40 participants from the United States and 11 participants from the United Kingdom.
Based on this output, we can also see the number of rows of this tibble. This tells us that our group of participants contains 30 different countries.
Great – that’s very international!
Draw Barplot of Country Counts
Now that we know the country counts of the course participants, I would also like to visualize these results using the ggplot2 package.
In the plot below, I specify that I want to draw an ordered barplot with vertical x-axis labels, an x-axis title called “Country”, a y-axis title called “Count”, a main title “dplyr Course Participants by Country”, and a little text message inside the plot.
Let’s do this:
my_ggp <- my_tib_grouped %>% # Create ggplot2 plot ggplot(aes(x = reorder(country, - country_count), y = country_count)) + geom_col() + # Specify to draw a barplot theme(axis.text.x = element_text(angle = 90, # Vertical x-axis labels hjust = 1, vjust = 0.3)) + xlab("Country") + # Change x-axis label ylab("Count") + # Change y-axis label ggtitle("dplyr Course Participants by Country") + # Change main title annotate("text", # Add text element to plot x = 15, y = 25, label = "Thank You !!", size = 15, color = "red") my_ggp # Draw ggplot2 plot
The graphic above visualizes the country counts of our participants in an ordered barplot. Looks great!
Do Everything in One Line of Code
The dplyr pipe’s elegance lies in its ability to handle nearly all tasks within a single line of code. Check this out:
tibble(country = c("United Kingdom", # Create tibble with country data "United Kingdom", "Australia", "United States", "United States", "United Kingdom", "Netherlands", "Austria", "United States", "United States", "Ireland", "United States", "United States", "United States", "Japan", "United States", "United States", "Bangladesh", "Congo", "Spain", "Spain", "Netherlands", "United States", "United States", "United States", "Chile", "United States", "Canada", "United States", "Spain", "United Kingdom", "Ireland", "United Kingdom", "Mexico", "Namibia", "United States", "India", "United States", "Romania", "Mexico", "Canada", "Tanzania", "Netherlands", "Portugal", "Germany", "United Kingdom", "United States", "United States", "Australia", "Sweden", "Japan", "Canada", "United States", "Italy", "France", "Germany", "Germany", "Sweden", "Mexico", "New Zealand", "Mexico", "South Korea", "United States", "United States", "United States", "South Africa", "United States", "United States", "Australia", "United States", "United States", "United States", "Iceland", "United States", "United States", "United Kingdom", "United Kingdom", "Ireland", "Germany", "United States", "United States", "Singapore", "United Kingdom", "United States", "Mexico", "United Kingdom", "Norway", "Brazil", "United States", "United States", "Canada", "Netherlands", "Canada", "Sweden", "United States", "United States", "United Kingdom", "Germany", "United States", "United States", "United States", "United States", "Trinidad and Tobago")) %>% group_by(country) %>% # Group tibble summarize(country_count = n()) %>% # Calculate country count ggplot(aes(x = reorder(country, - country_count), y = country_count)) + geom_col() + # Specify to draw a barplot theme(axis.text.x = element_text(angle = 90, # Vertical x-axis labels hjust = 1, vjust = 0.3)) + xlab("Country") + # Change x-axis label ylab("Count") + # Change y-axis label ggtitle("dplyr Course Participants by Country") + # Change main title annotate("text", # Add text element to plot x = 15, y = 25, label = "Thank You !!", size = 15, color = "red")
Great, isn’t it? 🙂
Video & Further Resources
Do you want to learn more about the handling of country data using the dplyr & ggplot2 packages? Then I can recommend taking a look at the following video on my YouTube channel. In the video, I explain the contents of this post in more detail.
In addition, you could read the related articles on this website. A selection of related tutorials can be found below.
- Mean by Group (dplyr Package vs. Base R)
- Error: group_by & summarize Functions don’t Work
- Draw Grouped ggplot2 Barplot with Text Labels
- ggplot2 Barplot with Axis Break & Zoom
- Change Y-Axis to Percentage Points in ggplot2 Barplot
- R Programming Overview
At this point, you should have learned how to analyze and visualize country data using the packages of the tidyverse in the R programming language. Don’t hesitate to let me know in the comments, if you have further questions or comments.
Furthermore, make sure to visit the course description page and join the waiting list, in case you’d like to take part in such a course in the future as well.