Violin Plot [Course Preview]

This module introduces violin plots, a visualization that combines aspects of boxplots and density plots to display the distribution and variability of data across categories. You’ll learn how to create and customize violin plots in ggplot2 to compare distributions and reveal patterns in your data.

Video Lecture

R Code

Please find the R code of this lecture below. In this lecture, we are using the Palmer Penguins data set [see data attribution below]. You can download it by clicking here.

# install.packages("gghalves")                            # Install & load gghalves
library(gghalves)
 
# install.packages("ggplot2")                             # Install & load ggplot2
library(ggplot2)
 
# install.packages("ggstatsplot")                         # Install & load ggstatsplot
library(ggstatsplot)
 
my_path <- "C:/Users/Joachim Schork/Dropbox/Jock/Data Sets/ggplot2 Course/" # Path
 
my_penguins <- read.csv(paste0(my_path,                   # Import CSV file
                               "palmerpenguins_original.csv"))
my_penguins <- na.omit(my_penguins)                       # Remove NA values
 
ggplot(data = my_penguins,                                # Multiple violin plots
       aes(x = species,
           y = flipper_length_mm)) +
  geom_violin()
 
ggplot(data = my_penguins,                                # Add colors & legend
       aes(x = species,
           y = flipper_length_mm,
           fill = species)) +
  geom_violin()
 
ggplot(data = my_penguins,                                # Half for female & male
       aes(x = species,
           y = flipper_length_mm,
           fill = sex)) +
  geom_half_violin(side = "l",                            # Left side for females
                   data = my_penguins[my_penguins$sex == "female", ]) +
  geom_half_violin(side = "r",                            # Right side for males
                   data = my_penguins[my_penguins$sex == "male", ])
 
ggplot(data = my_penguins,                                # Add boxplots
       aes(x = species,
           y = flipper_length_mm,
           fill = species)) +
  geom_violin() +
  geom_boxplot(width = 0.3,
               fill = "white")
 
set.seed(44294)                                           # Seed for reproducibility
 
ggplot(data = my_penguins,                                # Add jittered points
       aes(x = species,
           y = flipper_length_mm,
           fill = species)) +
  geom_violin() +
  geom_boxplot(width = 0.3,
               fill = "white") +
  geom_jitter(width = 0.1,
              color = "#636363",
              alpha = 0.5)
 
ggplot(data = my_penguins,                                # Add mean values
       aes(x = species,
           y = flipper_length_mm,
           fill = species)) +
  geom_violin() +
  geom_boxplot(width = 0.3,
               fill = "white") +
  geom_jitter(width = 0.1,
              color = "#636363",
              alpha = 0.5) +
  stat_summary(fun = mean, 
               geom = "point", 
               color = "#f31f61", 
               size = 5, 
               shape = 18)
 
ggplot(data = my_penguins,                                # Customize design
       aes(x = species,
           y = flipper_length_mm,
           fill = species)) +
  geom_violin() +
  geom_boxplot(width = 0.3,
               fill = "white") +
  geom_jitter(width = 0.1,
              color = "#636363",
              alpha = 0.5) +
  stat_summary(fun = mean, 
               geom = "point", 
               color = "#f31f61", 
               size = 5, 
               shape = 18) +
  theme_minimal(base_size = 15) +                         # Apply theme
  labs(title = "Flipper Length Distribution by Penguin Species", # Change labels
       subtitle = "Violin Plot with Boxplot Overlay and Mean Points",
       x = "Penguin Species",
       y = "Flipper Length (mm)",
       fill = "Species") +
  theme(plot.title = element_text(hjust = 0.5,            # Center & bold title
                                  face = "bold",
                                  size = 18),
    plot.subtitle = element_text(hjust = 0.5,             # Center subtitle
                                 size = 14),
    axis.title.x = element_text(margin = margin(t = 10)), # Add margin to titles
    axis.title.y = element_text(margin = margin(r = 10)),
    axis.text.x = element_text(size = 12,                 # Style axis labels
                               face = "italic"),
    axis.text.y = element_text(size = 12),
    legend.position = "top",                              # Move legend to top
    legend.title = element_text(face = "bold"))           # Bold legend title
 
ggbetweenstats(data = my_penguins,                        # Violin plot with stats
               x = species,
               y = flipper_length_mm)

Exercises

In this module, we will work with the midwest data set, a built-in data set in the ggplot2 package. This data set contains demographic and geographic information for counties in the midwestern United States, including variables such as total population (poptotal), county (county), and state (state).

Violin plots are ideal for visualizing the distribution of numeric variables like population size across categorical groups such as states. In this module, we will explore the distribution of population sizes across different states.

To get started, install and load the dplyr and ggplot2 packages, and load the midwest data set as shown below:

# install.packages("dplyr")                               # Install & load dplyr
library(dplyr)
 
# install.packages("ggplot2")                             # Install & load ggplot2
library(ggplot2)
 
data(midwest)                                             # Load example data

Our current data set includes counties of various sizes. Let’s use a violin plot to visualize the distribution of these population sizes:

ggplot(data = midwest,                                    # Violin plots for entire data
       aes(x = state,
           y = poptotal)) +
  geom_violin()

As you can see in the previous plot, the presence of a few larger counties in our data makes it challenging to analyze the main distributions. Since this analysis focuses on smaller and medium-sized counties, we will create a subset of the data to better examine these distributions:

midwest_below_200k <- midwest %>%                         # Create subset
  filter(poptotal < 200000)

Now, let’s move on to the exercises for this module:

  1. Create a violin plot using the midwest_below_200k data set to visualize the distribution of total population (poptotal) across different states.
  2. Enhance the violin plot by adding colors to represent each state using the fill aesthetic.
  3. Add boxplots inside the violin plots to display summary statistics for the population distribution in each state. Use geom_boxplot() with a reduced width of 0.1 and a white fill color.
  4. Further enhance the visualization by adding jittered points to represent individual observations. Set the jitter width to 0.1, use gray30 as the point color, and set the transparency to alpha = 0.5 for better clarity.
  5. Analyze the plots created in the previous exercises. Identify which state has the highest median county population size and which state has the most smaller counties based on the density of the violin plots.

Exercise Solutions

Below are the solutions to the exercises for this module.

Exercise 1) Create a violin plot using the midwest_below_200k data set to visualize the distribution of total population (poptotal) across different states.

ggplot(data = midwest_below_200k,                         # Violin plots for subset
       aes(x = state,
           y = poptotal)) +
  geom_violin()

This code generates violin plots that display the population distribution of counties in each state.

Exercise 2) Enhance the violin plot by adding colors to represent each state using the fill aesthetic.

ggplot(data = midwest_below_200k,                         # Add colors & legend
       aes(x = state,
           y = poptotal,
           fill = state)) +
  geom_violin()

This code applies a unique color to each state, making it easier to distinguish the distributions.

Exercise 3) Add boxplots inside the violin plots to display summary statistics for the population distribution in each state. Use geom_boxplot() with a reduced width of 0.1 and a white fill color.

ggplot(data = midwest_below_200k,                         # Add boxplots
       aes(x = state,
           y = poptotal,
           fill = state)) +
  geom_violin() +
  geom_boxplot(width = 0.1,
               fill = "white")

This code overlays boxplots within the violin plots, highlighting the medians and interquartile ranges for each state.

Exercise 4) Further enhance the visualization by adding jittered points to represent individual observations. Set the jitter width to 0.1, use gray as the point color, and set the transparency to alpha = 0.5 for better clarity.

set.seed(84374)                                           # Seed for reproducibility
 
ggplot(data = midwest_below_200k,                         # Add jittered points
       aes(x = state,
           y = poptotal,
           fill = state)) +
  geom_violin() +
  geom_boxplot(width = 0.1,
               fill = "white") +
  geom_jitter(width = 0.1,
              color = "gray30",
              alpha = 0.5)

This code adds jittered points to show the distribution of individual county populations, complementing the violin and boxplots.

Exercise 5) Analyze the plots created in the previous exercises. Identify which state has the highest median county population size and which state has the most smaller counties based on the density of the violin plots.

Based on the plots, Ohio (OH) has the highest median county population size, as indicated by the position of the median line in the boxplot within the violins. Illinois (IL) shows the widest lower distribution in the violin plot, indicating that it has the most smaller counties.

Data Attribution

This module utilizes data obtained from kaggle.com. We acknowledge the contributors as the primary source of this data set, which significantly enhances the educational value of our course. For more detailed info and additional resources, please visit this page on the kaggle website.

Further Resources

 

Move to Previous Module Button

.

Move to Next Module Button

 

You can access the course overview page, timetable, and table of contents by clicking here.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

The maximum upload file size: 2 MB. You can upload: image. Drop file here

Top