Report Missing Values in Data Frame in R (2 Examples)

In this R tutorial you’ll learn how to illustrate missing data in a data table in an elegant way.

Table of contents:

1) Creating Example Data

2) Example 1: Count Missing Values in Columns

3) Example 2: Visualize Missing Values Using VIM Package

4) Video & Further Resources

Here’s the step-by-step process!

Creating Example Data

First, we need to construct some data that we can use in the following examples:

set.seed(873264)                     # Create example data
data <- round(data.frame(x1 = rnorm(100),
                         x2 = runif(100),
                         x3 = rpois(100, 1)), 2)
data$x1[rbinom(100, 1, 0.2) == 1] <- NA
data$x2[rbinom(100, 1, 0.4) == 1] <- NA
data$x3[rbinom(100, 1, 0.6) == 1] <- NA
head(data)                           # First rows of example data
#      x1   x2 x3
# 1 -0.35   NA NA
# 2    NA 0.98 NA
# 3  1.69 0.87  2
# 4 -0.99 0.00 NA
# 5    NA   NA NA
# 6    NA 0.03 NA

The previous output of the RStudio console shows the structure of our exemplifying data: It’s a data frame containing three numeric columns. Each of the columns has a non-neglectable amount of NA values.

Example 1: Count Missing Values in Columns

When inspecting the missing data structure of a data frame, the first step should always be to count the missing values in each variable. This Example therefore illustrates how to get the number of NAs in each column. For this task, we can use the colSums and the is.na functions as shown below:

colSums(is.na(data))                 # Count missing values by column
# x1 x2 x3 
# 20 44 58

The previous output of the RStudio console shows that our example data contains 20 missing values in the variable x1, 44 missing values in the variable x2, and 58 missing values in the variable x3.

Those total numbers are hard to interpret without taking the size of our data table into account. The following R code therefore computes the percentages of missing values by column:

colSums(is.na(data)) / nrow(data)    # Percentage of missing values by column
#   x1   x2   x3 
# 0.20 0.44 0.58

x1 has 20% missings, x2 has 44% missings, and x3 has 58% missings. This result would definitely be alarming in practice!

Example 2: Visualize Missing Values Using VIM Package

It is also important to inspect the missing data structure. Hence, this Example explains how to show the structure of missing values in a graphic using the VIM add-on package. If we want to use the functions of the VIM package, we first have to install and load VIM:

install.packages("VIM")              # Install VIM package
library("VIM")                       # Load VIM

Now, we can use the aggr() function of the VIM package to create an aggregation plot of our missing data:

aggr(data)                           # Create aggregation plot

r graph figure 1

Figure 1 shows how the aggregation plot of our data looks like. Based on the plot you can see the amount of missing values in each column and you can see how often multiple variables are missing simultaneously.

Video & Further Resources

Do you need more info on the content of this page? Then you might want to watch the following video of my YouTube channel. In the video, I’m explaining the topics of this tutorial.

The YouTube video will be added soon.

Furthermore, you could have a look at the related tutorials of my website. Note that this page showed only a small part of the possible analysis methods for missing values. Make sure to analyze your missing data as good as possible and treat the missing values properly via imputation methods or other missing data approaches.

To summarize: In this R tutorial you learned how to visualize and count missing values. If you have further questions, please let me know in the comments section below.

6 Comments. Leave new

saima
November 8, 2022 4:28 am

I like your style. It’s easy, short, and informative.
I have a question about the aggregation plot. I can’t understand the right graph of the aggregation plot. Can you please explain it?

Reply
- Joachim
  November 14, 2022 12:54 pm
  
  Hi Saima,
  
  Thanks again for the very kind feedback! Do you still need help with your syntax?
  
  Regards,
  Joachim
  
  Reply
Saima Jahan
November 15, 2022 2:25 am

I am waiting for your response.

Reply
- Joachim
  November 15, 2022 10:42 am
  
  Hey Saima,
  
  Based on your comment I have noticed that I have embedded the wrong image to this tutorial. So first of all, thanks for making me aware of this.
  
  Regarding your question, the right side of the graph shows how often which combination of missing values occurrs. For example, there are more rows in our data set where the variables x2 and x3 both contain a missing value compared to the rows where both the variables x1 and x3 contain a missing value.
  
  I hope that clarifies this image for you!
  
  Regards,
  Joachim
  
  Reply
  - Saima
    November 16, 2022 4:12 am
    
    Thank you so much. I appreciate your help.
    
    Reply
    - Joachim
      November 16, 2022 10:19 am
      
      Thank you very much Saima, glad it was helpful!
      
      Regards,
      Joachim
      
      Reply