R Find Missing Values (6 Examples for Data Frame, Column & Vector)
Let’s face it:
Missing values are an issue of almost every raw data set!
If we don’t handle our missing data in an appropriate way, our estimates are likely to be biased.
However, before we can deal with missingness, we need to identify in which rows and columns the missing values occur.
In the following, I will show you several examples how to find missing values in R.
Example 1: One of the most common ways in R to find missing values in a vector
expl_vec1 <- c(4, 8, 12, NA, 99, - 20, NA) # Create your own example vector with NA's is.na(expl_vec1) # The is.na() function returns a logical vector. The vector is TRUE in case # of a missing value and FALSE in case of an observed value which(is.na(expl_vec1)) # The which() function returns the positions with missing values in your vector. # In our case there are NA's at positions 4 & 7 ### [1] 4 7
You can find a more detailed explanation for this example in the following video:
Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.
If you accept this notice, your choice will be saved and the page will refresh.
Example 2: Find missing values in a column of a data frame
expl_data1 <- data.frame(x1 = c(NA, 7, 8, 9, 3), # Numeric variable with one missing value x2 = c(4, 1, NA, NA, 4), # Numeric variable with two missing values x3 = c(1, 4, 2, 9, 6), # Numeric variable without any missing values x4 = c("Hello", "I am not NA", NA, "I love R", NA)) # Factor variable with # two missing values expl_data1 # This is how our data with missing values looks like
Table 1: Example Data Frame with Missing Values
which(is.na(expl_data1$x1)) # Same procedure as in Example 1, but this time with the column of a data frame; # Missing value in x1 at position 1 which(is.na(expl_data1$x2)) # Variable x2 has missing values at positions 3 and 4 which(is.na(expl_data1$x3)) # The variable x3 in column 3 has no missing values which(is.na(expl_data1$x4)) # Our factor variable x4 in column 4 has missing values at positions 3 and 5; # The same procedure can be applied to factors
Example 3: Identify missing values in an R data frame
# As in Example one, you can create a data frame with logical TRUE and FALSE values; # Indicating observed and missing values is.na(expl_data1) apply(is.na(expl_data1), 2, which) # In order to get the positions of each column in your data set, # you can use the apply() function
Example 4: Detect missing values in a column of an R matrix
# Create matrix on the basis of the first three columns of our example data of Example 2 expl_matrix1 <- as.matrix(expl_data1[ , 1:3]) expl_matrix1 which(is.na(expl_matrix1[ , 1])) # The $ operator is invalid for columns of matrices. # Therefore we have to select our matrix columns by squared brackets which(is.na(expl_matrix1[ , 2])) # Beside the change from the $ operator to squared brackets, # we can apply the same functions as in the other examples which(is.na(expl_matrix1[ , 3])) # Again, no missing values in x3
Example 5: Identify NA values in a matrix
# We can check the missing values of the whole matrix with the same procedure as in Example 3 apply(is.na(expl_matrix1), 2, which)
Example 6: Find missing values in R with the complete.cases() function
# An alternative to the is.na() function is the function complete.cases(), # which searches for observed values instead of missing values which(complete.cases(expl_vec1)) # Identify observed values (opposite result as in Example 1) which(complete.cases(expl_vec1) == FALSE) # Reproduce result of Example 1 by adding == FALSE complete.cases(expl_data1) # If a data frame or matrix is checked by complete.case(), # the function returns a logical vector indicating whether a row is complete
Video Example – Detect Missing Values in a Real Data Set
The following video of my YouTube channel shows in a live example how to find NA, how to count NA, how to omit NA, and how to remove missing values.
Have a look at minute 1:05.
I’m showing here the same approach that I have explained in Example 1.
Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.
If you accept this notice, your choice will be saved and the page will refresh.
R – Count Missing Values per Row and Column
Besides the positioning of your missing data, the question might arise how to count missing values per row, by column, or in a single vector. Let’s check how to do this based on our example data above:
# With the sum() and the is.na() functions you can find the number of missing values in your data sum(is.na(expl_vec1)) # Two missings in our vector sum(is.na(expl_data1)) # The same method works for the whole data frame; Five missings overall sum(is.na(expl_matrix1)) # The procedure works also for matrices; The NA count is three in our case
How to Handle Missing Data in R?
Once we found and located missing values and their index positions in our data, the question appears how we should treat these not available values. Complete case data is needed for most data analyses in R!
The default method in the R programming language is listwise deletion, which deletes all rows with missing values in one or more columns.
Basic data manipulations can be done with the na.omit command or with the is.na R function.
A more sophisticated approach – which is usually preferable to a complete case analysis – is the imputation of missing values.
Very simple imputation approaches would be mean imputation (mode imputation in case of categorical variables) or the replacement of NA’s with 0.
However, in order to create a more reasonable complete data set, missing data imputation usually replaces missing values with estimates that are based on statistical models (e.g. via regression imputation or predictive mean matching).
Now It’s Your Turn
So that is how I’m checking for missing values in my data sets.
Now I’d like to hear about your thoughts: What’s your favorite approach?
Are you going to use the is.na function of Example 1? Or will you find NA’s by searching for complete cases?
Let me know by leaving a comment below. I will respond to every question!
Appendix
How to create the graphic of the header of this page
The header graphic shows a simple dotplot created with the R package ggplot2.
The dark blue values indicate observed values; The light blue values indicate missingness.
Since the missing values appear more often in the upper right part of the plot, they can not be considered as Missing Completely At Random anymore.
set.seed(8765) # Reproducability var1 <- rnorm(2000, 10, 3) # Normal distribution var2 <- var1 + rnorm(2000) # Correlated normal distribution range01 <- function(x){(x - min(x)) / (max(x) - min(x))} # Suppress probabilities of missingness between 0 and 1 var2_miss <- rbinom(2000, 1, range01(var1^3)) == 1 # Insert missing values for var2 in dependance of var1 data_ggplot_missings <- data.frame(var1, var2) # Store var1 and var2 in a data frame colours <- rep(1, 2000) # Set colours colours[var2_miss] <- 2 ggplot_missings <- ggplot(data_ggplot_missings, aes(x = var1, y = var2)) + # Create ggplot geom_point(aes(col = colours, size = 1.1)) + theme(legend.position = "none")
Statistics Globe Newsletter
4 Comments. Leave new
Useful
Thank you Salman, glad you think so! 🙂
How to deal with missing data on daily wind data?
Hey,
Would Missing Data Imputation be an option for you?
Regards,
Joachim