# R Find Missing Values (6 Examples for Data Frame, Column & Vector)

Let’s face it:

Missing values are an issue of almost every raw data set!

If we don’t handle our missing data in an appropriate way, our estimates are likely to be biased.

However, before we can deal with missingness, we need to identify in which rows and columns the missing values occur.

In the following, I will show you several examples how to find missing values in R.

**Example 1: One of the most common ways in R to find missing values in a vector**

expl_vec1 <- c(4, 8, 12, NA, 99, - 20, NA) # Create your own example vector with NA's is.na(expl_vec1) # The is.na() function returns a logical vector. The vector is TRUE in case # of a missing value and FALSE in case of an observed value which(is.na(expl_vec1)) # The which() function returns the positions with missing values in your vector. # In our case there are NA's at positions 4 & 7 ### [1] 4 7 |

expl_vec1 <- c(4, 8, 12, NA, 99, - 20, NA) # Create your own example vector with NA's is.na(expl_vec1) # The is.na() function returns a logical vector. The vector is TRUE in case # of a missing value and FALSE in case of an observed value which(is.na(expl_vec1)) # The which() function returns the positions with missing values in your vector. # In our case there are NA's at positions 4 & 7 ### [1] 4 7

You can find a more detailed explanation for this example in the following video:

**Please accept YouTube cookies to play this video.** By accepting you will be accessing content from YouTube, a service provided by an external third party.

If you accept this notice, your choice will be saved and the page will refresh.

**Example 2: Find missing values in a column of a data frame**

expl_data1 <- data.frame(x1 = c(NA, 7, 8, 9, 3), # Numeric variable with one missing value x2 = c(4, 1, NA, NA, 4), # Numeric variable with two missing values x3 = c(1, 4, 2, 9, 6), # Numeric variable without any missing values x4 = c("Hello", "I am not NA", NA, "I love R", NA)) # Factor variable with # two missing values expl_data1 # This is how our data with missing values looks like |

expl_data1 <- data.frame(x1 = c(NA, 7, 8, 9, 3), # Numeric variable with one missing value x2 = c(4, 1, NA, NA, 4), # Numeric variable with two missing values x3 = c(1, 4, 2, 9, 6), # Numeric variable without any missing values x4 = c("Hello", "I am not NA", NA, "I love R", NA)) # Factor variable with # two missing values expl_data1 # This is how our data with missing values looks like

**Table 1: Example Data Frame with Missing Values**

which(is.na(expl_data1$x1)) # Same procedure as in Example 1, but this time with the column of a data frame; # Missing value in x1 at position 1 which(is.na(expl_data1$x2)) # Variable x2 has missing values at positions 3 and 4 which(is.na(expl_data1$x3)) # The variable x3 in column 3 has no missing values which(is.na(expl_data1$x4)) # Our factor variable x4 in column 4 has missing values at positions 3 and 5; # The same procedure can be applied to factors |

which(is.na(expl_data1$x1)) # Same procedure as in Example 1, but this time with the column of a data frame; # Missing value in x1 at position 1 which(is.na(expl_data1$x2)) # Variable x2 has missing values at positions 3 and 4 which(is.na(expl_data1$x3)) # The variable x3 in column 3 has no missing values which(is.na(expl_data1$x4)) # Our factor variable x4 in column 4 has missing values at positions 3 and 5; # The same procedure can be applied to factors

**Example 3: Identify missing values in an R data frame**

# As in Example one, you can create a data frame with logical TRUE and FALSE values; # Indicating observed and missing values is.na(expl_data1) apply(is.na(expl_data1), 2, which) # In order to get the positions of each column in your data set, # you can use the apply() function |

# As in Example one, you can create a data frame with logical TRUE and FALSE values; # Indicating observed and missing values is.na(expl_data1) apply(is.na(expl_data1), 2, which) # In order to get the positions of each column in your data set, # you can use the apply() function

**Example 4: Detect missing values in a column of an R matrix**

# Create matrix on the basis of the first three columns of our example data of Example 2 expl_matrix1 <- as.matrix(expl_data1[ , 1:3]) expl_matrix1 which(is.na(expl_matrix1[ , 1])) # The $ operator is invalid for columns of matrices. # Therefore we have to select our matrix columns by squared brackets which(is.na(expl_matrix1[ , 2])) # Beside the change from the $ operator to squared brackets, # we can apply the same functions as in the other examples which(is.na(expl_matrix1[ , 3])) # Again, no missing values in x3 |

# Create matrix on the basis of the first three columns of our example data of Example 2 expl_matrix1 <- as.matrix(expl_data1[ , 1:3]) expl_matrix1 which(is.na(expl_matrix1[ , 1])) # The $ operator is invalid for columns of matrices. # Therefore we have to select our matrix columns by squared brackets which(is.na(expl_matrix1[ , 2])) # Beside the change from the $ operator to squared brackets, # we can apply the same functions as in the other examples which(is.na(expl_matrix1[ , 3])) # Again, no missing values in x3

**Example 5: Identify NA values in a matrix**

# We can check the missing values of the whole matrix with the same procedure as in Example 3 apply(is.na(expl_matrix1), 2, which) |

# We can check the missing values of the whole matrix with the same procedure as in Example 3 apply(is.na(expl_matrix1), 2, which)

**Example 6: Find missing values in R with the complete.cases() function**

# An alternative to the is.na() function is the function complete.cases(), # which searches for observed values instead of missing values which(complete.cases(expl_vec1)) # Identify observed values (opposite result as in Example 1) which(complete.cases(expl_vec1) == FALSE) # Reproduce result of Example 1 by adding == FALSE complete.cases(expl_data1) # If a data frame or matrix is checked by complete.case(), # the function returns a logical vector indicating whether a row is complete |

# An alternative to the is.na() function is the function complete.cases(), # which searches for observed values instead of missing values which(complete.cases(expl_vec1)) # Identify observed values (opposite result as in Example 1) which(complete.cases(expl_vec1) == FALSE) # Reproduce result of Example 1 by adding == FALSE complete.cases(expl_data1) # If a data frame or matrix is checked by complete.case(), # the function returns a logical vector indicating whether a row is complete

## Video Example – Detect Missing Values in a Real Data Set

The following video of my YouTube channel shows in a live example how to find NA, how to count NA, how to omit NA, and how to remove missing values.

Have a look at minute 1:05.

I’m showing here the same approach that I have explained in Example 1.

**Please accept YouTube cookies to play this video.** By accepting you will be accessing content from YouTube, a service provided by an external third party.

If you accept this notice, your choice will be saved and the page will refresh.

## R – Count Missing Values per Row and Column

Besides the positioning of your missing data, the question might arise how to count missing values per row, by column, or in a single vector. Let’s check how to do this based on our example data above:

# With the sum() and the is.na() functions you can find the number of missing values in your data sum(is.na(expl_vec1)) # Two missings in our vector sum(is.na(expl_data1)) # The same method works for the whole data frame; Five missings overall sum(is.na(expl_matrix1)) # The procedure works also for matrices; The NA count is three in our case |

# With the sum() and the is.na() functions you can find the number of missing values in your data sum(is.na(expl_vec1)) # Two missings in our vector sum(is.na(expl_data1)) # The same method works for the whole data frame; Five missings overall sum(is.na(expl_matrix1)) # The procedure works also for matrices; The NA count is three in our case

## How to Handle Missing Data in R?

Once we found and located missing values and their index positions in our data, the question appears how we should treat these not available values. Complete case data is needed for most data analyses in R!

The default method in the R programming language is listwise deletion, which deletes all rows with missing values in one or more columns.

Basic data manipulations can be done with the na.omit command or with the is.na R function.

A more sophisticated approach – which is usually preferable to a complete case analysis – is the imputation of missing values.

Very simple imputation approaches would be mean imputation (mode imputation in case of categorical variables) or the replacement of NA’s with 0.

However, in order to create a more reasonable complete data set, missing data imputation usually replaces missing values with estimates that are based on statistical models (e.g. via regression imputation or predictive mean matching).

## Now It’s Your Turn

So that is how I’m checking for missing values in my data sets.

Now I’d like to hear about your thoughts: What’s your favorite approach?

Are you going to use the is.na function of Example 1? Or will you find NA’s by searching for complete cases?

Let me know by leaving a comment below. I will respond to every question!

## Appendix

**How to create the graphic of the header of this page**

The header graphic shows a simple dotplot created with the R package ggplot2.

The dark blue values indicate observed values; The light blue values indicate missingness.

Since the missing values appear more often in the upper right part of the plot, they can not be considered as Missing Completely At Random anymore.

set.seed(8765) # Reproducability var1 <- rnorm(2000, 10, 3) # Normal distribution var2 <- var1 + rnorm(2000) # Correlated normal distribution range01 <- function(x){(x - min(x)) / (max(x) - min(x))} # Suppress probabilities of missingness between 0 and 1 var2_miss <- rbinom(2000, 1, range01(var1^3)) == 1 # Insert missing values for var2 in dependance of var1 data_ggplot_missings <- data.frame(var1, var2) # Store var1 and var2 in a data frame colours <- rep(1, 2000) # Set colours colours[var2_miss] <- 2 ggplot_missings <- ggplot(data_ggplot_missings, aes(x = var1, y = var2)) + # Create ggplot geom_point(aes(col = colours, size = 1.1)) + theme(legend.position = "none") |

set.seed(8765) # Reproducability var1 <- rnorm(2000, 10, 3) # Normal distribution var2 <- var1 + rnorm(2000) # Correlated normal distribution range01 <- function(x){(x - min(x)) / (max(x) - min(x))} # Suppress probabilities of missingness between 0 and 1 var2_miss <- rbinom(2000, 1, range01(var1^3)) == 1 # Insert missing values for var2 in dependance of var1 data_ggplot_missings <- data.frame(var1, var2) # Store var1 and var2 in a data frame colours <- rep(1, 2000) # Set colours colours[var2_miss] <- 2 ggplot_missings <- ggplot(data_ggplot_missings, aes(x = var1, y = var2)) + # Create ggplot geom_point(aes(col = colours, size = 1.1)) + theme(legend.position = "none")

### Statistics Globe Newsletter

## 2 Comments. Leave new

Useful

Thank you Salman, glad you think so! 🙂