R NA – What are <Not Available> Values?

Your data contains NA, <NA>, or NaN values? That’s not the end of the world — but your alarm bells should start ringing!

In R (or R Studio), NA stands for Not Available. Each cell of your data that displays NA is a missing value.

Not available values are sometimes enclosed by < and >, i.e. <NA>. That happens when the vector or column that contains the NA is a factor.

In R, NA needs to be distinguished from NaN. NaN stands for Not a Number and represents an undefined or unrepresentable value. It appears, for instance, when you try to divide by zero.

Consider the following example in R:

# Create some example variables
x1 <- c(7, 9, NA, 2, 5)
x2 <- as.factor(c(NA, 2, NA, 1, 1))
x3 <- c(4, NaN, 0, 9, 8)
x4 <- c(6, 1, 5, 5, 7)
 
# Create data.frame
data <- data.frame(x1, x2, x3, x4)

R Data Frame with NA, <NA>, and NaN

Table 1: R Example Data with NA, <NA> & NaN

 

The column X1 of our R example data has one missing value in the third row. The missing value is displayed with NA, since the column is numeric.

Column X2 has two missing values in the first and third row. The missings are represented by <NA>, since the second column is a factor.

The third column X3 is of class numeric (the same as X1). The second entry of the column is not a number and is therefore displayed by the code NaN.

The fourth column X4 is complete and does therefore not contain any NAs or NaNs.

 

Important Functions for Dealing with NAs

In the following, I’ll show you some of the most important approaches and functions of the R programming language for the handling of missing data. I’ll use our exemplifying data table that we created above.

 

na.omit

The na.omit function is used to exclude rows of a data set with one or more missing values. Read more…

na.omit(data)
# x1 x2 x3 x4
# 2  1  9  5
# 5  1  8  7

na.omit can also be used to delete NAs in a vector…

na.omit(data$x1)
# [1] 7 9 2 5

…or in a list.

# Create some data frames and matrices
data_1 <- data[ , 1:2]
data_2 <- data[1:3, 3:4]
data_3 <- matrix(ncol = 2, c(0, NA, - 4, 3, 2, 1))
 
# Store data frames and matrix in list
data_list <- list(data_1, data_2, data_3)
 
# Create empty list
data_list_na.omit <- list()
 
# For loop for removal of rows with NAs in whole list
for(i in 1:length(data_list)) {
  data_list_na.omit[[i]] <- na.omit(data_list[[i]])
}

Note: With such a for loop, all functions can be applied to a list (not only na.omit).

 

na.rm

na.rm is used to remove NAs of your data matrix within a function by setting na.rm = TRUE. For instance, na.rm can be used in combination with the functions mean…

mean(data$x1, na.rm = TRUE)
# [1] 5.75

…and max.

max(data$x1, na.rm = TRUE)
# [1] 9

 

use

Often confusing: The function cor uses the option use instead of na.rm.

cor(data$x1, data$x3, use = "complete.obs")
# [1] -0.9011271

 

complete.cases

The complete.cases function creates a logical vector that indicates complete rows of our data matrix by TRUE. Read more…

complete.cases(data)
# [1] FALSE FALSE FALSE  TRUE  TRUE

The function can also be used for casewise deletion (same as na.omit).

data[complete.cases(data), ]
# x1 x2 x3 x4
# 2  1  9  5
# 5  1  8  7

 

is.na

is.na is also used to identify missing values via TRUE and FALSE (TRUE indicates NA). In contrast to the function complete.cases, is.na retains the dimension of our data matrix. Read more…

is.na(data)
# x1    x2    x3    x4
# FALSE TRUE  FALSE FALSE
# FALSE FALSE TRUE  FALSE
# TRUE  TRUE  FALSE FALSE
# FALSE FALSE FALSE FALSE
# FALSE FALSE FALSE FALSE

 

!is.na

!is.na (with a ! in front) does the opposite than is.na.

!is.na(data)
# x1    x2    x3    x4
# TRUE  FALSE TRUE  TRUE
# TRUE  TRUE  FALSE TRUE
# FALSE FALSE TRUE  TRUE
# TRUE  TRUE  TRUE  TRUE
# TRUE  TRUE  TRUE  TRUE

 

which

Combined with the function which, logical vectors can be used to find missing values. Read more…

which(is.na(data$x1))
# [1] 3

 

sum

Another benefit of logical vectors is the possibility to count the amount of missing values. The function sum can be used together with is.na to count NA values in R.

sum(is.na(data$x1))
# [1] 1

 

summary

The summary function provides another way to count NA values in a data table, column, array, or vector.

summary(data)

Summary Function in R

Table 2: Summary Function in R Counts NAs in Each Column

In the bottom cell of each column of Table 2, the amount of NAs is displayed.

 

Merge Complete Data via rbind and na.omit

The functions rbind and na.omit can be combined in order to merge (i.e. row bind) only complete rows.

# Create 2 data sets; NA in data_merge_2
data_merge_1 <- data.frame(x1 = c(5, 9, 8),
                           x2 = c(1, 2, 3))
data_merge_2 <- data.frame(x1 = c(2, NA, 8),
                           x2 = c(6, 9, 3))
 
# Merge data sets and keep only complete rows
data_merge <- na.omit(rbind(data_merge_1, data_merge_2))
data_merge # Display merged data

 

R Remove NA, NaN, and Inf

It is also possible to exclude all rows with NA, NaN, and/or Inf values.

# Create data with NA, NaN, and Inf
data_inf <- data
data_inf[5, 4] <- Inf
 
# Remove NA, NaN, and Inf
data_no_na_nan_inf <- data_inf[
  complete.cases(data_inf) & 
    apply(data_inf, 1, max) != "Inf", ]
data_no_na_nan_inf # Display complete subset

 

Recode Values to NA

Sometimes existing values have to be recoded to NA. If you want to replace a certain value with NA, you can do it as follows.

data_NA <- data              # Replicate data
data_NA[data_NA == 1] <- NA  # Recode the value 1 to NA

If you want to recode a specific cell of your data matrix to NA, you can do it as follows.

data_NA2 <- data             # Replicate data
data_NA2[1, 3] <- NA         # Recode row 1, column 3 to NA

 

Replace NAs

Logical vectors can also be used to replace NA with other values, e.g. 0. Read more…

vect_example <- data$x1
vect_example[is.na(vect_example)] <- 0
vect_example
# [1] 7 9 0 2 5

 

Missing Value Imputation

Missing data imputation replaces missing values by new values. Data imputation has many advantages compared to the deletion of rows/columns with NAs. Read more…

In the following example, we use the predictive mean matching imputation method. However, there are many other imputation methods such as regression imputation or mean imputation available.

install.packages("mice")   # Install mice package in R
library("mice")            # Load mice package
 
imp <- mice(data,          # Impute data
            m = 1, 
            seed = 123)
data_imp <- complete(imp)  # Store imputed data set
 
data_imp                   # Display imputed data

 

Video Example – How to Handle NA Values

Need more help with your NA values in R? Then you should definitely have a look at the following video of my Statistical Programming YouTube channel.

In this video, I’m explaining how to deal with incomplete data. I show easy-to-understand live examples and explain how to apply different functions such as is.na, na.omit, and na.rm.

Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.

YouTube Content Consent Button Thumbnail

YouTube privacy policy

If you accept this notice, your choice will be saved and the page will refresh.

 

I Would Like to Hear From You

I’ve shown you my favourite ways to handle NA values in R.

Now, I would like to hear about your experiences.

Which of these methods is your favourite? Do you use any other methods that I missed above?

Let me know in the comments!

 

Appendix

The header graphic of this page shows a correlation plot of two variables. Missing cases are illustrated via NA.

With the following code, the plot is created in R.

N <- 50000                  # Sample size
x <- rnorm(N)               # X variable
y <- rnorm(N)               # Y variable
 
par(bg = "#353436")         # Set background color
par(mar = c(0, 0, 0, 0))    # Remove space around plot
 
plot(x, y,                  # Plot observed values
     col = "#1b98e0")
points(x[1:15], y[1:15],    # Plot missing values
       pch = 16, cex = 5, 
       col = "#353436")
text(x[1:15], y[1:15],      # Write NA into each missing value
     "NA",
     col = "red")

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


6 Comments. Leave new

  • KNarasimhan
    May 1, 2019 8:46 am

    Hi Joachim,
    My data has a specific column named “treatment” where the contents are 1) empty cells 2) drug 3) diet 4) unknown and 5) None.
    I want to create a parallel column named “treatment_n” with drug replaced as 1 and all other content as 0.
    Can you please help with this.
    Thank you
    Nara

    Reply
    • Hey Nara,

      that’s a great question. I have created an example, which simulates your problem. You can copy/paste the following code to your RStudio and run it yourself:

      # Example data
      data <- data.frame(treatment = c("drug", NA, "drug", "diet", "diet", "unknown", " "))
       
      # Create new column treatment_n
      data$treatment_n <- 0
       
      # Replace drug with 1
      data$treatment_n[data$treatment == "drug"] <- 1
       
      # Evaluate final data frame
      data
       
      # treatment treatment_n
      #      drug           1
      #      <NA>           0
      #      drug           1
      #      diet           0
      #      diet           0
      #   unknown           0
      #                     0

      I hope that helps!

      Regards,

      Joachim

      Reply
  • Adekola Owoyemi
    November 24, 2020 6:27 pm

    hi Joachim,

    my data has ‘NA’ as real values (standard ISO 2 code for Namibia). How do I prevent R from seeing it as ?

    Reply
    • Hi Adekola,

      You can specify “NA” as character string or factor level. R diferentiates between “NA” and NA.

      For example:

      countries <- c("NA", NA, "DE", "GB", NA)

      The first element is considered as country code and the second and last elements are considered as missing data.

      Greetings from Germany to Namibia!

      Joachim

      Reply
  • That was great, productive and beneficial.

    How about using Maximum Likelihood or Expectation-Maximization Techniques to handle the missing data?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top