# R NA – What are <Not Available> Values?

Your data contains NA, <NA>, or NaN values? That’s not the end of the world — but your alarm bells should start ringing!

In R (or R Studio), NA stands for **Not Available**. Each cell of your data that displays NA is a missing value.

Not available values are sometimes enclosed by < and >, i.e. **<NA>**. That happens when the vector or column that contains the NA is a factor.

In R, NA needs to be distinguished from NaN. NaN stands for **Not a Number** and represents an undefined or unrepresentable value. It appears, for instance, when you try to divide by zero.

Consider the following **example in R**:

# Create some example variables x1 <- c(7, 9, NA, 2, 5) x2 <- as.factor(c(NA, 2, NA, 1, 1)) x3 <- c(4, NaN, 0, 9, 8) x4 <- c(6, 1, 5, 5, 7) # Create data.frame data <- data.frame(x1, x2, x3, x4) |

# Create some example variables x1 <- c(7, 9, NA, 2, 5) x2 <- as.factor(c(NA, 2, NA, 1, 1)) x3 <- c(4, NaN, 0, 9, 8) x4 <- c(6, 1, 5, 5, 7) # Create data.frame data <- data.frame(x1, x2, x3, x4)

**Table 1: R Example Data with NA, <NA> & NaN**

The column X1 of our R example data has one missing value in the third row. The missing value is displayed with NA, since **the column is numeric**.

Column X2 has two missing values in the first and third row. The missings are represented by <NA>, since **the second column is a factor**.

The third column X3 is of class numeric (the same as X1). The second entry of the column is **not a number** and is therefore displayed by the code NaN.

The fourth column X4 is **complete** and does therefore not contain any NAs or NaNs.

## Important Functions for Dealing with NAs

In the following, I’ll show you some of the most important approaches and functions of the R programming language for the **handling of missing data**. I’ll use our exemplifying data table that we created above.

### na.omit

The na.omit function is used to exclude rows of a data set with one or more missing values. Read more…

na.omit(data) # x1 x2 x3 x4 # 2 1 9 5 # 5 1 8 7 |

na.omit(data) # x1 x2 x3 x4 # 2 1 9 5 # 5 1 8 7

na.omit can also be used to delete NAs in a vector…

na.omit(data$x1) # [1] 7 9 2 5 |

na.omit(data$x1) # [1] 7 9 2 5

…or in a list.

# Create some data frames and matrices data_1 <- data[ , 1:2] data_2 <- data[1:3, 3:4] data_3 <- matrix(ncol = 2, c(0, NA, - 4, 3, 2, 1)) # Store data frames and matrix in list data_list <- list(data_1, data_2, data_3) # Create empty list data_list_na.omit <- list() # For loop for removal of rows with NAs in whole list for(i in 1:length(data_list)) { data_list_na.omit[[i]] <- na.omit(data_list[[i]]) } |

# Create some data frames and matrices data_1 <- data[ , 1:2] data_2 <- data[1:3, 3:4] data_3 <- matrix(ncol = 2, c(0, NA, - 4, 3, 2, 1)) # Store data frames and matrix in list data_list <- list(data_1, data_2, data_3) # Create empty list data_list_na.omit <- list() # For loop for removal of rows with NAs in whole list for(i in 1:length(data_list)) { data_list_na.omit[[i]] <- na.omit(data_list[[i]]) }

Note: With such a for loop, all functions can be applied to a list (not only na.omit).

### na.rm

na.rm is used to remove NAs of your data matrix within a function by setting na.rm = TRUE. For instance, na.rm can be used in combination with the functions mean…

mean(data$x1, na.rm = TRUE) # [1] 5.75 |

mean(data$x1, na.rm = TRUE) # [1] 5.75

…and max.

max(data$x1, na.rm = TRUE) # [1] 9 |

max(data$x1, na.rm = TRUE) # [1] 9

### use

Often confusing: The function cor uses the option use instead of na.rm.

cor(data$x1, data$x3, use = "complete.obs") # [1] -0.9011271 |

cor(data$x1, data$x3, use = "complete.obs") # [1] -0.9011271

### complete.cases

The complete.cases function creates a logical vector that indicates complete rows of our data matrix by TRUE. Read more…

complete.cases(data) # [1] FALSE FALSE FALSE TRUE TRUE |

complete.cases(data) # [1] FALSE FALSE FALSE TRUE TRUE

The function can also be used for casewise deletion (same as na.omit).

data[complete.cases(data), ] # x1 x2 x3 x4 # 2 1 9 5 # 5 1 8 7 |

data[complete.cases(data), ] # x1 x2 x3 x4 # 2 1 9 5 # 5 1 8 7

### is.na

is.na is also used to identify missing values via TRUE and FALSE (TRUE indicates NA). In contrast to the function complete.cases, is.na retains the dimension of our data matrix. Read more…

is.na(data) # x1 x2 x3 x4 # FALSE TRUE FALSE FALSE # FALSE FALSE TRUE FALSE # TRUE TRUE FALSE FALSE # FALSE FALSE FALSE FALSE # FALSE FALSE FALSE FALSE |

is.na(data) # x1 x2 x3 x4 # FALSE TRUE FALSE FALSE # FALSE FALSE TRUE FALSE # TRUE TRUE FALSE FALSE # FALSE FALSE FALSE FALSE # FALSE FALSE FALSE FALSE

### !is.na

!is.na (with a ! in front) does the opposite than is.na.

!is.na(data) # x1 x2 x3 x4 # TRUE FALSE TRUE TRUE # TRUE TRUE FALSE TRUE # FALSE FALSE TRUE TRUE # TRUE TRUE TRUE TRUE # TRUE TRUE TRUE TRUE |

!is.na(data) # x1 x2 x3 x4 # TRUE FALSE TRUE TRUE # TRUE TRUE FALSE TRUE # FALSE FALSE TRUE TRUE # TRUE TRUE TRUE TRUE # TRUE TRUE TRUE TRUE

### which

Combined with the function which, logical vectors can be used to find missing values. Read more…

which(is.na(data$x1)) # [1] 3 |

which(is.na(data$x1)) # [1] 3

### sum

Another benefit of logical vectors is the possibility to count the amount of missing values. The function sum can be used together with is.na to count NA values in R.

sum(is.na(data$x1)) # [1] 1 |

sum(is.na(data$x1)) # [1] 1

### summary

The summary function provides another way to count NA values in a data table, column, array, or vector.

summary(data) |

summary(data)

**Table 2: Summary Function in R Counts NAs in Each Column**

In the bottom cell of each column of Table 2, the amount of NAs is displayed.

### Merge Complete Data via rbind and na.omit

The functions rbind and na.omit can be combined in order to merge (i.e. row bind) only complete rows.

# Create 2 data sets; NA in data_merge_2 data_merge_1 <- data.frame(x1 = c(5, 9, 8), x2 = c(1, 2, 3)) data_merge_2 <- data.frame(x1 = c(2, NA, 8), x2 = c(6, 9, 3)) # Merge data sets and keep only complete rows data_merge <- na.omit(rbind(data_merge_1, data_merge_2)) data_merge # Display merged data |

# Create 2 data sets; NA in data_merge_2 data_merge_1 <- data.frame(x1 = c(5, 9, 8), x2 = c(1, 2, 3)) data_merge_2 <- data.frame(x1 = c(2, NA, 8), x2 = c(6, 9, 3)) # Merge data sets and keep only complete rows data_merge <- na.omit(rbind(data_merge_1, data_merge_2)) data_merge # Display merged data

### R Remove NA, NaN, and Inf

It is also possible to exclude all rows with NA, NaN, and/or Inf values.

# Create data with NA, NaN, and Inf data_inf <- data data_inf[5, 4] <- Inf # Remove NA, NaN, and Inf data_no_na_nan_inf <- data_inf[ complete.cases(data_inf) & apply(data_inf, 1, max) != "Inf", ] data_no_na_nan_inf # Display complete subset |

# Create data with NA, NaN, and Inf data_inf <- data data_inf[5, 4] <- Inf # Remove NA, NaN, and Inf data_no_na_nan_inf <- data_inf[ complete.cases(data_inf) & apply(data_inf, 1, max) != "Inf", ] data_no_na_nan_inf # Display complete subset

### Recode Values to NA

Sometimes existing values have to be recoded to NA. If you want to replace a certain value with NA, you can do it as follows.

data_NA <- data # Replicate data data_NA[data_NA == 1] <- NA # Recode the value 1 to NA |

data_NA <- data # Replicate data data_NA[data_NA == 1] <- NA # Recode the value 1 to NA

If you want to recode a specific cell of your data matrix to NA, you can do it as follows.

data_NA2 <- data # Replicate data data_NA2[1, 3] <- NA # Recode row 1, column 3 to NA |

data_NA2 <- data # Replicate data data_NA2[1, 3] <- NA # Recode row 1, column 3 to NA

### Replace NAs

Logical vectors can also be used to replace NA with other values, e.g. 0. Read more…

vect_example <- data$x1 vect_example[is.na(vect_example)] <- 0 vect_example # [1] 7 9 0 2 5 |

vect_example <- data$x1 vect_example[is.na(vect_example)] <- 0 vect_example # [1] 7 9 0 2 5

### Missing Value Imputation

Missing data imputation replaces missing values by new values. Data imputation has many advantages compared to the deletion of rows/columns with NAs. Read more…

In the following example, we use the predictive mean matching imputation method. However, there are many other imputation methods such as regression imputation or mean imputation available.

install.packages("mice") # Install mice package in R library("mice") # Load mice package imp <- mice(data, # Impute data m = 1, seed = 123) data_imp <- complete(imp) # Store imputed data set data_imp # Display imputed data |

install.packages("mice") # Install mice package in R library("mice") # Load mice package imp <- mice(data, # Impute data m = 1, seed = 123) data_imp <- complete(imp) # Store imputed data set data_imp # Display imputed data

## Video Example – How to Handle NA Values

Need more help with your NA values in R? Then you should definitely have a look at the following video of my Statistical Programming YouTube channel.

In this video, I’m explaining how to deal with incomplete data. I show easy-to-understand live examples and explain how to apply different functions such as is.na, na.omit, and na.rm.

**Please accept YouTube cookies to play this video.** By accepting you will be accessing content from YouTube, a service provided by an external third party.

If you accept this notice, your choice will be saved and the page will refresh.

## I Would Like to Hear From You

I’ve shown you my favourite ways to handle NA values in R.

Now, I would like to hear about **your experiences**.

Which of these methods is your favourite? Do you use any other methods that I missed above?

Let me know in the comments!

## Appendix

The header graphic of this page shows a correlation plot of two variables. Missing cases are illustrated via NA.

With the following code, the plot is created in R.

N <- 50000 # Sample size x <- rnorm(N) # X variable y <- rnorm(N) # Y variable par(bg = "#353436") # Set background color par(mar = c(0, 0, 0, 0)) # Remove space around plot plot(x, y, # Plot observed values col = "#1b98e0") points(x[1:15], y[1:15], # Plot missing values pch = 16, cex = 5, col = "#353436") text(x[1:15], y[1:15], # Write NA into each missing value "NA", col = "red") |

N <- 50000 # Sample size x <- rnorm(N) # X variable y <- rnorm(N) # Y variable par(bg = "#353436") # Set background color par(mar = c(0, 0, 0, 0)) # Remove space around plot plot(x, y, # Plot observed values col = "#1b98e0") points(x[1:15], y[1:15], # Plot missing values pch = 16, cex = 5, col = "#353436") text(x[1:15], y[1:15], # Write NA into each missing value "NA", col = "red")

### Statistics Globe Newsletter

## 6 Comments. Leave new

Hi Joachim,

My data has a specific column named “treatment” where the contents are 1) empty cells 2) drug 3) diet 4) unknown and 5) None.

I want to create a parallel column named “treatment_n” with drug replaced as 1 and all other content as 0.

Can you please help with this.

Thank you

Nara

Hey Nara,

that’s a great question. I have created an example, which simulates your problem. You can copy/paste the following code to your RStudio and run it yourself:

I hope that helps!

Regards,

Joachim

hi Joachim,

my data has ‘NA’ as real values (standard ISO 2 code for Namibia). How do I prevent R from seeing it as ?

Hi Adekola,

You can specify “NA” as character string or factor level. R diferentiates between “NA” and NA.

For example:

The first element is considered as country code and the second and last elements are considered as missing data.

Greetings from Germany to Namibia!

Joachim

That was great, productive and beneficial.

How about using Maximum Likelihood or Expectation-Maximization Techniques to handle the missing data?

Hey Umar,

Thank you for the kind words!

I have never done this myself, but the mlmi package seems to provide functions for Maximum Likelihood Multiple Imputation in R. Have a look here.

Regards,

Joachim