Replace Missing Values by Column Mean in R (3 Examples)

In this R tutorial you’ll learn how to substitute NA values by the mean of a data frame variable.

The content of the post is structured as follows:

1) Creation of Example Data

2) Example 1: Replacing Missing Data in One Specific Variable Using is.na() & mean() Functions

3) Example 2: Replacing Missing Data in All Variables Using for-Loop

4) Example 3: Replacing Missing Data in All Variables Using na.aggregate() Function of zoo Package

5) Video, Further Resources & Summary

Let’s get started…

Creation of Example Data

As a first step, we’ll have to create some example data:

data <- data.frame(x1 = c(NA, 2:10),                       # Create data frame
                   x2 = c(rep(5, 8), NA, NA),
                   x3 = c(4, NA, 1, 5, 6, 7, NA, 5, 9, 0))
data                                                       # Print data frame
#    x1 x2 x3
# 1  NA  5  4
# 2   2  5 NA
# 3   3  5  1
# 4   4  5  5
# 5   5  5  6
# 6   6  5  7
# 7   7  5 NA
# 8   8  5  5
# 9   9 NA  9
# 10 10 NA  0

As you can see based on the previous output of the RStudio console, our exemplifying data has ten rows and three numeric columns. Each of the variables contains at least one missing value (i.e. NA).

Example 1: Replacing Missing Data in One Specific Variable Using is.na() & mean() Functions

In this example, I’ll show how to substitute the NA values in only one particular data frame column by its average. For this, we can use the is.na and mean functions as shown below:

data1 <- data                                              # Duplicate data frame
data1$x1[is.na(data1$x1)] <- mean(data1$x1, na.rm = TRUE)  # Replace NA in one column
data1                                                      # Print updated data frame
#    x1 x2 x3
# 1   6  5  4
# 2   2  5 NA
# 3   3  5  1
# 4   4  5  5
# 5   5  5  6
# 6   6  5  7
# 7   7  5 NA
# 8   8  5  5
# 9   9 NA  9
# 10 10 NA  0

Have a look at the previous output of the RStudio console: As you can see, the first cell in the variable x1 was replaced by the mean of the variable x1 (i.e. 6).

Example 2: Replacing Missing Data in All Variables Using for-Loop

This example illustrates how to replace all numeric values of your data with a for-loop.

Have a look at the following R code:

data2 <- data                                              # Duplicate data frame
for(i in 1:ncol(data)) {                                   # Replace NA in all columns
  data2[ , i][is.na(data2[ , i])] <- mean(data2[ , i], na.rm = TRUE)
}
data2                                                      # Print updated data frame
#    x1 x2    x3
# 1   6  5 4.000
# 2   2  5 4.625
# 3   3  5 1.000
# 4   4  5 5.000
# 5   5  5 6.000
# 6   6  5 7.000
# 7   7  5 4.625
# 8   8  5 5.000
# 9   9  5 9.000
# 10 10  5 0.000

All NA values of our data frame were replaced by the mean of the corresponding column.

Example 3: Replacing Missing Data in All Variables Using na.aggregate() Function of zoo Package

You might say that the R syntax of Example 2 was relatively complicated. Fortunately, the zoo package provides a very simple alternative if we want to replace all missing values by column means.

If we want to use the functions and commands of the zoo package, we first have to install and load zoo:

install.packages("zoo")                                    # Install & load zoo package
library("zoo")

Now, we can use the na.aggregate function to replace all missing data:

data3 <- na.aggregate(data)                                # Replace NA in all columns
data3                                                      # Print updated data frame
#    x1 x2    x3
# 1   6  5 4.000
# 2   2  5 4.625
# 3   3  5 1.000
# 4   4  5 5.000
# 5   5  5 6.000
# 6   6  5 7.000
# 7   7  5 4.625
# 8   8  5 5.000
# 9   9  5 9.000
# 10 10  5 0.000

The output is exactly the same as in Example 2.

Video, Further Resources & Summary

Do you want to know more about missing data? Then you may want to have a look at the following video which I have published on my YouTube channel. In the video, I’m explaining the R codes of this article.

Also, you could have a look at the related articles that I have published on my homepage.

Summary: In this R tutorial you learned how to exchange missing values by column means in one or multiple variables. Let me know in the comments below, if you have further questions.

12 Comments. Leave new

Giorgos
October 14, 2021 11:07 am

Hello! thank you very much for this super useful tutorial.
I cannot make it work however with my data frame and I think I understand the reason why but I cannot really see how to fix it. On your code above, in example 2, you have used mean(data2[ , i], na.rm = TRUE) which, as I understand it, you have used the mean() on the data frame. However, in another tutorial you also mention that the “the mean() function cannot handle a data frame as an input” and there I understood why I cannot make it work.
But, how it is working on your code above? Is there something with the [, i] that I am failing to understand?
I would be so grateful with some help on this 🙂

Reply
- Joachim
  October 14, 2021 2:56 pm
  
  Hey Giorgos,
  
  Thank you very much for the very kind feedback, glad you find the tutorial useful!
  
  Regarding your question: The i is the running index of the for-loop, so it uses the mean of each column for the missing values of this column.
  
  Could you share your code and the error message that is returned?
  
  Regards
  
  Joachim
  
  Reply
Giorgos
October 14, 2021 11:09 am

I made the comment above, but I didn’t add my email.

Reply
- Joachim
  October 14, 2021 2:56 pm
  
  No problem, I have just responded to your other comment 🙂
  
  Reply
Jen
October 27, 2021 6:08 am

Hi Joachim,

If I upload a file missing data on R, then I use codes to substitute the missing data with mean. How can I download the updated file after that? There is a function on RCloud that we can export the file directly. How about R console?

Can you please help me with that?

Reply
- Joachim
  October 28, 2021 7:10 am
  
  Hi Jen,
  
  I recommend exporting your file to your computer using the write.csv function: https://www.rdocumentation.org/packages/AlphaPart/versions/0.8.1/topics/write.csv
  
  Afterwards, you can upload it to any cloud service you want.
  
  Please note that it is usually better to substitute missing values with other techniques than mean imputation. See here for more details: https://statisticsglobe.com/missing-data/
  
  Regards,
  Joachim
  
  Reply
Giorgos
October 27, 2021 12:17 pm

Thank you very much for your reply Joachim.
Above, you have simulated the data and the process is easy. In my case, I have used some PISA data, where I have made a smaller data set with the responses of the Greek students. So far so good.
So, let’s say that my data set is called Greece, I copy this data set in another one called Greece2 and I start from there.

for(i in 1:ncol(Greece2)) {
Greece2[, i][is.na(Greece2[, i])] <- mean(Greece2[, i], na.rm = TRUE)
}
Greece2

Results:
1) the NA values will not be replaced
2) I get the error "argument is not numeric or logical: returning NA"

My first question would be if you know how to just focus on the columns of interest. the columns(/variables) that I am working with are the last five ones so, the loop would work maybe if I would instruct it to iterate through these columns only?
An assumption that I have also made is that some of my NAs may not be NAs at all, but other non recognisable values such as NaN, or "?", or whatever. Therefore, the for loop cannot work on them.

Any ideas ?

Reply
- Joachim
  October 28, 2021 7:16 am
  
  Hey Giorgos,
  
  First, you would have to make sure that the classes of your variables are formatted properly (i.e. numeric columns are set to the numeric class).
  
  Then, you may select only the numeric columns of your data frame as explained here: https://statisticsglobe.com/select-only-numeric-columns-from-data-frame-in-r
  
  Alternatively, I recommend having a look at the mice package. This package imputes each column differently based on its class, and it uses more sophisticated imputation methods than mean imputation.
  
  I hope that helps!
  
  Joachim
  
  Reply
Giorgos
October 28, 2021 10:10 am

Dear Joachim,
Thanks for the immediate reply! I think I understood where the problem stemmed from and I realised this by reading the link that you provided above where, how to select numeric columns is explained.
I used the str() function as explained in the tutorial and I saw that indeed some of my columns where not numeric therefore the for loop and subsequently the “replace by mean” code were “crashing”.
Therefore, after reading the information in the link you submitted, I decided to go on the simple way:
1) I just subsetted the columns of interest by doing this:
Greece2 <- dplyr::select(Greece, Basic_Edu_Mother:possessions)
Greece2
(Here, I subsetted the 8 columns of interest which I know that they are all numeric. But to be absolutely sure, I run the function str(Greece2) and the output read : num num num num etc. showing that all the columns are numeric.

2) I run the loop again like you do:

for(i in 1:ncol(Greece2)) {
Greece2[, i][is.na(Greece2[, i])] <- mean(Greece2[, i], na.rm = TRUE)
}
Greece2

This time, I was absolutely confident that the loop will work, since all the columns in this subset are numeric.
Doch, the same error as before: " argument is not numeric or logical: returning NA"

So, the problem must be somewhere else and I am not seeing it, right?
(In any case, I am inspecting the mice package as you suggested and I will need some time to understand it but it seems better)

Again, thanks for the support, ich wünsche dir noch einen schönen Tag

Reply
- Joachim
  October 28, 2021 11:59 am
  Hey Giorgos,
  
  In this case, I assume something went wrong when you were changing the data classes. You can check the classes of all columns using the following R code:
  sapply(your_data, class)
  I definitely recommend having a closer look at the mice package, it’s great 🙂
  
  Dir auch einen schönen Tag!
  
  Joachim
  Reply
Komal
December 9, 2021 2:48 pm

I have many continuous missing values in single series like 10 days or more, is it right to replace it with mean of that series.

Reply
- Joachim
  December 13, 2021 8:23 am
  
  Hey Komal,
  
  If those values occur due to a specific reason such as non-response, you should usually not replace them by the mean. Have a look here for more details: https://statisticsglobe.com/missing-data/
  
  Regards,
  Joachim
  
  Reply