Replace Missing Values by Column Mean in R (3 Examples)
In this R tutorial you’ll learn how to substitute NA values by the mean of a data frame variable.
The content of the post is structured as follows:
Let’s get started…
Creation of Example Data
As a first step, we’ll have to create some example data:
data <- data.frame(x1 = c(NA, 2:10), # Create data frame x2 = c(rep(5, 8), NA, NA), x3 = c(4, NA, 1, 5, 6, 7, NA, 5, 9, 0)) data # Print data frame # x1 x2 x3 # 1 NA 5 4 # 2 2 5 NA # 3 3 5 1 # 4 4 5 5 # 5 5 5 6 # 6 6 5 7 # 7 7 5 NA # 8 8 5 5 # 9 9 NA 9 # 10 10 NA 0
As you can see based on the previous output of the RStudio console, our exemplifying data has ten rows and three numeric columns. Each of the variables contains at least one missing value (i.e. NA).
Example 1: Replacing Missing Data in One Specific Variable Using is.na() & mean() Functions
In this example, I’ll show how to substitute the NA values in only one particular data frame column by its average. For this, we can use the is.na and mean functions as shown below:
data1 <- data # Duplicate data frame data1$x1[is.na(data1$x1)] <- mean(data1$x1, na.rm = TRUE) # Replace NA in one column data1 # Print updated data frame # x1 x2 x3 # 1 6 5 4 # 2 2 5 NA # 3 3 5 1 # 4 4 5 5 # 5 5 5 6 # 6 6 5 7 # 7 7 5 NA # 8 8 5 5 # 9 9 NA 9 # 10 10 NA 0
Have a look at the previous output of the RStudio console: As you can see, the first cell in the variable x1 was replaced by the mean of the variable x1 (i.e. 6).
Example 2: Replacing Missing Data in All Variables Using for-Loop
This example illustrates how to replace all numeric values of your data with a for-loop.
Have a look at the following R code:
data2 <- data # Duplicate data frame for(i in 1:ncol(data)) { # Replace NA in all columns data2[ , i][is.na(data2[ , i])] <- mean(data2[ , i], na.rm = TRUE) } data2 # Print updated data frame # x1 x2 x3 # 1 6 5 4.000 # 2 2 5 4.625 # 3 3 5 1.000 # 4 4 5 5.000 # 5 5 5 6.000 # 6 6 5 7.000 # 7 7 5 4.625 # 8 8 5 5.000 # 9 9 5 9.000 # 10 10 5 0.000
All NA values of our data frame were replaced by the mean of the corresponding column.
Example 3: Replacing Missing Data in All Variables Using na.aggregate() Function of zoo Package
You might say that the R syntax of Example 2 was relatively complicated. Fortunately, the zoo package provides a very simple alternative if we want to replace all missing values by column means.
If we want to use the functions and commands of the zoo package, we first have to install and load zoo:
install.packages("zoo") # Install & load zoo package library("zoo")
Now, we can use the na.aggregate function to replace all missing data:
data3 <- na.aggregate(data) # Replace NA in all columns data3 # Print updated data frame # x1 x2 x3 # 1 6 5 4.000 # 2 2 5 4.625 # 3 3 5 1.000 # 4 4 5 5.000 # 5 5 5 6.000 # 6 6 5 7.000 # 7 7 5 4.625 # 8 8 5 5.000 # 9 9 5 9.000 # 10 10 5 0.000
The output is exactly the same as in Example 2.
Video, Further Resources & Summary
Do you want to know more about missing data? Then you may want to have a look at the following video which I have published on my YouTube channel. In the video, I’m explaining the R codes of this article.
Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.
If you accept this notice, your choice will be saved and the page will refresh.
Also, you could have a look at the related articles that I have published on my homepage.
- Replace NA with 0 (10 Examples for Data Frame, Vector & Column)
- Get Sum of Data Frame Column Values
- Find Missing Values (6 Examples for Data Frame, Column & Vector)
- The R Programming Language
Summary: In this R tutorial you learned how to exchange missing values by column means in one or multiple variables. Let me know in the comments below, if you have further questions.
Statistics Globe Newsletter
12 Comments. Leave new
Hello! thank you very much for this super useful tutorial.
I cannot make it work however with my data frame and I think I understand the reason why but I cannot really see how to fix it. On your code above, in example 2, you have used mean(data2[ , i], na.rm = TRUE) which, as I understand it, you have used the mean() on the data frame. However, in another tutorial you also mention that the “the mean() function cannot handle a data frame as an input” and there I understood why I cannot make it work.
But, how it is working on your code above? Is there something with the [, i] that I am failing to understand?
I would be so grateful with some help on this 🙂
Hey Giorgos,
Thank you very much for the very kind feedback, glad you find the tutorial useful!
Regarding your question: The i is the running index of the for-loop, so it uses the mean of each column for the missing values of this column.
Could you share your code and the error message that is returned?
Regards
Joachim
I made the comment above, but I didn’t add my email.
No problem, I have just responded to your other comment 🙂
Hi Joachim,
If I upload a file missing data on R, then I use codes to substitute the missing data with mean. How can I download the updated file after that? There is a function on RCloud that we can export the file directly. How about R console?
Can you please help me with that?
Hi Jen,
I recommend exporting your file to your computer using the write.csv function: https://www.rdocumentation.org/packages/AlphaPart/versions/0.8.1/topics/write.csv
Afterwards, you can upload it to any cloud service you want.
Please note that it is usually better to substitute missing values with other techniques than mean imputation. See here for more details: https://statisticsglobe.com/missing-data/
Regards,
Joachim
Thank you very much for your reply Joachim.
Above, you have simulated the data and the process is easy. In my case, I have used some PISA data, where I have made a smaller data set with the responses of the Greek students. So far so good.
So, let’s say that my data set is called Greece, I copy this data set in another one called Greece2 and I start from there.
for(i in 1:ncol(Greece2)) {
Greece2[, i][is.na(Greece2[, i])] <- mean(Greece2[, i], na.rm = TRUE)
}
Greece2
Results:
1) the NA values will not be replaced
2) I get the error "argument is not numeric or logical: returning NA"
My first question would be if you know how to just focus on the columns of interest. the columns(/variables) that I am working with are the last five ones so, the loop would work maybe if I would instruct it to iterate through these columns only?
An assumption that I have also made is that some of my NAs may not be NAs at all, but other non recognisable values such as NaN, or "?", or whatever. Therefore, the for loop cannot work on them.
Any ideas ?
Hey Giorgos,
First, you would have to make sure that the classes of your variables are formatted properly (i.e. numeric columns are set to the numeric class).
Then, you may select only the numeric columns of your data frame as explained here: https://statisticsglobe.com/select-only-numeric-columns-from-data-frame-in-r
Alternatively, I recommend having a look at the mice package. This package imputes each column differently based on its class, and it uses more sophisticated imputation methods than mean imputation.
I hope that helps!
Joachim
Dear Joachim,
Thanks for the immediate reply! I think I understood where the problem stemmed from and I realised this by reading the link that you provided above where, how to select numeric columns is explained.
I used the str() function as explained in the tutorial and I saw that indeed some of my columns where not numeric therefore the for loop and subsequently the “replace by mean” code were “crashing”.
Therefore, after reading the information in the link you submitted, I decided to go on the simple way:
1) I just subsetted the columns of interest by doing this:
Greece2 <- dplyr::select(Greece, Basic_Edu_Mother:possessions)
Greece2
(Here, I subsetted the 8 columns of interest which I know that they are all numeric. But to be absolutely sure, I run the function str(Greece2) and the output read : num num num num etc. showing that all the columns are numeric.
2) I run the loop again like you do:
for(i in 1:ncol(Greece2)) {
Greece2[, i][is.na(Greece2[, i])] <- mean(Greece2[, i], na.rm = TRUE)
}
Greece2
This time, I was absolutely confident that the loop will work, since all the columns in this subset are numeric.
Doch, the same error as before: " argument is not numeric or logical: returning NA"
So, the problem must be somewhere else and I am not seeing it, right?
(In any case, I am inspecting the mice package as you suggested and I will need some time to understand it but it seems better)
Again, thanks for the support, ich wünsche dir noch einen schönen Tag
Hey Giorgos,
In this case, I assume something went wrong when you were changing the data classes. You can check the classes of all columns using the following R code:
I definitely recommend having a closer look at the mice package, it’s great 🙂
Dir auch einen schönen Tag!
Joachim
I have many continuous missing values in single series like 10 days or more, is it right to replace it with mean of that series.
Hey Komal,
If those values occur due to a specific reason such as non-response, you should usually not replace them by the mean. Have a look here for more details: https://statisticsglobe.com/missing-data/
Regards,
Joachim