Replace Missing Values by Column Mean in R (3 Examples)

 

In this R tutorial you’ll learn how to substitute NA values by the mean of a data frame variable.

The content of the post is structured as follows:

Let’s get started…

 

Creation of Example Data

As a first step, we’ll have to create some example data:

data <- data.frame(x1 = c(NA, 2:10),                       # Create data frame
                   x2 = c(rep(5, 8), NA, NA),
                   x3 = c(4, NA, 1, 5, 6, 7, NA, 5, 9, 0))
data                                                       # Print data frame
#    x1 x2 x3
# 1  NA  5  4
# 2   2  5 NA
# 3   3  5  1
# 4   4  5  5
# 5   5  5  6
# 6   6  5  7
# 7   7  5 NA
# 8   8  5  5
# 9   9 NA  9
# 10 10 NA  0

As you can see based on the previous output of the RStudio console, our exemplifying data has ten rows and three numeric columns. Each of the variables contains at least one missing value (i.e. NA).

 

Example 1: Replacing Missing Data in One Specific Variable Using is.na() & mean() Functions

In this example, I’ll show how to substitute the NA values in only one particular data frame column by its average. For this, we can use the is.na and mean functions as shown below:

data1 <- data                                              # Duplicate data frame
data1$x1[is.na(data1$x1)] <- mean(data1$x1, na.rm = TRUE)  # Replace NA in one column
data1                                                      # Print updated data frame
#    x1 x2 x3
# 1   6  5  4
# 2   2  5 NA
# 3   3  5  1
# 4   4  5  5
# 5   5  5  6
# 6   6  5  7
# 7   7  5 NA
# 8   8  5  5
# 9   9 NA  9
# 10 10 NA  0

Have a look at the previous output of the RStudio console: As you can see, the first cell in the variable x1 was replaced by the mean of the variable x1 (i.e. 6).

 

Example 2: Replacing Missing Data in All Variables Using for-Loop

This example illustrates how to replace all numeric values of your data with a for-loop.

Have a look at the following R code:

data2 <- data                                              # Duplicate data frame
for(i in 1:ncol(data)) {                                   # Replace NA in all columns
  data2[ , i][is.na(data2[ , i])] <- mean(data2[ , i], na.rm = TRUE)
}
data2                                                      # Print updated data frame
#    x1 x2    x3
# 1   6  5 4.000
# 2   2  5 4.625
# 3   3  5 1.000
# 4   4  5 5.000
# 5   5  5 6.000
# 6   6  5 7.000
# 7   7  5 4.625
# 8   8  5 5.000
# 9   9  5 9.000
# 10 10  5 0.000

All NA values of our data frame were replaced by the mean of the corresponding column.

 

Example 3: Replacing Missing Data in All Variables Using na.aggregate() Function of zoo Package

You might say that the R syntax of Example 2 was relatively complicated. Fortunately, the zoo package provides a very simple alternative if we want to replace all missing values by column means.

If we want to use the functions and commands of the zoo package, we first have to install and load zoo:

install.packages("zoo")                                    # Install & load zoo package
library("zoo")

Now, we can use the na.aggregate function to replace all missing data:

data3 <- na.aggregate(data)                                # Replace NA in all columns
data3                                                      # Print updated data frame
#    x1 x2    x3
# 1   6  5 4.000
# 2   2  5 4.625
# 3   3  5 1.000
# 4   4  5 5.000
# 5   5  5 6.000
# 6   6  5 7.000
# 7   7  5 4.625
# 8   8  5 5.000
# 9   9  5 9.000
# 10 10  5 0.000

The output is exactly the same as in Example 2.

 

Video, Further Resources & Summary

Do you want to know more about missing data? Then you may want to have a look at the following video which I have published on my YouTube channel. In the video, I’m explaining the R codes of this article.

 

 

Also, you could have a look at the related articles that I have published on my homepage.

 

Summary: In this R tutorial you learned how to exchange missing values by column means in one or multiple variables. Let me know in the comments below, if you have further questions.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


12 Comments. Leave new

  • Hello! thank you very much for this super useful tutorial.
    I cannot make it work however with my data frame and I think I understand the reason why but I cannot really see how to fix it. On your code above, in example 2, you have used mean(data2[ , i], na.rm = TRUE) which, as I understand it, you have used the mean() on the data frame. However, in another tutorial you also mention that the “the mean() function cannot handle a data frame as an input” and there I understood why I cannot make it work.
    But, how it is working on your code above? Is there something with the [, i] that I am failing to understand?
    I would be so grateful with some help on this 🙂

    Reply
    • Hey Giorgos,

      Thank you very much for the very kind feedback, glad you find the tutorial useful!

      Regarding your question: The i is the running index of the for-loop, so it uses the mean of each column for the missing values of this column.

      Could you share your code and the error message that is returned?

      Regards

      Joachim

      Reply
  • I made the comment above, but I didn’t add my email.

    Reply
  • Hi Joachim,

    If I upload a file missing data on R, then I use codes to substitute the missing data with mean. How can I download the updated file after that? There is a function on RCloud that we can export the file directly. How about R console?

    Can you please help me with that?

    Reply
  • Thank you very much for your reply Joachim.
    Above, you have simulated the data and the process is easy. In my case, I have used some PISA data, where I have made a smaller data set with the responses of the Greek students. So far so good.
    So, let’s say that my data set is called Greece, I copy this data set in another one called Greece2 and I start from there.

    for(i in 1:ncol(Greece2)) {
    Greece2[, i][is.na(Greece2[, i])] <- mean(Greece2[, i], na.rm = TRUE)
    }
    Greece2

    Results:
    1) the NA values will not be replaced
    2) I get the error "argument is not numeric or logical: returning NA"

    My first question would be if you know how to just focus on the columns of interest. the columns(/variables) that I am working with are the last five ones so, the loop would work maybe if I would instruct it to iterate through these columns only?
    An assumption that I have also made is that some of my NAs may not be NAs at all, but other non recognisable values such as NaN, or "?", or whatever. Therefore, the for loop cannot work on them.

    Any ideas ?

    Reply
  • Dear Joachim,
    Thanks for the immediate reply! I think I understood where the problem stemmed from and I realised this by reading the link that you provided above where, how to select numeric columns is explained.
    I used the str() function as explained in the tutorial and I saw that indeed some of my columns where not numeric therefore the for loop and subsequently the “replace by mean” code were “crashing”.
    Therefore, after reading the information in the link you submitted, I decided to go on the simple way:
    1) I just subsetted the columns of interest by doing this:
    Greece2 <- dplyr::select(Greece, Basic_Edu_Mother:possessions)
    Greece2
    (Here, I subsetted the 8 columns of interest which I know that they are all numeric. But to be absolutely sure, I run the function str(Greece2) and the output read : num num num num etc. showing that all the columns are numeric.

    2) I run the loop again like you do:

    for(i in 1:ncol(Greece2)) {
    Greece2[, i][is.na(Greece2[, i])] <- mean(Greece2[, i], na.rm = TRUE)
    }
    Greece2

    This time, I was absolutely confident that the loop will work, since all the columns in this subset are numeric.
    Doch, the same error as before: " argument is not numeric or logical: returning NA"

    So, the problem must be somewhere else and I am not seeing it, right?
    (In any case, I am inspecting the mice package as you suggested and I will need some time to understand it but it seems better)

    Again, thanks for the support, ich wünsche dir noch einen schönen Tag

    Reply
    • Hey Giorgos,

      In this case, I assume something went wrong when you were changing the data classes. You can check the classes of all columns using the following R code:

      sapply(your_data, class)

      I definitely recommend having a closer look at the mice package, it’s great 🙂

      Dir auch einen schönen Tag!

      Joachim

      Reply
  • I have many continuous missing values in single series like 10 days or more, is it right to replace it with mean of that series.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top