# Mode Imputation (How to Impute Categorical Variables Using R)

Mode imputation is easy to apply – but using it the wrong way might screw the quality of your data.

In the following article, I’m going to show you how and when to use mode imputation.

Before we can start, a short definition:

Definition:
Mode imputation (or mode substitution) replaces missing values of a categorical variable by the mode of non-missing cases of that variable.

## Impute with Mode in R (Programming Example)

Imputing missing data by mode is quite easy. For this example, I’m using the statistical programming language R (RStudio). However, mode imputation can be conducted in essentially all software packages such as Python, SAS, Stata, SPSS and so on…

Consider the following example variable (i.e. vector in R):

```set.seed(951) # Set seed N <- 1000 # Number of observations vec <- round(runif(N, 0, 5)) # Create vector without missings vec_miss <- vec # Replicate vector vec_miss[rbinom(N, 1, 0.1) == 1] <- NA # Insert missing values   table(vec_miss) # Count of each category # 0 1 2 3 4 5 # 86 183 207 170 174 90   sum(is.na(vec_miss)) # Count of NA values # 90```

Our example vector consists of 1000 observations – 90 of them are NA (i.e. missing values).

Now lets substitute these missing values via mode imputation. First, we need to determine the mode of our data vector:

```val <- unique(vec_miss[!is.na(vec_miss)]) # Values in vec_miss mode <- val[which.max(tabulate(match(vec_miss, val)))] # Mode of vec_miss```

The mode of our variable is 2. With the following code, all missing values are replaced by 2 (i.e. the mode):

```vec_imp <- vec_miss # Replicate vec_miss vec_imp[is.na(vec_imp)] <- mode # Impute by mode```

That’s it. Imputation finished.

But do the imputed values introduce bias to our data? I’m going to check this in the following…

## Did we Screw it up?

Did the imputation run down the quality of our data? The following graphic is answering this question:

```missingness <- c(rep("No Missings", 6), rep("Post Imputation", 6)) # Pre/post imputation Category <- as.factor(rep(names(table(vec)), 2)) # Categories Count <- c(as.numeric(table(vec)), as.numeric(table(vec_imp))) # Count of categories   data_barplot <- data.frame(missingness, Category, Count) # Combine data for plot   ggplot(data_barplot, aes(Category, Count, fill = missingness)) + # Create plot geom_bar(stat = "identity", position = "dodge") + scale_fill_brewer(palette = "Set2") + theme(legend.title = element_blank())``` Graphic 1: Complete Example Vector (Before Insertion of Missings) vs. Imputed Vector

Graphic 1 reveals the issue of mode imputation:

The green bars reflect how our example vector was distributed before we inserted missing values. A perfect imputation method would reproduce the green bars.

However, after the application of mode imputation, the imputed vector (orange bars) differs a lot. While category 2 is highly over-represented, all other categories are underrepresented.

In other words: The distribution of our imputed data is highly biased!

## Are There Better Alternatives?

As you have seen, mode imputation is usually not a good idea. The method should only be used, if you have strong theoretical arguments (similar to mean imputation in case of continuous variables).

You might say: OK, got it! But what should I do instead?!

Recent research literature advises two imputation methods for categorical variables:

1. Multinomial logistic regression imputation
2. Multinomial logistic regression imputation is the method of choice for categorical target variables – whenever it is computationally feasible. However, if you want to impute a variable with too many categories, it might be impossible to use the method (due to computational reasons). In this case, predictive mean matching imputation can help:

3. Predictive mean matching imputation
4. Predictive mean matching was originally designed for numerical variables. However, recent literature has shown that predictive mean matching also works well for categorical variables – especially when the categories are ordered (van Buure & Groothuis-Oudshoorn, 2011). Even though predictive mean matching has to be used with care for categorical variables, it can be a good solution for computationally problematic imputations.

## I would like to hear your opinion!

I’ve shown you how mode imputation works, why it is usually not the best method for imputing your data, and what alternatives you could use.

Now, I’d love to hear from your experiences!

Have you already imputed via mode yourself? Would you do it again?

Leave me a comment below and let me know about your thoughts (questions are very welcome)!

## References

an Buuren, S., and Groothuis-Oudshoorn, C. G. (2011). MICE: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3).

## Appendix

How to create the header graphic? There you go:

```par(bg = "#1b98e0") # Background color par(mar = c(0, 0, 0, 0)) # Remove space around plot   N <- 5000 # Sample size x <- round(runif(N, 1, 100)) # Uniform distrbution   x <- c(x, rep(60, 35)) # Add some values equal to 60   hist_save <- hist(x, breaks = 100) # Save histogram col <- cut(h\$breaks, c(- Inf, 58, 59, Inf)) # Colors of histogram   plot(hist_save, # Plot histogram col = c("#353436", "red", "#353436")[col], ylim = c(0, 110), main = "", xaxs="i", yaxs="i")```

Subscribe to my free statistics newsletter

• Ugyen Norbu
November 9, 2019 4:08 am

Mean and mode imputation may be used when there is strong theoretical justification. What can those justifications be? Can you please provide some examples.

Thank you very much for your well written blog on statistical concepts that are pre-digested down to suit students and those of us who are not statistician.

Cheers.

• Joachim
November 18, 2019 8:27 am

Hi Ugyen,

Thank you for your question and the nice compliment!

In practice, mean/mode imputation are almost never the best option. If you don’t know by design that the missing values are always equal to the mean/mode, you shouldn’t use it.

You may also have a look at this thread on Cross Validated to get more information on the topic.

Regards,

Joachim

• sara
May 12, 2020 4:02 am

Hi, thanks for your article. Can you provide any other published article for causing bias with replacing the mode in categorical missing values? Thanks

• Joachim
May 12, 2020 5:52 am

Hi Sara,

Thank you for the comment! For instance, have a look at Zhang 2016: “Imputations with mean, median and mode are simple but, like complete case analysis, can introduce bias on mean and deviation.”

Regards,

Joachim