# Mode Imputation (How to Impute Categorical Variables Using R)

**Mode imputation** is easy to apply – but using it the wrong way might **screw the quality** of your data.

In the following article, I’m going to show you **how and when** to use mode imputation.

Before we can start, a short definition:

**Definition:**

Mode imputation (or mode substitution) replaces missing values of a categorical variable by the mode of non-missing cases of that variable.

Mode imputation (or mode substitution) replaces missing values of a categorical variable by the mode of non-missing cases of that variable.

## Impute with Mode in R (Programming Example)

Imputing missing data by mode is quite easy. For this example, I’m using the **statistical programming language R** (RStudio). However, mode imputation can be conducted in essentially all software packages such as Python, SAS, Stata, SPSS and so on…

Consider the following example variable (i.e. vector in R):

set.seed(951) # Set seed N <- 1000 # Number of observations vec <- round(runif(N, 0, 5)) # Create vector without missings vec_miss <- vec # Replicate vector vec_miss[rbinom(N, 1, 0.1) == 1] <- NA # Insert missing values table(vec_miss) # Count of each category # 0 1 2 3 4 5 # 86 183 207 170 174 90 sum(is.na(vec_miss)) # Count of NA values # 90 |

set.seed(951) # Set seed N <- 1000 # Number of observations vec <- round(runif(N, 0, 5)) # Create vector without missings vec_miss <- vec # Replicate vector vec_miss[rbinom(N, 1, 0.1) == 1] <- NA # Insert missing values table(vec_miss) # Count of each category # 0 1 2 3 4 5 # 86 183 207 170 174 90 sum(is.na(vec_miss)) # Count of NA values # 90

Our example vector consists of 1000 observations – 90 of them are NA (i.e. missing values).

Now lets substitute these missing values via mode imputation. First, we need to determine the mode of our data vector:

val <- unique(vec_miss[!is.na(vec_miss)]) # Values in vec_miss mode <- val[which.max(tabulate(match(vec_miss, val)))] # Mode of vec_miss |

val <- unique(vec_miss[!is.na(vec_miss)]) # Values in vec_miss mode <- val[which.max(tabulate(match(vec_miss, val)))] # Mode of vec_miss

The mode of our variable is 2. With the following code, all missing values are replaced by 2 (i.e. the mode):

vec_imp <- vec_miss # Replicate vec_miss vec_imp[is.na(vec_imp)] <- mode # Impute by mode |

vec_imp <- vec_miss # Replicate vec_miss vec_imp[is.na(vec_imp)] <- mode # Impute by mode

That’s it. Imputation finished.

But do the imputed values introduce bias to our data? I’m going to check this in the following…

## Did we Screw it up?

Did the imputation run down the quality of our data? The following graphic is answering this question:

missingness <- c(rep("No Missings", 6), rep("Post Imputation", 6)) # Pre/post imputation Category <- as.factor(rep(names(table(vec)), 2)) # Categories Count <- c(as.numeric(table(vec)), as.numeric(table(vec_imp))) # Count of categories data_barplot <- data.frame(missingness, Category, Count) # Combine data for plot ggplot(data_barplot, aes(Category, Count, fill = missingness)) + # Create plot geom_bar(stat = "identity", position = "dodge") + scale_fill_brewer(palette = "Set2") + theme(legend.title = element_blank()) |

missingness <- c(rep("No Missings", 6), rep("Post Imputation", 6)) # Pre/post imputation Category <- as.factor(rep(names(table(vec)), 2)) # Categories Count <- c(as.numeric(table(vec)), as.numeric(table(vec_imp))) # Count of categories data_barplot <- data.frame(missingness, Category, Count) # Combine data for plot ggplot(data_barplot, aes(Category, Count, fill = missingness)) + # Create plot geom_bar(stat = "identity", position = "dodge") + scale_fill_brewer(palette = "Set2") + theme(legend.title = element_blank())

**Graphic 1: Complete Example Vector (Before Insertion of Missings) vs. Imputed Vector**

Graphic 1 reveals the **issue of mode imputation**:

The green bars reflect how our example vector was distributed before we inserted missing values. A perfect imputation method would reproduce the green bars.

However, after the application of mode imputation, the imputed vector (orange bars) differs a lot. While **category 2 is highly over-represented**, all other categories are underrepresented.

In other words: The distribution of our **imputed data is highly biased**!

## Are There Better Alternatives?

As you have seen, mode imputation is usually not a good idea. The method should only be used, if you have strong theoretical arguments (similar to mean imputation in case of continuous variables).

You might say: *OK, got it! But what should I do instead?!*

Recent research literature advises two imputation methods for categorical variables:

**Multinomial logistic regression imputation****Predictive mean matching imputation**

Multinomial logistic regression imputation is the method of choice for categorical target variables – whenever it is computationally feasible. However, if you want to impute a variable with **too many categories**, it might be impossible to use the method (due to computational reasons). In this case, predictive mean matching imputation can help:

Predictive mean matching was originally **designed for numerical variables**. However, recent literature has shown that predictive mean matching also works well for categorical variables – especially when the categories are ordered (van Buure & Groothuis-Oudshoorn, 2011). Even though predictive mean matching has to be **used with care** for categorical variables, it can be a good solution for computationally problematic imputations.

## I would like to hear your opinion!

I’ve shown you how mode imputation works, why it is usually not the best method for imputing your data, and what alternatives you could use.

Now, I’d love to hear from your experiences!

Have you already imputed via mode yourself? Would you do it again?

Leave me a comment below and let me know about your thoughts (questions are very welcome)!

## References

an Buuren, S., and Groothuis-Oudshoorn, C. G. (2011). MICE: Multivariate Imputation by Chained Equations in R. *Journal of Statistical Software*, 45(3).

## Appendix

How to create the header graphic? There you go:

par(bg = "#1b98e0") # Background color par(mar = c(0, 0, 0, 0)) # Remove space arround plot N <- 5000 # Sample size x <- round(runif(N, 1, 100)) # Uniform distrbution x <- c(x, rep(60, 35)) # Add some values equal to 60 hist_save <- hist(x, breaks = 100) # Save histogram col <- cut(h$breaks, c(- Inf, 58, 59, Inf)) # Colors of histogram plot(hist_save, # Plot histogram col = c("#353436", "red", "#353436")[col], ylim = c(0, 110), main = "", xaxs="i", yaxs="i") |

par(bg = "#1b98e0") # Background color par(mar = c(0, 0, 0, 0)) # Remove space arround plot N <- 5000 # Sample size x <- round(runif(N, 1, 100)) # Uniform distrbution x <- c(x, rep(60, 35)) # Add some values equal to 60 hist_save <- hist(x, breaks = 100) # Save histogram col <- cut(h$breaks, c(- Inf, 58, 59, Inf)) # Colors of histogram plot(hist_save, # Plot histogram col = c("#353436", "red", "#353436")[col], ylim = c(0, 110), main = "", xaxs="i", yaxs="i")

### Subscribe to my free statistics newsletter:

## 2 Comments. Leave new

Mean and mode imputation may be used when there is strong theoretical justification. What can those justifications be? Can you please provide some examples.

Thank you very much for your well written blog on statistical concepts that are pre-digested down to suit students and those of us who are not statistician.

Cheers.

Hi Ugyen,

Thank you for your question and the nice compliment!

In practice, mean/mode imputation are almost never the best option. If you don’t know by design that the missing values are always equal to the mean/mode, you shouldn’t use it.

You may also have a look at this thread on Cross Validated to get more information on the topic.

Regards,

Joachim