# Mode Imputation (How to Impute Categorical Variables Using R)

Mode imputation is easy to apply – but using it the wrong way might screw the quality of your data.

In the following article, I’m going to show you how and when to use mode imputation.

Before we can start, a short definition:

Definition:
Mode imputation (or mode substitution) replaces missing values of a categorical variable by the mode of non-missing cases of that variable.

## Impute with Mode in R (Programming Example)

Imputing missing data by mode is quite easy. For this example, I’m using the statistical programming language R (RStudio). However, mode imputation can be conducted in essentially all software packages such as Python, SAS, Stata, SPSS and so on…

Consider the following example variable (i.e. vector in R):

```set.seed(951)                           # Set seed
N <- 1000                               # Number of observations
vec <- round(runif(N, 0, 5))            # Create vector without missings
vec_miss <- vec                         # Replicate vector
vec_miss[rbinom(N, 1, 0.1) == 1] <- NA  # Insert missing values

table(vec_miss)                         # Count of each category
#  0   1   2   3   4   5
# 86 183 207 170 174  90

sum(is.na(vec_miss))                    # Count of NA values
# 90```

Our example vector consists of 1000 observations – 90 of them are NA (i.e. missing values).

Now lets substitute these missing values via mode imputation. First, we need to determine the mode of our data vector:

```val <- unique(vec_miss[!is.na(vec_miss)])                   # Values in vec_miss
my_mode <- val[which.max(tabulate(match(vec_miss, val)))]   # Mode of vec_miss```

The mode of our variable is 2. With the following code, all missing values are replaced by 2 (i.e. the mode):

```vec_imp <- vec_miss                                    # Replicate vec_miss
vec_imp[is.na(vec_imp)] <- my_mode                     # Impute by mode```

That’s it. Imputation finished.

But do the imputed values introduce bias to our data? I’m going to check this in the following…

## Did we Screw it up?

Did the imputation run down the quality of our data? The following graphic is answering this question:

```missingness <- c(rep("No Missings", 6), rep("Post Imputation", 6)) # Pre/post imputation
Category <- as.factor(rep(names(table(vec)), 2))                   # Categories
Count <- c(as.numeric(table(vec)), as.numeric(table(vec_imp)))     # Count of categories

data_barplot <- data.frame(missingness, Category, Count)           # Combine data for plot

ggplot(data_barplot, aes(Category, Count, fill = missingness)) +   # Create plot
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set2") +
theme(legend.title = element_blank())``` Graphic 1: Complete Example Vector (Before Insertion of Missings) vs. Imputed Vector

Graphic 1 reveals the issue of mode imputation:

The green bars reflect how our example vector was distributed before we inserted missing values. A perfect imputation method would reproduce the green bars.

However, after the application of mode imputation, the imputed vector (orange bars) differs a lot. While category 2 is highly over-represented, all other categories are underrepresented.

In other words: The distribution of our imputed data is highly biased!

## Are There Better Alternatives?

As you have seen, mode imputation is usually not a good idea. The method should only be used, if you have strong theoretical arguments (similar to mean imputation in case of continuous variables).

You might say: OK, got it! But what should I do instead?!

Recent research literature advises two imputation methods for categorical variables:

1. Multinomial logistic regression imputation
2. Multinomial logistic regression imputation is the method of choice for categorical target variables – whenever it is computationally feasible. However, if you want to impute a variable with too many categories, it might be impossible to use the method (due to computational reasons). In this case, predictive mean matching imputation can help:

3. Predictive mean matching imputation
4. Predictive mean matching was originally designed for numerical variables. However, recent literature has shown that predictive mean matching also works well for categorical variables – especially when the categories are ordered (van Buuren & Groothuis-Oudshoorn, 2011). Even though predictive mean matching has to be used with care for categorical variables, it can be a good solution for computationally problematic imputations.

## I would like to hear your opinion!

I’ve shown you how mode imputation works, why it is usually not the best method for imputing your data, and what alternatives you could use.

Now, I’d love to hear from your experiences!

Have you already imputed via mode yourself? Would you do it again?

Leave me a comment below and let me know about your thoughts (questions are very welcome)!

## References

van Buuren, S., and Groothuis-Oudshoorn, C. G. (2011). MICE: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3).

## Appendix

How to create the header graphic? There you go:

```par(bg = "#1b98e0")                         # Background color
par(mar = c(0, 0, 0, 0))                    # Remove space around plot

N <- 5000                                   # Sample size
x <- round(runif(N, 1, 100))                # Uniform distribution

x <- c(x, rep(60, 35))                      # Add some values equal to 60

hist_save <- hist(x, breaks = 100)          # Save histogram
col <- cut(h\$breaks, c(- Inf, 58, 59, Inf)) # Colors of histogram

plot(hist_save,                             # Plot histogram
col = c("#353436",
"red",
"#353436")[col],
ylim = c(0, 110),
main = "",
xaxs="i",
yaxs="i")```

Subscribe to the Statistics Globe Newsletter

• Ugyen Norbu
November 9, 2019 4:08 am

Mean and mode imputation may be used when there is strong theoretical justification. What can those justifications be? Can you please provide some examples.

Thank you very much for your well written blog on statistical concepts that are pre-digested down to suit students and those of us who are not statistician.

Cheers.

• November 18, 2019 8:27 am

Hi Ugyen,

Thank you for your question and the nice compliment!

In practice, mean/mode imputation are almost never the best option. If you don’t know by design that the missing values are always equal to the mean/mode, you shouldn’t use it.

You may also have a look at this thread on Cross Validated to get more information on the topic.

Regards,

Joachim

• sara
May 12, 2020 4:02 am

Hi, thanks for your article. Can you provide any other published article for causing bias with replacing the mode in categorical missing values? Thanks

• May 12, 2020 5:52 am

Hi Sara,

Thank you for the comment! For instance, have a look at Zhang 2016: “Imputations with mean, median and mode are simple but, like complete case analysis, can introduce bias on mean and deviation.”

Regards,

Joachim

• Ismail
October 28, 2020 1:51 am

Hi Joachim. What do you think about random sample imputation for categorical variables? Not randomly drawing from any old uniform or normal distribution, but drawing from the specific distribution of the categories in the variable itself.

As a simple example, consider the Gender variable with 100 observations. Male has 64 instances, Female has 16 instances and there are 20 missing instances. Before imputation, 80% of non-missing data are Male (64/80) and 20% of non-missing data are Female (16/80). After variable-specific random sample imputation (so drawing from the 80% Male 20% Female distribution), we could have maybe 80 Male instances and 20 Female instances.

Imputing this way by randomly sampling from the specific distribution of non-missing data results in very similar distributions before and after imputation. If mode imputation was used instead, there would be 84 Male and 16 Female instances. More biased towards the mode instead of preserving the original distribution.

My question is: is this a valid way of imputing categorical variables? What are its strengths and limitations?

• October 28, 2020 7:54 am

Hi Ismail,

Thank you for you comment! The advantage of random sample imputation vs. mode imputation is (as you mentioned) that it preserves the univariate distribution of the imputed variable. However, there are two major drawbacks:

1) You are not accounting for systematic missingness. Assume that females are more likely to respond to your questionnaire. This would lead to a biased distribution of males/females (i.e. too many females). This is already a problem in your observed data. By imputing the missing values based on this biased distribution you are introducing even more bias. Have a look at the “response mechanisms” MCAR, MAR, and MNAR.

2) You are introducing bias to the multivariate distributions. For instance, assume that you have a data set with sports data and in the observed cases males are faster runners than females. If you are imputing the gender variable randomly, the correlation between gender and running speed in your imputed data will be zero and hence the overall correlation will be estimated too low.

For those reasons, I recommend to consider polytomous logistic regression. Have a look at the mice package of the R programming language and the mice() function. Within this function, you’d have to specify the method argument to be equal to “polyreg”.

I hope that helps!

Joachim

• inzamam ul haq
February 1, 2021 4:13 pm

my mood function predicted values belonging to one class for classification data… what to do now

• February 1, 2021 5:14 pm

Hi Inzamam,

Could you tell me some more details about your data? How does the input data look like?

Regards,

Joachim

• Alassane
October 11, 2021 10:27 am

Thank you Joachim for this article. In my experiences, I have inputed by mod et mean. I do the remarque as it’stands one so.

• October 11, 2021 4:33 pm

Hey Alassane,

Thank you for your comment! 🙂

In most cases, I’d recommend using other methods such as predictive mean matching imputation and hot deck imputation. However, this depends strongly on your specific data.

Regards

Joachim

• Sam
December 3, 2021 12:54 pm

I use multiple imputation approach (via Mice package), using the “polyreg” option. I will take a keen interest into evaluating distribution of the imputation

• December 3, 2021 1:21 pm

Hey Sam,

Your approach is usually a good choice, and basically always better than a simple mode imputation.

Regards,
Joachim

• Harry
January 23, 2022 5:48 pm

Thank you for the illustration. Is there an efficient way for mode imputation while grouping by several other categorical variables in a dataframe? I’ve been trying to do so with dplyr but haven’t been able to make it work.

• January 24, 2022 11:53 am

Hey Harry,

Please have a look at the example code below. This code can be improved a lot in terms of efficiency. However, it should give you a good basis:

```set.seed(32786) # Create example data
data <- data.frame(values = sample(1:3, 24, replace = TRUE),
group1 = LETTERS[1:3],
group2 = letters[1:2])

library("dplyr")

data_new <- data %>% # Create ID column
mutate(ID = group_indices_(data, .dots=c("group1", "group2")))

library("DescTools")

data_new2 <- data_new # Add mode by group
data_new2\$Mode <- NA

data_new2\$Mode[data_new\$ID == 1] <- Mode(data_new\$values[data_new\$ID == 1])
data_new2\$Mode[data_new\$ID == 2] <- Mode(data_new\$values[data_new\$ID == 2])
data_new2\$Mode[data_new\$ID == 3] <- Mode(data_new\$values[data_new\$ID == 3])
data_new2\$Mode[data_new\$ID == 4] <- Mode(data_new\$values[data_new\$ID == 4])
data_new2\$Mode[data_new\$ID == 5] <- Mode(data_new\$values[data_new\$ID == 5])
data_new2\$Mode[data_new\$ID == 6] <- Mode(data_new\$values[data_new\$ID == 6])

data_new2 # Print final output
#    values group1 group2 ID Mode
# 1       2      A      a  1    2
# 2       3      B      b  4    3
# 3       3      C      a  5    2
# 4       1      A      b  2    1
# 5       2      B      a  3    2
# 6       3      C      b  6    3
# 7       3      A      a  1    2
# 8       3      B      b  4    3
# 9       2      C      a  5    3
# 10      1      A      b  2    1
# 11      2      B      a  3    3
# 12      3      C      b  6    3
# 13      2      A      a  1    2
# 14      3      B      b  4    3
# 15      3      C      a  5    2
# 16      1      A      b  2    1
# 17      3      B      a  3    2
# 18      3      C      b  6    3
# 19      1      A      a  1    2
# 20      3      B      b  4    3
# 21      2      C      a  5    3
# 22      2      A      b  2    1
# 23      3      B      a  3    3
# 24      2      C      b  6    3```

I hope that helps!

Joachim

• Lola
March 16, 2022 1:32 am

Hi Joachim,
How do you impute the mode of the categorical column containing missing cells already replaced by NA e.g. In a column of red and blue with the former being the most frequent, how do you impute red in this case to fill in the NA cells? Will be grateful if you could shed some light on this. Thank you.

• March 16, 2022 8:19 am

Hey Lola,

Could you provide some example data or illustrate the structure of your data in some more detail? I’m afraid I don’t understand your question properly.

Regards,
Joachim

• Mahmoud
March 27, 2022 5:37 pm

Hallo Joachim,

Thank you so much for explanation, now I am a big fan of your channel on youtibe and Statistics Globe. In fact I’ve learned hear alot and I am not exaggerating when I say that I’ve learned here especially in practice more than what I studied in some statistic seminars. Thank you so much 🙂 I have a small question: in my dataset I have some binary variables like sex (male, female) and social background (migrant or native), the observations are already coded as numbers like 1 for male and 2 for females and the same for social background. I checked the type of the variable with r typeof command and it says Integer. should I then recode them as factor and change values to Male, Female before using the Multinomial logistic regression imputation or the PMM imputation. In general I am using the PMM imputation for all variables in my dataset..

Thank you so much

• March 28, 2022 7:31 am

Hey Mahmoud,

First of all: Thank you so much for the very kind feedback! It’s really great to hear that my tutorials are helpful to you! 🙂

I’ll partly copy/paste my response to your other question on YouTube, because I think it’s relevant to this question as well:

Usually, I would use predictive mean matching only for numeric variables and polynomial regression (method “polyreg” in the mice package) for categorical variables. If your categorical variables are ordered, you may also use predictive mean matching for them. However, it is an ongoing discussion whether predictive mean matching is preferable compared to polyreg. For unordered categorical variables or for binary variables, I would not use predictive mean matching, but other methods such as polyreg or hot deck imputation.

Your binary variables should be coded as factors instead of integers (it doesn’t matter if it’s 1/0 or male/female). If this is done correctly, the mice package should automatically choose a different method (i.e. “logreg”) than pmm for these variables.

I have created a reproducible example on how such an imputation could look like:

```set.seed(34943475)
data <- data.frame(x1 = factor(c(1, 2, 2, 1, NA, 2, 2, NA, 1)),
x2 = c(rnorm(8), NA),
x3 = c(NA, rnorm(8)))
data
#     x1          x2         x3
# 1    1 -1.42907958         NA
# 2    2 -0.41545391 -0.8883570
# 3    2 -0.95932054 -0.1557562
# 4    1 -0.04283036  0.3339744
# 5 <NA>  3.34753672 -0.3376120
# 6    2  1.14095821 -1.4277056
# 7    2  0.52324218 -0.1441450
# 8 <NA> -1.19379113  1.5405276
# 9    1          NA -0.7328813

library("mice")

data_imp <- mice(data)
data_imp
# Class: mids
# Number of multiple imputations:  5
# Imputation methods:
#   x1       x2       x3
# "logreg"    "pmm"    "pmm"
# PredictorMatrix:
#   x1 x2 x3
# x1  0  1  1
# x2  1  0  1
# x3  1  1  0

data_comp <- complete(data_imp, action = "broad")
data_comp
#   x1.1        x2.1       x3.1 x1.2        x2.2       x3.2 x1.3        x2.3       x3.3 x1.4        x2.4       x3.4 x1.5        x2.5       x3.5
# 1    1 -1.42907958 -0.8883570    1 -1.42907958 -0.3376120    1 -1.42907958 -0.8883570    1 -1.42907958 -0.8883570    1 -1.42907958 -0.1557562
# 2    2 -0.41545391 -0.8883570    2 -0.41545391 -0.8883570    2 -0.41545391 -0.8883570    2 -0.41545391 -0.8883570    2 -0.41545391 -0.8883570
# 3    2 -0.95932054 -0.1557562    2 -0.95932054 -0.1557562    2 -0.95932054 -0.1557562    2 -0.95932054 -0.1557562    2 -0.95932054 -0.1557562
# 4    1 -0.04283036  0.3339744    1 -0.04283036  0.3339744    1 -0.04283036  0.3339744    1 -0.04283036  0.3339744    1 -0.04283036  0.3339744
# 5    1  3.34753672 -0.3376120    1  3.34753672 -0.3376120    2  3.34753672 -0.3376120    1  3.34753672 -0.3376120    1  3.34753672 -0.3376120
# 6    2  1.14095821 -1.4277056    2  1.14095821 -1.4277056    2  1.14095821 -1.4277056    2  1.14095821 -1.4277056    2  1.14095821 -1.4277056
# 7    2  0.52324218 -0.1441450    2  0.52324218 -0.1441450    2  0.52324218 -0.1441450    2  0.52324218 -0.1441450    2  0.52324218 -0.1441450
# 8    2 -1.19379113  1.5405276    1 -1.19379113  1.5405276    2 -1.19379113  1.5405276    2 -1.19379113  1.5405276    1 -1.19379113  1.5405276
# 9    1  1.14095821 -0.7328813    1 -1.42907958 -0.7328813    1 -1.42907958 -0.7328813    1 -0.04283036 -0.7328813    1 -0.41545391 -0.7328813```

Above, you can see a multiply imputed data set. The variable x1 was imputed by logreg, and the variables x2 and x3 by pmm.

Regards,
Joachim

• Ann_96
August 31, 2022 8:19 am

Very good article! Thank you!!
I wanted to ask, I am struggling with a health care related dataset with 14 numeric variables, with about 30% of them missing. I want to fill the missing data in order to use them into a clustering technique.
I have tried mice but the further I can go is the first step of creating the imputed datasets. After that I cannot pool my results as long as I have not a response variable. I have read about maximum likelihood imputation and I don’t know if this could be an option. Thank you in advance!

• September 2, 2022 8:36 am

Hi Ann,