Mode Imputation (How to Impute Categorical Variables Using R)

 

Mode imputation is easy to apply – but using it the wrong way might screw the quality of your data.

In the following article, I’m going to show you how and when to use mode imputation.

Before we can start, a short definition:

Definition:
Mode imputation (or mode substitution) replaces missing values of a categorical variable by the mode of non-missing cases of that variable.

 

Impute with Mode in R (Programming Example)

Imputing missing data by mode is quite easy. For this example, I’m using the statistical programming language R (RStudio). However, mode imputation can be conducted in essentially all software packages such as Python, SAS, Stata, SPSS and so on…

Consider the following example variable (i.e. vector in R):

set.seed(951)                           # Set seed
N <- 1000                               # Number of observations
vec <- round(runif(N, 0, 5))            # Create vector without missings
vec_miss <- vec                         # Replicate vector
vec_miss[rbinom(N, 1, 0.1) == 1] <- NA  # Insert missing values
 
table(vec_miss)                         # Count of each category
#  0   1   2   3   4   5 
# 86 183 207 170 174  90
 
sum(is.na(vec_miss))                    # Count of NA values
# 90

Our example vector consists of 1000 observations – 90 of them are NA (i.e. missing values).

Now lets substitute these missing values via mode imputation. First, we need to determine the mode of our data vector:

val <- unique(vec_miss[!is.na(vec_miss)])              # Values in vec_miss
mode <- val[which.max(tabulate(match(vec_miss, val)))] # Mode of vec_miss

The mode of our variable is 2. With the following code, all missing values are replaced by 2 (i.e. the mode):

vec_imp <- vec_miss                                    # Replicate vec_miss
vec_imp[is.na(vec_imp)] <- mode                        # Impute by mode

That’s it. Imputation finished.

But do the imputed values introduce bias to our data? I’m going to check this in the following…

 

Did we Screw it up?

Did the imputation run down the quality of our data? The following graphic is answering this question:

missingness <- c(rep("No Missings", 6), rep("Post Imputation", 6)) # Pre/post imputation
Category <- as.factor(rep(names(table(vec)), 2))                   # Categories
Count <- c(as.numeric(table(vec)), as.numeric(table(vec_imp)))     # Count of categories
 
data_barplot <- data.frame(missingness, Category, Count)           # Combine data for plot
 
ggplot(data_barplot, aes(Category, Count, fill = missingness)) +   # Create plot
  geom_bar(stat = "identity", position = "dodge") + 
  scale_fill_brewer(palette = "Set2") +
  theme(legend.title = element_blank())

 

Difference before and after mode imputation

Graphic 1: Complete Example Vector (Before Insertion of Missings) vs. Imputed Vector

 

Graphic 1 reveals the issue of mode imputation:

The green bars reflect how our example vector was distributed before we inserted missing values. A perfect imputation method would reproduce the green bars.

However, after the application of mode imputation, the imputed vector (orange bars) differs a lot. While category 2 is highly over-represented, all other categories are underrepresented.

In other words: The distribution of our imputed data is highly biased!

 

Are There Better Alternatives?

As you have seen, mode imputation is usually not a good idea. The method should only be used, if you have strong theoretical arguments (similar to mean imputation in case of continuous variables).

You might say: OK, got it! But what should I do instead?!

Recent research literature advises two imputation methods for categorical variables:

  1. Multinomial logistic regression imputation
  2. Multinomial logistic regression imputation is the method of choice for categorical target variables – whenever it is computationally feasible. However, if you want to impute a variable with too many categories, it might be impossible to use the method (due to computational reasons). In this case, predictive mean matching imputation can help:

  3. Predictive mean matching imputation
  4. Predictive mean matching was originally designed for numerical variables. However, recent literature has shown that predictive mean matching also works well for categorical variables – especially when the categories are ordered (van Buure & Groothuis-Oudshoorn, 2011). Even though predictive mean matching has to be used with care for categorical variables, it can be a good solution for computationally problematic imputations.

 

I would like to hear your opinion!

I’ve shown you how mode imputation works, why it is usually not the best method for imputing your data, and what alternatives you could use.

Now, I’d love to hear from your experiences!

Have you already imputed via mode yourself? Would you do it again?

Leave me a comment below and let me know about your thoughts (questions are very welcome)!

 



 

References

an Buuren, S., and Groothuis-Oudshoorn, C. G. (2011). MICE: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3).

Appendix

How to create the header graphic? There you go:

par(bg = "#1b98e0")                         # Background color
par(mar = c(0, 0, 0, 0))                    # Remove space arround plot
 
N <- 5000                                   # Sample size
x <- round(runif(N, 1, 100))                # Uniform distrbution
 
x <- c(x, rep(60, 35))                      # Add some values equal to 60
 
hist_save <- hist(x, breaks = 100)          # Save histogram
col <- cut(h$breaks, c(- Inf, 58, 59, Inf)) # Colors of histogram
 
plot(hist_save,                             # Plot histogram
     col = c("#353436", 
             "red", 
             "#353436")[col], 
     ylim = c(0, 110), 
     main = "", 
     xaxs="i", 
     yaxs="i")

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top