Mode Imputation (How to Impute Categorical Variables Using R)
Mode imputation is easy to apply – but using it the wrong way might screw the quality of your data.
In the following article, I’m going to show you how and when to use mode imputation.
Before we can start, a short definition:
Impute with Mode in R (Programming Example)
Imputing missing data by mode is quite easy. For this example, I’m using the statistical programming language R (RStudio). However, mode imputation can be conducted in essentially all software packages such as Python, SAS, Stata, SPSS and so on…
Consider the following example variable (i.e. vector in R):
set.seed(951) # Set seed N <- 1000 # Number of observations vec <- round(runif(N, 0, 5)) # Create vector without missings vec_miss <- vec # Replicate vector vec_miss[rbinom(N, 1, 0.1) == 1] <- NA # Insert missing values table(vec_miss) # Count of each category # 0 1 2 3 4 5 # 86 183 207 170 174 90 sum(is.na(vec_miss)) # Count of NA values # 90
Now lets substitute these missing values via mode imputation. First, we need to determine the mode of our data vector:
val <- unique(vec_miss[!is.na(vec_miss)]) # Values in vec_miss my_mode <- val[which.max(tabulate(match(vec_miss, val)))] # Mode of vec_miss
The mode of our variable is 2. With the following code, all missing values are replaced by 2 (i.e. the mode):
vec_imp <- vec_miss # Replicate vec_miss vec_imp[is.na(vec_imp)] <- my_mode # Impute by mode
That’s it. Imputation finished.
But do the imputed values introduce bias to our data? I’m going to check this in the following…
Did we Screw it up?
Did the imputation run down the quality of our data? The following graphic is answering this question:
missingness <- c(rep("No Missings", 6), rep("Post Imputation", 6)) # Pre/post imputation Category <- as.factor(rep(names(table(vec)), 2)) # Categories Count <- c(as.numeric(table(vec)), as.numeric(table(vec_imp))) # Count of categories data_barplot <- data.frame(missingness, Category, Count) # Combine data for plot ggplot(data_barplot, aes(Category, Count, fill = missingness)) + # Create plot geom_bar(stat = "identity", position = "dodge") + scale_fill_brewer(palette = "Set2") + theme(legend.title = element_blank())
Graphic 1: Complete Example Vector (Before Insertion of Missings) vs. Imputed Vector
Graphic 1 reveals the issue of mode imputation:
The green bars reflect how our example vector was distributed before we inserted missing values. A perfect imputation method would reproduce the green bars.
However, after the application of mode imputation, the imputed vector (orange bars) differs a lot. While category 2 is highly over-represented, all other categories are underrepresented.
In other words: The distribution of our imputed data is highly biased!
Are There Better Alternatives?
You might say: OK, got it! But what should I do instead?!
Recent research literature advises two imputation methods for categorical variables:
- Multinomial logistic regression imputation
- Predictive mean matching imputation
Multinomial logistic regression imputation is the method of choice for categorical target variables – whenever it is computationally feasible. However, if you want to impute a variable with too many categories, it might be impossible to use the method (due to computational reasons). In this case, predictive mean matching imputation can help:
Predictive mean matching was originally designed for numerical variables. However, recent literature has shown that predictive mean matching also works well for categorical variables – especially when the categories are ordered (van Buuren & Groothuis-Oudshoorn, 2011). Even though predictive mean matching has to be used with care for categorical variables, it can be a good solution for computationally problematic imputations.
I would like to hear your opinion!
I’ve shown you how mode imputation works, why it is usually not the best method for imputing your data, and what alternatives you could use.
Now, I’d love to hear from your experiences!
Have you already imputed via mode yourself? Would you do it again?
Leave me a comment below and let me know about your thoughts (questions are very welcome)!
van Buuren, S., and Groothuis-Oudshoorn, C. G. (2011). MICE: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3).
How to create the header graphic? There you go:
par(bg = "#1b98e0") # Background color par(mar = c(0, 0, 0, 0)) # Remove space around plot N <- 5000 # Sample size x <- round(runif(N, 1, 100)) # Uniform distribution x <- c(x, rep(60, 35)) # Add some values equal to 60 hist_save <- hist(x, breaks = 100) # Save histogram col <- cut(h$breaks, c(- Inf, 58, 59, Inf)) # Colors of histogram plot(hist_save, # Plot histogram col = c("#353436", "red", "#353436")[col], ylim = c(0, 110), main = "", xaxs="i", yaxs="i")
Statistics Globe Newsletter