Imputation Methods (Top 5 Popularity Ranking)

 

Which technique for the handling of my missing values should I use? A question that probably almost every data user already had…

Typical answer: You have to use missing data imputation – Your results might be biased otherwise!

OK, so far so good. But which of the imputation methods should I use? There is a bunch of approaches out there and sometimes it seems like everybody is using a different methodology.

In order to bring some clarity into the field of missing data treatment, I’m going to investigate in this article, which imputation methods are used by other statisticians and data scientists.

More precisely, I’m going to investigate the popularity of the following five imputation methods:

 

 

Note: Listwise deletion is technically not an imputation method. However, since the method is quite often used in practice, I included it to this comparison.

Furthermore, I assume that you already know how these five missing data methods work. If not, you can click on the previous pictures for detailed tutorials.

So, let’s move on to the driving question of this article…

 

Which is the most popular imputation method among other researchers and data users?

To investigate this question, I analyzed the Google Scholar search results. For the analysis, I checked how many search results appeared for each single year and for each of the five imputation methods since the year 2000. For instance, I filtered the search for “mean imputation” OR “mean substitution” of the year 2018, 2017, 2016 and so on…

The result is shown in Graphic 1 and I can tell you, it is very surprising to me:

 

Imputation Methods for Missing Data

Graphic 1: Comparison of the Popularity of Different Imputation Methods since the Year 2000.

Embed this graphic on your site (copy code below):

 

As you can see, listwise deletion is by far the most often mentioned missing data technique among literature at Google Scholar. The second place goes to mean imputation. The popularity of both methods is heavily increasing during the last two decades.

Gosh! That’s not what I was hoping to see!

Why? Listwise deletion and mean imputation are the two methods that are widely known to introduce bias in most of their applications (Have a look at these two tutorials for more details: listwise deletion; mean imputation).

So, what about the other three imputation methods? In missing data research literature, these three methods are highly respected for their ability to improve data quality (Learn more: regression imputation; predictive mean matching; hot deck imputation).

Regression imputation and hot deck imputation seem to have increased their popularity until 2013. Afterwards, however, both methods converge at approximately 500 Google Scholar search results per year. In contrast, the popularity of predictive mean matching imputation is pretty low until 2010 (no surprise, the method is quite new), but afterwards its popularity increases quickly.

What does this tell us? Among the more respected methods, predictive mean matching seems to outperform the other methods in terms of popularity – and this is actually something I was hoping to see! In the recent past, it was more and more often shown that predictive mean matching has advantages over other imputation methods (e.g. here or here).

That predictive mean matching is getting more popular is good news!

 

What about Single vs. Multiple Imputation?

 

Popularity of Multiple Imputation

Graphic 2: The Increasing Popularity of Multiple Imputation.

Embed this graphic on your site (copy code below):

 

Further Reading

 



 

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top