Regression Imputation (Stochastic vs. Deterministic & R Example)

 

Be careful: Flawed imputations can heavily reduce the quality of your data! Are you aware that a poor missing value imputation might destroy the correlations between your variables?

If it’s done right, regression imputation can be a good solution for this problem. I‘ll show you all the tricks you need to know in the following article.

But first of all, what is regression imputation?

 

Definition:
Regression imputation fits a statistical model on a variable with missing values. Predictions of this regression model are used to substitute the missing values in this variable.

 
 

What are the advantages and drawbacks of regression imputation? How can I apply regression imputation? Are there even better alternatives?

Click on the buttons below to get an answer to your specific question:

Deterministic vs. Stochastic Regression Regression Imputation in R (Example) Even Better Alternatives? Ask me a Question (It's Free)

 

Deterministic vs. Stochastic Regression Imputation

Regression imputation consists of two subsequent steps:

  1. A linear regression model is estimated on the basis of observed values in the target variable Y and some explanatory variables X.
  2. The model is used to predict values for the missing cases in Y. Missing values of Y are then replaced on the basis of these predictions.

Relationships of X and Y (i.e. correlations, regression coefficients etc.) are preserved, since imputed values are based on regression models. This is a big advantage over simpler imputation methods such as mean imputation or zero substitution.

Regression imputation is classified into two different versions: deterministic and stochastic regression imputation.

Deterministic regression imputation replaces missing values with the exact prediction of the regression model. Random variation (i.e. an error term) around the regression slope is not considered. Imputed values are therefore often too precise and lead to an overestimation of the correlation between X and Y.

Stochastic regression imputation was developed in order to solve this issue of deterministic regression imputation. Stochastic regression imputation adds a random error term to the predicted value and is therefore able to reproduce the correlation of X and Y more appropriately.

In the following example, I’ll show you the differences between the two approaches of deterministic and stochastic regression imputation in more detail.

 

Regression Imputation in R (Example)

Before we can start with our regression imputation example, we need some data with missing values. So let’s create some synthetic example data with R:

# Example data
 
set.seed(9090909) # Create reproducible data
N <- 2000 # Sample size
 
y <- round(rnorm(N, 20, 10))            # Dependent variable
x1 <- round(0.2 * y + rnorm(N, 5), 2)   # Predictor 1
x2 <- round(y * rpois(N, 5))            # Predictor 2
x3 <- round(0.01 * y + runif(N, 0, 10)) # Predictor 3
 
data <- data.frame(y, x1, x2, x3)
 
data$y[rbinom(N, 1, 0.2) == 1] <- NA # Aproximately 10% missings in y
head(data) # First 6 rows of our example data

 

We have created a target variable Y and three auxiliary variables X1, X2, and X3. The variable Y has some missing values, displayed as NA in rows 5 and 6. The first rows of our data set look as follows.

 

Example Data Regression Imputation

Table 1: First 6 Rows of Our Synthetic Example Data in R

 

A very recommendable R package for regression imputation (and also for other imputation methods) is the mice package. Install and load the package in R.

# Install and load the R package mice
 
install.packages("mice") # Needs to be done only once
library("mice") # Load package

 

Now, let’s apply a deterministic regression imputation to our example data. The function mice() is used to impute the data; method = “norm.predict” is the specification for deterministic regression imputation; and m = 1 specifies the number of imputed data sets (in our case single imputation).

# Deterministic regression imputation
 
imp <- mice(data, method = "norm.predict", m = 1) # Impute data
data_det <- complete(imp) # Store data

 

We can use almost the same code for stochastic regression imputation. We only have to change method = “norm.predict” to method = “norm.nob”.

# Stochastic regression imputation
 
imp <- mice(data, method = "norm.nob", m = 1) # Impute data
data_sto <- complete(imp) # Store data

 

Let’s graphically check how well our missing data imputations worked:

# Graphical comparison of deterministic and stochastic regression imputation
 
par(mfrow = c(1, 2)) # Both plots in one graphic
 
# Deterministic regression imputation
plot(x1[!is.na(data$y)], data_det$y[!is.na(data$y)],     # Plot of observed values
     xlim = c(0, 20), ylim = c(- 15, 60),
     main = "Deterministic Regression Imputation",
     xlab = "X1", ylab = "Y")
points(x1[is.na(data$y)], data_det$y[is.na(data$y)],     # Plot of missing values
       col = "red")
abline(lm(y ~ x1, data_det), col = "#1b98e0", lwd = 1.5) # Regression slope
legend("topleft",                                        # Legend
       c("Observed Values", "Imputed Values", "Regression Y ~ X1"),
       pch = c(1, 1, NA),
       lty = c(NA, NA, 1),
       col = c("black", "red", "#1b98e0"))
 
# Stochastic regression imputation
plot(x1[!is.na(data$y)], data_sto$y[!is.na(data$y)],     # Plot of observed values
     xlim = c(0, 20), ylim = c(- 15, 60),
     main = "Stochastic Regression Imputation",
     xlab = "X1", ylab = "Y")
points(x1[is.na(data$y)], data_sto$y[is.na(data$y)],     # Plot of missing values
       col = "red")
abline(lm(y ~ x1, data_det), col = "#1b98e0", lwd = 1.5) # Regression slope
legend("topleft",                                        # Legend
       c("Observed Values", "Imputed Values", "Regression Y ~ X1"),
       pch = c(1, 1, NA),
       lty = c(NA, NA, 1),
       col = c("black", "red", "#1b98e0"))

 

Deterministic and Stochastic Regression Imputation

Graphic 1: Imputed Values of Deterministic & Stochastic Regression Imputation (Correlation Plots of X1 & Y)

 

Graphic 1 visualizes the main drawback of deterministic regression imputation: The imputed values (red bubbles) are way too close to the regression slope (blue line)!

In contrast, the imputation by stochastic regression worked much better. The distribution of imputed values is similar compared to the observed values and, hence, much more realistic.

But what does that actually mean for the analysis of our data? Have we introduced bias with the deterministic regression imputation? Let’s check some numbers:

# Correlation between X1 and Y
 
round(cor(y, x1), 3) # True correlation
## [1] 0.897
 
round(cor(data_det$y, data_det$x1), 3) # Correlation after deterministic regression imputation
## [1] 0.912
 
round(cor(data_sto$y, data_sto$x1), 3) # Correlation after stochastic regression imputation
## [1] 0.894

The correlation coefficients between X1 and Y for our data without missings; after deterministic regression imputation; and after stochastic regression imputation confirm what we already saw graphically. While stochastic regression imputation reproduces the true correlation almost prefectly (true = 0.897; stochastic = 0.894), deterministic regression imputation overestimates the correlation (true = 0.897; deterministic = 0.912).

You might say now: OK, but does such a small difference really matter?

Answer: Yes it does. First of all, you should always avoid bias, no matter how small it is. Second, you don’t know how much bias you introduce when applying a flawed statistical method such as deterministic regression imputation. In your specific database, the bias might be much larger!

 

Video: Deterministic vs. Stochastic Imputation

Do you still have problems to understand the difference between deterministic and stochastic regression imputation? Don’t worry and watch the following video of Iris Eekhout’s YoutTube channel.

The video is only 17 seconds long. However, it shows perfectly the difference between deterministic regression imputation (red dots at the beginning of the video) and stochastic regression imputation (red dots at the end of the video).

 

 

Are There Even Better Alternatives? (The Drawbacks of Stochastic Regression Imputation)

At this point, we have learned that stochastic regression imputation outperforms an imputation by deterministic regression. However, the field of statistics is constantly evolving and especially one imputation method has gathered a lot of attention in recent research literature.

The method I’m talking about is called predictive mean matching (click to learn more).

But how can stochastic regression imputation be improved? Are there any drawbacks?! You guessed it, yes there are!

 

Drawback 1: Stochastic regression imputation may lead to implausible values. Variables are often restricted to a certain range of values (e.g. income should always be positive). Regression imputation is not able to impute according to such restrictions.

Consider the following example of implausible imputed values:

# Income data
 
set.seed(91919) # Set seed
N <- 1000 # Sample size
 
income <- round(rnorm(N, 0, 500)) # Create some synthetic income data
income[income < 0] <- income[income < 0] * (- 1)
 
x1 <- income + rnorm(N, 1000, 1500) # Auxiliary variables
x2 <- income + rnorm(N, - 5000, 2000)
 
income[rbinom(N, 1, 0.1) == 1] <- NA # Create 10% missingness in income
 
data_inc_miss <- data.frame(income, x1, x2)
 
imp_inc <- mice(data_inc_miss, method = "norm.nob", m = 1) # Impute data
data_inc <- complete(imp_inc)
 
data_inc$income[data_inc$income < 0] # 10 values below 0 (unplausible)
 
## [1]  -57.8859452  -58.9457388 -104.1099599 -147.9968026  -70.9192935 -280.2621111
## [2]  -0.5694554  -91.3508690  -76.8628876 -468.5756297

After imputing our income variable, we end up with 10 implausible values. Clearly a result we wouldn’t want to see in practice.

 

Drawback 2: Stochastic regression imputation leads to poor results when data is heteroscedastic. The imputation method assumes that the random error has on average the same size for all parts of the distribution, often resulting in too small or too large random error terms for the imputed values.

Consider the following example of heteroscedastic data:

# Heteroscedastic data
 
set.seed(654654) # Set seed
N <- 1:5000 # Sample size
 
a <- 0
b <- 1
sigma2 <- N^2
eps <- rnorm(N, mean = 0, sd = sqrt(sigma2))
 
y <- a + b * N + eps # Heteroscedastic variable
x <- 30 * N + rnorm(N[length(N)], 1000, 200) # Correlated variable
 
y[rbinom(N[length(N)], 1, 0.3) == 1] <- NA # 30% missings
 
data_het_miss <- data.frame(y, x)
 
imp_het <- mice(data_het_miss, method = "norm.nob", m = 1) # Impute data
data_het <- complete(imp_het)
 
plot(x[!is.na(data_het$y)], data_het$y[!is.na(data_het$y)], # Plot of observed values
     main = "Stochastic Regression Imputation (Heteroscedastic Data)",
     xlab = "X", ylab = "Y")
points(x[is.na(y)], data_het$y[is.na(y)], # Plot of missing values
       col = "red")
abline(lm(y ~ x, data_het), col = "#1b98e0", lwd = 1.5)
legend("topleft", # Legend
       c("Observed Values", "Imputed Values", "Regression Y ~ X"),
       pch = c(1, 1, NA),
       lty = c(NA, NA, 1),
       col = c("black", "red", "#1b98e0"))

 

Plot of Imputed Values Heteroscedasticity

Graphic 2: Stochastic Regression Imputation of Heteroscedastic Data

 

Graphic 2 reveals the second major drawback of stochastic regression imputation. The observed values (black bubbles) are very heteroscedastic, i.e. the random error is much smaller for smaller values of X an Y and much larger for larger values.

This heteroscedasticity, however, is not reflected by the imputation (red bubbles). The imputed values are equally spread around the regression slope (blue line), no matter how large X and Y are. Another result we wouldn’t want to see in practice.

Find out here whether predictive mean matching is capable of solving the issues of stochastic regression imputation.

 

It’s Your Turn!

I showed you all the important things I know about regression imputation.

Now, I’d love to hear from your experiences! Have you already used regression imputation in practice? Have you experienced any of the drawbacks I mentioned above? Do you think about switching to newer methods such as predictive mean matching?

Let me know about your questions and experiences in the comments!

 

Appendix

The header graphic of this page shows a heteroscedastic relation between two variables.

Do you want to know how to create such a plot? Here’s the R code.

# Create some data
n <- 1:3000
a <- 0
b <- 1
sigma2 <- n^1.9
eps <- rnorm(n, mean = 0, sd = sqrt(sigma2))
y <- - 1 * (a + b * n + eps)
 
# Set colors
palette <- colorRampPalette(colors = c("#1b98e0", "#FFFFFF"))
cols <- palette(3000)
par(bg = "#353436")
 
# Produce plot
plot(n, y, pch = 18, col = cols)
 
# Estimate regression line
mod <- lm(y ~ n)
res <- residuals(mod)
abline(coef(mod), col = "red", lwd = 3)

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


22 Comments. Leave new

  • Fajrul Ramdhan
    October 1, 2018 2:00 am

    Hi. Your post about imputation is very good, and more importantly, helpful. However, I would like to implement it in MATLAB. All resources I found online didn’t help much (maybe because I didn’t look hard enough). Is there a way I can do this?

    Reply
    • Hi Fajrul, thank you for your kind words! Unfortunately, I am not a MATLAB user and therefore I cannot tell you, if regression imputation is available for MATLAB. However, in R the implementation is quite easy (and free). You might consider to impute your data in R and analyse the data afterwards in MATLAB.

      Reply
  • Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.

    Reply
  • Hi, very good article. Below are few details yet that are worth mentioning:

    1) Missing-data pattern that you created is so called MCAR (= missing completely at random) mechanism, e.g. line:
    data$y[rbinom(N, 1, 0.2) == 1] <- NA # Aproximately 10% missings in y
    This is quite unrealistic mechanism in practical data sets which are more of MAR (= missing at random) or NMAR type. For types see https://en.wikipedia.org/wiki/Missing_data#Types
    Regression imputations assume are valid (provided model structure is ok) under MAR mechanism (as well as MCAR). Knowing / analyzing these mechanisms (with real data set having missingness) is very important if one wants to be able to do unbiased analysis of underlying population.
    2) Single imputation provides estimates (e.g. sample mean of imputed data set) which underestimate imputation variance. This can be fixed by using multiple imputation, e.g. see https://en.wikipedia.org/wiki/Imputation_(statistics)#Multiple_imputation
    3) Sampling weight adjustment (often used with incomplete survey data) can be used also to deal with missing data (= no need for imputation), e.g. see https://en.wikipedia.org/wiki/Horvitz%E2%80%93Thompson_estimator

    Reply
  • Thanks a lot for this!

    However, I still have many missing values left after following all steps – what could be the reason?

    Reply
    • Hey Lisa,

      Thank you for the nice comment!

      Could you provide the code you have used? Usually, all missing values should be gone after applying missing data imputation.

      Regards

      Joachim

      Reply
  • Hi Sir:
    How can find range for Intercept, Slope and R-Square?

    Reply
  • This is very Interesting and productive. However, how do I handle such missing values using different techniques such as Maximum Likelihood and Expectation-Maximization techniques in R?

    Reply
    • Hey Umar,

      Thank you for the kind comment!

      Please have a look at the help documentation of the mice package. The package provides numerous different imputation techniques.

      Regards,
      Joachim

      Reply
  • Hi Joachim,

    Your instruction on MICE is really helpful! Cheers:)

    However, I am working on a longitudinal dataset. The attrition is an unavoidable issue for it. If we just simply use MICE, some TRUE missing values (e.g., death) can be incorrectly imputed. Do you have any suggestions in this case?

    Thanks in advance,

    Shunqi

    Reply
    • Hey Shunqi,

      Thanks a lot for the kind words! 🙂

      I just came back from holidays and couldn’t reply earlier. Do you still need help with this question?

      Regards,
      Joachim

      Reply
      • Hi Joachim,

        No worries! I solved this problem. For those who want to deal with missing data with longitudinal data, there are many methods that may be ignored by users without carefully reading the technical document of MICE (I was!).

        I attached the potential methods as follows:

        method type of variable usage
        2l.norm numeric Level-1 normal heteroscedastic

        2l.lmer numeric Level-1 normal homoscedastic, lmer

        2l.pan numeric Level-1 normal homoscedastic, pan

        2l.bin binary Level-1 logistic, glmer

        2lonly.mean numeric Level-2 class mean

        2lonly.norm numeric Level-2 class normal

        2lonly.pmm any Level-2 class predictive mean matching

        Hope it helps someone!

        I will definitely follow your amazing website! Keep in touch:)

        All the best,

        Shunqi

        Reply
        • That’s great to hear, glad you found a solution!

          Thanks a lot for sharing the additional info, and for the very kind words! 🙂

          Regards,
          Joachim

          Reply
  • Shawn Garbett
    August 4, 2022 8:12 pm

    Do you have a good published reference on this topic?

    Reply
  • Thanks a lot for this article! Very helpful and easy to understand. I was wondering, what is the method to perform stochastic regression imputation in SPSS?

    Reply
  • Boa tarde, estou apreciando o conteúdo desse site: Sería possivel resolver essa questão já que se trata de numeros ausentes, por favor? Eis a questão: Uma formiga encontra-se na origem de um plano cartesiano. Ela sempre se move de um ponto (X, Y) para (X+1, Y+1) ou para (X+1, Y-1). De quantas formas ela pode iniciar o movimento e chegar em (9,3), sabendo que ela passou por todos os pontos (n^2, n), com n pertencente ao conjunto dos números inteiros? Desde já obrigado.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top