Fixed Effects in Linear Regression (Example in R) | Cross Sectional, Time & Two-Way

 

This blog post will cover the use of fixed effects to control for unobservable confounding in linear regressions. Fixed effects (FE) are binary indicators of group membership that are used as covariates in linear regression.

When entered as covariates in a linear regression, FE computationally remove mean differences between observations in the indicator group and all other observations. This demeaning process adjusts regression coefficient estimates on other modeled covariates for all confounders related to these group differences, a powerful method for combatting bias in linear regression estimates.

We will focus on three categories of FE models, those with cross-sectional FE, time FE, & two-way FE (TWFE). The article will be structured as shown below:

This post assumes baseline understanding of the use of linear (Ordinary Least Squares) regression to measure the linear relationship between an outcome and a continuous covariate (explanatory/ independent variable).

It also assumes familiarity with the problem of confounding, and the potential bias that can occur when both an outcome and covariate are correlated with unmeasured “lurking” variables.

Finally, readers will benefit from intuition for the use of multiple regression to add additional covariates to a regression equation, thus “controlling for” alternate explanations for the relationship between the outcome and the explanatory variable of interest.

 

Philip Gigliotti Statistician Programmer

Note: This article is a guest post written by Philip Gigliotti. Philip is a Senior Healthcare Informatics Analyst at DataGen Healthcare Analytics, and the author of the “causal inference philosophy” tutorial series on LinkedIn. You can read more about Philip here!

 

The Basic Model

The post will describe the implementation of FE regression in R, using the cutting edge felm() function from the “lfe” package. Readers will benefit from prior experience with R’s classical regression package lm().

Our toy model for exposition and implementation will be the relationship between premature death rate (outcome) and income (explanatory variable) in a sample of 3,000 USA counties, nested in 50 USA states. At times, we may consider this sample when observed over two periods, yielding 6,000 observations (3,000 counties over 2 periods).

The data set can be downloaded here.

After downloading the data set, we can import it into R as shown below.

df <- read.csv("gigliotti_county.csv")

Let’s have a look at the first six rows of our data.

head(df)
#   FIPS      state_county      prem income period
# 1 1001 AL Autauga County  9409.295  54487   2016
# 2 1001 AL Autauga County  8128.591  59338   2018
# 3 1003 AL Baldwin County  7467.597  56460   2016
# 4 1003 AL Baldwin County  7354.123  57588   2018
# 5 1005 AL Barbour County  8929.475  32884   2016
# 6 1005 AL Barbour County 10253.573  34382   2018

To implement the previously described model in R, use the following syntax.

reg <- lm(prem ~ income, df)
summary(reg)
# Call:
# lm(formula = prem ~ income, data = df)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
#  -6089  -1293   -142   1028  34667 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  1.446e+04  1.070e+02  135.08   <2e-16 ***
# income      -1.196e-01  2.024e-03  -59.08   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 2153 on 6160 degrees of freedom
# Multiple R-squared:  0.3617,	Adjusted R-squared:  0.3616 
# F-statistic:  3491 on 1 and 6160 DF,  p-value: < 2.2e-16

Finally, the pedagogy of this study adopts an opinionated assumption consistent with understandings in the econometric literature that heteroskedasticity (a violation of regression assumptions where residuals (error terms) are unequally distributed across observational units due to unmeasured intra-group correlations) is omnipresent in observational data.

Thus, all regression models will be corrected with heteroscedasticity robust standard errors to prevent false positives in statistical significance tests (or less commonly false negatives) resulting from these violated assumptions. Readers may recall this implementation of robust standard errors using the “lmtest” and “sandwich” packages

reg <- lm(prem ~ income, df)
install.packages("lmtest")
library(lmtest)
install.packages("sandwich")
library(sandwich)
coeftest(reg, vcov = vcovHC(reg, type="HC1"))
# t test of coefficients:
# 
#                Estimate  Std. Error t value  Pr(>|t|)    
# (Intercept)  1.4460e+04  1.3406e+02 107.863 < 2.2e-16 ***
# income      -1.1956e-01  2.4259e-03 -49.286 < 2.2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Theory of Fixed Effects

The intuition behind multiple regression is often illustrated as using a second continuous covariate to “control for” an alternate explanation between an outcome and an explanatory variable. A classic example is using a measure of temperature to control away the spurious correlation between homicides and ice cream sales.

But “control” variables can also be categorical. In this case they often take the form of a binary indicator (or dummy) variable which identifies mutually exclusive membership or non-membership in a given group. When entered in the regression model, this covariate computationally eliminates the difference in means between group members and the rest of the sample, along with any confounding factors associated with this mean difference.

This is a powerful adjustment! A classic example of this is using a binary indicator of sex assigned at birth as a covariate. For example, in a regression of the relationship between wages (outcome) and education (explanatory), we likely want to control for this “sex at birth” dummy to (partially) remove confounding mean differences associated with socially constructed norms related to gender, education, and career.

Note well that in practice, we would enter a single “dummy” variable indicating either male or female sex assigned at birth. The excluded category is called the “reference” category. By excluding it, we computationally allow the mean difference between groups to be parametrically estimated.

 

Cross Sectional Fixed Effects

We can use “dummy” variables to control for categorical measures of membership across multiple groups using a similar procedure. Now consider our sample of 3,000 USA counties nested in 50 US states. The relationship between premature death rate and income in these counties may plausibly have different slopes within each of 50 states, since counties within the same state share many confounding factors related to health, such as socioeconomic factors, political culture, health infrastructure, and environmental conditions.

To remove these differences between states, we apply the same theory of fixed effects (FE) by entering a “dummy” variable indicating membership in each state. Just as before we exclude a single state as the “reference” category against which mean differences in each state will be calculated. Any state can be chosen as reference.

In our case, regression software will choose the reference category for automatically. We are left with a regression of premature death rate on income that includes a vector of 49 state indicator variables, which we refer to as state fixed effects (FE). Since state membership is a characteristic that varies across groupings of observational units in the sample, state FE can be more generally referred to as “cross-sectional” FE. Cross-sectional FE remove all confounding variation from estimates which is shared within the cross-sectional group.

 

Time Fixed Effects

Aside from cross-sectional groupings such as location (state), time period is another salient grouping which may introduce bias in regression models. Consider our model of 3,000 US counties nested in 50 US states. Now imagine if we are to observe the sample over 2 periods, for example once in 2016 and once in 2018. Observations in the sample, even within the same county, may be very different in the two periods.

Importantly, the slope of the relationship between premature death rate and income may vary between these periods due to changes in factors like macroeconomic trends, technological advances, or political regime, which impact the entire sample. We want to hold these factors constant between periods, so we can estimate the aggregate relationship across the entire sample with reduced bias.

Just as we used dummy variable indicators to remove mean differences between states in our example of cross-sectional fixed effects (FE), we can add a dummy variable indicating whether a particular observation was recorded in 2018 (included category) or 2016 (reference category). With more than 2 periods we would include n-1 dummy variables indicating membership in each single year. These indicators are called time FE, and remove all confounding factors trending over time that are shared across the entire sample.

Two-Way Fixed Effects

The key insight of fixed effects (FE) is that whenever we have a group of two or more observations in our data, we can use a dummy variable indicator to remove the mean difference between the group and remaining sample, eliminating with it all shared confounding variation.

Consider once more our sample of 3,000 USA counties observed over 2 periods (2016, 2018). The presence of two periods of data for each county opens a new opportunity for us. Each county now constitutes a unique grouping in the data, with two observations each (2016 & 2018). This data structure (called panel data) facilitates the introduction of observational unit FE.

By adding a single dummy variable indicator for each county (excluding one as reference) we can now remove mean differences between counties, and all associated confounding, from regression estimates. This is a powerful method to “control” for county-level confounders such as differences in average wealth, education, or health status, which constitute the most obvious threats of bias.

The introduction of county FE removes all time-invariant variation from the sample, in other words all factors that do not change over time within a county. Thus, the only remaining variation included in estimates are those factors that change over time within a county. This changes interpretation of the regression coefficient. While a cross-sectional regression measures the relationship between levels of an outcome (premature death) and a covariate (income), the county FE model measures the relationship between changes in premature death and changes in income over time.

For added robustness, don’t forget to include time period fixed effects in your observational unit fixed effects model. This removes problematic time trends shared across the sample, which is especially important if using an extended data set, for example which covers 10, 20, or more periods. The robust model including both unit and time FE is called a two-way fixed effects model, and has traditionally been the gold standard for observational causal inference in the quantitative social sciences.

 

Cluster-Robust Standard Errors

A final note on the two-way fixed effect (TWFE) model: Recall that we used robust standard errors (SE) to correct significance tests in our cross-sectional regression model for correlations of error terms across cross-sectional units (heteroskedasticity).

In the TWFE model, a second threat to valid significance tests arises due to the fact that there are more than two observations per cross-sectional unit (two years per county in the context of our model). Since both data points within a given county are likely to share similar variation, their error terms are also likely to be correlated with each other (this is called autocorrelation).

The solution to this dynamic is called clustering, which is a statistical adjustment to SE calculation which allows arbitrary correlation within each “cluster” or group. In TWFE models, it is standard practice to cluster SE on the observational unit. In our model of 3,000 USA counties over 2 periods with county and period fixed effects, we would cluster the SE by county.

Implementation in R

Implementation of the two-way fixed effects (TWFE) estimator in R is quite simple using the cutting edge felm() function from the “lfe” package. While R users have traditionally estimated panel data models with the plm() function, this is now considered antiquated amongst most working applied econometricians using R.

felm() uses cutting edge computational methods that are more effective and efficient in contexts with complex FE structure. While this may not be strictly necessary in the case of the TWFE model, it should be considered best practice to use the felm() function.

See implementation below:

install.packages("lfe")
library(lfe)
twfe <- felm(prem ~ income | state_county + period | 0 | state_county, df)
summary(twfe)
# Call:
#    felm(formula = prem ~ income | state_county + period | 0 | state_county,      data = df) 
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -18503.1   -317.5      0.0    317.5  18503.1 
# 
# Coefficients:
#         Estimate Cluster s.e. t value Pr(>|t|)
# income -0.012293     0.009082  -1.354    0.176
# 
# Residual standard error: 1204 on 3032 degrees of freedom
# Multiple R-squared(full model): 0.9017   Adjusted R-squared: 0.8003 
# Multiple R-squared(proj model): 0.0004803   Adjusted R-squared: -1.031 
# F-statistic(full model, *iid*):8.891 on 3129 and 3032 DF, p-value: < 2.2e-16 
# F-statistic(proj model): 1.832 on 1 and 3032 DF, p-value: 0.1759

Just like the lm() function, the first argument of felm() is the regression equation. The space after the first “|” character takes the column name for the FE dimension, in this case we use two: county and period (year). The space after the second “|” character is where you can specify instrumental variables (which is not relevant for this blog post).

For now, filling the space with “0” negates the argument. Following the last “|” character is where you specify your standard error (SE) clustering dimension. Finally, as in the lm() function we specify our data frame. Once we create the felm() object, the summary() function has been overwritten with summary.felm when loading the “lfe” package. This summary object calculates robust SE as default, or cluster robust SE if a clustering dimension is specified.

In case you want to see an alternative example on how to estimate a two ways effect in a fixed effect model using the R programming language, you may have a look at the following video of Miklesh Yadav’s YouTube channel:

 

Conclusion

Perhaps the largest threat to validity in regression analysis is confounding by lurking variables that are not accounted for in the regression. While many lurking variables can be measured and “controlled for” via multiple regression, there will always be “unobservable” confounders that cannot be directly measured (or possibly even imagined).

One of the best weapons we have against unobservable confounders is the use of fixed effects to remove mean differences between groups of data points, along with all confounding “unobservable” factors associated with those groupings. In the two-way fixed effects model, we are able to control for all unobservable characteristics of observational units that are fixed over time, and all shared unobservable factors changing over time in the sample.

This is a formidable weapon in our arsenal when seeking to make causal inferences from observational data via regression models.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


10 Comments. Leave new

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top