Fixed Effects in Linear Regression (Example in R) | Cross Sectional, Time & Two-Way
This blog post will cover the use of fixed effects to control for unobservable confounding in linear regressions. Fixed effects (FE) are binary indicators of group membership that are used as covariates in linear regression.
When entered as covariates in a linear regression, FE computationally remove mean differences between observations in the indicator group and all other observations. This demeaning process adjusts regression coefficient estimates on other modeled covariates for all confounders related to these group differences, a powerful method for combatting bias in linear regression estimates.
We will focus on three categories of FE models, those with cross-sectional FE, time FE, & two-way FE (TWFE). The article will be structured as shown below:
This post assumes baseline understanding of the use of linear (Ordinary Least Squares) regression to measure the linear relationship between an outcome and a continuous covariate (explanatory/ independent variable).
It also assumes familiarity with the problem of confounding, and the potential bias that can occur when both an outcome and covariate are correlated with unmeasured “lurking” variables.
Finally, readers will benefit from intuition for the use of multiple regression to add additional covariates to a regression equation, thus “controlling for” alternate explanations for the relationship between the outcome and the explanatory variable of interest.
Note: This article is a guest post written by Philip Gigliotti. Philip is a Senior Healthcare Informatics Analyst at DataGen Healthcare Analytics, and the author of the “causal inference philosophy” tutorial series on LinkedIn. You can read more about Philip here!
The Basic Model
The post will describe the implementation of FE regression in R, using the cutting edge felm() function from the “lfe” package. Readers will benefit from prior experience with R’s classical regression package lm().
Our toy model for exposition and implementation will be the relationship between premature death rate (outcome) and income (explanatory variable) in a sample of 3,000 USA counties, nested in 50 USA states. At times, we may consider this sample when observed over two periods, yielding 6,000 observations (3,000 counties over 2 periods).
The data set can be downloaded here.
After downloading the data set, we can import it into R as shown below.
df <- read.csv("gigliotti_county.csv")
Let’s have a look at the first six rows of our data.
head(df) # FIPS state_county prem income period # 1 1001 AL Autauga County 9409.295 54487 2016 # 2 1001 AL Autauga County 8128.591 59338 2018 # 3 1003 AL Baldwin County 7467.597 56460 2016 # 4 1003 AL Baldwin County 7354.123 57588 2018 # 5 1005 AL Barbour County 8929.475 32884 2016 # 6 1005 AL Barbour County 10253.573 34382 2018
To implement the previously described model in R, use the following syntax.
reg <- lm(prem ~ income, df) summary(reg) # Call: # lm(formula = prem ~ income, data = df) # # Residuals: # Min 1Q Median 3Q Max # -6089 -1293 -142 1028 34667 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.446e+04 1.070e+02 135.08 <2e-16 *** # income -1.196e-01 2.024e-03 -59.08 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 2153 on 6160 degrees of freedom # Multiple R-squared: 0.3617, Adjusted R-squared: 0.3616 # F-statistic: 3491 on 1 and 6160 DF, p-value: < 2.2e-16
Finally, the pedagogy of this study adopts an opinionated assumption consistent with understandings in the econometric literature that heteroskedasticity (a violation of regression assumptions where residuals (error terms) are unequally distributed across observational units due to unmeasured intra-group correlations) is omnipresent in observational data.
Thus, all regression models will be corrected with heteroscedasticity robust standard errors to prevent false positives in statistical significance tests (or less commonly false negatives) resulting from these violated assumptions. Readers may recall this implementation of robust standard errors using the “lmtest” and “sandwich” packages
reg <- lm(prem ~ income, df) install.packages("lmtest") library(lmtest) install.packages("sandwich") library(sandwich) coeftest(reg, vcov = vcovHC(reg, type="HC1")) # t test of coefficients: # # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.4460e+04 1.3406e+02 107.863 < 2.2e-16 *** # income -1.1956e-01 2.4259e-03 -49.286 < 2.2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Theory of Fixed Effects
The intuition behind multiple regression is often illustrated as using a second continuous covariate to “control for” an alternate explanation between an outcome and an explanatory variable. A classic example is using a measure of temperature to control away the spurious correlation between homicides and ice cream sales.
But “control” variables can also be categorical. In this case they often take the form of a binary indicator (or dummy) variable which identifies mutually exclusive membership or non-membership in a given group. When entered in the regression model, this covariate computationally eliminates the difference in means between group members and the rest of the sample, along with any confounding factors associated with this mean difference.
This is a powerful adjustment! A classic example of this is using a binary indicator of sex assigned at birth as a covariate. For example, in a regression of the relationship between wages (outcome) and education (explanatory), we likely want to control for this “sex at birth” dummy to (partially) remove confounding mean differences associated with socially constructed norms related to gender, education, and career.
Note well that in practice, we would enter a single “dummy” variable indicating either male or female sex assigned at birth. The excluded category is called the “reference” category. By excluding it, we computationally allow the mean difference between groups to be parametrically estimated.
Cross Sectional Fixed Effects
We can use “dummy” variables to control for categorical measures of membership across multiple groups using a similar procedure. Now consider our sample of 3,000 USA counties nested in 50 US states. The relationship between premature death rate and income in these counties may plausibly have different slopes within each of 50 states, since counties within the same state share many confounding factors related to health, such as socioeconomic factors, political culture, health infrastructure, and environmental conditions.
To remove these differences between states, we apply the same theory of fixed effects (FE) by entering a “dummy” variable indicating membership in each state. Just as before we exclude a single state as the “reference” category against which mean differences in each state will be calculated. Any state can be chosen as reference.
In our case, regression software will choose the reference category for automatically. We are left with a regression of premature death rate on income that includes a vector of 49 state indicator variables, which we refer to as state fixed effects (FE). Since state membership is a characteristic that varies across groupings of observational units in the sample, state FE can be more generally referred to as “cross-sectional” FE. Cross-sectional FE remove all confounding variation from estimates which is shared within the cross-sectional group.
Time Fixed Effects
Aside from cross-sectional groupings such as location (state), time period is another salient grouping which may introduce bias in regression models. Consider our model of 3,000 US counties nested in 50 US states. Now imagine if we are to observe the sample over 2 periods, for example once in 2016 and once in 2018. Observations in the sample, even within the same county, may be very different in the two periods.
Importantly, the slope of the relationship between premature death rate and income may vary between these periods due to changes in factors like macroeconomic trends, technological advances, or political regime, which impact the entire sample. We want to hold these factors constant between periods, so we can estimate the aggregate relationship across the entire sample with reduced bias.
Just as we used dummy variable indicators to remove mean differences between states in our example of cross-sectional fixed effects (FE), we can add a dummy variable indicating whether a particular observation was recorded in 2018 (included category) or 2016 (reference category). With more than 2 periods we would include n-1 dummy variables indicating membership in each single year. These indicators are called time FE, and remove all confounding factors trending over time that are shared across the entire sample.
Two-Way Fixed Effects
The key insight of fixed effects (FE) is that whenever we have a group of two or more observations in our data, we can use a dummy variable indicator to remove the mean difference between the group and remaining sample, eliminating with it all shared confounding variation.
Consider once more our sample of 3,000 USA counties observed over 2 periods (2016, 2018). The presence of two periods of data for each county opens a new opportunity for us. Each county now constitutes a unique grouping in the data, with two observations each (2016 & 2018). This data structure (called panel data) facilitates the introduction of observational unit FE.
By adding a single dummy variable indicator for each county (excluding one as reference) we can now remove mean differences between counties, and all associated confounding, from regression estimates. This is a powerful method to “control” for county-level confounders such as differences in average wealth, education, or health status, which constitute the most obvious threats of bias.
The introduction of county FE removes all time-invariant variation from the sample, in other words all factors that do not change over time within a county. Thus, the only remaining variation included in estimates are those factors that change over time within a county. This changes interpretation of the regression coefficient. While a cross-sectional regression measures the relationship between levels of an outcome (premature death) and a covariate (income), the county FE model measures the relationship between changes in premature death and changes in income over time.
For added robustness, don’t forget to include time period fixed effects in your observational unit fixed effects model. This removes problematic time trends shared across the sample, which is especially important if using an extended data set, for example which covers 10, 20, or more periods. The robust model including both unit and time FE is called a two-way fixed effects model, and has traditionally been the gold standard for observational causal inference in the quantitative social sciences.
Cluster-Robust Standard Errors
A final note on the two-way fixed effect (TWFE) model: Recall that we used robust standard errors (SE) to correct significance tests in our cross-sectional regression model for correlations of error terms across cross-sectional units (heteroskedasticity).
In the TWFE model, a second threat to valid significance tests arises due to the fact that there are more than two observations per cross-sectional unit (two years per county in the context of our model). Since both data points within a given county are likely to share similar variation, their error terms are also likely to be correlated with each other (this is called autocorrelation).
The solution to this dynamic is called clustering, which is a statistical adjustment to SE calculation which allows arbitrary correlation within each “cluster” or group. In TWFE models, it is standard practice to cluster SE on the observational unit. In our model of 3,000 USA counties over 2 periods with county and period fixed effects, we would cluster the SE by county.
Implementation in R
Implementation of the two-way fixed effects (TWFE) estimator in R is quite simple using the cutting edge felm() function from the “lfe” package. While R users have traditionally estimated panel data models with the plm() function, this is now considered antiquated amongst most working applied econometricians using R.
felm() uses cutting edge computational methods that are more effective and efficient in contexts with complex FE structure. While this may not be strictly necessary in the case of the TWFE model, it should be considered best practice to use the felm() function.
See implementation below:
install.packages("lfe") library(lfe) twfe <- felm(prem ~ income | state_county + period | 0 | state_county, df) summary(twfe) # Call: # felm(formula = prem ~ income | state_county + period | 0 | state_county, data = df) # # Residuals: # Min 1Q Median 3Q Max # -18503.1 -317.5 0.0 317.5 18503.1 # # Coefficients: # Estimate Cluster s.e. t value Pr(>|t|) # income -0.012293 0.009082 -1.354 0.176 # # Residual standard error: 1204 on 3032 degrees of freedom # Multiple R-squared(full model): 0.9017 Adjusted R-squared: 0.8003 # Multiple R-squared(proj model): 0.0004803 Adjusted R-squared: -1.031 # F-statistic(full model, *iid*):8.891 on 3129 and 3032 DF, p-value: < 2.2e-16 # F-statistic(proj model): 1.832 on 1 and 3032 DF, p-value: 0.1759
Just like the lm() function, the first argument of felm() is the regression equation. The space after the first “|” character takes the column name for the FE dimension, in this case we use two: county and period (year). The space after the second “|” character is where you can specify instrumental variables (which is not relevant for this blog post).
For now, filling the space with “0” negates the argument. Following the last “|” character is where you specify your standard error (SE) clustering dimension. Finally, as in the lm() function we specify our data frame. Once we create the felm() object, the summary() function has been overwritten with summary.felm when loading the “lfe” package. This summary object calculates robust SE as default, or cluster robust SE if a clustering dimension is specified.
In case you want to see an alternative example on how to estimate a two ways effect in a fixed effect model using the R programming language, you may have a look at the following video of Miklesh Yadav’s YouTube channel:
Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.
If you accept this notice, your choice will be saved and the page will refresh.
Conclusion
Perhaps the largest threat to validity in regression analysis is confounding by lurking variables that are not accounted for in the regression. While many lurking variables can be measured and “controlled for” via multiple regression, there will always be “unobservable” confounders that cannot be directly measured (or possibly even imagined).
One of the best weapons we have against unobservable confounders is the use of fixed effects to remove mean differences between groups of data points, along with all confounding “unobservable” factors associated with those groupings. In the two-way fixed effects model, we are able to control for all unobservable characteristics of observational units that are fixed over time, and all shared unobservable factors changing over time in the sample.
This is a formidable weapon in our arsenal when seeking to make causal inferences from observational data via regression models.
8 Comments. Leave new
This is great! I teach this topic and it is most useful. I will cite you if I use any bits from this in any classes. Thanks for this!
This is great to hear Andrew! Thanks a lot for the kind feedback, and for sharing the article!
You are wonderful Joachim!! Very interesting and clear, thanks a lot,
Marco
Thank you so much for the amazing feedback Marco, glad you like the article! 🙂
twfe <- felm(prem ~ income | county + period | 0 | county, df)
This line from your post is not workong
It shall be
twfe <- felm(prem ~ income | state_county + period | 0 |state_county, df)
All the best
Ejner
Hey Ejner,
Thanks a lot for the hint! I’ve made some changes to the code and forgot to update this line in the tutorial. I have just fixed it.
Thanks again!
Joachim
Great content Joachim!
Where can I please see the list of tutorials on this website?
For example, under what category is the following post on the website?
https://statisticsglobe.com/extract-fitted-values-from-regression-model-r
I could find it only via google but not by just searching your website..
Thanks!
Hey Ilan,
Thank you so much for the very kind words!
You may find a list of all R programming tutorials at the bottom of this page: https://statisticsglobe.com/r-programming-language
Furthermore, you might use the search icon at the top right of the menu bar.
I hope that helps!
Joachim