Write Model Formula with Many Variables of Data Frame in R (5 Examples)

 

This article illustrates how to write formulas containing many variables in a less time-consuming way in the R programming language.

Table of contents:

Here’s how to do it!

 

Creating Example Data

As a first step, we have to define some data that we can use in the examples below:

set.seed(26537948)                              # Create example data
x1 <- round(rnorm(100), 2)
x2 <- round(rnorm(100) + x1, 2)
x3 <- round(rnorm(100) + 0.1 * x1 + 0.3 * x2, 2)
x4 <- round(rnorm(100) + 0.3 * x1 - 0.5 * x2 - 0.05 * x3, 2)
x5 <- round(rnorm(100) + x4, 2)
y <- round(rnorm(100) + 0.2 * x2 - 0.1 * x4 + 0.3 * x5, 2)
data <- data.frame(x1, x2, x3, x4, x5, y)
head(data)                                      # Print head of example data

 

table 1 data frame write model formula many variables r

 

Table 1 shows the structure of our example data: It has six numeric columns, whereby the variables x1-x5 will be used as predictors for the target variable y.

 

Example 1: Specify Predictors of Linear Model Manually

This example illustrates how to estimate a linear regression model by specifying each predictor variable manually (i.e. the “traditional” way).

For this, we can specify each predictor variable separated by a + sign within the lm() function:

mod1 <- lm(y ~ x1 + x2 + x3 + x4 + x5, data)    # Specify predictors manually
summary(mod1)                                   # Summary output
# 
# Call:
# lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = data)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -2.76719 -0.70849  0.02008  0.60865  2.48361 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  0.029946   0.099582   0.301   0.7643    
# x1           0.277766   0.134792   2.061   0.0421 *  
# x2           0.142349   0.122809   1.159   0.2493    
# x3           0.031034   0.099008   0.313   0.7546    
# x4          -0.006211   0.142101  -0.044   0.9652    
# x5           0.392585   0.095768   4.099 8.79e-05 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.9866 on 94 degrees of freedom
# Multiple R-squared:  0.3281,	Adjusted R-squared:  0.2923 
# F-statistic: 9.178 on 5 and 94 DF,  p-value: 3.87e-07
#

The previous output of the RStudio console shows the summary statistics we have estimated with our linear regression model.

As you can see, we have used all variables as predictors (i.e. x1-x5). However, the previous R code can get very inefficient once the size of our data frame increases.

In the following examples, I’ll therefore explain how to specify a model formula more efficiently!

 

Example 2: Specify All Variables of Data Frame as Predictors

In this example, I’ll explain how to use all variables as predictors with a simple dot (i.e. .). Have a look at the following R code and its output:

mod2 <- lm(y ~ ., data)                         # All variables as predictors
summary(mod2)                                   # Summary output
# 
# Call:
# lm(formula = y ~ ., data = data)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -2.76719 -0.70849  0.02008  0.60865  2.48361 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  0.029946   0.099582   0.301   0.7643    
# x1           0.277766   0.134792   2.061   0.0421 *  
# x2           0.142349   0.122809   1.159   0.2493    
# x3           0.031034   0.099008   0.313   0.7546    
# x4          -0.006211   0.142101  -0.044   0.9652    
# x5           0.392585   0.095768   4.099 8.79e-05 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.9866 on 94 degrees of freedom
# Multiple R-squared:  0.3281,	Adjusted R-squared:  0.2923 
# F-statistic: 9.178 on 5 and 94 DF,  p-value: 3.87e-07
#

The previous output is exactly the same as in Example 1. However, this time we have used a dot instead of defining all variables manually.

This works because the R programming language automatically uses all variables of a data frame as predictors a dot is specified within a formula.

 

Example 3: Exclude Certain Variables from Model

This example shows how to remove particular variables from our regression model using a dot and a – sign.

Consider the following R syntax. As you can see, we are using a dot (as in the previous example) and in addition we specify – x2.

Let’s execute this code and let’s have a look at its output:

mod3 <- lm(y ~ . - x2, data)                    # Exclude variables
summary(mod3)                                   # Summary output
# 
# Call:
# lm(formula = y ~ . - x2, data = data)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -2.64872 -0.61466  0.05135  0.61115  2.51644 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  0.03742    0.09955   0.376 0.707845    
# x1           0.38096    0.10139   3.757 0.000296 ***
# x3           0.07395    0.09199   0.804 0.423500    
# x4          -0.02943    0.14094  -0.209 0.835063    
# x5           0.37945    0.09527   3.983 0.000133 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.9884 on 95 degrees of freedom
# Multiple R-squared:  0.3185,	Adjusted R-squared:  0.2898 
# F-statistic:  11.1 on 4 and 95 DF,  p-value: 1.989e-07
#

As you can see, the – x2 has removed the variable x2 from our list of independent variables.

 

Example 4: Specify Formula of Linear Model Using as.formula() Function

In Example 4, I’ll explain how to apply the as.formula and paste functions to specify a formula outside the lm function.

Have a look at the following R code:

my_formula1 <- as.formula(                      # Create formula
  paste("y ~ ", paste(paste0("x", 2:5), collapse = " + ")))
my_formula1                                     # Print formula
# y ~ x2 + x3 + x4 + x5

The previous syntax created a formula object specifying that we want to use all variables but x1 as predictors.

Now, we can use this formula object within the lm function as shown below:

mod4 <- lm(my_formula1, data)                   # Estimate model based on formula
summary(mod4)                                   # Summary output
# 
# Call:
# lm(formula = my_formula1, data = data)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -2.87467 -0.70598  0.03668  0.60646  2.49772 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  0.01776    0.10109   0.176  0.86092    
# x2           0.30950    0.09377   3.300  0.00136 ** 
# x3           0.01658    0.10043   0.165  0.86925    
# x4           0.01673    0.14406   0.116  0.90779    
# x5           0.40384    0.09723   4.153 7.16e-05 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 1.003 on 95 degrees of freedom
# Multiple R-squared:  0.2977,	Adjusted R-squared:  0.2681 
# F-statistic: 10.07 on 4 and 95 DF,  p-value: 7.762e-07
#

Easy!

 

Example 5: Specify Formula of Linear Model Using Index Positions of Columns

The following R code shows how to apply the as.formula function to create a formula based on the data frame indices of our columns.

Let’s assume that we want to create a formula containing the second, fourth, and fifth variables of our data frame as predictors:

my_formula2 <- as.formula(                      # Create formula
  paste("y ~ ", paste(colnames(data)[c(2, 4, 5)], collapse = " + ")))
my_formula2                                     # Print formula
# y ~ x2 + x4 + x5

As you can see, the previous R code has created another formula object.

Let’s use this formula to estimate some descriptive statistics:

mod5 <- lm(my_formula2, data)                   # Estimate model based on formula
summary(mod5)                                   # Summary output
# 
# Call:
# lm(formula = my_formula2, data = data)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -2.87855 -0.71755  0.04527  0.61073  2.49457 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  0.01815    0.10055   0.180 0.857170    
# x2           0.31626    0.08393   3.768 0.000284 ***
# x4           0.01509    0.14299   0.105 0.916199    
# x5           0.40381    0.09674   4.174 6.57e-05 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.9982 on 96 degrees of freedom
# Multiple R-squared:  0.2975,	Adjusted R-squared:  0.2755 
# F-statistic: 13.55 on 3 and 96 DF,  p-value: 1.916e-07
#

Looks good!

 

Video & Further Resources

Do you want to learn more about regression models in R? Then you could watch the following video of my YouTube channel. In the video, I’m explaining the R programming code of this article in the R programming language.

 

The YouTube video will be added soon.

 

Besides the video, you may want to read the other articles on this website:

 

Summary: At this point of the article you should know how to succinctly write a formula using a vector of all column names in a character string as reference variables in R programming. The approaches shown in the present tutorial ranged from very basic specifications up to more complex and automatic model specifications that can be used in more complex settings such as for-loops or user-defined functions. Let me know in the comments, in case you have additional comments and/or questions.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


2 Comments. Leave new

  • i want to run a model where my response variable is combined of 3 ‘factor’ e.g.
    y = a * X1 * b * X2 * c *X3 with intercept =0
    how do i code this in R ?
    i though so far it is lm(y ~ 0+x1+x2+x3)
    but if the formula for this has “+” signs, it seems to be additive not multiplicative…. thanks for help

    Reply
    • Hi Vera,

      I have never seen a model where all components are multiplied. Are you sure that this is correct? The model you have specified, i.e. lm(y ~ 0+x1+x2+x3), seems to be correct. But this also depends on what you want to do exactly.

      Regards,
      Joachim

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top