I worked with several students over the last few months who were fitting many linear models, all with the same basic structure but different response variables. They were struggling to find an efficient way to do this in R while still taking the time to check model assumptions.

A first step when working towards a more automated process for fitting many models is to learn how to build model formulas with paste() and as.formula(). Once we learn how to build model formulas we can create functions to streamline the model fitting process.

I will be making residuals plots with ggplot2 today so will load it here.

library(ggplot2) # v.3.2.0

Building a formula with paste() and as.formula()

Model formula of the form y ~ x can be built based on variable names passed as character strings. A character string means the variable name will have quotes around it.

The function paste() is great for the job of concatenating strings together. (Also see glue::glue().)

Here is an example, using mpg as the response variable and am as the explanatory variable. These need to be separated by a tilde, ~. In this example I paste the character string "mpg" to the character string "~ am".

The output of this is the formula as a string. You can see it is in quotes in the output.

paste("mpg", "~ am")
# [1] "mpg ~ am"

We turn this into a formula ready to be passed to a model function using as.formula().

as.formula( paste("mpg", "~ am") )
# mpg ~ am

Using a constructed formula in lm()

Once we’ve built the formula we can put it in as the first argument of a model fitting function like lm() in order to fit the model. I’ll be using the mtcars dataset throughout the model fitting examples.

Since am is a 0/1 variable, this particular analysis is a two-sample t-test with mpg as the response variable.

lm( as.formula( paste("mpg", "~ am") ), data = mtcars)
# 
# Call:
# lm(formula = as.formula(paste("mpg", "~ am")), data = mtcars)
# 
# Coefficients:
# (Intercept)           am  
#      17.147        7.245

Making a function for model fitting

This paste() / as.formula() combination is essential for making user-defined model fitting functions.

For example, say I wanted to do the same t-test with am for many response variables. I could create a function that takes the response variable as an argument and build the model formula within the function with paste() and as.formula().

The response variable name is passed to the response argument as a character string.

lm_fun = function(response) {
  lm( as.formula( paste(response, "~ am") ), data = mtcars)
}

Here are two examples of this function in action, using mpg and then wt as the response variables.

lm_fun(response = "mpg")
# 
# Call:
# lm(formula = as.formula(paste(response, "~ am")), data = mtcars)
# 
# Coefficients:
# (Intercept)           am  
#      17.147        7.245
lm_fun(response = "wt")
# 
# Call:
# lm(formula = as.formula(paste(response, "~ am")), data = mtcars)
# 
# Coefficients:
# (Intercept)           am  
#       3.769       -1.358

Using bare names instead of strings (i.e., non-standard evaluation)

As you can see, this approach to building formula relies on character strings. This is going to be great once we start looping through variable names, but if making a function for interactive use it can be nice for the user to pass bare column names.

We can use some deparse()/substitute() magic in the function for this. Those two functions will turn bare names into strings within the function rather than having the user pass strings directly.

lm_fun2 = function(response) {
  resp = deparse( substitute( response) )
  lm( as.formula( paste(resp, "~ am") ), data = mtcars)
}

Here’s an example of this function in action. Note the use of the bare column name for the response variable.

lm_fun2(response = mpg)
# 
# Call:
# lm(formula = as.formula(paste(resp, "~ am")), data = mtcars)
# 
# Coefficients:
# (Intercept)           am  
#      17.147        7.245

You can see that one thing that happens when using as.formula() like this is that the formula in the model output shows the formula-building code instead of the actual variables used in the model.

Call:  
lm(formula = as.formula(paste(response, "~ am")), data = mtcars)  

While this often won’t matter in practice, there are ways to force the model to show the variables used in the model fitting. See this blog post for some discussion as well as code for how to do this.

Building a formula with varying explanatory variables

The formula building approach can also be used for fitting models where the explanatory variables vary. The explanatory variables should have plus signs between them on the right-hand side of the formula. To get plus signs between a bunch of variables we can use the collapse argument in paste().

Here’s an example with a vector of two explanatory variables. Passing a vector of variable names to paste() and using collapse = "+" puts plus signs between the variables.

expl = c("am", "disp")
paste(expl, collapse = "+")
# [1] "am+disp"

This additional pasting step can then be included within as.formula().

as.formula( paste("mpg ~", paste(expl, collapse = "+") ) )
# mpg ~ am + disp

Let’s go through an example of using this in a function that can fit a model with different explanatory variables.

In this function I demonstrate building the formula as a separate step and then passing it to lm(). This can be easier to read compared to building the formula within lm() as a single step like I did earlier.

lm_fun_expl = function(expl) {
  form = as.formula( paste("mpg ~ ", paste(expl, collapse = "+") ) )
  lm(form, data = mtcars)
}

To use the function we pass a vector of variable names as strings to the expl argument.

lm_fun_expl(expl = c("am", "disp") )
# 
# Call:
# lm(formula = form, data = mtcars)
# 
# Coefficients:
# (Intercept)           am         disp  
#    27.84808      1.83346     -0.03685

The dots for passing many variables to a function

Using dots (…) instead of named arguments can allow the user to list the explanatory variables separately instead of in a vector.

I’ll demonstrate a function using dots to indicate some undefined number of additional arguments for putting as many explanatory variables as desired into the model. I wrap the dots in c() within the function in order to collapse variables together with +.

Note that if you want non-standard evaluation you could wrap the ... in rlang::exprs() within the function and then pass bare variable names.

lm_fun_expl2 = function(...) {
  form = as.formula( paste("mpg ~ ", paste( c(...), collapse = "+") ) )
  lm(form, data = mtcars)
}

Now variables are passed individually as strings separated by commas instead of as a vector.

lm_fun_expl2("am", "disp")
# 
# Call:
# lm(formula = form, data = mtcars)
# 
# Coefficients:
# (Intercept)           am         disp  
#    27.84808      1.83346     -0.03685

Example function that returns residuals plots and model output

One of the reasons to make a function is to increase efficiency when fitting many models. For example, it might be useful to make a function that returns residual plots and any desired statistical results simultaneously.

Here’s an example of such a function, using some of the tools covered above. The function takes the response variable as a bare name, fits a model with am hard-coded as the explanatory variable and the mtcars dataset, and then makes two residual plots.

The function outputs a list that contains the two residuals plots as well as the overall \(F\) tests from the model.

lm_modfit = function(response) {
  resp = deparse( substitute( response) )
  mod = lm( as.formula( paste(resp, "~ am") ), data = mtcars)
  resvfit = qplot(x = mod$fit, y = mod$res) + theme_bw()
  resdist = qplot(x = "Residual", mod$res, geom = "boxplot") + theme_bw()
  list(resvfit, resdist, anova(mod) )
}

mpgfit = lm_modfit(mpg)

Individual parts of the output list can be extracted as needed. To check model assumptions prior to looking at any results we’d pull out the two plots, which are the first two elements of the output list.

mpgfit[1:2]
# [[1]]

# 
# [[2]]

If we deem the model fit acceptable we can extract the overall \(F\) tests from the third element of the output.

mpgfit[[3]]
# Analysis of Variance Table
# 
# Response: mpg
#           Df Sum Sq Mean Sq F value   Pr(>F)    
# am         1 405.15  405.15   16.86 0.000285 ***
# Residuals 30 720.90   24.03                     
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Next step: looping

This post focused on using paste() and as.formula() for building model formula and then making user-defined functions for interactive use. When working with many models we’d likely want to automate the process more by using some sort of looping. I’ll write a follow-up post on looping through variables and fitting models with the map family of functions from package purrr.

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(ggplot2) # v.3.2.0

paste("mpg", "~ am")

as.formula( paste("mpg", "~ am") )

lm( as.formula( paste("mpg", "~ am") ), data = mtcars)

lm_fun = function(response) {
  lm( as.formula( paste(response, "~ am") ), data = mtcars)
}

lm_fun(response = "mpg")
lm_fun(response = "wt")

lm_fun2 = function(response) {
  resp = deparse( substitute( response) )
  lm( as.formula( paste(resp, "~ am") ), data = mtcars)
}

lm_fun2(response = mpg)

expl = c("am", "disp")
paste(expl, collapse = "+")

as.formula( paste("mpg ~", paste(expl, collapse = "+") ) )
lm_fun_expl = function(expl) {
  form = as.formula( paste("mpg ~ ", paste(expl, collapse = "+") ) )
  lm(form, data = mtcars)
}

lm_fun_expl(expl = c("am", "disp") )

lm_fun_expl2 = function(...) {
  form = as.formula( paste("mpg ~ ", paste( c(...), collapse = "+") ) )
  lm(form, data = mtcars)
}

lm_fun_expl2("am", "disp")

lm_modfit = function(response) {
  resp = deparse( substitute( response) )
  mod = lm( as.formula( paste(resp, "~ am") ), data = mtcars)
  resvfit = qplot(x = mod$fit, y = mod$res) + theme_bw()
  resdist = qplot(x = "Residual", mod$res, geom = "boxplot") + theme_bw()
  list(resvfit, resdist, anova(mod) )
}

mpgfit = lm_modfit(mpg)
mpgfit[1:2]
mpgfit[[3]]