Confession: I love simulations.

In simulations you get to define everything about a model and then see how that model behaves over the long run. It’s like getting the luxury of taking many samples instead of only the one real one we have resources for in an actual study.

I find simulations incredibly useful in understanding statistical theory and assumptions of linear models. When someone tells me with great certainty “I don’t need to meet that assumption because [fill in the blank]” or asks “Does it matter that [something complicated]?”, I often turn to simulations instead of textbooks to check.

I like simulations for the same reasons I like building Bayesian models and using resampling methods (i.e., Monte Carlo) for inference. Building the simulation increases my understanding of the problem and makes all the assumptions clearer to me because I must use them explicitly. Plus it’s fun to put the code together and explore the results. 🙂

Simulate, simulate, dance to the music

Simulations have been so helpful in my own understanding of statistical models that I find myself wishing I knew how to use them more in teaching and consulting. Being able to build a simulation could really help folks understand the strengths and weaknesses of their analysis. I haven’t managed to fit it in so far, but it’s always on my mind. Hence, this post.

Today I’m going to go over an example of simulating data from a two-group linear model. I’ll work work up to linear mixed models and generalized linear mixed models (the fun stuff! 😆) in subsequent posts.

The statistical model

Warning: Here there be equations.

If you’re like me and your brain says “I think this section must not pertain to me” when your eyes hit mathematical notation, you can jump right down to the R code in the next section. But if you can power through, these equations are actually pretty useful when setting up a simulation (honest).

A simulation for a linear model is based on the statistical model. The statistical model is an equation that describes the processes we believe gave rise to the observed response variable. It includes parameters to describe the assumed effect of explanatory variables on the response variable as well as a description of any distributions associated with processes we assume are random variation. (There is more in-depth coverage of the statistical model in Stroup’s 2013 Generalized Linear Mixed Models book if you are interested and have access to it.)

So the statistical model is where we write down the exact assumptions we are making when we fit a linear model to a set of data.

Here is an example of a linear model for two groups. I wrote the statistical model to match the form of the default summary output from a model fit with lm() in R.

\[y_t = \beta_0 + \beta_1*I_{(group_t=\textit{''group''})} + \epsilon_t\]

  • \(y_t\) is the observed values for the quantitative response variable; \(t\) goes from 1 to the number of observations in the dataset
  • \(\beta_0\) is the mean response variable when the group is “group1”
  • \(\beta_1\) is the difference in mean response between the groups, “group2” minus “group1”.
  • The indicator variable, \(I_{(group_t=\textit{''group2''})}\), is 1 when the group is “group2” and 0 otherwise, as \(\beta_1\) only affects the response variable for observations in “group2”
  • \(\epsilon_t\) is the random variation present for each observation that is not explained by the group variable. These are assumed to come from an iid normal distribution with a mean of 0 and some shared variance, \(\sigma^2\): \(\epsilon_t \thicksim N(0, \sigma^2)\)

A single simulation from a two-group model

I use the statistical model to build a simulation. In this case I’ll call my response variable “growth”, and the two groups “group1” and “group2”. I’ll have 10 observations per group (it’s possible to simulate unbalanced groups but balanced groups is a good place to start).

I’ll set my seed so these particular results can be reproduced.

set.seed(16)

I start out by defining what the “truth” is in the simulation by setting all the parameters in the statistical model to a value of my choosing. Here’s what I’ll do today.

  • The true group mean (\(\beta_0\)) for “group1” will be 5
  • The mean of “group2” will be 2 less than “group1” (\(\beta_1\))
  • The shared variance will be set at 4 (\(\sigma^2\)), so the standard deviation (\(\sigma\)) is 2.

I’ll define the number of groups and number of replicates per group while I’m at it. The total number of observations is the number of groups times the number of replicates per group, which is 2*10 = 20.

ngroup = 2
nrep = 10
b0 = 5
b1 = -2
sd = 2

I need to create the variable I’ll call “group” to use as the explanatory variable when fitting a model in R. I use rep() a lot when doing simulations in order to repeat values of variables to appropriately match the scenario I’m working in. Here I’ll repeat each level of group 10 times (nrep).

( group = rep( c("group1", "group2"), each = nrep) )
#  [1] "group1" "group1" "group1" "group1" "group1" "group1" "group1"
#  [8] "group1" "group1" "group1" "group2" "group2" "group2" "group2"
# [15] "group2" "group2" "group2" "group2" "group2" "group2"

Next I’ll simulate the random errors. Remember I defined these above as \(\epsilon_t \thicksim N(0, \sigma^2)\). To simulate these I’ll take random draws from a normal distribution with a mean of 0 and standard deviation of 2 (note that rnorm() in takes the standard deviation as input, not the variance). Every observation has a random error, so I need to make 20 draws total (ngroup*nrep).

( eps = rnorm(ngroup*nrep, 0, sd) )
#  [1]  0.9528268 -0.2507600  2.1924324 -2.8884581  2.2956586 -0.9368241
#  [7] -2.0119012  0.1271254  2.0499452  1.1462840  3.6943642  0.2238667
# [13] -1.4920746  3.3164273  1.4434411 -3.3261610  1.1518191  0.9455202
# [19] -1.0854633  2.2553741

Now I have the fixed estimates of parameters, the variable group on which to base the indicator variable, and the simulated errors drawn from the defined distribution. That’s all the pieces I need to calculate my response variable.

The statistical model

\[y_t = \beta_0 + \beta_1*I_{(group_t=\textit{''group''})} + \epsilon_t\]

is my guide for how to combine these pieces to create the simulated response variable, \(y_t\). Notice I create the indicator variable in R with group == "group2" and call the simulated response variable growth.

( growth = b0 + b1*(group == "group2") + eps )
#  [1]  5.952827  4.749240  7.192432  2.111542  7.295659  4.063176  2.988099
#  [8]  5.127125  7.049945  6.146284  6.694364  3.223867  1.507925  6.316427
# [15]  4.443441 -0.326161  4.151819  3.945520  1.914537  5.255374

It’s not necessary for this simple case, but I often store the variables I will use in fitting the model in a dataset to help keep things organized. This becomes more important when working with more variables. I’ll skip this step today.

Once the response and explanatory variables have been created, it’s time for model fitting. I can fit the two group linear model with lm().

growthfit = lm(growth ~ group)
summary(growthfit)
# 
# Call:
# lm(formula = growth ~ group)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -4.039 -1.353  0.336  1.603  2.982 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   5.2676     0.6351   8.294 1.46e-07 ***
# groupgroup2  -1.5549     0.8982  -1.731    0.101    
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 2.008 on 18 degrees of freedom
# Multiple R-squared:  0.1427,  Adjusted R-squared:  0.0951 
# F-statistic: 2.997 on 1 and 18 DF,  p-value: 0.1005

Make a function for the simulation

A single simulation can help us understand the statistical model, but it doesn’t help us see how the model behaves over the long run. To repeat this simulation many times in R we’ll want to “functionize” the data simulating and model fitting process.

In my function I’m going to set all the arguments to the parameter values I’ve defined above. I allow some flexibility, though, so the argument values can be changed if I want to explore the simulation with different coefficient values, a different number of replications, or a different amount of random variation.

This function returns a linear model fit with lm.

twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) {
     ngroup = 2
     group = rep( c("group1", "group2"), each = nrep)
     eps = rnorm(ngroup*nrep, 0, sigma)
     growth = b0 + b1*(group == "group2") + eps
     growthfit = lm(growth ~ group)
     growthfit
}

I test the function, using the same seed, to make things are working as expected and that I get the same results as above.

set.seed(16)
twogroup_fun()
# 
# Call:
# lm(formula = growth ~ group)
# 
# Coefficients:
# (Intercept)  groupgroup2  
#       5.268       -1.555

If I want to change some element of the simulation, I can. Here’s a simulation from the same model but with a smaller standard deviation.

twogroup_fun(sigma = 1)
# 
# Call:
# lm(formula = growth ~ group)
# 
# Coefficients:
# (Intercept)  groupgroup2  
#       5.313       -2.476

Repeat the simulation many times

Now that I have a working function to simulate data and fit the model, it’s time to do the simulation many times. The model from each individual simulation is saved to allow exploration of long run model performance.

This is a task I’ve commonly used replicate() for. The rerun() function from package purrr is equivalent to replicate() with simplify = FALSE, and I’ll use it here for convenience.

library(purrr)

I’ll run this simulation 1000 times, resulting in a list of fitted two-group linear models based on the simulation parameters I’ve set.

sims = rerun(1000, twogroup_fun() )

Extracting results from the linear model

There are many elements of our model that we might be interested in exploring, including estimated coefficients, estimated standard deviations/variances, and the statistical results (test statistics/p-values).

To get the coefficients and statistical tests of coefficients we can use tidy() from package broom.

library(broom)

This returns the information on coefficients and tests of those coefficients in a tidy format that is easy to work with.

tidy(growthfit)
#          term  estimate std.error statistic      p.value
# 1 (Intercept)  5.267633 0.6351230  8.293878 1.460563e-07
# 2 groupgroup2 -1.554922 0.8981996 -1.731154 1.005290e-01

I have often been interested in understanding how the variances/standard deviations behave over the long run, in particular in mixed models. For a linear model we can extract an estimate of the residual standard deviation from the summary() output. This can be squared to get the variance as needed.

summary(growthfit)$sigma
# [1] 2.008435

Simulation results

Now for the fun part! Given we know the truth, how do the parameters behave over many samples?

To extract any results of interest I loop through the list of models, which I’ve stored in sims, and pull out the element of interest. Functions from package purrr are useful here for looping through the list of models. I’ll use functions from dplyr for any data manipulation and plot distributions via ggplot2.

suppressMessages( library(dplyr) )
library(ggplot2)

Estimated differences in mean response

As this is a linear model about differences among groups, the estimate of \(\beta_1\) is one of the statistics of primary interest. What does the distribution of differences in mean growth between groups look like? Here’s a density plot.

sims %>%
     map_df(tidy) %>%
     filter(term == "groupgroup2") %>%
     ggplot( aes(estimate) ) +
          geom_density(fill = "blue", alpha = .5) +
          geom_vline( xintercept = -2)

It’s a simulation result like this one, from a scenario involving relatively few samples of a noisy measurement that I think can be so compelling. Sure, “on average” we get the correct result, as the peak is right around the true value of -2. However, there is quite a range in the estimated coefficient across simulations, with some samples leading to overestimation and some to underestimation of the parameter. Some models even get the sign of the coefficient wrong. See Gelman and Carlin’s 2014 paper, Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors if you are interested in further discussion.

Estimates of the standard deviation

I can do a similar plot exploration of the residual standard deviation, extracting sigma from the model object and plotting it as a density plot.

sims %>%
     map_dbl(~summary(.x)$sigma) %>%
     data.frame(sigma = .) %>%
     ggplot( aes(sigma) ) +
          geom_density(fill = "blue", alpha = .5) +
          geom_vline(xintercept = 2)

The estimated variation ranges between 1 to just over 3, and the distribution is roughly centered on the true value of 2. Like with the coefficient above, the model performs pretty well on average but any single model can have a biased estimate of the standard deviation.

The standard deviation is underestimated a bit more than 50% of the time.

sims %>%
     map_dbl(~summary(.x)$sigma) %>%
     {. < 2} %>%
     mean()
# [1] 0.539

Statistical results

If the goal of a simulation is to get an idea of the statistical power of a test we could look at the proportion of times the null hypothesis was rejected given a fixed alpha level (often 0.05, but of course it can be something else).

Here the proportion of models that correctly rejected the null hypothesis, given that we know it’s not true, is just over 56%. That’s an estimate of statistical power.

sims %>%
     map_df(tidy) %>%
     filter(term == "groupgroup2") %>%
     pull(p.value) %>%
     {. <  0.05} %>%
     mean()
# [1] 0.563

Those are the basics that I generally pull out of models, but any output from the model is fair game. For linear models you could pull out \(R^2\) or the overall F-test, etc.

Where to go from here?

I’ll do future posts about simulating from more complicated linear models, likely starting with linear mixed models. In particular I will explore interesting issues that crop up when estimating variances.