Why use simulations for model checking?

One of the difficult things about working with generalized linear models (GLM) and generalized linear mixed models (GLMM) is figuring out how to interpret residual plots. We don’t really expect residual plots from a GLMM to look like one from a linear model, sure, but how do we tell when something looks “bad”?

This is the situation I was in several years ago, working on an analysis involving counts from a fairly complicated study design. I was using a negative binomial generalized linear mixed models, and the residual vs fitted values plot looked, well, “funny”. But was something wrong or was this just how the residuals from a complicated model like this look sometimes?

Here is an example of a plot I wasn’t feeling too good about but also wasn’t certain if what I was seeing indicated a lack of fit.

To try to figure out if what I was seeing was a problem, I fit models to response data simulated from my model. The beauty of such simulations is that I know that the model definitely does fit the simulated response data; the model is what created the data! I compared residuals plots from simulated data models to my real plot to help decide if what I was seeing was unusual. I looked at a fair number of simulated residual plots and decided that, yes, something was wrong with my model. I ended up moving on to a different model that worked better.

Here is an example of one of the simulated residual plots. There was definitely variation in plots from models fit to the simulated data, but this is a good example of what they generally looked like.

The DHARMa package

I found my “brute force” simulation approach useful, but I spent a lot of time visually comparing the simulated plots to my real plot. I didn’t have a metric to help me decide if my actual residual plot seemed unusual compared to residual plots from my “true” models. This left me unable to recommend this as a general approach to folks I consult with.

Since then, the author of the DHARMa package has come up with a clever way to use a simulation-based approach for residuals checks of GLMM’s. If you are interested in trying the package out, it has a very nice vignette to get you started.

These days I’m been happily recommending the DHARMa packages to students I work with for residual checks of GLMM’s. However, students aren’t always working with models that DHARMa currently supports. Luckily, DHARMa can simulate residuals for any model as long as the user can provide simulated values to the createDHARMa function. Below I show how to do this for a couple of different situations.

How to use createDHARMa()

In order to use createDHARMa, the user needs to provide three pieces of information.

  • Simulated response vectors
  • Observed data
  • The predicted response from the model

I thinking making the simulated response vectors is the biggest bottleneck for folks I’ve worked with, and that is what I’m focusing on today.

Simulations for models that have a simulate() function

When a function has a simulate function, getting the simulations needed to use createDHARMa can be comparatively straightforward.

Example using glmmTMB()

The glmmTMB function from package glmmTMB is one of those models that DHARMa doesn’t currently support. (2018-04-05 update: the development version of DHARMA now supports glmmTMB objects for glmmTMB 0.2.1. I believe the example below is still useful for showing how to work with DHARMa-unsupported model types that have a simulate() function.)

The glmmTMB function does have a simulate function.

library(DHARMa) # version 0.1.5
library(glmmTMB) # version 0.1.1

I’m going to use one of the models from the glmmTMB documentation to demonstrate how to make the simulations and then use them in createDHARMa.

Below I fit zero-inflated negative binomial model with glmmTMB.

fit_nbz = glmmTMB(count ~ spp + mined + (1|site), 
                  zi = ~spp + mined, 
                  family = nbinom2, data = Salamanders)

If I try to calculate the scaled residuals via DHARMa functions, I get a warning and then an error. DHARMa attempts to make predictions from the model to simulate with, but it fails.

simulateResiduals(fittedModel = fit_nbz, n = 250)
# Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help. 
#  
# Scaled residual values: 0.12 0.492 0.424 0.672 0.644 0.568 0.456 0.672 0.836 0.556 0.392 0.004 0.452 0.116 0.084 0.296 0.184 0.54 0.576 0.56 ...

This is an indication I’ll need to use createDHARMa to make the residuals instead.

I can simulate from my model via the simulate function (see the documentation for simulate.glmmTMB for details). I usually do at least 1000 simulations if it isn’t too time-consuming. I’ll do only 10 here as an example.

sim_nbz = simulate(fit_nbz, nsim = 10)

The result is a list of simulated response vectors.

str(sim_nbz)
# 'data.frame': 644 obs. of  10 variables:
#  $ sim_1 : num  0 0 0 0 6 2 0 1 1 0 ...
#  $ sim_2 : num  0 1 0 0 0 0 0 0 0 2 ...
#  $ sim_3 : num  0 0 0 1 4 0 3 0 0 1 ...
#  $ sim_4 : num  0 0 0 1 0 7 0 0 0 4 ...
#  $ sim_5 : num  0 0 0 1 0 0 0 1 1 4 ...
#  $ sim_6 : num  0 0 0 0 0 1 1 0 10 0 ...
#  $ sim_7 : num  0 0 1 2 4 1 7 1 2 0 ...
#  $ sim_8 : num  0 0 0 0 10 13 1 2 0 9 ...
#  $ sim_9 : num  0 0 0 0 1 6 0 4 3 0 ...
#  $ sim_10: num  2 0 0 0 5 0 0 0 3 1 ...

I need these in a matrix, not a list, where each column contains a simulated response vector and each row is an observation. I’ll collapse the list into a matrix by using cbind in do.call.

sim_nbz = do.call(cbind, sim_nbz)
head(sim_nbz)
#      sim_1 sim_2 sim_3 sim_4 sim_5 sim_6 sim_7 sim_8 sim_9 sim_10
# [1,]     0     0     0     0     0     0     0     0     0      2
# [2,]     0     1     0     0     0     0     0     0     0      0
# [3,]     0     0     0     0     0     0     1     0     0      0
# [4,]     0     0     1     1     1     0     2     0     0      0
# [5,]     6     0     4     0     0     0     4    10     1      5
# [6,]     2     0     0     7     0     1     1    13     6      0

Now I can pass these to createDHARMa along with observed values and model predictions. I set integerResponse to TRUE, as well, as I’m working with counts.

sim_res_nbz = createDHARMa(simulatedResponse = sim_nbz, 
                           observedResponse = Salamanders$count,
                           fittedPredictedResponse = predict(fit_nbz),
                           integerResponse = TRUE)

This function creates the scaled residuals that can be used to make residual plots via DHARMa’s plotSimulatedResiduals.

Remember I’ve only done 10 simulations here, so this particular set of plots don’t look very nice.

plotSimulatedResiduals(sim_res_nbz)
# plotSimulatedResiduals is deprecated, switch your code to using the plot function

Simulations for models without a simulate() function

The simulate function did most of the heavy lifting for me in the glmmTMB example. Not all models have a simulate function, though. This doesn’t mean I can’t use DHARMa, but it does mean I have to put more effort in up front.

I will do the following simulations“by hand” in R, more-or-less following the method shown in this answer on CrossValidated.

Example using zeroinfl()

The zeroinfl function from package pscl is an example of a model that doesn’t have a simulate function and is unsupported by DHARMa. I will use this in my next example.

I will also load package VGAM, which has a function for making random draws from a zero-inflated negative binomial distribution.

library(pscl) # version 1.5.2
library(VGAM) # version 1.0-4

I will use a zeroinfl documentation example in this section. The zero-inflated negative binomial model is below.

fit_zinb = zeroinfl(art ~ . | 1, 
                    data = bioChemists, 
                    dist = "negbin")

In order to make my own simulations, I’ll need both the model-predicted count and the model-predicted probability of a 0 for each observation in the dataset. I’ll also need an estimate of the negative binomial dispersion parameter, \(\theta\).

The predict function for zeroinfl models lets the user define the kind of prediction desired. I use predict twice, once to extra the predicted counts and once to extract the predicted probability of 0.

# Predicted probabilities
p = predict(fit_zinb, type = "zero")
# Predicted counts
mus = predict(fit_zinb, type = "count")

I can pull \(\theta\) directly out of the model output.

fit_zinb$theta
# [1] 2.264391

Now that I have these, I can make random draws from a zero-inflated negative distribution using rzinegbin from package VGAM.

It took me awhile to figure out which arguments I needed to use in rzinegbin. I need to provide the predicted counts to the "munb" argument and the predicted probabilities to the "pstr0" argument. The "size" argument is the estimate of the dispersion parameter. And the "n" arguments indicates the number of simulated values needed, in this case the same number as the rows in the original dataset.

sim1 = rzinegbin(n = nrow(bioChemists),
                 size = fit_zinb$theta,
                 pstr0 = p,
                 munb = mus)

I use replicate to draw more than one simulated vector. The first argument of replicate, "n", indicates the number of times to evaluate the given expression. Here I make 10 simulated response vectors.

sim_zinb = replicate(10, rzinegbin(n = nrow(bioChemists),
                                   size = fit_zinb$theta,
                                   pstr0 = p,
                                   munb = mus) )

The output of replicate in this case is a matrix, with one simulated response vector in every column and an observation in every row. This is ready for use in createDHARMa.

head(sim_zinb)
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    2    0    0    0    3    1    3    2    0     1
# [2,]    1    2    1    0    1    0    1    0    4     0
# [3,]    0    1    2    0    3    2    2    1    0     0
# [4,]    0    0    2    0    1    1    0    1    1     2
# [5,]    4    3    1    3    1    0    3    5    3     0
# [6,]    0    0    0    0    0    0    1    3    3     1

Making the simulations was the hard part. Now that I have them, createDHARMa works exactly the same way as in the glmmTMB example.

sim_res_zinb = createDHARMa(simulatedResponse = sim_zinb,
                            observedResponse = bioChemists$art,
                            fittedPredictedResponse = predict(fit_zinb, type = "response"),
                            integerResponse = TRUE)
plotSimulatedResiduals(sim_res_zinb)
# plotSimulatedResiduals is deprecated, switch your code to using the plot function