Very statisticious on Very statisticious

Handling errors using purrr's possibly() and safely()

Mon, 31 Aug 2020 00:00:00 +0000

One topic I haven’t discussed in my previous posts about automating tasks with loops or doing simulations is how to deal with errors. If we have unanticipated errors a map() or lapply() loop will come to a screeching halt with no output to show for the time spent. When your task is time-consuming, this can feel pretty frustrating, since the whole process has to be restarted.

How to deal with errors? Using functions try() or tryCatch() when building a function is the traditional way to catch and address potential errors. In the past I’ve struggled to remember how to use these, though, and functions possibly() and safely() from package purrr are convenient alternatives that I find a little easier to use.

In this post I’ll show examples on how to use these two functions for handling errors. I’ll also demonstrate the use of the related function quietly() to capture other types of output, such as warnings and messages.

R packages

The functions I’m highlighting today are from package purrr. I’ll also use lme4 for fitting simulated data to linear mixed models.

library(purrr) # v. 0.3.4
library(lme4) # v. 1.1-23

Using possibly() to return values instead of errors

When doing a repetitive task like fitting many models with a map() loop, an error in one of the models will shut down the whole process. We can anticipate this issue and bypass it by defining a value to return if a model errors out via possibly().

I created the very small dataset below to demonstrate the issue. The goal is to fit a linear model of y vs x for each group. I made exactly two groups here, a and b, to make it easy to see what goes wrong and why. Usually we have many more groups and potential problems can be harder to spot.

dat = structure(list(group = c("a", "a", "a", "a", "a", "a", "b", "b", "b"), 
                     x = c("A", "A", "A", "B", "B", "B", "A", "A", "A"), 
                     y = c(10.9, 11.1, 10.5, 9.7, 10.5, 10.9, 13, 9.9, 10.3)), 
                class = "data.frame", 
                row.names = c(NA, -9L))
dat

#   group x    y
# 1     a A 10.9
# 2     a A 11.1
# 3     a A 10.5
# 4     a B  9.7
# 5     a B 10.5
# 6     a B 10.9
# 7     b A 13.0
# 8     b A  9.9
# 9     b A 10.3

I’ll first split the dataset by group to get a list of data.frames to loop through.

dat_split = split(dat, dat$group)

Then I’ll loop through each dataset in the list with map() and fit a linear model with lm(). Instead of getting output, though, I get an error.

map(dat_split, ~lm(y ~ x, data = .x) )

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts
can be applied only to factors with 2 or more levels

What’s going on? If you look at the dataset again, you’ll see that x from group b contains only a single value. Once you know that you can see the error actually is telling us what the problem is: we can’t use a factor with only one level.

Model a fits fine, since x has two values.

lm(y ~ x, data = dat, subset = group == "a")

# 
# Call:
# lm(formula = y ~ x, data = dat, subset = group == "a")
# 
# Coefficients:
# (Intercept)           xB  
#     10.8333      -0.4667

It is the b model that fails.

lm(y ~ x, data = dat, subset = group == "b")

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts
can be applied only to factors with 2 or more levels

You can imagine that the problem of having only a single value for the factor in some groups could be easy to miss if you working with a large number groups. This is where possibly() can help, allowing us to keep going through all groups regardless of errors. We can then find and explore problem groups.

Wrapping a function with possibly()

The possibly() function is a wrapper function. It wraps around an existing function. Other than defining the function to wrap, the main argument of interest is otherwise. In otherwise we define what value to return if we get an error from the function we are wrapping.

I make a new wrapped function called posslm1(), which wraps lm() and returns “Error” if an error occurs when fitting the model.

posslm1 = possibly(.f = lm, otherwise = "Error")

When I use posslm1() in my model fitting loop, you can see that loop now finishes. Model b contains the string “Error” instead of a model.

map(dat_split, ~posslm1(y ~ x, data = .x) )

# $a
# 
# Call:
# .f(formula = ..1, data = ..2)
# 
# Coefficients:
# (Intercept)           xB  
#     10.8333      -0.4667  
# 
# 
# $b
# [1] "Error"

Here’s another example of possibly() wrapped around lm(), this time using otherwise = NULL. Depending on what we plan to do with the output, using NULL or NA as the return value can be useful when using possibly().

Now group b is NULL in the output.

posslm2 = possibly(.f = lm, otherwise = NULL)
( mods = map(dat_split, ~posslm2(y ~ x, data = .x) ) )

# $a
# 
# Call:
# .f(formula = ..1, data = ..2)
# 
# Coefficients:
# (Intercept)           xB  
#     10.8333      -0.4667  
# 
# 
# $b
# NULL

Finding the groups with errors

Once the loop is done, we can examine the groups that had errors when fitting models. For example, I can use purrr::keep() to keep only the results that are NULL.

mods %>%
     keep(~is.null(.x) )

# $b
# NULL

This allows me to pull out the names for the groups that had errors. Getting the names in this way is one reason I like that split() returns named lists.

group_errs = mods %>%
     keep(~is.null(.x) ) %>%
     names()
group_errs

# [1] "b"

Once I have the names of the groups with errors, I can pull any problematic groups out of the original dataset or the split list to examine them more closely. (I use %in% here in case group_errs is a vector.)

dat[dat$group %in% group_errs, ]

#   group x    y
# 7     b A 13.0
# 8     b A  9.9
# 9     b A 10.3

dat_split[group_errs]

# $b
#   group x    y
# 7     b A 13.0
# 8     b A  9.9
# 9     b A 10.3

Using compact() to remove empty elements

You may come to a point where you’ve looked at the problem groups and decide that the models with errors shouldn’t be used in further analysis. In that case, if all the groups with errors are NULL, you can use purrr::compact() to remove the empty elements from the list. This can make subsequent loops to get output more straightforward in some cases.

compact(mods)

# $a
# 
# Call:
# .f(formula = ..1, data = ..2)
# 
# Coefficients:
# (Intercept)           xB  
#     10.8333      -0.4667

Using safely() to capture results and errors

Rather than replacing the errors with values, safely() returns both the results and the errors in a list. This function is also a wrapper function. It defaults to using otherwise = NULL, and I generally haven’t had reason to change away from that default.

Here’s an example, wrapping lm() in safely() and then using the wrapped function safelm() to fit the models.

safelm = safely(.f = lm)
mods2 = map(dat_split, ~safelm(y ~ x, data = .x) )

The output for each group is now a list with two elements, one for results (if there was no error) and the other for the error (if there was an error).

Here’s what this looks like for model a, which doesn’t have an error. The output contains a result but no error.

mods2[[1]]

# $result
# 
# Call:
# .f(formula = ..1, data = ..2)
# 
# Coefficients:
# (Intercept)           xB  
#     10.8333      -0.4667  
# 
# 
# $error
# NULL

Model b didn’t work, of course, so the results are NULL but the error was captured in error.

mods2[[2]]

# $result
# NULL
# 
# $error
# <simpleError in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels>

Exploring the errors

One reason to save the errors using safely() is so we can take a look at what the errors were for each group. This is most useful with informative errors like the one in my example.

Errors can be extracted with a map() loop, pulling out the “error” element from each group.

map(mods2, "error")

# $a
# NULL
# 
# $b
# <simpleError in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels>

Extracting results

Results can be extracted similarly, and, if relevant, NULL results can be removed via compact().

mods2 %>%
     map("result") %>%
     compact()

# $a
# 
# Call:
# .f(formula = ..1, data = ..2)
# 
# Coefficients:
# (Intercept)           xB  
#     10.8333      -0.4667

Using quietly() to capture messages

The quietly() function doesn’t handle errors, but instead captures other types of output such as warnings and messages along with any results. This is useful for exploring what kinds of warnings come up when doing simulations, for example.

A few years ago I wrote a post showing a simulation for a linear mixed model. I use the following function, pulled from that earlier post.

twolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) {
     standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)
     stand = rep(LETTERS[1:nstand], each = nplot)
     ploteff = rnorm(nstand*nplot, 0, sigma)
     resp = mu + standeff + ploteff
     dat = data.frame(stand, resp)
     lmer(resp ~ 1 + (1|stand), data = dat)
}

One thing I skipped discussing in that post were the messages returned for some simulations. However, I can certainly picture scenarios where it would be interesting and important to capture warnings and messages to see, e.g., how often they occur even when we know the data comes from the model.

Here I’ll set the seed so the results are reproducible and then run the function 10 times. You see I get two messages, indicating that two of the ten models returned a message. In this case, the message indicates that the random effect variance is estimated to be exactly 0 in the model.

set.seed(16)
sims = replicate(10, twolevel_fun(), simplify = FALSE )

# boundary (singular) fit: see ?isSingular
# boundary (singular) fit: see ?isSingular

It turns out that the second model in the output list is one with a message. You can see at the bottom of the model output below that there is 1 lme4 warning.

sims[[2]]

# Linear mixed model fit by REML ['lmerMod']
# Formula: resp ~ 1 + (1 | stand)
#    Data: dat
# REML criterion at convergence: 45.8277
# Random effects:
#  Groups   Name        Std.Dev.
#  stand    (Intercept) 0.0000  
#  Residual             0.7469  
# Number of obs: 20, groups:  stand, 5
# Fixed Effects:
# (Intercept)  
#       10.92  
# convergence code 0; 0 optimizer warnings; 1 lme4 warnings

The lme4 package stores warnings and messages in the model object, so I can pull the message out of the model object.

sims[[2]]@optinfo$conv$lme4

# $messages
# [1] "boundary (singular) fit: see ?isSingular"

But I think quietly() is more convenient for this task. This is another wrapper function, and I’m going to wrap it around lmer(). I do this because I’m focusing specifically on messages that happen when I fit the model. However, I could have wrapped twolevel_fun() and captured any messages from the entire simulation process.

I use my new function qlmer() inside my simulation function.

qlmer = quietly(.f = lmer)
qtwolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) {
     standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)
     stand = rep(LETTERS[1:nstand], each = nplot)
     ploteff = rnorm(nstand*nplot, 0, sigma)
     resp = mu + standeff + ploteff
     dat = data.frame(stand, resp)
     qlmer(resp ~ 1 + (1|stand), data = dat)
}

I set the seed back to 16 so I get the same models and then run the function using qlmer() 10 times. Note this is considered quiet because the messages are now captured in the output by quietly() instead of printed.

The wrapped function returns a list with 4 elements, including the results, any printed output, warnings, and messages. You can see this for the second model here.

set.seed(16)
sims2 = replicate(10, qtwolevel_fun(), simplify = FALSE)
sims2[[2]]

# $result
# Linear mixed model fit by REML ['lmerMod']
# Formula: resp ~ 1 + (1 | stand)
#    Data: ..2
# REML criterion at convergence: 45.8277
# Random effects:
#  Groups   Name        Std.Dev.
#  stand    (Intercept) 0.0000  
#  Residual             0.7469  
# Number of obs: 20, groups:  stand, 5
# Fixed Effects:
# (Intercept)  
#       10.92  
# convergence code 0; 0 optimizer warnings; 1 lme4 warnings 
# 
# $output
# [1] ""
# 
# $warnings
# character(0)
# 
# $messages
# [1] "boundary (singular) fit: see ?isSingular\n"

In a simulation setting, I think seeing how many times different messages and warnings come up could be pretty interesting. It might inform how problematic a message is. If a message is common in simulation we may feel more confident that such a message from a model fit to our real data is not a big issue.

For example, I could pull out all the messages and then put the results into a vector with unlist().

sims2 %>%
     map("messages") %>% 
     unlist()

# [1] "boundary (singular) fit: see ?isSingular\n"
# [2] "boundary (singular) fit: see ?isSingular\n"

If I wanted to extract multiple parts of the output, such as keeping both messages and warnings, I can use the extract brackets in map().

These results don’t look much different compared to the output above since there are no warnings in my example. However, note the result is now in a named vector so I could potentially keep track of which are messages and which are warnings if I needed to.

sims2 %>%
     map(`[`, c("messages", "warnings") ) %>%
     unlist()

#                                     messages 
# "boundary (singular) fit: see ?isSingular\n" 
#                                     messages 
# "boundary (singular) fit: see ?isSingular\n"

I showed only fairly simple way to use these three functions. However, you certainly may find yourself using them for more complex tasks. For example, I’ve been in situations in the past where I wanted to keep only models that didn’t have errors when building parametric bootstrap confidence intervals. If they had existed at the time, I could have used possibly() or safely() in a while() loop, where the bootstrap data would be redrawn until a model fit without error. Very useful! 😉

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(purrr) # v. 0.3.4
library(lme4) # v. 1.1-23

dat = structure(list(group = c("a", "a", "a", "a", "a", "a", "b", "b", "b"), 
                     x = c("A", "A", "A", "B", "B", "B", "A", "A", "A"), 
                     y = c(10.9, 11.1, 10.5, 9.7, 10.5, 10.9, 13, 9.9, 10.3)), 
                class = "data.frame", 
                row.names = c(NA, -9L))
dat

dat_split = split(dat, dat$group)
map(dat_split, ~lm(y ~ x, data = .x) )

lm(y ~ x, data = dat, subset = group == "a")
lm(y ~ x, data = dat, subset = group == "b")

posslm1 = possibly(.f = lm, otherwise = "Error")
map(dat_split, ~posslm1(y ~ x, data = .x) )

posslm2 = possibly(.f = lm, otherwise = NULL)
( mods = map(dat_split, ~posslm2(y ~ x, data = .x) ) )

mods %>%
     keep(~is.null(.x) )

group_errs = mods %>%
     keep(~is.null(.x) ) %>%
     names()
group_errs

dat[dat$group %in% group_errs, ]
dat_split[group_errs]

compact(mods)

safelm = safely(.f = lm)
mods2 = map(dat_split, ~safelm(y ~ x, data = .x) )
mods2[[1]]
mods2[[2]]

map(mods2, "error")

mods2 %>%
     map("result") %>%
     compact()

twolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) {
     standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)
     stand = rep(LETTERS[1:nstand], each = nplot)
     ploteff = rnorm(nstand*nplot, 0, sigma)
     resp = mu + standeff + ploteff
     dat = data.frame(stand, resp)
     lmer(resp ~ 1 + (1|stand), data = dat)
}

set.seed(16)
sims = replicate(10, twolevel_fun(), simplify = FALSE )
sims[[2]]
sims[[2]]@optinfo$conv$lme4

qlmer = quietly(.f = lmer)
qtwolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) {
     standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)
     stand = rep(LETTERS[1:nstand], each = nplot)
     ploteff = rnorm(nstand*nplot, 0, sigma)
     resp = mu + standeff + ploteff
     dat = data.frame(stand, resp)
     qlmer(resp ~ 1 + (1|stand), data = dat)
}

set.seed(16)
sims2 = replicate(10, qtwolevel_fun(), simplify = FALSE)
sims2[[2]]

sims2 %>%
     map("messages") %>% 
     unlist()

sims2 %>%
     map(`[`, c("messages", "warnings") ) %>%
     unlist()

Simulate! Simulate! - Part 4: A binomial generalized linear mixed model

Thu, 20 Aug 2020 00:00:00 +0000

A post about simulating data from a generalized linear mixed model (GLMM), the fourth post in my simulations series involving linear models, is long overdue. I settled on a binomial example based on a binomial GLMM with a logit link.

I find binomial models the most difficult to grok, primarily because the model is on the scale of log odds, inference is based on odds, but the response variable is a counted proportion. I use the term counted proportion to indicate that the proportions are based on discrete counts, the total number of “successes” divided by the total number of trials. A different distribution (possibly beta) would be needed for continuous proportions like, e.g., total leaf area with lesions.

Models based on single parameter distributions like the binomial can be overdispersed or underdispersed, where the variance in the data is bigger or smaller, respectively, than the variance defined by the binomial distribution. Given this, I thought exploring estimates of dispersion based on simulated data that we know comes from a binomial distribution would be interesting.

I will be simulating data “manually”. However, also see the simulate() function from package lme4. I find this function particularly useful if I want to simulate data based on a fitted model, but it can also be used in situations where you don’t already have a model.

R packages

I’ll be fitting binomial GLMM with lme4. I use purrr for looping and ggplot2 for plotting results.

library(lme4) # v. 1.1-23
library(purrr) # v. 0.3.4
library(ggplot2) # v. 3.3.2

The statistical model

As usual, I’ll start by writing out the statistical model using mathematical equations. If these aren’t helpful to you, jump down to the code. You may find that writing the code first and coming back to look at the statistical model later is helpful.

The imaginary study design that is the basis of my model has two different sizes of study units. This is a field experiment scenario, where multiple sites within a region are selected and then two plots within each site are randomly placed and a treatment assigned (“treatment” or “control”). You can think of “sites” as a blocking variable. The number of surviving plants from some known total number planted at an earlier time point is measured in each plot.

I first define a response variable that comes from the binomial distribution. (If you haven’t seen this style of statistical model before, my Poisson GLM post goes into slightly more detail.)

\[y_t \thicksim Binomial(p_t, m_t)\]

\(y_t\) is the observed number of surviving plants from the total \(m_t\) planted for the \(t\)th plot.
\(p_t\) is the unobserved true mean (proportion) of the binomial distribution for the \(t\)th plot.
\(m_t\) is the total number of plants originally planted, also know as the total number of trials or the binomial sample size. The binomial sample size can be the same for all plots (likely for experimental data) or vary among plots (more common for observational data).

We assume that the relationship between the mean of the response and explanatory variables is linear on the logit scale so I use a logit link function when writing out the linear predictor. The logit is the same as the log odds; i.e., \(logit(p)\) is the same as \(log(\frac{p}{1-p})\).

The model I define here has a categorical fixed effect with only two levels.

\[logit(p_t) = \beta_0 + \beta_1*I_{(treatment_t=\textit{treatment})} + (b_s)_t\]

\(\beta_0\) is the log odds of survival when the treatment is control.
\(\beta_1\) is the difference in log odds of survival between the two treatments, treatment minus control.
The indicator variable, \(I_{(treatment_t=\textit{treatment})}\), is 1 when the treatment is treatment and 0 otherwise.
\(b_s\) is the (random) effect of the \(s\)th site on the log odds of survival. \(s\) goes from 1 to the total number of sites sampled. The site-level random effects are assumed to come from an iid normal distribution with a mean of 0 and some shared, site-level variance, \(\sigma^2_s\): \(b_s \thicksim N(0, \sigma^2_s)\).

If you are newer to generalized linear mixed models you might want to take a moment and note of the absence of epsilon in the linear predictor.

A single simulation for a binomial GLMM

Below is what the dataset I will create via simulation looks like. I have a variable to represent the sites (site) and plots (plot) as well as one for the treatments the plot was assigned to (treatment). In addition, y is the total number of surviving plants, and num_samp is the total number originally planted (50 for all plots in this case).

Note that y/num_samp is the proportion of plants that survived, which is what we are interested in. In binomial models in R you often use the number of successes and the number of failures (total trials minus the number of successes) as the response variable instead of the actual observed proportion.

# # A tibble: 20 x 5
#    site  plot  treatment num_samp     y
#    <chr> <chr> <chr>        <dbl> <int>
#  1 A     A.1   treatment       50    40
#  2 A     A.2   control         50    26
#  3 B     B.1   treatment       50    42
#  4 B     B.2   control         50    23
#  5 C     C.1   treatment       50    48
#  6 C     C.2   control         50    33
#  7 D     D.1   treatment       50    28
#  8 D     D.2   control         50    19
#  9 E     E.1   treatment       50    45
# 10 E     E.2   control         50    35
# 11 F     F.1   treatment       50    45
# 12 F     F.2   control         50    25
# 13 G     G.1   treatment       50    35
# 14 G     G.2   control         50    21
# 15 H     H.1   treatment       50    42
# 16 H     H.2   control         50    26
# 17 I     I.1   treatment       50    47
# 18 I     I.2   control         50    30
# 19 J     J.1   treatment       50    42
# 20 J     J.2   control         50    31

I’ll start the simulation by setting the seed so the results can be exactly reproduced. I always do this for testing my methodology prior to performing many simulations.

set.seed(16)

Defining the difference in treatments

I need to define the “truth” in the simulation by setting all the parameters in the statistical model to values of my choosing. I found it a little hard to figure out what the difference between treatments would be on the scale of the log odds, so I thought it worthwhile to discuss my process here.

I realized it was easier to for me to think about the results in this case in terms of proportions of each treatment and then use those to convert differences between treatments to log odds. I started out by thinking about what I would expect the surviving proportion of plants to be in the control group. I decided I’d expect half to survive, 0.5. The treatment, if effective, needs to improve survival substantially to be cost effective. I decided that treatment group should have at least 85% survival (0.85).

The estimate difference from the model will be expressed as odds, so I calculated the odds and then the difference in odds as an odds ratio based on my chosen proportions per group.

codds = .5/(1 - .5)
todds = .85/(1 - .85)
todds/codds

# [1] 5.666667

Since the model is linear on the scale of log odds I took the log of the odds ratio above to figure out the additive difference between treatments on the model scale.

log(todds/codds)

# [1] 1.734601

I also need the log odds for the control group, since that is what the intercept, \(\beta_0\), represents in my statistical model.

log(codds)

# [1] 0

Here are the values I’ll use for the “truth” today:

The true log odds for on the control group, \(\beta_0\), will be 0
The difference in log odds of the treatment compared to the control, \(\beta_1\), will be 1.735.
The site-level variance (\(\sigma^2_s\)) will be set at 0.5.

I’ll define the number of sites to 10 while I’m at it. Since I’m working with only 2 treatments, there will be 2 plots per site. The total number of plots (and so observations) is the number of sites times the number of plots per site: 10*2 = 20.

b0 = 0
b1 = 1.735
site_var = 0.5
n_sites = 10

Creating the study design variables

Without discussing the code, since I’ve gone over code like this in detail in earlier posts, I will create variables based on the study design, site, plot, and treatment, using rep(). I’m careful to line things up so there are two unique plots in each site, one for each treatment. I don’t technically need the plot variable for the analysis I’m going to do, but I create it to keep myself organized (and to mimic a real dataset 😁).

site = rep(LETTERS[1:n_sites], each = 2)
plot = paste(site, rep(1:2, times = n_sites), sep = "." )
treatment = rep( c("treatment", "control"), times = n_sites)
dat = data.frame(site, plot, treatment)
dat

#    site plot treatment
# 1     A  A.1 treatment
# 2     A  A.2   control
# 3     B  B.1 treatment
# 4     B  B.2   control
# 5     C  C.1 treatment
# 6     C  C.2   control
# 7     D  D.1 treatment
# 8     D  D.2   control
# 9     E  E.1 treatment
# 10    E  E.2   control
# 11    F  F.1 treatment
# 12    F  F.2   control
# 13    G  G.1 treatment
# 14    G  G.2   control
# 15    H  H.1 treatment
# 16    H  H.2   control
# 17    I  I.1 treatment
# 18    I  I.2   control
# 19    J  J.1 treatment
# 20    J  J.2   control

Simulate the random effect

Next I will simulate the site-level random effects. I defined these as \(b_s \thicksim N(0, \sigma^2_s)\), so will randomly draw from a normal distribution with a mean of 0 and a variance of 0.5. Remember that rnorm() in R uses standard deviation, not variance, so I use the square root of site_var.

Since I am have 10 sites I draw 10 values, with each value repeated for each plot present within the site.

( site_eff = rep( rnorm(n = n_sites, 
                        mean = 0, 
                        sd = sqrt(site_var) ), 
                  each = 2) )

#  [1]  0.33687514  0.33687514 -0.08865705 -0.08865705  0.77514191  0.77514191
#  [7] -1.02122414 -1.02122414  0.81163788  0.81163788 -0.33121733 -0.33121733
# [13] -0.71131449 -0.71131449  0.04494560  0.04494560  0.72476507  0.72476507
# [19]  0.40527261  0.40527261

Calculate log odds

I now have fixed values for all parameters, the variable treatment to create the indicator variable, and the simulated effects of sites drawn from the defined distribution. That’s all the pieces I need to calculate the true log odds.

The statistical model

\[logit(p_t) = \beta_0 + \beta_1*I_{(treatment_t=\textit{treatment})} + (b_s)_t\]

is my guide for how to combine these pieces to calculate the log odds, \(logit(p_t)\).

( log_odds = with(dat, b0 + b1*(treatment == "treatment") + site_eff ) )

#  [1]  2.07187514  0.33687514  1.64634295 -0.08865705  2.51014191  0.77514191
#  [7]  0.71377586 -1.02122414  2.54663788  0.81163788  1.40378267 -0.33121733
# [13]  1.02368551 -0.71131449  1.77994560  0.04494560  2.45976507  0.72476507
# [19]  2.14027261  0.40527261

Convert log odds to proportions

I’m getting close to pulling values from the binomial distribution to get my response variable. Remember that I defined \(y_t\) as: \[y_t \thicksim Binomial(p_t, m_t)\] Right now I’ve gotten to the point where I have \(logit(p_t)\). To get the true proportions, \(p_t\), I need to inverse-logit the log odds. In R, function plogis() performs the inverse logit.

( prop = plogis(log_odds) )

#  [1] 0.8881394 0.5834313 0.8383962 0.4778502 0.9248498 0.6846321 0.6712349
#  [8] 0.2647890 0.9273473 0.6924584 0.8027835 0.4179445 0.7356899 0.3293085
# [15] 0.8556901 0.5112345 0.9212726 0.6736555 0.8947563 0.5999538

I can’t forget about \(m_t\). I need to know the binomial sample size for each plot before I can calculate the number of successes based on the total number of trials from the binomial distribution. Since my imaginary study is an experiment I will set this as 50 for every plot.

dat$num_samp = 50

I’ve been in situations where I wanted the binomial sample size to vary per observation. In that case, you may find sample() useful, using the range of binomial sample sizes you are interested in as the first argument.

Here’s an example of what that code could look like, allowing the binomial sample size to vary from 40 and 50 for every plot. (Code not run.)

num_samp = sample(40:50, size = 20, replace = TRUE)

Generate the response variable

Now that I have a vector of proportions and have set the binomial sample size per plot, I can calculate the number of successes for each true proportion and binomial sample size based on the binomial distribution. I do this via rbinom().

It is this step where we add the “binomial errors” to the proportions to generate a response variable. The variation for each simulated y value is based on the binomial variance.

The next bit of code is directly based on the distribution defined in the statistical model: \(y_t \thicksim Binomial(p_t, m_t)\). I randomly draw 20 values from the binomial distribution, one for each of the 20 proportions stored in prop. I define the binomial sample size in the size argument.

( dat$y = rbinom(n = n_sites*2, size = dat$num_samp, prob = prop) )

#  [1] 40 26 42 23 48 33 28 19 45 35 45 25 35 21 42 26 47 30 42 31

Fit a model

It’s time for model fitting! I can now fit a binomial generalized linear mixed model with a logit link using, e.g., the glmer() function from package lme4.

mod = glmer(cbind(y, num_samp - y) ~ treatment + (1|site), 
            data = dat,
            family = binomial(link = "logit") )
mod

# Generalized linear mixed model fit by maximum likelihood (Laplace
#   Approximation) [glmerMod]
#  Family: binomial  ( logit )
# Formula: cbind(y, num_samp - y) ~ treatment + (1 | site)
#    Data: dat
#      AIC      BIC   logLik deviance df.resid 
# 122.6154 125.6025 -58.3077 116.6154       17 
# Random effects:
#  Groups Name        Std.Dev.
#  site   (Intercept) 0.4719  
# Number of obs: 20, groups:  site, 10
# Fixed Effects:
#        (Intercept)  treatmenttreatment  
#             0.1576              1.4859

Make a function for the simulation

A single simulation can help us understand the statistical model, but usually the goal of a simulation is to see how the model behaves over the long run. To that end I’ll make my simulation process into a function.

In my function I’m going to set all the arguments to the parameter values as I defined them earlier. I allow some flexibility, though, so the argument values can be changed if I want to explore the simulation with, say, a different number of replications or different parameter values. I do not allow the number of plots to vary in this particular function, since I’m hard-coding in two treatments.

This function returns a generalized linear mixed model fit with glmer().

bin_glmm_fun = function(n_sites = 10,
                        b0 = 0,
                        b1 = 1.735,
                        num_samp = 50,
                        site_var = 0.5) {
     site = rep(LETTERS[1:n_sites], each = 2)
     plot = paste(site, rep(1:2, times = n_sites), sep = "." )
     treatment = rep( c("treatment", "control"), times = n_sites)
     dat = data.frame(site, plot, treatment)           
     
     site_eff = rep( rnorm(n = n_sites, mean = 0, sd = sqrt(site_var) ), each = 2)
     
     log_odds = with(dat, b0 + b1*(treatment == "treatment") + site_eff)
     prop = plogis(log_odds)
     dat$num_samp = num_samp
     dat$y = rbinom(n = n_sites*2, size = num_samp, prob = prop)
     
     glmer(cbind(y, num_samp - y) ~ treatment + (1|site),
           data = dat,
           family = binomial(link = "logit") )
}

I test the function, using the same seed, to make sure things are working as expected and that I get the same results as above. I do, and everything looks good.

set.seed(16)
bin_glmm_fun()

# Generalized linear mixed model fit by maximum likelihood (Laplace
#   Approximation) [glmerMod]
#  Family: binomial  ( logit )
# Formula: cbind(y, num_samp - y) ~ treatment + (1 | site)
#    Data: dat
#      AIC      BIC   logLik deviance df.resid 
# 122.6154 125.6025 -58.3077 116.6154       17 
# Random effects:
#  Groups Name        Std.Dev.
#  site   (Intercept) 0.4719  
# Number of obs: 20, groups:  site, 10
# Fixed Effects:
#        (Intercept)  treatmenttreatment  
#             0.1576              1.4859

Repeat the simulation many times

Now that I have a working function to simulate data and fit the model it’s time to do the simulation many times. The model from each individual simulation is saved to allow exploration of long run model performance.

This is a task for replicate(), which repeatedly calls a function and saves the output. When using simplify = FALSE the output is a list, which is convenient for going through to extract elements from the models later. I’ll re-run the simulation 1000 times. This could take awhile to run for complex models with many terms.

I print the output of the 100th list element so you can see the list is filled with models.

sims = replicate(1000, bin_glmm_fun(), simplify = FALSE )
sims[[100]]

# Generalized linear mixed model fit by maximum likelihood (Laplace
#   Approximation) [glmerMod]
#  Family: binomial  ( logit )
# Formula: cbind(y, num_samp - y) ~ treatment + (1 | site)
#    Data: dat
#      AIC      BIC   logLik deviance df.resid 
# 122.1738 125.1610 -58.0869 116.1738       17 
# Random effects:
#  Groups Name        Std.Dev.
#  site   (Intercept) 0.6059  
# Number of obs: 20, groups:  site, 10
# Fixed Effects:
#        (Intercept)  treatmenttreatment  
#           0.001914            1.824177

Extract results from the binomial GLMM

After running all the models we can extract whatever we are interested in to explore long run behavior. As I was planning this post, I started wondering what the estimate of dispersion would look like from a binomial GLMM that was not over or underdispersed by definition.

With some caveats, which you can read more about in the GLMM FAQ, the sum of the squared Pearson residuals divided by the residual degrees of freedom is an estimate of over/underdispersion. This seems OK to use in the scenario I’ve set up here since my binomial sample sizes are fairly large and my proportions are not too close to the distribution limits.

I made a function to calculate this.

overdisp_fun = function(model) {
     sum( residuals(model, type = "pearson")^2)/df.residual(model)
}
overdisp_fun(mod)

# [1] 0.7169212

Explore estimated dispersion

I want to look at the distribution of dispersion estimates from the 1000 models. This involves looping through the models and using overdisp_fun() to extract the estimated dispersion from each one. I put the result in a data.frame since I’ll be plotting the result with ggplot2. I use purrr helper function map_dfr() for the looping.

alldisp = map_dfr(sims, ~data.frame(disp = overdisp_fun(.x) ) )

Here’s the plot of the resulting distribution. I put a vertical line at 1, since values above 1 indicate overdispersion and below 1 indicate underdispersion.

ggplot(alldisp, aes(x = disp) ) +
     geom_histogram(fill = "blue", 
                    alpha = .25, 
                    bins = 100) +
     geom_vline(xintercept = 1) +
     scale_x_continuous(breaks = seq(0, 2, by = 0.2) ) +
     theme_bw(base_size = 14) +
     labs(x = "Disperson",
          y = "Count")

I’m not sure what to think of this yet, but I am pretty fascinated by the result. Only ~7% of models show any overdispersion.

mean(alldisp$disp > 1)

# [1] 0.069

And hardly any (<0.05%) estimate overdispersion greater than 1.5, which is a high enough value that we would likely be concerned that our results were anti-conservative if this were an analysis of a real dataset.

mean(alldisp$disp > 1.5)

# [1] 0.004

For this scenario, at least, I learned that it is rare to observe substantial overdispersion when the model isn’t overdispersed. That seems useful.

I don’t know why so many models show substantial underdispersion, though. Maybe the method for calculating overdispersion doesn’t work well for underdispersion? I’m not sure.

When checking a real model we’d be using additional tools beyond the estimated dispersion to check model fit and decide if a model looks problematic. I highly recommend package DHARMa for checking model fit for GLMM’s (although I’m not necessarily a fan of all the p-values 😜).

Happy simulating!

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(lme4) # v. 1.1-23
library(purrr) # v. 0.3.4
library(ggplot2) # v. 3.3.2

dat
set.seed(16)
codds = .5/(1 - .5)
todds = .85/(1 - .85)
todds/codds
log(todds/codds)
log(codds)
b0 = 0
b1 = 1.735
site_var = 0.5
n_sites = 10

site = rep(LETTERS[1:n_sites], each = 2)
plot = paste(site, rep(1:2, times = n_sites), sep = "." )
treatment = rep( c("treatment", "control"), times = n_sites)
dat = data.frame(site, plot, treatment)
dat

( site_eff = rep( rnorm(n = n_sites, 
                        mean = 0, 
                        sd = sqrt(site_var) ), 
                  each = 2) )

( log_odds = with(dat, b0 + b1*(treatment == "treatment") + site_eff ) )
( prop = plogis(log_odds) )

dat$num_samp = 50
num_samp = sample(40:50, size = 20, replace = TRUE)
( dat$y = rbinom(n = n_sites*2, size = dat$num_samp, prob = prop) )

mod = glmer(cbind(y, num_samp - y) ~ treatment + (1|site), 
            data = dat,
            family = binomial(link = "logit") )
mod

bin_glmm_fun = function(n_sites = 10,
                        b0 = 0,
                        b1 = 1.735,
                        num_samp = 50,
                        site_var = 0.5) {
     site = rep(LETTERS[1:n_sites], each = 2)
     plot = paste(site, rep(1:2, times = n_sites), sep = "." )
     treatment = rep( c("treatment", "control"), times = n_sites)
     dat = data.frame(site, plot, treatment)           
     
     site_eff = rep( rnorm(n = n_sites, mean = 0, sd = sqrt(site_var) ), each = 2)
     
     log_odds = with(dat, b0 + b1*(treatment == "treatment") + site_eff)
     prop = plogis(log_odds)
     dat$num_samp = num_samp
     dat$y = rbinom(n = n_sites*2, size = num_samp, prob = prop)
     
     glmer(cbind(y, num_samp - y) ~ treatment + (1|site),
           data = dat,
           family = binomial(link = "logit") )
}


set.seed(16)
bin_glmm_fun()

sims = replicate(1000, bin_glmm_fun(), simplify = FALSE )
sims[[100]]

overdisp_fun = function(model) {
     sum( residuals(model, type = "pearson")^2)/df.residual(model)
}
overdisp_fun(mod)

alldisp = map_dfr(sims, ~data.frame(disp = overdisp_fun(.x) ) )

ggplot(alldisp, aes(x = disp) ) +
     geom_histogram(fill = "blue", 
                    alpha = .25, 
                    bins = 100) +
     geom_vline(xintercept = 1) +
     scale_x_continuous(breaks = seq(0, 2, by = 0.2) ) +
     theme_bw(base_size = 14) +
     labs(x = "Disperson",
          y = "Count")

mean(alldisp$disp > 1)
mean(alldisp$disp > 1.5)

Controlling legend appearance in ggplot2 with override.aes

Thu, 09 Jul 2020 00:00:00 +0000

In ggplot2, aesthetics and their scale_*() functions change both the plot appearance and the plot legend appearance simultaneously. The override.aes argument in guide_legend() allows the user to change only the legend appearance without affecting the rest of the plot. This is useful for making the legend more readable or for creating certain types of combined legends.

In this post I’ll first introduce override.aes with a basic example and then go through three additional plotting scenarios to how other instances where override.aes comes in handy.

R packages

The only package I’ll use in this post is ggplot2 for plotting.

library(ggplot2) # v. 3.3.2

Introducing override.aes

A basic reason to change the legend appearance without changing the plot is to make the legend more readable.

For example, I’ll start with a scatterplot using the diamonds dataset. This is a large dataset, so after mapping color to the cut variable I set alpha to increase the transparency and size to reduce the size of points in the plot.

You can see using alpha and size changes the way the points are shown in both the plot and the legend.

ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1)

Adding a guides() layer

Making the points small and transparent may be desirable when plotting many points, but it also makes the legend more difficult to read. This is a case where I’d want to make the legend more readable by increasing the point size and/or reducing the point transparency.

One way to do this is by adding a guides() layer. The guides() function uses scale name-guide pairs. I am going to change the legend for the color scale, so I’ll use color = guide_legend() as the scale name-guide pair.

override.aes is an argument within guide_legend(), so if you’re looking for more background you can start at ?guide_legend. The override.aes argument takes a list of aesthetic parameters that will override the default legend appearance.

To increase the size of the points in the color legend of my plot, the layer I’ll add will look like:

guides(color = guide_legend(override.aes = list(size = 3) ) )

Adding this layer to the initial plot, you can see how the points in the legend get larger while the points in the plot remain unchanged.

ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1) +
     guides(color = guide_legend(override.aes = list(size = 3) ) )

Using the guide argument in scale_*()

If I am going to change my default colors with a scale_color_*() function in addition to overriding the legend appearance, I can use the guide argument there instead of adding a separate guides() layer. The guide argument is part of all scale functions.

For example, say I am already using scale_color_viridis_d() to change the default color palette of the whole plot (i.e., plot and legend). I can use the same guide_legend() code from above for the guide argument to change the size of the points in the legend.

ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1) +
     scale_color_viridis_d(option = "magma",
                           guide = guide_legend(override.aes = list(size = 3) ) )

Changing multiple aesthetic parameters

You can control multiple aesthetic parameters at once by adding them to the list passed to override.aes. If I want to increase the point size as well as remove the point transparency in the legend, I can change both size and alpha.

ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1) +
     scale_color_viridis_d(option = "magma",
                           guide = guide_legend(override.aes = list(size = 3,
                                                                    alpha = 1) ) )

Suppress aesthetics from part of the legend

Removing aesthetic for only some parts of the legend is another use for override.aes. For example, this can be useful when different layers are based on a different number of levels for the same grouping factor.

The following example is based on this Stack Overflow question. The points data has information from all three groups of the id variable but the rectangle, based on the box dataset, is for only a single group.

points = structure(list(x = c(5L, 10L, 7L, 9L, 86L, 46L, 22L, 94L, 21L, 
6L, 24L, 3L), y = c(51L, 54L, 50L, 60L, 97L, 74L, 59L, 68L, 45L, 
56L, 25L, 70L), id = c("a", "a", "a", "a", "b", "b", "b", "b", 
"c", "c", "c", "c")), row.names = c(NA, -12L), class = "data.frame")

head(points)

#    x  y id
# 1  5 51  a
# 2 10 54  a
# 3  7 50  a
# 4  9 60  a
# 5 86 97  b
# 6 46 74  b

box = data.frame(left = 1, right = 10, bottom = 50, top = 60, id = "a")
box

#   left right bottom top id
# 1    1    10     50  60  a

Here’s what the initial plot looks like, mapping color to the id variable. Note that the colored outlines, representing the rectangle layer, are present for every group in the legend even though there is a rectangle for only one of the groups present in the plot.

ggplot(data = points, aes(color = id) ) +
     geom_point(aes(x = x, y = y), size = 4) +
     geom_rect(data = box, aes(xmin = left,
                               xmax = right,
                               ymin = 50,
                               ymax = top),
               fill = NA, size = 1)

In this case, I want to remove the outlines for the second and third legend key boxes so the legend matches what is in the plot. The legend outlines are based on the linetype aesthetic. Suppressing these lines can be done with override.aes, setting the line types to 0 in order to remove them.

Note that I have to list the line type for every group, not just the groups I want to remove. I keep the line for the first group solid via 1.

ggplot(data = points, aes(color = id) ) +
     geom_point(aes(x = x, y = y), size = 4) +
     geom_rect(data = box, aes(xmin = left,
                               xmax = right,
                               ymin = 50,
                               ymax = top),
               fill = NA, size = 1) +
     guides(color = guide_legend(override.aes = list(linetype = c(1, 0, 0) ) ) )

Combining legends from two layers

In the next example I’ll show how override.aes can be useful when creating a legend based on multiple layers and want distinct symbols in each legend key box.

There are situations where we want to add a legend to identify different elements of the plot, such as indicating the plotted line is a fitted line or that points are means. This can be done by mapping aesthetics to constants to make a manual legend and then manipulating the symbols shown in the legend via override.aes. I wrote about making manual legends in an earlier blog post.

The plot below shows observed values and a fitted line per group based on color.

ggplot(data = mtcars, aes(x = mpg, y = wt, color = factor(am) ) ) +
     geom_point(size = 3) +
     geom_smooth(method = "lm", se = FALSE)

# `geom_smooth()` using formula 'y ~ x'

I’m going to leave the color legend alone but I want to add a second legend to indicate that the points are observed values and the lines are fitted lines. I’ll use the alpha aesthetic for this. Using an aesthetic that you haven’t used already and affects both layers is a trick that often comes in handy when adding an extra legend like I’m doing here.

I don’t actually want alpha to affect the plot appearance, so I also add scale_alpha_manual() to make sure both layers stay opaque by setting the values for both groups to 1. I also remove the legend name and set the order of the breaks so the Observed group is listed first in the new legend.

ggplot(data = mtcars, aes(x = mpg, y = wt, color = factor(am) ) ) +
     geom_point(aes(alpha = "Observed"), size = 3) +
     geom_smooth(method = "lm", se = FALSE, aes(alpha = "Fitted") ) +
     scale_alpha_manual(name = NULL,
                        values = c(1, 1),
                        breaks = c("Observed", "Fitted") )

Now I have a new legend to work with. However, the legend has both the point and the line symbol in all legend key boxes. I need to override the current legend so the Observed legend key box contains only a point symbol and the Fitted legend key box has only a line symbol. This is where override.aes comes in.

Here’s what I’ll do: I’ll change the linetype to 0 for the first key box but leave it as 1 for the second. I’ll use shape 16 (a solid circle) as the shape for the first key box but remove the point all together in the second key box with NA. I’m also going to make sure all elements are black via color. (If you need to know codes for shapes and line types see here.)

I use linetype, shape, and color in the override.aes list within scale_alpha_manual().

ggplot(data = mtcars, aes(x = mpg, y = wt, color = factor(am) ) ) +
     geom_point(aes(alpha = "Observed"), size = 3) +
     geom_smooth(method = "lm", se = FALSE, aes(alpha = "Fitted") ) +
     scale_alpha_manual(name = NULL,
                        values = c(1, 1),
                        breaks = c("Observed", "Fitted"),
                        guide = guide_legend(override.aes = list(linetype = c(0, 1),
                                                                  shape = c(16, NA),
                                                                  color = "black") ) )

Controlling the appearance of multiple legends

The final example for override.aes may seem a little esoteric, but it has come up for me in the past. Say I want to make a scatterplot with the fill and shape aesthetics mapped to two different factors. I use fill instead of color so the points have an outline. Having an outline around the points can matter if, e.g., you have a white plot background and wanted the points to be black and white as in this question.

This is the dataset I’ll use in this plot example, where the two factors are named g1 and g2.

dat = structure(list(g1 = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), class = "factor", .Label = c("High", 
"Low")), g2 = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
1L, 2L, 2L, 1L, 1L, 2L, 2L), class = "factor", .Label = c("Control", 
"Treatment")), x = c(0.42, 0.39, 0.56, 0.59, 0.17, 0.95, 0.85, 
0.25, 0.31, 0.75, 0.58, 0.9, 0.6, 0.86, 0.61, 0.61), y = c(-1.4, 
3.6, 1.1, -0.1, 0.5, 0, -1.8, 0.8, -1.1, -0.6, 0.2, 0.3, 1.1, 
1.6, 0.9, -0.6)), class = "data.frame", row.names = c(NA, -16L
))

head(dat)

#     g1        g2    x    y
# 1 High   Control 0.42 -1.4
# 2  Low   Control 0.39  3.6
# 3 High Treatment 0.56  1.1
# 4  Low Treatment 0.59 -0.1
# 5 High   Control 0.17  0.5
# 6  Low   Control 0.95  0.0

I set the colors for the fill in scale_fill_manual() and choose fillable shapes in scale_shape_manual(). Fillable shapes are shapes 21 through 25.

ggplot(data = dat, aes(x = x, y = y, fill = g1, shape = g2) ) +
     geom_point(size = 5) +
     scale_fill_manual(values = c("#002F70", "#EDB4B5") ) +
     scale_shape_manual(values = c(21, 24) )

The plots itself shows the fill colors and shapes, but you can some issues in the legends. The fill colors don’t show up in the g1 legend at all. This is because the default shape in the legend isn’t a fillable shape. In addition, the g2 legend shows unfilled points and I think it would look better if the points were filled.

I can address both these issues via override.aes. I’ll change the point shape in the fill legend to shape 21 and the fill color in the shape legend to black within a guides() layer. This code gives you a chance to see how you can use use multiple scale name-guide pairs within the same guides() layer.

ggplot(data = dat, aes(x = x, y = y, fill = g1, shape = g2) ) +
     geom_point(size = 5) +
     scale_fill_manual(values = c("#002F70", "#EDB4B5") ) +
     scale_shape_manual(values = c(21, 24) ) +
     guides(fill = guide_legend(override.aes = list(shape = 21) ),
            shape = guide_legend(override.aes = list(fill = "black") ) )

While I’m sure you can come up with additional scenarios, that should give you a taste for when overriding the aesthetics in the legend is useful. 😄

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(ggplot2) # v. 3.3.2

ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1)
     
ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1) +
     guides(color = guide_legend(override.aes = list(size = 3) ) )
            
ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1) +
     scale_color_viridis_d(option = "magma",
                           guide = guide_legend(override.aes = list(size = 3) ) )

ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1) +
     scale_color_viridis_d(option = "magma",
                           guide = guide_legend(override.aes = list(size = 3,
                                                                    alpha = 1) ) )

points = structure(list(x = c(5L, 10L, 7L, 9L, 86L, 46L, 22L, 94L, 21L, 
6L, 24L, 3L), y = c(51L, 54L, 50L, 60L, 97L, 74L, 59L, 68L, 45L, 
56L, 25L, 70L), id = c("a", "a", "a", "a", "b", "b", "b", "b", 
"c", "c", "c", "c")), row.names = c(NA, -12L), class = "data.frame")

head(points)

box = data.frame(left = 1, right = 10, bottom = 50, top = 60, id = "a")
box
ggplot(data = points, aes(color = id) ) +
     geom_point(aes(x = x, y = y), size = 4) +
     geom_rect(data = box, aes(xmin = left,
                               xmax = right,
                               ymin = 50,
                               ymax = top),
               fill = NA, size = 1)

ggplot(data = points, aes(color = id) ) +
     geom_point(aes(x = x, y = y), size = 4) +
     geom_rect(data = box, aes(xmin = left,
                               xmax = right,
                               ymin = 50,
                               ymax = top),
               fill = NA, size = 1) +
     guides(color = guide_legend(override.aes = list(linetype = c(1, 0, 0) ) ) )

ggplot(data = mtcars, aes(x = mpg, y = wt, color = factor(am) ) ) +
     geom_point(size = 3) +
     geom_smooth(method = "lm", se = FALSE)
       
ggplot(data = mtcars, aes(x = mpg, y = wt, color = factor(am) ) ) +
     geom_point(aes(alpha = "Observed"), size = 3) +
     geom_smooth(method = "lm", se = FALSE, aes(alpha = "Fitted") ) +
     scale_alpha_manual(name = NULL,
                        values = c(1, 1),
                        breaks = c("Observed", "Fitted") )
                        

ggplot(data = mtcars, aes(x = mpg, y = wt, color = factor(am) ) ) +
     geom_point(aes(alpha = "Observed"), size = 3) +
     geom_smooth(method = "lm", se = FALSE, aes(alpha = "Fitted") ) +
     scale_alpha_manual(name = NULL,
                        values = c(1, 1),
                        breaks = c("Observed", "Fitted"),
                        guide = guide_legend(override.aes = list(linetype = c(0, 1),
                                                                  shape = c(16, NA),
                                                                  color = "black") ) )

dat = structure(list(g1 = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), class = "factor", .Label = c("High", 
"Low")), g2 = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
1L, 2L, 2L, 1L, 1L, 2L, 2L), class = "factor", .Label = c("Control", 
"Treatment")), x = c(0.42, 0.39, 0.56, 0.59, 0.17, 0.95, 0.85, 
0.25, 0.31, 0.75, 0.58, 0.9, 0.6, 0.86, 0.61, 0.61), y = c(-1.4, 
3.6, 1.1, -0.1, 0.5, 0, -1.8, 0.8, -1.1, -0.6, 0.2, 0.3, 1.1, 
1.6, 0.9, -0.6)), class = "data.frame", row.names = c(NA, -16L
))

head(dat)

ggplot(data = dat, aes(x = x, y = y, fill = g1, shape = g2) ) +
     geom_point(size = 5) +
     scale_fill_manual(values = c("#002F70", "#EDB4B5") ) +
     scale_shape_manual(values = c(21, 24) )

ggplot(data = dat, aes(x = x, y = y, fill = g1, shape = g2) ) +
     geom_point(size = 5) +
     scale_fill_manual(values = c("#002F70", "#EDB4B5") ) +
     scale_shape_manual(values = c(21, 24) ) +
     guides(fill = guide_legend(override.aes = list(shape = 21) ),
            shape = guide_legend(override.aes = list(fill = "black") ) )

Analysis essentials: Using the help page for a function in R

Tue, 28 Apr 2020 00:00:00 +0000

Since I tend to work with relatively new R users, I think a lot about what folks need to know when they are getting started. Learning how to get help tops my list of essential skills. Some of this involves learning about useful help forums like Stack Overflow and the RStudio Community. Some of this is about learning good search terms (this is a hard one!). And some of this is learning how to use the R documentation help pages.

While there are still exceptions, most often the help pages in R contain a bunch of useful information. Here I talk a little about what is generally in a help page for a function and what I focus on in each section.

R help pages

Every time I use a function for a first time or reuse a function after some time has passed (like, 5 minutes in some cases 😜), I spend time looking at the R help page for that function. You can get to a help page in R by typing ?functionname into your Console and pressing Enter, where functionname is some R function you are using.

For example, if I wanted to take an average of some numbers with the mean() function, I would type ?mean at the > in the R Console and then press Enter. The help page opens up; if using RStudio this will default to open in the Help pane.

Help page structure

A help page for an R function always has the same basic set-up. Here’s what the first half of the help page for mean() looks like.

At the very top you’ll see the function name, followed by the package the function is in surrounded by curly braces. You can see that mean() is part of the base package.

This is followed by a function title and basic Description of the function. Sometimes this description can be in fairly in depth and useful but often, like here, it’s not and I quickly skim over it.

Usage

The Usage section is usually my first stop in a help page. This is where I can see the arguments available in the function along with any default values. The function arguments are labels for the inputs you can give to a function. A default value means that is the value the function will use if you don’t input something else.

For example, for mean() you can see that the first argument is x (no default value), followed by trim that defaults to a value of 0, and then na.rm with a default of FALSE.

Arguments

The arguments the function takes and a description of those arguments is given in the Arguments section. This is a section I often spend a lot of time in and go back to regularly, figuring out what arguments do and the options available for each argument.

In the mean() example I’m using, this section tells me that the trim argument can take numeric values between 0 and 0.5 in order to trim the dataset prior to calculating the mean. I know from Usage it defaults to 0 but note in this case the default is not explicitly listed in the argument description.

The na.rm argument takes a logical value (i.e., TRUE or FALSE) and controls whether or not NA values are stripped before the function calculates the means. Since it defaults to FALSE, the NA values are not stripped prior to calculation unless I change this.

Examples

If you scroll to the very bottom of a help page you will find the Examples section. This gives examples of how the function works. You can practice using the function by copying and pasting the example and running the code. In RStudio you can also highlight the code and run it directly from the Help pane with Ctrl+Enter (MacOS Cmd+Enter).

After looking at Usage and Arguments I often scroll right down to the Examples section to see an example of the code in use. The Examples section for mean() is pretty sparse, but you’ll find that this section is quite extensive for some functions.

Other sections

Depending on the function, there can be a variety of different and important information after Arguments and before Examples. You may see mathematical notation that shows what the function does (in Details), a description of what the function returns (in Value), references in support of what the function does (in References), etc. This can be extremely valuable information, but I often don’t read it until I run into trouble using the function or need more information to understand exactly what the function does.

I have a couple examples of useful information I’ve found in these other sections for various functions.

First up is rbind() for stacking datasets. It turns out that rbind() stacks columns based on matching column names and not column positions. This is mentioned in the function documentation, but you have to dive deep into the very long Details section of the help file at ?rbind to find the information.

Second, functions for distributions will give information about the density function used in the Details section. Since the help pages for distributions almost always describe multiple functions at once, you can see what each of the functions return in Value. Here’s an example from ?rnorm.

Using argument order instead of labels

You will see plenty of examples in R where the argument labels are not written out explicitly. This is because we can take advantage of the argument order when writing code in R.

You can see this in the mean() Examples section, for example. You can pass a vector to the first argument of mean() without explicitly writing the x argument label.

vals = c(0:10, 50)
mean(vals)

# [1] 8.75

In fact, you can pass in values to all the arguments without labels as long as you input them in the order the arguments come into the function. This relies heavily on you remembering the order of the arguments, as listed in Usage.

vals = c(0:10, 50, NA)
mean(vals, 0.1, TRUE)

# [1] 5.5

You will definitely catch me leaving argument labels off in my own code. These days, though, I try to be more careful and primarily only leave off only the first label. One reason for this is it turns out my future self needs the arguments written out to better understand the code. I’m much more likely to figure out what the mean() code is doing if I put the argument labels back in. I think the code above, without the labels for trim and na.rm, is hard to understand.

Here’s the same code, this time with the argument labels written out. Note the argument order doesn’t matter if the argument labels are used.

vals = c(0:10, 50, NA)
mean(vals, na.rm = TRUE, trim = 0.1)

# [1] 5.5

Another reason I try to use argument labels is that new R users can get stung leaving off argument labels when they don’t realize how/why it works. 🐝 I worked with an R newbie recently who was getting weird results from a GLM with an offset. It turns out they weren’t using argument labels and so had passed the offset to weights instead of offset. Whoops! Luckily they saw something was weird and I could help get them on the right path. And now they know more about why it can be useful to write out argument labels. 😄

I talk about this issue here because I don’t often see a lot of explicit discussion on why and when argument labels can be left off even though there are a lot of code examples out there that do this. This reminds me of when I was a new beekeeper and I made the mistake of going into a hive in the evening. (Do not try this at home, folks!) It turns out “everyone” who is an expert beekeeper knows what happens if you do this, but it wasn’t mentioned in any of my beginner books and classes. I don’t think beginners shouldn’t have to learn this sort of thing the hard way.

Figure 1: No worries, this is a daytime hive inspection.

An example of base::split() for looping through groups

Wed, 27 Nov 2019 00:00:00 +0000

I recently had a question from a client about the simplest way to subset a data.frame and apply a function to each subset. “Simplest” could mean many things, of course, since what is simple for one person could appear very difficult to another. In this specific case I suggested using base::split() as a possible option since it is one I find fairly approachable.

I turns out I don’t have a go-to example for how to get started with a split() approach. So here’s a quick blog post about it! 😄

Load R packages

I’ll load purrr for looping through lists.

library(purrr) # 0.3.3

A dataset with groups

I made a small dataset to use with split(). The id variable contains the group information. There are three groups, a, b, and c, with 10 observations per group. There are also two numeric variables, var1 and var2.

dat = structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), 
    var1 = c(4, 2.7, 3.4, 2.7, 4.6, 2.9, 2.2, 4.5, 4.6, 2.4, 
    3, 3.8, 2.5, 4, 3.6, 2.7, 4.5, 4.1, 4.2, 2.2, 4.9, 4.4, 3.6, 
    3.3, 2.7, 3.9, 4.9, 4.9, 4.3, 3.4), var2 = c(6, 22.3, 19.4, 
    22.8, 18.6, 14.2, 10.9, 22.7, 22.4, 11.7, 6, 13.3, 12.5, 
    6.3, 13.6, 20.5, 23.6, 10.9, 8.9, 20.9, 23.7, 15.9, 22.1, 
    11.6, 22, 17.7, 21, 20.8, 16.7, 21.4)), class = "data.frame", row.names = c(NA, 
-30L))

head(dat)

#   id var1 var2
# 1  a  4.0  6.0
# 2  a  2.7 22.3
# 3  a  3.4 19.4
# 4  a  2.7 22.8
# 5  a  4.6 18.6
# 6  a  2.9 14.2

Create separate data.frames per group

If the goal is to apply a function to each dataset in each group, we need to pull out a dataset for each id. One approach to do this is to make a subset for each group and then apply the function of interest to the subset. A classic approach would be to do the subsetting within a for() loop.

This is a situation where I find split() to be really convenient. It splits the data by a defined group variable so we don’t have to subset things manually.

The output from split() is a list. If I split a dataset by groups, each element of the list will be a data.frame for one of the groups. Note the group values are used as the names of the list elements. I find the list-naming aspect of split() handy for keeping track of groups in subsequent steps.

Here’s an example, where I split dat by the id variable.

dat_list = split(dat, dat$id)
dat_list

# $a
#    id var1 var2
# 1   a  4.0  6.0
# 2   a  2.7 22.3
# 3   a  3.4 19.4
# 4   a  2.7 22.8
# 5   a  4.6 18.6
# 6   a  2.9 14.2
# 7   a  2.2 10.9
# 8   a  4.5 22.7
# 9   a  4.6 22.4
# 10  a  2.4 11.7
# 
# $b
#    id var1 var2
# 11  b  3.0  6.0
# 12  b  3.8 13.3
# 13  b  2.5 12.5
# 14  b  4.0  6.3
# 15  b  3.6 13.6
# 16  b  2.7 20.5
# 17  b  4.5 23.6
# 18  b  4.1 10.9
# 19  b  4.2  8.9
# 20  b  2.2 20.9
# 
# $c
#    id var1 var2
# 21  c  4.9 23.7
# 22  c  4.4 15.9
# 23  c  3.6 22.1
# 24  c  3.3 11.6
# 25  c  2.7 22.0
# 26  c  3.9 17.7
# 27  c  4.9 21.0
# 28  c  4.9 20.8
# 29  c  4.3 16.7
# 30  c  3.4 21.4

Looping through the list

Once the data are split into separate data.frames per group, we can loop through the list and apply a function to each one using whatever looping approach we prefer.

For example, if I want to fit a linear model of var1 vs var2 for each group I might do the looping with purrr::map() or lapply().

Each element of the new list still has the grouping information attached via the list names.

map(dat_list, ~lm(var1 ~ var2, data = .x) )

# $a
# 
# Call:
# lm(formula = var1 ~ var2, data = .x)
# 
# Coefficients:
# (Intercept)         var2  
#     2.64826      0.04396  
# 
# 
# $b
# 
# Call:
# lm(formula = var1 ~ var2, data = .x)
# 
# Coefficients:
# (Intercept)         var2  
#     3.80822     -0.02551  
# 
# 
# $c
# 
# Call:
# lm(formula = var1 ~ var2, data = .x)
# 
# Coefficients:
# (Intercept)         var2  
#     3.35241      0.03513

I could also create a function that fit a model and then returned model output. For example, maybe what I really wanted to do is the fit a linear model and extract \(R^2\) for each group model fit.

r2 = function(data) {
     fit = lm(var1 ~ var2, data = data)
     
     broom::glance(fit)
}

The output of my r2 function, which uses broom::glance(), is a data.frame.

r2(data = dat)

# # A tibble: 1 x 11
#   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
#       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
# 1    0.0292      -0.00550 0.867     0.841   0.367     2  -37.3  80.5  84.7
# # ... with 2 more variables: deviance <dbl>, df.residual <int>

Since the function output is a data.frame, I can use purrr::map_dfr() to combine the output per group into a single data.frame. The .id argument creates a new variable to store the list names in the output.

map_dfr(dat_list, r2, .id = "id")

# # A tibble: 3 x 12
#   id    r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#   <chr>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl>
# 1 a        0.0775       -0.0378 0.968     0.672   0.436     2  -12.7  31.5
# 2 b        0.0387       -0.0815 0.832     0.322   0.586     2  -11.2  28.5
# 3 c        0.0285       -0.0930 0.808     0.235   0.641     2  -10.9  27.9
# # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>

Splitting by multiple groups

It is possible to split data by multiple grouping variables in the split() function. The grouping variables must be passed as a list.

Here’s an example, using the built-in mtcars dataset. I show only the first two list elements to demonstrate that the list names are now based on a combination of the values for the two groups. By default these values are separated by a . (but see the sep argument to control this).

mtcars_cylam = split(mtcars, list(mtcars$cyl, mtcars$am) )
mtcars_cylam[1:2]

# $`4.0`
#                mpg cyl  disp hp drat    wt  qsec vs am gear carb
# Merc 240D     24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
# Merc 230      22.8   4 140.8 95 3.92 3.150 22.90  1  0    4    2
# Toyota Corona 21.5   4 120.1 97 3.70 2.465 20.01  1  0    3    1
# 
# $`6.0`
#                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
# Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4

If all combinations of groups are not present, the drop argument in split() allows us to drop missing combinations. By default combinations that aren’t present are kept as 0-length data.frames.

Other thoughts on split()

I feel like split() was a gateway function for me to get started working with lists and associated convenience functions like lapply() and purrr::map() for looping through lists. I think learning to work with lists and “list loops” also made the learning curve for list-columns in data.frames and the nest()/unnest() approach of analysis-by-groups a little less steep for me.

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(purrr) # 0.3.3

dat = structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), 
    var1 = c(4, 2.7, 3.4, 2.7, 4.6, 2.9, 2.2, 4.5, 4.6, 2.4, 
    3, 3.8, 2.5, 4, 3.6, 2.7, 4.5, 4.1, 4.2, 2.2, 4.9, 4.4, 3.6, 
    3.3, 2.7, 3.9, 4.9, 4.9, 4.3, 3.4), var2 = c(6, 22.3, 19.4, 
    22.8, 18.6, 14.2, 10.9, 22.7, 22.4, 11.7, 6, 13.3, 12.5, 
    6.3, 13.6, 20.5, 23.6, 10.9, 8.9, 20.9, 23.7, 15.9, 22.1, 
    11.6, 22, 17.7, 21, 20.8, 16.7, 21.4)), class = "data.frame", row.names = c(NA, 
-30L))

head(dat)

dat_list = split(dat, dat$id)
dat_list

map(dat_list, ~lm(var1 ~ var2, data = .x) )

r2 = function(data) {
     fit = lm(var1 ~ var2, data = data)
     
     broom::glance(fit)
}
r2(data = dat)

map_dfr(dat_list, r2, .id = "id")

mtcars_cylam = split(mtcars, list(mtcars$cyl, mtcars$am) )
mtcars_cylam[1:2]

Making a background color gradient in ggplot2

Mon, 14 Oct 2019 00:00:00 +0000

I was recently making some arrangements for the 2020 eclipse in South America, which of course got me thinking of the day we were lucky enough to have a path of totality come to us.

We have a weather station that records local temperature every 5 minutes, so after the eclipse I was able to plot the temperature change over the eclipse as we experienced it at our house. Here is an example of a basic plot I made at the time.

Looking at this now with new eyes, I see it might be nice replace the gray rectangle with one that goes from light to dark to light as the eclipse progresses to totality and then back. I’ll show how I tackled making a gradient color background in this post.

Load R packages

I’ll load ggplot2 for plotting and dplyr for data manipulation.

library(ggplot2) # 3.2.1
library(dplyr) # 0.8.3

The dataset

My weather station records the temperature in °Fahrenheit every 5 minutes. I downloaded the data from 6 AM to 12 PM local time and cleaned it up a bit. The date-times and temperature are in a dataset I named temp. You can download this below if you’d like to play around with these data.

Download eclipse_temp.csv

Here are the first six lines of this temperature dataset.

head(temp)

# # A tibble: 6 x 2
#   datetime            tempf
#   <dttm>              <dbl>
# 1 2017-08-21 06:00:00  54.9
# 2 2017-08-21 06:05:00  54.9
# 3 2017-08-21 06:10:00  54.9
# 4 2017-08-21 06:15:00  54.9
# 5 2017-08-21 06:20:00  54.9
# 6 2017-08-21 06:25:00  54.8

I also stored the start and end times of the eclipse and totality in data.frames, which I pulled for my location from this website.

If following along at home, make sure your time zones match for all the date-time variables or, from personal experience 🤣, you’ll run into problems.

eclipse = data.frame(start = as.POSIXct("2017-08-21 09:05:10"),
                     end = as.POSIXct("2017-08-21 11:37:19") )

totality = data.frame(start = as.POSIXct("2017-08-21 10:16:55"),
                      end = as.POSIXct("2017-08-21 10:18:52") )

Initial plot

I decided to make a plot of the temperature change during the eclipse only.

To keep the temperature line looking continuous, even though it’s taken every 5 minutes, I subset the data to times close but outside the start and end of the eclipse.

plottemp = filter(temp, between(datetime, 
                                as.POSIXct("2017-08-21 09:00:00"),
                                as.POSIXct("2017-08-21 12:00:00") ) )

I then zoomed the plot to only include times encompassed by the eclipse with coord_cartesian(). I removed the x axis expansion in scale_x_datetime().

Since the plot covers only about 2 and half hours, I make breaks on the x axis every 15 minutes.

ggplot(plottemp) +
     geom_line( aes(datetime, tempf), size = 1 ) +
     scale_x_datetime( date_breaks = "15 min",
                       date_labels = "%H:%M",
                       expand = c(0, 0) ) +
     coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
     labs(y = expression( Temperature~(degree*F) ),
          x = NULL,
          title = "Temperature during 2017-08-21 solar eclipse",
          subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),
          caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
     ) +
     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 , 
                                            name =  expression( Temperature~(degree*C)),
                                            breaks = seq(16, 24, by = 1)) ) +
     theme_bw(base_size = 14) +
     theme(panel.grid = element_blank() )

Adding color gradient using geom_segment()

I wanted the background of the plot to go from light to dark back to light through time. This means a color gradient should go from left to right across the plot.

Since the gradient will be based on time, I figured I could add a vertical line with geom_segment() for every second of the eclipse and color each segment based on how far that time was from totality.

Make a variable for the color mapping

The first step I took was to make variable with a row for every second of the eclipse, since I wanted a segment drawn for each second. I used seq.POSIXt for this.

color_dat = data.frame(time = seq(eclipse$start, eclipse$end, by = "1 sec") )

Then came some hard thinking. How would I make a continuous variable to map to color? 🤔

While I don’t have an actual measurement of light throughout the eclipse, I can show the general idea of a light change with color by using a linear change in color from the start of the eclipse to totality and then another linear change in color from totality to the end of the eclipse.

My first idea for creating a variable was to use information on the current time vs totality start/end. After subtracting the times before totality from totality start and subtracting totality end from times after totality, I realized that the amount of time before totality wasn’t actually the same as the amount of time after totality. Back to the drawing board.

Since I was making a linear change in color, I realized I could make a sequence of values before totality and after totality that covered the same range but had a different total number of values. This would account for the difference in the length of time before and after totality. I ended up making a sequence going from 100 to 0 for times before totality and a sequence from 0 to 100 after totality. Times during totality were assigned a 0.

Here’s one way to get these sequences, using base::replace(). My dataset is in order by time, which is key to this working correctly.

color_dat = mutate(color_dat,
                   color = 0,
                   color = replace(color, 
                                   time < totality$start, 
                                   seq(100, 0, length.out = sum(time < totality$start) ) ),
                   color = replace(color, 
                                   time > totality$end, 
                                   seq(0, 100, length.out = sum(time > totality$end) ) ) )

Adding one geom_segment() per second

Once I had my color variable I was ready plot the segments along the x axis. Since the segments neeeded to go across the full height of the plot, I set y and yend to -Inf and Inf, respectively.

I put this layer first to use it as a background that the temperature line was plotted on top of.

g1 = ggplot(plottemp) +
     geom_segment(data = color_dat,
                  aes(x = time, xend = time,
                      y = -Inf, yend = Inf, color = color),
                  show.legend = FALSE) +
     geom_line( aes(datetime, tempf), size = 1 ) +
     scale_x_datetime( date_breaks = "15 min",
                       date_labels = "%H:%M",
                       expand = c(0, 0) ) +
     coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
     labs(y = expression( Temperature~(degree*F) ),
          x = NULL,
          title = "Temperature during 2017-08-21 solar eclipse",
          subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),
          caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
     ) +
     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 , 
                                            name =  expression( Temperature~(degree*C)),
                                            breaks = seq(16, 24, by = 1)) ) +
     theme_bw(base_size = 14) +
     theme(panel.grid = element_blank() ) 

g1

Switching to a gray scale

The default blue color scheme for the segments actually works OK, but I was picturing going from white to dark. I picked gray colors with grDevices::gray.colors() in scale_color_gradient(). In gray.colors(), 0 is black and 1 is white. I didn’t want the colors to go all the way to black, since that would make the temperature line impossible to see during totality. And, of course, it’s not actually pitch black during totality. 😁

g1 + scale_color_gradient(low = gray.colors(1, 0.25),
                          high = gray.colors(1, 1) )

Using segments to make a gradient rectangle

I can use this same approach on only a portion of the x axis to give the appearance of a rectangle with gradient fill. Here’s an example using times outside the eclipse.

g2 = ggplot(temp) +
     geom_segment(data = color_dat,
                  aes(x = time, xend = time,
                      y = -Inf, yend = Inf, color = color),
                  show.legend = FALSE) +
     geom_line( aes(datetime, tempf), size = 1 ) +
     scale_x_datetime( date_breaks = "1 hour",
                       date_labels = "%H:%M",
                       expand = c(0, 0) ) +
     labs(y = expression( Temperature~(degree*F) ),
          x = NULL,
          title = "Temperature during 2017-08-21 solar eclipse",
          subtitle = expression(italic("Sapsucker Farm, Dallas, OR, USA") ),
          caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
     ) +
     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 , 
                                            name =  expression( Temperature~(degree*C)),
                                            breaks = seq(12, 24, by = 2)) ) +
     scale_color_gradient(low = gray.colors(1, .25),
                          high = gray.colors(1, 1) ) +
     theme_bw(base_size = 14) +
     theme(panel.grid.major.x = element_blank(),
           panel.grid.minor = element_blank() ) 

g2

Bonus: annotations with curved arrows

This second plot gives me a chance to try out Cédric Scherer’s very cool curved annotation arrow idea for the first time 🎉.

g2 = g2 + 
     annotate("text", x = as.POSIXct("2017-08-21 08:00"),
              y = 74, 
              label = "Partial eclipse begins\n09:05:10 PDT",
              color = "grey24") +
     annotate("text", x = as.POSIXct("2017-08-21 09:00"),
              y = 57, 
              label = "Totality begins\n10:16:55 PDT",
              color = "grey24")
g2

I’ll make a data.frame for the arrow start/end positions. I’m skipping all the work it took to get the positions where I wanted them, which is always iterative for me.

arrows = data.frame(x1 = as.POSIXct( c("2017-08-21 08:35",
                                      "2017-08-21 09:34") ),
                    x2 = c(eclipse$start, totality$start),
                    y1 = c(74, 57.5),
                    y2 = c(72.5, 60) )

I add arrows with geom_curve(). I changed the size of the arrowhead and made it closed in arrow(). I also thought the arrows looked better with a little less curvature.

g2 +
     geom_curve(data = arrows,
                aes(x = x1, xend = x2,
                    y = y1, yend = y2),
                arrow = arrow(length = unit(0.075, "inches"),
                              type = "closed"),
                curvature = 0.25)

Other ways to make a gradient color background

Based on a bunch of internet searches, it looks like a gradient background in ggplot2 generally takes some work. There are some nice examples out there on how to use rasterGrob() and annotate_custom() to add background gradients, such as in this Stack Overflow question. I haven’t researched how to make this go from light to dark and back to light for the uneven time scale like in my example.

I’ve also seen approaches involving dataset expansion and drawing many filled rectangles or using rasters, which is like what I did with geom_segment().

Eclipses!

Before actually experiencing totality, it seemed to me like the difference between a 99% and a 100% eclipse wasn’t a big deal. I mean, those numbers are pretty darn close.

I was very wrong. 😜

If you ever are lucky enough to be near a path of totality, definitely try to get there even if it’s a little more trouble then the 99.9% partial eclipse. It’s an amazing experience. 😻

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(ggplot2) # 3.2.1
library(dplyr) # 0.8.3

head(temp)
eclipse = data.frame(start = as.POSIXct("2017-08-21 09:05:10"),
                     end = as.POSIXct("2017-08-21 11:37:19") )

totality = data.frame(start = as.POSIXct("2017-08-21 10:16:55"),
                      end = as.POSIXct("2017-08-21 10:18:52") )

plottemp = filter(temp, between(datetime, 
                                as.POSIXct("2017-08-21 09:00:00"),
                                as.POSIXct("2017-08-21 12:00:00") ) )
ggplot(plottemp) +
     geom_line( aes(datetime, tempf), size = 1 ) +
     scale_x_datetime( date_breaks = "15 min",
                       date_labels = "%H:%M",
                       expand = c(0, 0) ) +
     coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
     labs(y = expression( Temperature~(degree*F) ),
          x = NULL,
          title = "Temperature during 2017-08-21 solar eclipse",
          subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),
          caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
     ) +
     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 , 
                                            name =  expression( Temperature~(degree*C)),
                                            breaks = seq(16, 24, by = 1)) ) +
     theme_bw(base_size = 14) +
     theme(panel.grid = element_blank() ) 
color_dat = data.frame(time = seq(eclipse$start, eclipse$end, by = "1 sec") )
color_dat = mutate(color_dat,
                   color = 0,
                   color = replace(color, 
                                   time < totality$start, 
                                   seq(100, 0, length.out = sum(time < totality$start) ) ),
                   color = replace(color, 
                                   time > totality$end, 
                                   seq(0, 100, length.out = sum(time > totality$end) ) ) )
g1 = ggplot(plottemp) +
     geom_segment(data = color_dat,
                  aes(x = time, xend = time,
                      y = -Inf, yend = Inf, color = color),
                  show.legend = FALSE) +
     geom_line( aes(datetime, tempf), size = 1 ) +
     scale_x_datetime( date_breaks = "15 min",
                       date_labels = "%H:%M",
                       expand = c(0, 0) ) +
     coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
     labs(y = expression( Temperature~(degree*F) ),
          x = NULL,
          title = "Temperature during 2017-08-21 solar eclipse",
          subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),
          caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
     ) +
     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 , 
                                            name =  expression( Temperature~(degree*C)),
                                            breaks = seq(16, 24, by = 1)) ) +
     theme_bw(base_size = 14) +
     theme(panel.grid = element_blank() ) 

g1

g1 + scale_color_gradient(low = gray.colors(1, 0.25),
                          high = gray.colors(1, 1) )
g2 = ggplot(temp) +
     geom_segment(data = color_dat,
                  aes(x = time, xend = time,
                      y = -Inf, yend = Inf, color = color),
                  show.legend = FALSE) +
     geom_line( aes(datetime, tempf), size = 1 ) +
     scale_x_datetime( date_breaks = "1 hour",
                       date_labels = "%H:%M",
                       expand = c(0, 0) ) +
     labs(y = expression( Temperature~(degree*F) ),
          x = NULL,
          title = "Temperature during 2017-08-21 solar eclipse",
          subtitle = expression(italic("Sapsucker Farm, Dallas, OR, USA") ),
          caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"
     ) +
     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 , 
                                            name =  expression( Temperature~(degree*C)),
                                            breaks = seq(12, 24, by = 2)) ) +
     scale_color_gradient(low = gray.colors(1, .25),
                          high = gray.colors(1, 1) ) +
     theme_bw(base_size = 14) +
     theme(panel.grid.major.x = element_blank(),
           panel.grid.minor = element_blank() ) 

g2
g2 = g2 + 
     annotate("text", x = as.POSIXct("2017-08-21 08:00"),
              y = 74, 
              label = "Partial eclipse begins\n09:05:10 PDT",
              color = "grey24") +
     annotate("text", x = as.POSIXct("2017-08-21 09:00"),
              y = 57, 
              label = "Totality begins\n10:16:55 PDT",
              color = "grey24")
g2

arrows = data.frame(x1 = as.POSIXct( c("2017-08-21 08:35",
                                      "2017-08-21 09:34") ),
                    x2 = c(eclipse$start, totality$start),
                    y1 = c(74, 57.5),
                    y2 = c(72.5, 60) )
g2 +
     geom_curve(data = arrows,
                aes(x = x1, xend = x2,
                    y = y1, yend = y2),
                arrow = arrow(length = unit(0.075, "inches"),
                              type = "closed"),
                curvature = 0.25)

Expanding binomial counts to binary 0/1 with purrr::pmap()

Fri, 04 Oct 2019 00:00:00 +0000

Data on successes and failures can be summarized and analyzed as counted proportions via the binomial distribution or as long format 0/1 binary data. I most often see summarized data when there are multiple trials done within a study unit; for example, when tallying up the number of dead trees out of the total number of trees in a plot.

If these within-plot trials are all independent, analyzing data in a binary format instead of summarized binomial counts doesn’t change the statistical results. If trials are not independent, though, neither approach works correctly and we would see overdispersion/underdispersion in a binomial model. The confusing piece in this is that binary data by definition can’t be overdispersed and so the lack of fit from non-independence can’t be diagnosed with standard overdispersion checks when working with binary data.

In a future post I’ll talk more about simulating data to explore binomial overdispersion and how lack of fit can be diagnosed in binomial vs binary datasets. Today, however, my goal is show how to take binomial count data and expand it into binary data.

In the past I’ve done the data expansion with rowwise() and do() from package dplyr, but these days I’m using purrr::pmap_dfr(). I’ll demonstrate the pmap_dfr() approach as well as a nest()/unnest() approach using functions from tidyr.

Load R packages

I’m using purrr for looping through rows with pmap_dfr(). I also load dplyr and tidyr for a nest()/unnest() approach.

library(purrr) # 0.3.2
library(tidyr) # 1.0.0
library(dplyr) # 0.8.3

The dataset

I created a dataset with a total of 8 plots, 4 plots in each of two groups. The data has been summarized up to the plot level. The number of trials (total) per plot varied. The number of successes observed is in num_dead.

dat = structure(list(plot = structure(1:8, .Label = c("plot1", "plot2", 
"plot3", "plot4", "plot5", "plot6", "plot7", "plot8"), class = "factor"), 
    group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("g1", 
    "g2"), class = "factor"), num_dead = c(4L, 6L, 6L, 5L, 1L, 4L, 
    3L, 2L), total = c(5L, 7L, 9L, 7L, 8L, 10L, 10L, 7L)), class = "data.frame", row.names = c(NA, 
-8L))

dat

#    plot group num_dead total
# 1 plot1    g1        4     5
# 2 plot2    g1        6     7
# 3 plot3    g1        6     9
# 4 plot4    g1        5     7
# 5 plot5    g2        1     8
# 6 plot6    g2        4    10
# 7 plot7    g2        3    10
# 8 plot8    g2        2     7

Expanding binomial to binary with pmap_dfr()

To make the binomial data into binary data, I need to make a vector with a \(1\) for every “success” listed in num_dead and a \(0\) for every “failure” (the total number of trials minus the number of successes). Since I want to do a rowwise operation I’ll use one of the pmap functions. I want my output to be a data.frame so I use pmap_dfr().

I use an anonymous function within pmap_dfr() for creating the output I want from each row. I purposely make the names of the function arguments match the column names. You can either match on position or on names in pmap functions, and I tend to go for name matching. You can use formula coding with the tilde in pmap variants, but I find the code more difficult to understand when I have more than three or so columns.

Within the function I make a column for the response variable, repeating \(1\) num_dead times and \(0\) total - num_dead times for each row of the original data. I’m taking advantage of recycling in data.frame() to keep the plot and group columns in the output, as well.

binary_dat = pmap_dfr(dat, 
                      function(group, plot, num_dead, total) {
                           data.frame(plot = plot,
                                      group = group,
                                      dead = c( rep(1, num_dead),
                                                rep(0, total - num_dead) ) )
                      }
)

Here are the first 6 rows of this new dataset. You can see for the first plot, plot1, there are five rows of \(1\) and one row of \(0\).

head(binary_dat)

#    plot group dead
# 1 plot1    g1    1
# 2 plot1    g1    1
# 3 plot1    g1    1
# 4 plot1    g1    1
# 5 plot1    g1    0
# 6 plot2    g1    1

This matches the information in the first row of the original dataset.

dat[1, ]

#    plot group num_dead total
# 1 plot1    g1        4     5

Aside: pmap functions with more columns

My anonymous function in pmap_dfr() works fine in its current form as long as every column is included as a function argument. If I had extra columns that I didn’t want to remove and wasn’t using in the function, however, I would get an error.

To bypass this problem you can add dots, ..., to the anonymous function to refer to all other columns not being used.

function(group, plot, num_dead, total, ...)

Comparing analysis results

While I definitely learned that binomial data can be analyzed in binary format and returns identical results in a GLM class, for some reason I often have to re-convince myself this is true. 😜 This is clear when I do an analysis with each dataset and compare results.

Binomial model

Here’s results from comparing the two groups for the binomial model.

fit = glm( cbind(num_dead, total - num_dead) ~ group, 
           data = dat,
           family = binomial)
summary(fit)$coefficients

#              Estimate Std. Error   z value     Pr(>|z|)
# (Intercept)  1.098612  0.4364358  2.517237 0.0118279240
# groupg2     -2.014903  0.5748706 -3.504968 0.0004566621

Binary model

The binary model gives identical results for estimates and statistical tests.

fit_binary = glm( dead ~ group, 
                  data = binary_dat,
                  family = binomial)
summary(fit_binary)$coefficients

#              Estimate Std. Error   z value     Pr(>|z|)
# (Intercept)  1.098612  0.4364354  2.517239 0.0118278514
# groupg2     -2.014903  0.5748701 -3.504971 0.0004566575

Expanding binomial to binary via nesting

Doing the expansion with nesting plus purrr::map() inside mutate() is another option, although this seems less straightforward to me for this particular case. It does keep the other variables in the dataset, though, without having to manually include them in the output data.frame like I did above.

binary_dat2 = dat %>%
     nest(data = c(num_dead, total) ) %>%
     mutate(dead = map(data, ~c( rep(1, .x$num_dead),
                                 rep(0, .x$total - .x$num_dead) ) ) ) %>%
     select(-data) %>%
     unnest(dead)
head(binary_dat2)

# # A tibble: 6 x 3
#   plot  group  dead
#   <fct> <fct> <dbl>
# 1 plot1 g1        1
# 2 plot1 g1        1
# 3 plot1 g1        1
# 4 plot1 g1        1
# 5 plot1 g1        0
# 6 plot2 g1        1

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(purrr) # 0.3.2
library(tidyr) # 1.0.0
library(dplyr) # 0.8.3

dat = structure(list(plot = structure(1:8, .Label = c("plot1", "plot2", 
"plot3", "plot4", "plot5", "plot6", "plot7", "plot8"), class = "factor"), 
    group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("g1", 
    "g2"), class = "factor"), num_dead = c(4L, 6L, 6L, 5L, 1L, 4L, 
    3L, 2L), total = c(5L, 7L, 9L, 7L, 8L, 10L, 10L, 7L)), class = "data.frame", row.names = c(NA, 
-8L))

dat

binary_dat = pmap_dfr(dat, 
                      function(group, plot, num_dead, total) {
                           data.frame(plot = plot,
                                      group = group,
                                      dead = c( rep(1, num_dead),
                                                rep(0, total - num_dead) ) )
                      }
)

head(binary_dat)
dat[1, ]

function(group, plot, num_dead, total, ...)
     
fit = glm( cbind(num_dead, total - num_dead) ~ group, 
           data = dat,
           family = binomial)
summary(fit)$coefficients

fit_binary = glm( dead ~ group, 
                  data = binary_dat,
                  family = binomial)
summary(fit_binary)$coefficients

binary_dat2 = dat %>%
     nest(data = c(num_dead, total) ) %>%
     mutate(dead = map(data, ~c( rep(1, .x$num_dead),
                                 rep(0, .x$total - .x$num_dead) ) ) ) %>%
     select(-data) %>%
     unnest(dead)
head(binary_dat2)

More exploratory plots with ggplot2 and purrr: Adding conditional elements

Fri, 27 Sep 2019 00:00:00 +0000

This summer I was asked to collaborate on an analysis project with many response variables. As usual, I planned on automating my initial graphical data exploration through the use of functions and purrr::map() as I’ve written about previously.

However, this particular project was a follow-up to a previous analysis. In the original analysis, different variables were analyzed on different scales. I wanted to put the new plots on whatever scale they were analyzed in that analysis. If I was going to automate the plotting, which I definitely wanted to do with so many variables 😄, I needed to add conditional elements.

This post demonstrates how I used if() statements within my plotting function to use different plotting elements depending on which variable I was plotting.

R packages

I will use ggplot2 for plotting and purrr for looping through variables.

library(ggplot2) # v. 3.2.1
library(purrr) # v. 0.3.2

The data

My simplified example dataset, dat, contains three response variables, cov_plant, cov_oth, and gap. I created two categorical explanatory variables, year with 3 levels and trt with two levels.

dat = structure(list(cov_plant = c(3.7, 1.8, 7.5, 0.4, 7.9, 1.2, 0.7, 
2.3, 6.9, 4.1, 17.7, 2.4, 0.9, 14.3, 4.9, 0, 4.1, 3.6, 1.1, 7.7, 
0, 1.5, 1.7, 11.5, 0.8, 12.3, 7.1, 6.9, 5.6, 2.7, 1, 2.5, 2, 
0.7, 0.7, 2.9, 4, 2.5, 2.9, 1.5, 0.5, 22.8, 2.8, 1.4, 1, 2.9, 
2.4, 4.1, 4.1, 1.9, 2.8, 5, 5.7, 5.6, 0, 4.6, 8.1, 0.5, 88.9, 
1), cov_oth = c(11.5, 63.2, 34, 65.5, 28.8, 8.6, 7.1, 65.5, 12.1, 
3, 23.6, 3.8, 24.9, 55.9, 24.2, 78.2, 81.1, 10.7, 30.7, 23.5, 
10.1, 4.6, 45.7, 37.6, 81.3, 39.1, 50.8, 75.8, 78.2, 23.9, 53, 
51.1, 2.5, 40.2, 15.9, 91.3, 44, 72.9, 82.7, 42.4, 94.1, 23, 
86.2, 50.1, 88.9, 80.5, 34.2, 68.7, 45, 13.9, 44.2, 85, 79.6, 
1, 45.3, 69.5, 89.6, 22.2, 1.3, 88), gap = c(2.8, 11.8, 0.3, 
17.2, 18.3, 1.4, 19.6, 19.4, 2.6, 66.3, 97.1, 17, 381.5, 15.7, 
8.3, 2.4, 3.8, 3.8, 246.6, 43.2, 16.7, 6.6, 3.1, 2.4, 3.2, 4.3, 
0.3, 2.1, 41.7, 68.9, 5.1, 5.7, 0.4, 35.5, 1.1, 10.8, 5, 11.8, 
75.5, 5.4, 12.6, 5.2, 11.4, 6.8, 5.3, 1.1, 3.2, 2.9, 5.2, 0.2, 
1.5, 0.6, 7.4, 18.6, 11.7, 1.6, 13.7, 7.1, 19.9, 16.8), year = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Year 1", 
"Year 2", "Year 3"), class = "factor"), trt = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a", 
"b"), class = "factor")), row.names = c(NA, -60L), class = "data.frame")

Here are the first six rows of this dataset.

head(dat)

#   cov_plant cov_oth  gap   year trt
# 1       3.7    11.5  2.8 Year 1   a
# 2       1.8    63.2 11.8 Year 1   a
# 3       7.5    34.0  0.3 Year 1   a
# 4       0.4    65.5 17.2 Year 1   a
# 5       7.9    28.8 18.3 Year 1   a
# 6       1.2     8.6  1.4 Year 1   a

I spent a fair amount of time sleuthing out which variables were used and which transformations were done in the original analysis. It turned out many (but not all) of the variables in that analysis were log transformed. Variables that contained 0 values were shifted by a fixed constant prior to the log transformation. A different constant was used for different variables.

I made a dataset of variable metadata to help me keep all this information organized. This dataset contains a row for each variable along with a description of what that variable was, the units the variable was measured in, the transformation used for analysis, and the constant used to shift the variable. I’ll call this dataset resp_dat.

This metadata dataset was key in adding conditional elements to my plotting function.

resp_dat = structure(list(variable = structure(c(2L, 1L, 3L), .Label = c("cov_oth", 
"cov_plant", "gap"), class = "factor"), description = structure(3:1, .Label = c("Gap size", 
"Other cover", "Plant cover"), class = "factor"), units = structure(c(1L, 
1L, 2L), .Label = c("%", "m"), class = "factor"), transformation = structure(c(2L, 
1L, 2L), .Label = c("identity", "log"), class = "factor"), constant = c(0.3, 
0, 0)), class = "data.frame", row.names = c(NA, -3L))

resp_dat

#    variable description units transformation constant
# 1 cov_plant Plant cover     %            log      0.3
# 2   cov_oth Other cover     %       identity      0.0
# 3       gap    Gap size     m            log      0.0

Initial plotting function

I set up my initial plotting function to make a scatterplot of the raw data per year for each trt. Different trt were indicated by shapes and colors, and I added group means as larger symbols connected by lines.

You can see that my plotting code ended up fairly complicated. I’m skipping the (many!) steps it took to get to this point. While I don’t show the process here, you can rest assured that I did a lot of testing to work out the plot structure prior to making the plotting function below. 😉

In addition to the plotting code, my plot_fun() function includes a line where I subset the resp_dat dataset to only the row of metadata for the response variable used in the plot. I use this information to add the constant to y and make a plot title with a description of the variable plus the units.

plot_fun = function(data = dat, respdata = resp_dat, response) {
     
     respvar = subset(respdata, variable == response)

     ggplot(data = data, aes(x = year, 
                             y = .data[[response]] + respvar$constant,
                             shape = trt, 
                             color = trt,
                             group = trt) ) +
          geom_point(position = position_dodge(width = 0.5),
                   alpha = 0.25,
                   size = 2, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = "point",
                       position = position_dodge(width = 0.5),
                       size = 4, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = "line",
                       position = position_dodge(width = 0.5),
                       size = 1, key_glyph = "rect") +
          theme_bw(base_size = 14) +
          theme(legend.position = "bottom",
                legend.direction = "horizontal",
                legend.box.spacing = unit(0, "cm"),
                legend.text = element_text(margin = margin(l = -.2, unit = "cm") ),
                panel.grid.minor.y = element_blank() ) +
          scale_color_grey(name = "",
                           label = c("A Treatment", "B Treatment"),
                           start = 0, end = 0.5) +
          labs(x = "Year since treatment",
               title = paste0(respvar$descrip, " (", respvar$units, ")"),
               y = NULL)
}

Here is what the plot looks like for cov_plant.

plot_fun(response = "cov_plant")

Adding a conditional axis scale

I wanted to put variables that were log transformed in the original analysis on the log scale. Since not all variables were transformed, I wanted to use scale_y_log10() for log transformed variables and the standard scale for everything else.

To achieve this, I will assign my base plot a name within the function so I can add on to it conditionally. I name it g1.

I use the transformation column in the variable metadata to check if a log transformation was done or not via grepl(). If it was done, I add scale_y_log10() to the existing plot. Otherwise I return the plot on the original scale.

This is what that code looks like that I’ll add to the end of my function.

if( grepl("log", respvar$transformation) ) {
          g1 + scale_y_log10()
     } else {
          g1
     }

I used the metadata I created to use in the if() statement, but you could do something similar if you had variables with the transformation as part of the variable name.

Here is what the plotting function looks like now.

plot_fun2 = function(data = dat, respdata = resp_dat, response) {
     
     respvar = subset(respdata, variable == response)

     g1 = ggplot(data = data, aes(x = year, 
                                  y = .data[[response]] + respvar$constant,
                                  shape = trt, 
                                  color = trt,
                                  group = trt) ) +
          geom_point(position = position_dodge(width = 0.5),
                     alpha = 0.25,
                     size = 2, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = "point",
                       position = position_dodge(width = 0.5),
                       size = 4, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = "line",
                       position = position_dodge(width = 0.5),
                       size = 1, key_glyph = "rect") +
          theme_bw(base_size = 14) +
          theme(legend.position = "bottom",
                legend.direction = "horizontal",
                legend.box.spacing = unit(0, "cm"),
                legend.text = element_text(margin = margin(l = -.2, unit = "cm") ),
                panel.grid.minor.y = element_blank() ) +
          scale_color_grey(name = "",
                           label = c("A Treatment", "B Treatment"),
                           start = 0, end = 0.5) +
          labs(x = "Year since treatment",
               title = paste0(respvar$descrip, " (", respvar$units, ")"),
               y = NULL)
     
     if( grepl("log", respvar$transformation) ) {
          g1 + scale_y_log10()
     } else {
          g1
     }
}

Now when I use the function on a log transformed variable like cov_plant, the y axis is on the log scale.

plot_fun2(response = "cov_plant")

But the y axis for cov_oth, which was analyzed on the original scale, is not.

plot_fun2(response = "cov_oth")

Adding a conditional caption

After I changed the y axis scale for some variables, I decided I should make sure that the scale of the axis is clear to the reader. In addition, I wanted to highlight the fact that some variables were shifted prior to transformation. I decided to create a conditional caption with this information, which can then be then added to the plot in labs().

Since I ended up with three conditions, log transformation, log transformation with added constant, or no transformation, I ended up using if()-else if()-else to do this.

caption_text = {
     if(respvar$constant != 0 ) {
               paste0("Y axis on log scale ", 
                      "(added constant ", 
                      respvar$constant, ")")
          } else if(!grepl("log", respvar$transformation) ) {
               "Y axis on original scale"
          } else {
               "Y axis on log scale"
          }
}

The function is getting pretty long and complicated now.

plot_fun3 = function(data = dat, respdata = resp_dat, response) {
     
     respvar = subset(respdata, variable == response)

     caption_text = {
          if(respvar$constant != 0 ) {
               paste0("Y axis on log scale ", 
                      "(added constant ", 
                      respvar$constant, ")")
        } else if(!grepl("log", respvar$transformation) ) {
                "Y axis on original scale"
        } else {
                "Y axis on log scale"
        }
     }
     
     g1 = ggplot(data = data, aes(x = year, 
                                  y = .data[[response]] + respvar$constant,
                                  shape = trt, 
                                  color = trt,
                                  group = trt) ) +
          geom_point(position = position_dodge(width = 0.5),
                     alpha = 0.25,
                     size = 2, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = "point",
                       position = position_dodge(width = 0.5),
                       size = 4, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = "line",
                       position = position_dodge(width = 0.5),
                       size = 1, key_glyph = "rect") +
          theme_bw(base_size = 14) +
          theme(legend.position = "bottom",
                legend.direction = "horizontal",
                legend.box.spacing = unit(0, "cm"),
                legend.text = element_text(margin = margin(l = -.2, unit = "cm") ),
                panel.grid.minor.y = element_blank() ) +
          scale_color_grey(name = "",
                           label = c("A Treatment", "B Treatment"),
                           start = 0, end = 0.5) +
          labs(x = "Year since treatment",
               title = paste0(respvar$descrip, " (", respvar$units, ")"),
               y = NULL,
               caption = caption_text)
     
     if( grepl("log", respvar$transformation) ) {
          g1 +
               scale_y_log10()
     } else {
          g1
     }
}

But it does what I want. The plots now have captions with information added at the bottom in addition to the conditional y axis scale.

plot_fun3(response = "cov_plant")

Looping through the variables

Once I have worked out the details of the function I can loop through all the variables and make plots with purrr::map(). I’ve set this up to loop through the vector of variable names, stored in vars as strings.

vars = names(dat)[1:3]
vars

# [1] "cov_plant" "cov_oth"   "gap"

all_plots = map(vars, ~plot_fun3(response = .x) )

Here are the plots, with captions showing that two plots are on the log scale, one is on the original scale, and one has an added constant.

I’m showing the plots all together here, but I actually saved them in a PDF with one plot per page so collaborators could easily page through them.

cowplot::plot_grid(plotlist = all_plots)

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(ggplot2) # v. 3.2.1
library(purrr) # v. 0.3.2
dat = structure(list(cov_plant = c(3.7, 1.8, 7.5, 0.4, 7.9, 1.2, 0.7, 
2.3, 6.9, 4.1, 17.7, 2.4, 0.9, 14.3, 4.9, 0, 4.1, 3.6, 1.1, 7.7, 
0, 1.5, 1.7, 11.5, 0.8, 12.3, 7.1, 6.9, 5.6, 2.7, 1, 2.5, 2, 
0.7, 0.7, 2.9, 4, 2.5, 2.9, 1.5, 0.5, 22.8, 2.8, 1.4, 1, 2.9, 
2.4, 4.1, 4.1, 1.9, 2.8, 5, 5.7, 5.6, 0, 4.6, 8.1, 0.5, 88.9, 
1), cov_oth = c(11.5, 63.2, 34, 65.5, 28.8, 8.6, 7.1, 65.5, 12.1, 
3, 23.6, 3.8, 24.9, 55.9, 24.2, 78.2, 81.1, 10.7, 30.7, 23.5, 
10.1, 4.6, 45.7, 37.6, 81.3, 39.1, 50.8, 75.8, 78.2, 23.9, 53, 
51.1, 2.5, 40.2, 15.9, 91.3, 44, 72.9, 82.7, 42.4, 94.1, 23, 
86.2, 50.1, 88.9, 80.5, 34.2, 68.7, 45, 13.9, 44.2, 85, 79.6, 
1, 45.3, 69.5, 89.6, 22.2, 1.3, 88), gap = c(2.8, 11.8, 0.3, 
17.2, 18.3, 1.4, 19.6, 19.4, 2.6, 66.3, 97.1, 17, 381.5, 15.7, 
8.3, 2.4, 3.8, 3.8, 246.6, 43.2, 16.7, 6.6, 3.1, 2.4, 3.2, 4.3, 
0.3, 2.1, 41.7, 68.9, 5.1, 5.7, 0.4, 35.5, 1.1, 10.8, 5, 11.8, 
75.5, 5.4, 12.6, 5.2, 11.4, 6.8, 5.3, 1.1, 3.2, 2.9, 5.2, 0.2, 
1.5, 0.6, 7.4, 18.6, 11.7, 1.6, 13.7, 7.1, 19.9, 16.8), year = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Year 1", 
"Year 2", "Year 3"), class = "factor"), trt = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a", 
"b"), class = "factor")), row.names = c(NA, -60L), class = "data.frame")
head(dat)

resp_dat = structure(list(variable = structure(c(2L, 1L, 3L), .Label = c("cov_oth", 
"cov_plant", "gap"), class = "factor"), description = structure(3:1, .Label = c("Gap size", 
"Other cover", "Plant cover"), class = "factor"), units = structure(c(1L, 
1L, 2L), .Label = c("%", "m"), class = "factor"), transformation = structure(c(2L, 
1L, 2L), .Label = c("identity", "log"), class = "factor"), constant = c(0.3, 
0, 0)), class = "data.frame", row.names = c(NA, -3L))

resp_dat

plot_fun = function(data = dat, respdata = resp_dat, response) {
     
     respvar = subset(respdata, variable == response)

     ggplot(data = data, aes(x = year, 
                             y = .data[[response]] + respvar$constant,
                             shape = trt, 
                             color = trt,
                             group = trt) ) +
          geom_point(position = position_dodge(width = 0.5),
                   alpha = 0.25,
                   size = 2, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = "point",
                       position = position_dodge(width = 0.5),
                       size = 4, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = "line",
                       position = position_dodge(width = 0.5),
                       size = 1, key_glyph = "rect") +
          theme_bw(base_size = 14) +
          theme(legend.position = "bottom",
                legend.direction = "horizontal",
                legend.box.spacing = unit(0, "cm"),
                legend.text = element_text(margin = margin(l = -.2, unit = "cm") ),
                panel.grid.minor.y = element_blank() ) +
          scale_color_grey(name = "",
                           label = c("A Treatment", "B Treatment"),
                           start = 0, end = 0.5) +
          labs(x = "Year since treatment",
               title = paste0(respvar$descrip, " (", respvar$units, ")"),
               y = NULL)
}
plot_fun(response = "cov_plant")

if( grepl("log", respvar$transformation) ) {
          g1 + scale_y_log10()
     } else {
          g1
     }

plot_fun2 = function(data = dat, respdata = resp_dat, response) {
     
     respvar = subset(respdata, variable == response)

     g1 = ggplot(data = data, aes(x = year, 
                                  y = .data[[response]] + respvar$constant,
                                  shape = trt, 
                                  color = trt,
                                  group = trt) ) +
          geom_point(position = position_dodge(width = 0.5),
                     alpha = 0.25,
                     size = 2, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = "point",
                       position = position_dodge(width = 0.5),
                       size = 4, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = "line",
                       position = position_dodge(width = 0.5),
                       size = 1, key_glyph = "rect") +
          theme_bw(base_size = 14) +
          theme(legend.position = "bottom",
                legend.direction = "horizontal",
                legend.box.spacing = unit(0, "cm"),
                legend.text = element_text(margin = margin(l = -.2, unit = "cm") ),
                panel.grid.minor.y = element_blank() ) +
          scale_color_grey(name = "",
                           label = c("A Treatment", "B Treatment"),
                           start = 0, end = 0.5) +
          labs(x = "Year since treatment",
               title = paste0(respvar$descrip, " (", respvar$units, ")"),
               y = NULL)
     
     if( grepl("log", respvar$transformation) ) {
          g1 + scale_y_log10()
     } else {
          g1
     }
}
plot_fun2(response = "cov_plant")
plot_fun2(response = "cov_oth")

caption_text = {
     if(respvar$constant != 0 ) {
               paste0("Y axis on log scale ", 
                      "(added constant ", 
                      respvar$constant, ")")
          } else if(!grepl("log", respvar$transformation) ) {
               "Y axis on original scale"
          } else {
               "Y axis on log scale"
          }
}

plot_fun3 = function(data = dat, respdata = resp_dat, response) {
     
     respvar = subset(respdata, variable == response)

     caption_text = {
          if(respvar$constant != 0 ) {
               paste0("Y axis on log scale ", 
                      "(added constant ", 
                      respvar$constant, ")")
        } else if(!grepl("log", respvar$transformation) ) {
                "Y axis on original scale"
        } else {
                "Y axis on log scale"
        }
     }
     
     g1 = ggplot(data = data, aes(x = year, 
                                  y = .data[[response]] + respvar$constant,
                                  shape = trt, 
                                  color = trt,
                                  group = trt) ) +
          geom_point(position = position_dodge(width = 0.5),
                     alpha = 0.25,
                     size = 2, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = "point",
                       position = position_dodge(width = 0.5),
                       size = 4, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = "line",
                       position = position_dodge(width = 0.5),
                       size = 1, key_glyph = "rect") +
          theme_bw(base_size = 14) +
          theme(legend.position = "bottom",
                legend.direction = "horizontal",
                legend.box.spacing = unit(0, "cm"),
                legend.text = element_text(margin = margin(l = -.2, unit = "cm") ),
                panel.grid.minor.y = element_blank() ) +
          scale_color_grey(name = "",
                           label = c("A Treatment", "B Treatment"),
                           start = 0, end = 0.5) +
          labs(x = "Year since treatment",
               title = paste0(respvar$descrip, " (", respvar$units, ")"),
               y = NULL,
               caption = caption_text)
     
     if( grepl("log", respvar$transformation) ) {
          g1 +
               scale_y_log10()
     } else {
          g1
     }
}
plot_fun3(response = "cov_plant")

vars = names(dat)[1:3]
vars

all_plots = map(vars, ~plot_fun3(response = .x) )

cowplot::plot_grid(plotlist = all_plots)

Many similar models - Part 2: Automate model fitting with purrr::map() loops

Mon, 22 Jul 2019 00:00:00 +0000

This post was last updated on 2022-01-05.

When we have many similar models to fit, automating at least some portions of the task can be a real time saver. In my last post I demonstrated how to make a function for model fitting. Once you have made such a function it’s possible to loop through variable names and fit a model for each one.

In this post I am specifically focusing on having many response variables with the same explanatory variables, using purrr::map() and friends for the looping. However, this same approach can be used for models with varying explanatory variables, etc.

R packages

I’ll be using purrr for looping and will make residual plots with ggplot2 and patchwork. I’ll use broom to extract tidy results from models.

library(purrr) # v. 0.3.4
library(ggplot2) # v. 3.3.5
library(patchwork) # v. 1.1.1
library(broom) # v. 0.7.10

The dataset

I made a dataset with three response variables, resp, slp, and grad, along with a 2-level explanatory variable group.

dat = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a", "b"), class = "factor"), 
    resp = c(10.48, 9.87, 11.1, 8.56, 11.15, 9.53, 8.99, 10.06, 
    11.02, 10.57, 11.85, 10.11, 9.25, 11.66, 10.72, 8.34, 10.58, 
    10.47, 9.46, 11.13, 8.35, 9.69, 9.82, 11.47, 9.13, 11.53, 
    11.05, 11.03, 10.84, 10.22), slp = c(38.27, 46.33, 44.29, 
    35.57, 34.78, 47.81, 50.45, 46.31, 47.82, 42.07, 31.75, 65.65, 
    47.42, 41.51, 38.69, 47.84, 46.22, 50.66, 50.69, 44.09, 47.3, 
    52.53, 53.63, 53.38, 27.34, 51.83, 56.63, 32.99, 77.5, 38.24
    ), grad = c(0.3, 0.66, 0.57, 0.23, 0.31, 0.48, 0.5, 0.49, 
    2.41, 0.6, 0.27, 0.89, 2.43, 1.02, 2.17, 1.38, 0.17, 0.47, 
    1.1, 3.28, 6.14, 3.8, 4.35, 0.85, 1.13, 1.11, 2.93, 1.13, 
    4.52, 0.13)), class = "data.frame", row.names = c(NA, -30L) )
head(dat)

#   group  resp   slp grad
# 1     a 10.48 38.27 0.30
# 2     a  9.87 46.33 0.66
# 3     a 11.10 44.29 0.57
# 4     a  8.56 35.57 0.23
# 5     a 11.15 34.78 0.31
# 6     a  9.53 47.81 0.48

A function for model fitting

The analysis in the example I’m using today amounts to a two-sample t-test. I will fit this as a linear model with lm().

Since the response variable needs to vary among models but the dataset and explanatory variable do not, my function will have a single argument for setting the response variable. Building the model formula in my function ttest_fun() relies on reformulate().

ttest_fun = function(response) {
  form = reformulate("group", response = response)
  lm(form, data = dat)
}

This function takes the response variable as a string and returns a model object.

ttest_fun(response = "resp")

# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#     10.3280      -0.1207

Looping through the response variables

I’ll make a vector of the response variable names as strings so I can loop through them and fit a model for each one. I pull my response variable names out of the dataset with names(). This step may take more work for you if you have many response variables that aren’t neatly listed all in a row like mine are. 😜

vars = names(dat)[2:4]
vars

# [1] "resp" "slp"  "grad"

I want to keep track of which variable goes with which model. This can be accomplished by naming the vector I’m going to loop through. I name the vector of strings with itself using purrr::set_names().

vars = set_names(vars)
vars

#   resp    slp   grad 
# "resp"  "slp" "grad"

Now I’m ready to loop through the variables and fit a model for each one with purrr::map(). Since my function takes a single argument, the response variable, I can list the function by name within map() without using a formula (~) or an anonymous function.

models = vars %>%
     map(ttest_fun)

The output is a list containing three models, one for each response variable. Notice that the output list is a named list, where the names of each list element is the response variable used in that model. This is the reason I took the time to name the response variable vector.

models

# $resp
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#     10.3280      -0.1207  
# 
# 
# $slp
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#       43.91         4.81  
# 
# 
# $grad
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#      0.8887       1.2773

Note I could have done the set_names() step within the pipe chain rather than as a separate step.

vars %>%
     set_names() %>%
     map(ttest_fun)

Create residual plots for each model

I’m working with a simple model fitting function, where the output only contains the fitted model. To extract other output I can loop through the list of models in a separate step. An alternative is to create all the output within the modeling function and then pull whatever you want out of the list of results.

In this case, my next step is to loop through the models and make residual plots. I want to look at a residuals vs fitted values plot as well as a plot to look at residual normality (like a boxplot, a histogram, or a quantile-quantile normal plot). In more complicated models I might also make plots of residuals vs explanatory variables.

I’ll make a function to build the two residuals plots. My function takes a model and the model name as arguments. I extract residuals and fitted values via broom::augment() and make the two plots with ggplot2 functions. I combine the plots via patchwork. I add a title to the combined plot with the name of the response variable from each model to help me keep track of things.

resid_plots = function(model, modelname) {
     output = augment(model)
     
     res.v.fit = ggplot(output, aes(x = .fitted, y = .resid) ) +
          geom_point() +
          theme_bw(base_size = 16)
     
     res.box = ggplot(output, aes(x = "", y = .resid) ) +
          geom_boxplot() +
          theme_bw(base_size = 16) +
          labs(x = NULL)
     
     res.v.fit + res.box +
          plot_annotation(title = paste("Residuals plots for", modelname) )
}

The output of this function is a combined plot of the residuals. Here is an example for one model (printed at 8" wide by 4" tall).

resid_plots(model = models[[1]], modelname = names(models)[1])

I can use purrr::imap() to loop through all models and the model names simultaneously to make the plots with the title for each variable.

residplots = imap(models, resid_plots)

Examining the plots

In a situation where I have many response variables, I like to save my plots out into a PDF so I can easily page through them outside of R. You can see some approaches for saving plots in a previous post.

Since I have only a few plots I can print them in R. The last plot, shown below, looks potentially problematic. I see the variance increasing with the mean and right skew in the residuals.

residplots[[3]]

Re-fitting a model

If you find a problematic model fit you’ll need to spend some time working with that variable to find a more appropriate model.

Once you have a model you’re happy with, you can manually add the new model to the list (if needed). In my example, let’s say the grad model needed a log transformation.

gradmod = ttest_fun("log(grad)")

If I’m happy with the fit of the new model I add it to the list with the other models to automate extracting results.

models$log_grad = gradmod

I remove the original model by setting it to NULL. I don’t want any results from that model and if I leave it in I know I’ll ultimately get confused about which model is the final model. 😕

models$grad = NULL

Now the output list has three models, with the new log_grad model and the old grad model removed.

models

# $resp
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#     10.3280      -0.1207  
# 
# 
# $slp
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#       43.91         4.81  
# 
# 
# $log_grad
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#     -0.4225       0.7177

I could have removed models from the list via subsetting by name. Here’s an example, showing what the list looks like if I remove the slp model.

models[!names(models) %in% "slp"]

# $resp
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#     10.3280      -0.1207  
# 
# 
# $log_grad
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#     -0.4225       0.7177

Getting model results

Once you are happy with model fit of all models it’s time to extract any output of interest. For a t-test we would commonly want the estimated difference between the two groups, which is in the summary() output. I’ll pull this information from the model as a data.frame with broom::tidy(). This returns the estimated coefficients, statistical tests, and (optionally) confidence intervals for coefficients

I switch to map_dfr() for looping to get the output combined into a single data.frame. I use the .id argument to add the response variable name to the output dataset.

Since some of the response variables are log transformed, it would make sense to back-transform coefficients in this step. I don’t show this here, but would likely approach this using an if() statement based on log transformed variables containing "log" in their names.

res_anova = map_dfr(models, tidy, conf.int = TRUE, .id = "variable")
res_anova

# # A tibble: 6 x 8
#   variable term        estimate std.error statistic  p.value conf.low conf.high
#   <chr>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
# 1 resp     (Intercept)   10.3       0.260    39.7   3.60e-26   9.80     10.9   
# 2 resp     groupb        -0.121     0.368    -0.328 7.45e- 1  -0.874     0.632 
# 3 slp      (Intercept)   43.9       2.56     17.2   2.18e-16  38.7      49.2   
# 4 slp      groupb         4.81      3.62      1.33  1.95e- 1  -2.61     12.2   
# 5 log_grad (Intercept)   -0.423     0.255    -1.66  1.09e- 1  -0.945     0.0997
# 6 log_grad groupb         0.718     0.361     1.99  5.64e- 2  -0.0208    1.46

The primary interest in this output would be in the groupb row for each variable. Since the output is a data frame (thanks broom::tidy()!) you can use standard data manipulation tools to pull out only rows and columns of interest.

Other output, like, e.g., AIC or estimated marginal means for more complicated models, can be extracted and saved in a similar way. Check out broom::glance() for extracting AIC and other overall model results.

Alternative approach to fitting many models

When I am working with many response variables with widely varying ranges, it feels most natural to me to keep the different variables in different columns and loop through them as I have shown above. However, a reasonable alternative is to reshape your dataset so all the values of all variables are in a single column. A second, categorical column will contain the variable names so we know which variable each row is associated with. Such reshaping is an example of making a wide dataset into a long dataset.

Once your data are in a long format, you can use a list-columns approach for the analysis. You can see an example of this in Chapter 25: Many models of Grolemund and Wickham’s R for Data Science book.

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(purrr) # v. 0.3.4
library(ggplot2) # v. 3.3.5
library(patchwork) # v. 1.1.1
library(broom) # v. 0.7.10

dat = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a", "b"), class = "factor"), 
    resp = c(10.48, 9.87, 11.1, 8.56, 11.15, 9.53, 8.99, 10.06, 
    11.02, 10.57, 11.85, 10.11, 9.25, 11.66, 10.72, 8.34, 10.58, 
    10.47, 9.46, 11.13, 8.35, 9.69, 9.82, 11.47, 9.13, 11.53, 
    11.05, 11.03, 10.84, 10.22), slp = c(38.27, 46.33, 44.29, 
    35.57, 34.78, 47.81, 50.45, 46.31, 47.82, 42.07, 31.75, 65.65, 
    47.42, 41.51, 38.69, 47.84, 46.22, 50.66, 50.69, 44.09, 47.3, 
    52.53, 53.63, 53.38, 27.34, 51.83, 56.63, 32.99, 77.5, 38.24
    ), grad = c(0.3, 0.66, 0.57, 0.23, 0.31, 0.48, 0.5, 0.49, 
    2.41, 0.6, 0.27, 0.89, 2.43, 1.02, 2.17, 1.38, 0.17, 0.47, 
    1.1, 3.28, 6.14, 3.8, 4.35, 0.85, 1.13, 1.11, 2.93, 1.13, 
    4.52, 0.13)), class = "data.frame", row.names = c(NA, -30L) )
head(dat)

ttest_fun = function(response) {
  form = reformulate("group", response = response)
  lm(form, data = dat)
}
ttest_fun(response = "resp")

vars = names(dat)[2:4]
vars

vars = set_names(vars)
vars

models = vars %>%
     map(ttest_fun)
models

vars %>%
     set_names() %>%
     map(ttest_fun)

resid_plots = function(model, modelname) {
     output = augment(model)
     
     res.v.fit = ggplot(output, aes(x = .fitted, y = .resid) ) +
          geom_point() +
          theme_bw(base_size = 16)
     
     res.box = ggplot(output, aes(x = "", y = .resid) ) +
          geom_boxplot() +
          theme_bw(base_size = 16) +
          labs(x = NULL)
     
     res.v.fit + res.box +
          plot_annotation(title = paste("Residuals plots for", modelname) )
}
resid_plots(model = models[[1]], modelname = names(models)[1])

residplots = imap(models, resid_plots)
residplots[[3]]

gradmod = ttest_fun("log(grad)")

models$log_grad = gradmod
models$grad = NULL
models

models[!names(models) %in% "slp"]

res_anova = map_dfr(models, tidy, conf.int = TRUE, .id = "variable")
res_anova

Many similar models - Part 1: How to make a function for model fitting

Mon, 24 Jun 2019 00:00:00 +0000

This post was last updated on 2022-01-05.

I worked with several students over the last few months who were fitting many linear models, all with the same basic structure but different response variables. They were struggling to find an efficient way to do this in R while still taking the time to check model assumptions.

A first step when working towards a more automated process for fitting many models is to learn how to build model formulas using reformulate() or with paste() and as.formula(). Once we learn how to build model formulas we can create functions to streamline the model fitting process.

I will be making residuals plots with ggplot2 today so will load it here.

library(ggplot2) # v.3.2.0

Building a formula with reformulate()

Model formula of the form y ~ x can be built based on variable names passed as character strings. A character string means the variable name will have quotes around it.

The function reformulate() allows us to pass response and explanatory variables as character strings and returns them as a formula.

Here is an example, using mpg as the response variable and am as the explanatory variable. Note the explanatory variable is passed to the first argument, termlabels, and the response variable to response.

reformulate(termlabels = "am", response = "mpg")

# mpg ~ am

A common alternative to reformulate() is to use paste() with as.formula(). I show this option below, but won’t discuss it more in this post.

as.formula( paste("mpg", "~ am") )

# mpg ~ am

Using a constructed formula in lm()

Once we’ve built the formula we can put it in as the first argument of a model fitting function like lm() in order to fit the model. I’ll be using the mtcars dataset throughout the model fitting examples.

Since am is a 0/1 variable, this particular analysis is a two-sample t-test with mpg as the response variable. I removed the step of writing out the first argument name to reformulate to save space, knowing that the first argument is always the explanatory variables.

lm( reformulate("am", response = "mpg"), data = mtcars)

# 
# Call:
# lm(formula = reformulate("am", response = "mpg"), data = mtcars)
# 
# Coefficients:
# (Intercept)           am  
#      17.147        7.245

Making a function for model fitting

Being able to build a formula is essential for making user-defined model fitting functions.

For example, say I wanted to do the same t-test with am for many response variables. I could create a function that takes the response variable as an argument and build the model formula within the function with reformulate().

The response variable name is passed to the response argument as a character string.

lm_fun = function(response) {
  lm( reformulate("am", response = response), data = mtcars)
}

Here are two examples of this function in action, using mpg and then wt as the response variables.

lm_fun(response = "mpg")

# 
# Call:
# lm(formula = reformulate("am", response = response), data = mtcars)
# 
# Coefficients:
# (Intercept)           am  
#      17.147        7.245

lm_fun(response = "wt")

# 
# Call:
# lm(formula = reformulate("am", response = response), data = mtcars)
# 
# Coefficients:
# (Intercept)           am  
#       3.769       -1.358

Using bare names instead of strings (i.e., non-standard evaluation)

As you can see, this approach to building formula relies on character strings. This is going to be great once we start looping through variable names, but if making a function for interactive use it can be nice for the user to pass bare column names.

We can use some deparse()/substitute() magic in the function for this. Those two functions will turn bare names into strings within the function rather than having the user pass strings directly.

lm_fun2 = function(response) {
  resp = deparse( substitute( response) )
  lm( reformulate("am", response = resp), data = mtcars)
}

Here’s an example of this function in action. Note the use of the bare column name for the response variable.

lm_fun2(response = mpg)

# 
# Call:
# lm(formula = reformulate("am", response = resp), data = mtcars)
# 
# Coefficients:
# (Intercept)           am  
#      17.147        7.245

You can see that one thing that happens when using reformulate() like this is that the formula in the model output shows the formula-building code instead of the actual variables used in the model.

Call:  
lm(formula = reformulate("am", response = resp), data = mtcars)

While this often won’t matter in practice, there are ways to force the model to show the variables used in the model fitting. See this blog post for some discussion as well as code for how to do this.

Building a formula with varying explanatory variables

The formula building approach can also be used for fitting models where the explanatory variables vary. The explanatory variables should have plus signs between them on the right-hand side of the formula, which we can achieve by passing a vector of character strings to the first argument of reformulate().

expl = c("am", "disp")
reformulate(expl, response = "mpg")

# mpg ~ am + disp

Let’s go through an example of using this in a function that can fit a model with different explanatory variables.

In this function I demonstrate building the formula as a separate step and then passing it to lm(). Some find this easier to read compared to building the formula within lm() as a single step like I did earlier.

lm_fun_expl = function(expl) {
  form = reformulate(expl, response = "mpg")
  lm(form, data = mtcars)
}

To use the function we pass a vector of variable names as strings to the expl argument.

lm_fun_expl(expl = c("am", "disp") )

# 
# Call:
# lm(formula = form, data = mtcars)
# 
# Coefficients:
# (Intercept)           am         disp  
#    27.84808      1.83346     -0.03685

The dots for passing many variables to a function

Using dots (…) instead of named arguments can allow the user to list the explanatory variables separately instead of in a vector.

I’ll demonstrate a function using dots to indicate some undefined number of additional arguments for putting as many explanatory variables as desired into the model. I wrap the dots in c() within the function in order to collapse variables together with +.

lm_fun_expl2 = function(...) {
  form = reformulate(c(...), response = "mpg")
  lm(form, data = mtcars)
}

Now variables are passed individually as strings separated by commas instead of as a vector.

lm_fun_expl2("am", "disp")

# 
# Call:
# lm(formula = form, data = mtcars)
# 
# Coefficients:
# (Intercept)           am         disp  
#    27.84808      1.83346     -0.03685

Example function that returns residuals plots and model output

One of the reasons to make a function is to increase efficiency when fitting many models. For example, it might be useful to make a function that returns residual plots and any desired statistical results simultaneously.

Here’s an example of such a function, using some of the tools covered above. The function takes the response variable as a bare name, fits a model with am hard-coded as the explanatory variable and the mtcars dataset, and then makes two residual plots.

The function outputs a list that contains the two residuals plots as well as the overall \(F\) tests from the model.

lm_modfit = function(response) {
  resp = deparse( substitute( response) )
  mod = lm( reformulate("am", response = resp), data = mtcars)
  resvfit = qplot(x = mod$fit, y = mod$res) + theme_bw()
  resdist = qplot(x = "Residual", mod$res, geom = "boxplot") + theme_bw()
  list(resvfit, resdist, anova(mod) )
}

mpgfit = lm_modfit(mpg)

Individual parts of the output list can be extracted as needed. To check model assumptions prior to looking at any results we’d pull out the two plots, which are the first two elements of the output list.

mpgfit[1:2]

# [[1]]

# 
# [[2]]

If we deem the model fit acceptable we can extract the overall \(F\) tests from the third element of the output.

mpgfit[[3]]

# Analysis of Variance Table
# 
# Response: mpg
#           Df Sum Sq Mean Sq F value   Pr(>F)    
# am         1 405.15  405.15   16.86 0.000285 ***
# Residuals 30 720.90   24.03                     
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Next step: looping

This post focused on using reformulate() for building model formula and then making user-defined functions for interactive use. When working with many models we’d likely want to automate the process more by using some sort of looping. I wrote a follow-up post on looping through variables and fitting models with the map family of functions from package purrr, which you can see here.

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(ggplot2) # v.3.2.0

reformulate(termlabels = "am", response = "mpg")

as.formula( paste("mpg", "~ am") )

lm( reformulate("am", response = "mpg"), data = mtcars)

lm_fun = function(response) {
  lm( reformulate("am", response = response), data = mtcars)
}

lm_fun(response = "mpg")
lm_fun(response = "wt")

lm_fun2 = function(response) {
  resp = deparse( substitute( response) )
  lm( reformulate("am", response = resp), data = mtcars)
}

lm_fun2(response = mpg)

expl = c("am", "disp")
reformulate(expl, response = "mpg")

lm_fun_expl = function(expl) {
  form = reformulate(expl, response = "mpg")
  lm(form, data = mtcars)
}

lm_fun_expl(expl = c("am", "disp") )

lm_fun_expl2 = function(...) {
  form = reformulate(c(...), response = "mpg")
  lm(form, data = mtcars)
}

lm_fun_expl2("am", "disp")

lm_modfit = function(response) {
  resp = deparse( substitute( response) )
  mod = lm( reformulate("am", response = resp), data = mtcars)
  resvfit = qplot(x = mod$fit, y = mod$res) + theme_bw()
  resdist = qplot(x = "Residual", mod$res, geom = "boxplot") + theme_bw()
  list(resvfit, resdist, anova(mod) )
}

mpgfit = lm_modfit(mpg)
mpgfit[1:2]
mpgfit[[3]]

The small multiples plot: how to combine ggplot2 plots with one shared axis

Mon, 13 May 2019 00:00:00 +0000

There are a variety of ways to combine ggplot2 plots with a single shared axis. However, things can get tricky if you want a lot of control over all plot elements.

I demonstrate four different approaches for this:
1. Using facets, which is built in to ggplot2 but doesn’t allow much control over the non-shared axes.
2. Using package cowplot, which has a lot of nice features but the plot spacing doesn’t play well with a single shared axis.
3. Using package egg.
4. Using package patchwork.

The last two packages allow nice spacing for plots with a shared axis.

Load R packages

I’ll be plotting with ggplot2, reshaping with tidyr, and combining plots with packages egg and patchwork.

I’ll also be using package cowplot version 0.9.4 to combine individual plots into one, but will use the package functions via cowplot:: instead of loading the package. (I believe the next version of cowplot will not be so opinionated about the theme.)

library(ggplot2) # v. 3.1.1
library(tidyr) # v. 0.8.3
library(egg) # v. 0.4.2
library(patchwork) # v. 1.0.0

The set-up

Here’s the scenario: we have one response variable (resp) that we want to plot against three other variables and combine them into a single “small multiples” plot.

I’ll call the three variables elev, grad, and slp. You’ll note that I created these variables to have very different scales.

set.seed(16)
dat = data.frame(elev = round( runif(20, 100, 500), 1),
                 resp = round( runif(20, 0, 10), 1),
                 grad = round( runif(20, 0, 1), 2),
                 slp = round( runif(20, 0, 35),1) )
head(dat)

#    elev resp grad  slp
# 1 373.2  9.7 0.05  8.8
# 2 197.6  8.1 0.42 33.3
# 3 280.0  5.4 0.38 19.3
# 4 191.8  4.3 0.07 29.6
# 5 445.4  2.3 0.43 16.5
# 6 224.5  6.5 0.78  4.1

Using facets for small multiples

One good option when we want to make a similar plot for different groups (in this case, different variables) is to use faceting to make different panels within the same plot.

Since the three variables are currently in separate columns we’ll need to reshape the dataset prior to plotting. I’ll use gather() from tidyr for this.

datlong = gather(dat, key = "variable", value = "value", -resp)
head(datlong)

#   resp variable value
# 1  9.7     elev 373.2
# 2  8.1     elev 197.6
# 3  5.4     elev 280.0
# 4  4.3     elev 191.8
# 5  2.3     elev 445.4
# 6  6.5     elev 224.5

Now I can use facet_wrap() to make a separate scatterplot of resp vs each variable. The argument scales = "free_x" allows the x axis scales to differ for each variable but leaves a single y axis.

ggplot(datlong, aes(x = value, y = resp) ) +
     geom_point() +
     theme_bw() +
     facet_wrap(~variable, scales = "free_x")

I can use the facet strips to give the appearance of axis labels, as shown in this Stack Overflow answer.

ggplot(datlong, aes(x = value, y = resp) ) +
     geom_point() +
     theme_bw() +
     facet_wrap(~variable, scales = "free_x", strip.position = "bottom") +
     theme(strip.background = element_blank(),
           strip.placement = "outside") +
     labs(x = NULL)

That’s a pretty nice plot to start with. However, controlling the axis breaks in the individual panels can be complicated, which is something we’d commonly want to do.

In that case, it may make more sense to create separate plots and then combine them into a small multiples plot with an add-on package.

Using cowplot to combine plots

Package cowplot is a really nice package for combining plots, and has lots of bells and whistles along with some pretty thorough vignettes.

The first step is to make each of the three plots separately. If doing lots of these we’d want to use some sort of loop to make a list of plots as I’ve demonstrated previously. Today I’m going to make the three plots manually.

elevplot = ggplot(dat, aes(x = elev, y = resp) ) +
     geom_point() +
     theme_bw()

gradplot = ggplot(dat, aes(x = grad, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 1, by = 0.2) )

slpplot = ggplot(dat, aes(x = slp, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 35, by = 5) )

The function plot_grid() in cowplot is for combining plots. To make a single row of plots I use nrow = 1.

The labels argument puts separate labels on each panel for captioning.

cowplot::plot_grid(elevplot, 
                   gradplot, 
                   slpplot,
                   nrow = 1,
                   labels = "auto")

But we want a single shared y axis, not a separate y axis on each plot. I’ll remake the combined plot, this time removing the y axis elements from all but the first plot.

cowplot::plot_grid(elevplot, 
                   gradplot + theme(axis.text.y = element_blank(),
                                    axis.ticks.y = element_blank(),
                                    axis.title.y = element_blank() ), 
                   slpplot + theme(axis.text.y = element_blank(),
                                    axis.ticks.y = element_blank(),
                                    axis.title.y = element_blank() ),
                   nrow = 1,
                   labels = "auto")

This makes panels different sizes, though, which isn’t ideal. To have all the plots the same width I need to align them vertically with align = "v".

cowplot::plot_grid(elevplot, 
                   gradplot + 
                        theme(axis.text.y = element_blank(),
                              axis.ticks.y = element_blank(),
                              axis.title.y = element_blank() ), 
                   slpplot + 
                        theme(axis.text.y = element_blank(),
                              axis.ticks.y = element_blank(),
                              axis.title.y = element_blank() ),
                   nrow = 1,
                   labels = "auto",
                   align = "v")

But, unfortunately, this puts the axis space back between the plots to make them all the same width. It turns out that cowplot isn’t really made for plots with a single shared axis. The cowplot package author points us to package egg for this in this Stack Overflow answer.

Using egg to combine plots

Package egg is another nice alternative for combining plots into a small multiples plot. The function in this package for combining plots is called ggarrange().

Here are the three plots again.

elevplot = ggplot(dat, aes(x = elev, y = resp) ) +
     geom_point() +
     theme_bw()

gradplot = ggplot(dat, aes(x = grad, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 1, by = 0.2) )

slpplot = ggplot(dat, aes(x = slp, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 35, by = 5) )

The ggarrange() function has an nrow argument so I can keep the plots in a single row.

The panel spacing is automagically the same here after I remove the y axis elements, and things look pretty nice right out of the box.

ggarrange(elevplot, 
          gradplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank() ), 
          slpplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank() ),
          nrow = 1)

We can bring panes closer by removing some of the space around the plot margins with the plot.margin in theme(). I’ll set the spacing for right margin of the first plot, both left and right margins of the second, and the left margin of the third.

ggarrange(elevplot +
               theme(axis.ticks.y = element_blank(),
                     plot.margin = margin(r = 1) ), 
          gradplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank(),
                     plot.margin = margin(r = 1, l = 1) ), 
          slpplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank(),
                     plot.margin = margin(l = 1)  ),
          nrow = 1)

Adding plot labels with tag_facet()

You’ll see there is a labels argument in ggarrange() documentation, but it didn’t work well for me out of the box with only one plot with a y axis. However, we can get tricky with egg::tag_facet() if we add a facet strip to each of the individual plots.

It’d make sense to build these plots outside of ggarrange() and then add the tags and combine them instead of nesting everything like I did here, since the code is now a little hard to follow.

ggarrange(tag_facet(elevplot +
                          theme(axis.ticks.y = element_blank(),
                                plot.margin = margin(r = 1) ) +
                          facet_wrap(~"elev"),
                     tag_pool = "a"), 
          tag_facet(gradplot + 
                          theme(axis.text.y = element_blank(),
                                axis.ticks.y = element_blank(),
                                axis.title.y = element_blank(),
                                plot.margin = margin(r = 1, l = 1) ) +
                          facet_wrap(~"grad"), 
                    tag_pool = "b" ), 
          tag_facet(slpplot + 
                          theme(axis.text.y = element_blank(),
                                axis.ticks.y = element_blank(),
                                axis.title.y = element_blank(),
                                plot.margin = margin(l = 1)  ) +
                          facet_wrap(~"slp"),
                     tag_pool = "c"),
          nrow = 1)

We might want to add a right y axis to the right-most plot. In this case we’d want to change the axis ticks length to 0 via theme() elements. This can be done separately per axis in the development version of ggplot2, and will be included in version 3.2.0.

Using patchwork to combine plots

The patchwork package is another one that is great for combining plots, and is now on CRAN (as of December 2019) 🎉. It has nice vignettes here to help you get started.

In patchwork the + operator is used to add plots together. Here’s an example, combining my three original plots.

elevplot + gradplot + slpplot

If we give the resulting combined plot a name, we can remove the titles from the last two subplots using double-bracket indexing. (Of course, I also could have built the plots how I wanted them in the first place. 😜)

The result has nice spacing for a single, shared y axis. Margins can be controlled the same was as in the egg example above.

patchwork = elevplot + gradplot + slpplot

# Remove title from second subplot
patchwork[[2]] = patchwork[[2]] + theme(axis.text.y = element_blank(),
                                        axis.ticks.y = element_blank(),
                                        axis.title.y = element_blank() )

# Remove title from third subplot
patchwork[[3]] = patchwork[[3]] + theme(axis.text.y = element_blank(),
                                        axis.ticks.y = element_blank(),
                                        axis.title.y = element_blank() )

patchwork

Adding plots labels with plot_annotation()

There are many annotation options in patchwork. I’ll focus on adding tags, but see the annotation vignette.

Tags can be added with the tag_levels argument in plot_annotation(). I want lowercase Latin letter so use "a" as my tag_levels.

Plot tags go outside the plot by default. You can control the position at least somewhat with the theme option plot.tag.position, which works on the individual subplots and not the entire combined plot.

patchwork + plot_annotation(tag_levels = "a")

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(ggplot2) # v. 3.1.1
library(tidyr) # v. 0.8.3
library(egg) # v. 0.4.2
library(patchwork) # v. 1.0.0

set.seed(16)
dat = data.frame(elev = round( runif(20, 100, 500), 1),
                 resp = round( runif(20, 0, 10), 1),
                 grad = round( runif(20, 0, 1), 2),
                 slp = round( runif(20, 0, 35),1) )
head(dat)

datlong = gather(dat, key = "variable", value = "value", -resp)
head(datlong)

ggplot(datlong, aes(x = value, y = resp) ) +
     geom_point() +
     theme_bw() +
     facet_wrap(~variable, scales = "free_x")

ggplot(datlong, aes(x = value, y = resp) ) +
     geom_point() +
     theme_bw() +
     facet_wrap(~variable, scales = "free_x", strip.position = "bottom") +
     theme(strip.background = element_blank(),
           strip.placement = "outside") +
     labs(x = NULL)

elevplot = ggplot(dat, aes(x = elev, y = resp) ) +
     geom_point() +
     theme_bw()

gradplot = ggplot(dat, aes(x = grad, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 1, by = 0.2) )

slpplot = ggplot(dat, aes(x = slp, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 35, by = 5) )

cowplot::plot_grid(elevplot, 
                   gradplot, 
                   slpplot,
                   nrow = 1,
                   labels = "auto")

cowplot::plot_grid(elevplot, 
                   gradplot + theme(axis.text.y = element_blank(),
                                    axis.ticks.y = element_blank(),
                                    axis.title.y = element_blank() ), 
                   slpplot + theme(axis.text.y = element_blank(),
                                    axis.ticks.y = element_blank(),
                                    axis.title.y = element_blank() ),
                   nrow = 1,
                   labels = "auto")

cowplot::plot_grid(elevplot, 
                   gradplot + 
                        theme(axis.text.y = element_blank(),
                              axis.ticks.y = element_blank(),
                              axis.title.y = element_blank() ), 
                   slpplot + 
                        theme(axis.text.y = element_blank(),
                              axis.ticks.y = element_blank(),
                              axis.title.y = element_blank() ),
                   nrow = 1,
                   labels = "auto",
                   align = "v")

elevplot = ggplot(dat, aes(x = elev, y = resp) ) +
     geom_point() +
     theme_bw()

gradplot = ggplot(dat, aes(x = grad, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 1, by = 0.2) )

slpplot = ggplot(dat, aes(x = slp, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 35, by = 5) )

ggarrange(elevplot, 
          gradplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank() ), 
          slpplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank() ),
          nrow = 1)

ggarrange(elevplot +
               theme(axis.ticks.y = element_blank(),
                     plot.margin = margin(r = 1) ), 
          gradplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank(),
                     plot.margin = margin(r = 1, l = 1) ), 
          slpplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank(),
                     plot.margin = margin(l = 1)  ),
          nrow = 1)

ggarrange(tag_facet(elevplot +
                          theme(axis.ticks.y = element_blank(),
                                plot.margin = margin(r = 1) ) +
                          facet_wrap(~"elev"),
                     tag_pool = "a"), 
          tag_facet(gradplot + 
                          theme(axis.text.y = element_blank(),
                                axis.ticks.y = element_blank(),
                                axis.title.y = element_blank(),
                                plot.margin = margin(r = 1, l = 1) ) +
                          facet_wrap(~"grad"), 
                    tag_pool = "b" ), 
          tag_facet(slpplot + 
                          theme(axis.text.y = element_blank(),
                                axis.ticks.y = element_blank(),
                                axis.title.y = element_blank(),
                                plot.margin = margin(l = 1)  ) +
                          facet_wrap(~"slp"),
                     tag_pool = "c"),
          nrow = 1)

elevplot + gradplot + slpplot

patchwork = elevplot + gradplot + slpplot

# Remove title from second subplot
patchwork[[2]] = patchwork[[2]] + theme(axis.text.y = element_blank(),
                                        axis.ticks.y = element_blank(),
                                        axis.title.y = element_blank() )

# Remove title from third subplot
patchwork[[3]] = patchwork[[3]] + theme(axis.text.y = element_blank(),
                                        axis.ticks.y = element_blank(),
                                        axis.title.y = element_blank() )

patchwork

patchwork + plot_annotation(tag_levels = "a")

Embedding subplots in ggplot2 graphics

Mon, 22 Apr 2019 00:00:00 +0000

The idea of embedded plots for visualizing a large dataset that has an overplotting problem recently came up in some discussions with students. I first learned about embedded graphics from package ggsubplot. You can still see an old post about that package and about embedded graphics in general, with examples. However, ggsubplot is no longer maintained and doesn’t work with current versions of ggplot2.

I poked around a bit, and found that annotation_custom() is the go-to function for embedding plots in a ggplot2 graphic. I found a couple of recent examples for how to tackle making such plots on Stack Overflow here and here.

I’m going to work through an example of embedding subplots using the same kind of looping approach outlined in those answers.

R packages

First I’ll load the R packages I’m using today. All plotting is done via ggplot2, I do data manipulation with dplyr and tidyr, and purrr is for looping to make the subplots and then for getting the subplots into annotation_custom() layers.

library(ggplot2) # 3.1.1
suppressPackageStartupMessages( library(dplyr) ) # 0.8.0.1
library(tidyr) # 0.8.3
library(purrr) # 0.3.2

Cutting continuous variables into evenly-spaced categories

Binning the continuous variables that will be on the axes of the larger plot in order to create separate datasets for each subplot is the first step in this process.

I thought it made the most sense to make all the subplots the same size in the final plot and so I wanted to make evenly sized bins or groups. The range of values in each bin can then be based on the total range of the variable of interest and the desired number of groups.

Binning into even-length groups is a job for cut(). I’m going to need the minimum and maximum value of each group to place the subplots along the axes of the larger plot, so rather than using cut() directly I made a function built around it. While information on the range of values encompassed by a group can be pulled from the default cut() bin labels, I didn’t like how cut() rounded those values.

My function, cuteven(), takes a continuous variable and returns a variable cut into ngroups bins. The labels for the new groups are the unrounded minimum and maximum value within each group, with the values separated by commas.

I use include.lowest = TRUE in cut() is to make sure the minimum value in the dataset is included in the first group.

cuteven = function(variable, ngroups) {
     seq_all = seq(min(variable), max(variable), length.out = ngroups + 1)
     cut(variable,
         breaks = seq_all,
         labels = paste(seq_all[-(ngroups + 1)], seq_all[-1], sep = ","),
         include.lowest = TRUE)
}

I’ll test the function by cutting Sepal.Length from the iris dataset into 3 groups. The new, categorical variable has three groups. The groups are labeled with their minimum and maximum value.

with(iris, cuteven(Sepal.Length, ngroups = 3) )

#   [1] 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5
#   [9] 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 5.5,6.7 5.5,6.7
#  [17] 4.3,5.5 4.3,5.5 5.5,6.7 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5
#  [25] 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5
#  [33] 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5
#  [41] 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5
#  [49] 4.3,5.5 4.3,5.5 6.7,7.9 5.5,6.7 6.7,7.9 4.3,5.5 5.5,6.7 5.5,6.7
#  [57] 5.5,6.7 4.3,5.5 5.5,6.7 4.3,5.5 4.3,5.5 5.5,6.7 5.5,6.7 5.5,6.7
#  [65] 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7
#  [73] 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 6.7,7.9 5.5,6.7 5.5,6.7 5.5,6.7
#  [81] 4.3,5.5 4.3,5.5 5.5,6.7 5.5,6.7 4.3,5.5 5.5,6.7 5.5,6.7 5.5,6.7
#  [89] 5.5,6.7 4.3,5.5 4.3,5.5 5.5,6.7 5.5,6.7 4.3,5.5 5.5,6.7 5.5,6.7
#  [97] 5.5,6.7 5.5,6.7 4.3,5.5 5.5,6.7 5.5,6.7 5.5,6.7 6.7,7.9 5.5,6.7
# [105] 5.5,6.7 6.7,7.9 4.3,5.5 6.7,7.9 5.5,6.7 6.7,7.9 5.5,6.7 5.5,6.7
# [113] 6.7,7.9 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 6.7,7.9 6.7,7.9 5.5,6.7
# [121] 6.7,7.9 5.5,6.7 6.7,7.9 5.5,6.7 5.5,6.7 6.7,7.9 5.5,6.7 5.5,6.7
# [129] 5.5,6.7 6.7,7.9 6.7,7.9 6.7,7.9 5.5,6.7 5.5,6.7 5.5,6.7 6.7,7.9
# [137] 5.5,6.7 5.5,6.7 5.5,6.7 6.7,7.9 5.5,6.7 6.7,7.9 5.5,6.7 6.7,7.9
# [145] 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7
# Levels: 4.3,5.5 5.5,6.7 6.7,7.9

Categorizing the axis variables

While these embedded plots can be useful for large datasets, I’m going demonstrate the process on a relatively small dataset. Here I will embed subplots on a larger plot based on the iris data. The variable Sepal.Length will be on the x axis and Petal.Length on the y axis.

My first step is to categorize those variables with cuteven(). I’m going to make three groups for Sepal.Length and four groups for Petal.Length.

I cut both variables within mutate() and add them to iris. I give the new variables generic names that indicate which variable is on the x axis and which on the y.

iris = mutate(iris,
                group_x = cuteven(Sepal.Length, 3),
                group_y = cuteven(Petal.Length, 4) )

glimpse(iris)

# Observations: 150
# Variables: 7
# $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
# $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
# $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
# $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
# $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, s...
# $ group_x      <fct> "4.3,5.5", "4.3,5.5", "4.3,5.5", "4.3,5.5", "4.3,...
# $ group_y      <fct> "1,2.475", "1,2.475", "1,2.475", "1,2.475", "1,2....

Extracting the coordinates for each subplot

I need the minimum and maximum value per group for each axis variable in order to place the subplot in the larger plot with annotation_custom(). Since the labels of the new variables contain this info separated by a comma, I can use separate() to extract the coordinate information from the labels into separate columns.

Since I have two group variables, one for each axis, I end up using separate() twice. I again make the names of the new columns in into based on which axis I’ll be plotting that variable on.

When this step is complete I’ll have coordinates to indicate where each corner of a subplot will be placed within the larger plot. Unique combinations of the four coordinate variables define each group I want to make a subplot for.

iris = iris %>%
     separate(group_x, into = c("min_x", "max_x"), 
              sep = ",", convert = TRUE) %>%
     separate(group_y, into = c("min_y", "max_y"), 
              sep = ",", convert = TRUE)

glimpse(iris)

# Observations: 150
# Variables: 9
# $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
# $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
# $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
# $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
# $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, s...
# $ min_x        <dbl> 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3,...
# $ max_x        <dbl> 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5,...
# $ min_y        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
# $ max_y        <dbl> 2.475, 2.475, 2.475, 2.475, 2.475, 2.475, 2.475, ...

Bar plot subplots

Next I’m going to figure out what I want my subplots to look like. I’m going to start with bar plots to count up the number of each species in each group.

Since I will be making many similar plots I’ll create a function to use for the plotting. I always work out what I want the plot to look like on a single subset of the data before making the function.

In this case, I want all the plots to have the same x and y axes. The y axis of my bar plot is based on counts, so I need to calculate the maximum number of species across groups so I can set the upper y axis limit for all plots to that value.

The maximum count is 47, so that will be my upper axis limit. Bar plots start at 0.

iris %>%
     group_by(min_x, max_x, min_y, max_y, Species) %>%
     count() %>%
     ungroup() %>%
     filter(n == max(n) )

# # A tibble: 1 x 6
#   min_x max_x min_y max_y Species     n
#   <dbl> <dbl> <dbl> <dbl> <fct>   <int>
# 1   4.3   5.5     1  2.48 setosa     47

In case a species is missing from one of the subplot groups I’ll define the limits for the x axis in scale_x_discrete(). This forces each plot to have the same x axis breaks.

I’ll be removing all axis labels, etc., via theme_void() so that the subplots fit nicely into the larger plot. I will add an outline around the plot, though.

I use fill to color the bars by species since there will be no x axis labels on the subplots. I suppress the legend, as well, and will add a legend to the large plot instead.

I set explicit colors in scale_fill_manual() (colors taken from here) so all the subplots will have the same color scheme.

Here is my test plot for one group. This particular group only has one species in it.

ggplot(data = filter(iris, max_x <= 5.5, max_y <= 2.475), 
       aes(x = Species, fill = Species) ) +
     geom_bar() +
     theme_void() +
     scale_x_discrete(limits = c("setosa", "versicolor", "virginica") ) +
     scale_fill_manual(values = c("setosa" = "#ED90A4", 
                                  "versicolor" = "#ABB150",
                                  "virginica" = "#00C1B2"),
                       guide  = "none") +
     theme(panel.border = element_rect(color = "grey",
                                       fill = "transparent") ) +
     ylim(0, 47)

Once I have the plot worked out for one group I put the code into a function, barfun. In this case the function takes only a dataset, since I’m hard-coding in all the axis variables, limits, etc.

barfun = function(data) {
     ggplot(data = data, 
            aes(x = Species, fill = Species) ) +
          geom_bar() +
          theme_void() +
          scale_x_discrete(limits = c("setosa", "versicolor", "virginica") ) +
          scale_fill_manual(values = c("setosa" = "#ED90A4", 
                                       "versicolor" = "#ABB150",
                                       "virginica" = "#00C1B2"),
                            guide  = "none") +
          theme(panel.border = element_rect(color = "grey",
                                            fill = "transparent") ) +
          ylim(0, 47) 
  
}

Does this function make the same plot I made manually? Yep. 👍

barfun(data = filter(iris, max_x <= 5.5, max_y <= 2.475) )

Creating all the bar plot subplots

I’m ready to make the subplots!

I’ll loop through each subset of data and plot it with my function.

Since I’m going to need those coordinates for subplot placement later I decided that the most straightforward way to do this is to group by the unique combinations of coordinates and then nest the dataset. When nesting, the data to be plotted for each group is placed in a column called data.

I loop through each dataset in data via map() within mutate(). The new column containing the plots is named subplots.

allplots = iris %>%
     group_by_at( vars( matches("min|max") ) ) %>%
     group_nest() %>%
     mutate(subplots = map(data, barfun) )

allplots

# # A tibble: 9 x 6
#   min_x max_x min_y max_y data              subplots
#   <dbl> <dbl> <dbl> <dbl> <list>            <list>  
# 1   4.3   5.5  1     2.48 <tibble [47 x 5]> <S3: gg>
# 2   4.3   5.5  2.48  3.95 <tibble [7 x 5]>  <S3: gg>
# 3   4.3   5.5  3.95  5.42 <tibble [5 x 5]>  <S3: gg>
# 4   5.5   6.7  1     2.48 <tibble [3 x 5]>  <S3: gg>
# 5   5.5   6.7  2.48  3.95 <tibble [4 x 5]>  <S3: gg>
# 6   5.5   6.7  3.95  5.42 <tibble [51 x 5]> <S3: gg>
# 7   5.5   6.7  5.42  6.9  <tibble [13 x 5]> <S3: gg>
# 8   6.7   7.9  3.95  5.42 <tibble [5 x 5]>  <S3: gg>
# 9   6.7   7.9  5.42  6.9  <tibble [15 x 5]> <S3: gg>

Here’s a couple of the plots. The first one should look familiar.

allplots$subplots[[1]]

allplots$subplots[[6]]

Putting the subplots into annotation_custom()

Next I need to make each of these plots a grob (graphical object) and pass it into annotation_custom(). The coordinates for each subplot will be passed to the xmin, xmax, ymin, ymax arguments in annotation_custom(), which indicate where each subplot should be placed in the larger plot.

Since I want to pass the subplots column as well as the four coordinates columns to annotation_custom() I will loop through allplots row-wise. I can use the pmap() function for looping through the rows of a dataset.

Since I’ll be working with five columns simultaneously in this step I decided to make a function prior to looping, which I name grobfun.

Notice I put my arguments of the function in the same order as they appear in the dataset and they have the same names as the columns of the dataset. I did this on purpose for ease of working with pmap().

grobfun = function(min_x, max_x, min_y, max_y, subplots) {
     annotation_custom(ggplotGrob(subplots),
                       xmin = min_x, ymin = min_y,
                       xmax = max_x, ymax = max_y)
}

I no longer need the data column, so I remove it to end up with only the columns I need to pass to the grobfun() function. This isn’t strictly necessary, but I find it makes working with pmap() easier.

The dot, ., indicates I’m passing the entire dataset to pmap() and so looping through it row-wise.

I find this can take a little time to run when doing many subplots.

( allgrobs = allplots %>%
     select(-data) %>%
     mutate(grobs = pmap(., grobfun) ) )

# # A tibble: 9 x 6
#   min_x max_x min_y max_y subplots grobs              
#   <dbl> <dbl> <dbl> <dbl> <list>   <list>             
# 1   4.3   5.5  1     2.48 <S3: gg> <S3: LayerInstance>
# 2   4.3   5.5  2.48  3.95 <S3: gg> <S3: LayerInstance>
# 3   4.3   5.5  3.95  5.42 <S3: gg> <S3: LayerInstance>
# 4   5.5   6.7  1     2.48 <S3: gg> <S3: LayerInstance>
# 5   5.5   6.7  2.48  3.95 <S3: gg> <S3: LayerInstance>
# 6   5.5   6.7  3.95  5.42 <S3: gg> <S3: LayerInstance>
# 7   5.5   6.7  5.42  6.9  <S3: gg> <S3: LayerInstance>
# 8   6.7   7.9  3.95  5.42 <S3: gg> <S3: LayerInstance>
# 9   6.7   7.9  5.42  6.9  <S3: gg> <S3: LayerInstance>

Making the large plot

I haven’t actually made the larger plot I’m going to embed subplots into yet. This will be a blank plot with Sepal.Length on the x axis and Petal.Length on the y axis with a legend for fill.

Adding on the overall legend to a blank plot involves a little trick with geom_col(), which was demonstrated in those Stack Overflow posts. (Thank goodness for SO, since I would have never figured it out otherwise 😜.)

Here is the plot I will embed the subplots into.

( largeplot = ggplot(iris, aes(x = Sepal.Length, 
                               y = Petal.Length, 
                               fill = Species) ) +
       geom_blank() +
       geom_col( aes(Inf, Inf) ) +
       scale_fill_manual(values = c("setosa" = "#ED90A4", 
                                    "versicolor" = "#ABB150",
                                    "virginica" = "#00C1B2") ) )

# Warning: Removed 149 rows containing missing values (geom_col).

Embedding the bar plot subplots

Last step! I can now add the list of subplots in annotation_custom() to the larger plot. 🎉

There was a little extra space on the y axis that I removed by setting the axis limits.

I think this looks nice with evenly spaced subplots but there may be times that having unevenly spaced subplots is desirable. In that case you could cut the variables into uneven groups.

largeplot +
     allgrobs$grobs +
     ylim(1, NA)

# Warning: Removed 150 rows containing missing values (geom_col).

After polishing this plot up a bit as desired, the final plot can be saved with ggsave().

Histogram subplots

I’ve seen quite a few examples of embedded plots with bar plot or pie chart subplots to show patterns in the distribution of some categorical variable across the axis variables. But there’s no reason we can’t show the distribution of a third continuous variable.

I decided to make a histogram of the variable Petal.Width for the same subplot groups I used above. I found a lot of little details to work through when making subplots showing the distribution of a continuous variable.

I want the x axis of each plot to encompass the entire range of Petal.Width, so my first step was to pull out that information. These values will be the x axis limits (with some extra added to make sure all the histogram bars fit) and the limits for the continuous legend.

range(iris$Petal.Width)

# [1] 0.1 2.5

I found the y axis to be a little trickier. I decided to show the bars as proportions of the maximum count with ncount instead of as a raw count. Since the height of bars is then a proportion of maximum instead of a count I added the sample size per group to the plot as text. I ended up putting this in a facet strip rather than within the plot. I went back and forth a bunch and am still not sure the final result is exactly what I want.

I’ll base the color of the bars on Petal.Width. This can be done with fill = stat(x), which is how to refer to the bins calculated by geom_histogram().

To make sure each plot has the same number of bar I set the binwidth. I chose to make each bar 0.2 units wide, since the entire range of Petal.Width (2.4 units) is evenly divisible by that number. I also center the first bar at the minimum value of the dataset, so the first bar is centered at 0.1.

I pad the x axis limits with half the binwidth value to make sure all the bars will show in every plot. Getting the limits correct can be hard; see, e.g., this GitHub issue if you are seeing warnings that you don’t think are correct.

Here is my example plot for the first group. Since I was careful with my plot limits the warning message is spurious.

ggplot(data = filter(iris, max_x <= 5.5, max_y <= 2.475),
       aes(x = Petal.Width, y = stat(ncount), fill = stat(x) ) ) +
     geom_histogram(binwidth = .2, center = .1) +
     theme_void(base_size = 14) +
     scale_x_continuous(limits = c(0.1 - .1, 2.5 + .1) ) +
     scale_fill_continuous(type = "viridis",
                           guide  = "none",
                           limits = c(.1, 2.5) ) +
     facet_wrap(~paste0("n = ", nrow(filter(iris, max_x <= 5.5, max_y <= 2.475) ) ) ) +
     theme(panel.border = element_rect(color = "grey",
                                       fill = "transparent") )

And here’s the function to make histograms for each subplot dataset, which I name histfun.

histfun = function(data) {
     ggplot(data = data,
            aes(x = Petal.Width, y = stat(ncount), fill = stat(x) ) ) +
          geom_histogram(binwidth = .2, center = .1) +
          theme_void(base_size = 14) +
          scale_x_continuous(limits = c(0.1 - .1, 2.5 + .1) ) +
          scale_fill_continuous(type = "viridis",
                                guide  = "none",
                                limits = c(.1, 2.5) ) +
          facet_wrap(~paste0("n = ", nrow(data) ) ) +
          theme(panel.border = element_rect(color = "grey",
                                            fill = "transparent") )
}

Get the histograms ready to embed

This time I’ll make the subplots with histfun and then put them into annotation_custom() with grobfun in one pipe chain.

allgrobs_hist = iris %>%
     group_by_at( vars( matches("min|max") ) ) %>%
     group_nest() %>%
     mutate(subplots = map(data, histfun) ) %>%
     select(-data) %>%
     mutate(grobs = pmap(., grobfun) )

Embed the histogram subplots

This time the large plot needs a continuous legend. I set the breaks so the minimum and maximum value are included on the legend. How well this works will depend on the range of your dataset.

( largeplot2 = ggplot(iris, aes(x = Sepal.Length, 
                                y = Petal.Length, 
                                fill = Petal.Width) ) +
       geom_blank() +
       geom_col( aes(Inf, Inf) ) +
       scale_fill_continuous(type = "viridis",
                             limits = c(.1, 2.5),
                             breaks = seq(.1, 2.5, by = .8) ) )

# Warning: Removed 149 rows containing missing values (geom_col).

And, finally, here’s the plot embedded with the Petal.Width distribution plots. 😃

largeplot2 +
     allgrobs_hist$grobs +
     ylim(1, NA)

# Warning: Removed 150 rows containing missing values (geom_col).

A density subplot example

I thought the histogram subplots ended up being pretty tricky, what with having to figure out the plot limits and the bar widths and centers.

Density plots are another possibility for showing the distribution of continuous data, and the color of the line can be allowed to vary.

Here’s an example of what a density plot could look like. In some scenarios using trim = TRUE may be useful in stat_density(). You’ll notice I set the x axis limits to the minimum and maximum value in the datast with no padding for this plot.

ggplot(data = filter(iris, max_x <= 5.5, max_y <= 2.475),
       aes(x = Petal.Width, y = stat(ndensity), color = stat(x) ) ) +
     stat_density(geom = "line", size = 1.25) +
     theme_void(base_size = 14) +
     scale_x_continuous(limits = c(0.1, 2.5),
                        expand = c(0, 0) ) +
     scale_color_viridis_c(guide  = "none",
                           limits = c(.1, 2.5) ) +
     facet_wrap(~paste0("n = ", nrow(filter(iris, max_x <= 5.5, max_y <= 2.475) ) ) ) +
     theme(panel.border = element_rect(color = "grey",
                                       fill = "transparent") )

Turning the plot code into a function for looping through groups.

densfun = function(data) {
     ggplot(data = data,
            aes(x = Petal.Width, y = stat(ndensity), color = stat(x) ) ) +
          stat_density(geom = "line", size = 1.25) +
          theme_void(base_size = 14) +
          scale_x_continuous(limits = c(0.1, 2.5),
                             expand = c(0, 0) ) +
          scale_color_viridis_c(guide  = "none",
                                limits = c(.1, 2.5) ) +
          facet_wrap(~paste0("n = ", nrow(data) ) ) +
          theme(panel.border = element_rect(color = "grey",
                                            fill = "transparent") )
}

Loop through to create and get the subplots ready to add to the plot.

allgrobs_dens = iris %>%
     group_by_at( vars( matches("min|max") ) ) %>%
     group_nest() %>%
     mutate(subplots = map(data, densfun) ) %>%
     select(-data) %>%
     mutate(grobs = pmap(., grobfun) )

And here’s the density subplots embedded in the large plot. Not too bad! I’m guessing the density plot approach would be most useful for larger sample sizes. 😺

largeplot2 +
     allgrobs_dens$grobs +
     ylim(1, NA)

# Warning: Removed 150 rows containing missing values (geom_col).

Filled density plots

I belatedly realized that we can use geom_density_ridges_gradient() from package ggridges to make density plots with continuous fill.

Since this package is really for ridge plots, I use y = 1 to get a single density plot. This geom uses a relative scale by default so stat(ndensity) isn’t needed.

library(ggridges) # v 0.5.1
ggplot(data = filter(iris, max_x <= 5.5, max_y <= 2.475),
       aes(x = Petal.Width, y = 1, fill = stat(x) ) ) +
     geom_density_ridges_gradient() +
     theme_void(base_size = 14) +
     scale_x_continuous(limits = c(0.1, 2.5),
                        expand = c(0, 0) ) +
     scale_fill_viridis_c(guide  = "none",
                           limits = c(.1, 2.5) ) +
     facet_wrap(~paste0("n = ", nrow(filter(iris, max_x <= 5.5, max_y <= 2.475) ) ) ) +
     theme(panel.border = element_rect(color = "grey",
                                       fill = "transparent") )

Let’s see how that looks as embedded plots. I’ll do the whole process in one code chunk, taking some extra time to move the legend around in the final plot.

densfun2 = function(data) {
  ggplot(data = data,
         aes(x = Petal.Width, y = 1, fill = stat(x) ) ) +
    geom_density_ridges_gradient() +
    theme_void(base_size = 14) +
    scale_x_continuous(limits = c(0.1, 2.5),
                       expand = c(0, 0) ) +
    scale_fill_viridis_c(guide  = "none",
                         limits = c(.1, 2.5) ) +
    facet_wrap(~paste0("n = ", nrow(data) ) ) +
    theme(panel.border = element_rect(color = "grey",
                                      fill = "transparent") )
}
allgrobs_dens2 = iris %>%
    group_by_at( vars( matches("min|max") ) ) %>%
    group_nest() %>%
    mutate(subplots = map(data, densfun2) ) %>%
    select(-data) %>%
    mutate(grobs = pmap(., grobfun) )

largeplot2 +
    allgrobs_dens2$grobs +
    ylim(1, NA) +
    theme_bw() +
    theme(legend.direction = "horizontal",
          legend.position = c(.8, .25),
          legend.background = element_blank() ) +
    guides(fill = guide_colorbar(title.position = "top") )

# Warning: Removed 150 rows containing missing values (geom_col).

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(ggplot2) # 3.1.1
suppressPackageStartupMessages( library(dplyr) ) # 0.8.0.1
library(tidyr) # 0.8.3
library(purrr) # 0.3.2

cuteven = function(variable, ngroups) {
     seq_all = seq(min(variable), max(variable), length.out = ngroups + 1)
     cut(variable,
         breaks = seq_all,
         labels = paste(seq_all[-(ngroups + 1)], seq_all[-1], sep = ","),
         include.lowest = TRUE)
}

with(iris, cuteven(Sepal.Length, ngroups = 3) )

iris = mutate(iris,
                group_x = cuteven(Sepal.Length, 3),
                group_y = cuteven(Petal.Length, 4) )

glimpse(iris)

iris = iris %>%
     separate(group_x, into = c("min_x", "max_x"), 
              sep = ",", convert = TRUE) %>%
     separate(group_y, into = c("min_y", "max_y"), 
              sep = ",", convert = TRUE)

glimpse(iris)

iris %>%
     group_by(min_x, max_x, min_y, max_y, Species) %>%
     count() %>%
     ungroup() %>%
     filter(n == max(n) )

ggplot(data = filter(iris, max_x <= 5.5, max_y <= 2.475), 
       aes(x = Species, fill = Species) ) +
     geom_bar() +
     theme_void() +
     scale_x_discrete(limits = c("setosa", "versicolor", "virginica") ) +
     scale_fill_manual(values = c("setosa" = "#ED90A4", 
                                  "versicolor" = "#ABB150",
                                  "virginica" = "#00C1B2"),
                       guide  = "none") +
     theme(panel.border = element_rect(color = "grey",
                                       fill = "transparent") ) +
     ylim(0, 47)

barfun = function(data) {
     ggplot(data = data, 
            aes(x = Species, fill = Species) ) +
          geom_bar() +
          theme_void() +
          scale_x_discrete(limits = c("setosa", "versicolor", "virginica") ) +
          scale_fill_manual(values = c("setosa" = "#ED90A4", 
                                       "versicolor" = "#ABB150",
                                       "virginica" = "#00C1B2"),
                            guide  = "none") +
          theme(panel.border = element_rect(color = "grey",
                                            fill = "transparent") ) +
          ylim(0, 47) 
  
}
barfun(data = filter(iris, max_x <= 5.5, max_y <= 2.475) )

allplots = iris %>%
     group_by_at( vars( matches("min|max") ) ) %>%
     group_nest() %>%
     mutate(subplots = map(data, barfun) )

allplots
allplots$subplots[[1]]
allplots$subplots[[6]]
grobfun = function(min_x, max_x, min_y, max_y, subplots) {
     annotation_custom(ggplotGrob(subplots),
                       xmin = min_x, ymin = min_y,
                       xmax = max_x, ymax = max_y)
}

( allgrobs = allplots %>%
     select(-data) %>%
     mutate(grobs = pmap(., grobfun) ) )

( largeplot = ggplot(iris, aes(x = Sepal.Length, 
                               y = Petal.Length, 
                               fill = Species) ) +
       geom_blank() +
       geom_col( aes(Inf, Inf) ) +
       scale_fill_manual(values = c("setosa" = "#ED90A4", 
                                    "versicolor" = "#ABB150",
                                    "virginica" = "#00C1B2") ) )

largeplot +
     allgrobs$grobs +
     ylim(1, NA)

range(iris$Petal.Width)

ggplot(data = filter(iris, max_x <= 5.5, max_y <= 2.475),
       aes(x = Petal.Width, y = stat(ncount), fill = stat(x) ) ) +
     geom_histogram(binwidth = .2, center = .1) +
     theme_void(base_size = 14) +
     scale_x_continuous(limits = c(0.1 - .1, 2.5 + .1) ) +
     scale_fill_continuous(type = "viridis",
                           guide  = "none",
                           limits = c(.1, 2.5) ) +
     facet_wrap(~paste0("n = ", nrow(filter(iris, max_x <= 5.5, max_y <= 2.475) ) ) ) +
     theme(panel.border = element_rect(color = "grey",
                                       fill = "transparent") )

histfun = function(data) {
     ggplot(data = data,
            aes(x = Petal.Width, y = stat(ncount), fill = stat(x) ) ) +
          geom_histogram(binwidth = .2, center = .1) +
          theme_void(base_size = 14) +
          scale_x_continuous(limits = c(0.1 - .1, 2.5 + .1) ) +
          scale_fill_continuous(type = "viridis",
                                guide  = "none",
                                limits = c(.1, 2.5) ) +
          facet_wrap(~paste0("n = ", nrow(data) ) ) +
          theme(panel.border = element_rect(color = "grey",
                                            fill = "transparent") )
}

allgrobs_hist = iris %>%
     group_by_at( vars( matches("min|max") ) ) %>%
     group_nest() %>%
     mutate(subplots = map(data, histfun) ) %>%
     select(-data) %>%
     mutate(grobs = pmap(., grobfun) )

( largeplot2 = ggplot(iris, aes(x = Sepal.Length, 
                                y = Petal.Length, 
                                fill = Petal.Width) ) +
       geom_blank() +
       geom_col( aes(Inf, Inf) ) +
       scale_fill_continuous(type = "viridis",
                             limits = c(.1, 2.5),
                             breaks = seq(.1, 2.5, by = .8) ) )

largeplot2 +
     allgrobs_hist$grobs +
     ylim(1, NA)

ggplot(data = filter(iris, max_x <= 5.5, max_y <= 2.475),
       aes(x = Petal.Width, y = stat(ndensity), color = stat(x) ) ) +
     stat_density(geom = "line", size = 1.25) +
     theme_void(base_size = 14) +
     scale_x_continuous(limits = c(0.1, 2.5),
                        expand = c(0, 0) ) +
     scale_color_viridis_c(guide  = "none",
                           limits = c(.1, 2.5) ) +
     facet_wrap(~paste0("n = ", nrow(filter(iris, max_x <= 5.5, max_y <= 2.475) ) ) ) +
     theme(panel.border = element_rect(color = "grey",
                                       fill = "transparent") )

densfun = function(data) {
     ggplot(data = data,
            aes(x = Petal.Width, y = stat(ndensity), color = stat(x) ) ) +
          stat_density(geom = "line", size = 1.25) +
          theme_void(base_size = 14) +
          scale_x_continuous(limits = c(0.1, 2.5),
                             expand = c(0, 0) ) +
          scale_color_viridis_c(guide  = "none",
                                limits = c(.1, 2.5) ) +
          facet_wrap(~paste0("n = ", nrow(data) ) ) +
          theme(panel.border = element_rect(color = "grey",
                                            fill = "transparent") )
}

allgrobs_dens = iris %>%
     group_by_at( vars( matches("min|max") ) ) %>%
     group_nest() %>%
     mutate(subplots = map(data, densfun) ) %>%
     select(-data) %>%
     mutate(grobs = pmap(., grobfun) )
largeplot2 +
     allgrobs_dens$grobs +
     ylim(1, NA)

library(ggridges) # v 0.5.1
ggplot(data = filter(iris, max_x <= 5.5, max_y <= 2.475),
       aes(x = Petal.Width, y = 1, fill = stat(x) ) ) +
     geom_density_ridges_gradient() +
     theme_void(base_size = 14) +
     scale_x_continuous(limits = c(0.1, 2.5),
                        expand = c(0, 0) ) +
     scale_fill_viridis_c(guide  = "none",
                           limits = c(.1, 2.5) ) +
     facet_wrap(~paste0("n = ", nrow(filter(iris, max_x <= 5.5, max_y <= 2.475) ) ) ) +
     theme(panel.border = element_rect(color = "grey",
                                       fill = "transparent") )

densfun2 = function(data) {
  ggplot(data = data,
         aes(x = Petal.Width, y = 1, fill = stat(x) ) ) +
    geom_density_ridges_gradient() +
    theme_void(base_size = 14) +
    scale_x_continuous(limits = c(0.1, 2.5),
                       expand = c(0, 0) ) +
    scale_fill_viridis_c(guide  = "none",
                         limits = c(.1, 2.5) ) +
    facet_wrap(~paste0("n = ", nrow(data) ) ) +
    theme(panel.border = element_rect(color = "grey",
                                      fill = "transparent") )
}
allgrobs_dens2 = iris %>%
    group_by_at( vars( matches("min|max") ) ) %>%
    group_nest() %>%
    mutate(subplots = map(data, densfun2) ) %>%
    select(-data) %>%
    mutate(grobs = pmap(., grobfun) )

largeplot2 +
    allgrobs_dens2$grobs +
    ylim(1, NA) +
    theme_bw() +
    theme(legend.direction = "horizontal",
          legend.position = c(.8, .25),
          legend.background = element_blank() ) +
    guides(fill = guide_colorbar(title.position = "top") )

Custom contrasts in emmeans

Mon, 15 Apr 2019 00:00:00 +0000

Following up on a previous post, where I demonstrated the basic usage of package emmeans for doing post hoc comparisons, here I’ll demonstrate how to make custom comparisons (aka contrasts). These are comparisons that aren’t encompassed by the built-in functions in the package.

Remember that you can explore the available built-in emmeans functions for doing comparisons via ?"contrast-methods".

Reasons for custom comparisons

There are a variety of reasons you might need custom comparisons instead of some of the standard, built-in ones. One common scenario that I see a lot is when we have a single control group for multiple factors, so the factors aren’t perfectly crossed. This comes up, e.g., when doing experiments that involve applying different substances (like fertilizers) at varying rates. One factor is the different substances applied and the other is different application rates. However, the control is applying nothing or water or something like that. There aren’t different rates of the control to apply, so there is a single control group for both factors.

Rather than trying to fit a model with multiple factors, focusing on main effects and the interaction, such data can be analyzed with a simple effects model. This is where the two (or more) factors of interest have been combined into a single factor for analysis. Such an analysis focuses on the effect of the two factors combined. We can use post hoc comparisons to estimate the overall effects of individual factors.

R packages

I will load magrittr for the pipe in addition to emmeans.

library(emmeans) # v. 1.3.3
library(magrittr) # v. 1.5

The dataset and model

I’ve made a small dataset to use as an example. The response variable is resp and the two factors of interest have been combined into a single factor sub.rate that has 5 levels: A.1, A.2, B.1, B.2, and control.

One factor, which I’m thinking of as the substance factor, is represented by A and B (and the control). The second, the rate factor, is represented by 1 and 2.

dat = structure(list(sub.rate = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L), .Label = c("A.1", "A.2", "B.1", "B.2", "control"), class = "factor"), 
    resp = c(5.5, 4.9, 6.1, 3.6, 6.1, 3.5, 3, 4.1, 5, 4.6, 7.3, 
    5.6, 4.8, 7.2, 6.2, 4.3, 6.6, 6.5, 5.5, 7.1, 5.4, 6.7, 6.8, 
    8.5, 6.1)), row.names = c(NA, -25L), class = "data.frame")

str(dat)

# 'data.frame': 25 obs. of  2 variables:
#  $ sub.rate: Factor w/ 5 levels "A.1","A.2","B.1",..: 1 1 1 1 1 2 2 2 2 2 ...
#  $ resp    : num  5.5 4.9 6.1 3.6 6.1 3.5 3 4.1 5 4.6 ...

I will use a simple, linear model for analysis.

fit1 = lm(resp ~ sub.rate, data = dat)

Treatment vs control comparisons

The simple effects model makes it easy to get comparisons for each factor combination vs the control group with emmeans(). I’ll use trt.vs.ctrlk to do this since the control is the last level of the factor.

emmeans(fit1, specs = trt.vs.ctrlk ~ sub.rate)

# $emmeans
#  sub.rate emmean    SE df lower.CL upper.CL
#  A.1        5.24 0.466 20     4.27     6.21
#  A.2        4.04 0.466 20     3.07     5.01
#  B.1        6.22 0.466 20     5.25     7.19
#  B.2        6.00 0.466 20     5.03     6.97
#  control    6.70 0.466 20     5.73     7.67
# 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast      estimate   SE df t.ratio p.value
#  A.1 - control    -1.46 0.66 20 -2.214  0.1230 
#  A.2 - control    -2.66 0.66 20 -4.033  0.0024 
#  B.1 - control    -0.48 0.66 20 -0.728  0.8403 
#  B.2 - control    -0.70 0.66 20 -1.061  0.6548 
# 
# P value adjustment: dunnettx method for 4 tests

We may also be interested in some other comparisons, though. In particular, we might want to do some overall comparisons across the two factors. We will need custom contrasts for this.

Building custom contrasts

Custom contrasts are based on the estimated marginal means output from emmeans(). The first step to building custom contrasts is to calculate the estimated marginal means so we have them to work with. I will name this output emm1.

emm1 = emmeans(fit1, specs = ~ sub.rate)
emm1

#  sub.rate emmean    SE df lower.CL upper.CL
#  A.1        5.24 0.466 20     4.27     6.21
#  A.2        4.04 0.466 20     3.07     5.01
#  B.1        6.22 0.466 20     5.25     7.19
#  B.2        6.00 0.466 20     5.03     6.97
#  control    6.70 0.466 20     5.73     7.67
# 
# Confidence level used: 0.95

I’m going to start with a relatively simple example. I will compare mean resp of the A.2 group to the B.2 group via custom contrasts.

Building a custom contrast involves pulling out specific group means of interest from the emmeans() output. We pull out a group mean by making a vector to represent the specific mean of interest. In this vector we assign a 1 to the mean of the group of interest and a 0 to the other groups.

For example, to pull out the mean of A.2 from emm1 we will make a vector with 5 values in it, one for each row of the output. The second value will be a 1, since the mean of A.2 is on the second row of emm1. All the other values in the vector will be 0.

Below is the vector that represents the A.2 mean.

A2 = c(0, 1, 0, 0, 0)

Similarly, to pull out the mean of B.2 we’ll have a vector of 5 values with a 1 as the fourth value. The B.2 group is on the fourth row in emm1.

B2 = c(0, 0, 0, 1, 0)

When building custom contrasts via vectors like this, the vectors will always be the same length as the number of rows in the emmeans() output. I always calculate and print the estimated marginal means prior to building the vectors so I am certain of the number of rows and the order of the groups in the output.

The contrast() function for custom comparisons

Once we have the vectors that represent the means we are interested in comparing, we actually do the comparisons via the contrast() function. Since we are interested in a difference in mean response, we take the difference between the vectors that represent the means.

Taking the difference between vectors can be done inside or outside contrast(). In this example I’m doing it inside.

The contrast() function takes an emmGrid object (i.e., output from emmeans()) as the first argument. We give the comparison we want to do via a list passed to the method argument.

Here I want to calculate the difference in mean resp of A.2 and B.2. I subtract the B2 vector from the A2 vector. The output is the difference in mean resp between these groups.

contrast(emm1, method = list(A2 - B2) )

#  contrast          estimate   SE df t.ratio p.value
#  c(0, 1, 0, -1, 0)    -1.96 0.66 20 -2.972  0.0075

Using named lists for better output

Unfortunately you can’t tell what comparisons was done in the output above 🤔. We can use a named list in method to make the output more understandable.

contrast(emm1, method = list("A2 - B2" = A2 - B2) )

#  contrast estimate   SE df t.ratio p.value
#  A2 - B2     -1.96 0.66 20 -2.972  0.0075

Using “at” for simple comparisons

Note that I didn’t need to do a custom contrast to do this particular comparison. I could have gotten the comparison I wanted by using the at argument with pairwise in emmeans() and choosing just the two groups I was interested in.

emmeans(fit1, specs = pairwise ~ sub.rate, 
         at = list(sub.rate = c("A.2", "B.2") ) )

# $emmeans
#  sub.rate emmean    SE df lower.CL upper.CL
#  A.2        4.04 0.466 20     3.07     5.01
#  B.2        6.00 0.466 20     5.03     6.97
# 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast  estimate   SE df t.ratio p.value
#  A.2 - B.2    -1.96 0.66 20 -2.972  0.0075

Multiple custom contrasts at once

Multiple custom contrasts can be done simultaneously in contrast() by adding more comparisons to the method list. I’ll demonstrate this by doing the simple example comparison twice, changing only which group mean is subtracted from the other.

I name both elements of the list for ease of interpretation. I find naming the list of comparisons to be a key part of doing these custom contrasts.

contrast(emm1, method = list("A2 - B2" = A2 - B2,
                             "B2 - A2" = B2 - A2) )

#  contrast estimate   SE df t.ratio p.value
#  A2 - B2     -1.96 0.66 20 -2.972  0.0075 
#  B2 - A2      1.96 0.66 20  2.972  0.0075

Doing multiple comparisons at once means a multiple comparisons adjustment can be done as needed. In addition, we can use the confint() function do get confidence intervals for the comparisons.

I’ll add a multivariate-\(t\) adjustment via adjust = "mvt" and then get confidence intervals for the comparisons. Remember we can get both confidence intervals and tests for comparisons via summary() with infer = TRUE.

twocomp = contrast(emm1, method = list("A2 minus B2" = A2 - B2,
                             "B2 minus A2" = B2 - A2),
         adjust = "mvt") %>%
     confint()
twocomp

#  contrast    estimate   SE df lower.CL upper.CL
#  A2 minus B2    -1.96 0.66 20   -3.336   -0.584
#  B2 minus A2     1.96 0.66 20    0.584    3.336
# 
# Confidence level used: 0.95 
# Conf-level adjustment: mvt method for 2 estimates

More complicated custom contrasts

Now that we’ve seen a simple case, let’s do something slightly more complicated (and realistic). What if we want to compare the A group to the B group overall, regardless of the application rate?

This is a main effect comparison, so I need to average over the effect of the rate factor in order to estimate the overall effect of the levels of the substance factor.

To do this comparison I need the means for all four non-control factor levels. I’ll print emm1 again here so I remember the order of the output before starting to write out the vectors that represent the group means.

emm1

#  sub.rate emmean    SE df lower.CL upper.CL
#  A.1        5.24 0.466 20     4.27     6.21
#  A.2        4.04 0.466 20     3.07     5.01
#  B.1        6.22 0.466 20     5.25     7.19
#  B.2        6.00 0.466 20     5.03     6.97
#  control    6.70 0.466 20     5.73     7.67
# 
# Confidence level used: 0.95

I’ll need all means that involve A or B, which is the first four group means in emm1. I’ll make a vector to represent each of these group means.

While typing these vectors out isn’t too hard, since they only contain 5 values, when I have many groups and so really long vectors I sometimes use rep() to repeat all the 0 values.

A1 = c(1, 0, 0, 0, 0)
A2 = c(0, 1, 0, 0, 0)
B1 = c(0, 0, 1, 0, 0)
B2 = c(0, 0, 0, 1, 0)

The vectors I made represent means for combinations of substance and rate. I want to compare the overall substance group means, though. This can be done by averaging over the two rates. This involves literally taking the average of, e.g., A1 and A2 vectors to get a vector that represents the overall A mean.

Aoverall = (A1 + A2)/2
Boverall = (B1 + B2)/2

Now that we have vectors to represent the overall means we can do comparison of mean resp of the A group vs B group overall in contrast().

contrast(emm1, method = list("A - B" = Aoverall - Boverall) )

#  contrast estimate    SE df t.ratio p.value
#  A - B       -1.47 0.466 20 -3.152  0.0050

Custom contrasts are all built in this same basic way. You can also build your own contrast function if there is some contrast you do all the time that is not part of emmeans. See the custom contrasts section of the emmeans vignette for more info.

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(emmeans) # v. 1.3.3
library(magrittr) # v. 1.5

dat = structure(list(sub.rate = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L), .Label = c("A.1", "A.2", "B.1", "B.2", "control"), class = "factor"), 
    resp = c(5.5, 4.9, 6.1, 3.6, 6.1, 3.5, 3, 4.1, 5, 4.6, 7.3, 
    5.6, 4.8, 7.2, 6.2, 4.3, 6.6, 6.5, 5.5, 7.1, 5.4, 6.7, 6.8, 
    8.5, 6.1)), row.names = c(NA, -25L), class = "data.frame")

str(dat)

fit1 = lm(resp ~ sub.rate, data = dat)

emmeans(fit1, specs = trt.vs.ctrlk ~ sub.rate)

emm1 = emmeans(fit1, specs = ~ sub.rate)
emm1

A2 = c(0, 1, 0, 0, 0)
B2 = c(0, 0, 0, 1, 0)
contrast(emm1, method = list(A2 - B2) )

contrast(emm1, method = list("A2 - B2" = A2 - B2) )

emmeans(fit1, specs = pairwise ~ sub.rate, 
         at = list(sub.rate = c("A.2", "B.2") ) )

contrast(emm1, method = list("A2 - B2" = A2 - B2,
                             "B2 - A2" = B2 - A2) )

twocomp = contrast(emm1, method = list("A2 minus B2" = A2 - B2,
                             "B2 minus A2" = B2 - A2),
         adjust = "mvt") %>%
     confint()
twocomp

emm1
A1 = c(1, 0, 0, 0, 0)
A2 = c(0, 1, 0, 0, 0)
B1 = c(0, 0, 1, 0, 0)
B2 = c(0, 0, 0, 1, 0)

Aoverall = (A1 + A2)/2
Boverall = (B1 + B2)/2

contrast(emm1, method = list("A - B" = Aoverall - Boverall) )

Getting started with emmeans

Mon, 25 Mar 2019 00:00:00 +0000

This post was last updated on 2021-11-04.

Package emmeans (formerly known as lsmeans) is enormously useful for folks wanting to do post hoc comparisons among groups after fitting a model. It has a very thorough set of vignettes (see the vignette topics here), is very flexible with a ton of options, and works out of the box with a lot of different model objects (and can be extended to others 👍).

I’ve been consistently recommending emmeans to students fitting models in R. However, often times students struggle a bit to get started using the package, possibly due to the sheer amount of flexibility and information in the vignettes.

I’ve put together some basic examples for using emmeans, meant to be a complement to the vignettes. Specifically this post will demonstrate a few of the built-in options for some standard post hoc comparisons; I will write a separate post about custom comparisons in emmeans.

Disclaimer: This post is about using a package in R and so unfortunately does not focus on appropriate statistical practice for model fitting and post hoc comparisons.

R packages

I will load magrittr for the pipe in addition to emmeans.

library(emmeans) # v. 1.7.0
library(magrittr) # v. 2.0.1

The dataset and model

I’ve made a small dataset to use in this example.

The response variable is resp, which comes from the log-normal distribution, and the two crossed factors of interest are f1 and f2. Each factor has two levels: a control called c as well as a second non-control level.

dat = data.frame(resp = c(1.6,0.3,3,0.1,3.2,0.2,0.4,0.4,2.8,
                          0.7,3.8,3,0.3,14.3,1.2,0.5,1.1,4.4,0.4,8.4),
                 f1 = factor(c("a","a","a","a","a",
                               "a","a","a","a","a","c","c","c","c","c",
                               "c","c","c","c","c")),
                 f2 = factor(c("1","c","1","c","1",
                               "c","1","c","1","c","1","c","1","c","1",
                               "c","1","c","1","c")))

str(dat)

# 'data.frame': 20 obs. of  3 variables:
#  $ resp: num  1.6 0.3 3 0.1 3.2 0.2 0.4 0.4 2.8 0.7 ...
#  $ f1  : Factor w/ 2 levels "a","c": 1 1 1 1 1 1 1 1 1 1 ...
#  $ f2  : Factor w/ 2 levels "1","c": 1 2 1 2 1 2 1 2 1 2 ...

The model I will use is a linear model with a log-transformed response variable and the two factors and their interaction as explanatory variables. This is the “true” model since I created these data so I’m skipping all model checks (which I would not do in a real analysis).

Note I use log(resp) in the model rather than creating a new log-transformed variable. This will allow me to demonstrate one of the convenient options available in emmeans() later.

fit1 = lm(log(resp) ~ f1 + f2 + f1:f2, data = dat)

Built in comparisons with emmeans()

The emmeans package has helper functions for commonly used post hoc comparisons (aka contrasts). For example, we can do pairwise comparisons via pairwise or revpairwise, treatment vs control comparisons via trt.vs.ctrl or trt.vs.ctrlk, and even consecutive comparisons via consec.

The available built-in functions for doing comparisons are listed in the documentation for ?"contrast-methods".

All pairwise comparisons

One way to use emmeans() is via formula coding for the comparisons. The formula is defined in the specs argument.

In my first example I do all pairwise comparisons for all combinations of f1 and f2. The built-in function pairwise is put on the left-hand side of the formula of the specs argument. The factors with levels to compare among are on the right-hand side. Since I’m doing all pairwise comparisons, the combination of f1 and f2 are put in the formula.

The model object is passed to the first argument in emmeans(), object.

emm1 = emmeans(fit1, specs = pairwise ~ f1:f2)

Using the formula in this way returns an object with two parts. The first part, called emmeans, is the estimated marginal means along with the standard errors and confidence intervals. We can pull these out with dollar sign notation, which I demonstrate below.

These results are all on the model scale, so in this case these are estimated mean log response for each f1 and f2 combination. Note the message that emmeans() gives us about results being on the log scale in the output. It knows the model is on the log scale because I used log(resp) as the response variable.

emm1$emmeans

#  f1 f2 emmean    SE df lower.CL upper.CL
#  a  1   0.569 0.445 16   -0.374    1.512
#  c  1  -0.102 0.445 16   -1.045    0.842
#  a  c  -1.278 0.445 16   -2.221   -0.334
#  c  c   1.335 0.445 16    0.392    2.279
# 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95

The second part of the output, called contrasts, contains the comparisons of interest. It is this section that we are generally most interested in when answering a question about differences among groups. You can see which comparison is which via the contrast column.

These results are also on the model scale (and we get the same message in this section), and we’ll want to put them on the original scale.

The comparisons are accompanied by statistical tests of the null hypothesis of “no difference”, but lack confidence interval (CI) limits by default. We’ll need to get these.

The emmeans() package automatically adjusts for multiple comparisons. Since we did all pairwise comparisons the package used a Tukey adjustment. The type of adjustment can be changed.

emm1$contrasts

#  contrast  estimate    SE df t.ratio p.value
#  a 1 - c 1    0.671 0.629 16   1.065  0.7146
#  a 1 - a c    1.847 0.629 16   2.934  0.0434
#  a 1 - c c   -0.766 0.629 16  -1.217  0.6253
#  c 1 - a c    1.176 0.629 16   1.869  0.2795
#  c 1 - c c   -1.437 0.629 16  -2.283  0.1438
#  a c - c c   -2.613 0.629 16  -4.152  0.0038
# 
# Results are given on the log (not the response) scale. 
# P value adjustment: tukey method for comparing a family of 4 estimates

Back-transforming results

Since I used a log transformation I can express the results as multiplicative differences in medians on the original (data) scale.

We can always back-transform estimates and CI limits by hand, but in emmeans() we can use the type argument for this. Using type = "response" will return results on the original scale. This works when the transformation is explicit in the model (e.g., log(resp)) and works similarly for link functions in generalized linear models.

You’ll see the message changes in the output once I do this, indicating things were back-transformed from the model scale. We also are reminded that the tests were done on the model scale.

In the contrast column in the contrasts section we can see the expression of the comparisons has changed from additive comparisons (via subtraction) shown above to multiplicative comparisons (via division).

emmeans(fit1, specs = pairwise ~ f1:f2, type = "response")

# $emmeans
#  f1 f2 response    SE df lower.CL upper.CL
#  a  1     1.767 0.786 16    0.688    4.538
#  c  1     0.903 0.402 16    0.352    2.321
#  a  c     0.279 0.124 16    0.108    0.716
#  c  c     3.800 1.691 16    1.479    9.763
# 
# Confidence level used: 0.95 
# Intervals are back-transformed from the log scale 
# 
# $contrasts
#  contrast   ratio     SE df null t.ratio p.value
#  a 1 / c 1 1.9553 1.2306 16    1   1.065  0.7146
#  a 1 / a c 6.3396 3.9900 16    1   2.934  0.0434
#  a 1 / c c 0.4648 0.2926 16    1  -1.217  0.6253
#  c 1 / a c 3.2422 2.0406 16    1   1.869  0.2795
#  c 1 / c c 0.2377 0.1496 16    1  -2.283  0.1438
#  a c / c c 0.0733 0.0461 16    1  -4.152  0.0038
# 
# P value adjustment: tukey method for comparing a family of 4 estimates 
# Tests are performed on the log scale

Changing the multiple comparisons adjustment

The adjust argument can be used to change the type of multiple comparisons adjustment. All available options are listed and described in the documentation for summary.emmGrid under the section P-value adjustments.

One option is to skip multiple comparisons adjustments all together, using adjust = "none". If we use this the message about multiple comparisons disappears (since we didn’t use one).

emm1.1 = emmeans(fit1, specs = pairwise ~ f1:f2, type = "response", adjust = "none")
emm1.1

# $emmeans
#  f1 f2 response    SE df lower.CL upper.CL
#  a  1     1.767 0.786 16    0.688    4.538
#  c  1     0.903 0.402 16    0.352    2.321
#  a  c     0.279 0.124 16    0.108    0.716
#  c  c     3.800 1.691 16    1.479    9.763
# 
# Confidence level used: 0.95 
# Intervals are back-transformed from the log scale 
# 
# $contrasts
#  contrast   ratio     SE df null t.ratio p.value
#  a 1 / c 1 1.9553 1.2306 16    1   1.065  0.3025
#  a 1 / a c 6.3396 3.9900 16    1   2.934  0.0097
#  a 1 / c c 0.4648 0.2926 16    1  -1.217  0.2412
#  c 1 / a c 3.2422 2.0406 16    1   1.869  0.0801
#  c 1 / c c 0.2377 0.1496 16    1  -2.283  0.0365
#  a c / c c 0.0733 0.0461 16    1  -4.152  0.0008
# 
# Tests are performed on the log scale

Confidence intervals for comparisons

We will almost invariably want to report confidence intervals for any comparisons of interest. We need a separate function to get these. Here is an example using the confint() function with the default 95% CI (the confidence level can be changed, see ?confint.emmGrid). I use the pipe to pass the contrasts into the confint() function.

emm1.1$contrasts %>%
     confint()

#  contrast   ratio     SE df lower.CL upper.CL
#  a 1 / c 1 1.9553 1.2306 16   0.5150    7.424
#  a 1 / a c 6.3396 3.9900 16   1.6696   24.072
#  a 1 / c c 0.4648 0.2926 16   0.1224    1.765
#  c 1 / a c 3.2422 2.0406 16   0.8539   12.311
#  c 1 / c c 0.2377 0.1496 16   0.0626    0.903
#  a c / c c 0.0733 0.0461 16   0.0193    0.278
# 
# Confidence level used: 0.95 
# Intervals are back-transformed from the log scale

The confint() function returns confidence intervals but gets rid of the statistical tests. Some people will want to also report the test statistics and p-values. In this case, we can use summary() instead of confint(), with infer = TRUE.

emm1.1$contrasts %>%
     summary(infer = TRUE)

#  contrast   ratio     SE df lower.CL upper.CL null t.ratio p.value
#  a 1 / c 1 1.9553 1.2306 16   0.5150    7.424    1   1.065  0.3025
#  a 1 / a c 6.3396 3.9900 16   1.6696   24.072    1   2.934  0.0097
#  a 1 / c c 0.4648 0.2926 16   0.1224    1.765    1  -1.217  0.2412
#  c 1 / a c 3.2422 2.0406 16   0.8539   12.311    1   1.869  0.0801
#  c 1 / c c 0.2377 0.1496 16   0.0626    0.903    1  -2.283  0.0365
#  a c / c c 0.0733 0.0461 16   0.0193    0.278    1  -4.152  0.0008
# 
# Confidence level used: 0.95 
# Intervals are back-transformed from the log scale 
# Tests are performed on the log scale

Putting results in a data.frame

One of the really nice things about emmeans() is that it makes it easy to get the results into a nice format for making tables or graphics of results. This is because the results are converted to a data.frame with confint() or summary().

If needed, the estimated marginal means can also be put into a data.frame. In this case we can use as.data.frame() to convert the emmeans to a data.frame for plotting or putting into a table of results. We can also use as.data.frame() directly on the contrasts above if we don’t need confint() or summary() (not shown).

emm1.1$emmeans %>%
     as.data.frame()

#   f1 f2  response        SE df  lower.CL upper.CL
# 1  a  1 1.7665334 0.7861763 16 0.6876870 4.537879
# 2  c  1 0.9034576 0.4020739 16 0.3517035 2.320806
# 3  a  c 0.2786518 0.1240109 16 0.1084753 0.715802
# 4  c  c 3.8004222 1.6913362 16 1.4794517 9.762542

Within group comparisons

While we can do all pairwise comparisons, there are certainly plenty of situations where the research question dictates that we only want a specific set of comparisons. A common example of this is when we want to compare the levels of one factor within the levels of another. Here I’ll show comparisons among levels of f1 for each level of f2.

The only thing that changes is the right-hand side of the specs formula. The code f1|f2 translates to “compare levels of f1 within each level of f2”.

emm2 = emmeans(fit1, specs = pairwise ~ f1|f2, type = "response")
emm2

# $emmeans
# f2 = 1:
#  f1 response    SE df lower.CL upper.CL
#  a     1.767 0.786 16    0.688    4.538
#  c     0.903 0.402 16    0.352    2.321
# 
# f2 = c:
#  f1 response    SE df lower.CL upper.CL
#  a     0.279 0.124 16    0.108    0.716
#  c     3.800 1.691 16    1.479    9.763
# 
# Confidence level used: 0.95 
# Intervals are back-transformed from the log scale 
# 
# $contrasts
# f2 = 1:
#  contrast  ratio     SE df null t.ratio p.value
#  a / c    1.9553 1.2306 16    1   1.065  0.3025
# 
# f2 = c:
#  contrast  ratio     SE df null t.ratio p.value
#  a / c    0.0733 0.0461 16    1  -4.152  0.0008
# 
# Tests are performed on the log scale

You can see there is no message about a multiple comparisons adjustment in the above set of comparisons. This is because the package default is to correct for the number of comparisons within each group instead of across groups. In this case there is only a single comparison in each group.

If we consider the family of comparisons to be all comparisons regardless of group and want to correct for multiple comparisons, we can do so via rbind.emmGrid.

Here is an example of passing contrasts to rbind() to correct for multiple comparisons. The default adjustment is Bonferroni, which can be much too conservative when the number of comparisons is large. You can control the multiple comparisons procedure via adjust.

The results of rbind() can also conveniently be used with summary(), confint(), and/or as.data.frame().

emm2$contrasts %>%
     rbind()

#  f2 contrast  ratio     SE df null t.ratio p.value
#  1  a / c    1.9553 1.2306 16    1   1.065  0.6050
#  c  a / c    0.0733 0.0461 16    1  -4.152  0.0015
# 
# P value adjustment: bonferroni method for 2 tests 
# Tests are performed on the log scale

Main effects comparisons

Even if we have multiple factors in the model, complete with an interaction term, we can still do “overall” comparisons among groups if our research question indicated that main effects were important to estimate.

Doing main effects in the presence of an interaction means we average over the levels of the other factor(s). The emmeans() function gives both a warning about the interaction and a message indicating which factor was averaged over to remind us of this.

Here is the estimated main effect of f1. Since we are only interested in overall comparisons of that factor it is the only factor given on the right-hand side of the specs formula.

emmeans(fit1, specs = pairwise ~ f1)

# NOTE: Results may be misleading due to involvement in interactions

# $emmeans
#  f1 emmean    SE df lower.CL upper.CL
#  a  -0.354 0.315 16  -1.0215    0.313
#  c   0.617 0.315 16  -0.0503    1.284
# 
# Results are averaged over the levels of: f2 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast estimate    SE df t.ratio p.value
#  a - c      -0.971 0.445 16  -2.182  0.0443
# 
# Results are averaged over the levels of: f2 
# Results are given on the log (not the response) scale.

Treatment vs control example

The emmeans package has built-in helper functions for comparing each group mean to the control mean. If the control group is the in the first row of the emmeans section of the output, this set of comparisons can be requested via trt.vs.ctrl.

Note the default multiple comparisons adjustment is a Dunnett adjustment.

emmeans(fit1, specs = trt.vs.ctrl ~ f1:f2)

# $emmeans
#  f1 f2 emmean    SE df lower.CL upper.CL
#  a  1   0.569 0.445 16   -0.374    1.512
#  c  1  -0.102 0.445 16   -1.045    0.842
#  a  c  -1.278 0.445 16   -2.221   -0.334
#  c  c   1.335 0.445 16    0.392    2.279
# 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast  estimate    SE df t.ratio p.value
#  c 1 - a 1   -0.671 0.629 16  -1.065  0.5857
#  a c - a 1   -1.847 0.629 16  -2.934  0.0262
#  c c - a 1    0.766 0.629 16   1.217  0.4947
# 
# Results are given on the log (not the response) scale. 
# P value adjustment: dunnettx method for 3 tests

Using trt.vs.ctrl means we ended up comparing each group mean to the “a 1” group since it is in the first row. In the example I’m using the control group, “c c”, is actually the last group listed in the emmeans section. When the control group is the last group in emmeans we can use trt.vs.ctrlk to get the correct set of comparisons.

emmeans(fit1, specs = trt.vs.ctrlk ~ f1:f2)

# $emmeans
#  f1 f2 emmean    SE df lower.CL upper.CL
#  a  1   0.569 0.445 16   -0.374    1.512
#  c  1  -0.102 0.445 16   -1.045    0.842
#  a  c  -1.278 0.445 16   -2.221   -0.334
#  c  c   1.335 0.445 16    0.392    2.279
# 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast  estimate    SE df t.ratio p.value
#  a 1 - c c   -0.766 0.629 16  -1.217  0.4947
#  c 1 - c c   -1.437 0.629 16  -2.283  0.0935
#  a c - c c   -2.613 0.629 16  -4.152  0.0021
# 
# Results are given on the log (not the response) scale. 
# P value adjustment: dunnettx method for 3 tests

That gives us what we want in this case. However, if the control group was some other group, like “c 1”, we could use trt.vs.ctrlk with the ref argument to define which row in the emmeans section represents the control group.

The “c 1” group is the second row in the emmeans so we can use ref = 2 to define this group as the control group.

emmeans(fit1, specs = trt.vs.ctrlk ~ f1:f2, ref = 2)

# $emmeans
#  f1 f2 emmean    SE df lower.CL upper.CL
#  a  1   0.569 0.445 16   -0.374    1.512
#  c  1  -0.102 0.445 16   -1.045    0.842
#  a  c  -1.278 0.445 16   -2.221   -0.334
#  c  c   1.335 0.445 16    0.392    2.279
# 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast  estimate    SE df t.ratio p.value
#  a 1 - c 1    0.671 0.629 16   1.065  0.5857
#  a c - c 1   -1.176 0.629 16  -1.869  0.1937
#  c c - c 1    1.437 0.629 16   2.283  0.0935
# 
# Results are given on the log (not the response) scale. 
# P value adjustment: dunnettx method for 3 tests

Finally, if we want to reverse the order of subtraction in the treatment vs control comparisons we can use the reverse argument.

emmeans(fit1, specs = trt.vs.ctrlk ~ f1:f2, ref = 2, reverse = TRUE)

# $emmeans
#  f1 f2 emmean    SE df lower.CL upper.CL
#  a  1   0.569 0.445 16   -0.374    1.512
#  c  1  -0.102 0.445 16   -1.045    0.842
#  a  c  -1.278 0.445 16   -2.221   -0.334
#  c  c   1.335 0.445 16    0.392    2.279
# 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast  estimate    SE df t.ratio p.value
#  c 1 - a 1   -0.671 0.629 16  -1.065  0.5857
#  c 1 - a c    1.176 0.629 16   1.869  0.1937
#  c 1 - c c   -1.437 0.629 16  -2.283  0.0935
# 
# Results are given on the log (not the response) scale. 
# P value adjustment: dunnettx method for 3 tests

Alternative code for comparisons

The emmeans() package offers the option to do comparisons in two steps instead of in one step the way I have been using it so far. I personally find this alternative most useful when doing custom comparisons, and I think it’s useful to introduce it now so it looks familiar. This alternative keeps the estimated marginal means and the comparisons of interest in separate objects, which can be attractive in some situations.

The first step is to use emmeans() to calculate the marginal means of interest. We still use the formula in specs with the factor(s) of interest on the right-hand side but no longer put anything on the left-hand side of the tilde.

We can still use type in emmeans() but cannot use adjust (since we don’t adjust for multiple comparisons until we’ve actually done comparisons 😉).

emm3 = emmeans(fit1, specs = ~ f1:f2, type = "response")
emm3

#  f1 f2 response    SE df lower.CL upper.CL
#  a  1     1.767 0.786 16    0.688    4.538
#  c  1     0.903 0.402 16    0.352    2.321
#  a  c     0.279 0.124 16    0.108    0.716
#  c  c     3.800 1.691 16    1.479    9.763
# 
# Confidence level used: 0.95 
# Intervals are back-transformed from the log scale

We then get the comparisons we want in a second step using the contrast() function. We request the comparisons we want via method. When using built-in comparisons like I am here, we give the comparison function name as a string (meaning in quotes). Also see the pairs() function, which is for the special case of all pairwise comparisons.

We can use adjust in contrast() to change the multiple comparisons adjustment.

contrast(emm3, method = "pairwise", adjust = "none")

#  contrast   ratio     SE df null t.ratio p.value
#  a 1 / c 1 1.9553 1.2306 16    1   1.065  0.3025
#  a 1 / a c 6.3396 3.9900 16    1   2.934  0.0097
#  a 1 / c c 0.4648 0.2926 16    1  -1.217  0.2412
#  c 1 / a c 3.2422 2.0406 16    1   1.869  0.0801
#  c 1 / c c 0.2377 0.1496 16    1  -2.283  0.0365
#  a c / c c 0.0733 0.0461 16    1  -4.152  0.0008
# 
# Tests are performed on the log scale

We can follow the contrast() argument with summary() or confint() to get the output we want and put them into a data.frame for plotting/saving. Again, I think the real strength of contrast() comes when we want custom comparisons, and I’ll demonstrate these in my next post on custom contrasts.

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(emmeans) # v. 1.7.0
library(magrittr) # v. 2.0.1

dat = data.frame(resp = c(1.6,0.3,3,0.1,3.2,0.2,0.4,0.4,2.8,
                          0.7,3.8,3,0.3,14.3,1.2,0.5,1.1,4.4,0.4,8.4),
                 f1 = factor(c("a","a","a","a","a",
                               "a","a","a","a","a","c","c","c","c","c",
                               "c","c","c","c","c")),
                 f2 = factor(c("1","c","1","c","1",
                               "c","1","c","1","c","1","c","1","c","1",
                               "c","1","c","1","c")))

str(dat)

fit1 = lm(log(resp) ~ f1 + f2 + f1:f2, data = dat)

emm1 = emmeans(fit1, specs = pairwise ~ f1:f2)

emm1$emmeans
emm1$contrasts

emmeans(fit1, specs = pairwise ~ f1:f2, type = "response")

emm1.1 = emmeans(fit1, specs = pairwise ~ f1:f2, type = "response", adjust = "none")
emm1.1

emm1.1$contrasts %>%
     confint()

emm1.1$contrasts %>%
     summary(infer = TRUE)

emm1.1$emmeans %>%
     as.data.frame()

emm2 = emmeans(fit1, specs = pairwise ~ f1|f2, type = "response")
emm2

emm2$contrasts %>%
     rbind() 

emmeans(fit1, specs = pairwise ~ f1)

emmeans(fit1, specs = trt.vs.ctrl ~ f1:f2)

emmeans(fit1, specs = trt.vs.ctrlk ~ f1:f2)

emmeans(fit1, specs = trt.vs.ctrlk ~ f1:f2, ref = 2)

emmeans(fit1, specs = trt.vs.ctrlk ~ f1:f2, ref = 2, reverse = TRUE)

emm3 = emmeans(fit1, specs = ~ f1:f2, type = "response")
emm3

contrast(emm3, method = "pairwise", adjust = "none")

Lots of zeros or too many zeros?: Thinking about zero inflation in count data

Wed, 06 Mar 2019 00:00:00 +0000

In a recent lecture I gave a basic overview of zero-inflation in count distributions. My main take-home message to the students that I thought worth posting about here is that having a lot of zero values does not necessarily mean you have zero inflation.

Zero inflation is when there are more 0 values in the data than the distribution allows for. But some distributions can have a lot of zeros!

Load packages and dataset

I’m going to be simulating counts from different distributions to demonstrate this. First I’ll load the packages I’m using today.

Package HMMpa is for a function to draw random samples from the generalized Poisson distribution.

library(ggplot2) # v. 3.1.0
library(HMMpa) # v. 1.0.1
library(MASS) # v. 7.3-51.1

Negative binomial with many zeros

First I’ll draw 200 counts from a negative binomial with a mean (\(\lambda\)) of \(10\) and \(\theta = 0.05\).
R uses the parameterization of the negative binomial where the variance of the distribution is \(\lambda + (\lambda^2/\theta)\). In this parameterization, as \(\theta\) gets small the variance gets big. Using a very small value of theta like I am will generally mean the distribution of counts will have many zeros as well as a few large counts

I pull a random sample of size 200 from this distribution using rnbinom(). The mu argument is the mean and the size argument is theta.

set.seed(16)
dat = data.frame(Y = rnbinom(200, mu = 10, size = .05) )

Below is a histogram of these data. I’ve annotated the plot with the proportion of the 200 values that are 0 as well as the maximum observed count in the dataset. There are lots of zeros! But these data are not zero-inflated because we expect to have many 0 values under this particular distribution.

ggplot(dat, aes(x = Y) ) +
    geom_histogram(binwidth = 5)  +
    theme_bw(base_size = 18) +
    labs(y = "Frequency",
         title = "Negative binomial",
         subtitle = "mean = 10, theta = 0.05" ) +
    annotate(geom = "text",
            label = paste("Proportion 0:", mean(dat$Y == 0), 
                        "\nMax Count:", max(dat$Y) ),
                        x = 150, y = 100, size = 8)

Generalized Poisson with many zeros

I don’t know the generalized Poisson distribution well, although it appears to be regularly used in some fields. For whatever reason, the negative binomial seems much more common in ecology. 🤷

From my understanding, the generalized Poisson distribution can have heavier tails than the negative binomial. This would mean that it can have more extreme maximum counts as well as lots of zeros.

See the documentation for rgenpois() for the formula for the density of the generalized Poisson and definitions of mean and variance. Note that when lambda2 is 0, the generalized Poisson reduces to the Poisson.

set.seed(16)
dat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) )

Below is a histogram of these data. Just over 50% of the values are zeros but the maximum count is over 1000! 💥

ggplot(dat, aes(x = Y) ) +
    geom_histogram(binwidth = 5)  +
    theme_bw(base_size = 18) +
    labs(y = "Frequency",
         title = "Generalized Poisson",
         subtitle = "lambda1 = 0.5, lambda2 = 0.95") +
    annotate(geom = "text",
            label = paste("Proportion 0:", mean(dat$Y == 0), 
                        "\nMax Count:", max(dat$Y) ),
                        x = 600, y = 100, size = 8)

Lots of zeros or excess zeros?

All the simulations above show us is that some distributions can have a lot of zeros. In any given scenario, though, how do we check if we have excess zeros? Having excess zeros means there are more zeros than expected by the distribution we are using for modeling. If we have excess zeros than we may either need a different distribution to model the data or we could think about models that specifically address zero inflation.

The key to checking for excess zeros is to estimate the number of zeros you would expect to see if the fitted model were truly the model that created your data and compare that to the number of zeros in the actual data. If there are many more zeros in the data than the model allows for then you have zero inflation compared to whatever distribution you are using.

Simulate negative binomial data

I’ll now simulate data based on a negative binomial model with a single, continuous explanatory variable. I’ll use a model fit to these data to show how to check for excess zeros.

Since this is a generalized linear model, I first calculate the means based on the linear predictor. The exponentiation is due to using the natural log link to link the mean to the linear predictor.

set.seed(16)
x = runif(200, 5, 10) # simulate explanatory variable
b0 = 1 # set value of intercept
b1 = 0.25 # set value of slope
means = exp(b0 + b1*x) # calculate true means
theta = 0.25 # true theta

I can use these true means along with my chosen value of theta to simulate data from the negative binomial distribution.

y = rnbinom(200, mu = means, size = theta)

Now that I’ve made some data I can fit a model. Since I’m using a negative binomial GLM with x as the explanatory variable, which is how I created the data, this model should work well. The glm.nb() function is from package MASS.

fit1 = glm.nb(y ~ x)

In this exercise I’m going to go directly to checking for excess zeros. This means I’m skipping other important checks of model fit, such as checks for overdispersion and examining residual plots. Don’t skip these in a real analysis; having excess zeros certainly isn’t the only problem we can run into with count data.

Checking for excess zeros

The observed data has 76 zeros (out of 200).

sum(y == 0)

# [1] 76

How many zeros is expected given the model? I need the model estimated means and theta to answer this question. I can get the means via predict() and I can pull theta out of the model summary().

preds = predict(fit1, type = "response") # estimated means
esttheta = summary(fit1)$theta # estimated theta

For discrete distributions like the negative binomial, the density distribution functions (which start with the letter “d”) return the probability that the observation is equal to a given value. This means I can use dnbinom() to calculate the probability of an observation being 0 for every row in the dataset. To do this I need to provide values for the parameters of the distribution of each observation.

Based on the model, the distribution of each observation is negative binomial with the mean estimated from the model and the overall estimated theta.

prop0 = dnbinom(x = 0, mu = preds, size = esttheta )

The sum of these probabilities is an estimate of the number of zero values expected by the model (see here for another example). I’ll round this to the nearest integer.

round( sum(prop0) )

# [1] 72

The expected number of 0 values is ~72, very close to the 76 observed in the data. This is no big surprise, since I fit the same model that I used to create the data.

An example with excess zeros

The example above demonstrates a model without excess zeros. Let me finish by fitting a model to data that has more zeros than expected by the distribution. This can be done by fitting a Poisson GLM instead of a negative binomial GLM to my simulated data.

fit2 = glm(y ~ x, family = poisson)

Remember the data contain 76 zeros.

sum(y == 0)

# [1] 76

Using dpois(), the number of zeros given be the Poisson model is 0. 😮 These data are zero-inflated compared to the Poisson distribution, and I clearly need a different approach for modeling these data.

round( sum( dpois(x = 0,
           lambda = predict(fit2, type = "response") ) ) )

# [1] 0

This brings me back to my earlier point about checking model fit. If I had done other standard checks of model fit for fit2 I would have seen additional problems that would indicate the Poisson distribution did not fit these data (such as severe overdispersion).

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(ggplot2) # v. 3.1.0
library(HMMpa) # v. 1.0.1
library(MASS) # v. 7.3-51.1

set.seed(16)
dat = data.frame(Y = rnbinom(200, mu = 10, size = .05) )

ggplot(dat, aes(x = Y) ) +
    geom_histogram(binwidth = 5)  +
    theme_bw(base_size = 18) +
    labs(y = "Frequency",
         title = "Negative binomial",
         subtitle = "mean = 10, theta = 0.05" ) +
    annotate(geom = "text",
            label = paste("Proportion 0:", mean(dat$Y == 0), 
                        "\nMax Count:", max(dat$Y) ),
                        x = 150, y = 100, size = 8)

set.seed(16)
dat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) )

ggplot(dat, aes(x = Y) ) +
    geom_histogram(binwidth = 5)  +
    theme_bw(base_size = 18) +
    labs(y = "Frequency",
         title = "Generalized Poisson",
         subtitle = "lambda1 = 0.5, lambda2 = 0.95") +
    annotate(geom = "text",
            label = paste("Proportion 0:", mean(dat$Y == 0), 
                        "\nMax Count:", max(dat$Y) ),
                        x = 600, y = 100, size = 8)

set.seed(16)
x = runif(200, 5, 10) # simulate explanatory variable
b0 = 1 # set value of intercept
b1 = 0.25 # set value of slope
means = exp(b0 + b1*x) # calculate true means
theta = 0.25 # true theta
y = rnbinom(200, mu = means, size = theta)

fit1 = glm.nb(y ~ x)

sum(y == 0)

preds = predict(fit1, type = "response") # estimated means
esttheta = summary(fit1)$theta # estimated theta

prop0 = dnbinom(x = 0, mu = preds, size = esttheta )
round( sum(prop0) )

fit2 = glm(y ~ x, family = poisson)
sum(y == 0)

round( sum( dpois(x = 0,
           lambda = predict(fit2, type = "response") ) ) )

Very statisticious on Very statisticious

Handling errors using purrr's possibly() and safely()

Table of Contents

R packages

Using possibly() to return values instead of errors

Wrapping a function with possibly()

Finding the groups with errors

Using compact() to remove empty elements

Using safely() to capture results and errors

Exploring the errors

Extracting results

Using quietly() to capture messages

Just the code, please

Simulate! Simulate! - Part 4: A binomial generalized linear mixed model

Table of Contents

R packages

The statistical model

A single simulation for a binomial GLMM

Defining the difference in treatments

Creating the study design variables

Simulate the random effect

Calculate log odds

Convert log odds to proportions

Generate the response variable

Fit a model

Make a function for the simulation

Repeat the simulation many times

Extract results from the binomial GLMM

Explore estimated dispersion

Just the code, please

Controlling legend appearance in ggplot2 with override.aes

Table of Contents

R packages

Introducing override.aes

Adding a guides() layer

Using the guide argument in scale_*()

Changing multiple aesthetic parameters

Suppress aesthetics from part of the legend

Combining legends from two layers

Controlling the appearance of multiple legends

Just the code, please

Analysis essentials: Using the help page for a function in R

Table of Contents

R help pages

Help page structure

Usage

Arguments

Examples

Other sections

Using argument order instead of labels

An example of base::split() for looping through groups

Table of Contents

Load R packages

A dataset with groups

Create separate data.frames per group

Looping through the list

Splitting by multiple groups

Other thoughts on split()

Just the code, please

Making a background color gradient in ggplot2

Table of Contents

Load R packages

The dataset

Initial plot

Adding color gradient using geom_segment()

Make a variable for the color mapping

Adding one geom_segment() per second

Switching to a gray scale

Using segments to make a gradient rectangle

Bonus: annotations with curved arrows

Other ways to make a gradient color background

Eclipses!

Just the code, please

Expanding binomial counts to binary 0/1 with purrr::pmap()

Table of Contents

Load R packages

The dataset

Expanding binomial to binary with pmap_dfr()

Aside: pmap functions with more columns

Comparing analysis results