<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Very statisticious on Very statisticious</title>
    <link>https://aosmith.rbind.io/</link>
    <description>Recent content in Very statisticious on Very statisticious</description>
    <generator>Hugo -- gohugo.io</generator>
    <lastBuildDate>Mon, 31 Aug 2020 00:00:00 +0000</lastBuildDate>
    <atom:link href="/" rel="self" type="application/rss+xml" />
    
    <item>
      <title>Handling errors using purrr&#39;s possibly() and safely()</title>
      <link>https://aosmith.rbind.io/2020/08/31/handling-errors/</link>
      <pubDate>Mon, 31 Aug 2020 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2020/08/31/handling-errors/</guid>
      <description>


&lt;p&gt;One topic I haven’t discussed in my previous posts about automating tasks with loops or doing simulations is how to deal with errors. If we have unanticipated errors a &lt;code&gt;map()&lt;/code&gt; or &lt;code&gt;lapply()&lt;/code&gt; loop will come to a screeching halt with no output to show for the time spent. When your task is time-consuming, this can feel pretty frustrating, since the whole process has to be restarted.&lt;/p&gt;
&lt;p&gt;How to deal with errors? Using functions &lt;code&gt;try()&lt;/code&gt; or &lt;code&gt;tryCatch()&lt;/code&gt; when building a function is the traditional way to catch and address potential errors. In the past I’ve struggled to remember how to use these, though, and functions &lt;code&gt;possibly()&lt;/code&gt; and &lt;code&gt;safely()&lt;/code&gt; from package &lt;strong&gt;purrr&lt;/strong&gt; are convenient alternatives that I find a little easier to use.&lt;/p&gt;
&lt;p&gt;In this post I’ll show examples on how to use these two functions for handling errors. I’ll also demonstrate the use of the related function &lt;code&gt;quietly()&lt;/code&gt; to capture other types of output, such as warnings and messages.&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#r-packages&#34;&gt;R packages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-possibly-to-return-values-instead-of-errors&#34;&gt;Using possibly() to return values instead of errors&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#wrapping-a-function-with-possibly&#34;&gt;Wrapping a function with possibly()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#finding-the-groups-with-errors&#34;&gt;Finding the groups with errors&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-compact-to-remove-empty-elements&#34;&gt;Using compact() to remove empty elements&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-safely-to-capture-results-and-errors&#34;&gt;Using safely() to capture results and errors&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#exploring-the-errors&#34;&gt;Exploring the errors&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#extracting-results&#34;&gt;Extracting results&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-quietly-to-capture-messages&#34;&gt;Using quietly() to capture messages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;r-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R packages&lt;/h1&gt;
&lt;p&gt;The functions I’m highlighting today are from package &lt;strong&gt;purrr&lt;/strong&gt;. I’ll also use &lt;strong&gt;lme4&lt;/strong&gt; for fitting simulated data to linear mixed models.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(purrr) # v. 0.3.4
library(lme4) # v. 1.1-23&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-possibly-to-return-values-instead-of-errors&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using possibly() to return values instead of errors&lt;/h1&gt;
&lt;p&gt;When doing a repetitive task like fitting many models with a &lt;code&gt;map()&lt;/code&gt; loop, an error in one of the models will shut down the whole process. We can anticipate this issue and bypass it by defining a value to return if a model errors out via &lt;code&gt;possibly()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I created the very small dataset below to demonstrate the issue. The goal is to fit a linear model of &lt;code&gt;y&lt;/code&gt; vs &lt;code&gt;x&lt;/code&gt; for each &lt;code&gt;group&lt;/code&gt;. I made exactly two groups here, &lt;em&gt;a&lt;/em&gt; and &lt;em&gt;b&lt;/em&gt;, to make it easy to see what goes wrong and why. Usually we have many more groups and potential problems can be harder to spot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat = structure(list(group = c(&amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;b&amp;quot;), 
                     x = c(&amp;quot;A&amp;quot;, &amp;quot;A&amp;quot;, &amp;quot;A&amp;quot;, &amp;quot;B&amp;quot;, &amp;quot;B&amp;quot;, &amp;quot;B&amp;quot;, &amp;quot;A&amp;quot;, &amp;quot;A&amp;quot;, &amp;quot;A&amp;quot;), 
                     y = c(10.9, 11.1, 10.5, 9.7, 10.5, 10.9, 13, 9.9, 10.3)), 
                class = &amp;quot;data.frame&amp;quot;, 
                row.names = c(NA, -9L))
dat&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#   group x    y
# 1     a A 10.9
# 2     a A 11.1
# 3     a A 10.5
# 4     a B  9.7
# 5     a B 10.5
# 6     a B 10.9
# 7     b A 13.0
# 8     b A  9.9
# 9     b A 10.3&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I’ll first split the dataset by &lt;code&gt;group&lt;/code&gt; to get a list of data.frames to loop through.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat_split = split(dat, dat$group)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I’ll loop through each dataset in the list with &lt;code&gt;map()&lt;/code&gt; and fit a linear model with &lt;code&gt;lm()&lt;/code&gt;. Instead of getting output, though, I get an error.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;map(dat_split, ~lm(y ~ x, data = .x) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Error in `contrasts&amp;lt;-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts
can be applied only to factors with 2 or more levels&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;What’s going on? If you look at the dataset again, you’ll see that &lt;code&gt;x&lt;/code&gt; from group &lt;em&gt;b&lt;/em&gt; contains only a single value. Once you know that you can see the error actually is telling us what the problem is: we can’t use a factor with only one level.&lt;/p&gt;
&lt;p&gt;Model &lt;em&gt;a&lt;/em&gt; fits fine, since &lt;code&gt;x&lt;/code&gt; has two values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lm(y ~ x, data = dat, subset = group == &amp;quot;a&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# 
# Call:
# lm(formula = y ~ x, data = dat, subset = group == &amp;quot;a&amp;quot;)
# 
# Coefficients:
# (Intercept)           xB  
#     10.8333      -0.4667&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It is the &lt;em&gt;b&lt;/em&gt; model that fails.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lm(y ~ x, data = dat, subset = group == &amp;quot;b&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Error in `contrasts&amp;lt;-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts
can be applied only to factors with 2 or more levels&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can imagine that the problem of having only a single value for the factor in some groups could be easy to miss if you working with a large number groups. This is where &lt;code&gt;possibly()&lt;/code&gt; can help, allowing us to keep going through all groups regardless of errors. We can then find and explore problem groups.&lt;/p&gt;
&lt;div id=&#34;wrapping-a-function-with-possibly&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Wrapping a function with possibly()&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;possibly()&lt;/code&gt; function is a &lt;em&gt;wrapper&lt;/em&gt; function. It wraps around an existing function. Other than defining the function to wrap, the main argument of interest is &lt;code&gt;otherwise&lt;/code&gt;. In &lt;code&gt;otherwise&lt;/code&gt; we define what value to return if we get an error from the function we are wrapping.&lt;/p&gt;
&lt;p&gt;I make a new wrapped function called &lt;code&gt;posslm1()&lt;/code&gt;, which wraps &lt;code&gt;lm()&lt;/code&gt; and returns “Error” if an error occurs when fitting the model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;posslm1 = possibly(.f = lm, otherwise = &amp;quot;Error&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When I use &lt;code&gt;posslm1()&lt;/code&gt; in my model fitting loop, you can see that loop now finishes. Model &lt;em&gt;b&lt;/em&gt; contains the string “Error” instead of a model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;map(dat_split, ~posslm1(y ~ x, data = .x) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $a
# 
# Call:
# .f(formula = ..1, data = ..2)
# 
# Coefficients:
# (Intercept)           xB  
#     10.8333      -0.4667  
# 
# 
# $b
# [1] &amp;quot;Error&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here’s another example of &lt;code&gt;possibly()&lt;/code&gt; wrapped around &lt;code&gt;lm()&lt;/code&gt;, this time using &lt;code&gt;otherwise = NULL&lt;/code&gt;. Depending on what we plan to do with the output, using &lt;code&gt;NULL&lt;/code&gt; or &lt;code&gt;NA&lt;/code&gt; as the return value can be useful when using &lt;code&gt;possibly()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Now group &lt;em&gt;b&lt;/em&gt; is &lt;code&gt;NULL&lt;/code&gt; in the output.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;posslm2 = possibly(.f = lm, otherwise = NULL)
( mods = map(dat_split, ~posslm2(y ~ x, data = .x) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $a
# 
# Call:
# .f(formula = ..1, data = ..2)
# 
# Coefficients:
# (Intercept)           xB  
#     10.8333      -0.4667  
# 
# 
# $b
# NULL&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;finding-the-groups-with-errors&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Finding the groups with errors&lt;/h2&gt;
&lt;p&gt;Once the loop is done, we can examine the groups that had errors when fitting models. For example, I can use &lt;code&gt;purrr::keep()&lt;/code&gt; to keep only the results that are &lt;code&gt;NULL&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mods %&amp;gt;%
     keep(~is.null(.x) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $b
# NULL&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This allows me to pull out the names for the groups that had errors. Getting the names in this way is one reason I like that &lt;code&gt;split()&lt;/code&gt; returns named lists.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;group_errs = mods %&amp;gt;%
     keep(~is.null(.x) ) %&amp;gt;%
     names()
group_errs&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] &amp;quot;b&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once I have the names of the groups with errors, I can pull any problematic groups out of the original dataset or the split list to examine them more closely. (I use &lt;code&gt;%in%&lt;/code&gt; here in case &lt;code&gt;group_errs&lt;/code&gt; is a vector.)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat[dat$group %in% group_errs, ]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#   group x    y
# 7     b A 13.0
# 8     b A  9.9
# 9     b A 10.3&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat_split[group_errs]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $b
#   group x    y
# 7     b A 13.0
# 8     b A  9.9
# 9     b A 10.3&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-compact-to-remove-empty-elements&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using compact() to remove empty elements&lt;/h2&gt;
&lt;p&gt;You may come to a point where you’ve looked at the problem groups and decide that the models with errors shouldn’t be used in further analysis. In that case, if all the groups with errors are &lt;code&gt;NULL&lt;/code&gt;, you can use &lt;code&gt;purrr::compact()&lt;/code&gt; to remove the empty elements from the list. This can make subsequent loops to get output more straightforward in some cases.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;compact(mods)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $a
# 
# Call:
# .f(formula = ..1, data = ..2)
# 
# Coefficients:
# (Intercept)           xB  
#     10.8333      -0.4667&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;using-safely-to-capture-results-and-errors&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using safely() to capture results and errors&lt;/h1&gt;
&lt;p&gt;Rather than replacing the errors with values, &lt;code&gt;safely()&lt;/code&gt; returns both the results and the errors in a list. This function is also a wrapper function. It defaults to using &lt;code&gt;otherwise = NULL&lt;/code&gt;, and I generally haven’t had reason to change away from that default.&lt;/p&gt;
&lt;p&gt;Here’s an example, wrapping &lt;code&gt;lm()&lt;/code&gt; in &lt;code&gt;safely()&lt;/code&gt; and then using the wrapped function &lt;code&gt;safelm()&lt;/code&gt; to fit the models.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;safelm = safely(.f = lm)
mods2 = map(dat_split, ~safelm(y ~ x, data = .x) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output for each group is now a list with two elements, one for results (if there was no error) and the other for the error (if there was an error).&lt;/p&gt;
&lt;p&gt;Here’s what this looks like for model &lt;em&gt;a&lt;/em&gt;, which doesn’t have an error. The output contains a result but no error.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mods2[[1]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $result
# 
# Call:
# .f(formula = ..1, data = ..2)
# 
# Coefficients:
# (Intercept)           xB  
#     10.8333      -0.4667  
# 
# 
# $error
# NULL&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Model &lt;em&gt;b&lt;/em&gt; didn’t work, of course, so the results are &lt;code&gt;NULL&lt;/code&gt; but the error was captured in &lt;code&gt;error&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mods2[[2]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $result
# NULL
# 
# $error
# &amp;lt;simpleError in `contrasts&amp;lt;-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;exploring-the-errors&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Exploring the errors&lt;/h2&gt;
&lt;p&gt;One reason to save the errors using &lt;code&gt;safely()&lt;/code&gt; is so we can take a look at what the errors were for each group. This is most useful with informative errors like the one in my example.&lt;/p&gt;
&lt;p&gt;Errors can be extracted with a &lt;code&gt;map()&lt;/code&gt; loop, pulling out the “error” element from each group.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;map(mods2, &amp;quot;error&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $a
# NULL
# 
# $b
# &amp;lt;simpleError in `contrasts&amp;lt;-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;extracting-results&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Extracting results&lt;/h2&gt;
&lt;p&gt;Results can be extracted similarly, and, if relevant, &lt;code&gt;NULL&lt;/code&gt; results can be removed via &lt;code&gt;compact()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mods2 %&amp;gt;%
     map(&amp;quot;result&amp;quot;) %&amp;gt;%
     compact()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $a
# 
# Call:
# .f(formula = ..1, data = ..2)
# 
# Coefficients:
# (Intercept)           xB  
#     10.8333      -0.4667&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;using-quietly-to-capture-messages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using quietly() to capture messages&lt;/h1&gt;
&lt;p&gt;The &lt;code&gt;quietly()&lt;/code&gt; function doesn’t handle errors, but instead captures other types of output such as warnings and messages along with any results. This is useful for exploring what kinds of warnings come up when doing simulations, for example.&lt;/p&gt;
&lt;p&gt;A few years ago I wrote a post showing a simulation for a linear mixed model. I use the following function, pulled from &lt;a href=&#34;https://aosmith.rbind.io/2018/04/23/simulate-simulate-part-2/&#34;&gt;that earlier post&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;twolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) {
     standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)
     stand = rep(LETTERS[1:nstand], each = nplot)
     ploteff = rnorm(nstand*nplot, 0, sigma)
     resp = mu + standeff + ploteff
     dat = data.frame(stand, resp)
     lmer(resp ~ 1 + (1|stand), data = dat)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One thing I skipped discussing in that post were the messages returned for some simulations. However, I can certainly picture scenarios where it would be interesting and important to capture warnings and messages to see, e.g., how often they occur even when we know the data comes from the model.&lt;/p&gt;
&lt;p&gt;Here I’ll set the seed so the results are reproducible and then run the function 10 times. You see I get two messages, indicating that two of the ten models returned a message. In this case, the message indicates that the random effect variance is estimated to be exactly 0 in the model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(16)
sims = replicate(10, twolevel_fun(), simplify = FALSE )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# boundary (singular) fit: see ?isSingular
# boundary (singular) fit: see ?isSingular&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It turns out that the second model in the output list is one with a message. You can see at the bottom of the model output below that there is 1 lme4 warning.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sims[[2]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Linear mixed model fit by REML [&amp;#39;lmerMod&amp;#39;]
# Formula: resp ~ 1 + (1 | stand)
#    Data: dat
# REML criterion at convergence: 45.8277
# Random effects:
#  Groups   Name        Std.Dev.
#  stand    (Intercept) 0.0000  
#  Residual             0.7469  
# Number of obs: 20, groups:  stand, 5
# Fixed Effects:
# (Intercept)  
#       10.92  
# convergence code 0; 0 optimizer warnings; 1 lme4 warnings&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;strong&gt;lme4&lt;/strong&gt; package stores warnings and messages in the model object, so I can pull the message out of the model object.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sims[[2]]@optinfo$conv$lme4&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $messages
# [1] &amp;quot;boundary (singular) fit: see ?isSingular&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But I think &lt;code&gt;quietly()&lt;/code&gt; is more convenient for this task. This is another wrapper function, and I’m going to wrap it around &lt;code&gt;lmer()&lt;/code&gt;. I do this because I’m focusing specifically on messages that happen when I fit the model. However, I could have wrapped &lt;code&gt;twolevel_fun()&lt;/code&gt; and captured any messages from the entire simulation process.&lt;/p&gt;
&lt;p&gt;I use my new function &lt;code&gt;qlmer()&lt;/code&gt; inside my simulation function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;qlmer = quietly(.f = lmer)
qtwolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) {
     standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)
     stand = rep(LETTERS[1:nstand], each = nplot)
     ploteff = rnorm(nstand*nplot, 0, sigma)
     resp = mu + standeff + ploteff
     dat = data.frame(stand, resp)
     qlmer(resp ~ 1 + (1|stand), data = dat)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I set the seed back to 16 so I get the same models and then run the function using &lt;code&gt;qlmer()&lt;/code&gt; 10 times. Note this is considered &lt;em&gt;quiet&lt;/em&gt; because the messages are now captured in the output by &lt;code&gt;quietly()&lt;/code&gt; instead of printed.&lt;/p&gt;
&lt;p&gt;The wrapped function returns a list with 4 elements, including the results, any printed output, warnings, and messages. You can see this for the second model here.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(16)
sims2 = replicate(10, qtwolevel_fun(), simplify = FALSE)
sims2[[2]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $result
# Linear mixed model fit by REML [&amp;#39;lmerMod&amp;#39;]
# Formula: resp ~ 1 + (1 | stand)
#    Data: ..2
# REML criterion at convergence: 45.8277
# Random effects:
#  Groups   Name        Std.Dev.
#  stand    (Intercept) 0.0000  
#  Residual             0.7469  
# Number of obs: 20, groups:  stand, 5
# Fixed Effects:
# (Intercept)  
#       10.92  
# convergence code 0; 0 optimizer warnings; 1 lme4 warnings 
# 
# $output
# [1] &amp;quot;&amp;quot;
# 
# $warnings
# character(0)
# 
# $messages
# [1] &amp;quot;boundary (singular) fit: see ?isSingular\n&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In a simulation setting, I think seeing how many times different messages and warnings come up could be pretty interesting. It might inform how problematic a message is. If a message is common in simulation we may feel more confident that such a message from a model fit to our real data is not a big issue.&lt;/p&gt;
&lt;p&gt;For example, I could pull out all the &lt;code&gt;messages&lt;/code&gt; and then put the results into a vector with &lt;code&gt;unlist()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sims2 %&amp;gt;%
     map(&amp;quot;messages&amp;quot;) %&amp;gt;% 
     unlist()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] &amp;quot;boundary (singular) fit: see ?isSingular\n&amp;quot;
# [2] &amp;quot;boundary (singular) fit: see ?isSingular\n&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If I wanted to extract multiple parts of the output, such as keeping both messages and warnings, I can use the extract brackets in &lt;code&gt;map()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;These results don’t look much different compared to the output above since there are no warnings in my example. However, note the result is now in a named vector so I could potentially keep track of which are &lt;code&gt;messages&lt;/code&gt; and which are &lt;code&gt;warnings&lt;/code&gt; if I needed to.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sims2 %&amp;gt;%
     map(`[`, c(&amp;quot;messages&amp;quot;, &amp;quot;warnings&amp;quot;) ) %&amp;gt;%
     unlist()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#                                     messages 
# &amp;quot;boundary (singular) fit: see ?isSingular\n&amp;quot; 
#                                     messages 
# &amp;quot;boundary (singular) fit: see ?isSingular\n&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I showed only fairly simple way to use these three functions. However, you certainly may find yourself using them for more complex tasks. For example, I’ve been in situations in the past where I wanted to keep only models that didn’t have errors when building parametric bootstrap confidence intervals. If they had existed at the time, I could have used &lt;code&gt;possibly()&lt;/code&gt; or &lt;code&gt;safely()&lt;/code&gt; in a &lt;code&gt;while()&lt;/code&gt; loop, where the bootstrap data would be redrawn until a model fit without error. Very useful! 😉&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2020-08-31-handling-errors.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(purrr) # v. 0.3.4
library(lme4) # v. 1.1-23

dat = structure(list(group = c(&amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;b&amp;quot;), 
                     x = c(&amp;quot;A&amp;quot;, &amp;quot;A&amp;quot;, &amp;quot;A&amp;quot;, &amp;quot;B&amp;quot;, &amp;quot;B&amp;quot;, &amp;quot;B&amp;quot;, &amp;quot;A&amp;quot;, &amp;quot;A&amp;quot;, &amp;quot;A&amp;quot;), 
                     y = c(10.9, 11.1, 10.5, 9.7, 10.5, 10.9, 13, 9.9, 10.3)), 
                class = &amp;quot;data.frame&amp;quot;, 
                row.names = c(NA, -9L))
dat

dat_split = split(dat, dat$group)
map(dat_split, ~lm(y ~ x, data = .x) )

lm(y ~ x, data = dat, subset = group == &amp;quot;a&amp;quot;)
lm(y ~ x, data = dat, subset = group == &amp;quot;b&amp;quot;)

posslm1 = possibly(.f = lm, otherwise = &amp;quot;Error&amp;quot;)
map(dat_split, ~posslm1(y ~ x, data = .x) )

posslm2 = possibly(.f = lm, otherwise = NULL)
( mods = map(dat_split, ~posslm2(y ~ x, data = .x) ) )

mods %&amp;gt;%
     keep(~is.null(.x) )

group_errs = mods %&amp;gt;%
     keep(~is.null(.x) ) %&amp;gt;%
     names()
group_errs

dat[dat$group %in% group_errs, ]
dat_split[group_errs]

compact(mods)

safelm = safely(.f = lm)
mods2 = map(dat_split, ~safelm(y ~ x, data = .x) )
mods2[[1]]
mods2[[2]]

map(mods2, &amp;quot;error&amp;quot;)

mods2 %&amp;gt;%
     map(&amp;quot;result&amp;quot;) %&amp;gt;%
     compact()

twolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) {
     standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)
     stand = rep(LETTERS[1:nstand], each = nplot)
     ploteff = rnorm(nstand*nplot, 0, sigma)
     resp = mu + standeff + ploteff
     dat = data.frame(stand, resp)
     lmer(resp ~ 1 + (1|stand), data = dat)
}

set.seed(16)
sims = replicate(10, twolevel_fun(), simplify = FALSE )
sims[[2]]
sims[[2]]@optinfo$conv$lme4

qlmer = quietly(.f = lmer)
qtwolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 1, sigma = 1) {
     standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)
     stand = rep(LETTERS[1:nstand], each = nplot)
     ploteff = rnorm(nstand*nplot, 0, sigma)
     resp = mu + standeff + ploteff
     dat = data.frame(stand, resp)
     qlmer(resp ~ 1 + (1|stand), data = dat)
}

set.seed(16)
sims2 = replicate(10, qtwolevel_fun(), simplify = FALSE)
sims2[[2]]

sims2 %&amp;gt;%
     map(&amp;quot;messages&amp;quot;) %&amp;gt;% 
     unlist()

sims2 %&amp;gt;%
     map(`[`, c(&amp;quot;messages&amp;quot;, &amp;quot;warnings&amp;quot;) ) %&amp;gt;%
     unlist()&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Simulate! Simulate! - Part 4: A binomial generalized linear mixed model</title>
      <link>https://aosmith.rbind.io/2020/08/20/simulate-binomial-glmm/</link>
      <pubDate>Thu, 20 Aug 2020 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2020/08/20/simulate-binomial-glmm/</guid>
      <description>


&lt;p&gt;A post about simulating data from a generalized linear &lt;em&gt;mixed&lt;/em&gt; model (GLMM), the fourth post in my simulations series involving linear models, is long overdue. I settled on a binomial example based on a binomial GLMM with a logit link.&lt;/p&gt;
&lt;p&gt;I find binomial models the most difficult to grok, primarily because the model is on the scale of log odds, inference is based on odds, but the response variable is a &lt;em&gt;counted proportion&lt;/em&gt;. I use the term counted proportion to indicate that the proportions are based on discrete counts, the total number of “successes” divided by the total number of trials. A different distribution (possibly beta) would be needed for continuous proportions like, e.g., total leaf area with lesions.&lt;/p&gt;
&lt;p&gt;Models based on single parameter distributions like the binomial can be overdispersed or underdispersed, where the variance in the data is bigger or smaller, respectively, than the variance defined by the binomial distribution. Given this, I thought exploring estimates of dispersion based on simulated data that we know comes from a binomial distribution would be interesting.&lt;/p&gt;
&lt;p&gt;I will be simulating data “manually”. However, also see the &lt;a href=&#34;https://www.rdocumentation.org/packages/lme4/versions/1.1-23/topics/simulate.merMod&#34;&gt;&lt;code&gt;simulate()&lt;/code&gt;&lt;/a&gt; function from package &lt;strong&gt;lme4&lt;/strong&gt;. I find this function particularly useful if I want to simulate data based on a fitted model, but it can also be used in situations where you don’t already have a model.&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#r-packages&#34;&gt;R packages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-statistical-model&#34;&gt;The statistical model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#a-single-simulation-for-a-binomial-glmm&#34;&gt;A single simulation for a binomial GLMM&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#defining-the-difference-in-treatments&#34;&gt;Defining the difference in treatments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#creating-the-study-design-variables&#34;&gt;Creating the study design variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#simulate-the-random-effect&#34;&gt;Simulate the random effect&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#calculate-log-odds&#34;&gt;Calculate log odds&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#convert-log-odds-to-proportions&#34;&gt;Convert log odds to proportions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#generate-the-response-variable&#34;&gt;Generate the response variable&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#fit-a-model&#34;&gt;Fit a model&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#make-a-function-for-the-simulation&#34;&gt;Make a function for the simulation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#repeat-the-simulation-many-times&#34;&gt;Repeat the simulation many times&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#extract-results-from-the-binomial-glmm&#34;&gt;Extract results from the binomial GLMM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#explore-estimated-dispersion&#34;&gt;Explore estimated dispersion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;r-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R packages&lt;/h1&gt;
&lt;p&gt;I’ll be fitting binomial GLMM with &lt;strong&gt;lme4&lt;/strong&gt;. I use &lt;strong&gt;purrr&lt;/strong&gt; for looping and &lt;strong&gt;ggplot2&lt;/strong&gt; for plotting results.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(lme4) # v. 1.1-23
library(purrr) # v. 0.3.4
library(ggplot2) # v. 3.3.2&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-statistical-model&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The statistical model&lt;/h1&gt;
&lt;p&gt;As usual, I’ll start by writing out the statistical model using mathematical equations. If these aren’t helpful to you, &lt;a href=&#34;#a-single-simulation-for-a-binomial-glmm&#34;&gt;jump down to the code&lt;/a&gt;. You may find that writing the code first and coming back to look at the statistical model later is helpful.&lt;/p&gt;
&lt;p&gt;The imaginary study design that is the basis of my model has two different sizes of study units. This is a field experiment scenario, where multiple sites within a region are selected and then two plots within each site are randomly placed and a treatment assigned (“treatment” or “control”). You can think of “sites” as a blocking variable. The number of surviving plants from some known total number planted at an earlier time point is measured in each plot.&lt;/p&gt;
&lt;p&gt;I first define a response variable that comes from the binomial distribution. (If you haven’t seen this style of statistical model before, &lt;a href=&#34;https://aosmith.rbind.io/2018/07/18/simulate-poisson-edition/#the-statistical-model&#34;&gt;my Poisson GLM post&lt;/a&gt; goes into slightly more detail.)&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[y_t \thicksim Binomial(p_t, m_t)\]&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;math inline&#34;&gt;\(y_t\)&lt;/span&gt; is the observed number of surviving plants from the total &lt;span class=&#34;math inline&#34;&gt;\(m_t\)&lt;/span&gt; planted for the &lt;span class=&#34;math inline&#34;&gt;\(t\)&lt;/span&gt;th plot.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;math inline&#34;&gt;\(p_t\)&lt;/span&gt; is the unobserved true mean (proportion) of the binomial distribution for the &lt;span class=&#34;math inline&#34;&gt;\(t\)&lt;/span&gt;th plot.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;math inline&#34;&gt;\(m_t\)&lt;/span&gt; is the total number of plants originally planted, also know as the total number of trials or the &lt;em&gt;binomial sample size&lt;/em&gt;. The binomial sample size can be the same for all plots (likely for experimental data) or vary among plots (more common for observational data).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We assume that the relationship between the &lt;em&gt;mean&lt;/em&gt; of the response and explanatory variables is linear on the logit scale so I use a logit link function when writing out the linear predictor. The logit is the same as the log odds; i.e., &lt;span class=&#34;math inline&#34;&gt;\(logit(p)\)&lt;/span&gt; is the same as &lt;span class=&#34;math inline&#34;&gt;\(log(\frac{p}{1-p})\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;The model I define here has a categorical fixed effect with only two levels.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[logit(p_t) = \beta_0 + \beta_1*I_{(treatment_t=\textit{treatment})} + (b_s)_t\]&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;math inline&#34;&gt;\(\beta_0\)&lt;/span&gt; is the log odds of survival when the treatment is &lt;em&gt;control&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;math inline&#34;&gt;\(\beta_1\)&lt;/span&gt; is the difference in log odds of survival between the two treatments, &lt;em&gt;treatment&lt;/em&gt; minus &lt;em&gt;control&lt;/em&gt;.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;The indicator variable, &lt;span class=&#34;math inline&#34;&gt;\(I_{(treatment_t=\textit{treatment})}\)&lt;/span&gt;, is 1 when the treatment is &lt;em&gt;treatment&lt;/em&gt; and 0 otherwise.&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;math inline&#34;&gt;\(b_s\)&lt;/span&gt; is the (random) effect of the &lt;span class=&#34;math inline&#34;&gt;\(s\)&lt;/span&gt;th site on the log odds of survival. &lt;span class=&#34;math inline&#34;&gt;\(s\)&lt;/span&gt; goes from 1 to the total number of sites sampled. The site-level random effects are assumed to come from an iid normal distribution with a mean of 0 and some shared, site-level variance, &lt;span class=&#34;math inline&#34;&gt;\(\sigma^2_s\)&lt;/span&gt;: &lt;span class=&#34;math inline&#34;&gt;\(b_s \thicksim N(0, \sigma^2_s)\)&lt;/span&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are newer to generalized linear mixed models you might want to take a moment and note of the absence of epsilon in the linear predictor.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;a-single-simulation-for-a-binomial-glmm&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;A single simulation for a binomial GLMM&lt;/h1&gt;
&lt;p&gt;Below is what the dataset I will create via simulation looks like. I have a variable to represent the sites (&lt;code&gt;site&lt;/code&gt;) and plots (&lt;code&gt;plot&lt;/code&gt;) as well as one for the treatments the plot was assigned to (&lt;code&gt;treatment&lt;/code&gt;). In addition, &lt;code&gt;y&lt;/code&gt; is the total number of surviving plants, and &lt;code&gt;num_samp&lt;/code&gt; is the total number originally planted (50 for all plots in this case).&lt;/p&gt;
&lt;p&gt;Note that &lt;code&gt;y/num_samp&lt;/code&gt; is the proportion of plants that survived, which is what we are interested in. In binomial models in R you often use the number of successes and the number of failures (total trials minus the number of successes) as the response variable instead of the actual observed proportion.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# # A tibble: 20 x 5
#    site  plot  treatment num_samp     y
#    &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt;        &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;
#  1 A     A.1   treatment       50    40
#  2 A     A.2   control         50    26
#  3 B     B.1   treatment       50    42
#  4 B     B.2   control         50    23
#  5 C     C.1   treatment       50    48
#  6 C     C.2   control         50    33
#  7 D     D.1   treatment       50    28
#  8 D     D.2   control         50    19
#  9 E     E.1   treatment       50    45
# 10 E     E.2   control         50    35
# 11 F     F.1   treatment       50    45
# 12 F     F.2   control         50    25
# 13 G     G.1   treatment       50    35
# 14 G     G.2   control         50    21
# 15 H     H.1   treatment       50    42
# 16 H     H.2   control         50    26
# 17 I     I.1   treatment       50    47
# 18 I     I.2   control         50    30
# 19 J     J.1   treatment       50    42
# 20 J     J.2   control         50    31&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I’ll start the simulation by setting the seed so the results can be exactly reproduced. I always do this for testing my methodology prior to performing many simulations.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(16)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;defining-the-difference-in-treatments&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Defining the difference in treatments&lt;/h2&gt;
&lt;p&gt;I need to define the “truth” in the simulation by setting all the parameters in the statistical model to values of my choosing. I found it a little hard to figure out what the difference between treatments would be on the scale of the log odds, so I thought it worthwhile to discuss my process here.&lt;/p&gt;
&lt;p&gt;I realized it was easier to for me to think about the results in this case in terms of proportions of each treatment and then use those to convert differences between treatments to log odds. I started out by thinking about what I would expect the surviving proportion of plants to be in the control group. I decided I’d expect half to survive, 0.5. The treatment, if effective, needs to improve survival substantially to be cost effective. I decided that treatment group should have at least 85% survival (0.85).&lt;/p&gt;
&lt;p&gt;The estimate difference from the model will be expressed as odds, so I calculated the odds and then the difference in odds as an odds ratio based on my chosen proportions per group.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;codds = .5/(1 - .5)
todds = .85/(1 - .85)
todds/codds&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 5.666667&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since the model is linear on the scale of log odds I took the log of the odds ratio above to figure out the additive difference between treatments on the model scale.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;log(todds/codds)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 1.734601&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I also need the log odds for the control group, since that is what the intercept, &lt;span class=&#34;math inline&#34;&gt;\(\beta_0\)&lt;/span&gt;, represents in my statistical model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;log(codds)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here are the values I’ll use for the “truth” today:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The true log odds for on the control group, &lt;span class=&#34;math inline&#34;&gt;\(\beta_0\)&lt;/span&gt;, will be 0&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;The difference in log odds of the treatment compared to the control, &lt;span class=&#34;math inline&#34;&gt;\(\beta_1\)&lt;/span&gt;, will be 1.735.&lt;/li&gt;
&lt;li&gt;The site-level variance (&lt;span class=&#34;math inline&#34;&gt;\(\sigma^2_s\)&lt;/span&gt;) will be set at 0.5.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I’ll define the number of sites to 10 while I’m at it. Since I’m working with only 2 treatments, there will be 2 plots per site. The total number of plots (and so observations) is the number of sites times the number of plots per site: &lt;code&gt;10*2 = 20&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;b0 = 0
b1 = 1.735
site_var = 0.5
n_sites = 10&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-the-study-design-variables&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Creating the study design variables&lt;/h2&gt;
&lt;p&gt;Without discussing the code, since I’ve gone over code like this in detail in earlier posts, I will create variables based on the study design, &lt;code&gt;site&lt;/code&gt;, &lt;code&gt;plot&lt;/code&gt;, and &lt;code&gt;treatment&lt;/code&gt;, using &lt;code&gt;rep()&lt;/code&gt;. I’m careful to line things up so there are two unique plots in each site, one for each treatment. I don’t technically need the &lt;code&gt;plot&lt;/code&gt; variable for the analysis I’m going to do, but I create it to keep myself organized (and to mimic a real dataset 😁).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;site = rep(LETTERS[1:n_sites], each = 2)
plot = paste(site, rep(1:2, times = n_sites), sep = &amp;quot;.&amp;quot; )
treatment = rep( c(&amp;quot;treatment&amp;quot;, &amp;quot;control&amp;quot;), times = n_sites)
dat = data.frame(site, plot, treatment)
dat&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#    site plot treatment
# 1     A  A.1 treatment
# 2     A  A.2   control
# 3     B  B.1 treatment
# 4     B  B.2   control
# 5     C  C.1 treatment
# 6     C  C.2   control
# 7     D  D.1 treatment
# 8     D  D.2   control
# 9     E  E.1 treatment
# 10    E  E.2   control
# 11    F  F.1 treatment
# 12    F  F.2   control
# 13    G  G.1 treatment
# 14    G  G.2   control
# 15    H  H.1 treatment
# 16    H  H.2   control
# 17    I  I.1 treatment
# 18    I  I.2   control
# 19    J  J.1 treatment
# 20    J  J.2   control&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;simulate-the-random-effect&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Simulate the random effect&lt;/h2&gt;
&lt;p&gt;Next I will simulate the site-level random effects. I defined these as &lt;span class=&#34;math inline&#34;&gt;\(b_s \thicksim N(0, \sigma^2_s)\)&lt;/span&gt;, so will randomly draw from a normal distribution with a mean of 0 and a variance of 0.5. Remember that &lt;code&gt;rnorm()&lt;/code&gt; in R uses standard deviation, not variance, so I use the square root of &lt;code&gt;site_var&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Since I am have 10 sites I draw 10 values, with each value repeated for each plot present within the site.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;( site_eff = rep( rnorm(n = n_sites, 
                        mean = 0, 
                        sd = sqrt(site_var) ), 
                  each = 2) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  [1]  0.33687514  0.33687514 -0.08865705 -0.08865705  0.77514191  0.77514191
#  [7] -1.02122414 -1.02122414  0.81163788  0.81163788 -0.33121733 -0.33121733
# [13] -0.71131449 -0.71131449  0.04494560  0.04494560  0.72476507  0.72476507
# [19]  0.40527261  0.40527261&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;calculate-log-odds&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Calculate log odds&lt;/h2&gt;
&lt;p&gt;I now have fixed values for all parameters, the variable &lt;code&gt;treatment&lt;/code&gt; to create the indicator variable, and the simulated effects of sites drawn from the defined distribution. That’s all the pieces I need to calculate the true log odds.&lt;/p&gt;
&lt;p&gt;The statistical model&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[logit(p_t) = \beta_0 + \beta_1*I_{(treatment_t=\textit{treatment})} + (b_s)_t\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;is my guide for how to combine these pieces to calculate the log odds, &lt;span class=&#34;math inline&#34;&gt;\(logit(p_t)\)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;( log_odds = with(dat, b0 + b1*(treatment == &amp;quot;treatment&amp;quot;) + site_eff ) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  [1]  2.07187514  0.33687514  1.64634295 -0.08865705  2.51014191  0.77514191
#  [7]  0.71377586 -1.02122414  2.54663788  0.81163788  1.40378267 -0.33121733
# [13]  1.02368551 -0.71131449  1.77994560  0.04494560  2.45976507  0.72476507
# [19]  2.14027261  0.40527261&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;convert-log-odds-to-proportions&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Convert log odds to proportions&lt;/h2&gt;
&lt;p&gt;I’m getting close to pulling values from the binomial distribution to get my response variable. Remember that I defined &lt;span class=&#34;math inline&#34;&gt;\(y_t\)&lt;/span&gt; as:
&lt;span class=&#34;math display&#34;&gt;\[y_t \thicksim Binomial(p_t, m_t)\]&lt;/span&gt;
Right now I’ve gotten to the point where I have &lt;span class=&#34;math inline&#34;&gt;\(logit(p_t)\)&lt;/span&gt;. To get the true proportions, &lt;span class=&#34;math inline&#34;&gt;\(p_t\)&lt;/span&gt;, I need to inverse-logit the log odds. In R, function &lt;code&gt;plogis()&lt;/code&gt; performs the inverse logit.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;( prop = plogis(log_odds) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  [1] 0.8881394 0.5834313 0.8383962 0.4778502 0.9248498 0.6846321 0.6712349
#  [8] 0.2647890 0.9273473 0.6924584 0.8027835 0.4179445 0.7356899 0.3293085
# [15] 0.8556901 0.5112345 0.9212726 0.6736555 0.8947563 0.5999538&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I can’t forget about &lt;span class=&#34;math inline&#34;&gt;\(m_t\)&lt;/span&gt;. I need to know the binomial sample size for each plot before I can calculate the number of successes based on the total number of trials from the binomial distribution. Since my imaginary study is an experiment I will set this as 50 for every plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat$num_samp = 50&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I’ve been in situations where I wanted the binomial sample size to vary per observation. In that case, you may find &lt;code&gt;sample()&lt;/code&gt; useful, using the range of binomial sample sizes you are interested in as the first argument.&lt;/p&gt;
&lt;p&gt;Here’s an example of what that code could look like, allowing the binomial sample size to vary from 40 and 50 for every plot. (&lt;em&gt;Code not run.&lt;/em&gt;)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;num_samp = sample(40:50, size = 20, replace = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;generate-the-response-variable&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Generate the response variable&lt;/h2&gt;
&lt;p&gt;Now that I have a vector of proportions and have set the binomial sample size per plot, I can calculate the number of successes for each true proportion and binomial sample size based on the binomial distribution. I do this via &lt;code&gt;rbinom()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It is this step where we add the “binomial errors” to the proportions to generate a response variable. The variation for each simulated &lt;code&gt;y&lt;/code&gt; value is based on the binomial variance.&lt;/p&gt;
&lt;p&gt;The next bit of code is directly based on the distribution defined in the statistical model: &lt;span class=&#34;math inline&#34;&gt;\(y_t \thicksim Binomial(p_t, m_t)\)&lt;/span&gt;. I randomly draw 20 values from the binomial distribution, one for each of the 20 proportions stored in &lt;code&gt;prop&lt;/code&gt;. I define the binomial sample size in the &lt;code&gt;size&lt;/code&gt; argument.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;( dat$y = rbinom(n = n_sites*2, size = dat$num_samp, prob = prop) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  [1] 40 26 42 23 48 33 28 19 45 35 45 25 35 21 42 26 47 30 42 31&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;fit-a-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Fit a model&lt;/h2&gt;
&lt;p&gt;It’s time for model fitting! I can now fit a binomial generalized linear mixed model with a logit link using, e.g., the &lt;code&gt;glmer()&lt;/code&gt; function from package &lt;strong&gt;lme4&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mod = glmer(cbind(y, num_samp - y) ~ treatment + (1|site), 
            data = dat,
            family = binomial(link = &amp;quot;logit&amp;quot;) )
mod&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Generalized linear mixed model fit by maximum likelihood (Laplace
#   Approximation) [glmerMod]
#  Family: binomial  ( logit )
# Formula: cbind(y, num_samp - y) ~ treatment + (1 | site)
#    Data: dat
#      AIC      BIC   logLik deviance df.resid 
# 122.6154 125.6025 -58.3077 116.6154       17 
# Random effects:
#  Groups Name        Std.Dev.
#  site   (Intercept) 0.4719  
# Number of obs: 20, groups:  site, 10
# Fixed Effects:
#        (Intercept)  treatmenttreatment  
#             0.1576              1.4859&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;make-a-function-for-the-simulation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Make a function for the simulation&lt;/h1&gt;
&lt;p&gt;A single simulation can help us understand the statistical model, but usually the goal of a simulation is to see how the model behaves over the long run. To that end I’ll make my simulation process into a function.&lt;/p&gt;
&lt;p&gt;In my function I’m going to set all the arguments to the parameter values as I defined them earlier. I allow some flexibility, though, so the argument values can be changed if I want to explore the simulation with, say, a different number of replications or different parameter values. I do not allow the number of plots to vary in this particular function, since I’m hard-coding in two treatments.&lt;/p&gt;
&lt;p&gt;This function returns a generalized linear mixed model fit with &lt;code&gt;glmer()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bin_glmm_fun = function(n_sites = 10,
                        b0 = 0,
                        b1 = 1.735,
                        num_samp = 50,
                        site_var = 0.5) {
     site = rep(LETTERS[1:n_sites], each = 2)
     plot = paste(site, rep(1:2, times = n_sites), sep = &amp;quot;.&amp;quot; )
     treatment = rep( c(&amp;quot;treatment&amp;quot;, &amp;quot;control&amp;quot;), times = n_sites)
     dat = data.frame(site, plot, treatment)           
     
     site_eff = rep( rnorm(n = n_sites, mean = 0, sd = sqrt(site_var) ), each = 2)
     
     log_odds = with(dat, b0 + b1*(treatment == &amp;quot;treatment&amp;quot;) + site_eff)
     prop = plogis(log_odds)
     dat$num_samp = num_samp
     dat$y = rbinom(n = n_sites*2, size = num_samp, prob = prop)
     
     glmer(cbind(y, num_samp - y) ~ treatment + (1|site),
           data = dat,
           family = binomial(link = &amp;quot;logit&amp;quot;) )
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I test the function, using the same &lt;code&gt;seed&lt;/code&gt;, to make sure things are working as expected and that I get the same results as above. I do, and everything looks good.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(16)
bin_glmm_fun()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Generalized linear mixed model fit by maximum likelihood (Laplace
#   Approximation) [glmerMod]
#  Family: binomial  ( logit )
# Formula: cbind(y, num_samp - y) ~ treatment + (1 | site)
#    Data: dat
#      AIC      BIC   logLik deviance df.resid 
# 122.6154 125.6025 -58.3077 116.6154       17 
# Random effects:
#  Groups Name        Std.Dev.
#  site   (Intercept) 0.4719  
# Number of obs: 20, groups:  site, 10
# Fixed Effects:
#        (Intercept)  treatmenttreatment  
#             0.1576              1.4859&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;repeat-the-simulation-many-times&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Repeat the simulation many times&lt;/h1&gt;
&lt;p&gt;Now that I have a working function to simulate data and fit the model it’s time to do the simulation many times. The model from each individual simulation is saved to allow exploration of long run model performance.&lt;/p&gt;
&lt;p&gt;This is a task for &lt;code&gt;replicate()&lt;/code&gt;, which repeatedly calls a function and saves the output. When using &lt;code&gt;simplify = FALSE&lt;/code&gt; the output is a list, which is convenient for going through to extract elements from the models later. I’ll re-run the simulation 1000 times. This could take awhile to run for complex models with many terms.&lt;/p&gt;
&lt;p&gt;I print the output of the 100th list element so you can see the list is filled with models.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sims = replicate(1000, bin_glmm_fun(), simplify = FALSE )
sims[[100]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Generalized linear mixed model fit by maximum likelihood (Laplace
#   Approximation) [glmerMod]
#  Family: binomial  ( logit )
# Formula: cbind(y, num_samp - y) ~ treatment + (1 | site)
#    Data: dat
#      AIC      BIC   logLik deviance df.resid 
# 122.1738 125.1610 -58.0869 116.1738       17 
# Random effects:
#  Groups Name        Std.Dev.
#  site   (Intercept) 0.6059  
# Number of obs: 20, groups:  site, 10
# Fixed Effects:
#        (Intercept)  treatmenttreatment  
#           0.001914            1.824177&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;extract-results-from-the-binomial-glmm&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Extract results from the binomial GLMM&lt;/h1&gt;
&lt;p&gt;After running all the models we can extract whatever we are interested in to explore long run behavior. As I was planning this post, I started wondering what the estimate of dispersion would look like from a binomial GLMM that was not over or underdispersed by definition.&lt;/p&gt;
&lt;p&gt;With some caveats, which you can read more about in the &lt;a href=&#34;https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion&#34;&gt;GLMM FAQ&lt;/a&gt;, the sum of the squared Pearson residuals divided by the residual degrees of freedom is an estimate of over/underdispersion. This seems OK to use in the scenario I’ve set up here since my binomial sample sizes are fairly large and my proportions are not too close to the distribution limits.&lt;/p&gt;
&lt;p&gt;I made a function to calculate this.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;overdisp_fun = function(model) {
     sum( residuals(model, type = &amp;quot;pearson&amp;quot;)^2)/df.residual(model)
}
overdisp_fun(mod)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 0.7169212&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;explore-estimated-dispersion&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Explore estimated dispersion&lt;/h1&gt;
&lt;p&gt;I want to look at the distribution of dispersion estimates from the 1000 models. This involves looping through the models and using &lt;code&gt;overdisp_fun()&lt;/code&gt; to extract the estimated dispersion from each one. I put the result in a data.frame since I’ll be plotting the result with &lt;strong&gt;ggplot2&lt;/strong&gt;. I use &lt;strong&gt;purrr&lt;/strong&gt; helper function &lt;code&gt;map_dfr()&lt;/code&gt; for the looping.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;alldisp = map_dfr(sims, ~data.frame(disp = overdisp_fun(.x) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here’s the plot of the resulting distribution. I put a vertical line at 1, since values above 1 indicate overdispersion and below 1 indicate underdispersion.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(alldisp, aes(x = disp) ) +
     geom_histogram(fill = &amp;quot;blue&amp;quot;, 
                    alpha = .25, 
                    bins = 100) +
     geom_vline(xintercept = 1) +
     scale_x_continuous(breaks = seq(0, 2, by = 0.2) ) +
     theme_bw(base_size = 14) +
     labs(x = &amp;quot;Disperson&amp;quot;,
          y = &amp;quot;Count&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2020-08-20-simulate-binomial-glmm_files/figure-html/unnamed-chunk-20-1.png&#34; width=&#34;672&#34; /&gt;
I’m not sure what to think of this yet, but I am pretty fascinated by the result. Only ~7% of models show any overdispersion.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mean(alldisp$disp &amp;gt; 1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 0.069&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And hardly any (&amp;lt;0.05%) estimate overdispersion greater than 1.5, which is a high enough value that we would likely be concerned that our results were anti-conservative if this were an analysis of a real dataset.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mean(alldisp$disp &amp;gt; 1.5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 0.004&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For this scenario, at least, I learned that it is rare to observe substantial overdispersion when the model isn’t overdispersed. That seems useful.&lt;/p&gt;
&lt;p&gt;I don’t know why so many models show substantial underdispersion, though. Maybe the method for calculating overdispersion doesn’t work well for underdispersion? I’m not sure.&lt;/p&gt;
&lt;p&gt;When checking a real model we’d be using additional tools beyond the estimated dispersion to check model fit and decide if a model looks problematic. I highly recommend package &lt;strong&gt;DHARMa&lt;/strong&gt; for checking model fit for GLMM’s (although I’m not necessarily a fan of all the p-values 😜).&lt;/p&gt;
&lt;p&gt;Happy simulating!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2020-08-20-simulate-binomial-glmm.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(lme4) # v. 1.1-23
library(purrr) # v. 0.3.4
library(ggplot2) # v. 3.3.2

dat
set.seed(16)
codds = .5/(1 - .5)
todds = .85/(1 - .85)
todds/codds
log(todds/codds)
log(codds)
b0 = 0
b1 = 1.735
site_var = 0.5
n_sites = 10

site = rep(LETTERS[1:n_sites], each = 2)
plot = paste(site, rep(1:2, times = n_sites), sep = &amp;quot;.&amp;quot; )
treatment = rep( c(&amp;quot;treatment&amp;quot;, &amp;quot;control&amp;quot;), times = n_sites)
dat = data.frame(site, plot, treatment)
dat

( site_eff = rep( rnorm(n = n_sites, 
                        mean = 0, 
                        sd = sqrt(site_var) ), 
                  each = 2) )

( log_odds = with(dat, b0 + b1*(treatment == &amp;quot;treatment&amp;quot;) + site_eff ) )
( prop = plogis(log_odds) )

dat$num_samp = 50
num_samp = sample(40:50, size = 20, replace = TRUE)
( dat$y = rbinom(n = n_sites*2, size = dat$num_samp, prob = prop) )

mod = glmer(cbind(y, num_samp - y) ~ treatment + (1|site), 
            data = dat,
            family = binomial(link = &amp;quot;logit&amp;quot;) )
mod

bin_glmm_fun = function(n_sites = 10,
                        b0 = 0,
                        b1 = 1.735,
                        num_samp = 50,
                        site_var = 0.5) {
     site = rep(LETTERS[1:n_sites], each = 2)
     plot = paste(site, rep(1:2, times = n_sites), sep = &amp;quot;.&amp;quot; )
     treatment = rep( c(&amp;quot;treatment&amp;quot;, &amp;quot;control&amp;quot;), times = n_sites)
     dat = data.frame(site, plot, treatment)           
     
     site_eff = rep( rnorm(n = n_sites, mean = 0, sd = sqrt(site_var) ), each = 2)
     
     log_odds = with(dat, b0 + b1*(treatment == &amp;quot;treatment&amp;quot;) + site_eff)
     prop = plogis(log_odds)
     dat$num_samp = num_samp
     dat$y = rbinom(n = n_sites*2, size = num_samp, prob = prop)
     
     glmer(cbind(y, num_samp - y) ~ treatment + (1|site),
           data = dat,
           family = binomial(link = &amp;quot;logit&amp;quot;) )
}


set.seed(16)
bin_glmm_fun()

sims = replicate(1000, bin_glmm_fun(), simplify = FALSE )
sims[[100]]

overdisp_fun = function(model) {
     sum( residuals(model, type = &amp;quot;pearson&amp;quot;)^2)/df.residual(model)
}
overdisp_fun(mod)

alldisp = map_dfr(sims, ~data.frame(disp = overdisp_fun(.x) ) )

ggplot(alldisp, aes(x = disp) ) +
     geom_histogram(fill = &amp;quot;blue&amp;quot;, 
                    alpha = .25, 
                    bins = 100) +
     geom_vline(xintercept = 1) +
     scale_x_continuous(breaks = seq(0, 2, by = 0.2) ) +
     theme_bw(base_size = 14) +
     labs(x = &amp;quot;Disperson&amp;quot;,
          y = &amp;quot;Count&amp;quot;)

mean(alldisp$disp &amp;gt; 1)
mean(alldisp$disp &amp;gt; 1.5)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Controlling legend appearance in ggplot2 with override.aes</title>
      <link>https://aosmith.rbind.io/2020/07/09/ggplot2-override-aes/</link>
      <pubDate>Thu, 09 Jul 2020 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2020/07/09/ggplot2-override-aes/</guid>
      <description>


&lt;p&gt;In &lt;strong&gt;ggplot2&lt;/strong&gt;, aesthetics and their &lt;code&gt;scale_*()&lt;/code&gt; functions change both the plot appearance and the plot legend appearance simultaneously. The &lt;code&gt;override.aes&lt;/code&gt; argument in &lt;code&gt;guide_legend()&lt;/code&gt; allows the user to change only the legend appearance without affecting the rest of the plot. This is useful for making the legend more readable or for creating certain types of combined legends.&lt;/p&gt;
&lt;p&gt;In this post I’ll first introduce &lt;code&gt;override.aes&lt;/code&gt; with a basic example and then go through three additional plotting scenarios to how other instances where &lt;code&gt;override.aes&lt;/code&gt; comes in handy.&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#r-packages&#34;&gt;R packages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#introducing-override.aes&#34;&gt;Introducing override.aes&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#adding-a-guides-layer&#34;&gt;Adding a guides() layer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-the-guide-argument-in-scale_&#34;&gt;Using the guide argument in scale_*()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#changing-multiple-aesthetic-parameters&#34;&gt;Changing multiple aesthetic parameters&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#suppress-aesthetics-from-part-of-the-legend&#34;&gt;Suppress aesthetics from part of the legend&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#combining-legends-from-two-layers&#34;&gt;Combining legends from two layers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#controlling-the-appearance-of-multiple-legends&#34;&gt;Controlling the appearance of multiple legends&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;r-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R packages&lt;/h1&gt;
&lt;p&gt;The only package I’ll use in this post is &lt;strong&gt;ggplot2&lt;/strong&gt; for plotting.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # v. 3.3.2&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;introducing-override.aes&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introducing override.aes&lt;/h1&gt;
&lt;p&gt;A basic reason to change the legend appearance without changing the plot is to make the legend more readable.&lt;/p&gt;
&lt;p&gt;For example, I’ll start with a scatterplot using the &lt;code&gt;diamonds&lt;/code&gt; dataset. This is a large dataset, so after mapping &lt;code&gt;color&lt;/code&gt; to the &lt;code&gt;cut&lt;/code&gt; variable I set &lt;code&gt;alpha&lt;/code&gt; to increase the transparency and &lt;code&gt;size&lt;/code&gt; to reduce the size of points in the plot.&lt;/p&gt;
&lt;p&gt;You can see using &lt;code&gt;alpha&lt;/code&gt; and &lt;code&gt;size&lt;/code&gt; changes the way the points are shown in both the plot and the legend.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2020-07-09-ggplot2-override-aes_files/figure-html/firstplot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;adding-a-guides-layer&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Adding a guides() layer&lt;/h2&gt;
&lt;p&gt;Making the points small and transparent may be desirable when plotting many points, but it also makes the legend more difficult to read. This is a case where I’d want to make the legend more readable by increasing the point size and/or reducing the point transparency.&lt;/p&gt;
&lt;p&gt;One way to do this is by adding a &lt;code&gt;guides()&lt;/code&gt; layer. The &lt;code&gt;guides()&lt;/code&gt; function uses scale name-guide pairs. I am going to change the legend for the &lt;code&gt;color&lt;/code&gt; scale, so I’ll use &lt;code&gt;color = guide_legend()&lt;/code&gt; as the scale name-guide pair.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;override.aes&lt;/code&gt; is an argument within &lt;code&gt;guide_legend()&lt;/code&gt;, so if you’re looking for more background you can start at &lt;code&gt;?guide_legend&lt;/code&gt;. The &lt;code&gt;override.aes&lt;/code&gt; argument takes a list of aesthetic parameters that will &lt;em&gt;override&lt;/em&gt; the default legend appearance.&lt;/p&gt;
&lt;p&gt;To increase the &lt;code&gt;size&lt;/code&gt; of the points in the &lt;code&gt;color&lt;/code&gt; legend of my plot, the layer I’ll add will look like:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;guides(color = guide_legend(override.aes = list(size = 3) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Adding this layer to the initial plot, you can see how the points in the legend get larger while the points in the plot remain unchanged.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1) +
     guides(color = guide_legend(override.aes = list(size = 3) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2020-07-09-ggplot2-override-aes_files/figure-html/firstplot2-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;using-the-guide-argument-in-scale_&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using the guide argument in scale_*()&lt;/h2&gt;
&lt;p&gt;If I am going to change my default colors with a &lt;code&gt;scale_color_*()&lt;/code&gt; function in addition to overriding the legend appearance, I can use the &lt;code&gt;guide&lt;/code&gt; argument there instead of adding a separate &lt;code&gt;guides()&lt;/code&gt; layer. The &lt;code&gt;guide&lt;/code&gt; argument is part of all scale functions.&lt;/p&gt;
&lt;p&gt;For example, say I am already using &lt;code&gt;scale_color_viridis_d()&lt;/code&gt; to change the default color palette of the whole plot (i.e., plot and legend). I can use the same &lt;code&gt;guide_legend()&lt;/code&gt; code from above for the &lt;code&gt;guide&lt;/code&gt; argument to change the size of the points in the legend.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1) +
     scale_color_viridis_d(option = &amp;quot;magma&amp;quot;,
                           guide = guide_legend(override.aes = list(size = 3) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2020-07-09-ggplot2-override-aes_files/figure-html/firstplot3-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;changing-multiple-aesthetic-parameters&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Changing multiple aesthetic parameters&lt;/h2&gt;
&lt;p&gt;You can control multiple aesthetic parameters at once by adding them to the list passed to &lt;code&gt;override.aes&lt;/code&gt;. If I want to increase the point size as well as remove the point transparency in the legend, I can change both &lt;code&gt;size&lt;/code&gt; and &lt;code&gt;alpha&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1) +
     scale_color_viridis_d(option = &amp;quot;magma&amp;quot;,
                           guide = guide_legend(override.aes = list(size = 3,
                                                                    alpha = 1) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2020-07-09-ggplot2-override-aes_files/figure-html/firstplot4-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;suppress-aesthetics-from-part-of-the-legend&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Suppress aesthetics from part of the legend&lt;/h1&gt;
&lt;p&gt;Removing aesthetic for only some parts of the legend is another use for &lt;code&gt;override.aes&lt;/code&gt;. For example, this can be useful when different layers are based on a different number of levels for the same grouping factor.&lt;/p&gt;
&lt;p&gt;The following example is based on &lt;a href=&#34;https://stackoverflow.com/questions/59548358/r-ggplot2-in-the-legend-how-do-i-hide-unused-colors-from-one-geom-while-show&#34;&gt;this Stack Overflow question&lt;/a&gt;. The &lt;code&gt;points&lt;/code&gt; data has information from all three groups of the &lt;code&gt;id&lt;/code&gt; variable but the rectangle, based on the &lt;code&gt;box&lt;/code&gt; dataset, is for only a single group.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;points = structure(list(x = c(5L, 10L, 7L, 9L, 86L, 46L, 22L, 94L, 21L, 
6L, 24L, 3L), y = c(51L, 54L, 50L, 60L, 97L, 74L, 59L, 68L, 45L, 
56L, 25L, 70L), id = c(&amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;b&amp;quot;, 
&amp;quot;c&amp;quot;, &amp;quot;c&amp;quot;, &amp;quot;c&amp;quot;, &amp;quot;c&amp;quot;)), row.names = c(NA, -12L), class = &amp;quot;data.frame&amp;quot;)

head(points)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#    x  y id
# 1  5 51  a
# 2 10 54  a
# 3  7 50  a
# 4  9 60  a
# 5 86 97  b
# 6 46 74  b&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;box = data.frame(left = 1, right = 10, bottom = 50, top = 60, id = &amp;quot;a&amp;quot;)
box&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#   left right bottom top id
# 1    1    10     50  60  a&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here’s what the initial plot looks like, mapping &lt;code&gt;color&lt;/code&gt; to the &lt;code&gt;id&lt;/code&gt; variable. Note that the colored outlines, representing the rectangle layer, are present for every group in the legend even though there is a rectangle for only one of the groups present in the plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = points, aes(color = id) ) +
     geom_point(aes(x = x, y = y), size = 4) +
     geom_rect(data = box, aes(xmin = left,
                               xmax = right,
                               ymin = 50,
                               ymax = top),
               fill = NA, size = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2020-07-09-ggplot2-override-aes_files/figure-html/secondplot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;In this case, I want to remove the outlines for the second and third legend key boxes so the legend matches what is in the plot. The legend outlines are based on the &lt;code&gt;linetype&lt;/code&gt; aesthetic. Suppressing these lines can be done with &lt;code&gt;override.aes&lt;/code&gt;, setting the line types to &lt;code&gt;0&lt;/code&gt; in order to remove them.&lt;/p&gt;
&lt;p&gt;Note that I have to list the line type for every group, not just the groups I want to remove. I keep the line for the first group solid via &lt;code&gt;1&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = points, aes(color = id) ) +
     geom_point(aes(x = x, y = y), size = 4) +
     geom_rect(data = box, aes(xmin = left,
                               xmax = right,
                               ymin = 50,
                               ymax = top),
               fill = NA, size = 1) +
     guides(color = guide_legend(override.aes = list(linetype = c(1, 0, 0) ) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2020-07-09-ggplot2-override-aes_files/figure-html/secondplot2-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;combining-legends-from-two-layers&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Combining legends from two layers&lt;/h1&gt;
&lt;p&gt;In the next example I’ll show how &lt;code&gt;override.aes&lt;/code&gt; can be useful when creating a legend based on multiple layers and want distinct symbols in each legend key box.&lt;/p&gt;
&lt;p&gt;There are situations where we want to add a legend to identify different elements of the plot, such as indicating the plotted line is a fitted line or that points are means. This can be done by mapping aesthetics to constants to make a manual legend and then manipulating the symbols shown in the legend via &lt;code&gt;override.aes&lt;/code&gt;. I wrote about making manual legends in &lt;a href=&#34;https://aosmith.rbind.io/2018/07/19/manual-legends-ggplot2/&#34;&gt;an earlier blog post&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The plot below shows observed values and a fitted line per group based on &lt;code&gt;color&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = mtcars, aes(x = mpg, y = wt, color = factor(am) ) ) +
     geom_point(size = 3) +
     geom_smooth(method = &amp;quot;lm&amp;quot;, se = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# `geom_smooth()` using formula &amp;#39;y ~ x&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2020-07-09-ggplot2-override-aes_files/figure-html/thirdplot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I’m going to leave the &lt;code&gt;color&lt;/code&gt; legend alone but I want to add a second legend to indicate that the points are observed values and the lines are fitted lines. I’ll use the &lt;code&gt;alpha&lt;/code&gt; aesthetic for this. Using an aesthetic that you haven’t used already and affects both layers is a trick that often comes in handy when adding an extra legend like I’m doing here.&lt;/p&gt;
&lt;p&gt;I don’t actually want &lt;code&gt;alpha&lt;/code&gt; to affect the plot appearance, so I also add &lt;code&gt;scale_alpha_manual()&lt;/code&gt; to make sure both layers stay opaque by setting the &lt;code&gt;values&lt;/code&gt; for both groups to 1. I also remove the legend name and set the order of the &lt;code&gt;breaks&lt;/code&gt; so the &lt;code&gt;Observed&lt;/code&gt; group is listed first in the new legend.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = mtcars, aes(x = mpg, y = wt, color = factor(am) ) ) +
     geom_point(aes(alpha = &amp;quot;Observed&amp;quot;), size = 3) +
     geom_smooth(method = &amp;quot;lm&amp;quot;, se = FALSE, aes(alpha = &amp;quot;Fitted&amp;quot;) ) +
     scale_alpha_manual(name = NULL,
                        values = c(1, 1),
                        breaks = c(&amp;quot;Observed&amp;quot;, &amp;quot;Fitted&amp;quot;) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2020-07-09-ggplot2-override-aes_files/figure-html/thirdplot2-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now I have a new legend to work with. However, the legend has both the point and the line symbol in all legend key boxes. I need to override the current legend so the &lt;code&gt;Observed&lt;/code&gt; legend key box contains only a point symbol and the &lt;code&gt;Fitted&lt;/code&gt; legend key box has only a line symbol. This is where &lt;code&gt;override.aes&lt;/code&gt; comes in.&lt;/p&gt;
&lt;p&gt;Here’s what I’ll do: I’ll change the &lt;code&gt;linetype&lt;/code&gt; to &lt;code&gt;0&lt;/code&gt; for the first key box but leave it as &lt;code&gt;1&lt;/code&gt; for the second. I’ll use shape &lt;code&gt;16&lt;/code&gt; (a solid circle) as the &lt;code&gt;shape&lt;/code&gt; for the first key box but remove the point all together in the second key box with &lt;code&gt;NA&lt;/code&gt;. I’m also going to make sure all elements are black via &lt;code&gt;color&lt;/code&gt;. (&lt;em&gt;If you need to know codes for shapes and line types see &lt;a href=&#34;http://www.cookbook-r.com/Graphs/Shapes_and_line_types/&#34;&gt;here&lt;/a&gt;.&lt;/em&gt;)&lt;/p&gt;
&lt;p&gt;I use &lt;code&gt;linetype&lt;/code&gt;, &lt;code&gt;shape&lt;/code&gt;, and &lt;code&gt;color&lt;/code&gt; in the &lt;code&gt;override.aes&lt;/code&gt; list within &lt;code&gt;scale_alpha_manual()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = mtcars, aes(x = mpg, y = wt, color = factor(am) ) ) +
     geom_point(aes(alpha = &amp;quot;Observed&amp;quot;), size = 3) +
     geom_smooth(method = &amp;quot;lm&amp;quot;, se = FALSE, aes(alpha = &amp;quot;Fitted&amp;quot;) ) +
     scale_alpha_manual(name = NULL,
                        values = c(1, 1),
                        breaks = c(&amp;quot;Observed&amp;quot;, &amp;quot;Fitted&amp;quot;),
                        guide = guide_legend(override.aes = list(linetype = c(0, 1),
                                                                  shape = c(16, NA),
                                                                  color = &amp;quot;black&amp;quot;) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2020-07-09-ggplot2-override-aes_files/figure-html/thirdplot3-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;controlling-the-appearance-of-multiple-legends&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Controlling the appearance of multiple legends&lt;/h1&gt;
&lt;p&gt;The final example for &lt;code&gt;override.aes&lt;/code&gt; may seem a little esoteric, but it has come up for me in the past. Say I want to make a scatterplot with the &lt;code&gt;fill&lt;/code&gt; and &lt;code&gt;shape&lt;/code&gt; aesthetics mapped to two different factors. I use &lt;code&gt;fill&lt;/code&gt; instead of &lt;code&gt;color&lt;/code&gt; so the points have an outline. Having an outline around the points can matter if, e.g., you have a white plot background and wanted the points to be black and white as &lt;a href=&#34;https://stackoverflow.com/questions/44765946/r-ggplot2-plot-geom-point-with-black-and-white-without-using-shape&#34;&gt;in this question&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This is the dataset I’ll use in this plot example, where the two factors are named &lt;code&gt;g1&lt;/code&gt; and &lt;code&gt;g2&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat = structure(list(g1 = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), class = &amp;quot;factor&amp;quot;, .Label = c(&amp;quot;High&amp;quot;, 
&amp;quot;Low&amp;quot;)), g2 = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
1L, 2L, 2L, 1L, 1L, 2L, 2L), class = &amp;quot;factor&amp;quot;, .Label = c(&amp;quot;Control&amp;quot;, 
&amp;quot;Treatment&amp;quot;)), x = c(0.42, 0.39, 0.56, 0.59, 0.17, 0.95, 0.85, 
0.25, 0.31, 0.75, 0.58, 0.9, 0.6, 0.86, 0.61, 0.61), y = c(-1.4, 
3.6, 1.1, -0.1, 0.5, 0, -1.8, 0.8, -1.1, -0.6, 0.2, 0.3, 1.1, 
1.6, 0.9, -0.6)), class = &amp;quot;data.frame&amp;quot;, row.names = c(NA, -16L
))

head(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#     g1        g2    x    y
# 1 High   Control 0.42 -1.4
# 2  Low   Control 0.39  3.6
# 3 High Treatment 0.56  1.1
# 4  Low Treatment 0.59 -0.1
# 5 High   Control 0.17  0.5
# 6  Low   Control 0.95  0.0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I set the colors for the &lt;code&gt;fill&lt;/code&gt; in &lt;code&gt;scale_fill_manual()&lt;/code&gt; and choose &lt;em&gt;fillable&lt;/em&gt; shapes in &lt;code&gt;scale_shape_manual()&lt;/code&gt;. Fillable shapes are shapes 21 through 25.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = dat, aes(x = x, y = y, fill = g1, shape = g2) ) +
     geom_point(size = 5) +
     scale_fill_manual(values = c(&amp;quot;#002F70&amp;quot;, &amp;quot;#EDB4B5&amp;quot;) ) +
     scale_shape_manual(values = c(21, 24) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2020-07-09-ggplot2-override-aes_files/figure-html/fourthplot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The plots itself shows the fill colors and shapes, but you can some issues in the legends. The fill colors don’t show up in the &lt;code&gt;g1&lt;/code&gt; legend at all. This is because the default shape in the legend isn’t a fillable shape. In addition, the &lt;code&gt;g2&lt;/code&gt; legend shows unfilled points and I think it would look better if the points were filled.&lt;/p&gt;
&lt;p&gt;I can address both these issues via &lt;code&gt;override.aes&lt;/code&gt;. I’ll change the point shape in the &lt;code&gt;fill&lt;/code&gt; legend to shape 21 and the fill color in the &lt;code&gt;shape&lt;/code&gt; legend to black within a &lt;code&gt;guides()&lt;/code&gt; layer. This code gives you a chance to see how you can use use multiple scale name-guide pairs within the same &lt;code&gt;guides()&lt;/code&gt; layer.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = dat, aes(x = x, y = y, fill = g1, shape = g2) ) +
     geom_point(size = 5) +
     scale_fill_manual(values = c(&amp;quot;#002F70&amp;quot;, &amp;quot;#EDB4B5&amp;quot;) ) +
     scale_shape_manual(values = c(21, 24) ) +
     guides(fill = guide_legend(override.aes = list(shape = 21) ),
            shape = guide_legend(override.aes = list(fill = &amp;quot;black&amp;quot;) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2020-07-09-ggplot2-override-aes_files/figure-html/fourthplot2-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;While I’m sure you can come up with additional scenarios, that should give you a taste for when overriding the aesthetics in the legend is useful. 😄&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2020-07-09-ggplot2-override-aes.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # v. 3.3.2

ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1)
     
ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1) +
     guides(color = guide_legend(override.aes = list(size = 3) ) )
            
ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1) +
     scale_color_viridis_d(option = &amp;quot;magma&amp;quot;,
                           guide = guide_legend(override.aes = list(size = 3) ) )

ggplot(data = diamonds, aes(x = carat, y = price, color = cut) ) +
     geom_point(alpha = .25, size = 1) +
     scale_color_viridis_d(option = &amp;quot;magma&amp;quot;,
                           guide = guide_legend(override.aes = list(size = 3,
                                                                    alpha = 1) ) )

points = structure(list(x = c(5L, 10L, 7L, 9L, 86L, 46L, 22L, 94L, 21L, 
6L, 24L, 3L), y = c(51L, 54L, 50L, 60L, 97L, 74L, 59L, 68L, 45L, 
56L, 25L, 70L), id = c(&amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;b&amp;quot;, 
&amp;quot;c&amp;quot;, &amp;quot;c&amp;quot;, &amp;quot;c&amp;quot;, &amp;quot;c&amp;quot;)), row.names = c(NA, -12L), class = &amp;quot;data.frame&amp;quot;)

head(points)

box = data.frame(left = 1, right = 10, bottom = 50, top = 60, id = &amp;quot;a&amp;quot;)
box
ggplot(data = points, aes(color = id) ) +
     geom_point(aes(x = x, y = y), size = 4) +
     geom_rect(data = box, aes(xmin = left,
                               xmax = right,
                               ymin = 50,
                               ymax = top),
               fill = NA, size = 1)

ggplot(data = points, aes(color = id) ) +
     geom_point(aes(x = x, y = y), size = 4) +
     geom_rect(data = box, aes(xmin = left,
                               xmax = right,
                               ymin = 50,
                               ymax = top),
               fill = NA, size = 1) +
     guides(color = guide_legend(override.aes = list(linetype = c(1, 0, 0) ) ) )

ggplot(data = mtcars, aes(x = mpg, y = wt, color = factor(am) ) ) +
     geom_point(size = 3) +
     geom_smooth(method = &amp;quot;lm&amp;quot;, se = FALSE)
       
ggplot(data = mtcars, aes(x = mpg, y = wt, color = factor(am) ) ) +
     geom_point(aes(alpha = &amp;quot;Observed&amp;quot;), size = 3) +
     geom_smooth(method = &amp;quot;lm&amp;quot;, se = FALSE, aes(alpha = &amp;quot;Fitted&amp;quot;) ) +
     scale_alpha_manual(name = NULL,
                        values = c(1, 1),
                        breaks = c(&amp;quot;Observed&amp;quot;, &amp;quot;Fitted&amp;quot;) )
                        

ggplot(data = mtcars, aes(x = mpg, y = wt, color = factor(am) ) ) +
     geom_point(aes(alpha = &amp;quot;Observed&amp;quot;), size = 3) +
     geom_smooth(method = &amp;quot;lm&amp;quot;, se = FALSE, aes(alpha = &amp;quot;Fitted&amp;quot;) ) +
     scale_alpha_manual(name = NULL,
                        values = c(1, 1),
                        breaks = c(&amp;quot;Observed&amp;quot;, &amp;quot;Fitted&amp;quot;),
                        guide = guide_legend(override.aes = list(linetype = c(0, 1),
                                                                  shape = c(16, NA),
                                                                  color = &amp;quot;black&amp;quot;) ) )

dat = structure(list(g1 = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), class = &amp;quot;factor&amp;quot;, .Label = c(&amp;quot;High&amp;quot;, 
&amp;quot;Low&amp;quot;)), g2 = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
1L, 2L, 2L, 1L, 1L, 2L, 2L), class = &amp;quot;factor&amp;quot;, .Label = c(&amp;quot;Control&amp;quot;, 
&amp;quot;Treatment&amp;quot;)), x = c(0.42, 0.39, 0.56, 0.59, 0.17, 0.95, 0.85, 
0.25, 0.31, 0.75, 0.58, 0.9, 0.6, 0.86, 0.61, 0.61), y = c(-1.4, 
3.6, 1.1, -0.1, 0.5, 0, -1.8, 0.8, -1.1, -0.6, 0.2, 0.3, 1.1, 
1.6, 0.9, -0.6)), class = &amp;quot;data.frame&amp;quot;, row.names = c(NA, -16L
))

head(dat)

ggplot(data = dat, aes(x = x, y = y, fill = g1, shape = g2) ) +
     geom_point(size = 5) +
     scale_fill_manual(values = c(&amp;quot;#002F70&amp;quot;, &amp;quot;#EDB4B5&amp;quot;) ) +
     scale_shape_manual(values = c(21, 24) )

ggplot(data = dat, aes(x = x, y = y, fill = g1, shape = g2) ) +
     geom_point(size = 5) +
     scale_fill_manual(values = c(&amp;quot;#002F70&amp;quot;, &amp;quot;#EDB4B5&amp;quot;) ) +
     scale_shape_manual(values = c(21, 24) ) +
     guides(fill = guide_legend(override.aes = list(shape = 21) ),
            shape = guide_legend(override.aes = list(fill = &amp;quot;black&amp;quot;) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Analysis essentials: Using the help page for a function in R</title>
      <link>https://aosmith.rbind.io/2020/04/28/r-documentation/</link>
      <pubDate>Tue, 28 Apr 2020 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2020/04/28/r-documentation/</guid>
      <description>


&lt;p&gt;Since I tend to work with relatively new R users, I think a lot about what folks need to know when they are getting started. Learning how to get help tops my list of essential skills. Some of this involves learning about useful help forums like &lt;a href=&#34;https://stackoverflow.com/questions/tagged/r#&#34;&gt;Stack Overflow&lt;/a&gt; and the &lt;a href=&#34;https://community.rstudio.com/&#34;&gt;RStudio Community&lt;/a&gt;. Some of this is about learning good search terms (this is a hard one!). And some of this is learning how to use the R documentation help pages.&lt;/p&gt;
&lt;p&gt;While there are still exceptions, most often the help pages in R contain a bunch of useful information. Here I talk a little about what is generally in a help page for a function and what I focus on in each section.&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#r-help-pages&#34;&gt;R help pages&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#help-page-structure&#34;&gt;Help page structure&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#usage&#34;&gt;Usage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#arguments&#34;&gt;Arguments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#examples&#34;&gt;Examples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#other-sections&#34;&gt;Other sections&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-argument-order-instead-of-labels&#34;&gt;Using argument order instead of labels&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;r-help-pages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R help pages&lt;/h1&gt;
&lt;p&gt;Every time I use a function for a first time or reuse a function after some time has passed (like, 5 minutes in some cases 😜), I spend time looking at the R help page for that function. You can get to a help page in R by typing &lt;code&gt;?functionname&lt;/code&gt; into your Console and pressing Enter, where &lt;code&gt;functionname&lt;/code&gt; is some R function you are using.&lt;/p&gt;
&lt;p&gt;For example, if I wanted to take an average of some numbers with the &lt;code&gt;mean()&lt;/code&gt; function, I would type &lt;code&gt;?mean&lt;/code&gt; at the &lt;code&gt;&amp;gt;&lt;/code&gt; in the R Console and then press Enter. The help page opens up; if using RStudio this will default to open in the Help pane.&lt;/p&gt;
&lt;div id=&#34;help-page-structure&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Help page structure&lt;/h2&gt;
&lt;p&gt;A help page for an R function always has the same basic set-up. Here’s what the first half of the help page for &lt;code&gt;mean()&lt;/code&gt; looks like.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://raw.githubusercontent.com/aosmith16/r-basics-workshop/master/images/mean_help.png&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;p&gt;At the very top you’ll see the function name, followed by the package the function is in surrounded by curly braces. You can see that &lt;code&gt;mean()&lt;/code&gt; is part of the &lt;strong&gt;base&lt;/strong&gt; package.&lt;/p&gt;
&lt;p&gt;This is followed by a function title and basic &lt;strong&gt;Description&lt;/strong&gt; of the function. Sometimes this description can be in fairly in depth and useful but often, like here, it’s not and I quickly skim over it.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://raw.githubusercontent.com/aosmith16/r-basics-workshop/master/images/help_section1.png&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;usage&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Usage&lt;/h2&gt;
&lt;p&gt;The &lt;strong&gt;Usage&lt;/strong&gt; section is usually my first stop in a help page. This is where I can see the arguments available in the function along with any default values. The function &lt;em&gt;arguments&lt;/em&gt; are labels for the inputs you can give to a function. A &lt;em&gt;default value&lt;/em&gt; means that is the value the function will use if you don’t input something else.&lt;/p&gt;
&lt;p&gt;For example, for &lt;code&gt;mean()&lt;/code&gt; you can see that the first argument is &lt;code&gt;x&lt;/code&gt; (no default value), followed by &lt;code&gt;trim&lt;/code&gt; that defaults to a value of &lt;code&gt;0&lt;/code&gt;, and then &lt;code&gt;na.rm&lt;/code&gt; with a default of &lt;code&gt;FALSE&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://raw.githubusercontent.com/aosmith16/r-basics-workshop/master/images/help_section2.png&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;arguments&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Arguments&lt;/h2&gt;
&lt;p&gt;The arguments the function takes and a description of those arguments is given in the &lt;strong&gt;Arguments&lt;/strong&gt; section. This is a section I often spend a lot of time in and go back to regularly, figuring out what arguments do and the options available for each argument.&lt;/p&gt;
&lt;p&gt;In the &lt;code&gt;mean()&lt;/code&gt; example I’m using, this section tells me that the &lt;code&gt;trim&lt;/code&gt; argument can take numeric values between 0 and 0.5 in order to &lt;em&gt;trim&lt;/em&gt; the dataset prior to calculating the mean. I know from &lt;strong&gt;Usage&lt;/strong&gt; it defaults to &lt;code&gt;0&lt;/code&gt; but note in this case the default is not explicitly listed in the argument description.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;na.rm&lt;/code&gt; argument takes a logical value (i.e., &lt;code&gt;TRUE&lt;/code&gt; or &lt;code&gt;FALSE&lt;/code&gt;) and controls whether or not &lt;code&gt;NA&lt;/code&gt; values are stripped before the function calculates the means. Since it defaults to &lt;code&gt;FALSE&lt;/code&gt;, the &lt;code&gt;NA&lt;/code&gt; values are not stripped prior to calculation unless I change this.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://raw.githubusercontent.com/aosmith16/r-basics-workshop/master/images/help_section3.png&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;examples&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Examples&lt;/h2&gt;
&lt;p&gt;If you scroll to the very bottom of a help page you will find the &lt;strong&gt;Examples&lt;/strong&gt; section. This gives examples of how the function works. You can practice using the function by copying and pasting the example and running the code. In RStudio you can also highlight the code and run it directly from the Help pane with &lt;code&gt;Ctrl+Enter&lt;/code&gt; (MacOS &lt;code&gt;Cmd+Enter&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;After looking at &lt;strong&gt;Usage&lt;/strong&gt; and &lt;strong&gt;Arguments&lt;/strong&gt; I often scroll right down to the &lt;strong&gt;Examples&lt;/strong&gt; section to see an example of the code in use. The &lt;strong&gt;Examples&lt;/strong&gt; section for &lt;code&gt;mean()&lt;/code&gt; is pretty sparse, but you’ll find that this section is quite extensive for some functions.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://raw.githubusercontent.com/aosmith16/r-basics-workshop/master/images/help_section4.png&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;other-sections&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Other sections&lt;/h2&gt;
&lt;p&gt;Depending on the function, there can be a variety of different and important information after &lt;strong&gt;Arguments&lt;/strong&gt; and before &lt;strong&gt;Examples&lt;/strong&gt;. You may see mathematical notation that shows what the function does (in &lt;strong&gt;Details&lt;/strong&gt;), a description of what the function returns (in &lt;strong&gt;Value&lt;/strong&gt;), references in support of what the function does (in &lt;strong&gt;References&lt;/strong&gt;), etc. This can be extremely valuable information, but I often don’t read it until I run into trouble using the function or need more information to understand exactly what the function does.&lt;/p&gt;
&lt;p&gt;I have a couple examples of useful information I’ve found in these other sections for various functions.&lt;/p&gt;
&lt;p&gt;First up is &lt;code&gt;rbind()&lt;/code&gt; for stacking datasets. It turns out that &lt;code&gt;rbind()&lt;/code&gt; stacks columns based on matching column names and not column positions. This is mentioned in the function documentation, but you have to dive deep into the very long &lt;strong&gt;Details&lt;/strong&gt; section of the help file at &lt;code&gt;?rbind&lt;/code&gt; to find the information.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://raw.githubusercontent.com/aosmith16/r-basics-workshop/master/images/rbind.png&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;p&gt;Second, functions for distributions will give information about the density function used in the &lt;strong&gt;Details&lt;/strong&gt; section. Since the help pages for distributions almost always describe multiple functions at once, you can see what each of the functions return in &lt;strong&gt;Value&lt;/strong&gt;. Here’s an example from &lt;code&gt;?rnorm&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://raw.githubusercontent.com/aosmith16/aosmith/master/static/img/rnorm_help.PNG&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;using-argument-order-instead-of-labels&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using argument order instead of labels&lt;/h1&gt;
&lt;p&gt;You will see plenty of examples in R where the argument labels are not written out explicitly. This is because we can take advantage of the &lt;em&gt;argument order&lt;/em&gt; when writing code in R.&lt;/p&gt;
&lt;p&gt;You can see this in the &lt;code&gt;mean()&lt;/code&gt; &lt;strong&gt;Examples&lt;/strong&gt; section, for example. You can pass a vector to the first argument of &lt;code&gt;mean()&lt;/code&gt; without explicitly writing the &lt;code&gt;x&lt;/code&gt; argument label.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;vals = c(0:10, 50)
mean(vals)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 8.75&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In fact, you can pass in values to all the arguments without labels as long as you input them in the order the arguments come into the function. This relies heavily on you remembering the order of the arguments, as listed in &lt;strong&gt;Usage&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;vals = c(0:10, 50, NA)
mean(vals, 0.1, TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 5.5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You will definitely catch me leaving argument labels off in my own code. These days, though, I try to be more careful and primarily only leave off only the first label. One reason for this is it turns out my future self needs the arguments written out to better understand the code. I’m much more likely to figure out what the &lt;code&gt;mean()&lt;/code&gt; code is doing if I put the argument labels back in. I think the code above, without the labels for &lt;code&gt;trim&lt;/code&gt; and &lt;code&gt;na.rm&lt;/code&gt;, is hard to understand.&lt;/p&gt;
&lt;p&gt;Here’s the same code, this time with the argument labels written out. Note the argument order doesn’t matter if the argument labels are used.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;vals = c(0:10, 50, NA)
mean(vals, na.rm = TRUE, trim = 0.1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 5.5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Another reason I try to use argument labels is that new R users can get stung leaving off argument labels when they don’t realize how/why it works. 🐝 I worked with an R newbie recently who was getting weird results from a GLM with an offset. It turns out they weren’t using argument labels and so had passed the offset to &lt;code&gt;weights&lt;/code&gt; instead of &lt;code&gt;offset&lt;/code&gt;. Whoops! Luckily they saw something was weird and I could help get them on the right path. And now they know more about why it can be useful to write out argument labels. 😄&lt;/p&gt;
&lt;p&gt;I talk about this issue here because I don’t often see a lot of explicit discussion on why and when argument labels can be left off even though there are a lot of code examples out there that do this. This reminds me of when I was a new beekeeper and I made the mistake of going into a hive in the evening. (Do not try this at home, folks!) It turns out “everyone” who is an expert beekeeper knows what happens if you do this, but it wasn’t mentioned in any of my beginner books and classes. I don’t think beginners shouldn’t have to learn this sort of thing the hard way.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span id=&#34;fig:unnamed-chunk-5&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://raw.githubusercontent.com/aosmith16/aosmith/master/static/img/bees.jpg&#34; alt=&#34;No worries, this is a daytime hive inspection.&#34;  /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: No worries, this is a daytime hive inspection.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>An example of base::split() for looping through groups</title>
      <link>https://aosmith.rbind.io/2019/11/27/split-example/</link>
      <pubDate>Wed, 27 Nov 2019 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2019/11/27/split-example/</guid>
      <description>


&lt;p&gt;I recently had a question from a client about the simplest way to subset a data.frame and apply a function to each subset. “Simplest” could mean many things, of course, since what is simple for one person could appear very difficult to another. In this specific case I suggested using &lt;code&gt;base::split()&lt;/code&gt; as a possible option since it is one I find fairly approachable.&lt;/p&gt;
&lt;p&gt;I turns out I don’t have a go-to example for how to get started with a &lt;code&gt;split()&lt;/code&gt; approach. So here’s a quick blog post about it! 😄&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#load-r-packages&#34;&gt;Load R packages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#a-dataset-with-groups&#34;&gt;A dataset with groups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#create-separate-data.frames-per-group&#34;&gt;Create separate data.frames per group&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#looping-through-the-list&#34;&gt;Looping through the list&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#splitting-by-multiple-groups&#34;&gt;Splitting by multiple groups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#other-thoughts-on-split&#34;&gt;Other thoughts on split()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;load-r-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Load R packages&lt;/h1&gt;
&lt;p&gt;I’ll load &lt;strong&gt;purrr&lt;/strong&gt; for looping through lists.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(purrr) # 0.3.3&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;a-dataset-with-groups&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;A dataset with groups&lt;/h1&gt;
&lt;p&gt;I made a small dataset to use with &lt;code&gt;split()&lt;/code&gt;. The &lt;code&gt;id&lt;/code&gt; variable contains the group information. There are three groups, a, b, and c, with 10 observations per group. There are also two numeric variables, &lt;code&gt;var1&lt;/code&gt; and &lt;code&gt;var2&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat = structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L), .Label = c(&amp;quot;a&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;c&amp;quot;), class = &amp;quot;factor&amp;quot;), 
    var1 = c(4, 2.7, 3.4, 2.7, 4.6, 2.9, 2.2, 4.5, 4.6, 2.4, 
    3, 3.8, 2.5, 4, 3.6, 2.7, 4.5, 4.1, 4.2, 2.2, 4.9, 4.4, 3.6, 
    3.3, 2.7, 3.9, 4.9, 4.9, 4.3, 3.4), var2 = c(6, 22.3, 19.4, 
    22.8, 18.6, 14.2, 10.9, 22.7, 22.4, 11.7, 6, 13.3, 12.5, 
    6.3, 13.6, 20.5, 23.6, 10.9, 8.9, 20.9, 23.7, 15.9, 22.1, 
    11.6, 22, 17.7, 21, 20.8, 16.7, 21.4)), class = &amp;quot;data.frame&amp;quot;, row.names = c(NA, 
-30L))

head(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#   id var1 var2
# 1  a  4.0  6.0
# 2  a  2.7 22.3
# 3  a  3.4 19.4
# 4  a  2.7 22.8
# 5  a  4.6 18.6
# 6  a  2.9 14.2&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;create-separate-data.frames-per-group&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Create separate data.frames per group&lt;/h1&gt;
&lt;p&gt;If the goal is to apply a function to each dataset in each group, we need to pull out a dataset for each &lt;code&gt;id&lt;/code&gt;. One approach to do this is to make a subset for each group and then apply the function of interest to the subset. A classic approach would be to do the subsetting within a &lt;code&gt;for()&lt;/code&gt; loop.&lt;/p&gt;
&lt;p&gt;This is a situation where I find &lt;code&gt;split()&lt;/code&gt; to be really convenient. It splits the data by a defined group variable so we don’t have to subset things manually.&lt;/p&gt;
&lt;p&gt;The output from &lt;code&gt;split()&lt;/code&gt; is a list. If I split a dataset by groups, each element of the list will be a data.frame for one of the groups. Note the group values are used as the names of the list elements. I find the list-naming aspect of &lt;code&gt;split()&lt;/code&gt; handy for keeping track of groups in subsequent steps.&lt;/p&gt;
&lt;p&gt;Here’s an example, where I split &lt;code&gt;dat&lt;/code&gt; by the &lt;code&gt;id&lt;/code&gt; variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat_list = split(dat, dat$id)
dat_list&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $a
#    id var1 var2
# 1   a  4.0  6.0
# 2   a  2.7 22.3
# 3   a  3.4 19.4
# 4   a  2.7 22.8
# 5   a  4.6 18.6
# 6   a  2.9 14.2
# 7   a  2.2 10.9
# 8   a  4.5 22.7
# 9   a  4.6 22.4
# 10  a  2.4 11.7
# 
# $b
#    id var1 var2
# 11  b  3.0  6.0
# 12  b  3.8 13.3
# 13  b  2.5 12.5
# 14  b  4.0  6.3
# 15  b  3.6 13.6
# 16  b  2.7 20.5
# 17  b  4.5 23.6
# 18  b  4.1 10.9
# 19  b  4.2  8.9
# 20  b  2.2 20.9
# 
# $c
#    id var1 var2
# 21  c  4.9 23.7
# 22  c  4.4 15.9
# 23  c  3.6 22.1
# 24  c  3.3 11.6
# 25  c  2.7 22.0
# 26  c  3.9 17.7
# 27  c  4.9 21.0
# 28  c  4.9 20.8
# 29  c  4.3 16.7
# 30  c  3.4 21.4&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;looping-through-the-list&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Looping through the list&lt;/h1&gt;
&lt;p&gt;Once the data are split into separate data.frames per group, we can loop through the list and apply a function to each one using whatever looping approach we prefer.&lt;/p&gt;
&lt;p&gt;For example, if I want to fit a linear model of &lt;code&gt;var1&lt;/code&gt; vs &lt;code&gt;var2&lt;/code&gt; for each group I might do the looping with &lt;code&gt;purrr::map()&lt;/code&gt; or &lt;code&gt;lapply()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Each element of the new list still has the grouping information attached via the list names.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;map(dat_list, ~lm(var1 ~ var2, data = .x) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $a
# 
# Call:
# lm(formula = var1 ~ var2, data = .x)
# 
# Coefficients:
# (Intercept)         var2  
#     2.64826      0.04396  
# 
# 
# $b
# 
# Call:
# lm(formula = var1 ~ var2, data = .x)
# 
# Coefficients:
# (Intercept)         var2  
#     3.80822     -0.02551  
# 
# 
# $c
# 
# Call:
# lm(formula = var1 ~ var2, data = .x)
# 
# Coefficients:
# (Intercept)         var2  
#     3.35241      0.03513&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I could also create a function that fit a model and then returned model output. For example, maybe what I really wanted to do is the fit a linear model and extract &lt;span class=&#34;math inline&#34;&gt;\(R^2\)&lt;/span&gt; for each group model fit.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;r2 = function(data) {
     fit = lm(var1 ~ var2, data = data)
     
     broom::glance(fit)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output of my &lt;code&gt;r2&lt;/code&gt; function, which uses &lt;code&gt;broom::glance()&lt;/code&gt;, is a data.frame.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;r2(data = dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# # A tibble: 1 x 11
#   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
#       &amp;lt;dbl&amp;gt;         &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
# 1    0.0292      -0.00550 0.867     0.841   0.367     2  -37.3  80.5  84.7
# # ... with 2 more variables: deviance &amp;lt;dbl&amp;gt;, df.residual &amp;lt;int&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since the function output is a data.frame, I can use &lt;code&gt;purrr::map_dfr()&lt;/code&gt; to combine the output per group into a single data.frame. The &lt;code&gt;.id&lt;/code&gt; argument creates a new variable to store the list names in the output.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;map_dfr(dat_list, r2, .id = &amp;quot;id&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# # A tibble: 3 x 12
#   id    r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#   &amp;lt;chr&amp;gt;     &amp;lt;dbl&amp;gt;         &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
# 1 a        0.0775       -0.0378 0.968     0.672   0.436     2  -12.7  31.5
# 2 b        0.0387       -0.0815 0.832     0.322   0.586     2  -11.2  28.5
# 3 c        0.0285       -0.0930 0.808     0.235   0.641     2  -10.9  27.9
# # ... with 3 more variables: BIC &amp;lt;dbl&amp;gt;, deviance &amp;lt;dbl&amp;gt;, df.residual &amp;lt;int&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;splitting-by-multiple-groups&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Splitting by multiple groups&lt;/h1&gt;
&lt;p&gt;It is possible to split data by multiple grouping variables in the &lt;code&gt;split()&lt;/code&gt; function. The grouping variables must be passed as a list.&lt;/p&gt;
&lt;p&gt;Here’s an example, using the built-in &lt;code&gt;mtcars&lt;/code&gt; dataset. I show only the first two list elements to demonstrate that the list names are now based on a combination of the values for the two groups. By default these values are separated by a &lt;code&gt;.&lt;/code&gt; (but see the &lt;code&gt;sep&lt;/code&gt; argument to control this).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mtcars_cylam = split(mtcars, list(mtcars$cyl, mtcars$am) )
mtcars_cylam[1:2]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $`4.0`
#                mpg cyl  disp hp drat    wt  qsec vs am gear carb
# Merc 240D     24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
# Merc 230      22.8   4 140.8 95 3.92 3.150 22.90  1  0    4    2
# Toyota Corona 21.5   4 120.1 97 3.70 2.465 20.01  1  0    3    1
# 
# $`6.0`
#                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
# Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If all combinations of groups are not present, the &lt;code&gt;drop&lt;/code&gt; argument in &lt;code&gt;split()&lt;/code&gt; allows us to drop missing combinations. By default combinations that aren’t present are kept as 0-length data.frames.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;other-thoughts-on-split&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Other thoughts on split()&lt;/h1&gt;
&lt;p&gt;I feel like &lt;code&gt;split()&lt;/code&gt; was a gateway function for me to get started working with lists and associated convenience functions like &lt;code&gt;lapply()&lt;/code&gt; and &lt;code&gt;purrr::map()&lt;/code&gt; for looping through lists. I think learning to work with lists and “list loops” also made the learning curve for &lt;a href=&#34;https://r4ds.had.co.nz/many-models.html#list-columns-1&#34;&gt;list-columns&lt;/a&gt; in data.frames and the &lt;code&gt;nest()&lt;/code&gt;/&lt;code&gt;unnest()&lt;/code&gt; approach of analysis-by-groups a little less steep for me.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2019-11-26-split-example.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(purrr) # 0.3.3

dat = structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L), .Label = c(&amp;quot;a&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;c&amp;quot;), class = &amp;quot;factor&amp;quot;), 
    var1 = c(4, 2.7, 3.4, 2.7, 4.6, 2.9, 2.2, 4.5, 4.6, 2.4, 
    3, 3.8, 2.5, 4, 3.6, 2.7, 4.5, 4.1, 4.2, 2.2, 4.9, 4.4, 3.6, 
    3.3, 2.7, 3.9, 4.9, 4.9, 4.3, 3.4), var2 = c(6, 22.3, 19.4, 
    22.8, 18.6, 14.2, 10.9, 22.7, 22.4, 11.7, 6, 13.3, 12.5, 
    6.3, 13.6, 20.5, 23.6, 10.9, 8.9, 20.9, 23.7, 15.9, 22.1, 
    11.6, 22, 17.7, 21, 20.8, 16.7, 21.4)), class = &amp;quot;data.frame&amp;quot;, row.names = c(NA, 
-30L))

head(dat)

dat_list = split(dat, dat$id)
dat_list

map(dat_list, ~lm(var1 ~ var2, data = .x) )

r2 = function(data) {
     fit = lm(var1 ~ var2, data = data)
     
     broom::glance(fit)
}
r2(data = dat)

map_dfr(dat_list, r2, .id = &amp;quot;id&amp;quot;)

mtcars_cylam = split(mtcars, list(mtcars$cyl, mtcars$am) )
mtcars_cylam[1:2]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Making a background color gradient in ggplot2</title>
      <link>https://aosmith.rbind.io/2019/10/14/background-color_gradient/</link>
      <pubDate>Mon, 14 Oct 2019 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2019/10/14/background-color_gradient/</guid>
      <description>


&lt;p&gt;I was recently making some arrangements for the 2020 eclipse in South America, which of course got me thinking of the day we were lucky enough to have a path of totality come to us.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/img/dog_eclipse.png&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We have a weather station that records local temperature every 5 minutes, so after the eclipse I was able to plot the temperature change over the eclipse as we experienced it at our house. Here is an example of a basic plot I made at the time.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/img/eclipse_temp.png&#34; width=&#34;400px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Looking at this now with new eyes, I see it might be nice replace the gray rectangle with one that goes from light to dark to light as the eclipse progresses to totality and then back. I’ll show how I tackled making a gradient color background in this post.&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#load-r-packages&#34;&gt;Load R packages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-dataset&#34;&gt;The dataset&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#initial-plot&#34;&gt;Initial plot&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#adding-color-gradient-using-geom_segment&#34;&gt;Adding color gradient using geom_segment()&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#make-a-variable-for-the-color-mapping&#34;&gt;Make a variable for the color mapping&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#adding-one-geom_segment-per-second&#34;&gt;Adding one geom_segment() per second&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#switching-to-a-gray-scale&#34;&gt;Switching to a gray scale&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-segments-to-make-a-gradient-rectangle&#34;&gt;Using segments to make a gradient rectangle&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#bonus-annotations-with-curved-arrows&#34;&gt;Bonus: annotations with curved arrows&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#other-ways-to-make-a-gradient-color-background&#34;&gt;Other ways to make a gradient color background&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#eclipses&#34;&gt;Eclipses!&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;load-r-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Load R packages&lt;/h1&gt;
&lt;p&gt;I’ll load &lt;strong&gt;ggplot2&lt;/strong&gt; for plotting and &lt;strong&gt;dplyr&lt;/strong&gt; for data manipulation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # 3.2.1
library(dplyr) # 0.8.3&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-dataset&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The dataset&lt;/h1&gt;
&lt;p&gt;My weather station records the temperature in °Fahrenheit every 5 minutes. I downloaded the data from 6 AM to 12 PM local time and cleaned it up a bit. The date-times and temperature are in a dataset I named &lt;code&gt;temp&lt;/code&gt;. You can download this below if you’d like to play around with these data.&lt;/p&gt;
&lt;hr /&gt;
&lt;svg style=&#34;height:0.8em;top:.04em;position:relative;fill:#ee5863;&#34; viewBox=&#34;0 0 512 512&#34;&gt;
&lt;path d=&#34;M216 0h80c13.3 0 24 10.7 24 24v168h87.7c17.8 0 26.7 21.5 14.1 34.1L269.7 378.3c-7.5 7.5-19.8 7.5-27.3 0L90.1 226.1c-12.6-12.6-3.7-34.1 14.1-34.1H192V24c0-13.3 10.7-24 24-24zm296 376v112c0 13.3-10.7 24-24 24H24c-13.3 0-24-10.7-24-24V376c0-13.3 10.7-24 24-24h146.7l49 49c20.1 20.1 52.5 20.1 72.6 0l49-49H488c13.3 0 24 10.7 24 24zm-124 88c0-11-9-20-20-20s-20 9-20 20 9 20 20 20 20-9 20-20zm64 0c0-11-9-20-20-20s-20 9-20 20 9 20 20 20 20-9 20-20z&#34;/&gt;
&lt;/svg&gt;
&lt;a href=&#34;data:text/csv;base64,ImRhdGV0aW1lIiwidGVtcGYiDQoyMDE3LTA4LTIxIDA2OjAwOjAwLDU0LjkNCjIwMTctMDgtMjEgMDY6MDU6MDAsNTQuOQ0KMjAxNy0wOC0yMSAwNjoxMDowMCw1NC45DQoyMDE3LTA4LTIxIDA2OjE1OjAwLDU0LjkNCjIwMTctMDgtMjEgMDY6MjA6MDAsNTQuOQ0KMjAxNy0wOC0yMSAwNjoyNTowMCw1NC44DQoyMDE3LTA4LTIxIDA2OjMwOjAwLDU0LjcNCjIwMTctMDgtMjEgMDY6MzU6MDAsNTQuNg0KMjAxNy0wOC0yMSAwNjo0MDowMCw1NC42DQoyMDE3LTA4LTIxIDA2OjQ1OjAwLDU0LjUNCjIwMTctMDgtMjEgMDY6NTA6MDAsNTQuNg0KMjAxNy0wOC0yMSAwNjo1NTowMCw1NC42DQoyMDE3LTA4LTIxIDA3OjAwOjAwLDU0LjkNCjIwMTctMDgtMjEgMDc6MDU6MDAsNTQuOQ0KMjAxNy0wOC0yMSAwNzoxMDowMCw1NQ0KMjAxNy0wOC0yMSAwNzoxNTowMCw1NS4xDQoyMDE3LTA4LTIxIDA3OjIwOjAwLDU1LjQNCjIwMTctMDgtMjEgMDc6MjU6MDAsNTUuOQ0KMjAxNy0wOC0yMSAwNzozMDowMCw1Ni40DQoyMDE3LTA4LTIxIDA3OjM1OjAwLDU3DQoyMDE3LTA4LTIxIDA3OjQwOjAwLDU3LjcNCjIwMTctMDgtMjEgMDc6NDU6MDAsNTguMw0KMjAxNy0wOC0yMSAwNzo1MDowMCw1OS4xDQoyMDE3LTA4LTIxIDA3OjU1OjAwLDU5LjcNCjIwMTctMDgtMjEgMDg6MDA6MDAsNjAuNg0KMjAxNy0wOC0yMSAwODowNTowMCw2MS41DQoyMDE3LTA4LTIxIDA4OjEwOjAwLDYyLjQNCjIwMTctMDgtMjEgMDg6MTU6MDAsNjMuNA0KMjAxNy0wOC0yMSAwODoyMDowMCw2NC41DQoyMDE3LTA4LTIxIDA4OjI1OjAwLDY1LjUNCjIwMTctMDgtMjEgMDg6MzA6MDAsNjYuNQ0KMjAxNy0wOC0yMSAwODozNTowMCw2Ny4yDQoyMDE3LTA4LTIxIDA4OjQwOjAwLDY4DQoyMDE3LTA4LTIxIDA4OjQ1OjAwLDY4LjYNCjIwMTctMDgtMjEgMDg6NTA6MDAsNjkuNA0KMjAxNy0wOC0yMSAwODo1NTowMCw2OS45DQoyMDE3LTA4LTIxIDA5OjAwOjAwLDcwLjQNCjIwMTctMDgtMjEgMDk6MDU6MDAsNzAuOA0KMjAxNy0wOC0yMSAwOToxMDowMCw3MS4xDQoyMDE3LTA4LTIxIDA5OjE1OjAwLDcxLjMNCjIwMTctMDgtMjEgMDk6MjA6MDAsNzEuNA0KMjAxNy0wOC0yMSAwOToyNTowMCw3MS40DQoyMDE3LTA4LTIxIDA5OjMwOjAwLDcxLjMNCjIwMTctMDgtMjEgMDk6MzU6MDAsNzEuNA0KMjAxNy0wOC0yMSAwOTo0MDowMCw3MS4zDQoyMDE3LTA4LTIxIDA5OjQ1OjAwLDcxLjENCjIwMTctMDgtMjEgMDk6NTA6MDAsNzAuOQ0KMjAxNy0wOC0yMSAwOTo1NTowMCw3MC41DQoyMDE3LTA4LTIxIDEwOjAwOjAwLDY5LjkNCjIwMTctMDgtMjEgMTA6MDU6MDAsNjkuNQ0KMjAxNy0wOC0yMSAxMDoxMDowMCw2OC45DQoyMDE3LTA4LTIxIDEwOjE1OjAwLDY4LjMNCjIwMTctMDgtMjEgMTA6MjA6MDAsNjcuOA0KMjAxNy0wOC0yMSAxMDoyNTowMCw2Nw0KMjAxNy0wOC0yMSAxMDozMDowMCw2Ni4zDQoyMDE3LTA4LTIxIDEwOjM1OjAwLDY2DQoyMDE3LTA4LTIxIDEwOjQwOjAwLDY2DQoyMDE3LTA4LTIxIDEwOjQ1OjAwLDY2LjINCjIwMTctMDgtMjEgMTA6NTA6MDAsNjYuOA0KMjAxNy0wOC0yMSAxMDo1NTowMCw2Ny4zDQoyMDE3LTA4LTIxIDExOjAwOjAwLDY4DQoyMDE3LTA4LTIxIDExOjA1OjAwLDY4LjUNCjIwMTctMDgtMjEgMTE6MTA6MDAsNjkuMg0KMjAxNy0wOC0yMSAxMToxNTowMCw3MA0KMjAxNy0wOC0yMSAxMToyMDowMCw3MC44DQoyMDE3LTA4LTIxIDExOjI1OjAwLDcxLjcNCjIwMTctMDgtMjEgMTE6MzA6MDAsNzIuNA0KMjAxNy0wOC0yMSAxMTozNTowMCw3Mi45DQoyMDE3LTA4LTIxIDExOjQwOjAwLDczLjUNCjIwMTctMDgtMjEgMTE6NDU6MDAsNzMuOQ0KMjAxNy0wOC0yMSAxMTo1MDowMCw3NC4yDQoyMDE3LTA4LTIxIDExOjU1OjAwLDc0LjQNCjIwMTctMDgtMjEgMTI6MDA6MDAsNzQuNg0K&#34; download=&#34;eclipse_temp.csv&#34;&gt;Download eclipse_temp.csv&lt;/a&gt;
&lt;hr /&gt;
&lt;p&gt;Here are the first six lines of this temperature dataset.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;head(temp)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# # A tibble: 6 x 2
#   datetime            tempf
#   &amp;lt;dttm&amp;gt;              &amp;lt;dbl&amp;gt;
# 1 2017-08-21 06:00:00  54.9
# 2 2017-08-21 06:05:00  54.9
# 3 2017-08-21 06:10:00  54.9
# 4 2017-08-21 06:15:00  54.9
# 5 2017-08-21 06:20:00  54.9
# 6 2017-08-21 06:25:00  54.8&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I also stored the start and end times of the eclipse and totality in data.frames, which I pulled for my location from &lt;a href=&#34;http://xjubier.free.fr/en/site_pages/solar_eclipses/TSE_2017_GoogleMapFull.html&#34;&gt;this website&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If following along at home, make sure your time zones match for all the date-time variables or, from personal experience 🤣, you’ll run into problems.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;eclipse = data.frame(start = as.POSIXct(&amp;quot;2017-08-21 09:05:10&amp;quot;),
                     end = as.POSIXct(&amp;quot;2017-08-21 11:37:19&amp;quot;) )

totality = data.frame(start = as.POSIXct(&amp;quot;2017-08-21 10:16:55&amp;quot;),
                      end = as.POSIXct(&amp;quot;2017-08-21 10:18:52&amp;quot;) )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;initial-plot&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Initial plot&lt;/h1&gt;
&lt;p&gt;I decided to make a plot of the temperature change during the eclipse only.&lt;/p&gt;
&lt;p&gt;To keep the temperature line looking continuous, even though it’s taken every 5 minutes, I subset the data to times close but outside the start and end of the eclipse.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plottemp = filter(temp, between(datetime, 
                                as.POSIXct(&amp;quot;2017-08-21 09:00:00&amp;quot;),
                                as.POSIXct(&amp;quot;2017-08-21 12:00:00&amp;quot;) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I then zoomed the plot to only include times encompassed by the eclipse with &lt;code&gt;coord_cartesian()&lt;/code&gt;. I removed the x axis expansion in &lt;code&gt;scale_x_datetime()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Since the plot covers only about 2 and half hours, I make breaks on the x axis every 15 minutes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(plottemp) +
     geom_line( aes(datetime, tempf), size = 1 ) +
     scale_x_datetime( date_breaks = &amp;quot;15 min&amp;quot;,
                       date_labels = &amp;quot;%H:%M&amp;quot;,
                       expand = c(0, 0) ) +
     coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
     labs(y = expression( Temperature~(degree*F) ),
          x = NULL,
          title = &amp;quot;Temperature during 2017-08-21 solar eclipse&amp;quot;,
          subtitle = expression(italic(&amp;quot;Sapsucker Farm, 09:05:10 - 11:37:19 PDT&amp;quot;) ),
          caption = &amp;quot;Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds&amp;quot;
     ) +
     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 , 
                                            name =  expression( Temperature~(degree*C)),
                                            breaks = seq(16, 24, by = 1)) ) +
     theme_bw(base_size = 14) +
     theme(panel.grid = element_blank() ) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-10-14-color-gradient-background_files/figure-html/unnamed-chunk-5-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;adding-color-gradient-using-geom_segment&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Adding color gradient using geom_segment()&lt;/h1&gt;
&lt;p&gt;I wanted the background of the plot to go from light to dark back to light through time. This means a color gradient should go from left to right across the plot.&lt;/p&gt;
&lt;p&gt;Since the gradient will be based on time, I figured I could add a vertical line with &lt;code&gt;geom_segment()&lt;/code&gt; for every second of the eclipse and color each segment based on how far that time was from totality.&lt;/p&gt;
&lt;div id=&#34;make-a-variable-for-the-color-mapping&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Make a variable for the color mapping&lt;/h2&gt;
&lt;p&gt;The first step I took was to make variable with a row for every second of the eclipse, since I wanted a segment drawn for each second. I used &lt;code&gt;seq.POSIXt&lt;/code&gt; for this.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;color_dat = data.frame(time = seq(eclipse$start, eclipse$end, by = &amp;quot;1 sec&amp;quot;) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then came some hard thinking. How would I make a continuous variable to map to color? 🤔&lt;/p&gt;
&lt;p&gt;While I don’t have an actual measurement of light throughout the eclipse, I can show the general idea of a light change with color by using a linear change in color from the start of the eclipse to totality and then another linear change in color from totality to the end of the eclipse.&lt;/p&gt;
&lt;p&gt;My first idea for creating a variable was to use information on the current time vs totality start/end. After subtracting the times before totality from totality start and subtracting totality end from times after totality, I realized that the amount of time before totality wasn’t actually the same as the amount of time after totality. Back to the drawing board.&lt;/p&gt;
&lt;p&gt;Since I was making a linear change in color, I realized I could make a sequence of values before totality and after totality that covered the same range but had a different total number of values. This would account for the difference in the length of time before and after totality. I ended up making a sequence going from 100 to 0 for times before totality and a sequence from 0 to 100 after totality. Times during totality were assigned a 0.&lt;/p&gt;
&lt;p&gt;Here’s one way to get these sequences, using &lt;code&gt;base::replace()&lt;/code&gt;. My dataset is in order by time, which is key to this working correctly.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;color_dat = mutate(color_dat,
                   color = 0,
                   color = replace(color, 
                                   time &amp;lt; totality$start, 
                                   seq(100, 0, length.out = sum(time &amp;lt; totality$start) ) ),
                   color = replace(color, 
                                   time &amp;gt; totality$end, 
                                   seq(0, 100, length.out = sum(time &amp;gt; totality$end) ) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;adding-one-geom_segment-per-second&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Adding one geom_segment() per second&lt;/h2&gt;
&lt;p&gt;Once I had my color variable I was ready plot the segments along the x axis. Since the segments neeeded to go across the full height of the plot, I set &lt;code&gt;y&lt;/code&gt; and &lt;code&gt;yend&lt;/code&gt; to &lt;code&gt;-Inf&lt;/code&gt; and &lt;code&gt;Inf&lt;/code&gt;, respectively.&lt;/p&gt;
&lt;p&gt;I put this layer first to use it as a background that the temperature line was plotted on top of.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;g1 = ggplot(plottemp) +
     geom_segment(data = color_dat,
                  aes(x = time, xend = time,
                      y = -Inf, yend = Inf, color = color),
                  show.legend = FALSE) +
     geom_line( aes(datetime, tempf), size = 1 ) +
     scale_x_datetime( date_breaks = &amp;quot;15 min&amp;quot;,
                       date_labels = &amp;quot;%H:%M&amp;quot;,
                       expand = c(0, 0) ) +
     coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
     labs(y = expression( Temperature~(degree*F) ),
          x = NULL,
          title = &amp;quot;Temperature during 2017-08-21 solar eclipse&amp;quot;,
          subtitle = expression(italic(&amp;quot;Sapsucker Farm, 09:05:10 - 11:37:19 PDT&amp;quot;) ),
          caption = &amp;quot;Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds&amp;quot;
     ) +
     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 , 
                                            name =  expression( Temperature~(degree*C)),
                                            breaks = seq(16, 24, by = 1)) ) +
     theme_bw(base_size = 14) +
     theme(panel.grid = element_blank() ) 

g1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-10-14-color-gradient-background_files/figure-html/unnamed-chunk-8-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;switching-to-a-gray-scale&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Switching to a gray scale&lt;/h2&gt;
&lt;p&gt;The default blue color scheme for the segments actually works OK, but I was picturing going from white to dark. I picked gray colors with &lt;code&gt;grDevices::gray.colors()&lt;/code&gt; in &lt;code&gt;scale_color_gradient()&lt;/code&gt;. In &lt;code&gt;gray.colors()&lt;/code&gt;, &lt;code&gt;0&lt;/code&gt; is black and &lt;code&gt;1&lt;/code&gt; is white. I didn’t want the colors to go all the way to black, since that would make the temperature line impossible to see during totality. And, of course, it’s not actually pitch black during totality. 😁&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;g1 + scale_color_gradient(low = gray.colors(1, 0.25),
                          high = gray.colors(1, 1) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-10-14-color-gradient-background_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;using-segments-to-make-a-gradient-rectangle&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using segments to make a gradient rectangle&lt;/h1&gt;
&lt;p&gt;I can use this same approach on only a portion of the x axis to give the appearance of a rectangle with gradient fill. Here’s an example using times outside the eclipse.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;g2 = ggplot(temp) +
     geom_segment(data = color_dat,
                  aes(x = time, xend = time,
                      y = -Inf, yend = Inf, color = color),
                  show.legend = FALSE) +
     geom_line( aes(datetime, tempf), size = 1 ) +
     scale_x_datetime( date_breaks = &amp;quot;1 hour&amp;quot;,
                       date_labels = &amp;quot;%H:%M&amp;quot;,
                       expand = c(0, 0) ) +
     labs(y = expression( Temperature~(degree*F) ),
          x = NULL,
          title = &amp;quot;Temperature during 2017-08-21 solar eclipse&amp;quot;,
          subtitle = expression(italic(&amp;quot;Sapsucker Farm, Dallas, OR, USA&amp;quot;) ),
          caption = &amp;quot;Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds&amp;quot;
     ) +
     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 , 
                                            name =  expression( Temperature~(degree*C)),
                                            breaks = seq(12, 24, by = 2)) ) +
     scale_color_gradient(low = gray.colors(1, .25),
                          high = gray.colors(1, 1) ) +
     theme_bw(base_size = 14) +
     theme(panel.grid.major.x = element_blank(),
           panel.grid.minor = element_blank() ) 

g2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-10-14-color-gradient-background_files/figure-html/unnamed-chunk-10-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;bonus-annotations-with-curved-arrows&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Bonus: annotations with curved arrows&lt;/h2&gt;
&lt;p&gt;This second plot gives me a chance to try out Cédric Scherer’s &lt;a href=&#34;https://cedricscherer.netlify.com/2019/05/17/the-evolution-of-a-ggplot-ep.-1/#text&#34;&gt;very cool curved annotation arrow idea&lt;/a&gt; for the first time 🎉.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;g2 = g2 + 
     annotate(&amp;quot;text&amp;quot;, x = as.POSIXct(&amp;quot;2017-08-21 08:00&amp;quot;),
              y = 74, 
              label = &amp;quot;Partial eclipse begins\n09:05:10 PDT&amp;quot;,
              color = &amp;quot;grey24&amp;quot;) +
     annotate(&amp;quot;text&amp;quot;, x = as.POSIXct(&amp;quot;2017-08-21 09:00&amp;quot;),
              y = 57, 
              label = &amp;quot;Totality begins\n10:16:55 PDT&amp;quot;,
              color = &amp;quot;grey24&amp;quot;)
g2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-10-14-color-gradient-background_files/figure-html/unnamed-chunk-11-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I’ll make a data.frame for the arrow start/end positions. I’m skipping all the work it took to get the positions where I wanted them, which is always iterative for me.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;arrows = data.frame(x1 = as.POSIXct( c(&amp;quot;2017-08-21 08:35&amp;quot;,
                                      &amp;quot;2017-08-21 09:34&amp;quot;) ),
                    x2 = c(eclipse$start, totality$start),
                    y1 = c(74, 57.5),
                    y2 = c(72.5, 60) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I add arrows with &lt;code&gt;geom_curve()&lt;/code&gt;. I changed the size of the arrowhead and made it closed in &lt;code&gt;arrow()&lt;/code&gt;. I also thought the arrows looked better with a little less curvature.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;g2 +
     geom_curve(data = arrows,
                aes(x = x1, xend = x2,
                    y = y1, yend = y2),
                arrow = arrow(length = unit(0.075, &amp;quot;inches&amp;quot;),
                              type = &amp;quot;closed&amp;quot;),
                curvature = 0.25)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-10-14-color-gradient-background_files/figure-html/unnamed-chunk-13-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;other-ways-to-make-a-gradient-color-background&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Other ways to make a gradient color background&lt;/h1&gt;
&lt;p&gt;Based on a bunch of internet searches, it looks like a gradient background in &lt;strong&gt;ggplot2&lt;/strong&gt; generally takes some work. There are some nice examples out there on how to use &lt;code&gt;rasterGrob()&lt;/code&gt; and &lt;code&gt;annotate_custom()&lt;/code&gt; to add background gradients, such as &lt;a href=&#34;https://stackoverflow.com/questions/48596582/change-orientation-of-grob-background-gradient&#34;&gt;in this Stack Overflow question&lt;/a&gt;. I haven’t researched how to make this go from light to dark and back to light for the uneven time scale like in my example.&lt;/p&gt;
&lt;p&gt;I’ve also seen approaches involving dataset expansion and drawing many filled rectangles or using rasters, which is like what I did with &lt;code&gt;geom_segment()&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;eclipses&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Eclipses!&lt;/h1&gt;
&lt;p&gt;Before actually experiencing totality, it seemed to me like the difference between a 99% and a 100% eclipse wasn’t a big deal. I mean, those numbers &lt;em&gt;are&lt;/em&gt; pretty darn close.&lt;/p&gt;
&lt;p&gt;I was very wrong. 😜&lt;/p&gt;
&lt;p&gt;If you ever are lucky enough to be near a path of totality, definitely try to get there even if it’s a little more trouble then the 99.9% partial eclipse. It’s an amazing experience. 😻&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/img/eclipse.png&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2019-10-14-color-gradient-background.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # 3.2.1
library(dplyr) # 0.8.3

head(temp)
eclipse = data.frame(start = as.POSIXct(&amp;quot;2017-08-21 09:05:10&amp;quot;),
                     end = as.POSIXct(&amp;quot;2017-08-21 11:37:19&amp;quot;) )

totality = data.frame(start = as.POSIXct(&amp;quot;2017-08-21 10:16:55&amp;quot;),
                      end = as.POSIXct(&amp;quot;2017-08-21 10:18:52&amp;quot;) )

plottemp = filter(temp, between(datetime, 
                                as.POSIXct(&amp;quot;2017-08-21 09:00:00&amp;quot;),
                                as.POSIXct(&amp;quot;2017-08-21 12:00:00&amp;quot;) ) )
ggplot(plottemp) +
     geom_line( aes(datetime, tempf), size = 1 ) +
     scale_x_datetime( date_breaks = &amp;quot;15 min&amp;quot;,
                       date_labels = &amp;quot;%H:%M&amp;quot;,
                       expand = c(0, 0) ) +
     coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
     labs(y = expression( Temperature~(degree*F) ),
          x = NULL,
          title = &amp;quot;Temperature during 2017-08-21 solar eclipse&amp;quot;,
          subtitle = expression(italic(&amp;quot;Sapsucker Farm, 09:05:10 - 11:37:19 PDT&amp;quot;) ),
          caption = &amp;quot;Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds&amp;quot;
     ) +
     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 , 
                                            name =  expression( Temperature~(degree*C)),
                                            breaks = seq(16, 24, by = 1)) ) +
     theme_bw(base_size = 14) +
     theme(panel.grid = element_blank() ) 
color_dat = data.frame(time = seq(eclipse$start, eclipse$end, by = &amp;quot;1 sec&amp;quot;) )
color_dat = mutate(color_dat,
                   color = 0,
                   color = replace(color, 
                                   time &amp;lt; totality$start, 
                                   seq(100, 0, length.out = sum(time &amp;lt; totality$start) ) ),
                   color = replace(color, 
                                   time &amp;gt; totality$end, 
                                   seq(0, 100, length.out = sum(time &amp;gt; totality$end) ) ) )
g1 = ggplot(plottemp) +
     geom_segment(data = color_dat,
                  aes(x = time, xend = time,
                      y = -Inf, yend = Inf, color = color),
                  show.legend = FALSE) +
     geom_line( aes(datetime, tempf), size = 1 ) +
     scale_x_datetime( date_breaks = &amp;quot;15 min&amp;quot;,
                       date_labels = &amp;quot;%H:%M&amp;quot;,
                       expand = c(0, 0) ) +
     coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +
     labs(y = expression( Temperature~(degree*F) ),
          x = NULL,
          title = &amp;quot;Temperature during 2017-08-21 solar eclipse&amp;quot;,
          subtitle = expression(italic(&amp;quot;Sapsucker Farm, 09:05:10 - 11:37:19 PDT&amp;quot;) ),
          caption = &amp;quot;Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds&amp;quot;
     ) +
     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 , 
                                            name =  expression( Temperature~(degree*C)),
                                            breaks = seq(16, 24, by = 1)) ) +
     theme_bw(base_size = 14) +
     theme(panel.grid = element_blank() ) 

g1

g1 + scale_color_gradient(low = gray.colors(1, 0.25),
                          high = gray.colors(1, 1) )
g2 = ggplot(temp) +
     geom_segment(data = color_dat,
                  aes(x = time, xend = time,
                      y = -Inf, yend = Inf, color = color),
                  show.legend = FALSE) +
     geom_line( aes(datetime, tempf), size = 1 ) +
     scale_x_datetime( date_breaks = &amp;quot;1 hour&amp;quot;,
                       date_labels = &amp;quot;%H:%M&amp;quot;,
                       expand = c(0, 0) ) +
     labs(y = expression( Temperature~(degree*F) ),
          x = NULL,
          title = &amp;quot;Temperature during 2017-08-21 solar eclipse&amp;quot;,
          subtitle = expression(italic(&amp;quot;Sapsucker Farm, Dallas, OR, USA&amp;quot;) ),
          caption = &amp;quot;Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds&amp;quot;
     ) +
     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 , 
                                            name =  expression( Temperature~(degree*C)),
                                            breaks = seq(12, 24, by = 2)) ) +
     scale_color_gradient(low = gray.colors(1, .25),
                          high = gray.colors(1, 1) ) +
     theme_bw(base_size = 14) +
     theme(panel.grid.major.x = element_blank(),
           panel.grid.minor = element_blank() ) 

g2
g2 = g2 + 
     annotate(&amp;quot;text&amp;quot;, x = as.POSIXct(&amp;quot;2017-08-21 08:00&amp;quot;),
              y = 74, 
              label = &amp;quot;Partial eclipse begins\n09:05:10 PDT&amp;quot;,
              color = &amp;quot;grey24&amp;quot;) +
     annotate(&amp;quot;text&amp;quot;, x = as.POSIXct(&amp;quot;2017-08-21 09:00&amp;quot;),
              y = 57, 
              label = &amp;quot;Totality begins\n10:16:55 PDT&amp;quot;,
              color = &amp;quot;grey24&amp;quot;)
g2

arrows = data.frame(x1 = as.POSIXct( c(&amp;quot;2017-08-21 08:35&amp;quot;,
                                      &amp;quot;2017-08-21 09:34&amp;quot;) ),
                    x2 = c(eclipse$start, totality$start),
                    y1 = c(74, 57.5),
                    y2 = c(72.5, 60) )
g2 +
     geom_curve(data = arrows,
                aes(x = x1, xend = x2,
                    y = y1, yend = y2),
                arrow = arrow(length = unit(0.075, &amp;quot;inches&amp;quot;),
                              type = &amp;quot;closed&amp;quot;),
                curvature = 0.25)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Expanding binomial counts to binary 0/1 with purrr::pmap()</title>
      <link>https://aosmith.rbind.io/2019/10/04/expanding-binomial-to-binary/</link>
      <pubDate>Fri, 04 Oct 2019 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2019/10/04/expanding-binomial-to-binary/</guid>
      <description>


&lt;p&gt;Data on successes and failures can be summarized and analyzed as counted proportions via the binomial distribution or as long format 0/1 binary data. I most often see summarized data when there are multiple trials done within a study unit; for example, when tallying up the number of dead trees out of the total number of trees in a plot.&lt;/p&gt;
&lt;p&gt;If these within-plot trials are all independent, analyzing data in a binary format instead of summarized binomial counts doesn’t change the statistical results. If trials are not independent, though, neither approach works correctly and we would see overdispersion/underdispersion in a binomial model. The confusing piece in this is that binary data &lt;a href=&#34;https://stat.ethz.ch/pipermail/r-sig-mixed-models/2011q1/015534.html&#34;&gt;by definition can’t be overdispersed&lt;/a&gt; and so the lack of fit from non-independence can’t be diagnosed with standard overdispersion checks when working with binary data.&lt;/p&gt;
&lt;p&gt;In a future post I’ll talk more about simulating data to explore binomial overdispersion and how lack of fit can be diagnosed in binomial vs binary datasets. Today, however, my goal is show how to take binomial count data and expand it into binary data.&lt;/p&gt;
&lt;p&gt;In the past I’ve done the data expansion with &lt;code&gt;rowwise()&lt;/code&gt; and &lt;code&gt;do()&lt;/code&gt; from package &lt;strong&gt;dplyr&lt;/strong&gt;, but these days I’m using &lt;code&gt;purrr::pmap_dfr()&lt;/code&gt;. I’ll demonstrate the &lt;code&gt;pmap_dfr()&lt;/code&gt; approach as well as a &lt;code&gt;nest()&lt;/code&gt;/&lt;code&gt;unnest()&lt;/code&gt; approach using functions from &lt;strong&gt;tidyr&lt;/strong&gt;.&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#load-r-packages&#34;&gt;Load R packages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-dataset&#34;&gt;The dataset&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#expanding-binomial-to-binary-with-pmap_dfr&#34;&gt;Expanding binomial to binary with pmap_dfr()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#aside-pmap-functions-with-more-columns&#34;&gt;Aside: pmap functions with more columns&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#comparing-analysis-results&#34;&gt;Comparing analysis results&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#binomial-model&#34;&gt;Binomial model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#binary-model&#34;&gt;Binary model&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#expanding-binomial-to-binary-via-nesting&#34;&gt;Expanding binomial to binary via nesting&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;load-r-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Load R packages&lt;/h1&gt;
&lt;p&gt;I’m using &lt;strong&gt;purrr&lt;/strong&gt; for looping through rows with &lt;code&gt;pmap_dfr()&lt;/code&gt;. I also load &lt;strong&gt;dplyr&lt;/strong&gt; and &lt;strong&gt;tidyr&lt;/strong&gt; for a &lt;code&gt;nest()&lt;/code&gt;/&lt;code&gt;unnest()&lt;/code&gt; approach.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(purrr) # 0.3.2
library(tidyr) # 1.0.0
library(dplyr) # 0.8.3&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-dataset&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The dataset&lt;/h1&gt;
&lt;p&gt;I created a dataset with a total of 8 plots, 4 plots in each of two groups. The data has been summarized up to the plot level. The number of trials (&lt;code&gt;total&lt;/code&gt;) per plot varied. The number of successes observed is in &lt;code&gt;num_dead&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat = structure(list(plot = structure(1:8, .Label = c(&amp;quot;plot1&amp;quot;, &amp;quot;plot2&amp;quot;, 
&amp;quot;plot3&amp;quot;, &amp;quot;plot4&amp;quot;, &amp;quot;plot5&amp;quot;, &amp;quot;plot6&amp;quot;, &amp;quot;plot7&amp;quot;, &amp;quot;plot8&amp;quot;), class = &amp;quot;factor&amp;quot;), 
    group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c(&amp;quot;g1&amp;quot;, 
    &amp;quot;g2&amp;quot;), class = &amp;quot;factor&amp;quot;), num_dead = c(4L, 6L, 6L, 5L, 1L, 4L, 
    3L, 2L), total = c(5L, 7L, 9L, 7L, 8L, 10L, 10L, 7L)), class = &amp;quot;data.frame&amp;quot;, row.names = c(NA, 
-8L))

dat&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#    plot group num_dead total
# 1 plot1    g1        4     5
# 2 plot2    g1        6     7
# 3 plot3    g1        6     9
# 4 plot4    g1        5     7
# 5 plot5    g2        1     8
# 6 plot6    g2        4    10
# 7 plot7    g2        3    10
# 8 plot8    g2        2     7&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;expanding-binomial-to-binary-with-pmap_dfr&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Expanding binomial to binary with pmap_dfr()&lt;/h1&gt;
&lt;p&gt;To make the binomial data into binary data, I need to make a vector with a &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt; for every “success” listed in &lt;code&gt;num_dead&lt;/code&gt; and a &lt;span class=&#34;math inline&#34;&gt;\(0\)&lt;/span&gt; for every “failure” (the total number of trials minus the number of successes). Since I want to do a &lt;em&gt;rowwise&lt;/em&gt; operation I’ll use one of the &lt;code&gt;pmap&lt;/code&gt; functions. I want my output to be a data.frame so I use &lt;code&gt;pmap_dfr()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I use an anonymous function within &lt;code&gt;pmap_dfr()&lt;/code&gt; for creating the output I want from each row. I purposely make the names of the function arguments match the column names. You can either match on position or on names in &lt;code&gt;pmap&lt;/code&gt; functions, and I tend to go for name matching. You can use formula coding with the tilde in &lt;code&gt;pmap&lt;/code&gt; variants, but I find the code more difficult to understand when I have more than three or so columns.&lt;/p&gt;
&lt;p&gt;Within the function I make a column for the response variable, repeating &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt; &lt;code&gt;num_dead&lt;/code&gt; times and &lt;span class=&#34;math inline&#34;&gt;\(0\)&lt;/span&gt; &lt;code&gt;total - num_dead&lt;/code&gt; times for each row of the original data. I’m taking advantage of &lt;a href=&#34;http://www.hep.by/gnu/r-patched/r-lang/R-lang_41.html&#34;&gt;recycling&lt;/a&gt; in &lt;code&gt;data.frame()&lt;/code&gt; to keep the &lt;code&gt;plot&lt;/code&gt; and &lt;code&gt;group&lt;/code&gt; columns in the output, as well.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;binary_dat = pmap_dfr(dat, 
                      function(group, plot, num_dead, total) {
                           data.frame(plot = plot,
                                      group = group,
                                      dead = c( rep(1, num_dead),
                                                rep(0, total - num_dead) ) )
                      }
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here are the first 6 rows of this new dataset. You can see for the first plot, &lt;code&gt;plot1&lt;/code&gt;, there are five rows of &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt; and one row of &lt;span class=&#34;math inline&#34;&gt;\(0\)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;head(binary_dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#    plot group dead
# 1 plot1    g1    1
# 2 plot1    g1    1
# 3 plot1    g1    1
# 4 plot1    g1    1
# 5 plot1    g1    0
# 6 plot2    g1    1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This matches the information in the first row of the original dataset.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat[1, ]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#    plot group num_dead total
# 1 plot1    g1        4     5&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;aside-pmap-functions-with-more-columns&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Aside: pmap functions with more columns&lt;/h1&gt;
&lt;p&gt;My anonymous function in &lt;code&gt;pmap_dfr()&lt;/code&gt; works fine in its current form as long as every column is included as a function argument. If I had extra columns that I didn’t want to remove and wasn’t using in the function, however, I would get an error.&lt;/p&gt;
&lt;p&gt;To bypass this problem you can add dots, &lt;code&gt;...&lt;/code&gt;, to the anonymous function to refer to all other columns not being used.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;function(group, plot, num_dead, total, ...)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;comparing-analysis-results&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Comparing analysis results&lt;/h1&gt;
&lt;p&gt;While I definitely learned that binomial data can be analyzed in binary format and returns identical results in a GLM class, for some reason I often have to re-convince myself this is true. 😜 This is clear when I do an analysis with each dataset and compare results.&lt;/p&gt;
&lt;div id=&#34;binomial-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Binomial model&lt;/h2&gt;
&lt;p&gt;Here’s results from comparing the two groups for the binomial model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit = glm( cbind(num_dead, total - num_dead) ~ group, 
           data = dat,
           family = binomial)
summary(fit)$coefficients&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#              Estimate Std. Error   z value     Pr(&amp;gt;|z|)
# (Intercept)  1.098612  0.4364358  2.517237 0.0118279240
# groupg2     -2.014903  0.5748706 -3.504968 0.0004566621&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;binary-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Binary model&lt;/h2&gt;
&lt;p&gt;The binary model gives identical results for estimates and statistical tests.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit_binary = glm( dead ~ group, 
                  data = binary_dat,
                  family = binomial)
summary(fit_binary)$coefficients&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#              Estimate Std. Error   z value     Pr(&amp;gt;|z|)
# (Intercept)  1.098612  0.4364354  2.517239 0.0118278514
# groupg2     -2.014903  0.5748701 -3.504971 0.0004566575&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;expanding-binomial-to-binary-via-nesting&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Expanding binomial to binary via nesting&lt;/h1&gt;
&lt;p&gt;Doing the expansion with nesting plus &lt;code&gt;purrr::map()&lt;/code&gt; inside &lt;code&gt;mutate()&lt;/code&gt; is another option, although this seems less straightforward to me for this particular case. It does keep the other variables in the dataset, though, without having to manually include them in the output data.frame like I did above.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;binary_dat2 = dat %&amp;gt;%
     nest(data = c(num_dead, total) ) %&amp;gt;%
     mutate(dead = map(data, ~c( rep(1, .x$num_dead),
                                 rep(0, .x$total - .x$num_dead) ) ) ) %&amp;gt;%
     select(-data) %&amp;gt;%
     unnest(dead)
head(binary_dat2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# # A tibble: 6 x 3
#   plot  group  dead
#   &amp;lt;fct&amp;gt; &amp;lt;fct&amp;gt; &amp;lt;dbl&amp;gt;
# 1 plot1 g1        1
# 2 plot1 g1        1
# 3 plot1 g1        1
# 4 plot1 g1        1
# 5 plot1 g1        0
# 6 plot2 g1        1&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2019-10-04-expanding-binomial-to-binary.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(purrr) # 0.3.2
library(tidyr) # 1.0.0
library(dplyr) # 0.8.3

dat = structure(list(plot = structure(1:8, .Label = c(&amp;quot;plot1&amp;quot;, &amp;quot;plot2&amp;quot;, 
&amp;quot;plot3&amp;quot;, &amp;quot;plot4&amp;quot;, &amp;quot;plot5&amp;quot;, &amp;quot;plot6&amp;quot;, &amp;quot;plot7&amp;quot;, &amp;quot;plot8&amp;quot;), class = &amp;quot;factor&amp;quot;), 
    group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c(&amp;quot;g1&amp;quot;, 
    &amp;quot;g2&amp;quot;), class = &amp;quot;factor&amp;quot;), num_dead = c(4L, 6L, 6L, 5L, 1L, 4L, 
    3L, 2L), total = c(5L, 7L, 9L, 7L, 8L, 10L, 10L, 7L)), class = &amp;quot;data.frame&amp;quot;, row.names = c(NA, 
-8L))

dat

binary_dat = pmap_dfr(dat, 
                      function(group, plot, num_dead, total) {
                           data.frame(plot = plot,
                                      group = group,
                                      dead = c( rep(1, num_dead),
                                                rep(0, total - num_dead) ) )
                      }
)

head(binary_dat)
dat[1, ]

function(group, plot, num_dead, total, ...)
     
fit = glm( cbind(num_dead, total - num_dead) ~ group, 
           data = dat,
           family = binomial)
summary(fit)$coefficients

fit_binary = glm( dead ~ group, 
                  data = binary_dat,
                  family = binomial)
summary(fit_binary)$coefficients

binary_dat2 = dat %&amp;gt;%
     nest(data = c(num_dead, total) ) %&amp;gt;%
     mutate(dead = map(data, ~c( rep(1, .x$num_dead),
                                 rep(0, .x$total - .x$num_dead) ) ) ) %&amp;gt;%
     select(-data) %&amp;gt;%
     unnest(dead)
head(binary_dat2)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>More exploratory plots with ggplot2 and purrr: Adding conditional elements</title>
      <link>https://aosmith.rbind.io/2019/09/27/more-exploratory-plots/</link>
      <pubDate>Fri, 27 Sep 2019 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2019/09/27/more-exploratory-plots/</guid>
      <description>


&lt;p&gt;This summer I was asked to collaborate on an analysis project with many response variables. As usual, I planned on automating my initial graphical data exploration through the use of functions and &lt;code&gt;purrr::map()&lt;/code&gt; as &lt;a href=&#34;https://aosmith.rbind.io/2018/08/20/automating-exploratory-plots/&#34;&gt;I’ve written about previously&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;However, this particular project was a follow-up to a previous analysis. In the original analysis, different variables were analyzed on different scales. I wanted to put the new plots on whatever scale they were analyzed in that analysis. If I was going to automate the plotting, which I definitely wanted to do with so many variables 😄, I needed to add conditional elements.&lt;/p&gt;
&lt;p&gt;This post demonstrates how I used &lt;code&gt;if()&lt;/code&gt; statements within my plotting function to use different plotting elements depending on which variable I was plotting.&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#r-packages&#34;&gt;R packages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-data&#34;&gt;The data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#initial-plotting-function&#34;&gt;Initial plotting function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#adding-a-conditional-axis-scale&#34;&gt;Adding a conditional axis scale&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#adding-a-conditional-caption&#34;&gt;Adding a conditional caption&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#looping-through-the-variables&#34;&gt;Looping through the variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;r-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R packages&lt;/h1&gt;
&lt;p&gt;I will use &lt;strong&gt;ggplot2&lt;/strong&gt; for plotting and &lt;strong&gt;purrr&lt;/strong&gt; for looping through variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # v. 3.2.1
library(purrr) # v. 0.3.2&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The data&lt;/h1&gt;
&lt;p&gt;My simplified example dataset, &lt;code&gt;dat&lt;/code&gt;, contains three response variables, &lt;code&gt;cov_plant&lt;/code&gt;, &lt;code&gt;cov_oth&lt;/code&gt;, and &lt;code&gt;gap&lt;/code&gt;. I created two categorical explanatory variables, &lt;code&gt;year&lt;/code&gt; with 3 levels and &lt;code&gt;trt&lt;/code&gt; with two levels.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat = structure(list(cov_plant = c(3.7, 1.8, 7.5, 0.4, 7.9, 1.2, 0.7, 
2.3, 6.9, 4.1, 17.7, 2.4, 0.9, 14.3, 4.9, 0, 4.1, 3.6, 1.1, 7.7, 
0, 1.5, 1.7, 11.5, 0.8, 12.3, 7.1, 6.9, 5.6, 2.7, 1, 2.5, 2, 
0.7, 0.7, 2.9, 4, 2.5, 2.9, 1.5, 0.5, 22.8, 2.8, 1.4, 1, 2.9, 
2.4, 4.1, 4.1, 1.9, 2.8, 5, 5.7, 5.6, 0, 4.6, 8.1, 0.5, 88.9, 
1), cov_oth = c(11.5, 63.2, 34, 65.5, 28.8, 8.6, 7.1, 65.5, 12.1, 
3, 23.6, 3.8, 24.9, 55.9, 24.2, 78.2, 81.1, 10.7, 30.7, 23.5, 
10.1, 4.6, 45.7, 37.6, 81.3, 39.1, 50.8, 75.8, 78.2, 23.9, 53, 
51.1, 2.5, 40.2, 15.9, 91.3, 44, 72.9, 82.7, 42.4, 94.1, 23, 
86.2, 50.1, 88.9, 80.5, 34.2, 68.7, 45, 13.9, 44.2, 85, 79.6, 
1, 45.3, 69.5, 89.6, 22.2, 1.3, 88), gap = c(2.8, 11.8, 0.3, 
17.2, 18.3, 1.4, 19.6, 19.4, 2.6, 66.3, 97.1, 17, 381.5, 15.7, 
8.3, 2.4, 3.8, 3.8, 246.6, 43.2, 16.7, 6.6, 3.1, 2.4, 3.2, 4.3, 
0.3, 2.1, 41.7, 68.9, 5.1, 5.7, 0.4, 35.5, 1.1, 10.8, 5, 11.8, 
75.5, 5.4, 12.6, 5.2, 11.4, 6.8, 5.3, 1.1, 3.2, 2.9, 5.2, 0.2, 
1.5, 0.6, 7.4, 18.6, 11.7, 1.6, 13.7, 7.1, 19.9, 16.8), year = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c(&amp;quot;Year 1&amp;quot;, 
&amp;quot;Year 2&amp;quot;, &amp;quot;Year 3&amp;quot;), class = &amp;quot;factor&amp;quot;), trt = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c(&amp;quot;a&amp;quot;, 
&amp;quot;b&amp;quot;), class = &amp;quot;factor&amp;quot;)), row.names = c(NA, -60L), class = &amp;quot;data.frame&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here are the first six rows of this dataset.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;head(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#   cov_plant cov_oth  gap   year trt
# 1       3.7    11.5  2.8 Year 1   a
# 2       1.8    63.2 11.8 Year 1   a
# 3       7.5    34.0  0.3 Year 1   a
# 4       0.4    65.5 17.2 Year 1   a
# 5       7.9    28.8 18.3 Year 1   a
# 6       1.2     8.6  1.4 Year 1   a&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I spent a fair amount of time sleuthing out which variables were used and which transformations were done in the original analysis. It turned out many (but not all) of the variables in that analysis were log transformed. Variables that contained 0 values were shifted by a fixed constant prior to the log transformation. A different constant was used for different variables.&lt;/p&gt;
&lt;p&gt;I made a dataset of variable metadata to help me keep all this information organized. This dataset contains a row for each variable along with a description of what that variable was, the units the variable was measured in, the transformation used for analysis, and the constant used to shift the variable. I’ll call this dataset &lt;code&gt;resp_dat&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This metadata dataset was key in adding conditional elements to my plotting function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;resp_dat = structure(list(variable = structure(c(2L, 1L, 3L), .Label = c(&amp;quot;cov_oth&amp;quot;, 
&amp;quot;cov_plant&amp;quot;, &amp;quot;gap&amp;quot;), class = &amp;quot;factor&amp;quot;), description = structure(3:1, .Label = c(&amp;quot;Gap size&amp;quot;, 
&amp;quot;Other cover&amp;quot;, &amp;quot;Plant cover&amp;quot;), class = &amp;quot;factor&amp;quot;), units = structure(c(1L, 
1L, 2L), .Label = c(&amp;quot;%&amp;quot;, &amp;quot;m&amp;quot;), class = &amp;quot;factor&amp;quot;), transformation = structure(c(2L, 
1L, 2L), .Label = c(&amp;quot;identity&amp;quot;, &amp;quot;log&amp;quot;), class = &amp;quot;factor&amp;quot;), constant = c(0.3, 
0, 0)), class = &amp;quot;data.frame&amp;quot;, row.names = c(NA, -3L))

resp_dat&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#    variable description units transformation constant
# 1 cov_plant Plant cover     %            log      0.3
# 2   cov_oth Other cover     %       identity      0.0
# 3       gap    Gap size     m            log      0.0&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;initial-plotting-function&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Initial plotting function&lt;/h1&gt;
&lt;p&gt;I set up my initial plotting function to make a scatterplot of the raw data per &lt;code&gt;year&lt;/code&gt; for each &lt;code&gt;trt&lt;/code&gt;. Different &lt;code&gt;trt&lt;/code&gt; were indicated by shapes and colors, and I added group means as larger symbols connected by lines.&lt;/p&gt;
&lt;p&gt;You can see that my plotting code ended up fairly complicated. I’m skipping the (many!) steps it took to get to this point. While I don’t show the process here, you can rest assured that I did a &lt;strong&gt;lot&lt;/strong&gt; of testing to work out the plot structure prior to making the plotting function below. 😉&lt;/p&gt;
&lt;p&gt;In addition to the plotting code, my &lt;code&gt;plot_fun()&lt;/code&gt; function includes a line where I subset the &lt;code&gt;resp_dat&lt;/code&gt; dataset to only the row of metadata for the response variable used in the plot. I use this information to add the constant to &lt;code&gt;y&lt;/code&gt; and make a plot title with a description of the variable plus the units.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_fun = function(data = dat, respdata = resp_dat, response) {
     
     respvar = subset(respdata, variable == response)

     ggplot(data = data, aes(x = year, 
                             y = .data[[response]] + respvar$constant,
                             shape = trt, 
                             color = trt,
                             group = trt) ) +
          geom_point(position = position_dodge(width = 0.5),
                   alpha = 0.25,
                   size = 2, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = &amp;quot;point&amp;quot;,
                       position = position_dodge(width = 0.5),
                       size = 4, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = &amp;quot;line&amp;quot;,
                       position = position_dodge(width = 0.5),
                       size = 1, key_glyph = &amp;quot;rect&amp;quot;) +
          theme_bw(base_size = 14) +
          theme(legend.position = &amp;quot;bottom&amp;quot;,
                legend.direction = &amp;quot;horizontal&amp;quot;,
                legend.box.spacing = unit(0, &amp;quot;cm&amp;quot;),
                legend.text = element_text(margin = margin(l = -.2, unit = &amp;quot;cm&amp;quot;) ),
                panel.grid.minor.y = element_blank() ) +
          scale_color_grey(name = &amp;quot;&amp;quot;,
                           label = c(&amp;quot;A Treatment&amp;quot;, &amp;quot;B Treatment&amp;quot;),
                           start = 0, end = 0.5) +
          labs(x = &amp;quot;Year since treatment&amp;quot;,
               title = paste0(respvar$descrip, &amp;quot; (&amp;quot;, respvar$units, &amp;quot;)&amp;quot;),
               y = NULL)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here is what the plot looks like for &lt;code&gt;cov_plant&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_fun(response = &amp;quot;cov_plant&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-09-27-more-exploratory-plots_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;adding-a-conditional-axis-scale&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Adding a conditional axis scale&lt;/h1&gt;
&lt;p&gt;I wanted to put variables that were log transformed in the original analysis on the log scale. Since not all variables were transformed, I wanted to use &lt;code&gt;scale_y_log10()&lt;/code&gt; for log transformed variables and the standard scale for everything else.&lt;/p&gt;
&lt;p&gt;To achieve this, I will assign my base plot a name within the function so I can add on to it conditionally. I name it &lt;code&gt;g1&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I use the &lt;code&gt;transformation&lt;/code&gt; column in the variable metadata to check if a log transformation was done or not via &lt;code&gt;grepl()&lt;/code&gt;. If it was done, I add &lt;code&gt;scale_y_log10()&lt;/code&gt; to the existing plot. Otherwise I return the plot on the original scale.&lt;/p&gt;
&lt;p&gt;This is what that code looks like that I’ll add to the end of my function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;if( grepl(&amp;quot;log&amp;quot;, respvar$transformation) ) {
          g1 + scale_y_log10()
     } else {
          g1
     }&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I used the metadata I created to use in the &lt;code&gt;if()&lt;/code&gt; statement, but you could do something similar if you had variables with the transformation as part of the variable name.&lt;/p&gt;
&lt;p&gt;Here is what the plotting function looks like now.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_fun2 = function(data = dat, respdata = resp_dat, response) {
     
     respvar = subset(respdata, variable == response)

     g1 = ggplot(data = data, aes(x = year, 
                                  y = .data[[response]] + respvar$constant,
                                  shape = trt, 
                                  color = trt,
                                  group = trt) ) +
          geom_point(position = position_dodge(width = 0.5),
                     alpha = 0.25,
                     size = 2, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = &amp;quot;point&amp;quot;,
                       position = position_dodge(width = 0.5),
                       size = 4, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = &amp;quot;line&amp;quot;,
                       position = position_dodge(width = 0.5),
                       size = 1, key_glyph = &amp;quot;rect&amp;quot;) +
          theme_bw(base_size = 14) +
          theme(legend.position = &amp;quot;bottom&amp;quot;,
                legend.direction = &amp;quot;horizontal&amp;quot;,
                legend.box.spacing = unit(0, &amp;quot;cm&amp;quot;),
                legend.text = element_text(margin = margin(l = -.2, unit = &amp;quot;cm&amp;quot;) ),
                panel.grid.minor.y = element_blank() ) +
          scale_color_grey(name = &amp;quot;&amp;quot;,
                           label = c(&amp;quot;A Treatment&amp;quot;, &amp;quot;B Treatment&amp;quot;),
                           start = 0, end = 0.5) +
          labs(x = &amp;quot;Year since treatment&amp;quot;,
               title = paste0(respvar$descrip, &amp;quot; (&amp;quot;, respvar$units, &amp;quot;)&amp;quot;),
               y = NULL)
     
     if( grepl(&amp;quot;log&amp;quot;, respvar$transformation) ) {
          g1 + scale_y_log10()
     } else {
          g1
     }
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now when I use the function on a log transformed variable like &lt;code&gt;cov_plant&lt;/code&gt;, the y axis is on the log scale.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_fun2(response = &amp;quot;cov_plant&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-09-27-more-exploratory-plots_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;But the y axis for &lt;code&gt;cov_oth&lt;/code&gt;, which was analyzed on the original scale, is not.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_fun2(response = &amp;quot;cov_oth&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-09-27-more-exploratory-plots_files/figure-html/unnamed-chunk-10-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;adding-a-conditional-caption&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Adding a conditional caption&lt;/h1&gt;
&lt;p&gt;After I changed the y axis scale for some variables, I decided I should make sure that the scale of the axis is clear to the reader. In addition, I wanted to highlight the fact that some variables were shifted prior to transformation. I decided to create a conditional caption with this information, which can then be then added to the plot in &lt;code&gt;labs()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Since I ended up with three conditions, log transformation, log transformation with added constant, or no transformation, I ended up using &lt;code&gt;if()&lt;/code&gt;-&lt;code&gt;else if()&lt;/code&gt;-&lt;code&gt;else&lt;/code&gt; to do this.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;caption_text = {
     if(respvar$constant != 0 ) {
               paste0(&amp;quot;Y axis on log scale &amp;quot;, 
                      &amp;quot;(added constant &amp;quot;, 
                      respvar$constant, &amp;quot;)&amp;quot;)
          } else if(!grepl(&amp;quot;log&amp;quot;, respvar$transformation) ) {
               &amp;quot;Y axis on original scale&amp;quot;
          } else {
               &amp;quot;Y axis on log scale&amp;quot;
          }
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The function is getting pretty long and complicated now.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_fun3 = function(data = dat, respdata = resp_dat, response) {
     
     respvar = subset(respdata, variable == response)

     caption_text = {
          if(respvar$constant != 0 ) {
               paste0(&amp;quot;Y axis on log scale &amp;quot;, 
                      &amp;quot;(added constant &amp;quot;, 
                      respvar$constant, &amp;quot;)&amp;quot;)
        } else if(!grepl(&amp;quot;log&amp;quot;, respvar$transformation) ) {
                &amp;quot;Y axis on original scale&amp;quot;
        } else {
                &amp;quot;Y axis on log scale&amp;quot;
        }
     }
     
     g1 = ggplot(data = data, aes(x = year, 
                                  y = .data[[response]] + respvar$constant,
                                  shape = trt, 
                                  color = trt,
                                  group = trt) ) +
          geom_point(position = position_dodge(width = 0.5),
                     alpha = 0.25,
                     size = 2, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = &amp;quot;point&amp;quot;,
                       position = position_dodge(width = 0.5),
                       size = 4, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = &amp;quot;line&amp;quot;,
                       position = position_dodge(width = 0.5),
                       size = 1, key_glyph = &amp;quot;rect&amp;quot;) +
          theme_bw(base_size = 14) +
          theme(legend.position = &amp;quot;bottom&amp;quot;,
                legend.direction = &amp;quot;horizontal&amp;quot;,
                legend.box.spacing = unit(0, &amp;quot;cm&amp;quot;),
                legend.text = element_text(margin = margin(l = -.2, unit = &amp;quot;cm&amp;quot;) ),
                panel.grid.minor.y = element_blank() ) +
          scale_color_grey(name = &amp;quot;&amp;quot;,
                           label = c(&amp;quot;A Treatment&amp;quot;, &amp;quot;B Treatment&amp;quot;),
                           start = 0, end = 0.5) +
          labs(x = &amp;quot;Year since treatment&amp;quot;,
               title = paste0(respvar$descrip, &amp;quot; (&amp;quot;, respvar$units, &amp;quot;)&amp;quot;),
               y = NULL,
               caption = caption_text)
     
     if( grepl(&amp;quot;log&amp;quot;, respvar$transformation) ) {
          g1 +
               scale_y_log10()
     } else {
          g1
     }
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But it does what I want. The plots now have captions with information added at the bottom in addition to the conditional y axis scale.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_fun3(response = &amp;quot;cov_plant&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-09-27-more-exploratory-plots_files/figure-html/unnamed-chunk-13-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;looping-through-the-variables&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Looping through the variables&lt;/h1&gt;
&lt;p&gt;Once I have worked out the details of the function I can loop through all the variables and make plots with &lt;code&gt;purrr::map()&lt;/code&gt;. I’ve set this up to loop through the vector of variable names, stored in &lt;code&gt;vars&lt;/code&gt; as strings.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;vars = names(dat)[1:3]
vars&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] &amp;quot;cov_plant&amp;quot; &amp;quot;cov_oth&amp;quot;   &amp;quot;gap&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_plots = map(vars, ~plot_fun3(response = .x) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here are the plots, with captions showing that two plots are on the log scale, one is on the original scale, and one has an added constant.&lt;/p&gt;
&lt;p&gt;I’m showing the plots all together here, but I actually saved them in a PDF with one plot per page so collaborators could easily page through them.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cowplot::plot_grid(plotlist = all_plots)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-09-27-more-exploratory-plots_files/figure-html/unnamed-chunk-16-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2019-09-27-more-exploratory-plots.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # v. 3.2.1
library(purrr) # v. 0.3.2
dat = structure(list(cov_plant = c(3.7, 1.8, 7.5, 0.4, 7.9, 1.2, 0.7, 
2.3, 6.9, 4.1, 17.7, 2.4, 0.9, 14.3, 4.9, 0, 4.1, 3.6, 1.1, 7.7, 
0, 1.5, 1.7, 11.5, 0.8, 12.3, 7.1, 6.9, 5.6, 2.7, 1, 2.5, 2, 
0.7, 0.7, 2.9, 4, 2.5, 2.9, 1.5, 0.5, 22.8, 2.8, 1.4, 1, 2.9, 
2.4, 4.1, 4.1, 1.9, 2.8, 5, 5.7, 5.6, 0, 4.6, 8.1, 0.5, 88.9, 
1), cov_oth = c(11.5, 63.2, 34, 65.5, 28.8, 8.6, 7.1, 65.5, 12.1, 
3, 23.6, 3.8, 24.9, 55.9, 24.2, 78.2, 81.1, 10.7, 30.7, 23.5, 
10.1, 4.6, 45.7, 37.6, 81.3, 39.1, 50.8, 75.8, 78.2, 23.9, 53, 
51.1, 2.5, 40.2, 15.9, 91.3, 44, 72.9, 82.7, 42.4, 94.1, 23, 
86.2, 50.1, 88.9, 80.5, 34.2, 68.7, 45, 13.9, 44.2, 85, 79.6, 
1, 45.3, 69.5, 89.6, 22.2, 1.3, 88), gap = c(2.8, 11.8, 0.3, 
17.2, 18.3, 1.4, 19.6, 19.4, 2.6, 66.3, 97.1, 17, 381.5, 15.7, 
8.3, 2.4, 3.8, 3.8, 246.6, 43.2, 16.7, 6.6, 3.1, 2.4, 3.2, 4.3, 
0.3, 2.1, 41.7, 68.9, 5.1, 5.7, 0.4, 35.5, 1.1, 10.8, 5, 11.8, 
75.5, 5.4, 12.6, 5.2, 11.4, 6.8, 5.3, 1.1, 3.2, 2.9, 5.2, 0.2, 
1.5, 0.6, 7.4, 18.6, 11.7, 1.6, 13.7, 7.1, 19.9, 16.8), year = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c(&amp;quot;Year 1&amp;quot;, 
&amp;quot;Year 2&amp;quot;, &amp;quot;Year 3&amp;quot;), class = &amp;quot;factor&amp;quot;), trt = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c(&amp;quot;a&amp;quot;, 
&amp;quot;b&amp;quot;), class = &amp;quot;factor&amp;quot;)), row.names = c(NA, -60L), class = &amp;quot;data.frame&amp;quot;)
head(dat)

resp_dat = structure(list(variable = structure(c(2L, 1L, 3L), .Label = c(&amp;quot;cov_oth&amp;quot;, 
&amp;quot;cov_plant&amp;quot;, &amp;quot;gap&amp;quot;), class = &amp;quot;factor&amp;quot;), description = structure(3:1, .Label = c(&amp;quot;Gap size&amp;quot;, 
&amp;quot;Other cover&amp;quot;, &amp;quot;Plant cover&amp;quot;), class = &amp;quot;factor&amp;quot;), units = structure(c(1L, 
1L, 2L), .Label = c(&amp;quot;%&amp;quot;, &amp;quot;m&amp;quot;), class = &amp;quot;factor&amp;quot;), transformation = structure(c(2L, 
1L, 2L), .Label = c(&amp;quot;identity&amp;quot;, &amp;quot;log&amp;quot;), class = &amp;quot;factor&amp;quot;), constant = c(0.3, 
0, 0)), class = &amp;quot;data.frame&amp;quot;, row.names = c(NA, -3L))

resp_dat

plot_fun = function(data = dat, respdata = resp_dat, response) {
     
     respvar = subset(respdata, variable == response)

     ggplot(data = data, aes(x = year, 
                             y = .data[[response]] + respvar$constant,
                             shape = trt, 
                             color = trt,
                             group = trt) ) +
          geom_point(position = position_dodge(width = 0.5),
                   alpha = 0.25,
                   size = 2, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = &amp;quot;point&amp;quot;,
                       position = position_dodge(width = 0.5),
                       size = 4, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = &amp;quot;line&amp;quot;,
                       position = position_dodge(width = 0.5),
                       size = 1, key_glyph = &amp;quot;rect&amp;quot;) +
          theme_bw(base_size = 14) +
          theme(legend.position = &amp;quot;bottom&amp;quot;,
                legend.direction = &amp;quot;horizontal&amp;quot;,
                legend.box.spacing = unit(0, &amp;quot;cm&amp;quot;),
                legend.text = element_text(margin = margin(l = -.2, unit = &amp;quot;cm&amp;quot;) ),
                panel.grid.minor.y = element_blank() ) +
          scale_color_grey(name = &amp;quot;&amp;quot;,
                           label = c(&amp;quot;A Treatment&amp;quot;, &amp;quot;B Treatment&amp;quot;),
                           start = 0, end = 0.5) +
          labs(x = &amp;quot;Year since treatment&amp;quot;,
               title = paste0(respvar$descrip, &amp;quot; (&amp;quot;, respvar$units, &amp;quot;)&amp;quot;),
               y = NULL)
}
plot_fun(response = &amp;quot;cov_plant&amp;quot;)

if( grepl(&amp;quot;log&amp;quot;, respvar$transformation) ) {
          g1 + scale_y_log10()
     } else {
          g1
     }

plot_fun2 = function(data = dat, respdata = resp_dat, response) {
     
     respvar = subset(respdata, variable == response)

     g1 = ggplot(data = data, aes(x = year, 
                                  y = .data[[response]] + respvar$constant,
                                  shape = trt, 
                                  color = trt,
                                  group = trt) ) +
          geom_point(position = position_dodge(width = 0.5),
                     alpha = 0.25,
                     size = 2, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = &amp;quot;point&amp;quot;,
                       position = position_dodge(width = 0.5),
                       size = 4, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = &amp;quot;line&amp;quot;,
                       position = position_dodge(width = 0.5),
                       size = 1, key_glyph = &amp;quot;rect&amp;quot;) +
          theme_bw(base_size = 14) +
          theme(legend.position = &amp;quot;bottom&amp;quot;,
                legend.direction = &amp;quot;horizontal&amp;quot;,
                legend.box.spacing = unit(0, &amp;quot;cm&amp;quot;),
                legend.text = element_text(margin = margin(l = -.2, unit = &amp;quot;cm&amp;quot;) ),
                panel.grid.minor.y = element_blank() ) +
          scale_color_grey(name = &amp;quot;&amp;quot;,
                           label = c(&amp;quot;A Treatment&amp;quot;, &amp;quot;B Treatment&amp;quot;),
                           start = 0, end = 0.5) +
          labs(x = &amp;quot;Year since treatment&amp;quot;,
               title = paste0(respvar$descrip, &amp;quot; (&amp;quot;, respvar$units, &amp;quot;)&amp;quot;),
               y = NULL)
     
     if( grepl(&amp;quot;log&amp;quot;, respvar$transformation) ) {
          g1 + scale_y_log10()
     } else {
          g1
     }
}
plot_fun2(response = &amp;quot;cov_plant&amp;quot;)
plot_fun2(response = &amp;quot;cov_oth&amp;quot;)

caption_text = {
     if(respvar$constant != 0 ) {
               paste0(&amp;quot;Y axis on log scale &amp;quot;, 
                      &amp;quot;(added constant &amp;quot;, 
                      respvar$constant, &amp;quot;)&amp;quot;)
          } else if(!grepl(&amp;quot;log&amp;quot;, respvar$transformation) ) {
               &amp;quot;Y axis on original scale&amp;quot;
          } else {
               &amp;quot;Y axis on log scale&amp;quot;
          }
}

plot_fun3 = function(data = dat, respdata = resp_dat, response) {
     
     respvar = subset(respdata, variable == response)

     caption_text = {
          if(respvar$constant != 0 ) {
               paste0(&amp;quot;Y axis on log scale &amp;quot;, 
                      &amp;quot;(added constant &amp;quot;, 
                      respvar$constant, &amp;quot;)&amp;quot;)
        } else if(!grepl(&amp;quot;log&amp;quot;, respvar$transformation) ) {
                &amp;quot;Y axis on original scale&amp;quot;
        } else {
                &amp;quot;Y axis on log scale&amp;quot;
        }
     }
     
     g1 = ggplot(data = data, aes(x = year, 
                                  y = .data[[response]] + respvar$constant,
                                  shape = trt, 
                                  color = trt,
                                  group = trt) ) +
          geom_point(position = position_dodge(width = 0.5),
                     alpha = 0.25,
                     size = 2, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = &amp;quot;point&amp;quot;,
                       position = position_dodge(width = 0.5),
                       size = 4, show.legend = FALSE) +
          stat_summary(fun.y = mean, geom = &amp;quot;line&amp;quot;,
                       position = position_dodge(width = 0.5),
                       size = 1, key_glyph = &amp;quot;rect&amp;quot;) +
          theme_bw(base_size = 14) +
          theme(legend.position = &amp;quot;bottom&amp;quot;,
                legend.direction = &amp;quot;horizontal&amp;quot;,
                legend.box.spacing = unit(0, &amp;quot;cm&amp;quot;),
                legend.text = element_text(margin = margin(l = -.2, unit = &amp;quot;cm&amp;quot;) ),
                panel.grid.minor.y = element_blank() ) +
          scale_color_grey(name = &amp;quot;&amp;quot;,
                           label = c(&amp;quot;A Treatment&amp;quot;, &amp;quot;B Treatment&amp;quot;),
                           start = 0, end = 0.5) +
          labs(x = &amp;quot;Year since treatment&amp;quot;,
               title = paste0(respvar$descrip, &amp;quot; (&amp;quot;, respvar$units, &amp;quot;)&amp;quot;),
               y = NULL,
               caption = caption_text)
     
     if( grepl(&amp;quot;log&amp;quot;, respvar$transformation) ) {
          g1 +
               scale_y_log10()
     } else {
          g1
     }
}
plot_fun3(response = &amp;quot;cov_plant&amp;quot;)

vars = names(dat)[1:3]
vars

all_plots = map(vars, ~plot_fun3(response = .x) )

cowplot::plot_grid(plotlist = all_plots)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Many similar models - Part 2: Automate model fitting with purrr::map() loops</title>
      <link>https://aosmith.rbind.io/2019/07/22/automate-model-fitting-with-loops/</link>
      <pubDate>Mon, 22 Jul 2019 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2019/07/22/automate-model-fitting-with-loops/</guid>
      <description>
&lt;script src=&#34;https://aosmith.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;&lt;em&gt;This post was last updated on 2022-01-05.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;When we have many similar models to fit, automating at least some portions of the task can be a real time saver. In &lt;a href=&#34;https://aosmith.rbind.io/2019/06/24/function-for-model-fitting/&#34;&gt;my last post&lt;/a&gt; I demonstrated how to make a function for model fitting. Once you have made such a function it’s possible to loop through variable names and fit a model for each one.&lt;/p&gt;
&lt;p&gt;In this post I am specifically focusing on having many response variables with the same explanatory variables, using &lt;code&gt;purrr::map()&lt;/code&gt; and friends for the looping. However, this same approach can be used for models with varying explanatory variables, etc.&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#r-packages&#34;&gt;R packages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-dataset&#34;&gt;The dataset&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#a-function-for-model-fitting&#34;&gt;A function for model fitting&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#looping-through-the-response-variables&#34;&gt;Looping through the response variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#create-residual-plots-for-each-model&#34;&gt;Create residual plots for each model&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#examining-the-plots&#34;&gt;Examining the plots&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#re-fitting-a-model&#34;&gt;Re-fitting a model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#getting-model-results&#34;&gt;Getting model results&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#alternative-approach-to-fitting-many-models&#34;&gt;Alternative approach to fitting many models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;r-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R packages&lt;/h1&gt;
&lt;p&gt;I’ll be using &lt;strong&gt;purrr&lt;/strong&gt; for looping and will make residual plots with &lt;strong&gt;ggplot2&lt;/strong&gt; and &lt;strong&gt;patchwork&lt;/strong&gt;. I’ll use &lt;strong&gt;broom&lt;/strong&gt; to extract tidy results from models.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(purrr) # v. 0.3.4
library(ggplot2) # v. 3.3.5
library(patchwork) # v. 1.1.1
library(broom) # v. 0.7.10&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-dataset&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The dataset&lt;/h1&gt;
&lt;p&gt;I made a dataset with three response variables, &lt;code&gt;resp&lt;/code&gt;, &lt;code&gt;slp&lt;/code&gt;, and &lt;code&gt;grad&lt;/code&gt;, along with a 2-level explanatory variable &lt;code&gt;group&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c(&amp;quot;a&amp;quot;, &amp;quot;b&amp;quot;), class = &amp;quot;factor&amp;quot;), 
    resp = c(10.48, 9.87, 11.1, 8.56, 11.15, 9.53, 8.99, 10.06, 
    11.02, 10.57, 11.85, 10.11, 9.25, 11.66, 10.72, 8.34, 10.58, 
    10.47, 9.46, 11.13, 8.35, 9.69, 9.82, 11.47, 9.13, 11.53, 
    11.05, 11.03, 10.84, 10.22), slp = c(38.27, 46.33, 44.29, 
    35.57, 34.78, 47.81, 50.45, 46.31, 47.82, 42.07, 31.75, 65.65, 
    47.42, 41.51, 38.69, 47.84, 46.22, 50.66, 50.69, 44.09, 47.3, 
    52.53, 53.63, 53.38, 27.34, 51.83, 56.63, 32.99, 77.5, 38.24
    ), grad = c(0.3, 0.66, 0.57, 0.23, 0.31, 0.48, 0.5, 0.49, 
    2.41, 0.6, 0.27, 0.89, 2.43, 1.02, 2.17, 1.38, 0.17, 0.47, 
    1.1, 3.28, 6.14, 3.8, 4.35, 0.85, 1.13, 1.11, 2.93, 1.13, 
    4.52, 0.13)), class = &amp;quot;data.frame&amp;quot;, row.names = c(NA, -30L) )
head(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#   group  resp   slp grad
# 1     a 10.48 38.27 0.30
# 2     a  9.87 46.33 0.66
# 3     a 11.10 44.29 0.57
# 4     a  8.56 35.57 0.23
# 5     a 11.15 34.78 0.31
# 6     a  9.53 47.81 0.48&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;a-function-for-model-fitting&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;A function for model fitting&lt;/h1&gt;
&lt;p&gt;The analysis in the example I’m using today amounts to a two-sample t-test. I will fit this as a linear model with &lt;code&gt;lm()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Since the response variable needs to vary among models but the dataset and explanatory variable do not, my function will have a single argument for setting the response variable. Building the model formula in my function &lt;code&gt;ttest_fun()&lt;/code&gt; relies on &lt;code&gt;reformulate()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ttest_fun = function(response) {
  form = reformulate(&amp;quot;group&amp;quot;, response = response)
  lm(form, data = dat)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This function takes the response variable as a string and returns a model object.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ttest_fun(response = &amp;quot;resp&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#     10.3280      -0.1207&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;looping-through-the-response-variables&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Looping through the response variables&lt;/h1&gt;
&lt;p&gt;I’ll make a vector of the response variable names as strings so I can loop through them and fit a model for each one. I pull my response variable names out of the dataset with &lt;code&gt;names()&lt;/code&gt;. This step may take more work for you if you have many response variables that aren’t neatly listed all in a row like mine are. 😜&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;vars = names(dat)[2:4]
vars&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] &amp;quot;resp&amp;quot; &amp;quot;slp&amp;quot;  &amp;quot;grad&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I want to keep track of which variable goes with which model. This can be accomplished by naming the vector I’m going to loop through. I name the vector of strings with itself using &lt;code&gt;purrr::set_names()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;vars = set_names(vars)
vars&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#   resp    slp   grad 
# &amp;quot;resp&amp;quot;  &amp;quot;slp&amp;quot; &amp;quot;grad&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now I’m ready to loop through the variables and fit a model for each one with &lt;code&gt;purrr::map()&lt;/code&gt;. Since my function takes a single argument, the response variable, I can list the function by name within &lt;code&gt;map()&lt;/code&gt; without using a formula (&lt;code&gt;~&lt;/code&gt;) or an anonymous function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;models = vars %&amp;gt;%
     map(ttest_fun)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output is a list containing three models, one for each response variable. Notice that the output list is a &lt;em&gt;named&lt;/em&gt; list, where the names of each list element is the response variable used in that model. This is the reason I took the time to name the response variable vector.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;models&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $resp
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#     10.3280      -0.1207  
# 
# 
# $slp
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#       43.91         4.81  
# 
# 
# $grad
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#      0.8887       1.2773&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note I could have done the &lt;code&gt;set_names()&lt;/code&gt; step within the pipe chain rather than as a separate step.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;vars %&amp;gt;%
     set_names() %&amp;gt;%
     map(ttest_fun)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;create-residual-plots-for-each-model&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Create residual plots for each model&lt;/h1&gt;
&lt;p&gt;I’m working with a simple model fitting function, where the output only contains the fitted model. To extract other output I can loop through the list of models in a separate step. An alternative is to create all the output within the modeling function and then pull whatever you want out of the list of results.&lt;/p&gt;
&lt;p&gt;In this case, my next step is to loop through the models and make residual plots. I want to look at a residuals vs fitted values plot as well as a plot to look at residual normality (like a boxplot, a histogram, or a quantile-quantile normal plot). In more complicated models I might also make plots of residuals vs explanatory variables.&lt;/p&gt;
&lt;p&gt;I’ll make a function to build the two residuals plots. My function takes a model and the model name as arguments. I extract residuals and fitted values via &lt;code&gt;broom::augment()&lt;/code&gt; and make the two plots with &lt;strong&gt;ggplot2&lt;/strong&gt; functions. I combine the plots via &lt;strong&gt;patchwork&lt;/strong&gt;. I add a title to the combined plot with the name of the response variable from each model to help me keep track of things.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;resid_plots = function(model, modelname) {
     output = augment(model)
     
     res.v.fit = ggplot(output, aes(x = .fitted, y = .resid) ) +
          geom_point() +
          theme_bw(base_size = 16)
     
     res.box = ggplot(output, aes(x = &amp;quot;&amp;quot;, y = .resid) ) +
          geom_boxplot() +
          theme_bw(base_size = 16) +
          labs(x = NULL)
     
     res.v.fit + res.box +
          plot_annotation(title = paste(&amp;quot;Residuals plots for&amp;quot;, modelname) )
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output of this function is a combined plot of the residuals. Here is an example for one model (printed at 8&#34; wide by 4&#34; tall).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;resid_plots(model = models[[1]], modelname = names(models)[1])&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-07-22-automate-model-fitting-with-loops_files/figure-html/unnamed-chunk-11-1.png&#34; width=&#34;768&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I can use &lt;code&gt;purrr::imap()&lt;/code&gt; to loop through all models and the model names simultaneously to make the plots with the title for each variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;residplots = imap(models, resid_plots)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;examining-the-plots&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Examining the plots&lt;/h2&gt;
&lt;p&gt;In a situation where I have many response variables, I like to save my plots out into a PDF so I can easily page through them outside of R. You can see some approaches for saving plots &lt;a href=&#34;https://aosmith.rbind.io/2018/08/20/automating-exploratory-plots/#saving-the-plots&#34;&gt;in a previous post&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Since I have only a few plots I can print them in R. The last plot, shown below, looks potentially problematic. I see the variance increasing with the mean and right skew in the residuals.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;residplots[[3]]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-07-22-automate-model-fitting-with-loops_files/figure-html/unnamed-chunk-13-1.png&#34; width=&#34;768&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;re-fitting-a-model&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Re-fitting a model&lt;/h1&gt;
&lt;p&gt;If you find a problematic model fit you’ll need to spend some time working with that variable to find a more appropriate model.&lt;/p&gt;
&lt;p&gt;Once you have a model you’re happy with, you can manually add the new model to the list (if needed). In my example, let’s say the &lt;code&gt;grad&lt;/code&gt; model needed a log transformation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gradmod = ttest_fun(&amp;quot;log(grad)&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If I’m happy with the fit of the new model I add it to the list with the other models to automate extracting results.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;models$log_grad = gradmod&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I remove the original model by setting it to &lt;code&gt;NULL&lt;/code&gt;. I don’t want any results from that model and if I leave it in I know I’ll ultimately get confused about which model is the final model. 😕&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;models$grad = NULL&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now the output list has three models, with the new &lt;code&gt;log_grad&lt;/code&gt; model and the old &lt;code&gt;grad&lt;/code&gt; model removed.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;models&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $resp
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#     10.3280      -0.1207  
# 
# 
# $slp
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#       43.91         4.81  
# 
# 
# $log_grad
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#     -0.4225       0.7177&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I could have removed models from the list via subsetting by name. Here’s an example, showing what the list looks like if I remove the &lt;code&gt;slp&lt;/code&gt; model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;models[!names(models) %in% &amp;quot;slp&amp;quot;]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $resp
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#     10.3280      -0.1207  
# 
# 
# $log_grad
# 
# Call:
# lm(formula = form, data = dat)
# 
# Coefficients:
# (Intercept)       groupb  
#     -0.4225       0.7177&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-model-results&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Getting model results&lt;/h1&gt;
&lt;p&gt;Once you are happy with model fit of all models it’s time to extract any output of interest. For a t-test we would commonly want the estimated difference between the two groups, which is in the &lt;code&gt;summary()&lt;/code&gt; output. I’ll pull this information from the model as a data.frame with &lt;code&gt;broom::tidy()&lt;/code&gt;. This returns the estimated coefficients, statistical tests, and (optionally) confidence intervals for coefficients&lt;/p&gt;
&lt;p&gt;I switch to &lt;code&gt;map_dfr()&lt;/code&gt; for looping to get the output combined into a single data.frame. I use the &lt;code&gt;.id&lt;/code&gt; argument to add the response variable name to the output dataset.&lt;/p&gt;
&lt;p&gt;Since some of the response variables are log transformed, it would make sense to back-transform coefficients in this step. I don’t show this here, but would likely approach this using an &lt;code&gt;if()&lt;/code&gt; statement based on log transformed variables containing &lt;code&gt;&#34;log&#34;&lt;/code&gt; in their names.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;res_anova = map_dfr(models, tidy, conf.int = TRUE, .id = &amp;quot;variable&amp;quot;)
res_anova&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# # A tibble: 6 x 8
#   variable term        estimate std.error statistic  p.value conf.low conf.high
#   &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;
# 1 resp     (Intercept)   10.3       0.260    39.7   3.60e-26   9.80     10.9   
# 2 resp     groupb        -0.121     0.368    -0.328 7.45e- 1  -0.874     0.632 
# 3 slp      (Intercept)   43.9       2.56     17.2   2.18e-16  38.7      49.2   
# 4 slp      groupb         4.81      3.62      1.33  1.95e- 1  -2.61     12.2   
# 5 log_grad (Intercept)   -0.423     0.255    -1.66  1.09e- 1  -0.945     0.0997
# 6 log_grad groupb         0.718     0.361     1.99  5.64e- 2  -0.0208    1.46&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The primary interest in this output would be in the &lt;code&gt;groupb&lt;/code&gt; row for each variable. Since the output is a data frame (thanks &lt;code&gt;broom::tidy()&lt;/code&gt;!) you can use standard data manipulation tools to pull out only rows and columns of interest.&lt;/p&gt;
&lt;p&gt;Other output, like, e.g., AIC or estimated marginal means for more complicated models, can be extracted and saved in a similar way. Check out &lt;code&gt;broom::glance()&lt;/code&gt; for extracting AIC and other overall model results.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;alternative-approach-to-fitting-many-models&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Alternative approach to fitting many models&lt;/h1&gt;
&lt;p&gt;When I am working with many response variables with widely varying ranges, it feels most natural to me to keep the different variables in different columns and loop through them as I have shown above. However, a reasonable alternative is to &lt;em&gt;reshape&lt;/em&gt; your dataset so all the values of all variables are in a single column. A second, categorical column will contain the variable names so we know which variable each row is associated with. Such reshaping is an example of making a &lt;em&gt;wide&lt;/em&gt; dataset into a &lt;em&gt;long&lt;/em&gt; dataset.&lt;/p&gt;
&lt;p&gt;Once your data are in a long format, you can use a list-columns approach for the analysis. You can see an example of this in &lt;a href=&#34;https://r4ds.had.co.nz/many-models.html#introduction-17&#34;&gt;Chapter 25: Many models&lt;/a&gt; of Grolemund and Wickham’s &lt;a href=&#34;https://r4ds.had.co.nz/&#34;&gt;R for Data Science book&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2019-07-22-automate-model-fitting-with-loops.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(purrr) # v. 0.3.4
library(ggplot2) # v. 3.3.5
library(patchwork) # v. 1.1.1
library(broom) # v. 0.7.10

dat = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c(&amp;quot;a&amp;quot;, &amp;quot;b&amp;quot;), class = &amp;quot;factor&amp;quot;), 
    resp = c(10.48, 9.87, 11.1, 8.56, 11.15, 9.53, 8.99, 10.06, 
    11.02, 10.57, 11.85, 10.11, 9.25, 11.66, 10.72, 8.34, 10.58, 
    10.47, 9.46, 11.13, 8.35, 9.69, 9.82, 11.47, 9.13, 11.53, 
    11.05, 11.03, 10.84, 10.22), slp = c(38.27, 46.33, 44.29, 
    35.57, 34.78, 47.81, 50.45, 46.31, 47.82, 42.07, 31.75, 65.65, 
    47.42, 41.51, 38.69, 47.84, 46.22, 50.66, 50.69, 44.09, 47.3, 
    52.53, 53.63, 53.38, 27.34, 51.83, 56.63, 32.99, 77.5, 38.24
    ), grad = c(0.3, 0.66, 0.57, 0.23, 0.31, 0.48, 0.5, 0.49, 
    2.41, 0.6, 0.27, 0.89, 2.43, 1.02, 2.17, 1.38, 0.17, 0.47, 
    1.1, 3.28, 6.14, 3.8, 4.35, 0.85, 1.13, 1.11, 2.93, 1.13, 
    4.52, 0.13)), class = &amp;quot;data.frame&amp;quot;, row.names = c(NA, -30L) )
head(dat)

ttest_fun = function(response) {
  form = reformulate(&amp;quot;group&amp;quot;, response = response)
  lm(form, data = dat)
}
ttest_fun(response = &amp;quot;resp&amp;quot;)

vars = names(dat)[2:4]
vars

vars = set_names(vars)
vars

models = vars %&amp;gt;%
     map(ttest_fun)
models

vars %&amp;gt;%
     set_names() %&amp;gt;%
     map(ttest_fun)

resid_plots = function(model, modelname) {
     output = augment(model)
     
     res.v.fit = ggplot(output, aes(x = .fitted, y = .resid) ) +
          geom_point() +
          theme_bw(base_size = 16)
     
     res.box = ggplot(output, aes(x = &amp;quot;&amp;quot;, y = .resid) ) +
          geom_boxplot() +
          theme_bw(base_size = 16) +
          labs(x = NULL)
     
     res.v.fit + res.box +
          plot_annotation(title = paste(&amp;quot;Residuals plots for&amp;quot;, modelname) )
}
resid_plots(model = models[[1]], modelname = names(models)[1])

residplots = imap(models, resid_plots)
residplots[[3]]

gradmod = ttest_fun(&amp;quot;log(grad)&amp;quot;)

models$log_grad = gradmod
models$grad = NULL
models

models[!names(models) %in% &amp;quot;slp&amp;quot;]

res_anova = map_dfr(models, tidy, conf.int = TRUE, .id = &amp;quot;variable&amp;quot;)
res_anova&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Many similar models - Part 1: How to make a function for model fitting</title>
      <link>https://aosmith.rbind.io/2019/06/24/function-for-model-fitting/</link>
      <pubDate>Mon, 24 Jun 2019 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2019/06/24/function-for-model-fitting/</guid>
      <description>
&lt;script src=&#34;https://aosmith.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;&lt;em&gt;This post was last updated on 2022-01-05.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I worked with several students over the last few months who were fitting many linear models, all with the same basic structure but different response variables. They were struggling to find an efficient way to do this in R while still taking the time to check model assumptions.&lt;/p&gt;
&lt;p&gt;A first step when working towards a more automated process for fitting many models is to learn how to build model formulas using &lt;code&gt;reformulate()&lt;/code&gt; or with &lt;code&gt;paste()&lt;/code&gt; and &lt;code&gt;as.formula()&lt;/code&gt;. Once we learn how to build model formulas we can create functions to streamline the model fitting process.&lt;/p&gt;
&lt;p&gt;I will be making residuals plots with &lt;strong&gt;ggplot2&lt;/strong&gt; today so will load it here.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # v.3.2.0&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#building-a-formula-with-reformulate&#34;&gt;Building a formula with reformulate()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-a-constructed-formula-in-lm&#34;&gt;Using a constructed formula in lm()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#making-a-function-for-model-fitting&#34;&gt;Making a function for model fitting&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-bare-names-instead-of-strings-i.e.-non-standard-evaluation&#34;&gt;Using bare names instead of strings (i.e., non-standard evaluation)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#building-a-formula-with-varying-explanatory-variables&#34;&gt;Building a formula with varying explanatory variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-dots-for-passing-many-variables-to-a-function&#34;&gt;The dots for passing many variables to a function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#example-function-that-returns-residuals-plots-and-model-output&#34;&gt;Example function that returns residuals plots and model output&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#next-step-looping&#34;&gt;Next step: looping&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;building-a-formula-with-reformulate&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Building a formula with reformulate()&lt;/h1&gt;
&lt;p&gt;Model formula of the form &lt;code&gt;y ~ x&lt;/code&gt; can be built based on variable names passed as &lt;em&gt;character strings&lt;/em&gt;. A character string means the variable name will have quotes around it.&lt;/p&gt;
&lt;p&gt;The function &lt;code&gt;reformulate()&lt;/code&gt; allows us to pass response and explanatory variables as character strings and returns them as a formula.&lt;/p&gt;
&lt;p&gt;Here is an example, using &lt;code&gt;mpg&lt;/code&gt; as the response variable and &lt;code&gt;am&lt;/code&gt; as the explanatory variable. Note the explanatory variable is passed to the first argument, &lt;code&gt;termlabels&lt;/code&gt;, and the response variable to &lt;code&gt;response&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;reformulate(termlabels = &amp;quot;am&amp;quot;, response = &amp;quot;mpg&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# mpg ~ am&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A common alternative to &lt;code&gt;reformulate()&lt;/code&gt; is to use &lt;code&gt;paste()&lt;/code&gt; with &lt;code&gt;as.formula()&lt;/code&gt;. I show this option below, but won’t discuss it more in this post.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;as.formula( paste(&amp;quot;mpg&amp;quot;, &amp;quot;~ am&amp;quot;) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# mpg ~ am&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-a-constructed-formula-in-lm&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using a constructed formula in lm()&lt;/h1&gt;
&lt;p&gt;Once we’ve built the formula we can put it in as the first argument of a model fitting function like &lt;code&gt;lm()&lt;/code&gt; in order to fit the model. I’ll be using the &lt;code&gt;mtcars&lt;/code&gt; dataset throughout the model fitting examples.&lt;/p&gt;
&lt;p&gt;Since &lt;code&gt;am&lt;/code&gt; is a 0/1 variable, this particular analysis is a two-sample t-test with &lt;code&gt;mpg&lt;/code&gt; as the response variable. I removed the step of writing out the first argument name to reformulate to save space, knowing that the first argument is always the explanatory variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lm( reformulate(&amp;quot;am&amp;quot;, response = &amp;quot;mpg&amp;quot;), data = mtcars)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# 
# Call:
# lm(formula = reformulate(&amp;quot;am&amp;quot;, response = &amp;quot;mpg&amp;quot;), data = mtcars)
# 
# Coefficients:
# (Intercept)           am  
#      17.147        7.245&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;making-a-function-for-model-fitting&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Making a function for model fitting&lt;/h1&gt;
&lt;p&gt;Being able to build a formula is essential for making user-defined model fitting functions.&lt;/p&gt;
&lt;p&gt;For example, say I wanted to do the same t-test with &lt;code&gt;am&lt;/code&gt; for many response variables. I could create a function that takes the response variable as an argument and build the model formula within the function with &lt;code&gt;reformulate()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The response variable name is passed to the &lt;code&gt;response&lt;/code&gt; argument as a character string.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lm_fun = function(response) {
  lm( reformulate(&amp;quot;am&amp;quot;, response = response), data = mtcars)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here are two examples of this function in action, using &lt;code&gt;mpg&lt;/code&gt; and then &lt;code&gt;wt&lt;/code&gt; as the response variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lm_fun(response = &amp;quot;mpg&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# 
# Call:
# lm(formula = reformulate(&amp;quot;am&amp;quot;, response = response), data = mtcars)
# 
# Coefficients:
# (Intercept)           am  
#      17.147        7.245&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lm_fun(response = &amp;quot;wt&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# 
# Call:
# lm(formula = reformulate(&amp;quot;am&amp;quot;, response = response), data = mtcars)
# 
# Coefficients:
# (Intercept)           am  
#       3.769       -1.358&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-bare-names-instead-of-strings-i.e.-non-standard-evaluation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using bare names instead of strings (i.e., non-standard evaluation)&lt;/h1&gt;
&lt;p&gt;As you can see, this approach to building formula relies on character strings. This is going to be great once we start looping through variable names, but if making a function for interactive use it can be nice for the user to pass bare column names.&lt;/p&gt;
&lt;p&gt;We can use some &lt;code&gt;deparse()&lt;/code&gt;/&lt;code&gt;substitute()&lt;/code&gt; magic in the function for this. Those two functions will turn bare names into strings within the function rather than having the user pass strings directly.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lm_fun2 = function(response) {
  resp = deparse( substitute( response) )
  lm( reformulate(&amp;quot;am&amp;quot;, response = resp), data = mtcars)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here’s an example of this function in action. Note the use of the bare column name for the response variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lm_fun2(response = mpg)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# 
# Call:
# lm(formula = reformulate(&amp;quot;am&amp;quot;, response = resp), data = mtcars)
# 
# Coefficients:
# (Intercept)           am  
#      17.147        7.245&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can see that one thing that happens when using &lt;code&gt;reformulate()&lt;/code&gt; like this is that the formula in the model output shows the formula-building code instead of the actual variables used in the model.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Call:  
lm(formula = reformulate(&amp;quot;am&amp;quot;, response = resp), data = mtcars) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;While this often won’t matter in practice, there are ways to force the model to show the variables used in the model fitting. See &lt;a href=&#34;http://www.win-vector.com/blog/2018/09/r-tip-how-to-pass-a-formula-to-lm/&#34;&gt;this blog post&lt;/a&gt; for some discussion as well as code for how to do this.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;building-a-formula-with-varying-explanatory-variables&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Building a formula with varying explanatory variables&lt;/h1&gt;
&lt;p&gt;The formula building approach can also be used for fitting models where the explanatory variables vary. The explanatory variables should have plus signs between them on the right-hand side of the formula, which we can achieve by passing a vector of character strings to the first argument of &lt;code&gt;reformulate()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;expl = c(&amp;quot;am&amp;quot;, &amp;quot;disp&amp;quot;)
reformulate(expl, response = &amp;quot;mpg&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# mpg ~ am + disp&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s go through an example of using this in a function that can fit a model with different explanatory variables.&lt;/p&gt;
&lt;p&gt;In this function I demonstrate building the formula as a separate step and then passing it to &lt;code&gt;lm()&lt;/code&gt;. Some find this easier to read compared to building the formula within &lt;code&gt;lm()&lt;/code&gt; as a single step like I did earlier.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lm_fun_expl = function(expl) {
  form = reformulate(expl, response = &amp;quot;mpg&amp;quot;)
  lm(form, data = mtcars)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To use the function we pass a vector of variable names as strings to the &lt;code&gt;expl&lt;/code&gt; argument.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lm_fun_expl(expl = c(&amp;quot;am&amp;quot;, &amp;quot;disp&amp;quot;) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# 
# Call:
# lm(formula = form, data = mtcars)
# 
# Coefficients:
# (Intercept)           am         disp  
#    27.84808      1.83346     -0.03685&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-dots-for-passing-many-variables-to-a-function&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The dots for passing many variables to a function&lt;/h1&gt;
&lt;p&gt;Using dots (…) instead of named arguments can allow the user to list the explanatory variables separately instead of in a vector.&lt;/p&gt;
&lt;p&gt;I’ll demonstrate a function using dots to indicate some undefined number of additional arguments for putting as many explanatory variables as desired into the model. I wrap the dots in &lt;code&gt;c()&lt;/code&gt; within the function in order to collapse variables together with &lt;code&gt;+&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lm_fun_expl2 = function(...) {
  form = reformulate(c(...), response = &amp;quot;mpg&amp;quot;)
  lm(form, data = mtcars)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now variables are passed individually as strings separated by commas instead of as a vector.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lm_fun_expl2(&amp;quot;am&amp;quot;, &amp;quot;disp&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# 
# Call:
# lm(formula = form, data = mtcars)
# 
# Coefficients:
# (Intercept)           am         disp  
#    27.84808      1.83346     -0.03685&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;example-function-that-returns-residuals-plots-and-model-output&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Example function that returns residuals plots and model output&lt;/h1&gt;
&lt;p&gt;One of the reasons to make a function is to increase efficiency when fitting many models. For example, it might be useful to make a function that returns residual plots and any desired statistical results simultaneously.&lt;/p&gt;
&lt;p&gt;Here’s an example of such a function, using some of the tools covered above. The function takes the response variable as a bare name, fits a model with &lt;code&gt;am&lt;/code&gt; hard-coded as the explanatory variable and the &lt;code&gt;mtcars&lt;/code&gt; dataset, and then makes two residual plots.&lt;/p&gt;
&lt;p&gt;The function outputs a list that contains the two residuals plots as well as the overall &lt;span class=&#34;math inline&#34;&gt;\(F\)&lt;/span&gt; tests from the model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lm_modfit = function(response) {
  resp = deparse( substitute( response) )
  mod = lm( reformulate(&amp;quot;am&amp;quot;, response = resp), data = mtcars)
  resvfit = qplot(x = mod$fit, y = mod$res) + theme_bw()
  resdist = qplot(x = &amp;quot;Residual&amp;quot;, mod$res, geom = &amp;quot;boxplot&amp;quot;) + theme_bw()
  list(resvfit, resdist, anova(mod) )
}

mpgfit = lm_modfit(mpg)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Individual parts of the output list can be extracted as needed. To check model assumptions prior to looking at any results we’d pull out the two plots, which are the first two elements of the output list.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mpgfit[1:2]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [[1]]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-06-24-how-to-make-a-function-for-model-fitting_files/figure-html/fullfunout1-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# 
# [[2]]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-06-24-how-to-make-a-function-for-model-fitting_files/figure-html/fullfunout1-2.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If we deem the model fit acceptable we can extract the overall &lt;span class=&#34;math inline&#34;&gt;\(F\)&lt;/span&gt; tests from the third element of the output.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mpgfit[[3]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Analysis of Variance Table
# 
# Response: mpg
#           Df Sum Sq Mean Sq F value   Pr(&amp;gt;F)    
# am         1 405.15  405.15   16.86 0.000285 ***
# Residuals 30 720.90   24.03                     
# ---
# Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;next-step-looping&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Next step: looping&lt;/h1&gt;
&lt;p&gt;This post focused on using &lt;code&gt;reformulate()&lt;/code&gt; for building model formula and then making user-defined functions for interactive use. When working with many models we’d likely want to automate the process more by using some sort of looping. I wrote a follow-up post on looping through variables and fitting models with the &lt;code&gt;map&lt;/code&gt; family of functions from package &lt;strong&gt;purrr&lt;/strong&gt;, which you can see &lt;a href=&#34;https://aosmith.rbind.io/2019/07/22/automate-model-fitting-with-loops/&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2019-06-24-how-to-make-a-function-for-model-fitting.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # v.3.2.0

reformulate(termlabels = &amp;quot;am&amp;quot;, response = &amp;quot;mpg&amp;quot;)

as.formula( paste(&amp;quot;mpg&amp;quot;, &amp;quot;~ am&amp;quot;) )

lm( reformulate(&amp;quot;am&amp;quot;, response = &amp;quot;mpg&amp;quot;), data = mtcars)

lm_fun = function(response) {
  lm( reformulate(&amp;quot;am&amp;quot;, response = response), data = mtcars)
}

lm_fun(response = &amp;quot;mpg&amp;quot;)
lm_fun(response = &amp;quot;wt&amp;quot;)

lm_fun2 = function(response) {
  resp = deparse( substitute( response) )
  lm( reformulate(&amp;quot;am&amp;quot;, response = resp), data = mtcars)
}

lm_fun2(response = mpg)

expl = c(&amp;quot;am&amp;quot;, &amp;quot;disp&amp;quot;)
reformulate(expl, response = &amp;quot;mpg&amp;quot;)

lm_fun_expl = function(expl) {
  form = reformulate(expl, response = &amp;quot;mpg&amp;quot;)
  lm(form, data = mtcars)
}

lm_fun_expl(expl = c(&amp;quot;am&amp;quot;, &amp;quot;disp&amp;quot;) )

lm_fun_expl2 = function(...) {
  form = reformulate(c(...), response = &amp;quot;mpg&amp;quot;)
  lm(form, data = mtcars)
}

lm_fun_expl2(&amp;quot;am&amp;quot;, &amp;quot;disp&amp;quot;)

lm_modfit = function(response) {
  resp = deparse( substitute( response) )
  mod = lm( reformulate(&amp;quot;am&amp;quot;, response = resp), data = mtcars)
  resvfit = qplot(x = mod$fit, y = mod$res) + theme_bw()
  resdist = qplot(x = &amp;quot;Residual&amp;quot;, mod$res, geom = &amp;quot;boxplot&amp;quot;) + theme_bw()
  list(resvfit, resdist, anova(mod) )
}

mpgfit = lm_modfit(mpg)
mpgfit[1:2]
mpgfit[[3]]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>The small multiples plot: how to combine ggplot2 plots with one shared axis</title>
      <link>https://aosmith.rbind.io/2019/05/13/small-multiples-plot/</link>
      <pubDate>Mon, 13 May 2019 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2019/05/13/small-multiples-plot/</guid>
      <description>


&lt;p&gt;There are a variety of ways to combine &lt;strong&gt;ggplot2&lt;/strong&gt; plots with a single shared axis. However, things can get tricky if you want a lot of control over all plot elements.&lt;/p&gt;
&lt;p&gt;I demonstrate four different approaches for this:&lt;br /&gt;
1. Using facets, which is built in to &lt;strong&gt;ggplot2&lt;/strong&gt; but doesn’t allow much control over the non-shared axes.&lt;br /&gt;
2. Using package &lt;strong&gt;cowplot&lt;/strong&gt;, which has a lot of nice features but the plot spacing doesn’t play well with a single shared axis.&lt;br /&gt;
3. Using package &lt;strong&gt;egg&lt;/strong&gt;.&lt;br /&gt;
4. Using package &lt;strong&gt;patchwork&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The last two packages allow nice spacing for plots with a shared axis.&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#load-r-packages&#34;&gt;Load R packages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-set-up&#34;&gt;The set-up&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-facets-for-small-multiples&#34;&gt;Using facets for small multiples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-cowplot-to-combine-plots&#34;&gt;Using &lt;strong&gt;cowplot&lt;/strong&gt; to combine plots&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-egg-to-combine-plots&#34;&gt;Using &lt;strong&gt;egg&lt;/strong&gt; to combine plots&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#adding-plot-labels-with-tag_facet&#34;&gt;Adding plot labels with tag_facet()&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-patchwork-to-combine-plots&#34;&gt;Using &lt;strong&gt;patchwork&lt;/strong&gt; to combine plots&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#adding-plots-labels-with-plot_annotation&#34;&gt;Adding plots labels with plot_annotation()&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;load-r-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Load R packages&lt;/h1&gt;
&lt;p&gt;I’ll be plotting with &lt;strong&gt;ggplot2&lt;/strong&gt;, reshaping with &lt;strong&gt;tidyr&lt;/strong&gt;, and combining plots with packages &lt;strong&gt;egg&lt;/strong&gt; and &lt;strong&gt;patchwork&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I’ll also be using package &lt;strong&gt;cowplot&lt;/strong&gt; version 0.9.4 to combine individual plots into one, but will use the package functions via &lt;code&gt;cowplot::&lt;/code&gt; instead of loading the package. (I believe the next version of &lt;strong&gt;cowplot&lt;/strong&gt; will not be so opinionated about the theme.)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # v. 3.1.1
library(tidyr) # v. 0.8.3
library(egg) # v. 0.4.2
library(patchwork) # v. 1.0.0&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-set-up&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The set-up&lt;/h1&gt;
&lt;p&gt;Here’s the scenario: we have one response variable (&lt;code&gt;resp&lt;/code&gt;) that we want to plot against three other variables and combine them into a single “small multiples” plot.&lt;/p&gt;
&lt;p&gt;I’ll call the three variables &lt;code&gt;elev&lt;/code&gt;, &lt;code&gt;grad&lt;/code&gt;, and &lt;code&gt;slp&lt;/code&gt;. You’ll note that I created these variables to have very different scales.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(16)
dat = data.frame(elev = round( runif(20, 100, 500), 1),
                 resp = round( runif(20, 0, 10), 1),
                 grad = round( runif(20, 0, 1), 2),
                 slp = round( runif(20, 0, 35),1) )
head(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#    elev resp grad  slp
# 1 373.2  9.7 0.05  8.8
# 2 197.6  8.1 0.42 33.3
# 3 280.0  5.4 0.38 19.3
# 4 191.8  4.3 0.07 29.6
# 5 445.4  2.3 0.43 16.5
# 6 224.5  6.5 0.78  4.1&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-facets-for-small-multiples&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using facets for small multiples&lt;/h1&gt;
&lt;p&gt;One good option when we want to make a similar plot for different groups (in this case, different variables) is to use &lt;em&gt;faceting&lt;/em&gt; to make different panels within the same plot.&lt;/p&gt;
&lt;p&gt;Since the three variables are currently in separate columns we’ll need to &lt;em&gt;reshape&lt;/em&gt; the dataset prior to plotting. I’ll use &lt;code&gt;gather()&lt;/code&gt; from &lt;strong&gt;tidyr&lt;/strong&gt; for this.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;datlong = gather(dat, key = &amp;quot;variable&amp;quot;, value = &amp;quot;value&amp;quot;, -resp)
head(datlong)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#   resp variable value
# 1  9.7     elev 373.2
# 2  8.1     elev 197.6
# 3  5.4     elev 280.0
# 4  4.3     elev 191.8
# 5  2.3     elev 445.4
# 6  6.5     elev 224.5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now I can use &lt;code&gt;facet_wrap()&lt;/code&gt; to make a separate scatterplot of &lt;code&gt;resp&lt;/code&gt; vs each variable. The argument &lt;code&gt;scales = &#34;free_x&#34;&lt;/code&gt; allows the x axis scales to differ for each variable but leaves a single y axis.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(datlong, aes(x = value, y = resp) ) +
     geom_point() +
     theme_bw() +
     facet_wrap(~variable, scales = &amp;quot;free_x&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-05-13-small-multiples-plot_files/figure-html/unnamed-chunk-4-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I can use the facet strips to give the appearance of axis labels, as shown in &lt;a href=&#34;https://stackoverflow.com/a/37574221/2461552&#34;&gt;this Stack Overflow answer&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(datlong, aes(x = value, y = resp) ) +
     geom_point() +
     theme_bw() +
     facet_wrap(~variable, scales = &amp;quot;free_x&amp;quot;, strip.position = &amp;quot;bottom&amp;quot;) +
     theme(strip.background = element_blank(),
           strip.placement = &amp;quot;outside&amp;quot;) +
     labs(x = NULL)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-05-13-small-multiples-plot_files/figure-html/unnamed-chunk-5-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;That’s a pretty nice plot to start with. However, controlling the axis breaks in the individual panels &lt;a href=&#34;https://stackoverflow.com/questions/51735481/ggplot2-change-axis-limits-for-each-individual-facet-panel&#34;&gt;can be complicated&lt;/a&gt;, which is something we’d commonly want to do.&lt;/p&gt;
&lt;p&gt;In that case, it may make more sense to create separate plots and then combine them into a small multiples plot with an add-on package.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;using-cowplot-to-combine-plots&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using &lt;strong&gt;cowplot&lt;/strong&gt; to combine plots&lt;/h1&gt;
&lt;p&gt;Package &lt;strong&gt;cowplot&lt;/strong&gt; is a really nice package for combining plots, and has lots of bells and whistles along with some pretty thorough &lt;a href=&#34;https://cran.r-project.org/web/packages/cowplot/index.html&#34;&gt;vignettes&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The first step is to make each of the three plots separately. If doing lots of these we’d want to use some sort of loop to make a list of plots &lt;a href=&#34;https://aosmith.rbind.io/2018/08/20/automating-exploratory-plots/&#34;&gt;as I’ve demonstrated previously&lt;/a&gt;. Today I’m going to make the three plots manually.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;elevplot = ggplot(dat, aes(x = elev, y = resp) ) +
     geom_point() +
     theme_bw()

gradplot = ggplot(dat, aes(x = grad, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 1, by = 0.2) )

slpplot = ggplot(dat, aes(x = slp, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 35, by = 5) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The function &lt;code&gt;plot_grid()&lt;/code&gt; in &lt;strong&gt;cowplot&lt;/strong&gt; is for combining plots. To make a single row of plots I use &lt;code&gt;nrow = 1&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;labels&lt;/code&gt; argument puts separate labels on each panel for captioning.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cowplot::plot_grid(elevplot, 
                   gradplot, 
                   slpplot,
                   nrow = 1,
                   labels = &amp;quot;auto&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-05-13-small-multiples-plot_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;But we want a single shared y axis, not a separate y axis on each plot. I’ll remake the combined plot, this time removing the y axis elements from all but the first plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cowplot::plot_grid(elevplot, 
                   gradplot + theme(axis.text.y = element_blank(),
                                    axis.ticks.y = element_blank(),
                                    axis.title.y = element_blank() ), 
                   slpplot + theme(axis.text.y = element_blank(),
                                    axis.ticks.y = element_blank(),
                                    axis.title.y = element_blank() ),
                   nrow = 1,
                   labels = &amp;quot;auto&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-05-13-small-multiples-plot_files/figure-html/unnamed-chunk-8-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This makes panels different sizes, though, which isn’t ideal. To have all the plots the same width I need to align them vertically with &lt;code&gt;align = &#34;v&#34;&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cowplot::plot_grid(elevplot, 
                   gradplot + 
                        theme(axis.text.y = element_blank(),
                              axis.ticks.y = element_blank(),
                              axis.title.y = element_blank() ), 
                   slpplot + 
                        theme(axis.text.y = element_blank(),
                              axis.ticks.y = element_blank(),
                              axis.title.y = element_blank() ),
                   nrow = 1,
                   labels = &amp;quot;auto&amp;quot;,
                   align = &amp;quot;v&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-05-13-small-multiples-plot_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;But, unfortunately, this puts the axis space back between the plots to make them all the same width. It turns out that &lt;strong&gt;cowplot&lt;/strong&gt; isn’t really made for plots with a single shared axis. The &lt;strong&gt;cowplot&lt;/strong&gt; package author points us to package &lt;strong&gt;egg&lt;/strong&gt; for this &lt;a href=&#34;https://stackoverflow.com/a/47615304/2461552&#34;&gt;in this Stack Overflow answer&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;using-egg-to-combine-plots&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using &lt;strong&gt;egg&lt;/strong&gt; to combine plots&lt;/h1&gt;
&lt;p&gt;Package &lt;strong&gt;egg&lt;/strong&gt; is another nice alternative for combining plots into a small multiples plot. The function in this package for combining plots is called &lt;code&gt;ggarrange()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Here are the three plots again.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;elevplot = ggplot(dat, aes(x = elev, y = resp) ) +
     geom_point() +
     theme_bw()

gradplot = ggplot(dat, aes(x = grad, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 1, by = 0.2) )

slpplot = ggplot(dat, aes(x = slp, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 35, by = 5) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;ggarrange()&lt;/code&gt; function has an &lt;code&gt;nrow&lt;/code&gt; argument so I can keep the plots in a single row.&lt;/p&gt;
&lt;p&gt;The panel spacing is automagically the same here after I remove the y axis elements, and things look pretty nice right out of the box.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggarrange(elevplot, 
          gradplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank() ), 
          slpplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank() ),
          nrow = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-05-13-small-multiples-plot_files/figure-html/unnamed-chunk-11-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can bring panes closer by removing some of the space around the plot margins with the &lt;code&gt;plot.margin&lt;/code&gt; in &lt;code&gt;theme()&lt;/code&gt;. I’ll set the spacing for right margin of the first plot, both left and right margins of the second, and the left margin of the third.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggarrange(elevplot +
               theme(axis.ticks.y = element_blank(),
                     plot.margin = margin(r = 1) ), 
          gradplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank(),
                     plot.margin = margin(r = 1, l = 1) ), 
          slpplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank(),
                     plot.margin = margin(l = 1)  ),
          nrow = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-05-13-small-multiples-plot_files/figure-html/unnamed-chunk-12-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;adding-plot-labels-with-tag_facet&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Adding plot labels with tag_facet()&lt;/h2&gt;
&lt;p&gt;You’ll see there is a &lt;code&gt;labels&lt;/code&gt; argument in &lt;code&gt;ggarrange()&lt;/code&gt; documentation, but it didn’t work well for me out of the box with only one plot with a y axis. However, we can get tricky with &lt;code&gt;egg::tag_facet()&lt;/code&gt; if we add a facet strip to each of the individual plots.&lt;/p&gt;
&lt;p&gt;It’d make sense to build these plots outside of &lt;code&gt;ggarrange()&lt;/code&gt; and then add the tags and combine them instead of nesting everything like I did here, since the code is now a little hard to follow.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggarrange(tag_facet(elevplot +
                          theme(axis.ticks.y = element_blank(),
                                plot.margin = margin(r = 1) ) +
                          facet_wrap(~&amp;quot;elev&amp;quot;),
                     tag_pool = &amp;quot;a&amp;quot;), 
          tag_facet(gradplot + 
                          theme(axis.text.y = element_blank(),
                                axis.ticks.y = element_blank(),
                                axis.title.y = element_blank(),
                                plot.margin = margin(r = 1, l = 1) ) +
                          facet_wrap(~&amp;quot;grad&amp;quot;), 
                    tag_pool = &amp;quot;b&amp;quot; ), 
          tag_facet(slpplot + 
                          theme(axis.text.y = element_blank(),
                                axis.ticks.y = element_blank(),
                                axis.title.y = element_blank(),
                                plot.margin = margin(l = 1)  ) +
                          facet_wrap(~&amp;quot;slp&amp;quot;),
                     tag_pool = &amp;quot;c&amp;quot;),
          nrow = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-05-13-small-multiples-plot_files/figure-html/unnamed-chunk-13-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We might want to add a right y axis to the right-most plot. In this case we’d want to change the axis ticks length to 0 via &lt;code&gt;theme()&lt;/code&gt; elements. This can be done &lt;a href=&#34;https://github.com/tidyverse/ggplot2/pull/2934&#34;&gt;separately per axis in the development version of &lt;strong&gt;ggplot2&lt;/strong&gt;&lt;/a&gt;, and will be included in version 3.2.0.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;using-patchwork-to-combine-plots&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using &lt;strong&gt;patchwork&lt;/strong&gt; to combine plots&lt;/h1&gt;
&lt;p&gt;The patchwork package is another one that is great for combining plots, and is now on CRAN (as of December 2019) 🎉. It has nice vignettes &lt;a href=&#34;https://patchwork.data-imaginist.com/articles/&#34;&gt;here&lt;/a&gt; to help you get started.&lt;/p&gt;
&lt;p&gt;In &lt;strong&gt;patchwork&lt;/strong&gt; the &lt;code&gt;+&lt;/code&gt; operator is used to add plots together. Here’s an example, combining my three original plots.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;elevplot + gradplot + slpplot&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-05-13-small-multiples-plot_files/figure-html/unnamed-chunk-14-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If we give the resulting combined plot a name, we can remove the titles from the last two subplots using double-bracket indexing. (Of course, I also could have built the plots how I wanted them in the first place. 😜)&lt;/p&gt;
&lt;p&gt;The result has nice spacing for a single, shared y axis. Margins can be controlled the same was as in the &lt;strong&gt;egg&lt;/strong&gt; example above.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;patchwork = elevplot + gradplot + slpplot

# Remove title from second subplot
patchwork[[2]] = patchwork[[2]] + theme(axis.text.y = element_blank(),
                                        axis.ticks.y = element_blank(),
                                        axis.title.y = element_blank() )

# Remove title from third subplot
patchwork[[3]] = patchwork[[3]] + theme(axis.text.y = element_blank(),
                                        axis.ticks.y = element_blank(),
                                        axis.title.y = element_blank() )

patchwork&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-05-13-small-multiples-plot_files/figure-html/unnamed-chunk-15-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;adding-plots-labels-with-plot_annotation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Adding plots labels with plot_annotation()&lt;/h2&gt;
&lt;p&gt;There are many annotation options in &lt;strong&gt;patchwork&lt;/strong&gt;. I’ll focus on adding tags, but see the &lt;a href=&#34;https://patchwork.data-imaginist.com/articles/guides/annotation.html&#34;&gt;annotation vignette&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Tags can be added with the &lt;code&gt;tag_levels&lt;/code&gt; argument in &lt;code&gt;plot_annotation()&lt;/code&gt;. I want lowercase Latin letter so use &lt;code&gt;&#34;a&#34;&lt;/code&gt; as my &lt;code&gt;tag_levels&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Plot tags go outside the plot by default. You can control the position at least somewhat with the &lt;code&gt;theme&lt;/code&gt; option &lt;code&gt;plot.tag.position&lt;/code&gt;, which works on the individual subplots and not the entire combined plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;patchwork + plot_annotation(tag_levels = &amp;quot;a&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-05-13-small-multiples-plot_files/figure-html/unnamed-chunk-16-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2019-05-13-small-multiples-plot.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # v. 3.1.1
library(tidyr) # v. 0.8.3
library(egg) # v. 0.4.2
library(patchwork) # v. 1.0.0

set.seed(16)
dat = data.frame(elev = round( runif(20, 100, 500), 1),
                 resp = round( runif(20, 0, 10), 1),
                 grad = round( runif(20, 0, 1), 2),
                 slp = round( runif(20, 0, 35),1) )
head(dat)

datlong = gather(dat, key = &amp;quot;variable&amp;quot;, value = &amp;quot;value&amp;quot;, -resp)
head(datlong)

ggplot(datlong, aes(x = value, y = resp) ) +
     geom_point() +
     theme_bw() +
     facet_wrap(~variable, scales = &amp;quot;free_x&amp;quot;)

ggplot(datlong, aes(x = value, y = resp) ) +
     geom_point() +
     theme_bw() +
     facet_wrap(~variable, scales = &amp;quot;free_x&amp;quot;, strip.position = &amp;quot;bottom&amp;quot;) +
     theme(strip.background = element_blank(),
           strip.placement = &amp;quot;outside&amp;quot;) +
     labs(x = NULL)

elevplot = ggplot(dat, aes(x = elev, y = resp) ) +
     geom_point() +
     theme_bw()

gradplot = ggplot(dat, aes(x = grad, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 1, by = 0.2) )

slpplot = ggplot(dat, aes(x = slp, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 35, by = 5) )

cowplot::plot_grid(elevplot, 
                   gradplot, 
                   slpplot,
                   nrow = 1,
                   labels = &amp;quot;auto&amp;quot;)

cowplot::plot_grid(elevplot, 
                   gradplot + theme(axis.text.y = element_blank(),
                                    axis.ticks.y = element_blank(),
                                    axis.title.y = element_blank() ), 
                   slpplot + theme(axis.text.y = element_blank(),
                                    axis.ticks.y = element_blank(),
                                    axis.title.y = element_blank() ),
                   nrow = 1,
                   labels = &amp;quot;auto&amp;quot;)

cowplot::plot_grid(elevplot, 
                   gradplot + 
                        theme(axis.text.y = element_blank(),
                              axis.ticks.y = element_blank(),
                              axis.title.y = element_blank() ), 
                   slpplot + 
                        theme(axis.text.y = element_blank(),
                              axis.ticks.y = element_blank(),
                              axis.title.y = element_blank() ),
                   nrow = 1,
                   labels = &amp;quot;auto&amp;quot;,
                   align = &amp;quot;v&amp;quot;)

elevplot = ggplot(dat, aes(x = elev, y = resp) ) +
     geom_point() +
     theme_bw()

gradplot = ggplot(dat, aes(x = grad, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 1, by = 0.2) )

slpplot = ggplot(dat, aes(x = slp, y = resp) ) +
     geom_point() +
     theme_bw() +
     scale_x_continuous(breaks = seq(0, 35, by = 5) )

ggarrange(elevplot, 
          gradplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank() ), 
          slpplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank() ),
          nrow = 1)

ggarrange(elevplot +
               theme(axis.ticks.y = element_blank(),
                     plot.margin = margin(r = 1) ), 
          gradplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank(),
                     plot.margin = margin(r = 1, l = 1) ), 
          slpplot + 
               theme(axis.text.y = element_blank(),
                     axis.ticks.y = element_blank(),
                     axis.title.y = element_blank(),
                     plot.margin = margin(l = 1)  ),
          nrow = 1)

ggarrange(tag_facet(elevplot +
                          theme(axis.ticks.y = element_blank(),
                                plot.margin = margin(r = 1) ) +
                          facet_wrap(~&amp;quot;elev&amp;quot;),
                     tag_pool = &amp;quot;a&amp;quot;), 
          tag_facet(gradplot + 
                          theme(axis.text.y = element_blank(),
                                axis.ticks.y = element_blank(),
                                axis.title.y = element_blank(),
                                plot.margin = margin(r = 1, l = 1) ) +
                          facet_wrap(~&amp;quot;grad&amp;quot;), 
                    tag_pool = &amp;quot;b&amp;quot; ), 
          tag_facet(slpplot + 
                          theme(axis.text.y = element_blank(),
                                axis.ticks.y = element_blank(),
                                axis.title.y = element_blank(),
                                plot.margin = margin(l = 1)  ) +
                          facet_wrap(~&amp;quot;slp&amp;quot;),
                     tag_pool = &amp;quot;c&amp;quot;),
          nrow = 1)

elevplot + gradplot + slpplot

patchwork = elevplot + gradplot + slpplot

# Remove title from second subplot
patchwork[[2]] = patchwork[[2]] + theme(axis.text.y = element_blank(),
                                        axis.ticks.y = element_blank(),
                                        axis.title.y = element_blank() )

# Remove title from third subplot
patchwork[[3]] = patchwork[[3]] + theme(axis.text.y = element_blank(),
                                        axis.ticks.y = element_blank(),
                                        axis.title.y = element_blank() )

patchwork

patchwork + plot_annotation(tag_levels = &amp;quot;a&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Embedding subplots in ggplot2 graphics</title>
      <link>https://aosmith.rbind.io/2019/04/22/embedding-subplots/</link>
      <pubDate>Mon, 22 Apr 2019 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2019/04/22/embedding-subplots/</guid>
      <description>


&lt;p&gt;The idea of embedded plots for visualizing a large dataset that has an overplotting problem recently came up in some discussions with students. I first learned about embedded graphics from package &lt;strong&gt;ggsubplot&lt;/strong&gt;. You can still see &lt;a href=&#34;https://blog.revolutionanalytics.com/2012/09/visualize-complex-data-with-subplots.html&#34;&gt;an old post&lt;/a&gt; about that package and about embedded graphics in general, with examples. However, &lt;strong&gt;ggsubplot&lt;/strong&gt; is no longer maintained and doesn’t work with current versions of &lt;strong&gt;ggplot2&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I poked around a bit, and found that &lt;code&gt;annotation_custom()&lt;/code&gt; is the go-to function for embedding plots in a &lt;strong&gt;ggplot2&lt;/strong&gt; graphic. I found a couple of recent examples for how to tackle making such plots on Stack Overflow &lt;a href=&#34;https://stackoverflow.com/a/44125392/2461552&#34;&gt;here&lt;/a&gt; and &lt;a href=&#34;https://stackoverflow.com/a/45417727/2461552&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I’m going to work through an example of embedding subplots using the same kind of looping approach outlined in those answers.&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#r-packages&#34;&gt;R packages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#cutting-continuous-variables-into-evenly-spaced-categories&#34;&gt;Cutting continuous variables into evenly-spaced categories&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#categorizing-the-axis-variables&#34;&gt;Categorizing the axis variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#extracting-the-coordinates-for-each-subplot&#34;&gt;Extracting the coordinates for each subplot&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#bar-plot-subplots&#34;&gt;Bar plot subplots&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#creating-all-the-bar-plot-subplots&#34;&gt;Creating all the bar plot subplots&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#putting-the-subplots-into-annotation_custom&#34;&gt;Putting the subplots into annotation_custom()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#making-the-large-plot&#34;&gt;Making the large plot&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#embedding-the-bar-plot-subplots&#34;&gt;Embedding the bar plot subplots&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#histogram-subplots&#34;&gt;Histogram subplots&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#get-the-histograms-ready-to-embed&#34;&gt;Get the histograms ready to embed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#embed-the-histogram-subplots&#34;&gt;Embed the histogram subplots&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#a-density-subplot-example&#34;&gt;A density subplot example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#filled-density-plots&#34;&gt;Filled density plots&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;r-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R packages&lt;/h1&gt;
&lt;p&gt;First I’ll load the R packages I’m using today. All plotting is done via &lt;strong&gt;ggplot2&lt;/strong&gt;, I do data manipulation with &lt;strong&gt;dplyr&lt;/strong&gt; and &lt;strong&gt;tidyr&lt;/strong&gt;, and &lt;strong&gt;purrr&lt;/strong&gt; is for looping to make the subplots and then for getting the subplots into &lt;code&gt;annotation_custom()&lt;/code&gt; layers.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # 3.1.1
suppressPackageStartupMessages( library(dplyr) ) # 0.8.0.1
library(tidyr) # 0.8.3
library(purrr) # 0.3.2&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;cutting-continuous-variables-into-evenly-spaced-categories&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Cutting continuous variables into evenly-spaced categories&lt;/h1&gt;
&lt;p&gt;Binning the continuous variables that will be on the axes of the larger plot in order to create separate datasets for each subplot is the first step in this process.&lt;/p&gt;
&lt;p&gt;I thought it made the most sense to make all the subplots the same size in the final plot and so I wanted to make evenly sized &lt;em&gt;bins&lt;/em&gt; or groups. The range of values in each bin can then be based on the total range of the variable of interest and the desired number of groups.&lt;/p&gt;
&lt;p&gt;Binning into even-length groups is a job for &lt;code&gt;cut()&lt;/code&gt;. I’m going to need the minimum and maximum value of each group to place the subplots along the axes of the larger plot, so rather than using &lt;code&gt;cut()&lt;/code&gt; directly I made a function built around it. While information on the range of values encompassed by a group can be pulled from the default &lt;code&gt;cut()&lt;/code&gt; bin labels, I didn’t like how &lt;code&gt;cut()&lt;/code&gt; rounded those values.&lt;/p&gt;
&lt;p&gt;My function, &lt;code&gt;cuteven()&lt;/code&gt;, takes a continuous variable and returns a variable cut into &lt;code&gt;ngroups&lt;/code&gt; bins. The labels for the new groups are the unrounded minimum and maximum value within each group, with the values separated by commas.&lt;/p&gt;
&lt;p&gt;I use &lt;code&gt;include.lowest = TRUE&lt;/code&gt; in &lt;code&gt;cut()&lt;/code&gt; is to make sure the minimum value in the dataset is included in the first group.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cuteven = function(variable, ngroups) {
     seq_all = seq(min(variable), max(variable), length.out = ngroups + 1)
     cut(variable,
         breaks = seq_all,
         labels = paste(seq_all[-(ngroups + 1)], seq_all[-1], sep = &amp;quot;,&amp;quot;),
         include.lowest = TRUE)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I’ll test the function by cutting &lt;code&gt;Sepal.Length&lt;/code&gt; from the &lt;code&gt;iris&lt;/code&gt; dataset into 3 groups. The new, categorical variable has three groups. The groups are labeled with their minimum and maximum value.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;with(iris, cuteven(Sepal.Length, ngroups = 3) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#   [1] 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5
#   [9] 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 5.5,6.7 5.5,6.7
#  [17] 4.3,5.5 4.3,5.5 5.5,6.7 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5
#  [25] 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5
#  [33] 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5
#  [41] 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5 4.3,5.5
#  [49] 4.3,5.5 4.3,5.5 6.7,7.9 5.5,6.7 6.7,7.9 4.3,5.5 5.5,6.7 5.5,6.7
#  [57] 5.5,6.7 4.3,5.5 5.5,6.7 4.3,5.5 4.3,5.5 5.5,6.7 5.5,6.7 5.5,6.7
#  [65] 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7
#  [73] 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 6.7,7.9 5.5,6.7 5.5,6.7 5.5,6.7
#  [81] 4.3,5.5 4.3,5.5 5.5,6.7 5.5,6.7 4.3,5.5 5.5,6.7 5.5,6.7 5.5,6.7
#  [89] 5.5,6.7 4.3,5.5 4.3,5.5 5.5,6.7 5.5,6.7 4.3,5.5 5.5,6.7 5.5,6.7
#  [97] 5.5,6.7 5.5,6.7 4.3,5.5 5.5,6.7 5.5,6.7 5.5,6.7 6.7,7.9 5.5,6.7
# [105] 5.5,6.7 6.7,7.9 4.3,5.5 6.7,7.9 5.5,6.7 6.7,7.9 5.5,6.7 5.5,6.7
# [113] 6.7,7.9 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 6.7,7.9 6.7,7.9 5.5,6.7
# [121] 6.7,7.9 5.5,6.7 6.7,7.9 5.5,6.7 5.5,6.7 6.7,7.9 5.5,6.7 5.5,6.7
# [129] 5.5,6.7 6.7,7.9 6.7,7.9 6.7,7.9 5.5,6.7 5.5,6.7 5.5,6.7 6.7,7.9
# [137] 5.5,6.7 5.5,6.7 5.5,6.7 6.7,7.9 5.5,6.7 6.7,7.9 5.5,6.7 6.7,7.9
# [145] 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7 5.5,6.7
# Levels: 4.3,5.5 5.5,6.7 6.7,7.9&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;categorizing-the-axis-variables&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Categorizing the axis variables&lt;/h1&gt;
&lt;p&gt;While these embedded plots can be useful for large datasets, I’m going demonstrate the process on a relatively small dataset. Here I will embed subplots on a larger plot based on the &lt;code&gt;iris&lt;/code&gt; data. The variable &lt;code&gt;Sepal.Length&lt;/code&gt; will be on the x axis and &lt;code&gt;Petal.Length&lt;/code&gt; on the y axis.&lt;/p&gt;
&lt;p&gt;My first step is to categorize those variables with &lt;code&gt;cuteven()&lt;/code&gt;. I’m going to make three groups for &lt;code&gt;Sepal.Length&lt;/code&gt; and four groups for &lt;code&gt;Petal.Length&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I cut both variables within &lt;code&gt;mutate()&lt;/code&gt; and add them to &lt;code&gt;iris&lt;/code&gt;. I give the new variables generic names that indicate which variable is on the &lt;code&gt;x&lt;/code&gt; axis and which on the &lt;code&gt;y&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;iris = mutate(iris,
                group_x = cuteven(Sepal.Length, 3),
                group_y = cuteven(Petal.Length, 4) )

glimpse(iris)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Observations: 150
# Variables: 7
# $ Sepal.Length &amp;lt;dbl&amp;gt; 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
# $ Sepal.Width  &amp;lt;dbl&amp;gt; 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
# $ Petal.Length &amp;lt;dbl&amp;gt; 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
# $ Petal.Width  &amp;lt;dbl&amp;gt; 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
# $ Species      &amp;lt;fct&amp;gt; setosa, setosa, setosa, setosa, setosa, setosa, s...
# $ group_x      &amp;lt;fct&amp;gt; &amp;quot;4.3,5.5&amp;quot;, &amp;quot;4.3,5.5&amp;quot;, &amp;quot;4.3,5.5&amp;quot;, &amp;quot;4.3,5.5&amp;quot;, &amp;quot;4.3,...
# $ group_y      &amp;lt;fct&amp;gt; &amp;quot;1,2.475&amp;quot;, &amp;quot;1,2.475&amp;quot;, &amp;quot;1,2.475&amp;quot;, &amp;quot;1,2.475&amp;quot;, &amp;quot;1,2....&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;extracting-the-coordinates-for-each-subplot&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Extracting the coordinates for each subplot&lt;/h1&gt;
&lt;p&gt;I need the minimum and maximum value per group for each axis variable in order to place the subplot in the larger plot with &lt;code&gt;annotation_custom()&lt;/code&gt;. Since the labels of the new variables contain this info separated by a comma, I can use &lt;code&gt;separate()&lt;/code&gt; to extract the coordinate information from the labels into separate columns.&lt;/p&gt;
&lt;p&gt;Since I have two group variables, one for each axis, I end up using &lt;code&gt;separate()&lt;/code&gt; twice. I again make the names of the new columns in &lt;code&gt;into&lt;/code&gt; based on which axis I’ll be plotting that variable on.&lt;/p&gt;
&lt;p&gt;When this step is complete I’ll have coordinates to indicate where each corner of a subplot will be placed within the larger plot. Unique combinations of the four coordinate variables define each group I want to make a subplot for.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;iris = iris %&amp;gt;%
     separate(group_x, into = c(&amp;quot;min_x&amp;quot;, &amp;quot;max_x&amp;quot;), 
              sep = &amp;quot;,&amp;quot;, convert = TRUE) %&amp;gt;%
     separate(group_y, into = c(&amp;quot;min_y&amp;quot;, &amp;quot;max_y&amp;quot;), 
              sep = &amp;quot;,&amp;quot;, convert = TRUE)

glimpse(iris)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Observations: 150
# Variables: 9
# $ Sepal.Length &amp;lt;dbl&amp;gt; 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
# $ Sepal.Width  &amp;lt;dbl&amp;gt; 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
# $ Petal.Length &amp;lt;dbl&amp;gt; 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
# $ Petal.Width  &amp;lt;dbl&amp;gt; 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
# $ Species      &amp;lt;fct&amp;gt; setosa, setosa, setosa, setosa, setosa, setosa, s...
# $ min_x        &amp;lt;dbl&amp;gt; 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3, 4.3,...
# $ max_x        &amp;lt;dbl&amp;gt; 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5,...
# $ min_y        &amp;lt;dbl&amp;gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
# $ max_y        &amp;lt;dbl&amp;gt; 2.475, 2.475, 2.475, 2.475, 2.475, 2.475, 2.475, ...&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;bar-plot-subplots&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Bar plot subplots&lt;/h1&gt;
&lt;p&gt;Next I’m going to figure out what I want my subplots to look like. I’m going to start with bar plots to count up the number of each species in each group.&lt;/p&gt;
&lt;p&gt;Since I will be making many similar plots I’ll create a function to use for the plotting. I always work out what I want the plot to look like on a single subset of the data before making the function.&lt;/p&gt;
&lt;p&gt;In this case, I want all the plots to have the same x and y axes. The y axis of my bar plot is based on counts, so I need to calculate the maximum number of species across groups so I can set the upper y axis limit for all plots to that value.&lt;/p&gt;
&lt;p&gt;The maximum count is &lt;code&gt;47&lt;/code&gt;, so that will be my upper axis limit. Bar plots start at &lt;code&gt;0&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;iris %&amp;gt;%
     group_by(min_x, max_x, min_y, max_y, Species) %&amp;gt;%
     count() %&amp;gt;%
     ungroup() %&amp;gt;%
     filter(n == max(n) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# # A tibble: 1 x 6
#   min_x max_x min_y max_y Species     n
#   &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;fct&amp;gt;   &amp;lt;int&amp;gt;
# 1   4.3   5.5     1  2.48 setosa     47&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In case a species is missing from one of the subplot groups I’ll define the &lt;code&gt;limits&lt;/code&gt; for the x axis in &lt;code&gt;scale_x_discrete()&lt;/code&gt;. This forces each plot to have the same x axis breaks.&lt;/p&gt;
&lt;p&gt;I’ll be removing all axis labels, etc., via &lt;code&gt;theme_void()&lt;/code&gt; so that the subplots fit nicely into the larger plot. I will add an outline around the plot, though.&lt;/p&gt;
&lt;p&gt;I use &lt;code&gt;fill&lt;/code&gt; to color the bars by species since there will be no x axis labels on the subplots. I suppress the legend, as well, and will add a legend to the large plot instead.&lt;/p&gt;
&lt;p&gt;I set explicit colors in &lt;code&gt;scale_fill_manual()&lt;/code&gt; (colors taken &lt;a href=&#34;http://colorspace.r-forge.r-project.org/articles/hcl_palettes.html#qualitative-palettes&#34;&gt;from here&lt;/a&gt;) so all the subplots will have the same color scheme.&lt;/p&gt;
&lt;p&gt;Here is my test plot for one group. This particular group only has one species in it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475), 
       aes(x = Species, fill = Species) ) +
     geom_bar() +
     theme_void() +
     scale_x_discrete(limits = c(&amp;quot;setosa&amp;quot;, &amp;quot;versicolor&amp;quot;, &amp;quot;virginica&amp;quot;) ) +
     scale_fill_manual(values = c(&amp;quot;setosa&amp;quot; = &amp;quot;#ED90A4&amp;quot;, 
                                  &amp;quot;versicolor&amp;quot; = &amp;quot;#ABB150&amp;quot;,
                                  &amp;quot;virginica&amp;quot; = &amp;quot;#00C1B2&amp;quot;),
                       guide  = &amp;quot;none&amp;quot;) +
     theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                       fill = &amp;quot;transparent&amp;quot;) ) +
     ylim(0, 47)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-04-22-embedding-subplots_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Once I have the plot worked out for one group I put the code into a function, &lt;code&gt;barfun&lt;/code&gt;. In this case the function takes only a dataset, since I’m hard-coding in all the axis variables, limits, etc.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;barfun = function(data) {
     ggplot(data = data, 
            aes(x = Species, fill = Species) ) +
          geom_bar() +
          theme_void() +
          scale_x_discrete(limits = c(&amp;quot;setosa&amp;quot;, &amp;quot;versicolor&amp;quot;, &amp;quot;virginica&amp;quot;) ) +
          scale_fill_manual(values = c(&amp;quot;setosa&amp;quot; = &amp;quot;#ED90A4&amp;quot;, 
                                       &amp;quot;versicolor&amp;quot; = &amp;quot;#ABB150&amp;quot;,
                                       &amp;quot;virginica&amp;quot; = &amp;quot;#00C1B2&amp;quot;),
                            guide  = &amp;quot;none&amp;quot;) +
          theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                            fill = &amp;quot;transparent&amp;quot;) ) +
          ylim(0, 47) 
  
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Does this function make the same plot I made manually? Yep. 👍&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;barfun(data = filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-04-22-embedding-subplots_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-all-the-bar-plot-subplots&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Creating all the bar plot subplots&lt;/h1&gt;
&lt;p&gt;I’m ready to make the subplots!&lt;/p&gt;
&lt;p&gt;I’ll loop through each subset of data and plot it with my function.&lt;/p&gt;
&lt;p&gt;Since I’m going to need those coordinates for subplot placement later I decided that the most straightforward way to do this is to group by the unique combinations of coordinates and then &lt;em&gt;nest&lt;/em&gt; the dataset. When nesting, the data to be plotted for each group is placed in a column called &lt;code&gt;data&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I loop through each dataset in &lt;code&gt;data&lt;/code&gt; via &lt;code&gt;map()&lt;/code&gt; within &lt;code&gt;mutate()&lt;/code&gt;. The new column containing the plots is named &lt;code&gt;subplots&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;allplots = iris %&amp;gt;%
     group_by_at( vars( matches(&amp;quot;min|max&amp;quot;) ) ) %&amp;gt;%
     group_nest() %&amp;gt;%
     mutate(subplots = map(data, barfun) )

allplots&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# # A tibble: 9 x 6
#   min_x max_x min_y max_y data              subplots
#   &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;list&amp;gt;            &amp;lt;list&amp;gt;  
# 1   4.3   5.5  1     2.48 &amp;lt;tibble [47 x 5]&amp;gt; &amp;lt;S3: gg&amp;gt;
# 2   4.3   5.5  2.48  3.95 &amp;lt;tibble [7 x 5]&amp;gt;  &amp;lt;S3: gg&amp;gt;
# 3   4.3   5.5  3.95  5.42 &amp;lt;tibble [5 x 5]&amp;gt;  &amp;lt;S3: gg&amp;gt;
# 4   5.5   6.7  1     2.48 &amp;lt;tibble [3 x 5]&amp;gt;  &amp;lt;S3: gg&amp;gt;
# 5   5.5   6.7  2.48  3.95 &amp;lt;tibble [4 x 5]&amp;gt;  &amp;lt;S3: gg&amp;gt;
# 6   5.5   6.7  3.95  5.42 &amp;lt;tibble [51 x 5]&amp;gt; &amp;lt;S3: gg&amp;gt;
# 7   5.5   6.7  5.42  6.9  &amp;lt;tibble [13 x 5]&amp;gt; &amp;lt;S3: gg&amp;gt;
# 8   6.7   7.9  3.95  5.42 &amp;lt;tibble [5 x 5]&amp;gt;  &amp;lt;S3: gg&amp;gt;
# 9   6.7   7.9  5.42  6.9  &amp;lt;tibble [15 x 5]&amp;gt; &amp;lt;S3: gg&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here’s a couple of the plots. The first one should look familiar.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;allplots$subplots[[1]]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-04-22-embedding-subplots_files/figure-html/unnamed-chunk-11-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;allplots$subplots[[6]]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-04-22-embedding-subplots_files/figure-html/unnamed-chunk-11-2.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;putting-the-subplots-into-annotation_custom&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Putting the subplots into annotation_custom()&lt;/h1&gt;
&lt;p&gt;Next I need to make each of these plots a &lt;em&gt;grob&lt;/em&gt; (graphical object) and pass it into &lt;code&gt;annotation_custom()&lt;/code&gt;. The coordinates for each subplot will be passed to the &lt;code&gt;xmin&lt;/code&gt;, &lt;code&gt;xmax&lt;/code&gt;, &lt;code&gt;ymin&lt;/code&gt;, &lt;code&gt;ymax&lt;/code&gt; arguments in &lt;code&gt;annotation_custom()&lt;/code&gt;, which indicate where each subplot should be placed in the larger plot.&lt;/p&gt;
&lt;p&gt;Since I want to pass the &lt;code&gt;subplots&lt;/code&gt; column as well as the four coordinates columns to &lt;code&gt;annotation_custom()&lt;/code&gt; I will loop through &lt;code&gt;allplots&lt;/code&gt; &lt;em&gt;row-wise&lt;/em&gt;. I can use the &lt;code&gt;pmap()&lt;/code&gt; function for looping through the rows of a dataset.&lt;/p&gt;
&lt;p&gt;Since I’ll be working with five columns simultaneously in this step I decided to make a function prior to looping, which I name &lt;code&gt;grobfun&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Notice I put my arguments of the function in the same order as they appear in the dataset and &lt;em&gt;they have the same names as the columns of the dataset&lt;/em&gt;. I did this on purpose for ease of working with &lt;code&gt;pmap()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grobfun = function(min_x, max_x, min_y, max_y, subplots) {
     annotation_custom(ggplotGrob(subplots),
                       xmin = min_x, ymin = min_y,
                       xmax = max_x, ymax = max_y)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I no longer need the &lt;code&gt;data&lt;/code&gt; column, so I remove it to end up with only the columns I need to pass to the &lt;code&gt;grobfun()&lt;/code&gt; function. This isn’t strictly necessary, but I find it makes working with &lt;code&gt;pmap()&lt;/code&gt; easier.&lt;/p&gt;
&lt;p&gt;The dot, &lt;code&gt;.&lt;/code&gt;, indicates I’m passing the entire dataset to &lt;code&gt;pmap()&lt;/code&gt; and so looping through it row-wise.&lt;/p&gt;
&lt;p&gt;I find this can take a little time to run when doing many subplots.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;( allgrobs = allplots %&amp;gt;%
     select(-data) %&amp;gt;%
     mutate(grobs = pmap(., grobfun) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# # A tibble: 9 x 6
#   min_x max_x min_y max_y subplots grobs              
#   &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;list&amp;gt;   &amp;lt;list&amp;gt;             
# 1   4.3   5.5  1     2.48 &amp;lt;S3: gg&amp;gt; &amp;lt;S3: LayerInstance&amp;gt;
# 2   4.3   5.5  2.48  3.95 &amp;lt;S3: gg&amp;gt; &amp;lt;S3: LayerInstance&amp;gt;
# 3   4.3   5.5  3.95  5.42 &amp;lt;S3: gg&amp;gt; &amp;lt;S3: LayerInstance&amp;gt;
# 4   5.5   6.7  1     2.48 &amp;lt;S3: gg&amp;gt; &amp;lt;S3: LayerInstance&amp;gt;
# 5   5.5   6.7  2.48  3.95 &amp;lt;S3: gg&amp;gt; &amp;lt;S3: LayerInstance&amp;gt;
# 6   5.5   6.7  3.95  5.42 &amp;lt;S3: gg&amp;gt; &amp;lt;S3: LayerInstance&amp;gt;
# 7   5.5   6.7  5.42  6.9  &amp;lt;S3: gg&amp;gt; &amp;lt;S3: LayerInstance&amp;gt;
# 8   6.7   7.9  3.95  5.42 &amp;lt;S3: gg&amp;gt; &amp;lt;S3: LayerInstance&amp;gt;
# 9   6.7   7.9  5.42  6.9  &amp;lt;S3: gg&amp;gt; &amp;lt;S3: LayerInstance&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;making-the-large-plot&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Making the large plot&lt;/h1&gt;
&lt;p&gt;I haven’t actually made the larger plot I’m going to embed subplots into yet. This will be a blank plot with &lt;code&gt;Sepal.Length&lt;/code&gt; on the x axis and &lt;code&gt;Petal.Length&lt;/code&gt; on the y axis with a legend for &lt;code&gt;fill&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Adding on the overall legend to a blank plot involves a little trick with &lt;code&gt;geom_col()&lt;/code&gt;, which was demonstrated in those Stack Overflow posts. (Thank goodness for SO, since I would have never figured it out otherwise 😜.)&lt;/p&gt;
&lt;p&gt;Here is the plot I will embed the subplots into.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;( largeplot = ggplot(iris, aes(x = Sepal.Length, 
                               y = Petal.Length, 
                               fill = Species) ) +
       geom_blank() +
       geom_col( aes(Inf, Inf) ) +
       scale_fill_manual(values = c(&amp;quot;setosa&amp;quot; = &amp;quot;#ED90A4&amp;quot;, 
                                    &amp;quot;versicolor&amp;quot; = &amp;quot;#ABB150&amp;quot;,
                                    &amp;quot;virginica&amp;quot; = &amp;quot;#00C1B2&amp;quot;) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Warning: Removed 149 rows containing missing values (geom_col).&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-04-22-embedding-subplots_files/figure-html/unnamed-chunk-14-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;embedding-the-bar-plot-subplots&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Embedding the bar plot subplots&lt;/h1&gt;
&lt;p&gt;Last step! I can now add the list of subplots in &lt;code&gt;annotation_custom()&lt;/code&gt; to the larger plot. 🎉&lt;/p&gt;
&lt;p&gt;There was a little extra space on the y axis that I removed by setting the axis limits.&lt;/p&gt;
&lt;p&gt;I think this looks nice with evenly spaced subplots but there may be times that having unevenly spaced subplots is desirable. In that case you could cut the variables into uneven groups.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;largeplot +
     allgrobs$grobs +
     ylim(1, NA)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Warning: Removed 150 rows containing missing values (geom_col).&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-04-22-embedding-subplots_files/figure-html/unnamed-chunk-15-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;After polishing this plot up a bit as desired, the final plot can be saved with &lt;code&gt;ggsave()&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;histogram-subplots&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Histogram subplots&lt;/h1&gt;
&lt;p&gt;I’ve seen quite a few examples of embedded plots with bar plot or pie chart subplots to show patterns in the distribution of some categorical variable across the axis variables. But there’s no reason we can’t show the distribution of a third continuous variable.&lt;/p&gt;
&lt;p&gt;I decided to make a histogram of the variable &lt;code&gt;Petal.Width&lt;/code&gt; for the same subplot groups I used above. I found a lot of little details to work through when making subplots showing the distribution of a continuous variable.&lt;/p&gt;
&lt;p&gt;I want the x axis of each plot to encompass the entire range of &lt;code&gt;Petal.Width&lt;/code&gt;, so my first step was to pull out that information. These values will be the x axis limits (with some extra added to make sure all the histogram bars fit) and the limits for the continuous legend.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;range(iris$Petal.Width)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 0.1 2.5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I found the y axis to be a little trickier. I decided to show the bars as proportions of the maximum count with &lt;code&gt;ncount&lt;/code&gt; instead of as a raw count. Since the height of bars is then a proportion of maximum instead of a count I added the sample size per group to the plot as text. I ended up putting this in a facet strip rather than within the plot. I went back and forth a bunch and am still not sure the final result is exactly what I want.&lt;/p&gt;
&lt;p&gt;I’ll base the color of the bars on &lt;code&gt;Petal.Width&lt;/code&gt;. This can be done with &lt;code&gt;fill = stat(x)&lt;/code&gt;, which is how to refer to the bins calculated by &lt;code&gt;geom_histogram()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;To make sure each plot has the same number of bar I set the &lt;code&gt;binwidth&lt;/code&gt;. I chose to make each bar 0.2 units wide, since the entire range of &lt;code&gt;Petal.Width&lt;/code&gt; (2.4 units) is evenly divisible by that number. I also &lt;code&gt;center&lt;/code&gt; the first bar at the minimum value of the dataset, so the first bar is centered at 0.1.&lt;/p&gt;
&lt;p&gt;I pad the x axis limits with half the &lt;code&gt;binwidth&lt;/code&gt; value to make sure all the bars will show in every plot. Getting the limits correct can be hard; see, e.g., &lt;a href=&#34;https://github.com/tidyverse/ggplot2/issues/3265&#34;&gt;this GitHub issue&lt;/a&gt; if you are seeing warnings that you don’t think are correct.&lt;/p&gt;
&lt;p&gt;Here is my example plot for the first group. Since I was careful with my plot limits the warning message is spurious.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475),
       aes(x = Petal.Width, y = stat(ncount), fill = stat(x) ) ) +
     geom_histogram(binwidth = .2, center = .1) +
     theme_void(base_size = 14) +
     scale_x_continuous(limits = c(0.1 - .1, 2.5 + .1) ) +
     scale_fill_continuous(type = &amp;quot;viridis&amp;quot;,
                           guide  = &amp;quot;none&amp;quot;,
                           limits = c(.1, 2.5) ) +
     facet_wrap(~paste0(&amp;quot;n = &amp;quot;, nrow(filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475) ) ) ) +
     theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                       fill = &amp;quot;transparent&amp;quot;) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-04-22-embedding-subplots_files/figure-html/unnamed-chunk-17-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;And here’s the function to make histograms for each subplot dataset, which I name &lt;code&gt;histfun&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;histfun = function(data) {
     ggplot(data = data,
            aes(x = Petal.Width, y = stat(ncount), fill = stat(x) ) ) +
          geom_histogram(binwidth = .2, center = .1) +
          theme_void(base_size = 14) +
          scale_x_continuous(limits = c(0.1 - .1, 2.5 + .1) ) +
          scale_fill_continuous(type = &amp;quot;viridis&amp;quot;,
                                guide  = &amp;quot;none&amp;quot;,
                                limits = c(.1, 2.5) ) +
          facet_wrap(~paste0(&amp;quot;n = &amp;quot;, nrow(data) ) ) +
          theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                            fill = &amp;quot;transparent&amp;quot;) )
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;get-the-histograms-ready-to-embed&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Get the histograms ready to embed&lt;/h1&gt;
&lt;p&gt;This time I’ll make the subplots with &lt;code&gt;histfun&lt;/code&gt; and then put them into &lt;code&gt;annotation_custom()&lt;/code&gt; with &lt;code&gt;grobfun&lt;/code&gt; in one pipe chain.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;allgrobs_hist = iris %&amp;gt;%
     group_by_at( vars( matches(&amp;quot;min|max&amp;quot;) ) ) %&amp;gt;%
     group_nest() %&amp;gt;%
     mutate(subplots = map(data, histfun) ) %&amp;gt;%
     select(-data) %&amp;gt;%
     mutate(grobs = pmap(., grobfun) )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;embed-the-histogram-subplots&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Embed the histogram subplots&lt;/h1&gt;
&lt;p&gt;This time the large plot needs a continuous legend. I set the &lt;code&gt;breaks&lt;/code&gt; so the minimum and maximum value are included on the legend. How well this works will depend on the range of your dataset.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;( largeplot2 = ggplot(iris, aes(x = Sepal.Length, 
                                y = Petal.Length, 
                                fill = Petal.Width) ) +
       geom_blank() +
       geom_col( aes(Inf, Inf) ) +
       scale_fill_continuous(type = &amp;quot;viridis&amp;quot;,
                             limits = c(.1, 2.5),
                             breaks = seq(.1, 2.5, by = .8) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Warning: Removed 149 rows containing missing values (geom_col).&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-04-22-embedding-subplots_files/figure-html/unnamed-chunk-20-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;And, finally, here’s the plot embedded with the &lt;code&gt;Petal.Width&lt;/code&gt; distribution plots. 😃&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;largeplot2 +
     allgrobs_hist$grobs +
     ylim(1, NA)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Warning: Removed 150 rows containing missing values (geom_col).&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-04-22-embedding-subplots_files/figure-html/unnamed-chunk-21-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;a-density-subplot-example&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;A density subplot example&lt;/h1&gt;
&lt;p&gt;I thought the histogram subplots ended up being pretty tricky, what with having to figure out the plot limits and the bar widths and centers.&lt;/p&gt;
&lt;p&gt;Density plots are another possibility for showing the distribution of continuous data, and the color of the line can be allowed to vary.&lt;/p&gt;
&lt;p&gt;Here’s an example of what a density plot could look like. In some scenarios using &lt;code&gt;trim = TRUE&lt;/code&gt; may be useful in &lt;code&gt;stat_density()&lt;/code&gt;. You’ll notice I set the x axis limits to the minimum and maximum value in the datast with no padding for this plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475),
       aes(x = Petal.Width, y = stat(ndensity), color = stat(x) ) ) +
     stat_density(geom = &amp;quot;line&amp;quot;, size = 1.25) +
     theme_void(base_size = 14) +
     scale_x_continuous(limits = c(0.1, 2.5),
                        expand = c(0, 0) ) +
     scale_color_viridis_c(guide  = &amp;quot;none&amp;quot;,
                           limits = c(.1, 2.5) ) +
     facet_wrap(~paste0(&amp;quot;n = &amp;quot;, nrow(filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475) ) ) ) +
     theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                       fill = &amp;quot;transparent&amp;quot;) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-04-22-embedding-subplots_files/figure-html/unnamed-chunk-22-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Turning the plot code into a function for looping through groups.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;densfun = function(data) {
     ggplot(data = data,
            aes(x = Petal.Width, y = stat(ndensity), color = stat(x) ) ) +
          stat_density(geom = &amp;quot;line&amp;quot;, size = 1.25) +
          theme_void(base_size = 14) +
          scale_x_continuous(limits = c(0.1, 2.5),
                             expand = c(0, 0) ) +
          scale_color_viridis_c(guide  = &amp;quot;none&amp;quot;,
                                limits = c(.1, 2.5) ) +
          facet_wrap(~paste0(&amp;quot;n = &amp;quot;, nrow(data) ) ) +
          theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                            fill = &amp;quot;transparent&amp;quot;) )
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Loop through to create and get the subplots ready to add to the plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;allgrobs_dens = iris %&amp;gt;%
     group_by_at( vars( matches(&amp;quot;min|max&amp;quot;) ) ) %&amp;gt;%
     group_nest() %&amp;gt;%
     mutate(subplots = map(data, densfun) ) %&amp;gt;%
     select(-data) %&amp;gt;%
     mutate(grobs = pmap(., grobfun) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And here’s the density subplots embedded in the large plot. Not too bad! I’m guessing the density plot approach would be most useful for larger sample sizes. 😺&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;largeplot2 +
     allgrobs_dens$grobs +
     ylim(1, NA)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Warning: Removed 150 rows containing missing values (geom_col).&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-04-22-embedding-subplots_files/figure-html/unnamed-chunk-25-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;filled-density-plots&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Filled density plots&lt;/h1&gt;
&lt;p&gt;I belatedly realized that we can use &lt;code&gt;geom_density_ridges_gradient()&lt;/code&gt; from package &lt;strong&gt;ggridges&lt;/strong&gt; to make density plots with continuous fill.&lt;/p&gt;
&lt;p&gt;Since this package is really for &lt;a href=&#34;https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html&#34;&gt;ridge plots&lt;/a&gt;, I use &lt;code&gt;y = 1&lt;/code&gt; to get a single density plot. This geom uses a relative scale by default so &lt;code&gt;stat(ndensity)&lt;/code&gt; isn’t needed.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggridges) # v 0.5.1
ggplot(data = filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475),
       aes(x = Petal.Width, y = 1, fill = stat(x) ) ) +
     geom_density_ridges_gradient() +
     theme_void(base_size = 14) +
     scale_x_continuous(limits = c(0.1, 2.5),
                        expand = c(0, 0) ) +
     scale_fill_viridis_c(guide  = &amp;quot;none&amp;quot;,
                           limits = c(.1, 2.5) ) +
     facet_wrap(~paste0(&amp;quot;n = &amp;quot;, nrow(filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475) ) ) ) +
     theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                       fill = &amp;quot;transparent&amp;quot;) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-04-22-embedding-subplots_files/figure-html/unnamed-chunk-26-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Let’s see how that looks as embedded plots. I’ll do the whole process in one code chunk, taking some extra time to move the legend around in the final plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;densfun2 = function(data) {
  ggplot(data = data,
         aes(x = Petal.Width, y = 1, fill = stat(x) ) ) +
    geom_density_ridges_gradient() +
    theme_void(base_size = 14) +
    scale_x_continuous(limits = c(0.1, 2.5),
                       expand = c(0, 0) ) +
    scale_fill_viridis_c(guide  = &amp;quot;none&amp;quot;,
                         limits = c(.1, 2.5) ) +
    facet_wrap(~paste0(&amp;quot;n = &amp;quot;, nrow(data) ) ) +
    theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                      fill = &amp;quot;transparent&amp;quot;) )
}
allgrobs_dens2 = iris %&amp;gt;%
    group_by_at( vars( matches(&amp;quot;min|max&amp;quot;) ) ) %&amp;gt;%
    group_nest() %&amp;gt;%
    mutate(subplots = map(data, densfun2) ) %&amp;gt;%
    select(-data) %&amp;gt;%
    mutate(grobs = pmap(., grobfun) )

largeplot2 +
    allgrobs_dens2$grobs +
    ylim(1, NA) +
    theme_bw() +
    theme(legend.direction = &amp;quot;horizontal&amp;quot;,
          legend.position = c(.8, .25),
          legend.background = element_blank() ) +
    guides(fill = guide_colorbar(title.position = &amp;quot;top&amp;quot;) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Warning: Removed 150 rows containing missing values (geom_col).&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-04-22-embedding-subplots_files/figure-html/unnamed-chunk-27-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2019-04-22-embedding-subplots.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # 3.1.1
suppressPackageStartupMessages( library(dplyr) ) # 0.8.0.1
library(tidyr) # 0.8.3
library(purrr) # 0.3.2

cuteven = function(variable, ngroups) {
     seq_all = seq(min(variable), max(variable), length.out = ngroups + 1)
     cut(variable,
         breaks = seq_all,
         labels = paste(seq_all[-(ngroups + 1)], seq_all[-1], sep = &amp;quot;,&amp;quot;),
         include.lowest = TRUE)
}

with(iris, cuteven(Sepal.Length, ngroups = 3) )

iris = mutate(iris,
                group_x = cuteven(Sepal.Length, 3),
                group_y = cuteven(Petal.Length, 4) )

glimpse(iris)

iris = iris %&amp;gt;%
     separate(group_x, into = c(&amp;quot;min_x&amp;quot;, &amp;quot;max_x&amp;quot;), 
              sep = &amp;quot;,&amp;quot;, convert = TRUE) %&amp;gt;%
     separate(group_y, into = c(&amp;quot;min_y&amp;quot;, &amp;quot;max_y&amp;quot;), 
              sep = &amp;quot;,&amp;quot;, convert = TRUE)

glimpse(iris)

iris %&amp;gt;%
     group_by(min_x, max_x, min_y, max_y, Species) %&amp;gt;%
     count() %&amp;gt;%
     ungroup() %&amp;gt;%
     filter(n == max(n) )

ggplot(data = filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475), 
       aes(x = Species, fill = Species) ) +
     geom_bar() +
     theme_void() +
     scale_x_discrete(limits = c(&amp;quot;setosa&amp;quot;, &amp;quot;versicolor&amp;quot;, &amp;quot;virginica&amp;quot;) ) +
     scale_fill_manual(values = c(&amp;quot;setosa&amp;quot; = &amp;quot;#ED90A4&amp;quot;, 
                                  &amp;quot;versicolor&amp;quot; = &amp;quot;#ABB150&amp;quot;,
                                  &amp;quot;virginica&amp;quot; = &amp;quot;#00C1B2&amp;quot;),
                       guide  = &amp;quot;none&amp;quot;) +
     theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                       fill = &amp;quot;transparent&amp;quot;) ) +
     ylim(0, 47)

barfun = function(data) {
     ggplot(data = data, 
            aes(x = Species, fill = Species) ) +
          geom_bar() +
          theme_void() +
          scale_x_discrete(limits = c(&amp;quot;setosa&amp;quot;, &amp;quot;versicolor&amp;quot;, &amp;quot;virginica&amp;quot;) ) +
          scale_fill_manual(values = c(&amp;quot;setosa&amp;quot; = &amp;quot;#ED90A4&amp;quot;, 
                                       &amp;quot;versicolor&amp;quot; = &amp;quot;#ABB150&amp;quot;,
                                       &amp;quot;virginica&amp;quot; = &amp;quot;#00C1B2&amp;quot;),
                            guide  = &amp;quot;none&amp;quot;) +
          theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                            fill = &amp;quot;transparent&amp;quot;) ) +
          ylim(0, 47) 
  
}
barfun(data = filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475) )

allplots = iris %&amp;gt;%
     group_by_at( vars( matches(&amp;quot;min|max&amp;quot;) ) ) %&amp;gt;%
     group_nest() %&amp;gt;%
     mutate(subplots = map(data, barfun) )

allplots
allplots$subplots[[1]]
allplots$subplots[[6]]
grobfun = function(min_x, max_x, min_y, max_y, subplots) {
     annotation_custom(ggplotGrob(subplots),
                       xmin = min_x, ymin = min_y,
                       xmax = max_x, ymax = max_y)
}

( allgrobs = allplots %&amp;gt;%
     select(-data) %&amp;gt;%
     mutate(grobs = pmap(., grobfun) ) )

( largeplot = ggplot(iris, aes(x = Sepal.Length, 
                               y = Petal.Length, 
                               fill = Species) ) +
       geom_blank() +
       geom_col( aes(Inf, Inf) ) +
       scale_fill_manual(values = c(&amp;quot;setosa&amp;quot; = &amp;quot;#ED90A4&amp;quot;, 
                                    &amp;quot;versicolor&amp;quot; = &amp;quot;#ABB150&amp;quot;,
                                    &amp;quot;virginica&amp;quot; = &amp;quot;#00C1B2&amp;quot;) ) )

largeplot +
     allgrobs$grobs +
     ylim(1, NA)

range(iris$Petal.Width)

ggplot(data = filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475),
       aes(x = Petal.Width, y = stat(ncount), fill = stat(x) ) ) +
     geom_histogram(binwidth = .2, center = .1) +
     theme_void(base_size = 14) +
     scale_x_continuous(limits = c(0.1 - .1, 2.5 + .1) ) +
     scale_fill_continuous(type = &amp;quot;viridis&amp;quot;,
                           guide  = &amp;quot;none&amp;quot;,
                           limits = c(.1, 2.5) ) +
     facet_wrap(~paste0(&amp;quot;n = &amp;quot;, nrow(filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475) ) ) ) +
     theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                       fill = &amp;quot;transparent&amp;quot;) )

histfun = function(data) {
     ggplot(data = data,
            aes(x = Petal.Width, y = stat(ncount), fill = stat(x) ) ) +
          geom_histogram(binwidth = .2, center = .1) +
          theme_void(base_size = 14) +
          scale_x_continuous(limits = c(0.1 - .1, 2.5 + .1) ) +
          scale_fill_continuous(type = &amp;quot;viridis&amp;quot;,
                                guide  = &amp;quot;none&amp;quot;,
                                limits = c(.1, 2.5) ) +
          facet_wrap(~paste0(&amp;quot;n = &amp;quot;, nrow(data) ) ) +
          theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                            fill = &amp;quot;transparent&amp;quot;) )
}

allgrobs_hist = iris %&amp;gt;%
     group_by_at( vars( matches(&amp;quot;min|max&amp;quot;) ) ) %&amp;gt;%
     group_nest() %&amp;gt;%
     mutate(subplots = map(data, histfun) ) %&amp;gt;%
     select(-data) %&amp;gt;%
     mutate(grobs = pmap(., grobfun) )

( largeplot2 = ggplot(iris, aes(x = Sepal.Length, 
                                y = Petal.Length, 
                                fill = Petal.Width) ) +
       geom_blank() +
       geom_col( aes(Inf, Inf) ) +
       scale_fill_continuous(type = &amp;quot;viridis&amp;quot;,
                             limits = c(.1, 2.5),
                             breaks = seq(.1, 2.5, by = .8) ) )

largeplot2 +
     allgrobs_hist$grobs +
     ylim(1, NA)

ggplot(data = filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475),
       aes(x = Petal.Width, y = stat(ndensity), color = stat(x) ) ) +
     stat_density(geom = &amp;quot;line&amp;quot;, size = 1.25) +
     theme_void(base_size = 14) +
     scale_x_continuous(limits = c(0.1, 2.5),
                        expand = c(0, 0) ) +
     scale_color_viridis_c(guide  = &amp;quot;none&amp;quot;,
                           limits = c(.1, 2.5) ) +
     facet_wrap(~paste0(&amp;quot;n = &amp;quot;, nrow(filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475) ) ) ) +
     theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                       fill = &amp;quot;transparent&amp;quot;) )

densfun = function(data) {
     ggplot(data = data,
            aes(x = Petal.Width, y = stat(ndensity), color = stat(x) ) ) +
          stat_density(geom = &amp;quot;line&amp;quot;, size = 1.25) +
          theme_void(base_size = 14) +
          scale_x_continuous(limits = c(0.1, 2.5),
                             expand = c(0, 0) ) +
          scale_color_viridis_c(guide  = &amp;quot;none&amp;quot;,
                                limits = c(.1, 2.5) ) +
          facet_wrap(~paste0(&amp;quot;n = &amp;quot;, nrow(data) ) ) +
          theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                            fill = &amp;quot;transparent&amp;quot;) )
}

allgrobs_dens = iris %&amp;gt;%
     group_by_at( vars( matches(&amp;quot;min|max&amp;quot;) ) ) %&amp;gt;%
     group_nest() %&amp;gt;%
     mutate(subplots = map(data, densfun) ) %&amp;gt;%
     select(-data) %&amp;gt;%
     mutate(grobs = pmap(., grobfun) )
largeplot2 +
     allgrobs_dens$grobs +
     ylim(1, NA)

library(ggridges) # v 0.5.1
ggplot(data = filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475),
       aes(x = Petal.Width, y = 1, fill = stat(x) ) ) +
     geom_density_ridges_gradient() +
     theme_void(base_size = 14) +
     scale_x_continuous(limits = c(0.1, 2.5),
                        expand = c(0, 0) ) +
     scale_fill_viridis_c(guide  = &amp;quot;none&amp;quot;,
                           limits = c(.1, 2.5) ) +
     facet_wrap(~paste0(&amp;quot;n = &amp;quot;, nrow(filter(iris, max_x &amp;lt;= 5.5, max_y &amp;lt;= 2.475) ) ) ) +
     theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                       fill = &amp;quot;transparent&amp;quot;) )

densfun2 = function(data) {
  ggplot(data = data,
         aes(x = Petal.Width, y = 1, fill = stat(x) ) ) +
    geom_density_ridges_gradient() +
    theme_void(base_size = 14) +
    scale_x_continuous(limits = c(0.1, 2.5),
                       expand = c(0, 0) ) +
    scale_fill_viridis_c(guide  = &amp;quot;none&amp;quot;,
                         limits = c(.1, 2.5) ) +
    facet_wrap(~paste0(&amp;quot;n = &amp;quot;, nrow(data) ) ) +
    theme(panel.border = element_rect(color = &amp;quot;grey&amp;quot;,
                                      fill = &amp;quot;transparent&amp;quot;) )
}
allgrobs_dens2 = iris %&amp;gt;%
    group_by_at( vars( matches(&amp;quot;min|max&amp;quot;) ) ) %&amp;gt;%
    group_nest() %&amp;gt;%
    mutate(subplots = map(data, densfun2) ) %&amp;gt;%
    select(-data) %&amp;gt;%
    mutate(grobs = pmap(., grobfun) )

largeplot2 +
    allgrobs_dens2$grobs +
    ylim(1, NA) +
    theme_bw() +
    theme(legend.direction = &amp;quot;horizontal&amp;quot;,
          legend.position = c(.8, .25),
          legend.background = element_blank() ) +
    guides(fill = guide_colorbar(title.position = &amp;quot;top&amp;quot;) )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Custom contrasts in emmeans</title>
      <link>https://aosmith.rbind.io/2019/04/15/custom-contrasts-emmeans/</link>
      <pubDate>Mon, 15 Apr 2019 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2019/04/15/custom-contrasts-emmeans/</guid>
      <description>


&lt;p&gt;Following up on a &lt;a href=&#34;https://aosmith.rbind.io/2019/03/25/getting-started-with-emmeans/&#34;&gt;previous post&lt;/a&gt;, where I demonstrated the basic usage of package &lt;strong&gt;emmeans&lt;/strong&gt; for doing post hoc comparisons, here I’ll demonstrate how to make custom comparisons (aka &lt;em&gt;contrasts&lt;/em&gt;). These are comparisons that aren’t encompassed by the built-in functions in the package.&lt;/p&gt;
&lt;p&gt;Remember that you can explore the available built-in &lt;strong&gt;emmeans&lt;/strong&gt; functions for doing comparisons via &lt;code&gt;?&#34;contrast-methods&#34;&lt;/code&gt;.&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#reasons-for-custom-comparisons&#34;&gt;Reasons for custom comparisons&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#r-packages&#34;&gt;R packages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-dataset-and-model&#34;&gt;The dataset and model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#treatment-vs-control-comparisons&#34;&gt;Treatment vs control comparisons&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#building-custom-contrasts&#34;&gt;Building custom contrasts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-contrast-function-for-custom-comparisons&#34;&gt;The contrast() function for custom comparisons&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-named-lists-for-better-output&#34;&gt;Using named lists for better output&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#using-at-for-simple-comparisons&#34;&gt;Using “at” for simple comparisons&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#multiple-custom-contrasts-at-once&#34;&gt;Multiple custom contrasts at once&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#more-complicated-custom-contrasts&#34;&gt;More complicated custom contrasts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;reasons-for-custom-comparisons&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Reasons for custom comparisons&lt;/h1&gt;
&lt;p&gt;There are a variety of reasons you might need custom comparisons instead of some of the standard, built-in ones. One common scenario that I see a lot is when we have a single control group for multiple factors, so the factors aren’t perfectly crossed. This comes up, e.g., when doing experiments that involve applying different substances (like fertilizers) at varying rates. One factor is the different substances applied and the other is different application rates. However, the control is applying nothing or water or something like that. There aren’t different rates of the control to apply, so there is a single control group for both factors.&lt;/p&gt;
&lt;p&gt;Rather than trying to fit a model with multiple factors, focusing on main effects and the interaction, such data can be analyzed with a &lt;em&gt;simple effects&lt;/em&gt; model. This is where the two (or more) factors of interest have been combined into a single factor for analysis. Such an analysis focuses on the effect of the two factors combined. We can use post hoc comparisons to estimate the overall effects of individual factors.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;r-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R packages&lt;/h1&gt;
&lt;p&gt;I will load &lt;strong&gt;magrittr&lt;/strong&gt; for the pipe in addition to &lt;strong&gt;emmeans&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(emmeans) # v. 1.3.3
library(magrittr) # v. 1.5&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-dataset-and-model&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The dataset and model&lt;/h1&gt;
&lt;p&gt;I’ve made a small dataset to use as an example. The response variable is &lt;code&gt;resp&lt;/code&gt; and the two factors of interest have been combined into a single factor &lt;code&gt;sub.rate&lt;/code&gt; that has 5 levels: &lt;code&gt;A.1&lt;/code&gt;, &lt;code&gt;A.2&lt;/code&gt;, &lt;code&gt;B.1&lt;/code&gt;, &lt;code&gt;B.2&lt;/code&gt;, and &lt;code&gt;control&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;One factor, which I’m thinking of as the &lt;em&gt;substance&lt;/em&gt; factor, is represented by &lt;code&gt;A&lt;/code&gt; and &lt;code&gt;B&lt;/code&gt; (and the control). The second, the &lt;em&gt;rate&lt;/em&gt; factor, is represented by &lt;code&gt;1&lt;/code&gt; and &lt;code&gt;2&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat = structure(list(sub.rate = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L), .Label = c(&amp;quot;A.1&amp;quot;, &amp;quot;A.2&amp;quot;, &amp;quot;B.1&amp;quot;, &amp;quot;B.2&amp;quot;, &amp;quot;control&amp;quot;), class = &amp;quot;factor&amp;quot;), 
    resp = c(5.5, 4.9, 6.1, 3.6, 6.1, 3.5, 3, 4.1, 5, 4.6, 7.3, 
    5.6, 4.8, 7.2, 6.2, 4.3, 6.6, 6.5, 5.5, 7.1, 5.4, 6.7, 6.8, 
    8.5, 6.1)), row.names = c(NA, -25L), class = &amp;quot;data.frame&amp;quot;)

str(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# &amp;#39;data.frame&amp;#39;: 25 obs. of  2 variables:
#  $ sub.rate: Factor w/ 5 levels &amp;quot;A.1&amp;quot;,&amp;quot;A.2&amp;quot;,&amp;quot;B.1&amp;quot;,..: 1 1 1 1 1 2 2 2 2 2 ...
#  $ resp    : num  5.5 4.9 6.1 3.6 6.1 3.5 3 4.1 5 4.6 ...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I will use a simple, linear model for analysis.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit1 = lm(resp ~ sub.rate, data = dat)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;treatment-vs-control-comparisons&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Treatment vs control comparisons&lt;/h1&gt;
&lt;p&gt;The simple effects model makes it easy to get comparisons for each factor combination vs the control group with &lt;code&gt;emmeans()&lt;/code&gt;. I’ll use &lt;code&gt;trt.vs.ctrlk&lt;/code&gt; to do this since the control is the last level of the factor.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emmeans(fit1, specs = trt.vs.ctrlk ~ sub.rate)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $emmeans
#  sub.rate emmean    SE df lower.CL upper.CL
#  A.1        5.24 0.466 20     4.27     6.21
#  A.2        4.04 0.466 20     3.07     5.01
#  B.1        6.22 0.466 20     5.25     7.19
#  B.2        6.00 0.466 20     5.03     6.97
#  control    6.70 0.466 20     5.73     7.67
# 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast      estimate   SE df t.ratio p.value
#  A.1 - control    -1.46 0.66 20 -2.214  0.1230 
#  A.2 - control    -2.66 0.66 20 -4.033  0.0024 
#  B.1 - control    -0.48 0.66 20 -0.728  0.8403 
#  B.2 - control    -0.70 0.66 20 -1.061  0.6548 
# 
# P value adjustment: dunnettx method for 4 tests&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We may also be interested in some other comparisons, though. In particular, we might want to do some overall comparisons across the two factors. We will need custom contrasts for this.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;building-custom-contrasts&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Building custom contrasts&lt;/h1&gt;
&lt;p&gt;Custom contrasts are based on the estimated marginal means output from &lt;code&gt;emmeans()&lt;/code&gt;. The first step to building custom contrasts is to calculate the estimated marginal means so we have them to work with. I will name this output &lt;code&gt;emm1&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emm1 = emmeans(fit1, specs = ~ sub.rate)
emm1&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  sub.rate emmean    SE df lower.CL upper.CL
#  A.1        5.24 0.466 20     4.27     6.21
#  A.2        4.04 0.466 20     3.07     5.01
#  B.1        6.22 0.466 20     5.25     7.19
#  B.2        6.00 0.466 20     5.03     6.97
#  control    6.70 0.466 20     5.73     7.67
# 
# Confidence level used: 0.95&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I’m going to start with a relatively simple example. I will compare mean &lt;code&gt;resp&lt;/code&gt; of the &lt;code&gt;A.2&lt;/code&gt; group to the &lt;code&gt;B.2&lt;/code&gt; group via custom contrasts.&lt;/p&gt;
&lt;p&gt;Building a custom contrast involves pulling out specific group means of interest from the &lt;code&gt;emmeans()&lt;/code&gt; output. We &lt;em&gt;pull out&lt;/em&gt; a group mean by making a vector to represent the specific mean of interest. In this vector we assign a &lt;code&gt;1&lt;/code&gt; to the mean of the group of interest and a &lt;code&gt;0&lt;/code&gt; to the other groups.&lt;/p&gt;
&lt;p&gt;For example, to pull out the mean of &lt;code&gt;A.2&lt;/code&gt; from &lt;code&gt;emm1&lt;/code&gt; we will make a vector with 5 values in it, one for each row of the output. The second value will be a &lt;code&gt;1&lt;/code&gt;, since the mean of &lt;code&gt;A.2&lt;/code&gt; is on the second row of &lt;code&gt;emm1&lt;/code&gt;. All the other values in the vector will be &lt;code&gt;0&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Below is the vector that represents the &lt;code&gt;A.2&lt;/code&gt; mean.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;A2 = c(0, 1, 0, 0, 0)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Similarly, to pull out the mean of &lt;code&gt;B.2&lt;/code&gt; we’ll have a vector of 5 values with a &lt;code&gt;1&lt;/code&gt; as the fourth value. The &lt;code&gt;B.2&lt;/code&gt; group is on the fourth row in &lt;code&gt;emm1&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;B2 = c(0, 0, 0, 1, 0)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When building custom contrasts via vectors like this, the vectors will always be the same length as the number of rows in the &lt;code&gt;emmeans()&lt;/code&gt; output. I always calculate and print the estimated marginal means prior to building the vectors so I am certain of the number of rows and the order of the groups in the output.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-contrast-function-for-custom-comparisons&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The contrast() function for custom comparisons&lt;/h1&gt;
&lt;p&gt;Once we have the vectors that represent the means we are interested in comparing, we actually do the comparisons via the &lt;code&gt;contrast()&lt;/code&gt; function. Since we are interested in a &lt;em&gt;difference&lt;/em&gt; in mean response, we take the difference between the vectors that represent the means.&lt;/p&gt;
&lt;p&gt;Taking the difference between vectors can be done inside or outside &lt;code&gt;contrast()&lt;/code&gt;. In this example I’m doing it inside.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;contrast()&lt;/code&gt; function takes an &lt;code&gt;emmGrid&lt;/code&gt; object (i.e., output from &lt;code&gt;emmeans()&lt;/code&gt;) as the first argument. We give the comparison we want to do via a list passed to the &lt;code&gt;method&lt;/code&gt; argument.&lt;/p&gt;
&lt;p&gt;Here I want to calculate the difference in mean &lt;code&gt;resp&lt;/code&gt; of &lt;code&gt;A.2&lt;/code&gt; and &lt;code&gt;B.2&lt;/code&gt;. I subtract the &lt;code&gt;B2&lt;/code&gt; vector from the &lt;code&gt;A2&lt;/code&gt; vector. The output is the difference in mean &lt;code&gt;resp&lt;/code&gt; between these groups.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;contrast(emm1, method = list(A2 - B2) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  contrast          estimate   SE df t.ratio p.value
#  c(0, 1, 0, -1, 0)    -1.96 0.66 20 -2.972  0.0075&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-named-lists-for-better-output&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using named lists for better output&lt;/h1&gt;
&lt;p&gt;Unfortunately you can’t tell what comparisons was done in the output above 🤔. We can use a named list in &lt;code&gt;method&lt;/code&gt; to make the output more understandable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;contrast(emm1, method = list(&amp;quot;A2 - B2&amp;quot; = A2 - B2) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  contrast estimate   SE df t.ratio p.value
#  A2 - B2     -1.96 0.66 20 -2.972  0.0075&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-at-for-simple-comparisons&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using “at” for simple comparisons&lt;/h1&gt;
&lt;p&gt;Note that I didn’t need to do a custom contrast to do this particular comparison. I could have gotten the comparison I wanted by using the &lt;code&gt;at&lt;/code&gt; argument with &lt;code&gt;pairwise&lt;/code&gt; in &lt;code&gt;emmeans()&lt;/code&gt; and choosing just the two groups I was interested in.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emmeans(fit1, specs = pairwise ~ sub.rate, 
         at = list(sub.rate = c(&amp;quot;A.2&amp;quot;, &amp;quot;B.2&amp;quot;) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $emmeans
#  sub.rate emmean    SE df lower.CL upper.CL
#  A.2        4.04 0.466 20     3.07     5.01
#  B.2        6.00 0.466 20     5.03     6.97
# 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast  estimate   SE df t.ratio p.value
#  A.2 - B.2    -1.96 0.66 20 -2.972  0.0075&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;multiple-custom-contrasts-at-once&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Multiple custom contrasts at once&lt;/h1&gt;
&lt;p&gt;Multiple custom contrasts can be done simultaneously in &lt;code&gt;contrast()&lt;/code&gt; by adding more comparisons to the &lt;code&gt;method&lt;/code&gt; list. I’ll demonstrate this by doing the simple example comparison twice, changing only which group mean is subtracted from the other.&lt;/p&gt;
&lt;p&gt;I name both elements of the list for ease of interpretation. I find naming the list of comparisons to be a key part of doing these custom contrasts.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;contrast(emm1, method = list(&amp;quot;A2 - B2&amp;quot; = A2 - B2,
                             &amp;quot;B2 - A2&amp;quot; = B2 - A2) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  contrast estimate   SE df t.ratio p.value
#  A2 - B2     -1.96 0.66 20 -2.972  0.0075 
#  B2 - A2      1.96 0.66 20  2.972  0.0075&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Doing multiple comparisons at once means a multiple comparisons adjustment can be done as needed. In addition, we can use the &lt;code&gt;confint()&lt;/code&gt; function do get confidence intervals for the comparisons.&lt;/p&gt;
&lt;p&gt;I’ll add a multivariate-&lt;span class=&#34;math inline&#34;&gt;\(t\)&lt;/span&gt; adjustment via &lt;code&gt;adjust = &#34;mvt&#34;&lt;/code&gt; and then get confidence intervals for the comparisons. Remember we can get both confidence intervals and tests for comparisons via &lt;code&gt;summary()&lt;/code&gt; with &lt;code&gt;infer = TRUE&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;twocomp = contrast(emm1, method = list(&amp;quot;A2 minus B2&amp;quot; = A2 - B2,
                             &amp;quot;B2 minus A2&amp;quot; = B2 - A2),
         adjust = &amp;quot;mvt&amp;quot;) %&amp;gt;%
     confint()
twocomp&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  contrast    estimate   SE df lower.CL upper.CL
#  A2 minus B2    -1.96 0.66 20   -3.336   -0.584
#  B2 minus A2     1.96 0.66 20    0.584    3.336
# 
# Confidence level used: 0.95 
# Conf-level adjustment: mvt method for 2 estimates&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;more-complicated-custom-contrasts&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;More complicated custom contrasts&lt;/h1&gt;
&lt;p&gt;Now that we’ve seen a simple case, let’s do something slightly more complicated (and realistic). What if we want to compare the &lt;code&gt;A&lt;/code&gt; group to the &lt;code&gt;B&lt;/code&gt; group overall, regardless of the application rate?&lt;/p&gt;
&lt;p&gt;This is a &lt;em&gt;main effect&lt;/em&gt; comparison, so I need to average over the effect of the rate factor in order to estimate the overall effect of the levels of the substance factor.&lt;/p&gt;
&lt;p&gt;To do this comparison I need the means for all four non-control factor levels. I’ll print &lt;code&gt;emm1&lt;/code&gt; again here so I remember the order of the output before starting to write out the vectors that represent the group means.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emm1&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  sub.rate emmean    SE df lower.CL upper.CL
#  A.1        5.24 0.466 20     4.27     6.21
#  A.2        4.04 0.466 20     3.07     5.01
#  B.1        6.22 0.466 20     5.25     7.19
#  B.2        6.00 0.466 20     5.03     6.97
#  control    6.70 0.466 20     5.73     7.67
# 
# Confidence level used: 0.95&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I’ll need all means that involve &lt;code&gt;A&lt;/code&gt; or &lt;code&gt;B&lt;/code&gt;, which is the first four group means in &lt;code&gt;emm1&lt;/code&gt;. I’ll make a vector to represent each of these group means.&lt;/p&gt;
&lt;p&gt;While typing these vectors out isn’t too hard, since they only contain 5 values, when I have many groups and so really long vectors I sometimes use &lt;code&gt;rep()&lt;/code&gt; to repeat all the 0 values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;A1 = c(1, 0, 0, 0, 0)
A2 = c(0, 1, 0, 0, 0)
B1 = c(0, 0, 1, 0, 0)
B2 = c(0, 0, 0, 1, 0)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The vectors I made represent means for combinations of substance and rate. I want to compare the &lt;em&gt;overall&lt;/em&gt; substance group means, though. This can be done by &lt;em&gt;averaging over&lt;/em&gt; the two rates. This involves literally taking the average of, e.g., &lt;code&gt;A1&lt;/code&gt; and &lt;code&gt;A2&lt;/code&gt; vectors to get a vector that represents the overall &lt;code&gt;A&lt;/code&gt; mean.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Aoverall = (A1 + A2)/2
Boverall = (B1 + B2)/2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that we have vectors to represent the overall means we can do comparison of mean &lt;code&gt;resp&lt;/code&gt; of the &lt;code&gt;A&lt;/code&gt; group vs &lt;code&gt;B&lt;/code&gt; group overall in &lt;code&gt;contrast()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;contrast(emm1, method = list(&amp;quot;A - B&amp;quot; = Aoverall - Boverall) ) &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  contrast estimate    SE df t.ratio p.value
#  A - B       -1.47 0.466 20 -3.152  0.0050&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Custom contrasts are all built in this same basic way. You can also build your own contrast function if there is some contrast you do all the time that is not part of &lt;strong&gt;emmeans&lt;/strong&gt;. See the &lt;a href=&#34;https://cran.r-project.org/web/packages/emmeans/vignettes/comparisons.html#linfcns&#34;&gt;custom contrasts section&lt;/a&gt; of the &lt;strong&gt;emmeans&lt;/strong&gt; vignette for more info.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2019-04-15-custom-contrasts-in-emmeans.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(emmeans) # v. 1.3.3
library(magrittr) # v. 1.5

dat = structure(list(sub.rate = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L), .Label = c(&amp;quot;A.1&amp;quot;, &amp;quot;A.2&amp;quot;, &amp;quot;B.1&amp;quot;, &amp;quot;B.2&amp;quot;, &amp;quot;control&amp;quot;), class = &amp;quot;factor&amp;quot;), 
    resp = c(5.5, 4.9, 6.1, 3.6, 6.1, 3.5, 3, 4.1, 5, 4.6, 7.3, 
    5.6, 4.8, 7.2, 6.2, 4.3, 6.6, 6.5, 5.5, 7.1, 5.4, 6.7, 6.8, 
    8.5, 6.1)), row.names = c(NA, -25L), class = &amp;quot;data.frame&amp;quot;)

str(dat)

fit1 = lm(resp ~ sub.rate, data = dat)

emmeans(fit1, specs = trt.vs.ctrlk ~ sub.rate)

emm1 = emmeans(fit1, specs = ~ sub.rate)
emm1

A2 = c(0, 1, 0, 0, 0)
B2 = c(0, 0, 0, 1, 0)
contrast(emm1, method = list(A2 - B2) )

contrast(emm1, method = list(&amp;quot;A2 - B2&amp;quot; = A2 - B2) )

emmeans(fit1, specs = pairwise ~ sub.rate, 
         at = list(sub.rate = c(&amp;quot;A.2&amp;quot;, &amp;quot;B.2&amp;quot;) ) )

contrast(emm1, method = list(&amp;quot;A2 - B2&amp;quot; = A2 - B2,
                             &amp;quot;B2 - A2&amp;quot; = B2 - A2) )

twocomp = contrast(emm1, method = list(&amp;quot;A2 minus B2&amp;quot; = A2 - B2,
                             &amp;quot;B2 minus A2&amp;quot; = B2 - A2),
         adjust = &amp;quot;mvt&amp;quot;) %&amp;gt;%
     confint()
twocomp

emm1
A1 = c(1, 0, 0, 0, 0)
A2 = c(0, 1, 0, 0, 0)
B1 = c(0, 0, 1, 0, 0)
B2 = c(0, 0, 0, 1, 0)

Aoverall = (A1 + A2)/2
Boverall = (B1 + B2)/2

contrast(emm1, method = list(&amp;quot;A - B&amp;quot; = Aoverall - Boverall) ) &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Getting started with emmeans</title>
      <link>https://aosmith.rbind.io/2019/03/25/getting-started-with-emmeans/</link>
      <pubDate>Mon, 25 Mar 2019 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2019/03/25/getting-started-with-emmeans/</guid>
      <description>
&lt;script src=&#34;https://aosmith.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;&lt;em&gt;This post was last updated on 2021-11-04.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Package &lt;strong&gt;emmeans&lt;/strong&gt; (formerly known as &lt;strong&gt;lsmeans&lt;/strong&gt;) is enormously useful for folks wanting to do post hoc comparisons among groups after fitting a model. It has a very thorough set of vignettes (see the vignette topics &lt;a href=&#34;https://cran.r-project.org/web/packages/emmeans/vignettes/basics.html#contents&#34;&gt;here&lt;/a&gt;), is very flexible with a ton of options, and works out of the box with a lot of different model objects (and can be extended to others 👍).&lt;/p&gt;
&lt;p&gt;I’ve been consistently recommending &lt;strong&gt;emmeans&lt;/strong&gt; to students fitting models in R. However, often times students struggle a bit to get started using the package, possibly due to the sheer amount of flexibility and information in the vignettes.&lt;/p&gt;
&lt;p&gt;I’ve put together some basic examples for using &lt;strong&gt;emmeans&lt;/strong&gt;, meant to be a complement to the vignettes. Specifically this post will demonstrate a few of the built-in options for some standard post hoc comparisons; I will write a separate post about custom comparisons in &lt;strong&gt;emmeans&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Disclaimer: This post is about using a package in R and so unfortunately does not focus on appropriate statistical practice for model fitting and post hoc comparisons.&lt;/em&gt;&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#r-packages&#34;&gt;R packages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-dataset-and-model&#34;&gt;The dataset and model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#built-in-comparisons-with-emmeans&#34;&gt;Built in comparisons with emmeans()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#all-pairwise-comparisons&#34;&gt;All pairwise comparisons&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#back-transforming-results&#34;&gt;Back-transforming results&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#changing-the-multiple-comparisons-adjustment&#34;&gt;Changing the multiple comparisons adjustment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#confidence-intervals-for-comparisons&#34;&gt;Confidence intervals for comparisons&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#putting-results-in-a-data.frame&#34;&gt;Putting results in a data.frame&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#within-group-comparisons&#34;&gt;Within group comparisons&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#main-effects-comparisons&#34;&gt;Main effects comparisons&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#treatment-vs-control-example&#34;&gt;Treatment vs control example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#alternative-code-for-comparisons&#34;&gt;Alternative code for comparisons&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;r-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R packages&lt;/h1&gt;
&lt;p&gt;I will load &lt;strong&gt;magrittr&lt;/strong&gt; for the pipe in addition to &lt;strong&gt;emmeans&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(emmeans) # v. 1.7.0
library(magrittr) # v. 2.0.1&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-dataset-and-model&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The dataset and model&lt;/h1&gt;
&lt;p&gt;I’ve made a small dataset to use in this example.&lt;/p&gt;
&lt;p&gt;The response variable is &lt;code&gt;resp&lt;/code&gt;, which comes from the log-normal distribution, and the two crossed factors of interest are &lt;code&gt;f1&lt;/code&gt; and &lt;code&gt;f2&lt;/code&gt;. Each factor has two levels: a control called &lt;code&gt;c&lt;/code&gt; as well as a second non-control level.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat = data.frame(resp = c(1.6,0.3,3,0.1,3.2,0.2,0.4,0.4,2.8,
                          0.7,3.8,3,0.3,14.3,1.2,0.5,1.1,4.4,0.4,8.4),
                 f1 = factor(c(&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,
                               &amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,
                               &amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;)),
                 f2 = factor(c(&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,
                               &amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,
                               &amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;)))

str(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# &amp;#39;data.frame&amp;#39;: 20 obs. of  3 variables:
#  $ resp: num  1.6 0.3 3 0.1 3.2 0.2 0.4 0.4 2.8 0.7 ...
#  $ f1  : Factor w/ 2 levels &amp;quot;a&amp;quot;,&amp;quot;c&amp;quot;: 1 1 1 1 1 1 1 1 1 1 ...
#  $ f2  : Factor w/ 2 levels &amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;: 1 2 1 2 1 2 1 2 1 2 ...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The model I will use is a linear model with a log-transformed response variable and the two factors and their interaction as explanatory variables. This is the “true” model since I created these data so I’m skipping all model checks (which I would not do in a real analysis).&lt;/p&gt;
&lt;p&gt;Note I use &lt;code&gt;log(resp)&lt;/code&gt; in the model rather than creating a new log-transformed variable. This will allow me to demonstrate one of the convenient options available in &lt;code&gt;emmeans()&lt;/code&gt; later.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit1 = lm(log(resp) ~ f1 + f2 + f1:f2, data = dat)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;built-in-comparisons-with-emmeans&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Built in comparisons with emmeans()&lt;/h1&gt;
&lt;p&gt;The &lt;strong&gt;emmeans&lt;/strong&gt; package has helper functions for commonly used post hoc comparisons (aka &lt;em&gt;contrasts&lt;/em&gt;). For example, we can do pairwise comparisons via &lt;code&gt;pairwise&lt;/code&gt; or &lt;code&gt;revpairwise&lt;/code&gt;, treatment vs control comparisons via &lt;code&gt;trt.vs.ctrl&lt;/code&gt; or &lt;code&gt;trt.vs.ctrlk&lt;/code&gt;, and even consecutive comparisons via &lt;code&gt;consec&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The available built-in functions for doing comparisons are listed in the documentation for &lt;code&gt;?&#34;contrast-methods&#34;&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;all-pairwise-comparisons&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;All pairwise comparisons&lt;/h1&gt;
&lt;p&gt;One way to use &lt;code&gt;emmeans()&lt;/code&gt; is via formula coding for the comparisons. The formula is defined in the &lt;code&gt;specs&lt;/code&gt; argument.&lt;/p&gt;
&lt;p&gt;In my first example I do all pairwise comparisons for all combinations of &lt;code&gt;f1&lt;/code&gt; and &lt;code&gt;f2&lt;/code&gt;. The built-in function &lt;code&gt;pairwise&lt;/code&gt; is put on the left-hand side of the formula of the &lt;code&gt;specs&lt;/code&gt; argument. The factors with levels to compare among are on the right-hand side. Since I’m doing all pairwise comparisons, the combination of &lt;code&gt;f1&lt;/code&gt; and &lt;code&gt;f2&lt;/code&gt; are put in the formula.&lt;/p&gt;
&lt;p&gt;The model object is passed to the first argument in &lt;code&gt;emmeans()&lt;/code&gt;, &lt;code&gt;object&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emm1 = emmeans(fit1, specs = pairwise ~ f1:f2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the formula in this way returns an object with two parts. The first part, called &lt;code&gt;emmeans&lt;/code&gt;, is the estimated marginal means along with the standard errors and confidence intervals. We can pull these out with dollar sign notation, which I demonstrate below.&lt;/p&gt;
&lt;p&gt;These results are all on the &lt;em&gt;model&lt;/em&gt; scale, so in this case these are estimated mean log response for each &lt;code&gt;f1&lt;/code&gt; and &lt;code&gt;f2&lt;/code&gt; combination. Note the message that &lt;code&gt;emmeans()&lt;/code&gt; gives us about results being on the log scale in the output. It knows the model is on the log scale because I used &lt;code&gt;log(resp)&lt;/code&gt; as the response variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emm1$emmeans&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  f1 f2 emmean    SE df lower.CL upper.CL
#  a  1   0.569 0.445 16   -0.374    1.512
#  c  1  -0.102 0.445 16   -1.045    0.842
#  a  c  -1.278 0.445 16   -2.221   -0.334
#  c  c   1.335 0.445 16    0.392    2.279
# 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The second part of the output, called &lt;code&gt;contrasts&lt;/code&gt;, contains the comparisons of interest. It is this section that we are generally most interested in when answering a question about differences among groups. You can see which comparison is which via the &lt;code&gt;contrast&lt;/code&gt; column.&lt;/p&gt;
&lt;p&gt;These results are also on the model scale (and we get the same message in this section), and &lt;a href=&#34;#back-transforming-results&#34;&gt;we’ll want to put them on the original scale&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The comparisons are accompanied by statistical tests of the null hypothesis of “no difference”, but lack confidence interval (CI) limits by default. &lt;a href=&#34;#confidence-intervals-for-comparisons&#34;&gt;We’ll need to get these&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;emmeans()&lt;/code&gt; package automatically adjusts for multiple comparisons. Since we did all pairwise comparisons the package used a Tukey adjustment. &lt;a href=&#34;#changing-the-multiple-comparisons-adjustment&#34;&gt;The type of adjustment can be changed&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emm1$contrasts&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  contrast  estimate    SE df t.ratio p.value
#  a 1 - c 1    0.671 0.629 16   1.065  0.7146
#  a 1 - a c    1.847 0.629 16   2.934  0.0434
#  a 1 - c c   -0.766 0.629 16  -1.217  0.6253
#  c 1 - a c    1.176 0.629 16   1.869  0.2795
#  c 1 - c c   -1.437 0.629 16  -2.283  0.1438
#  a c - c c   -2.613 0.629 16  -4.152  0.0038
# 
# Results are given on the log (not the response) scale. 
# P value adjustment: tukey method for comparing a family of 4 estimates&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;back-transforming-results&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Back-transforming results&lt;/h1&gt;
&lt;p&gt;Since I used a log transformation I can express the results as multiplicative differences in medians on the original (data) scale.&lt;/p&gt;
&lt;p&gt;We can always back-transform estimates and CI limits by hand, but in &lt;code&gt;emmeans()&lt;/code&gt; we can use the &lt;code&gt;type&lt;/code&gt; argument for this. Using &lt;code&gt;type = &#34;response&#34;&lt;/code&gt; will return results on the original scale. This works when the transformation is explicit in the model (e.g., &lt;code&gt;log(resp)&lt;/code&gt;) and works similarly for link functions in generalized linear models.&lt;/p&gt;
&lt;p&gt;You’ll see the message changes in the output once I do this, indicating things were back-transformed from the model scale. We also are reminded that the tests were done on the model scale.&lt;/p&gt;
&lt;p&gt;In the &lt;code&gt;contrast&lt;/code&gt; column in the &lt;code&gt;contrasts&lt;/code&gt; section we can see the expression of the comparisons has changed from additive comparisons (via subtraction) shown above to multiplicative comparisons (via division).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emmeans(fit1, specs = pairwise ~ f1:f2, type = &amp;quot;response&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $emmeans
#  f1 f2 response    SE df lower.CL upper.CL
#  a  1     1.767 0.786 16    0.688    4.538
#  c  1     0.903 0.402 16    0.352    2.321
#  a  c     0.279 0.124 16    0.108    0.716
#  c  c     3.800 1.691 16    1.479    9.763
# 
# Confidence level used: 0.95 
# Intervals are back-transformed from the log scale 
# 
# $contrasts
#  contrast   ratio     SE df null t.ratio p.value
#  a 1 / c 1 1.9553 1.2306 16    1   1.065  0.7146
#  a 1 / a c 6.3396 3.9900 16    1   2.934  0.0434
#  a 1 / c c 0.4648 0.2926 16    1  -1.217  0.6253
#  c 1 / a c 3.2422 2.0406 16    1   1.869  0.2795
#  c 1 / c c 0.2377 0.1496 16    1  -2.283  0.1438
#  a c / c c 0.0733 0.0461 16    1  -4.152  0.0038
# 
# P value adjustment: tukey method for comparing a family of 4 estimates 
# Tests are performed on the log scale&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;changing-the-multiple-comparisons-adjustment&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Changing the multiple comparisons adjustment&lt;/h1&gt;
&lt;p&gt;The &lt;code&gt;adjust&lt;/code&gt; argument can be used to change the type of multiple comparisons adjustment. All available options are listed and described in the documentation for &lt;code&gt;summary.emmGrid&lt;/code&gt; under the section &lt;em&gt;P-value adjustments&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;One option is to skip multiple comparisons adjustments all together, using &lt;code&gt;adjust = &#34;none&#34;&lt;/code&gt;. If we use this the message about multiple comparisons disappears (since we didn’t use one).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emm1.1 = emmeans(fit1, specs = pairwise ~ f1:f2, type = &amp;quot;response&amp;quot;, adjust = &amp;quot;none&amp;quot;)
emm1.1&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $emmeans
#  f1 f2 response    SE df lower.CL upper.CL
#  a  1     1.767 0.786 16    0.688    4.538
#  c  1     0.903 0.402 16    0.352    2.321
#  a  c     0.279 0.124 16    0.108    0.716
#  c  c     3.800 1.691 16    1.479    9.763
# 
# Confidence level used: 0.95 
# Intervals are back-transformed from the log scale 
# 
# $contrasts
#  contrast   ratio     SE df null t.ratio p.value
#  a 1 / c 1 1.9553 1.2306 16    1   1.065  0.3025
#  a 1 / a c 6.3396 3.9900 16    1   2.934  0.0097
#  a 1 / c c 0.4648 0.2926 16    1  -1.217  0.2412
#  c 1 / a c 3.2422 2.0406 16    1   1.869  0.0801
#  c 1 / c c 0.2377 0.1496 16    1  -2.283  0.0365
#  a c / c c 0.0733 0.0461 16    1  -4.152  0.0008
# 
# Tests are performed on the log scale&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;confidence-intervals-for-comparisons&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Confidence intervals for comparisons&lt;/h1&gt;
&lt;p&gt;We will almost invariably want to report confidence intervals for any comparisons of interest. We need a separate function to get these. Here is an example using the &lt;code&gt;confint()&lt;/code&gt; function with the default 95% CI (the confidence level can be changed, see &lt;code&gt;?confint.emmGrid&lt;/code&gt;). I use the pipe to pass the &lt;code&gt;contrasts&lt;/code&gt; into the &lt;code&gt;confint()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emm1.1$contrasts %&amp;gt;%
     confint()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  contrast   ratio     SE df lower.CL upper.CL
#  a 1 / c 1 1.9553 1.2306 16   0.5150    7.424
#  a 1 / a c 6.3396 3.9900 16   1.6696   24.072
#  a 1 / c c 0.4648 0.2926 16   0.1224    1.765
#  c 1 / a c 3.2422 2.0406 16   0.8539   12.311
#  c 1 / c c 0.2377 0.1496 16   0.0626    0.903
#  a c / c c 0.0733 0.0461 16   0.0193    0.278
# 
# Confidence level used: 0.95 
# Intervals are back-transformed from the log scale&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;confint()&lt;/code&gt; function returns confidence intervals but gets rid of the statistical tests. Some people will want to also report the test statistics and p-values. In this case, we can use &lt;code&gt;summary()&lt;/code&gt; instead of &lt;code&gt;confint()&lt;/code&gt;, with &lt;code&gt;infer = TRUE&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emm1.1$contrasts %&amp;gt;%
     summary(infer = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  contrast   ratio     SE df lower.CL upper.CL null t.ratio p.value
#  a 1 / c 1 1.9553 1.2306 16   0.5150    7.424    1   1.065  0.3025
#  a 1 / a c 6.3396 3.9900 16   1.6696   24.072    1   2.934  0.0097
#  a 1 / c c 0.4648 0.2926 16   0.1224    1.765    1  -1.217  0.2412
#  c 1 / a c 3.2422 2.0406 16   0.8539   12.311    1   1.869  0.0801
#  c 1 / c c 0.2377 0.1496 16   0.0626    0.903    1  -2.283  0.0365
#  a c / c c 0.0733 0.0461 16   0.0193    0.278    1  -4.152  0.0008
# 
# Confidence level used: 0.95 
# Intervals are back-transformed from the log scale 
# Tests are performed on the log scale&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;putting-results-in-a-data.frame&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Putting results in a data.frame&lt;/h1&gt;
&lt;p&gt;One of the really nice things about &lt;code&gt;emmeans()&lt;/code&gt; is that it makes it easy to get the results into a nice format for making tables or graphics of results. This is because the results are converted to a data.frame with &lt;code&gt;confint()&lt;/code&gt; or &lt;code&gt;summary()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If needed, the estimated marginal means can also be put into a data.frame. In this case we can use &lt;code&gt;as.data.frame()&lt;/code&gt; to convert the &lt;code&gt;emmeans&lt;/code&gt; to a data.frame for plotting or putting into a table of results. We can also use &lt;code&gt;as.data.frame()&lt;/code&gt; directly on the contrasts above if we don’t need &lt;code&gt;confint()&lt;/code&gt; or &lt;code&gt;summary()&lt;/code&gt; (not shown).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emm1.1$emmeans %&amp;gt;%
     as.data.frame()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#   f1 f2  response        SE df  lower.CL upper.CL
# 1  a  1 1.7665334 0.7861763 16 0.6876870 4.537879
# 2  c  1 0.9034576 0.4020739 16 0.3517035 2.320806
# 3  a  c 0.2786518 0.1240109 16 0.1084753 0.715802
# 4  c  c 3.8004222 1.6913362 16 1.4794517 9.762542&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;within-group-comparisons&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Within group comparisons&lt;/h1&gt;
&lt;p&gt;While we &lt;em&gt;can&lt;/em&gt; do all pairwise comparisons, there are certainly plenty of situations where the research question dictates that we only want a specific set of comparisons. A common example of this is when we want to compare the levels of one factor within the levels of another. Here I’ll show comparisons among levels of &lt;code&gt;f1&lt;/code&gt; for each level of &lt;code&gt;f2&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The only thing that changes is the right-hand side of the &lt;code&gt;specs&lt;/code&gt; formula. The code &lt;code&gt;f1|f2&lt;/code&gt; translates to “compare levels of &lt;code&gt;f1&lt;/code&gt; within each level of &lt;code&gt;f2&lt;/code&gt;”.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emm2 = emmeans(fit1, specs = pairwise ~ f1|f2, type = &amp;quot;response&amp;quot;)
emm2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $emmeans
# f2 = 1:
#  f1 response    SE df lower.CL upper.CL
#  a     1.767 0.786 16    0.688    4.538
#  c     0.903 0.402 16    0.352    2.321
# 
# f2 = c:
#  f1 response    SE df lower.CL upper.CL
#  a     0.279 0.124 16    0.108    0.716
#  c     3.800 1.691 16    1.479    9.763
# 
# Confidence level used: 0.95 
# Intervals are back-transformed from the log scale 
# 
# $contrasts
# f2 = 1:
#  contrast  ratio     SE df null t.ratio p.value
#  a / c    1.9553 1.2306 16    1   1.065  0.3025
# 
# f2 = c:
#  contrast  ratio     SE df null t.ratio p.value
#  a / c    0.0733 0.0461 16    1  -4.152  0.0008
# 
# Tests are performed on the log scale&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can see there is no message about a multiple comparisons adjustment in the above set of comparisons. This is because the package default is to correct for the number of comparisons &lt;em&gt;within&lt;/em&gt; each group instead of across groups. In this case there is only a single comparison in each group.&lt;/p&gt;
&lt;p&gt;If we consider the family of comparisons to be all comparisons regardless of group and want to correct for multiple comparisons, we can do so via &lt;code&gt;rbind.emmGrid&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Here is an example of passing &lt;code&gt;contrasts&lt;/code&gt; to &lt;code&gt;rbind()&lt;/code&gt; to correct for multiple comparisons. The default adjustment is Bonferroni, which can be much too conservative when the number of comparisons is large. You can control the multiple comparisons procedure via &lt;code&gt;adjust&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The results of &lt;code&gt;rbind()&lt;/code&gt; can also conveniently be used with &lt;code&gt;summary()&lt;/code&gt;, &lt;code&gt;confint()&lt;/code&gt;, and/or &lt;code&gt;as.data.frame()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emm2$contrasts %&amp;gt;%
     rbind() &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  f2 contrast  ratio     SE df null t.ratio p.value
#  1  a / c    1.9553 1.2306 16    1   1.065  0.6050
#  c  a / c    0.0733 0.0461 16    1  -4.152  0.0015
# 
# P value adjustment: bonferroni method for 2 tests 
# Tests are performed on the log scale&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;main-effects-comparisons&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Main effects comparisons&lt;/h1&gt;
&lt;p&gt;Even if we have multiple factors in the model, complete with an interaction term, we can still do “overall” comparisons among groups if our research question indicated that main effects were important to estimate.&lt;/p&gt;
&lt;p&gt;Doing main effects in the presence of an interaction means we &lt;em&gt;average over&lt;/em&gt; the levels of the other factor(s). The &lt;code&gt;emmeans()&lt;/code&gt; function gives both a warning about the interaction and a message indicating which factor was averaged over to remind us of this.&lt;/p&gt;
&lt;p&gt;Here is the estimated main effect of &lt;code&gt;f1&lt;/code&gt;. Since we are only interested in overall comparisons of that factor it is the only factor given on the right-hand side of the &lt;code&gt;specs&lt;/code&gt; formula.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emmeans(fit1, specs = pairwise ~ f1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# NOTE: Results may be misleading due to involvement in interactions&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $emmeans
#  f1 emmean    SE df lower.CL upper.CL
#  a  -0.354 0.315 16  -1.0215    0.313
#  c   0.617 0.315 16  -0.0503    1.284
# 
# Results are averaged over the levels of: f2 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast estimate    SE df t.ratio p.value
#  a - c      -0.971 0.445 16  -2.182  0.0443
# 
# Results are averaged over the levels of: f2 
# Results are given on the log (not the response) scale.&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;treatment-vs-control-example&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Treatment vs control example&lt;/h1&gt;
&lt;p&gt;The &lt;strong&gt;emmeans&lt;/strong&gt; package has built-in helper functions for comparing each group mean to the control mean. If the control group is the in the first row of the &lt;code&gt;emmeans&lt;/code&gt; section of the output, this set of comparisons can be requested via &lt;code&gt;trt.vs.ctrl&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Note the default multiple comparisons adjustment is a Dunnett adjustment.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emmeans(fit1, specs = trt.vs.ctrl ~ f1:f2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $emmeans
#  f1 f2 emmean    SE df lower.CL upper.CL
#  a  1   0.569 0.445 16   -0.374    1.512
#  c  1  -0.102 0.445 16   -1.045    0.842
#  a  c  -1.278 0.445 16   -2.221   -0.334
#  c  c   1.335 0.445 16    0.392    2.279
# 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast  estimate    SE df t.ratio p.value
#  c 1 - a 1   -0.671 0.629 16  -1.065  0.5857
#  a c - a 1   -1.847 0.629 16  -2.934  0.0262
#  c c - a 1    0.766 0.629 16   1.217  0.4947
# 
# Results are given on the log (not the response) scale. 
# P value adjustment: dunnettx method for 3 tests&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using &lt;code&gt;trt.vs.ctrl&lt;/code&gt; means we ended up comparing each group mean to the “a 1” group since it is in the first row. In the example I’m using the control group, “c c”, is actually the &lt;em&gt;last&lt;/em&gt; group listed in the &lt;code&gt;emmeans&lt;/code&gt; section. When the control group is the last group in &lt;code&gt;emmeans&lt;/code&gt; we can use &lt;code&gt;trt.vs.ctrlk&lt;/code&gt; to get the correct set of comparisons.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emmeans(fit1, specs = trt.vs.ctrlk ~ f1:f2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $emmeans
#  f1 f2 emmean    SE df lower.CL upper.CL
#  a  1   0.569 0.445 16   -0.374    1.512
#  c  1  -0.102 0.445 16   -1.045    0.842
#  a  c  -1.278 0.445 16   -2.221   -0.334
#  c  c   1.335 0.445 16    0.392    2.279
# 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast  estimate    SE df t.ratio p.value
#  a 1 - c c   -0.766 0.629 16  -1.217  0.4947
#  c 1 - c c   -1.437 0.629 16  -2.283  0.0935
#  a c - c c   -2.613 0.629 16  -4.152  0.0021
# 
# Results are given on the log (not the response) scale. 
# P value adjustment: dunnettx method for 3 tests&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That gives us what we want in this case. However, if the control group was some other group, like “c 1”, we could use &lt;code&gt;trt.vs.ctrlk&lt;/code&gt; with the &lt;code&gt;ref&lt;/code&gt; argument to define which row in the &lt;code&gt;emmeans&lt;/code&gt; section represents the control group.&lt;/p&gt;
&lt;p&gt;The “c 1” group is the second row in the &lt;code&gt;emmeans&lt;/code&gt; so we can use &lt;code&gt;ref = 2&lt;/code&gt; to define this group as the control group.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emmeans(fit1, specs = trt.vs.ctrlk ~ f1:f2, ref = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $emmeans
#  f1 f2 emmean    SE df lower.CL upper.CL
#  a  1   0.569 0.445 16   -0.374    1.512
#  c  1  -0.102 0.445 16   -1.045    0.842
#  a  c  -1.278 0.445 16   -2.221   -0.334
#  c  c   1.335 0.445 16    0.392    2.279
# 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast  estimate    SE df t.ratio p.value
#  a 1 - c 1    0.671 0.629 16   1.065  0.5857
#  a c - c 1   -1.176 0.629 16  -1.869  0.1937
#  c c - c 1    1.437 0.629 16   2.283  0.0935
# 
# Results are given on the log (not the response) scale. 
# P value adjustment: dunnettx method for 3 tests&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, if we want to reverse the order of subtraction in the treatment vs control comparisons we can use the &lt;code&gt;reverse&lt;/code&gt; argument.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emmeans(fit1, specs = trt.vs.ctrlk ~ f1:f2, ref = 2, reverse = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# $emmeans
#  f1 f2 emmean    SE df lower.CL upper.CL
#  a  1   0.569 0.445 16   -0.374    1.512
#  c  1  -0.102 0.445 16   -1.045    0.842
#  a  c  -1.278 0.445 16   -2.221   -0.334
#  c  c   1.335 0.445 16    0.392    2.279
# 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95 
# 
# $contrasts
#  contrast  estimate    SE df t.ratio p.value
#  c 1 - a 1   -0.671 0.629 16  -1.065  0.5857
#  c 1 - a c    1.176 0.629 16   1.869  0.1937
#  c 1 - c c   -1.437 0.629 16  -2.283  0.0935
# 
# Results are given on the log (not the response) scale. 
# P value adjustment: dunnettx method for 3 tests&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;alternative-code-for-comparisons&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Alternative code for comparisons&lt;/h1&gt;
&lt;p&gt;The &lt;code&gt;emmeans()&lt;/code&gt; package offers the option to do comparisons in two steps instead of in one step the way I have been using it so far. I personally find this alternative most useful when doing custom comparisons, and I think it’s useful to introduce it now so it looks familiar. This alternative keeps the estimated marginal means and the comparisons of interest in separate objects, which can be attractive in some situations.&lt;/p&gt;
&lt;p&gt;The first step is to use &lt;code&gt;emmeans()&lt;/code&gt; to calculate the marginal means of interest. We still use the formula in &lt;code&gt;specs&lt;/code&gt; with the factor(s) of interest on the right-hand side but no longer put anything on the left-hand side of the tilde.&lt;/p&gt;
&lt;p&gt;We can still use &lt;code&gt;type&lt;/code&gt; in &lt;code&gt;emmeans()&lt;/code&gt; but cannot use &lt;code&gt;adjust&lt;/code&gt; (since we don’t adjust for multiple comparisons until we’ve actually done comparisons 😉).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;emm3 = emmeans(fit1, specs = ~ f1:f2, type = &amp;quot;response&amp;quot;)
emm3&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  f1 f2 response    SE df lower.CL upper.CL
#  a  1     1.767 0.786 16    0.688    4.538
#  c  1     0.903 0.402 16    0.352    2.321
#  a  c     0.279 0.124 16    0.108    0.716
#  c  c     3.800 1.691 16    1.479    9.763
# 
# Confidence level used: 0.95 
# Intervals are back-transformed from the log scale&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We then get the comparisons we want in a second step using the &lt;code&gt;contrast()&lt;/code&gt; function. We request the comparisons we want via &lt;code&gt;method&lt;/code&gt;. When using built-in comparisons like I am here, we give the comparison function name as a string (meaning in quotes). Also see the &lt;code&gt;pairs()&lt;/code&gt; function, which is for the special case of all pairwise comparisons.&lt;/p&gt;
&lt;p&gt;We can use &lt;code&gt;adjust&lt;/code&gt; in &lt;code&gt;contrast()&lt;/code&gt; to change the multiple comparisons adjustment.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;contrast(emm3, method = &amp;quot;pairwise&amp;quot;, adjust = &amp;quot;none&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;#  contrast   ratio     SE df null t.ratio p.value
#  a 1 / c 1 1.9553 1.2306 16    1   1.065  0.3025
#  a 1 / a c 6.3396 3.9900 16    1   2.934  0.0097
#  a 1 / c c 0.4648 0.2926 16    1  -1.217  0.2412
#  c 1 / a c 3.2422 2.0406 16    1   1.869  0.0801
#  c 1 / c c 0.2377 0.1496 16    1  -2.283  0.0365
#  a c / c c 0.0733 0.0461 16    1  -4.152  0.0008
# 
# Tests are performed on the log scale&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can follow the &lt;code&gt;contrast()&lt;/code&gt; argument with &lt;code&gt;summary()&lt;/code&gt; or &lt;code&gt;confint()&lt;/code&gt; to get the output we want and put them into a data.frame for plotting/saving. Again, I think the real strength of &lt;code&gt;contrast()&lt;/code&gt; comes when we want custom comparisons, and I’ll demonstrate these in my &lt;a href=&#34;https://aosmith.rbind.io/2019/04/15/custom-contrasts-emmeans/&#34;&gt;next post on custom contrasts&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2019-03-25-getting-started-with-emmeans.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(emmeans) # v. 1.7.0
library(magrittr) # v. 2.0.1

dat = data.frame(resp = c(1.6,0.3,3,0.1,3.2,0.2,0.4,0.4,2.8,
                          0.7,3.8,3,0.3,14.3,1.2,0.5,1.1,4.4,0.4,8.4),
                 f1 = factor(c(&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,
                               &amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;a&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,
                               &amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;c&amp;quot;)),
                 f2 = factor(c(&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,
                               &amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,
                               &amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;,&amp;quot;1&amp;quot;,&amp;quot;c&amp;quot;)))

str(dat)

fit1 = lm(log(resp) ~ f1 + f2 + f1:f2, data = dat)

emm1 = emmeans(fit1, specs = pairwise ~ f1:f2)

emm1$emmeans
emm1$contrasts

emmeans(fit1, specs = pairwise ~ f1:f2, type = &amp;quot;response&amp;quot;)

emm1.1 = emmeans(fit1, specs = pairwise ~ f1:f2, type = &amp;quot;response&amp;quot;, adjust = &amp;quot;none&amp;quot;)
emm1.1

emm1.1$contrasts %&amp;gt;%
     confint()

emm1.1$contrasts %&amp;gt;%
     summary(infer = TRUE)

emm1.1$emmeans %&amp;gt;%
     as.data.frame()

emm2 = emmeans(fit1, specs = pairwise ~ f1|f2, type = &amp;quot;response&amp;quot;)
emm2

emm2$contrasts %&amp;gt;%
     rbind() 

emmeans(fit1, specs = pairwise ~ f1)

emmeans(fit1, specs = trt.vs.ctrl ~ f1:f2)

emmeans(fit1, specs = trt.vs.ctrlk ~ f1:f2)

emmeans(fit1, specs = trt.vs.ctrlk ~ f1:f2, ref = 2)

emmeans(fit1, specs = trt.vs.ctrlk ~ f1:f2, ref = 2, reverse = TRUE)

emm3 = emmeans(fit1, specs = ~ f1:f2, type = &amp;quot;response&amp;quot;)
emm3

contrast(emm3, method = &amp;quot;pairwise&amp;quot;, adjust = &amp;quot;none&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Lots of zeros or too many zeros?: Thinking about zero inflation in count data</title>
      <link>https://aosmith.rbind.io/2019/03/06/lots-of-zeros/</link>
      <pubDate>Wed, 06 Mar 2019 00:00:00 +0000</pubDate>
      
      <guid>https://aosmith.rbind.io/2019/03/06/lots-of-zeros/</guid>
      <description>


&lt;p&gt;In a recent lecture I gave a basic overview of zero-inflation in count distributions. My main take-home message to the students that I thought worth posting about here is that having a lot of zero values does not necessarily mean you have zero inflation.&lt;/p&gt;
&lt;p&gt;Zero inflation is when there are more 0 values in the data than the distribution allows for. But some distributions can have a lot of zeros!&lt;/p&gt;
&lt;div id=&#34;table-of-contents&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#load-packages-and-dataset&#34;&gt;Load packages and dataset&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#negative-binomial-with-many-zeros&#34;&gt;Negative binomial with many zeros&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#generalized-poisson-with-many-zeros&#34;&gt;Generalized Poisson with many zeros&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#lots-of-zeros-or-excess-zeros&#34;&gt;Lots of zeros or excess zeros?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#simulate-negative-binomial-data&#34;&gt;Simulate negative binomial data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#checking-for-excess-zeros&#34;&gt;Checking for excess zeros&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#an-example-with-excess-zeros&#34;&gt;An example with excess zeros&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#just-the-code-please&#34;&gt;Just the code, please&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;load-packages-and-dataset&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Load packages and dataset&lt;/h1&gt;
&lt;p&gt;I’m going to be simulating counts from different distributions to demonstrate this. First I’ll load the packages I’m using today.&lt;/p&gt;
&lt;p&gt;Package &lt;strong&gt;HMMpa&lt;/strong&gt; is for a function to draw random samples from the generalized Poisson distribution.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # v. 3.1.0
library(HMMpa) # v. 1.0.1
library(MASS) # v. 7.3-51.1&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;negative-binomial-with-many-zeros&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Negative binomial with many zeros&lt;/h1&gt;
&lt;p&gt;First I’ll draw 200 counts from a negative binomial with a mean (&lt;span class=&#34;math inline&#34;&gt;\(\lambda\)&lt;/span&gt;) of &lt;span class=&#34;math inline&#34;&gt;\(10\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\theta = 0.05\)&lt;/span&gt;.&lt;br /&gt;
R uses the parameterization of the negative binomial where the variance of the distribution is &lt;span class=&#34;math inline&#34;&gt;\(\lambda + (\lambda^2/\theta)\)&lt;/span&gt;. In this parameterization, as &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; gets small the variance gets big. Using a very small value of theta like I am will generally mean the distribution of counts will have many zeros as well as a few large counts&lt;/p&gt;
&lt;p&gt;I pull a random sample of size 200 from this distribution using &lt;code&gt;rnbinom()&lt;/code&gt;. The &lt;code&gt;mu&lt;/code&gt; argument is the mean and the &lt;code&gt;size&lt;/code&gt; argument is theta.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(16)
dat = data.frame(Y = rnbinom(200, mu = 10, size = .05) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Below is a histogram of these data. I’ve annotated the plot with the proportion of the 200 values that are 0 as well as the maximum observed count in the dataset. There are lots of zeros! But these data are not zero-inflated because we expect to have many 0 values under this particular distribution.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(dat, aes(x = Y) ) +
    geom_histogram(binwidth = 5)  +
    theme_bw(base_size = 18) +
    labs(y = &amp;quot;Frequency&amp;quot;,
         title = &amp;quot;Negative binomial&amp;quot;,
         subtitle = &amp;quot;mean = 10, theta = 0.05&amp;quot; ) +
    annotate(geom = &amp;quot;text&amp;quot;,
            label = paste(&amp;quot;Proportion 0:&amp;quot;, mean(dat$Y == 0), 
                        &amp;quot;\nMax Count:&amp;quot;, max(dat$Y) ),
                        x = 150, y = 100, size = 8)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-03-06-lots-of-zeros_files/figure-html/unnamed-chunk-3-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;generalized-poisson-with-many-zeros&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Generalized Poisson with many zeros&lt;/h1&gt;
&lt;p&gt;I don’t know the generalized Poisson distribution well, although it appears to be regularly used in some fields. For whatever reason, the negative binomial seems much more common in ecology. 🤷&lt;/p&gt;
&lt;p&gt;From my understanding, the generalized Poisson distribution can have heavier tails than the negative binomial. This would mean that it can have more extreme maximum counts as well as lots of zeros.&lt;/p&gt;
&lt;p&gt;See the documentation for &lt;code&gt;rgenpois()&lt;/code&gt; for the formula for the density of the generalized Poisson and definitions of mean and variance. Note that when &lt;code&gt;lambda2&lt;/code&gt; is 0, the generalized Poisson reduces to the Poisson.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(16)
dat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Below is a histogram of these data. Just over 50% of the values are zeros but the maximum count is over 1000! 💥&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(dat, aes(x = Y) ) +
    geom_histogram(binwidth = 5)  +
    theme_bw(base_size = 18) +
    labs(y = &amp;quot;Frequency&amp;quot;,
         title = &amp;quot;Generalized Poisson&amp;quot;,
         subtitle = &amp;quot;lambda1 = 0.5, lambda2 = 0.95&amp;quot;) +
    annotate(geom = &amp;quot;text&amp;quot;,
            label = paste(&amp;quot;Proportion 0:&amp;quot;, mean(dat$Y == 0), 
                        &amp;quot;\nMax Count:&amp;quot;, max(dat$Y) ),
                        x = 600, y = 100, size = 8)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://aosmith.rbind.io/post/2019-03-06-lots-of-zeros_files/figure-html/unnamed-chunk-5-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;lots-of-zeros-or-excess-zeros&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Lots of zeros or excess zeros?&lt;/h1&gt;
&lt;p&gt;All the simulations above show us is that some distributions &lt;em&gt;can&lt;/em&gt; have a lot of zeros. In any given scenario, though, how do we check if we have &lt;em&gt;excess&lt;/em&gt; zeros? Having excess zeros means there are more zeros than expected by the distribution we are using for modeling. If we have excess zeros than we may either need a different distribution to model the data or we could think about models that specifically address zero inflation.&lt;/p&gt;
&lt;p&gt;The key to checking for excess zeros is to estimate the number of zeros you would expect to see if the fitted model were truly the model that created your data and compare that to the number of zeros in the actual data. If there are many more zeros in the data than the model allows for then you have zero inflation compared to whatever distribution you are using.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;simulate-negative-binomial-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Simulate negative binomial data&lt;/h1&gt;
&lt;p&gt;I’ll now simulate data based on a negative binomial model with a single, continuous explanatory variable. I’ll use a model fit to these data to show how to check for excess zeros.&lt;/p&gt;
&lt;p&gt;Since this is a generalized linear model, I first calculate the means based on the linear predictor. The exponentiation is due to using the natural log link to &lt;em&gt;link&lt;/em&gt; the mean to the linear predictor.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(16)
x = runif(200, 5, 10) # simulate explanatory variable
b0 = 1 # set value of intercept
b1 = 0.25 # set value of slope
means = exp(b0 + b1*x) # calculate true means
theta = 0.25 # true theta&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I can use these true means along with my chosen value of &lt;code&gt;theta&lt;/code&gt; to simulate data from the negative binomial distribution.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;y = rnbinom(200, mu = means, size = theta)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that I’ve made some data I can fit a model. Since I’m using a negative binomial GLM with &lt;code&gt;x&lt;/code&gt; as the explanatory variable, which is how I created the data, this model should work well. The &lt;code&gt;glm.nb()&lt;/code&gt; function is from package &lt;strong&gt;MASS&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit1 = glm.nb(y ~ x)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this exercise I’m going to go directly to checking for excess zeros. This means I’m skipping other important checks of model fit, such as checks for overdispersion and examining residual plots. Don’t skip these in a real analysis; having excess zeros certainly isn’t the only problem we can run into with count data.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;checking-for-excess-zeros&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Checking for excess zeros&lt;/h1&gt;
&lt;p&gt;The observed data has 76 zeros (out of 200).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sum(y == 0)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 76&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;How many zeros is expected given the model? I need the model estimated means and theta to answer this question. I can get the means via &lt;code&gt;predict()&lt;/code&gt; and I can pull &lt;code&gt;theta&lt;/code&gt; out of the model &lt;code&gt;summary()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;preds = predict(fit1, type = &amp;quot;response&amp;quot;) # estimated means
esttheta = summary(fit1)$theta # estimated theta&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For discrete distributions like the negative binomial, the &lt;em&gt;density&lt;/em&gt; distribution functions (which start with the letter “d”) return the probability that the observation is equal to a given value. This means I can use &lt;code&gt;dnbinom()&lt;/code&gt; to calculate the probability of an observation being 0 for every row in the dataset. To do this I need to provide values for the parameters of the distribution of each observation.&lt;/p&gt;
&lt;p&gt;Based on the model, the distribution of each observation is negative binomial with the mean estimated from the model and the overall estimated theta.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;prop0 = dnbinom(x = 0, mu = preds, size = esttheta )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The sum of these probabilities is an estimate of the number of zero values expected by the model (see &lt;a href=&#34;https://data.library.virginia.edu/getting-started-with-hurdle-models/&#34;&gt;here&lt;/a&gt; for another example). I’ll round this to the nearest integer.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;round( sum(prop0) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 72&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The expected number of 0 values is ~72, very close to the 76 observed in the data. This is no big surprise, since I fit the same model that I used to create the data.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;an-example-with-excess-zeros&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;An example with excess zeros&lt;/h1&gt;
&lt;p&gt;The example above demonstrates a model without excess zeros. Let me finish by fitting a model to data that has more zeros than expected by the distribution. This can be done by fitting a Poisson GLM instead of a negative binomial GLM to my simulated data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit2 = glm(y ~ x, family = poisson)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Remember the data contain 76 zeros.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sum(y == 0)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 76&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using &lt;code&gt;dpois()&lt;/code&gt;, the number of zeros given be the Poisson model is 0. 😮 These data are zero-inflated compared to the Poisson distribution, and I clearly need a different approach for modeling these data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;round( sum( dpois(x = 0,
           lambda = predict(fit2, type = &amp;quot;response&amp;quot;) ) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# [1] 0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This brings me back to my earlier point about checking model fit. If I had done other standard checks of model fit for &lt;code&gt;fit2&lt;/code&gt; I would have seen additional problems that would indicate the Poisson distribution did not fit these data (such as severe overdispersion).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;just-the-code-please&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Just the code, please&lt;/h1&gt;
&lt;p&gt;Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code &lt;a href=&#34;https://aosmith.rbind.io/script/2019-03-06-lots-of-zeros.R&#34;&gt;from here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2) # v. 3.1.0
library(HMMpa) # v. 1.0.1
library(MASS) # v. 7.3-51.1

set.seed(16)
dat = data.frame(Y = rnbinom(200, mu = 10, size = .05) )

ggplot(dat, aes(x = Y) ) +
    geom_histogram(binwidth = 5)  +
    theme_bw(base_size = 18) +
    labs(y = &amp;quot;Frequency&amp;quot;,
         title = &amp;quot;Negative binomial&amp;quot;,
         subtitle = &amp;quot;mean = 10, theta = 0.05&amp;quot; ) +
    annotate(geom = &amp;quot;text&amp;quot;,
            label = paste(&amp;quot;Proportion 0:&amp;quot;, mean(dat$Y == 0), 
                        &amp;quot;\nMax Count:&amp;quot;, max(dat$Y) ),
                        x = 150, y = 100, size = 8)

set.seed(16)
dat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) )

ggplot(dat, aes(x = Y) ) +
    geom_histogram(binwidth = 5)  +
    theme_bw(base_size = 18) +
    labs(y = &amp;quot;Frequency&amp;quot;,
         title = &amp;quot;Generalized Poisson&amp;quot;,
         subtitle = &amp;quot;lambda1 = 0.5, lambda2 = 0.95&amp;quot;) +
    annotate(geom = &amp;quot;text&amp;quot;,
            label = paste(&amp;quot;Proportion 0:&amp;quot;, mean(dat$Y == 0), 
                        &amp;quot;\nMax Count:&amp;quot;, max(dat$Y) ),
                        x = 600, y = 100, size = 8)

set.seed(16)
x = runif(200, 5, 10) # simulate explanatory variable
b0 = 1 # set value of intercept
b1 = 0.25 # set value of slope
means = exp(b0 + b1*x) # calculate true means
theta = 0.25 # true theta
y = rnbinom(200, mu = means, size = theta)

fit1 = glm.nb(y ~ x)

sum(y == 0)

preds = predict(fit1, type = &amp;quot;response&amp;quot;) # estimated means
esttheta = summary(fit1)$theta # estimated theta

prop0 = dnbinom(x = 0, mu = preds, size = esttheta )
round( sum(prop0) )

fit2 = glm(y ~ x, family = poisson)
sum(y == 0)

round( sum( dpois(x = 0,
           lambda = predict(fit2, type = &amp;quot;response&amp;quot;) ) ) )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
