Very statisticious
https://aosmith.rbind.io/index.xml
Recent content on Very statisticiousHugo -- gohugo.ioWed, 19 Sep 2018 00:00:00 +0000The log-0 problem: analysis strategies and options for choosing c in log(y + c)
https://aosmith.rbind.io/2018/09/19/the-log-0-problem/
Wed, 19 Sep 2018 00:00:00 +0000https://aosmith.rbind.io/2018/09/19/the-log-0-problem/<p>I periodically find myself having long conversations with consultees about 0’s. Why? Well, the basic suite of statistical tools many of us learn involves the normal distribution (for the errors) and the log transformation tends to feature prominently for working with right-skewed data. Since <code>log(0)</code> returns <code>-Infinity</code>, a common first reaction I see is to use <code>log(y + c)</code> as the response in place of <code>log(y)</code>, where <code>c</code> is some constant added to the y variable to get rid of the 0 values.</p>
<p>This isn’t necessarily an incorrect thing to do. However, I think it is important to step back and think about the data and the 0 values more before forging ahead with adding a constant to the data.</p>
<p>Some of the resources I’ve used over the years to hone my thinking on this topic are <a href="https://stat.ethz.ch/pipermail/r-sig-ecology/2009-June/000676.html">this thread</a> on the R-sig-eco mailing list, <a href="https://stats.stackexchange.com/questions/1444/how-should-i-transform-non-negative-data-including-zeros">this Cross Validated question and answers</a>, and <a href="https://robjhyndman.com/hyndsight/transformations/">this blogpost on the Hyndsight blog</a>.</p>
<div id="thinking-about-0-values" class="section level1">
<h1><a href="#thinking-about-0-values">Thinking about 0 values</a></h1>
<p>Without getting into too much detail, below are some of the things I consider when I have 0 as well as positive values in a response variable.</p>
<div id="discrete-data" class="section level2">
<h2>Discrete data</h2>
<p>We have specific tools for working the discrete data. These distributions allow 0 values, so we can avoid potentially avoid the issue all together.</p>
<p>Questions to ask yourself:</p>
<div id="are-the-data-discrete-counts" class="section level3">
<h3>Are the data discrete counts?</h3>
<p>If so, start out considering a discrete distribution like the negative binomial or Poisson instead of the normal distribution. If the count is for different unit areas or sampling times, you can use that “effort” variable as an offset.</p>
</div>
<div id="are-the-data-proportions" class="section level3">
<h3>Are the data proportions?</h3>
<p>If the data are counted proportions, made by dividing a count by some total count, start with binomial-based models.</p>
<p>Models for discrete data can be extended as needed for a variety of situations (excessive 0 values, overdispersion, etc.). Sometimes things get too complicated and we may go back to normal-based tools but I would say that is the exception, not the rule.</p>
</div>
</div>
<div id="continuous-data" class="section level2">
<h2>Continuous data</h2>
<p>Positive, continuous data and 0 values are where I find things start to get sticky. The standard distributions that we have available to model positive, right-skewed data (log-normal, Gamma) can’t contain 0. (Note that when we do the transformation log(y) and then used normal-based statistical tools we are working with the <em>log-normal</em> distribution.)</p>
<div id="are-the-0-values-true-0s" class="section level3">
<h3>Are the 0 values “true” 0’s?</h3>
<p>Were the values unquestionably 0 (no ifs, ands, or buts!) or do you consider the 0’s to really represent some very small value?</p>
<p>This is important to consider, and the answer may affect how you proceed with the analysis and whether or not you think adding something and transforming is reasonable. There is a nice discussion of different kinds of 0’s in section 11.3.1 in <a href="http://highstat.com/index.php/mixed-effects-models-and-extensions-in-ecology-with-r">Zuur et al. 2009</a>.</p>
</div>
<div id="what-proportion-of-the-data-are-0" class="section level3">
<h3>What proportion of the data are 0?</h3>
<p>If you have relatively few 0 values they won’t have a large influence on your inference and it may be easier to justify adding a constant to remove them.</p>
</div>
<div id="are-the-0-values-caused-by-censoring" class="section level3">
<h3>Are the 0 values caused by censoring?</h3>
<p>Censoring can occur when our measuring tool has a lower limit of detection and every measurement lower than that limit is assigned a value of 0. This is not uncommon when measuring stream chemistry, for example. There are specific models for censored data, like Tobit models.</p>
</div>
<div id="are-the-data-continuous-proportions" class="section level3">
<h3>Are the data continuous proportions?</h3>
<p>The beta distribution can be used to model continuous proportions. However, the support of the beta distribution contains neither 0’s nor 1’s. Sheesh! You would need to either work with a zero-inflated/one-inflated beta or remap the variable to get rid of 0’s and 1’s.</p>
</div>
<div id="do-you-have-a-point-mass-at-0-along-with-positive-continuous-values" class="section level3">
<h3>Do you have a point mass at 0 along with positive, continuous values?</h3>
<p>The <a href="https://en.wikipedia.org/wiki/Tweedie_distribution">Tweedie distribution</a> is one relatively new option. I’ve seen this work well for % plant cover measurements, which can often be a bear to work with since the measurements can go from 0 to above 100% and often contain many 0 values.</p>
<p>Another option I’ve used for “point mass plus positive” data is to think about these as two separate problems that answer different questions. One analysis can answer a question about the probability of presence and a second can be used to model the positive data only. This is a type of <em>hurdle</em> model, which I’ve seen more generally referred to as a mixture model.</p>
</div>
</div>
</div>
<div id="common-choices-of-c" class="section level1">
<h1><a href="#common-choices-of-c">Common choices of c</a></h1>
<p>It may be that after all that hard thinking you end up on the option of adding a constant to shift your distribution away from 0 so you can proceed with a log transformation. I find this is most likely for cases where I consider the 0 values to be some minimal value (not true 0’s) and I have relatively few of them.</p>
<p>So what should this constant, <code>c</code>, be?</p>
<p>This choice isn’t minor, as the value you choose can change your results when your goal is estimation. The more 0 values you have the more likely the choice of <code>c</code> matters.</p>
<p>Some options:</p>
<ul>
<li><p>Add 1. I think folks find this attractive because log(1) = 0. However, whether or not 1 is reasonable can depend on the distribution of your data.</p></li>
<li><p>Add half the minimum non-0 value. This is what I was taught to do in my statistics program, and I’ve heard of others with the same experience. As far as I know this is an unpublished recommendation.</p></li>
<li><p>Add the square of the first quartile divided by the third quartile. This recommendation reportedly comes from <a href="https://stat.ethz.ch/~stahel/stat-dat-ana/">Stahel 2008</a>; I have never verified this since I unfortunately can’t read German. This approach clearly is relevant only if the first quartile is greater than 0, so you must have fewer than 25% 0 values to use this option.</p></li>
</ul>
</div>
<div id="load-r-packages" class="section level1">
<h1>Load R packages</h1>
<p>Let’s explore these different options for for one particular scenario via simulation. Here are the R packages I’m using today.</p>
<pre class="r"><code>library(purrr) # v. 0.2.5
library(dplyr) # v. 0.7.6
library(broom) # v. 0.5.0
library(ggplot2) # v. 3.0.0
library(ggridges) # v. 0.5.0</code></pre>
</div>
<div id="generate-log-normal-data-with-0-values" class="section level1">
<h1><a href="#generate-log-normal-data-with-0-values">Generate log-normal data with 0 values</a></h1>
<p>Since the log-normal distribution by definition has no 0 values, I found it wasn’t easy to simulate such data. When I was working on this several years ago I decided that one way to force 0 values was through rounding. This is not the most elegant solution, I’m sure, but I think it works to show how the value of <code>c</code> can influence estimates.</p>
<p>I managed to do this by generating primarily negative data on the log scale and then exponentiating to the data scale. I make my <code>x</code> variable negative, and positively related to <code>y</code>.</p>
<p>The range of the <code>y</code> variable ends up with many small values but can have some fairly large values, as well.</p>
<p>I’ll set the seed, set the values of the parameters (intercept and slope), and generate my <code>x</code>. I’m using a sample size of 50.</p>
<pre class="r"><code>set.seed(16)
beta0 = 0 # intercept
beta1 = .75 # slope
# Sample size
n = 50
# Explanatory variable
x = runif(n, min = -7, max = 0)</code></pre>
<p>The response variable is calculated from the parameters and the normally distributed random errors, after exponentiating things to the scale of the data. These data can be used in a log-normal model via a log transformation.</p>
<p>You can see the vast majority of the data are below 1, but none are exactly 0.</p>
<pre class="r"><code>true_y = exp(beta0 + beta1*x + rnorm(n, mean = 0, sd = 2))
summary(true_y)</code></pre>
<pre><code># Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.0013 0.0146 0.0862 8.6561 0.4180 377.3488</code></pre>
<p>I force 0 values into the dataset by rounding the real data, <code>true_y</code>, to two decimal places.</p>
<pre class="r"><code>y = round(true_y, digits = 2)
summary(y)</code></pre>
<pre><code># Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.000 0.010 0.085 8.656 0.420 377.350</code></pre>
<p>I fooled around a lot with the parameter values, <code>x</code> variables, and the residual errors to get 0’s in the rounded <code>y</code> most of the time. I wanted some 0 values but not too many, since the quartile method can only be used when there are less than 25% 0 values.</p>
<p>I checked for this by repeating the process above a number of times during testing. Here’s the number of 0 values for 100 iterations.</p>
<pre class="r"><code>replicate(100, sum( round( exp(beta0 + beta1*runif(n, min = -7, max = 0) +
rnorm(n, mean = 0, sd = 2)), 2) == 0) )</code></pre>
<pre><code># [1] 6 8 8 4 2 4 7 6 4 7 6 10 9 8 7 4 6 10 8 5 8 6 4
# [24] 12 7 10 3 7 7 9 3 6 9 7 4 12 5 11 11 9 4 8 5 6 16 6
# [47] 7 7 7 7 12 10 8 9 7 10 13 6 5 12 9 7 7 6 6 6 7 7 15
# [70] 4 8 10 5 7 9 6 9 9 4 4 8 0 7 4 9 9 8 6 9 6 10 5
# [93] 7 3 9 9 7 9 3 13</code></pre>
</div>
<div id="the-four-models-to-fit" class="section level1">
<h1>The four models to fit</h1>
<p>You can see there is variation in the estimated slope, depending on what value I use for <code>c</code>, in the four different models I fit below.</p>
<p>Here’s the “true” model, fit to the data prior to rounding.</p>
<pre class="r"><code>( true = lm(log(true_y) ~ x) )</code></pre>
<pre><code>#
# Call:
# lm(formula = log(true_y) ~ x)
#
# Coefficients:
# (Intercept) x
# 0.6751 0.9289</code></pre>
<p>Here’s a log(y + 1) model, fit to the rounded data.</p>
<pre class="r"><code>( fit1 = lm(log(y + 1) ~ x) )</code></pre>
<pre><code>#
# Call:
# lm(formula = log(y + 1) ~ x)
#
# Coefficients:
# (Intercept) x
# 1.2269 0.2324</code></pre>
<p>Here’s the colloquial method of adding half the minimum non-0 value.</p>
<pre class="r"><code>( fitc = lm(log(y + min(y[y>0])/2) ~ x) )</code></pre>
<pre><code>#
# Call:
# lm(formula = log(y + min(y[y > 0])/2) ~ x)
#
# Coefficients:
# (Intercept) x
# 0.6079 0.8348</code></pre>
<p>And the quartile method per Stahel 2008.</p>
<pre class="r"><code>( fitq = lm(log(y + quantile(y, .25)^2/quantile(y, .75) ) ~ x) )</code></pre>
<pre><code>#
# Call:
# lm(formula = log(y + quantile(y, 0.25)^2/quantile(y, 0.75)) ~
# x)
#
# Coefficients:
# (Intercept) x
# 0.8641 1.0558</code></pre>
</div>
<div id="a-function-for-fitting-the-models" class="section level1">
<h1>A function for fitting the models</h1>
<p>I decided I want to fit these four models to the same data. To do this I’m going to fit all four models within each call to the function.</p>
<p>The <code>beta0</code> and <code>beta1</code> arguments of the functions can technically be changed, but I’m pretty tied into the values I chose for them since I had a difficult time getting enough (but not too many!) 0 values.</p>
<p>I wanted to make sure that I always had at least one 0 in the rounded <code>y</code> data but that less than 25% of the values were 0 so I included a <code>while()</code> loop. This is key for using the quartile method, which I wanted to do.</p>
<p>The function returns a list of models. I give each model in the output list a name to help with organization when I start looking at results of many iterations of this.</p>
<pre class="r"><code>logy_0 = function(beta0 = 0, beta1 = .75, n) {
x = runif(n, -7, 0) # create expl var between -10 and 0
true_y = exp(beta0 + beta1*x + rnorm(n, 0, 2))
y = round(true_y, 2)
while( sum(y == 0 ) == 0 | sum(y == 0) > n/4) {
true_y = exp(beta0 + beta1*x + rnorm(n, 0, 2))
y = round(true_y, 2)
}
true = lm(log(true_y) ~ x)
fit1 = lm(log(y + 1) ~ x)
fitc = lm(log(y + min(y[y>0])/2) ~ x)
fitq = lm(log(y + quantile(y, .25)^2/quantile(y, .75) ) ~ x)
setNames(list(true, fit1, fitc, fitq),
c("True model", "Add 1", "Add 1/2 minimum > 0", "Quartile method") )
}</code></pre>
<p>Do I get the same values back as my manual work if I reset the seed? Yes! 🙌</p>
<pre class="r"><code>set.seed(16)
logy_0(n = 50)</code></pre>
<pre><code># $`True model`
#
# Call:
# lm(formula = log(true_y) ~ x)
#
# Coefficients:
# (Intercept) x
# 0.6751 0.9289
#
#
# $`Add 1`
#
# Call:
# lm(formula = log(y + 1) ~ x)
#
# Coefficients:
# (Intercept) x
# 1.2269 0.2324
#
#
# $`Add 1/2 minimum > 0`
#
# Call:
# lm(formula = log(y + min(y[y > 0])/2) ~ x)
#
# Coefficients:
# (Intercept) x
# 0.6079 0.8348
#
#
# $`Quartile method`
#
# Call:
# lm(formula = log(y + quantile(y, 0.25)^2/quantile(y, 0.75)) ~
# x)
#
# Coefficients:
# (Intercept) x
# 0.8641 1.0558</code></pre>
<p>Now I can simulate data and fit these models many times. I decided to do 1000 iterations today, resulting in a list of lists which I store in an object called <code>models</code>.</p>
<pre class="r"><code>models = replicate(1000, logy_0(n = 50), simplify = FALSE)</code></pre>
</div>
<div id="extract-the-results" class="section level1">
<h1>Extract the results</h1>
<p>I can loop through the list of models and extract the output of interest. Today I’m interested in the estimated coefficients, with confidence intervals. I can extract this information using <code>broom::tidy()</code> for <code>lm</code> objects.</p>
<p>I use <code>flatten()</code> to turn <code>models</code> into a single big list instead of a list of lists. I loop via <code>map_dfr()</code>, so my result is a data.frame that contains the coefficients plus the <code>c</code> option used for the model based on the names I set in the function.</p>
<pre class="r"><code>results = map_dfr(flatten(models),
~tidy(.x, conf.int = TRUE),
.id = "model")
head(results)</code></pre>
<pre><code># # A tibble: 6 x 8
# model term estimate std.error statistic p.value conf.low conf.high
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 True mod~ (Inte~ -0.912 0.543 -1.68 9.94e-2 -2.00e+0 0.179
# 2 True mod~ x 0.494 0.141 3.50 1.02e-3 2.10e-1 0.778
# 3 Add 1 (Inte~ 0.543 0.154 3.52 9.63e-4 2.33e-1 0.853
# 4 Add 1 x 0.0801 0.0402 1.99 5.18e-2 -6.46e-4 0.161
# 5 Add 1/2 ~ (Inte~ -0.908 0.461 -1.97 5.48e-2 -1.84e+0 0.0193
# 6 Add 1/2 ~ x 0.427 0.120 3.55 8.63e-4 1.85e-1 0.668</code></pre>
<p>I’m going to focus on only the slopes, so pull those out explicitly.</p>
<pre class="r"><code>results_sl = filter(results, term == "x")</code></pre>
</div>
<div id="compare-the-options-for-c" class="section level1">
<h1>Compare the options for c</h1>
<p>So how does changing the values of c change the results? Does it matter what we add?</p>
<div id="summary-stastics" class="section level2">
<h2>Summary stastics</h2>
<p>First I’ll calculate a few summary statistics. Here is the median estimate of the slope for each <code>c</code> option and the confidence interval coverage (i.e., the proportion of times the confidence interval contained the true value of the slope; for a 95% confidence interval this should be 0.95).</p>
<pre class="r"><code>results_sl %>%
group_by(model) %>%
summarise(med_estimate = median(estimate),
CI_coverage = mean(conf.low < .75 & .75 < conf.high) )</code></pre>
<pre><code># # A tibble: 4 x 3
# model med_estimate CI_coverage
# <chr> <dbl> <dbl>
# 1 Add 1 0.131 0
# 2 Add 1/2 minimum > 0 0.632 0.834
# 3 Quartile method 0.791 0.89
# 4 True model 0.750 0.941</code></pre>
<p>You can see right away that, as we’d expect, the actual model fit to the data prior to rounding does a good job. The confidence interval coverage is close to 0.95 and the median estimate is right at 0.75 (the true value).</p>
<p>The log(y + 1) models performed extremely poorly. The estimated slope is biased extremely low, on average. None of the 1000 models ever had the true slope in the confidence interval. 😮</p>
<p>The other two options performed better. The “Quartile method” gives estimates that are, on average, too high. The “Add 1/2 minimum > 0” options underestimates the slope, on average. The confidence interval coverage is not great, but I suspect this is at least partially due to the artificial reduction in variance due to rounding.</p>
</div>
<div id="graph-the-results" class="section level2">
<h2>Graph the results</h2>
<p>Here is a graph, using package <strong>ggridges</strong> to make ridge plots. This makes it easy to compare the distribution of slope estimates for each <code>c</code> option to the true model (shown at the top in yellow).</p>
<pre class="r"><code>ggplot(results_sl, aes(x = estimate, y = model, fill = model) ) +
geom_density_ridges2(show.legend = FALSE, rel_min_height = 0.005) +
geom_vline(xintercept = .75, size = 1) +
scale_fill_viridis_d() +
theme_bw() +
scale_x_continuous(name = "Estimated slope",
expand = c(0.01, 0),
breaks = seq(0, 1.5, by = .25) ) +
scale_y_discrete(name = NULL, expand = expand_scale(mult = c(.01, .3) ) )</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-09-18-the-log-0-problem_files/figure-html/unnamed-chunk-16-1.png" width="672" /></p>
<p>Again, the estimates from the “Add 1” models are clearly not useful here. The slope estimate are all biased very low.</p>
<p>The “Quartile method” has the widest distribution. It is long-tailed to the right compared to the distribution of slopes from the true model, which is why it can overestimate the slope.</p>
<p>And the “Add 1/2 minimum > 0” option tends to underestimate the slope. I’m curious that the distribution is also relatively narrow, and wonder if that has something to do with the way I simulated the 0 values via rounding.</p>
</div>
</div>
<div id="what-option-is-best" class="section level1">
<h1>What option is best?</h1>
<p>First, one of the main things I discovered is that my method for simulating positive, continuous data plus 0 values is a little kludgy. 😄 For the scenario I did simulate data for, though, adding 1 is clearly a very bad option if you want to get a reasonable estimate of the slope (which is generally my goal when I’m doing an analysis 😉).</p>
<p>Why did adding 1 perform so badly? Remember that when we do a log transformation we increase the space between values less than 1 on the original scale and decrease the spacing between values above 1. These data were primarily below 1, so when we added 1 to the data we totally changed the spacing of the data after log transformation. Adding 1 prior to a log transformation is going to make the most sense if our minimum non-0 value is 1 or higher (at least for estimation).</p>
<p>The other two options both performed OK in this specific scenario. Another avenue I think might be important to explore is how the number of 0 values in the distribution has an impact on the results. The choice between options could depend on that. Also, remember, the “Quartile method” only works at all if fewer than 25% of the data are 0’s.</p>
<p>You may have noticed that statisticians generally dislike anti-conservative methods, so my guess is many (most?) statisticians would choose the conservative option (“Add 1/2 minimum > 0”) to be on the safe side. However, even if this pattern of over- and under-estimation holds more generally in other scenarios, the choice between the options should likely be based on the gravity of missing a relationship due to underestimation compared to the gravity of overstating a relationship rather than on purely statistical concerns.</p>
</div>
Getting started simulating data in R: some helpful functions and how to use them
https://aosmith.rbind.io/2018/08/29/getting-started-simulating-data/
Wed, 29 Aug 2018 00:00:00 +0000https://aosmith.rbind.io/2018/08/29/getting-started-simulating-data/<p>I’ve been trying to participate a little more in the R community outside of my narrow professional world, so when the co-organizer of the <a href="https://www.meetup.com/meetup-group-cwPiAlnB/">Eugene R Users Group</a> invited me to come talk at one of their meet-ups I agreed (even though it involved public speaking! 😱).</p>
<p>I started out thinking I’d talk about doing simulations. But could I do that in 45 minutes? Maybe not. After much pondering I ended up settling on the topic of how we start a simulation: by making data in R.</p>
<p>Possibly out of habit from making written versions of labs and workshops as resources for students, I created a written version of the talk. You can see the <a href="https://github.com/aosmith16/simulation-helper-functions/blob/master/functions_for_simulations_detailed.pdf">full PDF version</a> at <a href="https://github.com/aosmith16/simulation-helper-functions">the GitHub repository</a> I made for the talk materials.</p>
<p>I’ve copied the R markdown code below, as well, for a “blog” version, although this ends up being quite long for a blog post. 😄</p>
<div id="overview" class="section level1">
<h1>Overview</h1>
<p>There are many reasons we might want to simulate data in R, and I find being able to simulate data to be incredibly useful in my day-to-day work. But how does someone get started simulating data?</p>
<p>Today I’m going to take a closer look at some of the R functions that are useful to get to know when simulating data. These functions are all from base R packages, not in add-on packages, so some of them may already familiar to you.</p>
<p>Here’s what we’ll do today:</p>
<ol style="list-style-type: decimal">
<li>Simulate quantitative variables via random number generation with <code>rnorm()</code>, <code>runif()</code> and <code>rpois()</code>.</li>
<li>Generate character variables that represent groups via <code>rep()</code>. We’ll explore how to create character vectors with different repeating patterns.</li>
<li>Create data with both quantitative and categorical variables, making use of functions from the first two steps above.</li>
<li>Learn to use <code>replicate()</code> to repeat the data simulation process many times.</li>
</ol>
</div>
<div id="generating-random-numbers" class="section level1">
<h1>Generating random numbers</h1>
<p>An easy way to generate numeric data is to pull random numbers from some distribution. This can be done via the functions for generating random deviates. These functions always start with <code>r</code> (for “random”).</p>
<p>The basic distributions that I use the most for generating random numbers are the normal (<code>rnorm()</code>) and uniform (<code>runif()</code>) distributions. We’ll look at those today, plus the Poisson (<code>rpois()</code>) distribution for generating discrete counts.</p>
<p>There are many other distributions available as part of the <strong>stats</strong> package (e.g., binomial, F, log normal, beta, exponential, Gamma) and, as you can imagine, even more available in add-on packages. I recently needed to generate data from the Tweedie distribution to test a modeling tool, which I could do via package <strong>tweedie</strong>.</p>
<p>The <code>r</code> functions for a chosen distribution all work basically the same way. We define how many random numbers we want to generate in the first argument (<code>n</code>) and then define the parameters for the distribution we want to draw from. This is easier to see with practice, so let’s get started.</p>
<div id="rnorm-to-generate-random-numbers-from-the-normal-distribution" class="section level2">
<h2>rnorm() to generate random numbers from the normal distribution</h2>
<p>I use <code>rnorm()</code> a lot, sometimes with good reason and other times when I need some numbers and I really don’t care too much about what they are. 😜</p>
<p>There are three arguments to <code>rnorm()</code>. From the <code>Usage</code> section of the documentation:</p>
<blockquote>
<p>rnorm(n, mean = 0, sd = 1)</p>
</blockquote>
<p>The <code>n</code> argument is the number of observations we want to generate. The <code>mean</code> and <code>sd</code> arguments show what the default values of the parameters are (note that <code>sd</code> is the <em>standard deviation</em>, not the variance). Not all <code>r</code> functions have defaults to the parameter arguments like this.</p>
<p>To get 5 random numbers from a <span class="math inline">\(Normal(0, 1)\)</span> (aka the <em>standard</em> normal) distribution we can write code like:</p>
<pre class="r"><code>rnorm(5)</code></pre>
<pre><code>[1] -1.6067071 -1.5068818 0.2098584 -0.9313993 -1.0283762</code></pre>
<p>There are a couple things about this code and the output to discuss.</p>
<p>First, the code did get me 5 numbers, which is what I wanted. However, the code itself isn’t particularly clear. What I might refer to as lazy coding on my part can look pretty mysterious to someone reading my code (or to my future self reading my code). Since I used the default values for <code>mean</code> and <code>sd</code>, it’s not clear exactly what distribution I drew the numbers from.</p>
<div id="writing-out-arguments-in-for-clearer-code" class="section level3">
<h3>Writing out arguments in for clearer code</h3>
<p>Here’s clearer code to do the same thing, where I write out the mean and standard deviation arguments explicitly even though I’m using the default values. It is certainly not necessary to always be this careful, but I don’t think I’ve run into a situation were it was bad to have clear code.</p>
<pre class="r"><code>rnorm(n = 5, mean = 0, sd = 1)</code></pre>
<pre><code>[1] -0.1728609 -0.2055838 -0.1123403 0.5365955 1.1206571</code></pre>
</div>
<div id="setting-the-random-seed-for-reproducible-random-numbers" class="section level3">
<h3>Setting the random seed for reproducible random numbers</h3>
<p>Second, if we run this code again we’ll get different numbers. To get reproducible random numbers we need to <em>set the seed</em> via <code>set.seed()</code>.</p>
<p>Making sure someone else will be able to exactly reproduce your results when running the same code can be desirable in teaching. It is also is useful when making an example dataset to demonstrate a coding issue, like if you were asking a code question on Stack Overflow.</p>
<p>You’ll also see me set the seed when I’m making a function for a simulation and I want to make sure it works correctly. Otherwise in most simulations we don’t actually want or need to set the seed.</p>
<p>If we set the seed prior to running <code>rnorm()</code>, we can reproduce the values we generate.</p>
<pre class="r"><code>set.seed(16)
rnorm(n = 5, mean = 0, sd = 1)</code></pre>
<pre><code>[1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293</code></pre>
<p>If we set the seed back to the same number and run the code again, we get the same values.</p>
<pre class="r"><code>set.seed(16)
rnorm(n = 5, mean = 0, sd = 1)</code></pre>
<pre><code>[1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293</code></pre>
</div>
<div id="change-parameters-in-rnorm" class="section level3">
<h3>Change parameters in rnorm()</h3>
<p>For getting a quick set of numbers it’s easy to use the default parameter values in <code>rnorm()</code> but we can certainly change the values to something else. For example, when I’m exploring long-run behavior of variance estimated from linear models I will want to vary the standard deviation values.</p>
<p>The <code>sd</code> argument shows the <em>standard deviation</em> of the normal distribution. So drawing from a <span class="math inline">\(Normal(0, 4)\)</span> can be done by setting <code>sd</code> to 2. (I repeat this info because I find it confusing sometimes. 😁)</p>
<pre class="r"><code>rnorm(n = 5, mean = 0, sd = 2)</code></pre>
<pre><code>[1] -0.9368241 -2.0119012 0.1271254 2.0499452 1.1462840</code></pre>
<p>I’ve seen others change the mean and standard deviation to create a variable that is within some specific range, as well. For example, if the mean is large and the standard deviation small in relation to the mean we can generate strictly positive numbers. (I usually use <code>runif()</code> for this, which we’ll see below.)</p>
<pre class="r"><code>rnorm(n = 5, mean = 50, sd = 20)</code></pre>
<pre><code>[1] 86.94364 52.23867 35.07925 83.16427 64.43441</code></pre>
</div>
<div id="using-vectors-of-values-for-the-parameter-arguments" class="section level3">
<h3>Using vectors of values for the parameter arguments</h3>
<p>We can pull random numbers from multiple different normal distributions simultaneously if we use a vector for the parameter arguments. This could be useful, for example, for simulating data with different group means but the same variance. We might want to use something like this if we were making data that we would analyze using an ANOVA.</p>
<p>I’ll keep the standard deviation at 1 but will draw data from three distribution centered at three different locations: one at 0, one at 5, and one at 20. I request 10 total draws by changing <code>n</code> to 10.</p>
<p>Note the repeating pattern: the function iteratively draws one value from each distribution until the total number requested is reached. This can lead to imbalance in the sample size per distribution.</p>
<pre class="r"><code>rnorm(n = 10, mean = c(0, 5, 20), sd = 1)</code></pre>
<pre><code> [1] -1.6630805 5.5759095 20.4727601 -0.5427317 6.1276871 18.3522024
[7] -0.3141739 4.8173184 21.4704785 -0.8658988</code></pre>
<p>A vector can also be passed to <code>sd</code>. Here both the means and standard deviations vary among the three distributions used to generate values.</p>
<pre class="r"><code>rnorm(n = 10, mean = c(0, 5, 20), sd = c(1, 5, 20) )</code></pre>
<pre><code> [1] 1.5274670 10.2708903 40.6014202 0.8401609 6.0848235 6.5494885
[7] 0.1325985 4.6453633 1.1460906 -1.0220310</code></pre>
<p>Things are different for the <code>n</code> argument. If a vector is passed to <code>n</code>, the <em>length</em> of that vector is taken to be the number required (see <code>Arguments</code> section of documentation for details).</p>
<p>Here’s an example. Since the vector for <code>n</code> is length 3, we only get 3 values. This has caught me before, as I would expect this code to give me different numbers per group instead of ignoring the information in the vector.</p>
<pre class="r"><code>rnorm(n = c(2, 10, 10), mean = c(0, 5, 20), sd = c(1, 5, 20) )</code></pre>
<pre><code>[1] 0.2805551 7.7239168 22.6173950</code></pre>
</div>
</div>
<div id="example-of-using-the-simulated-numbers-from-rnorm" class="section level2">
<h2>Example of using the simulated numbers from rnorm()</h2>
<p>Up to this point we’ve printed the results of each simulation. In reality we’d want to save our vectors as objects in R to use them for some further task.</p>
<p>For example, maybe we want to simulate two unrelated variables and then look to see how correlated they appear to be. This can be a fun exercise to demonstrate how variables can appear to be related by chance even when we know they are not, especially at small sample sizes.</p>
<p>Let’s generate two quantitative vectors of length 10, which I’ll name <code>x</code> and <code>y</code>, and plot the results. I’m using the defaults for <code>mean</code> and <code>sd</code>.</p>
<p>This particular example doesn’t show much of a pattern, but if you take this code and run it may times you’ll see some pretty surprising relationships.</p>
<pre class="r"><code>x = rnorm(n = 10, mean = 0, sd = 1)
y = rnorm(n = 10, mean = 0, sd = 1)
plot(y ~ x)</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-29-getting-started-simulating-data_files/figure-html/unnamed-chunk-10-1.png" width="672" /></p>
</div>
<div id="runif-pulls-from-the-uniform-distribution" class="section level2">
<h2>runif() pulls from the uniform distribution</h2>
<p>Pulling random numbers from other distributions is extremely similar to using <code>rnorm()</code>, so we’ll go through them more quickly.</p>
<p>I’ve started using <code>runif()</code> pretty regularly, especially when I want to easily generate numbers that are strictly positive but continuous. The uniform distribution is a continuous distribution, with numbers uniformly distributed between some minimum and maximum.</p>
<p>From <code>Usage</code> we can see that by default we pull random numbers between 0 and 1. The first argument, as with all of these <code>r</code> functions, is the number of deviates we want to randomly generate:</p>
<blockquote>
<p>runif(n, min = 0, max = 1)</p>
</blockquote>
<p>Let’s generate 5 numbers between 0 and 1.</p>
<pre class="r"><code>runif(n = 5, min = 0, max = 1)</code></pre>
<pre><code>[1] 0.9994220 0.9432766 0.2496042 0.6482484 0.1125788</code></pre>
<p>What if we want to generate 5 numbers between 50 and 100? We change the values for the parameter arguments.</p>
<pre class="r"><code>runif(n = 5, min = 50, max = 100)</code></pre>
<pre><code>[1] 81.56680 66.91604 82.70027 64.29244 54.17023</code></pre>
</div>
<div id="example-of-using-the-simulated-numbers-from-runif" class="section level2">
<h2>Example of using the simulated numbers from runif()</h2>
<p>I like using <code>runif()</code> for making explanatory variables that have realistic ranges. In multiple regression, having explanatory variables with different magnitudes affects interpretation of regression coefficients.</p>
<p>Let’s generate some data with the response variable (<code>y</code>) pulled from a standard normal distribution and then an explanatory variable with values between 1 and 2. The two variables are unrelated.</p>
<p>You see I’m still writing out my argument names for clarity, but you may be getting a sense how easy it would be to start cutting corners to avoid the extra typing.</p>
<pre class="r"><code>set.seed(16)
y = rnorm(n = 100, mean = 0, sd = 1)
x1 = runif(n = 100, min = 1, max = 2)
head(x1)</code></pre>
<pre><code>[1] 1.957004 1.082791 1.710816 1.326998 1.995723 1.449522</code></pre>
<p>Now let’s simulate a second explanatory variable with values between 200 and 300. This variable is also unrelated to the other two.</p>
<pre class="r"><code>x2 = runif(n = 100, min = 200, max = 300)
head(x2)</code></pre>
<pre><code>[1] 220.0617 263.4875 209.6036 245.3125 265.1869 257.4817</code></pre>
<p>We can fit a multiple regression linear model via <code>lm()</code>. In this case, the coefficient for the second variable, which has a larger magnitude, is smaller than the first (although neither are actually related to <code>y</code>). The change in <code>y</code> for a “1-unit increase” in <code>x</code> depends on the units of <code>x</code>.</p>
<pre class="r"><code>lm(y ~ x1 + x2)</code></pre>
<pre><code>
Call:
lm(formula = y ~ x1 + x2)
Coefficients:
(Intercept) x1 x2
0.380887 0.104941 -0.001908 </code></pre>
</div>
<div id="discrete-counts-with-rpois" class="section level2">
<h2>Discrete counts with rpois()</h2>
<p>Let’s look at one last function for generating random numbers, this time for generating discrete integers (including 0) from a Poisson distribution with <code>rpois()</code>.</p>
<p>I use <code>rpois()</code> for generating counts for creating data to be fit with generalized linear models. This function has also helped me gain a better understanding of the shape of Poisson distributions with different means.</p>
<p>The Poisson distribution is a single parameter distribution. The function looks like:</p>
<blockquote>
<p>rpois(n, lambda)</p>
</blockquote>
<p>The single parameter argument, <code>lambda</code>, is the mean. It has no default setting so must always be defined by the user.</p>
<p>Let’s generate five values from a Poisson distribution with a mean of 2.5. Note that <em>mean</em> of the Poisson distribution can be any non-negative value (i.e., it doesn’t have to be an integer) even though the observed values will be discrete integers.</p>
<pre class="r"><code>rpois(n = 5, lambda = 2.5)</code></pre>
<pre><code>[1] 2 1 4 1 2</code></pre>
</div>
<div id="example-of-using-the-simulated-numbers-from-rpois" class="section level2">
<h2>Example of using the simulated numbers from rpois()</h2>
<p>Let’s explore the Poisson distribution a little more, seeing how the distribution changes when the mean of the distribution changes. Being able to look at how the Poisson distribution changes with the mean via simulation helped me understand the distribution better, including why it so often does a poor job modeling ecological count data.</p>
<p>We’ll draw 100 values from a Poisson distribution with a mean of 5. We’ll name this vector <code>y</code> and take a look at a summary of those values.</p>
<pre class="r"><code>y = rpois(100, lambda = 5)</code></pre>
<p>The vector of values we simulated here fall between 1 and 11.</p>
<pre class="r"><code>summary(y)</code></pre>
<pre><code> Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.00 5.00 4.83 6.00 11.00 </code></pre>
<p>There is mild right-skew when we draw a histogram of the values.</p>
<pre class="r"><code>hist(y)</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-29-getting-started-simulating-data_files/figure-html/unnamed-chunk-19-1.png" width="672" /></p>
<p>Let’s do the same thing for a Poisson distribution with a mean of 100. The range of values is pretty narrow; there are no values even remotely close to 0.</p>
<pre class="r"><code>y = rpois(100, lambda = 100)
summary(y)</code></pre>
<pre><code> Min. 1st Qu. Median Mean 3rd Qu. Max.
76.00 94.75 102.00 101.31 108.00 124.00 </code></pre>
<p>And the distribution is now pretty symmetric.</p>
<pre class="r"><code>hist(y)</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-29-getting-started-simulating-data_files/figure-html/unnamed-chunk-21-1.png" width="672" /></p>
<p>An alternative to the Poisson distribution for discrete integers is the negative binomial distribution. Packages <strong>MASS</strong> has a function called <code>rnegbin()</code> for random number generation from the negative binomial distribution.</p>
</div>
</div>
<div id="generate-character-vectors-with-rep" class="section level1">
<h1>Generate character vectors with rep()</h1>
<p>Quantitative variables are great, but in simulations we’re often going to need categorical variables, as well.</p>
<p>In my own work these are usually sort of “grouping” or “treatment” variable, with multiple individuals/observations per group/treatment. This means I need to have repetitions of each character value. The <code>rep()</code> function is one way to avoid having to write out an entire vector manually. It’s for <em>replicating elements of vectors and lists</em>.</p>
<div id="using-letters-and-letters" class="section level2">
<h2>Using letters and LETTERS</h2>
<p>The first argument of <code>rep()</code> is the vector to be repeated. One option is to write out the character vector you want to repeat. You can also get a simple character vector through the use of <code>letters</code> or <code>LETTERS</code>. These are <em>built in constants</em> in R. <code>letters</code> is the 26 lowercase letters of the Roman alphabet and <code>LETTERS</code> is the 26 uppercase letters.</p>
<p>Letters can be pulled out via the extract brackets (<code>[</code>). I use these built-in constants for pure convenience when I need to make a basic categorical vector and it doesn’t matter what form those categories take. I find it more straightforward to type out the word and brackets than a vector of characters (complete with all those pesky quotes and such 😆).</p>
<p>Here’s the first two <code>letters</code>.</p>
<pre class="r"><code>letters[1:2]</code></pre>
<pre><code>[1] "a" "b"</code></pre>
<p>And the last 17 <code>LETTERS</code>.</p>
<pre class="r"><code>LETTERS[10:26]</code></pre>
<pre><code> [1] "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"</code></pre>
</div>
<div id="repeat-each-element-of-a-vector-with-each" class="section level2">
<h2>Repeat each element of a vector with each</h2>
<p>There are three arguments that help us repeat the values in the vector in <code>rep()</code> with different patterns: <code>each</code>, <code>times</code>, and <code>length.out</code>. These can be used individually or in combination.</p>
<p>With <code>each</code> we repeat each unique character in the vector the defined number of times. The replication is done “elementwise”, so the repeats of each unique character are all in a row.</p>
<p>Let’s repeat two characters three times each. The resulting vector is 6 observations long.</p>
<p>This is an example of how I might make a grouping variable for simulating data to be used in a two-sample analysis.</p>
<pre class="r"><code>rep(letters[1:2], each = 3)</code></pre>
<pre><code>[1] "a" "a" "a" "b" "b" "b"</code></pre>
</div>
<div id="repeat-a-whole-vector-with-the-times-argument" class="section level2">
<h2>Repeat a whole vector with the times argument</h2>
<p>The <code>times</code> argument can be used when we want to repeat the whole vector rather than repeating it elementwise.</p>
<p>We’ll make a two-group variable again, but this time we’ll change the repeating pattern of the values in the variable.</p>
<pre class="r"><code>rep(letters[1:2], times = 3)</code></pre>
<pre><code>[1] "a" "b" "a" "b" "a" "b"</code></pre>
</div>
<div id="set-the-output-vector-length-with-the-length.out-argument" class="section level2">
<h2>Set the output vector length with the length.out argument</h2>
<p>The <code>length.out</code> argument has <code>rep()</code> repeat the whole vector. However, it repeats the vector only until the defined length is reached. Using <code>length.out</code> is another way to get unbalanced groups.</p>
<p>Rather than defining the number of repeats like we did with <code>each</code> and <code>times</code> we define the length of the output vector.</p>
<p>Here we’ll make a two-group variable of length 5. This means the second group will have one less value than the first.</p>
<pre class="r"><code>rep(letters[1:2], length.out = 5)</code></pre>
<pre><code>[1] "a" "b" "a" "b" "a"</code></pre>
</div>
<div id="repeat-each-element-a-different-number-of-times" class="section level2">
<h2>Repeat each element a different number of times</h2>
<p>Unlike <code>each</code> and <code>length.out</code>, we can use <code>times</code> with a vector of values. This allows us to repeat each element of the character vector a different number of times. This is one way to simulate unbalanced groups.</p>
<p>Using <code>times</code> with a vector repeats each element like <code>each</code> does. I found this a little confusing as it makes it harder to remember which argument repeats “elementwise” and which “vectorwise”. But <code>length.out</code> always repeats “vectorwise”, so that’s something.</p>
<p>Let’s repeat the first element twice and the second four times.</p>
<pre class="r"><code>rep(letters[1:2], times = c(2, 4) )</code></pre>
<pre><code>[1] "a" "a" "b" "b" "b" "b"</code></pre>
</div>
<div id="combining-each-with-times" class="section level2">
<h2>Combining each with times</h2>
<p>As your simulation situation get more complicated, like if you are simulating data from a blocked design or with multiple sizes of experimental units, you may need more complicated patterns for your categorical variable. The <code>each</code> argument can be combined with <code>times</code> to first repeat each value elementwise (via <code>each</code>) and then repeat that whole pattern (via <code>times</code>).</p>
<p>When using <code>times</code> this way it will only take a single value and not a vector.</p>
<p>Let’s repeat each value twice, 3 times.</p>
<pre class="r"><code>rep(letters[1:2], each = 2, times = 3)</code></pre>
<pre><code> [1] "a" "a" "b" "b" "a" "a" "b" "b" "a" "a" "b" "b"</code></pre>
</div>
<div id="combining-each-with-length.out" class="section level2">
<h2>Combining each with length.out</h2>
<p>Similarly we can use <code>each</code> with <code>length.out</code>. This can lead to some imbalance.</p>
<p>Here we’ll repeat the two values twice each and then repeat that pattern until we hit a total final vector length of 7.</p>
<pre class="r"><code>rep(letters[1:2], each = 2, length.out = 7)</code></pre>
<pre><code>[1] "a" "a" "b" "b" "a" "a" "b"</code></pre>
<p>Note you can’t use <code>length.out</code> and <code>times</code> together (if you try, <code>length.out</code> will be given priority and <code>times</code> ignored).</p>
</div>
</div>
<div id="creating-datasets-with-quantiative-and-categorical-variables" class="section level1">
<h1>Creating datasets with quantiative and categorical variables</h1>
<p>We now have some tools for creating quantitative data as well as categorical. Which means it’s time to make some datasets! We’ll create several simple ones to get the general idea.</p>
<div id="simulate-data-with-no-differences-among-two-groups" class="section level2">
<h2>Simulate data with no differences among two groups</h2>
<p>Let’s start by simulating data that we would use in a simple two-sample analysis with no difference between groups. We’ll make a total of 6 observations, three in each group.</p>
<p>We’ll be using the tools we reviewed above but will now name the output and combine them into a data.frame. This last step isn’t always necessary, but can help you keep things organized.</p>
<p>First we’ll make separate vectors for the continuous and categorical data and then bind them together via <code>data.frame()</code>.</p>
<p>Notice there is no need to use <code>cbind()</code> here, which is commonly done by R beginners (I know I did!). Instead we can use <code>data.frame()</code> directly.</p>
<pre class="r"><code>group = rep(letters[1:2], each = 3)
response = rnorm(n = 6, mean = 0, sd = 1)
data.frame(group,
response)</code></pre>
<pre><code> group response
1 a 0.4933614
2 a 0.5234101
3 a 1.2365975
4 b 0.3563153
5 b 0.5748968
6 b -0.4222890</code></pre>
<p>When I make a data.frame like this I prefer to make my vectors and the data.frame simultaneously to avoid having a lot of variables cluttering up my R Environment.</p>
<p>I often teach/blog with all the steps clearly delineated as I think it’s easier when you are starting out, so (as always) use the method that works for you.</p>
<pre class="r"><code>data.frame(group = rep(letters[1:2], each = 3),
response = rnorm(n = 6, mean = 0, sd = 1) )</code></pre>
<pre><code> group response
1 a 0.4024228
2 a 0.9585800
3 a -1.8763844
4 b -0.2115171
5 b 1.4374372
6 b 0.3855285</code></pre>
<p>Now let’s add another categorical variable to this dataset.</p>
<p>Say we are in a situation involving two factors, not one. We have a single observations for every combination of the two factors (i.e., the two factors are <em>crossed</em>).</p>
<p>The second factor, which we’ll call <code>factor</code>, will take on the values “C”, “D”, and “E”.</p>
<pre class="r"><code>LETTERS[3:5]</code></pre>
<pre><code>[1] "C" "D" "E"</code></pre>
<p>We need to repeat the values in a way that every combination of <code>group</code> and <code>factor</code> is present in the dataset exactly one time.</p>
<p>Remember the <code>group</code> factor is repeated elementwise.</p>
<pre class="r"><code>rep(letters[1:2], each = 3)</code></pre>
<pre><code>[1] "a" "a" "a" "b" "b" "b"</code></pre>
<p>We need to repeat the three values twice. But what argument do we use in <code>rep()</code> to do so?</p>
<pre class="r"><code>rep(LETTERS[3:5], ?)</code></pre>
<p>Does <code>each</code> work?</p>
<pre class="r"><code>rep(LETTERS[3:5], each = 2)</code></pre>
<pre><code>[1] "C" "C" "D" "D" "E" "E"</code></pre>
<p>No, if we use <code>each</code> then each element is repeated twice and some of the combinations of <code>group</code> and the new variable will be missing.</p>
<p>This is a job for the <code>times</code> or <code>length.out</code> arguments, so the whole vector is repeated. We can repeat the whole vector twice using <code>times</code>, or via <code>length.out = 6</code>. I decided to do the former.</p>
<p>In the result below we can see every combination of the two factors is present once.</p>
<pre class="r"><code>data.frame(group = rep(letters[1:2], each = 3),
factor = rep(LETTERS[3:5], times = 2),
response = rnorm(n = 6, mean = 0, sd = 1) )</code></pre>
<pre><code> group factor response
1 a C 0.4255576
2 a D 0.2903586
3 a E -0.3638877
4 b C 1.9778117
5 b D 1.0869069
6 b E -0.5869200</code></pre>
</div>
<div id="simulate-data-with-a-difference-among-groups" class="section level2">
<h2>Simulate data with a difference among groups</h2>
<p>The dataset above is one with “no difference” among groups. What if we want data where the means are different between groups? Let’s make two groups of three observations where the mean of one group is 5 and the other is 10. The two groups have a shared variance (and so standard deviation) of 1.</p>
<p>Remembering how <code>rnorm()</code> works with a vector of means is key here. The function draws iteratively from each distribution.</p>
<pre class="r"><code>response = rnorm(n = 6, mean = c(5, 10), sd = 1)
response</code></pre>
<pre><code>[1] 4.413753 12.484499 4.740506 10.273258 5.369074 10.024199</code></pre>
<p>How do we get the <code>group</code> pattern correct?</p>
<pre class="r"><code>group = rep(letters[1:2], ?)</code></pre>
<p>We need to repeat the whole vector three times instead of elementwise.</p>
<p>To get the groups in the correct order we need to use <code>times</code> or <code>length.out</code> in <code>rep()</code>. With <code>length.out</code> we define the output length of the vector, which is 6. Alternatively we could use <code>times = 3</code> to repeat the whole vector 3 times.</p>
<pre class="r"><code>group = rep(letters[1:2], length.out = 6)
group</code></pre>
<pre><code>[1] "a" "b" "a" "b" "a" "b"</code></pre>
<p>These can then be combined into a data.frame. Working out this process is another reason why sometimes we build each vector separately prior to combining them into a data.frame.</p>
<pre class="r"><code>data.frame(group,
response)</code></pre>
<pre><code> group response
1 a 4.413753
2 b 12.484499
3 a 4.740506
4 b 10.273258
5 a 5.369074
6 b 10.024199</code></pre>
</div>
<div id="multiple-quantitative-variables-with-groups" class="section level2">
<h2>Multiple quantitative variables with groups</h2>
<p>For our last dataset we’ll have two groups, with 10 observations per group.</p>
<pre class="r"><code>rep(LETTERS[3:4], each = 10)</code></pre>
<pre><code> [1] "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "D" "D" "D" "D" "D" "D" "D"
[18] "D" "D" "D"</code></pre>
<p>Let’s make a dataset that has two quantitative variables, unrelated to both each other and the groups. One variable ranges from 10 and 15 and one from 100 and 150.</p>
<p>How many observations should we draw from each uniform distribution?</p>
<pre class="r"><code>runif(n = ?, min = 10, max = 15)</code></pre>
<p>We had 2 groups with 10 observations each and 2*10 = 20. So we need to use <code>n = 20</code> in <code>runif()</code>.</p>
<p>Here is the dataset made in a single step.</p>
<pre class="r"><code>data.frame(group = rep(LETTERS[3:4], each = 10),
x = runif(n = 20, min = 10, max = 15),
y = runif(n = 20, min = 100, max = 150))</code></pre>
<pre><code> group x y
1 C 13.20331 126.7004
2 C 13.91440 137.0772
3 C 12.72031 134.8689
4 C 14.27637 122.7582
5 C 11.72933 118.0189
6 C 14.59640 108.4544
7 C 12.81629 142.1792
8 C 13.45864 104.0355
9 C 13.94198 107.0199
10 C 12.15106 145.4622
11 D 12.01762 117.3381
12 D 13.66006 121.2900
13 D 14.78914 145.2424
14 D 11.70157 120.1363
15 D 13.25180 139.8288
16 D 10.76914 106.6282
17 D 13.97263 147.6383
18 D 14.91621 112.6612
19 D 13.53270 104.8965
20 D 14.40677 119.5231</code></pre>
<p>What happens if we get the number wrong? If we’re lucky we get an error.</p>
<pre class="r"><code>data.frame(group = rep(LETTERS[3:4], each = 10),
x = runif(n = 15, min = 10, max = 15),
y = runif(n = 15, min = 100, max = 150))</code></pre>
<p><code>Error in data.frame(group = rep(LETTERS[3:4], each = 10), x = runif(n = 15, : arguments imply differing number of rows: 20, 15</code></p>
<p>But if we get things wrong and the number we use happens to go into the number we need evenly, R will <em>recycle</em> the vector to the end of the <code>data.frame()</code>.</p>
<p>This is a hard mistake to catch. If you look carefully through the output below you can see that the continuous variables start to repeat on line 10 because I used <code>n = 10</code> instead of <code>n = 20</code>.</p>
<pre class="r"><code>data.frame(group = rep(LETTERS[3:4], each = 10),
x = runif(n = 10, min = 10, max = 15),
y = runif(n = 10, min = 100, max = 150))</code></pre>
<pre><code> group x y
1 C 12.28493 108.3455
2 C 13.84490 114.8247
3 C 12.42386 105.1358
4 C 10.08725 125.1979
5 C 10.83277 129.6310
6 C 10.96766 129.4584
7 C 11.51180 149.2819
8 C 13.48253 139.2530
9 C 11.64337 119.6663
10 C 12.88603 119.8368
11 D 12.28493 108.3455
12 D 13.84490 114.8247
13 D 12.42386 105.1358
14 D 10.08725 125.1979
15 D 10.83277 129.6310
16 D 10.96766 129.4584
17 D 11.51180 149.2819
18 D 13.48253 139.2530
19 D 11.64337 119.6663
20 D 12.88603 119.8368</code></pre>
</div>
</div>
<div id="repeatedly-simulate-data-with-replicate" class="section level1">
<h1>Repeatedly simulate data with replicate()</h1>
<p>The <code>replicate()</code> function is a real workhorse when making repeated simulations. It is a member of the <em>apply</em> family in R, and is specifically made (per the documentation) for the <em>repeated evaluation of an expression (which will usually involve random number generation)</em>.</p>
<p>We want to repeatedly simulate data that involves random number generation, so that sounds like a useful tool.</p>
<p>The <code>replicate()</code> function takes three arguments:</p>
<ul>
<li><code>n</code>, which is the number of replications to perform. This is to set the number of repeated runs we want.<br />
</li>
<li><code>expr</code>, the expression that should be run repeatedly. This is often a function.<br />
</li>
<li><code>simplify</code>, which controls the type of output the results of <code>expr</code> are saved into. Use <code>simplify = FALSE</code> to get output saved into a list instead of in an array.</li>
</ul>
<div id="simple-example-of-replicate" class="section level2">
<h2>Simple example of replicate()</h2>
<p>Let’s say we want to simulate some values from a normal distribution, which we can do using the <code>rnorm()</code> function as above. But now instead of drawing some number of values from a distribution one time we want to do it many times. This could be something we’d do when demonstrating the central limit theorem, for example.</p>
<p>Doing the random number generation many times is where <code>replicate()</code> comes in. It allows us to run the function in <code>expr</code> exactly <code>n</code> times.</p>
<p>Here I’ll generate 5 values from a standard normal distribution three times. Notice the addition of <code>simplify = FALSE</code> to get a list as output.</p>
<p>The output below is a list of three vectors. Each vector is from a unique run of the function, so contains five random numbers drawn from the normal distribution with a mean of 0 and standard deviation of 1.</p>
<pre class="r"><code>set.seed(16)
replicate(n = 3,
expr = rnorm(n = 5, mean = 0, sd = 1),
simplify = FALSE )</code></pre>
<pre><code>[[1]]
[1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293
[[2]]
[1] -0.46841204 -1.00595059 0.06356268 1.02497260 0.57314202
[[3]]
[1] 1.8471821 0.1119334 -0.7460373 1.6582137 0.7217206</code></pre>
<p>Note if I don’t use <code>simplify = FALSE</code> I will get a matrix of values instead of a list. Each column in the matrix is the output from one run of the function.</p>
<p>In this case there will be three columns in the output, one for each run, and 5 rows. This can be a useful output type for some simulations. I focus on list output throughout the rest of this post only because that’s what I have been using recently for simulations.</p>
<pre class="r"><code>set.seed(16)
replicate(n = 3,
expr = rnorm(n = 5, mean = 0, sd = 1) )</code></pre>
<pre><code> [,1] [,2] [,3]
[1,] 0.4764134 -0.46841204 1.8471821
[2,] -0.1253800 -1.00595059 0.1119334
[3,] 1.0962162 0.06356268 -0.7460373
[4,] -1.4442290 1.02497260 1.6582137
[5,] 1.1478293 0.57314202 0.7217206</code></pre>
</div>
<div id="an-equivalent-for-loop-example" class="section level2">
<h2>An equivalent for() loop example</h2>
<p>A <code>for()</code> loop can be used in place of <code>replicate()</code> for simulations. With time and practice I’ve found <code>replicate()</code> to be much more convenient in terms of writing the code. However, in my experience some folks find <code>for()</code> loops intuitive when they are starting out in R. I think it’s because <code>for()</code> loops are more explicit on the looping process: the user can see the values that <code>i</code> takes and the output for each <code>i</code> iteration is saved into the output object because the code is written out explicitly.</p>
<p>In my example I’ll save the output of each iteration of the loop into a list called <code>list1</code>. I initialize this as an empty list prior to starting the loop. To match what I did with <code>replicate()</code> I do three iterations of the loop (<code>i in 1:3</code>), drawing 5 values via <code>rnorm()</code> each time.</p>
<p>The result is identical to my <code>replicate()</code> code above. It took a little more code to do it but the process is very clear since it is explicitly written out.</p>
<pre class="r"><code>set.seed(16)
list1 = list() # Make an empty list to save output in
for (i in 1:3) { # Indicate number of iterations with "i"
list1[[i]] = rnorm(n = 5, mean = 0, sd = 1) # Save output in list for each iteration
}
list1</code></pre>
<pre><code>[[1]]
[1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293
[[2]]
[1] -0.46841204 -1.00595059 0.06356268 1.02497260 0.57314202
[[3]]
[1] 1.8471821 0.1119334 -0.7460373 1.6582137 0.7217206</code></pre>
</div>
<div id="using-replicate-to-repeatedly-make-a-dataset" class="section level2">
<h2>Using replicate() to repeatedly make a dataset</h2>
<p>Earlier we were making datasets with random numbers and some grouping variables. Our code looked like:</p>
<pre class="r"><code>data.frame(group = rep(letters[1:2], each = 3),
response = rnorm(n = 6, mean = 0, sd = 1) )</code></pre>
<pre><code> group response
1 a -1.6630805
2 a 0.5759095
3 a 0.4727601
4 b -0.5427317
5 b 1.1276871
6 b -1.6477976</code></pre>
<p>We could put this process as the <code>expr</code> argument in <code>replicate()</code> to get many simulated datasets. I would do something like this if I wanted to compare the long-run performance of two different statistical tools using the exact same random datasets.</p>
<p>I’ll replicate things 3 times again to easily see the output. I still use <code>simplify = FALSE</code> to get things into a list.</p>
<pre class="r"><code>simlist = replicate(n = 3,
expr = data.frame(group = rep(letters[1:2], each = 3),
response = rnorm(n = 6, mean = 0, sd = 1) ),
simplify = FALSE)</code></pre>
<p>We can see this result is a list of three data.frames.</p>
<pre class="r"><code>str(simlist)</code></pre>
<pre><code>List of 3
$ :'data.frame': 6 obs. of 2 variables:
..$ group : Factor w/ 2 levels "a","b": 1 1 1 2 2 2
..$ response: num [1:6] -0.314 -0.183 1.47 -0.866 1.527 ...
$ :'data.frame': 6 obs. of 2 variables:
..$ group : Factor w/ 2 levels "a","b": 1 1 1 2 2 2
..$ response: num [1:6] 1.03 0.84 0.217 -0.673 0.133 ...
$ :'data.frame': 6 obs. of 2 variables:
..$ group : Factor w/ 2 levels "a","b": 1 1 1 2 2 2
..$ response: num [1:6] -0.943 -1.022 0.281 0.545 0.131 ...</code></pre>
<p>Here is the first one.</p>
<pre class="r"><code>simlist[[1]]</code></pre>
<pre><code> group response
1 a -0.3141739
2 a -0.1826816
3 a 1.4704785
4 b -0.8658988
5 b 1.5274670
6 b 1.0541781</code></pre>
</div>
</div>
<div id="whats-the-next-step" class="section level1">
<h1>What’s the next step?</h1>
<p>I’m ending here, but there’s still more to learn about simulations. For a simulation to explore long-run behavior, some process is going to be repeated many times. We did this via <code>replicate()</code>. The next step would be to extract whatever results are of interest. This latter process is often going to involve some sort of looping.</p>
<p>By saving our generated variables or data.frames into a list we’ve made it so we can loop via list looping functions like <code>lapply()</code> or <code>purrr::map()</code>. The family of <em>map</em> functions are newer and have a lot of convenient output types that make them pretty useful. If you want to see how that might look for a simulations, you can see a few examples in my blog post <a href="https://aosmith.rbind.io/2018/06/05/a-closer-look-at-replicate-and-purrr/">A closer look at replicate() and purrr::map() for simulations</a>.</p>
<p>Happy simulating!</p>
</div>
Automating exploratory plots with ggplot2 and purrr
https://aosmith.rbind.io/2018/08/20/automating-exploratory-plots/
Mon, 20 Aug 2018 00:00:00 +0000https://aosmith.rbind.io/2018/08/20/automating-exploratory-plots/<p>When you have a lot of variables and need to make a lot exploratory plots it’s usually worthwhile to automate the process in R instead of manually copying and pasting code for every plot. However, the coding approach needed to automate plots can look pretty daunting to a beginner R user. It can look so daunting, in fact, that it can appear easier to manually make the plots (like in Excel) rather than using R at all.</p>
<p>Unfortunately making plots manually can backfire. The efficiency of using a software program you already know is quickly out-weighed by being unable to easily reproduce the plots when needed. I know I invariably have to re-make even exploratory plots, and it’d be a bummer if I had to remake them all manually rather than re-running some code.</p>
<p>So while I often assure students working under time constraints that it is perfectly OK to use software they already know rather than spending the time to learn how to do something in R, making many plots is a special case. To get them started I will provide students who need to automate plotting in R some example code (with explanation).</p>
<p>This post is based on an example I was working on recently, which involves plotting bivariate relationships between many continuous variables.</p>
<div id="load-r-packages" class="section level1">
<h1>Load R packages</h1>
<p>I’ll be plotting with <strong>ggplot2</strong> and looping with <strong>purrr</strong>. I’ll also be using package <strong>cowplot</strong> later to combine individual plots into one, but will use the package functions via <code>cowplot::</code> instead of loading the package.</p>
<pre class="r"><code>library(ggplot2) # v. 3.0.0
library(purrr) # v. 0.2.5</code></pre>
</div>
<div id="the-set-up" class="section level1">
<h1>The set-up</h1>
<p>Today I’m going to make an example dataset with 3 response (<code>y</code>) variables and 4 explanatory (<code>x</code>) variables for plotting. (The real dataset had 9 response and 9 explanatory variables.)</p>
<pre class="r"><code>set.seed(16)
dat = data.frame(elev = round( runif(20, 100, 500), 1),
resp = round( runif(20, 0, 10), 1),
grad = round( runif(20, 0, 1), 2),
slp = round( runif(20, 0, 35),1),
lat = runif(20, 44.5, 45),
long = runif(20, 122.5, 123.1),
nt = rpois(20, lambda = 25) )
head(dat)</code></pre>
<pre><code># elev resp grad slp lat long nt
# 1 373.2 9.7 0.05 8.8 44.54626 122.8547 18
# 2 197.6 8.1 0.42 33.3 44.79495 122.5471 26
# 3 280.0 5.4 0.38 19.3 44.99027 122.9645 18
# 4 191.8 4.3 0.07 29.6 44.95022 122.7290 19
# 5 445.4 2.3 0.43 16.5 44.79784 122.9836 15
# 6 224.5 6.5 0.78 4.1 44.96576 122.9836 21</code></pre>
<p>The goal is to make scatterplots for every response variable vs every explanatory variable. I’ve deemed the first three variables in the dataset to be the response variables (<code>elev</code>, <code>resp</code>, <code>grad</code>).</p>
<p>The plan is to loop through the variables and make the desired plots. I’m going to use vectors of the variable names for this, one vector for the response variables and one for the explanatory variables.</p>
<p>If all of your response or explanatory variables share some unique pattern in the variable names there are some clever ways to pull out the names with some of the select helper functions in <code>dplyr::select()</code>. Alas, my variable names are all unique. My options are to either write the vectors out manually or pull the names out by index. I’ll do the latter since the different types of variables are grouped together.</p>
<pre class="r"><code>response = names(dat)[1:3]
expl = names(dat)[4:7]</code></pre>
<p>When I know I’m going to be looping through character vectors I like to use <em>named</em> vectors. This helps me keep track of things in the output.</p>
<p>The <code>set_names()</code> function in <strong>purrr</strong> is super handy for naming character vectors, since it can use the values of the vector as names (i.e., the vector will be named by itself). (I don’t recommend trying this with lists of data.frames like I have in the past, though, since it turns out that naming a data.frame with a data.frame isn’t so useful. 😆)</p>
<pre class="r"><code>response = set_names(response)
response</code></pre>
<pre><code># elev resp grad
# "elev" "resp" "grad"</code></pre>
<pre class="r"><code>expl = set_names(expl)
expl</code></pre>
<pre><code># slp lat long nt
# "slp" "lat" "long" "nt"</code></pre>
</div>
<div id="create-a-plotting-function" class="section level1">
<h1><a href="#create-a-plotting-function">Create a plotting function</a></h1>
<p>Since I’m going to make a bunch of plots that will all have the same basic form, I will make a plotting function. I am going to make a function where only the <code>x</code> and <code>y</code> variables can vary (so are arguments to the function).</p>
<p>Since I’m making a function to plot variables from a single dataset I’m going to hard-code the dataset into the function. If you have multiple datasets or you are making a function for use across projects you’ll probably want to add the dataset as a function argument.</p>
<p>My functions inputs are based on the variable names, so I need to pass strings into the <strong>ggplot2</strong> functions. Strings cannot be used directly in <code>aes()</code>, but can be used directly in <code>aes_string()</code>.</p>
<p>I’m making pretty basic graphs since these are exploratory plots, not publication-ready plots. I will make a scatterplot and add locally weighted regression (loess) lines via <code>geom_smooth()</code>. I use such lines with great caution, as it can be easy to get too attached any pattern the loess line shows.</p>
<pre class="r"><code>scatter_fun = function(x, y) {
ggplot(dat, aes_string(x = x, y = y) ) +
geom_point() +
geom_smooth(method = "loess", se = FALSE, color = "grey74") +
theme_bw()
}</code></pre>
<p>Here’s an example of the function output, passing in <code>x</code> and <code>y</code> as strings.</p>
<pre class="r"><code>scatter_fun("lat", "elev")</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p><em>Aside</em>: The <code>aes_string()</code> function has been soft-deprecated as of <strong>ggplot2</strong> 3.0.0 and <code>tidyeval</code> methods are now available. For basic functions like mine this new framework is pretty straightforward to use. However, right or wrong, I’ve been hesitant to send code that contains <code>tidyeval</code> code to the beginner R users I generally work with on these tasks. I think the code looks complicated and “scary” compared to using <code>aes_string()</code>; I may change my mind with time.</p>
<p>To be thorough, here is an example of the same function using <code>tidyeval</code> instead of <code>aes_string()</code>. I use <code>sym()</code> instead of <code>quo()</code> because the inputs are strings.</p>
<p>The output graphic is the same.</p>
<pre class="r"><code>scatter_fun2 = function(x, y) {
ggplot(dat, aes(x = !!sym(x), y = !!sym(y) ) ) +
geom_point() +
geom_smooth(method = "loess", se = FALSE, color = "grey74") +
theme_bw()
}
scatter_fun2("lat", "elev")</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
</div>
<div id="looping-through-one-vector-of-variables" class="section level1">
<h1><a href="#looping-through-one-vector-of-variables">Looping through one vector of variables</a></h1>
<p>One way to make all the plots I want is to loop through each explanatory variable for a fixed response variable. With this approach I would need a separate loop for each response variable.</p>
<p>I will use <code>map()</code> from package <strong>purrr</strong> for the looping.</p>
<p>I pass each explanatory variable to the first argument in <code>scatter_fun()</code> and I fix the second argument to <code>"elev"</code>. I use the formula coding in <code>map()</code> and so refer to the element of the explanatory vector via <code>.x</code> within <code>scatter_fun()</code>.</p>
<pre class="r"><code>elev_plots = map(expl, ~scatter_fun(.x, "elev") )</code></pre>
<p>The output is a list of 4 plots (since there are 4 explanatory variables). You’ll notice that each element of the list has the variable name associated with it. This is why I used <code>set_names()</code> earlier, since this is convenient for printing the plots and, you’ll see later, is convenient when saving the plots in files with understandable names.</p>
<pre class="r"><code>elev_plots</code></pre>
<pre><code># $slp</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-10-1.png" width="672" /></p>
<pre><code>#
# $lat</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-10-2.png" width="672" /></p>
<pre><code>#
# $long</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-10-3.png" width="672" /></p>
<pre><code>#
# $nt</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-10-4.png" width="672" /></p>
</div>
<div id="looping-through-both-vectors" class="section level1">
<h1><a href="#looping-through-both-vectors">Looping through both vectors</a></h1>
<p>For only a few response variables we could easily copy and paste the code above, changing the hard-coded response variable each time. This process can get burdensome if there are a lot of response variables, though.</p>
<p>Another option is to loop through both vectors of variables and make all the plots at once. Because we want a plot for each combination of variables, this is a job for a <em>nested</em> loop. This means one <code>map()</code> loop will be nested inside another. I will refer to the first <code>map()</code> loop as the <em>outer</em> loop and the second one as the <em>inner</em> loop.</p>
<p>I’m going to have the response variables in the outer loop and the explanatory variables in the inner loop. That way I can graph all of the explanatory variables for each response variable before moving on to the next response variable. This puts the output, a nested list, in a logical order.</p>
<p>A nested loop involves more complicated code, of course.. For example, it took some effort for me to wrap my head around how to refer to the list element from the outer loop within the inner loop when using the <code>map()</code> formula coding. I found the answers/comments to <a href="https://stackoverflow.com/questions/48847613/purrr-map-equivalent-of-nested-for-loop">this question</a> on Stack Overflow to be helpful. Note that one approach is to avoid the formula coding all together and use anonymous functions for either or both the inner and outer loops.</p>
<p>Since my scatterplot function is so simple I ended up using formula coding for the outer loop and the function as is in the inner loop. The inner list elements are fed to the first argument of <code>scatter_fun()</code> by default, which works out great since the first argument is the <code>x</code> variable and the inner loop loops through the explanatory variables. The <code>.x</code> then refers to the outer list elements (the response variable names), and is passed to the <code>y</code> argument of the function in the inner loop.</p>
<pre class="r"><code>all_plots = map(response,
~map(expl, scatter_fun, y = .x) )</code></pre>
<p>The output is a list of lists. Each sublist contains all the plots for a single response variable. Because I set the names for both vectors of variable names, the inner and outer lists both have names. These names can be used to pull out individual plots.</p>
<p>For example, if I want to see all the plots for the <code>grad</code> response variable I can print that sublist by name. (I’m going to display only two of four <code>grad</code> plots here to save space.)</p>
<pre class="r"><code>all_plots$grad</code></pre>
<pre><code># $slp</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-13-1.png" width="672" /></p>
<pre><code>#
# $lat</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-13-2.png" width="672" /></p>
<p>If I want to print a single plot, I can first extract one of the sublists using an outer list name and then extract the individual plot via an inner list name.</p>
<pre class="r"><code>all_plots$grad$long</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-14-1.png" width="672" /></p>
<p>I find the names convenient, but you can also extract plots via position. Here’s the same graph, the third element of the third list.</p>
<pre class="r"><code>all_plots[[3]][[3]]</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-15-1.png" width="672" /></p>
<div id="saving-the-plots" class="section level2">
<h2><a href="#saving-the-plots">Saving the plots</a></h2>
<p>Once all the graphs are made we can look at them in R by printing the list or parts of the list as above. But if you want to peruse them at your leisure later or send them to a collaborator you’ll want to save them outside of R.</p>
<p>This next section is dedicated to exploring some of the ways you can do this.</p>
</div>
<div id="saving-all-plots-to-one-pdf" class="section level2">
<h2><a href="#saving-all-plots-to-one-pdf">Saving all plots to one PDF</a></h2>
<p>If you want to save every plot as a separate page in a PDF, you can do so with the <code>pdf()</code> function. The code below shows an example of how this works. First, a graphics device to save the plots into is created and given a name via <code>pdf()</code>. Then all the plots are put into that device. Finally, the device is turned off with <code>dev.off()</code>. The last step is important, as you can’t open the file until the device is turned off.</p>
<p>This is a pretty coarse way to save everything, but it allows you to easily page through all the plots. I’ve used this method when I had many exploratory plots for a single response variable that I wanted to share with collaborators.</p>
<p>In this example code I save the file, which I name <code>all_scatterplots.pdf</code>, into the working directory.</p>
<pre class="r"><code>pdf("all_scatterplots.pdf")
all_plots
dev.off()</code></pre>
</div>
<div id="saving-groups-of-plots-together" class="section level2">
<h2><a href="#saving-groups-of-plots-together">Saving groups of plots together</a></h2>
<p>Another option is to save each group of plots in a separate document. This might make sense in a case like this where there are a set of plots for each response variable and we might want a separate file for each set.</p>
<p>To save each sublist separately we’ll need to loop through <code>all_plots</code> and save the plots for each response variable into a separate file. The list names can be used in the file names to keep the output organized.</p>
<p>The functions in <strong>purrr</strong> that start with <code>i</code> are special functions that loop through a list and the names of that list simultaneously. This is useful here where we want to use the list names to identify the output files while we save them.</p>
<p>The <code>walk()</code> function is part of the <code>map</code> family, to be used when you want a function for its side effect instead of for a return value. Saving plots is a classic example of when we want <code>walk()</code> instead of <code>map()</code>.</p>
<p>Combining the <code>i</code> and the <code>walk</code> gives us the <code>iwalk()</code> function. In the formula interface, <code>.x</code> refers to the list elements and <code>.y</code> refers to the names of the list. You can see I create the plot file names using the list name combined with “scatterplots.pdf”, using <code>_</code> as the separator.</p>
<p>The code below makes three files, one for each response variable, with four plots each. The files are named “elev_scatterplots.pdf”, “resp_scatterplots.pdf”, and “grad_scatterplots.pdf”.</p>
<pre class="r"><code>iwalk(all_plots, ~{
pdf(paste0(.y, "_scatterplots.pdf") )
print(.x)
dev.off()
})</code></pre>
</div>
<div id="saving-all-plots-separately" class="section level2">
<h2><a href="#saving-all-plots-separately">Saving all plots separately</a></h2>
<p>All plots can be saved separately instead of combined in a single document. This might be necessary if you want to insert the plots into some larger document later.</p>
<p>We’ll want to use the names of both the outer and inner lists to appropriately identify each plot we save. I decided to do this by looping through the <code>all_plots</code> list and the names of the list via <code>imap()</code> to make the file names in a separate step. This time I’m going to save these as PNG files so use <code>.png</code> at the end of the file name.</p>
<p>The result is a list of lists, so I flatten this into a single list via <code>flatten()</code>. If I were to use <code>flatten()</code> earlier in the process I’d lose the names of the outer list. This process of combining names prior to flattening should be simplified once <a href="https://github.com/tidyverse/purrr/issues/525">the proposed <code>flatten_names()</code> function</a> is added to <strong>purrr</strong>.</p>
<pre class="r"><code>plotnames = imap(all_plots, ~paste0(.y, "_", names(.x), ".png")) %>%
flatten()
plotnames</code></pre>
<pre><code># [[1]]
# [1] "elev_slp.png"
#
# [[2]]
# [1] "elev_lat.png"
#
# [[3]]
# [1] "elev_long.png"
#
# [[4]]
# [1] "elev_nt.png"
#
# [[5]]
# [1] "resp_slp.png"
#
# [[6]]
# [1] "resp_lat.png"
#
# [[7]]
# [1] "resp_long.png"
#
# [[8]]
# [1] "resp_nt.png"
#
# [[9]]
# [1] "grad_slp.png"
#
# [[10]]
# [1] "grad_lat.png"
#
# [[11]]
# [1] "grad_long.png"
#
# [[12]]
# [1] "grad_nt.png"</code></pre>
<p>Once the file names are created I can loop through all the file names and plots simultaneously with <code>walk2()</code> and save things via <code>ggsave()</code>. The height and width of each output file can be set as needed in <code>ggsave()</code>.</p>
<p>You can see I flattened the nested list of plots into a single list to use in <code>walk2()</code>.</p>
<pre class="r"><code>walk2(plotnames, flatten(all_plots), ~ggsave(filename = .x, plot = .y,
height = 7, width = 7))</code></pre>
</div>
<div id="combining-plots" class="section level2">
<h2><a href="#combining-plots">Combining plots</a></h2>
<p>Another way to get a set of plots together is to combine them into one plot. How useful this is will depend on how many plots you have per set. This option is a lot like faceting, except we didn’t reshape our dataset to allow the use faceting.</p>
<p>I like the <strong>cowplot</strong> function <code>plot_grid()</code> for combining multiple plots into one. A list of plots can be passed via the <code>plotlist</code> argument.</p>
<p>Here’s what that looks like for the first response variable, <code>elev</code>.</p>
<pre class="r"><code>cowplot::plot_grid(plotlist = all_plots[[1]])</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-20-1.png" width="672" /></p>
<p>We can use a loop to combine the plots for each response variable sublist. The result could then be saved using any of the approaches shown above. If you have many subplots per combined plot you likely will want to save the plots at a larger size so the individual plots can be clearly seen.</p>
<pre class="r"><code>response_plots = map(all_plots, ~cowplot::plot_grid(plotlist = .x))
response_plots</code></pre>
<pre><code># $elev</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-21-1.png" width="672" /></p>
<pre><code>#
# $resp</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-21-2.png" width="672" /></p>
<pre><code>#
# $grad</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-08-16-automating-exploratory-plots_files/figure-html/unnamed-chunk-21-3.png" width="672" /></p>
</div>
</div>
Creating legends when aesthetics are constants in ggplot2
https://aosmith.rbind.io/2018/07/19/legends-constants-for-aesthetics-in-ggplot2/
Thu, 19 Jul 2018 00:00:00 +0000https://aosmith.rbind.io/2018/07/19/legends-constants-for-aesthetics-in-ggplot2/<p>In general, if you want to map an aesthetic to a variable and get a legend in <strong>ggplot2</strong> you do it inside <code>aes()</code>. If you want to set an aesthetic to a constant value, like making all the points purple, you do it outside <code>aes()</code>.</p>
<p>However, there are situations where you might want to set an aesthetic for a layer to a constant but you also want a legend for that aesthetic. One common alternative is to put your dataset into a long format to take advantage of the strengths of <strong>ggplot2</strong>, but that isn’t an option for every situation. I’ll show another approach here.</p>
<div id="the-setup" class="section level1">
<h1>The setup</h1>
<p>A few situations where we might want legends without mapping an aesthetic to a variable are:<br />
1. Adding a statistic like the mean as a line or symbol and wanting a legend to define it<br />
2. Adding separate layers for subsets of data or based on different datasets*<br />
3. Adding lines based on different fitted models</p>
<p>*<em>This second situation is where reformatting your dataset is often most useful</em></p>
<p>I’ll focus on adding lines from different models. I’m going to be using the ubiquitous <code>mtcars</code> dataset because, well, it’s easy. 😆</p>
</div>
<div id="making-a-plot-with-aesthetics-as-constant" class="section level1">
<h1>Making a plot with aesthetics as constant</h1>
<p>I’ll start by loading the <strong>ggplot2</strong> package.</p>
<pre class="r"><code>library(ggplot2) # v. 3.0.0</code></pre>
<p>I’m going to make a plot of the relationship between <code>mpg</code> and <code>hp</code>, adding three fitted lines from three different linear regression models. I will use a linear, a quadratic, and a cubic model. I use <code>geom_smooth()</code> to make the fitted regression lines, and so add a separate <code>geom_smooth()</code> layer for each model.</p>
<p>I’m going to focus on the <code>color</code> aesthetic here, but this is relevant for other aesthetics, as well.</p>
<p>You’ll see I set a different <code>color</code> per fitted line. Since I’m setting these colors as constants this is done outside <code>aes()</code>.</p>
<pre class="r"><code>ggplot(mtcars, aes(mpg, hp) ) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "black") +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, color = "red") +
geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = FALSE, color = "blue")</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-07-18-creating-legends-when-using-constants-for-aesthetics-in-ggplot2_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<p>It would be nice to know which line came from which model, and adding a legend is one way to do that. The question is, how do we add a legend?</p>
<p>I think for many people it feels intuitive to add the appropriate <code>scale_*()</code> function to the plotting code in hopes of getting a legend. Along those lines I’ll add <code>scale_color_manual()</code> to my plot.</p>
<pre class="r"><code>ggplot(mtcars, aes(mpg, hp) ) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "black") +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, color = "red") +
geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = FALSE, color = "blue") +
scale_color_manual(values = c("black", "red", "blue") )</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-07-18-creating-legends-when-using-constants-for-aesthetics-in-ggplot2_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
<p>But nothing changes. Unfortunately, no matter how hard I throw <code>scale_color_manual()</code> at the plot, I won’t get a legend.</p>
<p>Why doesn’t this work?</p>
<p>From the description in the <code>scale_manual</code> documentation, the manual scale functions <em>allow you to specify your own set of mappings from levels in the data to aesthetic values.</em> You can change already created mappings but not <em>construct</em> them. In <strong>ggplot2</strong>, mappings are constructed by <code>aes()</code>. Aesthetics therefore must be inside <code>aes()</code> to get a legend.</p>
</div>
<div id="adding-a-legend-by-moving-aesthetics-into-aes" class="section level1">
<h1>Adding a legend by moving aesthetics into aes()</h1>
<p>I’ll move <code>color</code> inside of <code>aes()</code> within each <code>geom_smooth()</code> layer to construct color mappings. This adds a legend to the plot.</p>
<pre class="r"><code>ggplot(mtcars, aes(mpg, hp) ) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, aes(color = "black") ) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, aes(color = "red") ) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = FALSE, aes(color = "blue") )</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-07-18-creating-legends-when-using-constants-for-aesthetics-in-ggplot2_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>A legend is now present, but the colors have changed. The values are no longer recognized as colors since <code>aes()</code> treats these as string constants. To get the desired colors we’ll need to turn to one of the <code>scale_color_*()</code> functions.</p>
</div>
<div id="using-scale_color_identity-to-recognize-color-strings" class="section level1">
<h1>Using scale_color_identity() to recognize color strings</h1>
<p>One way to force <code>ggplot</code> to recognize the color names when they are inside <code>aes()</code> is to use <code>scale_color_identity()</code>. To get a legend with an identity scale you must use <code>guide = "legend"</code>. (The default is <code>guide = "none"</code> for identity scales.)</p>
<pre class="r"><code>ggplot(mtcars, aes(mpg, hp) ) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, aes(color = "black") ) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, aes(color = "red") ) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = FALSE, aes(color = "blue") ) +
scale_color_identity(guide = "legend")</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-07-18-creating-legends-when-using-constants-for-aesthetics-in-ggplot2_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
<p>The colors are now correct but the legend still leaves a lot to be desired. The name of the legend isn’t useful, the order is alphabetical instead of by model complexity, and the labels are the color names instead of descriptive names that describe each model.</p>
<p>The legend name can be changed via <code>name</code>, the order can be changes via <code>breaks</code> and the labels can be changed via <code>labels</code> in <code>scale_color_identity()</code>. The order of the <code>labels</code> must be the same as the order of the <code>breaks</code>.</p>
<p>This all means the <code>scale_color_identity()</code> code has gotten relatively more complicated. I’ve found this to be pretty standard when mapping aesthetics to constants.</p>
<pre class="r"><code>ggplot(mtcars, aes(mpg, hp) ) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, aes(color = "black") ) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, aes(color = "red") ) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = FALSE, aes(color = "blue") ) +
scale_color_identity(name = "Model fit",
breaks = c("black", "red", "blue"),
labels = c("Linear", "Quadratic", "Cubic"),
guide = "legend")</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-07-18-creating-legends-when-using-constants-for-aesthetics-in-ggplot2_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
</div>
<div id="descriptive-strings-and-scale_color_manual" class="section level1">
<h1>Descriptive strings and scale_color_manual()</h1>
<p>An alternative (but not necessarily simpler 😄) approach is to use informative string names instead of the color names within <code>aes()</code>. Then we can use <code>scale_color_manual()</code> to get the legend cleaned up.</p>
<p>Here is the plot using descriptive names that describe each model instead of the color names.</p>
<pre class="r"><code>ggplot(mtcars, aes(mpg, hp) ) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, aes(color = "Linear") ) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, aes(color = "Quadratic") ) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = FALSE, aes(color = "Cubic") )</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-07-18-creating-legends-when-using-constants-for-aesthetics-in-ggplot2_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>This has nicer labels, but the legend has other problems, similar to those in the above <code>scale_color_identity()</code> example. The legend name isn’t informative, the order is again alphabetical instead of by model complexity, and the colors still need to be changed if we really want black, red, and blue lines. This can all be addressed in <code>scale_color_manual()</code>.</p>
<p>For the first two issues I will again use <code>name</code> and <code>breaks</code> to get things named and in the desired order.</p>
<p>Colors are set via passing a vector of color names to the <code>values</code> argument in <code>scale_color_manual()</code>. Note the <code>values</code> argument is a required aesthetic in <code>scale_color_manual()</code>; if you don’t want to change the colors in the plot use <code>scale_color_discrete()</code>.</p>
<p>The vector of colors needs to either be in the same order as the <code>breaks</code> or given as a named vector. The latter is “safest” since it is invariant to changing the order of the legend, and I’ll use a named vector in my example code.</p>
<pre class="r"><code>ggplot(mtcars, aes(mpg, hp) ) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, aes(color = "Linear") ) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, aes(color = "Quadratic") ) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = FALSE, aes(color = "Cubic") ) +
scale_color_manual(name = "Model fit",
breaks = c("Linear", "Quadratic", "Cubic"),
values = c("Cubic" = "blue", "Quadratic" = "red", "Linear" = "black") )</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-07-18-creating-legends-when-using-constants-for-aesthetics-in-ggplot2_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
</div>
<div id="other-examples" class="section level1">
<h1>Other examples</h1>
<p>You can see what I would consider some of the canonical questions and answers on this topic from Stack Overflow <a href="https://stackoverflow.com/questions/10349206/add-legend-to-ggplot2-line-plot">here</a> and <a href="https://stackoverflow.com/questions/17148679/construct-a-manual-legend-for-a-complicated-plot">here</a>. (I’m sure there are others, but these are two that I’ve been linking to as duplicates recently. 😺)</p>
</div>
Simulate! Simulate! - Part3: The Poisson edition
https://aosmith.rbind.io/2018/07/18/simulate-poisson-edition/
Wed, 18 Jul 2018 00:00:00 +0000https://aosmith.rbind.io/2018/07/18/simulate-poisson-edition/<p>One of the things I like about simulations is that, with practice, they can be a quick way to check your intuition about a model or relationship.</p>
<p>My most recent example is based on a discussion with a student about quadratic effects.</p>
<p>I’ve never had a great grasp on what the coefficients that define a quadratic relationship mean. Luckily there is this very nice <a href="https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqhow-do-i-interpret-the-sign-of-the-quadratic-term-in-a-polynomial-regression/">FAQ page</a> from the Institute for Digital Research and Education at UCLA that goes over the meaning of the coefficients in detail, with examples. This has become my go-to page when I need a review (which is apparently every time the topic comes up 😜).</p>
<p>So while we understood what the quadratic effect “meant” in the model, in this particular case the student was working with a generalized linear mixed model for count data. This model was <em>linear on the log scale</em>. If something is quadratic on the log scale (the scale of the model), what does the relationship look like on the original scale (the scale of the data)?</p>
<p>I wasn’t sure. Logically I would say that if something that is a straight line on the log scale is curved on the data scale then something that is curved on the log scale should be <em>more</em> curved on the original scale. Right? But how much more curved? Hmm.</p>
<p>Simulation to the rescue! I decided to simulate some data to see what a relationship like the one the student had estimated could look like on the original scale.</p>
<div id="the-statistical-model" class="section level1">
<h1><a href="#the-statistical-model">The statistical model</a></h1>
<p>Even though what I did was a single iteration simulation, I think it is still useful to write out the statistical model I used for building the simulated data. It has been the combination of seeing the equations and then doing simulations (and then seeing the equations in a new light 😁) that has really helped me understand generalized linear (mixed) models and so I like to include the models explicitly in these posts. But if you’re at a place where you find looking at these equations makes your eyes glaze over, <a href="#simulate-code">jump down to the code</a></p>
<p>The statistical model for generalized linear model looks pretty different than the statistical model of the very special case when we assume normality. Since so many of us (like me!) learned that special case first, this different approach takes some getting used to.</p>
<p>Instead of defining the distribution of the <em>errors</em> we’ll now directly define the distribution of the response variable. (<em>For a more formal coverage of the statistical model for generalized linear (mixed) models see Stroup’s <a href="http://lira.pro.br/wordpress/wp-content/uploads/2015/06/stroup-2015.pdf">Rethinking the Analysis of Non-Normal Data in Plant and Soil Science</a>.</em>)</p>
<p>I’m going to use a Poisson generalized linear model for my simulation, so the response variable will be discrete counts. In my statistical model I first define a response variable that comes from the Poisson distribution.</p>
<p><span class="math display">\[y_t \thicksim Poisson(\lambda_t)\]</span></p>
<ul>
<li><span class="math inline">\(y_t\)</span> is the recorded count for the <span class="math inline">\(t\)</span>th observation of the discrete response variable.<br />
</li>
<li><span class="math inline">\(\lambda_t\)</span> is the unobserved true mean of the Poisson distribution for the <span class="math inline">\(t\)</span>th observation. The Poisson distribution is a single parameter distribution, where the variance is exactly equal the mean.</li>
</ul>
<p>We will assume that the relationship between the <em>mean</em> of the response and any explanatory variables is linear on the log scale. This can be described as using a <em>log link</em>, since the log is the function that “links” the mean to the linear predictor. If you’re coming from the world of linear models you may be used to describing the relationship between the <em>response variable</em> and any explanatory variables, not the relationship between the <em>mean of the response variable</em> and explanatory variables.</p>
<p>The model I define here is a quadratic model for a single, continuous explanatory variable.</p>
<p><span class="math display">\[log(\lambda_t) = \beta_0 + \beta_1*x_t + \beta_2*x^2_t\]</span></p>
<ul>
<li><span class="math inline">\(x_t\)</span> is the recorded value of the <span class="math inline">\(t\)</span>th observation of the continuous explanatory variable<br />
</li>
<li><span class="math inline">\(x^2_t\)</span> is the square of <span class="math inline">\(x_t\)</span><br />
</li>
<li><span class="math inline">\(\beta_0\)</span>, <span class="math inline">\(\beta_1\)</span>, and <span class="math inline">\(\beta_2\)</span> are parameters (intercepts and slope coefficients) of the linear model</li>
</ul>
<p>If you are new to generalized linear models you might want to take a moment and note of the absence of epsilon in the linear predictor.</p>
<p>Notice we can calculate the mean on the original scale instead of the log scale by exponentiating both sides of the above equation. This will be important when we get to writing code to simulate data.</p>
<p><span class="math display">\[\lambda_t = exp(\beta_0 + \beta_1*x_t + \beta_2*x^2_t)\]</span></p>
</div>
<div id="simulation-code" class="section level1">
<h1><a href="#simulation-code">Simulation code</a></h1>
<p>The first thing I will do in this simulation is define my true parameter values. I’m simulating a relationship between x and y that is similar to the student’s results so I’ll set the intercept and the linear coefficient (<span class="math inline">\(\beta_0\)</span> and <span class="math inline">\(\beta_1\)</span>, respectively) both to 0.5 and the quadratic coefficient (<span class="math inline">\(\beta_2\)</span>) to 5.</p>
<pre class="r"><code>b0 = .5
b1 = .5
b2 = 5</code></pre>
<p>Next I need an explanatory variable, which I’ll call <code>x</code>. I decided to make this a continuous variable between 0 and 1 by taking 100 random draws from a uniform distribution with a minimum of 0 and a maximum of 1 via <code>runif()</code>. Since I’m taking 100 draws, <span class="math inline">\(t\)</span> in the statistical model goes from 1-100.</p>
<p>I’ll set my seed prior to generating random numbers so you’ll get identical results if you run this code.</p>
<pre class="r"><code>set.seed(16)
x = runif(100, min = 0, max = 1)
head(x) # First six values of x</code></pre>
<pre><code># [1] 0.6831101 0.2441174 0.4501114 0.2294351 0.8635079 0.3112003</code></pre>
<p>Once I have my parameters set and an explanatory variable created I can calculate <span class="math inline">\(\lambda_t\)</span>. This is where I find the statistical model to be really handy, as it directly shows me how to write the code. Because I want to calculate the means on the original scale and not the log of the means I use the model equation after exponentiating both sides.</p>
<p>I’ll simulate the 100 means via</p>
<p><span class="math display">\[\lambda_t = exp(0.5 + 0.5*x_t + 5*x^2_t)\]</span></p>
<pre class="r"><code>lambda = exp(b0 + b1*x + b2*x^2)</code></pre>
<p>The step above simulates the <em>mean</em> of each value of the response variable. These values are continuous, not discrete counts.</p>
<pre class="r"><code>head(lambda)</code></pre>
<pre><code># [1] 23.920881 2.509354 5.686283 2.405890 105.634346 3.126231</code></pre>
<p>Now that I have a vector of means I can use it to generate a count for each value of <code>lambda</code> based on the Poisson distribution. I do this via <code>rpois()</code>.</p>
<p>The next bit of code is based on the distribution defined in the statistical model. Remember that we defined <code>y</code> as:</p>
<p><span class="math display">\[y_t \thicksim Poisson(\lambda_t)\]</span>
It is this step where we add “Poisson errors” to the mean to generate the response variable. For a fixed x variable, the variation for each simulated <code>y</code> value around the mean is based on the Poisson variance. For linear model simulations we usually add variability to the mean by simulating the errors directly from a normal distribution with a mean of 0. Since the variance is based on the mean in the Poisson distribution, adding the variability isn’t so obvious. I’ve seen this referred to as adding “Poisson noise”, but “Poisson errors” may be a better term.</p>
<p>I randomly draw 100 counts, one for each of the 100 means stored in <code>lambda</code>.</p>
<pre class="r"><code>y = rpois(100, lambda = lambda) </code></pre>
<p>Unlike <code>lambda</code>, the <code>y</code> variable is a discrete count. This is the response variable that will be used in analysis.</p>
<pre class="r"><code>head(y)</code></pre>
<pre><code># [1] 25 4 5 4 114 2</code></pre>
</div>
<div id="results" class="section level1">
<h1>Results!</h1>
<p>Now that I have simulated values for both the response and explanatory variable I can take a look at the relationship between <code>x</code> and <code>y</code>.</p>
<p>Below is what things look like on the log scale (the scale of the model). I was interested to see that, while the relationship was curved up as expected by the quadratic coefficient I used, the curve was really quite shallow.</p>
<pre class="r"><code>plot(x, log(y) )</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-07-13-simulate-poisson-edition_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>How do things look on the original scale? The curve is more extreme, much more than I realized it would be.</p>
<pre class="r"><code>plot(x, y)</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-07-13-simulate-poisson-edition_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
<p>This was good to see, as it matched pretty well with an added variable plot the student had made. We had originally been concerned that there was some mistake in the plotting code, and being able to explore things via simulation helped allay these fears. Simulation success! 🥂</p>
</div>
Time after time: calculating the autocorrelation function for uneven or grouped time series
https://aosmith.rbind.io/2018/06/27/uneven-group-autocorrelation/
Wed, 27 Jun 2018 00:00:00 +0000https://aosmith.rbind.io/2018/06/27/uneven-group-autocorrelation/<p>I first learned how to check for autocorrelation via autocorrelation function (ACF) plots in R in a class on time series However, the examples we worked on were all single, long term time series with no missing values and no groups. I figured out later that calculating the ACF when the sampling through time is uneven or there are distinct time series for independent sample units takes a bit more thought. It’s easy to mistakenly ignore such structure, which then makes it difficult to determine what sort and how much autocorrelation may be present.</p>
<p>I first ran into this problem myself when I was analyzing data from a <a href="http://methods.sagepub.com/reference/encyclopedia-of-survey-research-methods/n500.xml">rotating panel study design</a>. In the data I was working with, some units were sampled every year, some sampled every 3rd year, some every 9th year, etc. So sampling was annual, but not all sample units had observations from every year. In addition, the sample units were considered independent replicates of each time series, so any autocorrelation of concern would be <em>within</em> sample unit autocorrelation.</p>
<p>It took me some time to figure out how to check for overall residual autocorrelation from models I fit to these data. But I’m glad I took the time to do it; since then I’ve been able to share what I learned with numerous students when they were facing a similar situation with unevenness or grouped time series (or both!).</p>
<p>The kind of time series I’m talking about here are <em>discrete</em> time series, not continuous. Treating time as continuous involves a different approach, generally using “spatial” autocorrelation tools (which I may write about in some later post).</p>
<div id="simulate-data-with-autocorrelation" class="section level1">
<h1>Simulate data with autocorrelation</h1>
<p>In my working example today I’ll use data that has a pattern to the unevenness, much like the data I had from the rotating panel design. The same approach applies, though, for evenly spaced data with groups or when some sampling events are missing because of unplanned events or logistical issues.</p>
<p>Autocorrelated noise can be simulated in R using the <code>arima.sim()</code> function. <a href="https://stat.ethz.ch/pipermail/r-help/2008-July/168487.html">This thread on the R mailing list</a> helped me figure out how to do this. I’m working with the default distribution of the innovations (i.e., errors) in <code>arima.sim()</code>, which is the normal distribution with a mean 0 and standard deviation 1.</p>
<p>I’ll use functions from <strong>purrr</strong> to do the looping and <strong>dplyr</strong> for data manipulation.</p>
<pre class="r"><code>library(purrr) # v. 0.2.5
library(dplyr) # v. 0.7.5</code></pre>
<p>I’ll start the simulation by setting the seed. I mix things up in this post by using a seed of 64 instead of my go-to seed, 16. 😆</p>
<pre class="r"><code>set.seed(64)</code></pre>
<p>I decided to simulate a 10-observation time series for 9 different sample units. Each sample unit is an independent time series so I do a loop via <code>map_dfr()</code> to simulate 9 separate datasets and then bind them together into one.</p>
<p>I’ll simulate observations of the response variable <code>y</code> and explanatory variable <code>x</code> for each time series and index <code>time</code> with an integer to represent the time of the observation (1-10). This time variable could be something like year or day or month or even depth in a soil core or height along a tree (since we can use time series tools for space in some situations). The key is that the unit of time is discrete and evenly spaced.</p>
<p>The response variable <code>y</code> will be constructed based on the relationship it has with the explanatory variable <code>x</code> along with autoregressive order 1 (AR(1)) errors. I set the lag 1 correlation to be 0.7.</p>
<p>The <em>lag 1</em> correlation is the correlation between the set of observed values from time <span class="math inline">\(t\)</span> with the values from time <span class="math inline">\(t\text{-}\mathit{1}\)</span>. In this example the lag 1 correlation for one sample unit is the correlation of the observed values at sampling times 2-10 with those at sampling times 1-9. The lag 2 correlation would be between the observations two sampling times apart (3-10 vs 1-8), etc. With 10 observations per group the largest lag possible is lag 9.</p>
<p>When you run the code to simulate the dataset below you will get warnings about not preserving the time series attribute from <code>arima.sim()</code>. I’ve suppressed the warnings in the output, but note these warnings aren’t a problem for what I’m doing here.</p>
<pre class="r"><code>dat = map_dfr(1:9,
~tibble(unit = .x,
x = runif(10, 5, 10),
y = 1 + x + arima.sim(list(ar = .7), 10),
time = 1:10)
)</code></pre>
<p>Here’s the top 15 rows of this dataset.</p>
<pre class="r"><code>head(dat, 15)</code></pre>
<pre><code># # A tibble: 15 x 4
# unit x y time
# <int> <dbl> <dbl> <int>
# 1 1 5.22 7.25 1
# 2 1 9.14 11.3 2
# 3 1 5.22 7.10 3
# 4 1 9.95 10.8 4
# 5 1 5.45 5.27 5
# 6 1 9.09 8.06 6
# 7 1 9.69 9.73 7
# 8 1 8.10 8.35 8
# 9 1 7.53 9.49 9
# 10 1 7.93 9.65 10
# 11 2 9.98 12.6 1
# 12 2 6.46 8.23 2
# 13 2 8.67 9.04 3
# 14 2 6.36 7.58 4
# 15 2 8.62 9.66 5</code></pre>
<p>The dataset I made has 9 sample units, all with the full time series of length 10. I want to have three units with samples taken only every fourth sampling time and three with samples on only the first and last sampling time. The last three units will have samples at every time.</p>
<p>I use <code>filter()</code> twice to remove rows from some of the sample units.</p>
<pre class="r"><code>dat = dat %>%
filter(unit %in% 4:9 | time %in% c(1, 4, 7, 10) ) %>%
filter(!unit %in% 4:6 | time %in% c(1, 10) )</code></pre>
<p>Now you can see the time series is no longer even in every sample unit.</p>
<pre class="r"><code>head(dat, 15)</code></pre>
<pre><code># # A tibble: 15 x 4
# unit x y time
# <int> <dbl> <dbl> <int>
# 1 1 5.22 7.25 1
# 2 1 9.95 10.8 4
# 3 1 9.69 9.73 7
# 4 1 7.93 9.65 10
# 5 2 9.98 12.6 1
# 6 2 6.36 7.58 4
# 7 2 6.51 7.91 7
# 8 2 6.79 7.34 10
# 9 3 6.71 8.52 1
# 10 3 8.64 10.1 4
# 11 3 9.00 7.74 7
# 12 3 9.49 9.91 10
# 13 4 5.20 3.25 1
# 14 4 9.52 11.3 10
# 15 5 9.26 10.8 1</code></pre>
</div>
<div id="fit-model-and-extract-residuals" class="section level1">
<h1>Fit model and extract residuals</h1>
<p>I’m going to focus on checking for <em>residual</em> autocorrelation here, since that is what I do most often. This means checking for autocorrelation that is left over after accounting for other variables in the model. We often hope that other variables explain some of the autocorrelation. 🤞</p>
<p>I’ll fit the model of <code>y</code> vs <code>x</code> via the <code>lm()</code> function and extract the residuals to check for autocorrelation.</p>
<pre class="r"><code>fit1 = lm(y ~ x, data = dat)</code></pre>
<p>I add the residuals to the dataset to keep things organized and to get ready for the next step.</p>
<pre class="r"><code>dat$res = residuals(fit1)</code></pre>
</div>
<div id="problems-with-naively-using-acf" class="section level1">
<h1><a href="#problems-with-naively-using-acf">Problems with naively using acf()</a></h1>
<p>If we hadn’t thought about our spacing issue in our grouped dataset, the next step would be to use the <code>acf()</code> function to check for any residual autocorrelation at various time lags. The <code>acf()</code> function calculates and plots the autocorrelation function of a vector of values. That sounds like what we would want to do; so why do I say this would be naive?</p>
<p>The <code>acf()</code> function expects values to be in order by time and <em>assumes equal spacing in time</em>. This means the <code>acf()</code> function considers any two observations that are next to each other in the dataset to be 1 lag apart even though it may have been three or more years since the last observation. If we calculate the autocorrelation function directly on the residuals as they are now we would ignore those instances where we didn’t sample for several years.</p>
<p>In addition, the <code>acf()</code> function doesn’t know we have independent groups. The last observation of one time series comes immediately before the first observation of another, so <code>acf()</code> will be consider these to be one lag apart even though they are unrelated.</p>
<p>If we want to work with the <code>acf()</code> function we’ll need to make sure that the dataset is spaced appropriately. To achieve this we can add empty rows to the dataset for any missing samples. In addition, we need to add rows between sample units so we don’t mistakenly treat them as if they were from the same time series.</p>
<p>Note that an alternative option for data like this, where we are assuming normally distributed errors, is to work with the <strong>nlme</strong> package. That package has an <code>ACF()</code> function that works on both <code>gls</code> and <code>lme</code> objects that will respect groupings.</p>
</div>
<div id="calculate-the-maximum-lag" class="section level1">
<h1><a href="#calculate-the-maximum-lag">Calculate the maximum lag</a></h1>
<p>Since the units are considered to be independent of each other, the maximum lag to check for autocorrelation should be based on the maximum number of observations in a group. Below is one way to check on group size. Because I simulated these data I already know that the longest time series has 10 observations, but when working with real data this sort of check would be standard.</p>
<pre class="r"><code>dat %>%
count(unit)</code></pre>
<pre><code># # A tibble: 9 x 2
# unit n
# <int> <int>
# 1 1 4
# 2 2 4
# 3 3 4
# 4 4 2
# 5 5 2
# 6 6 2
# 7 7 10
# 8 8 10
# 9 9 10</code></pre>
<p>Printing out the result wouldn’t be useful for a large number of groups, so here’s an alternative to look at just the maximum group size.</p>
<pre class="r"><code>dat %>%
count(unit) %>%
filter(n == max(n) )</code></pre>
<pre><code># # A tibble: 3 x 2
# unit n
# <int> <int>
# 1 7 10
# 2 8 10
# 3 9 10</code></pre>
</div>
<div id="order-the-dataset-by-time" class="section level1">
<h1>Order the dataset by time</h1>
<p>Since the <code>acf()</code> function is expecting a vector that is in order by time, always make sure things are in order prior to using <code>acf()</code>. I would use <code>arrange()</code> for this, putting things in order by group and then time within group.</p>
<pre class="r"><code>dat = dat %>%
arrange(unit, time)</code></pre>
</div>
<div id="pad-the-dataset-with-na" class="section level1">
<h1><a href="#pad-the-dataset-with-na">Pad the dataset with NA</a></h1>
<p>When I was working through this problem for the first time, I found these <a href="http://www.unc.edu/courses/2010spring/ecol/562/001/docs/lectures/lecture10.htm#testing">lecture notes</a> that showed the basic idea of adding observations in order to get the spacing between groups right. In my own work I originally achieved this using some joins. Since then I’ve found switched to using the <code>complete()</code> function from package <strong>tidyr</strong> for this task.</p>
<pre class="r"><code>library(tidyr) # v. 0.8.1</code></pre>
<p>You must have a variable representing the autocorrelation variable in the dataset for this approach to work. (This may seem obvious, but I’ve seen datasets that rely on the order of the dataset rather than having a specific time variable.) In this case that variable is <code>time</code>.</p>
<p>I group the dataset by <code>unit</code> prior to using <code>complete()</code> so every group has rows added to it. I define what values of <code>time</code> I want to be in the dataset within <code>complete()</code>. These values are based on 1., the sampling times present in the dataset and 2., the maximum group size. I need more rows between groups than the maximum lag I’m going to check for autocorrelation. The maximum lag I will explore is a lag 9 so I will add 10 extra rows between each sample unit in the dataset.</p>
<p>Since I’m working with an integer variable for <code>time</code> I can make the full sequence I want in each group via <code>1:20</code> (but also see <code>tidyr::full_seq()</code>). This means that I will add rows for times 1 through 20 for every sample unit in the dataset.</p>
<pre class="r"><code>dat_expand = dat %>%
group_by(unit) %>%
complete(time = 1:20) </code></pre>
<p>Here is an example of what the first group looks like in the newly expanded dataset. It has rows added for the times that weren’t sampled along with 10 rows of <code>NA</code> at the end.</p>
<pre class="r"><code>filter(dat_expand, unit == 1)</code></pre>
<pre><code># # A tibble: 20 x 5
# # Groups: unit [1]
# unit time x y res
# <int> <int> <dbl> <dbl> <dbl>
# 1 1 1 5.22 7.25 0.889
# 2 1 2 NA NA NA
# 3 1 3 NA NA NA
# 4 1 4 9.95 10.8 -0.746
# 5 1 5 NA NA NA
# 6 1 6 NA NA NA
# 7 1 7 9.69 9.73 -1.57
# 8 1 8 NA NA NA
# 9 1 9 NA NA NA
# 10 1 10 7.93 9.65 0.290
# 11 1 11 NA NA NA
# 12 1 12 NA NA NA
# 13 1 13 NA NA NA
# 14 1 14 NA NA NA
# 15 1 15 NA NA NA
# 16 1 16 NA NA NA
# 17 1 17 NA NA NA
# 18 1 18 NA NA NA
# 19 1 19 NA NA NA
# 20 1 20 NA NA NA</code></pre>
</div>
<div id="plot-autocorrelation-function-of-appropriately-spaced-residuals" class="section level1">
<h1>Plot autocorrelation function of appropriately-spaced residuals</h1>
<p>Now that things are spaced appropriately and in order by time, I can calculate and plot the residual autocorrelation function via <code>acf()</code>, using the residuals in the expanded dataset.</p>
<p>Note the use of <code>na.action = na.pass</code>, which is what makes this approach to work. The <code>lag.max</code> argument is used to set the maximum lag at which to calculate autocorrelation.</p>
<p>This is now a plot we can use to check for the presence and amount of autocorrelation. 🎉</p>
<pre class="r"><code>acf(dat_expand$res, lag.max = 9, na.action = na.pass, ci = 0)</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-06-21-time-after-time-uneven-grouped-autocorrelation_files/figure-html/unnamed-chunk-15-1.png" width="672" /></p>
</div>
<div id="the-confidence-interval-on-the-acf-plot" class="section level1">
<h1>The confidence interval on the ACF plot</h1>
<p>I removed the confidence interval from the plot above with <code>ci = 0</code>. The confidence intervals calculated by <code>acf()</code> use the NA values as part of the sample size for each lag, so the confidence interval is too narrow.</p>
<p>You can calculate your own confidence interval based on the actual number of observations for each lag, but you have to calculate them yourself. I refer you again to those <a href="http://www.unc.edu/courses/2010spring/ecol/562/001/docs/lectures/lecture10.htm#acf">lecture notes</a> for some ideas on how to do this. I’ll put code I’ve used for this in the past below as an example, but I won’t walk through it.</p>
<pre class="r"><code>( nall = map_df(1:9,
~dat %>%
group_by(unit) %>%
arrange(unit, time) %>%
summarise(lag = list( diff(time, lag = .x ) ) )
) %>%
unnest(lag) %>%
group_by(lag) %>%
summarise(n = n() ) )</code></pre>
<pre><code># # A tibble: 9 x 2
# lag n
# <int> <int>
# 1 1 27
# 2 2 24
# 3 3 30
# 4 4 18
# 5 5 15
# 6 6 18
# 7 7 9
# 8 8 6
# 9 9 9</code></pre>
<p>Here’s the ACF plot with 95% CI added via <code>lines()</code>.</p>
<pre class="r"><code>acf(dat_expand$res, lag.max = 9, na.action = na.pass, ci = 0)
lines(1:9,-qnorm(1-.025)/sqrt(nall$n), lty = 2)
lines(1:9, qnorm(1-.025)/sqrt(nall$n), lty = 2)</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-06-21-time-after-time-uneven-grouped-autocorrelation_files/figure-html/unnamed-chunk-17-1.png" width="672" /></p>
</div>
A closer look at replicate() and purrr::map() for simulations
https://aosmith.rbind.io/2018/06/05/a-closer-look-at-replicate-and-purrr/
Tue, 05 Jun 2018 00:00:00 +0000https://aosmith.rbind.io/2018/06/05/a-closer-look-at-replicate-and-purrr/<p>I’ve done a couple of posts so far on simulations, <a href="https://aosmith.rbind.io/2018/01/09/simulate-simulate-part1/">here</a> and <a href="https://aosmith.rbind.io/2018/04/23/simulate-simulate-part-2/">here</a>, where I demonstrate how to build a function for simulating data from a defined linear model and then explore long-run behavior of models fit to the simulated datasets. The focus of those posts was on the general simulation process, and I didn’t go into much detail on the specific R code. In this post I’ll focus in on the code I use for repeatedly simulating data and extracting output, specifically talking about the function <code>replicate()</code> and the <em>map</em> family of functions from package <strong>purrr</strong>.</p>
<div id="the-replicate-function" class="section level1">
<h1>The replicate() function</h1>
<p>The <code>replicate()</code> function is a member of the <em>apply</em> family of functions in base R.<br />
Specifically, from the documentation:</p>
<blockquote>
<p><code>replicate</code> is a wrapper for the common use of <code>sapply</code> for repeated evaluation of an expression (which will usually involve random number generation).</p>
</blockquote>
<p>Notice the documentation mentions <em>repeated evaluations</em> and that the use of <code>replicate()</code> involves <em>random number generation</em>. Those are primary parts of the simulations I do. While I don’t actually know the <em>apply</em> family of functions very well, I use <code>replicate()</code> a lot (although also see <code>purrr::rerun()</code>). Using <code>replicate()</code> is an alternative to building a <code>for()</code> loop to repeatedly simulate new values.</p>
<p>The <code>replicate()</code> function takes three arguments:</p>
<ul>
<li><code>n</code>, which is the number of replications to perform. This is where I set the number of simulations I want to run.<br />
</li>
<li><code>expr</code>, the expression that should be run repeatedly. I’ve only ever used a function here.<br />
</li>
<li><code>simplify</code>, which controls the type of output the results of <code>expr</code> are saved into. Use <code>simplify = FALSE</code> to get vectors saved into a list instead of in an array.</li>
</ul>
<div id="simple-example-of-replicate" class="section level2">
<h2>Simple example of replicate()</h2>
<p>Let’s say I wanted to simulate some values from a normal distribution, which I can do using the <code>rnorm()</code> function. Below I’ll simulate five values from a normal distribution with a mean of 0 and a standard deviation of 1 (which are the defaults for <code>mean</code> and <code>sd</code> arguments, respectively).</p>
<p>Since I’m going generate random numbers I’ll set the seed so anyone following along at home will see the same values.</p>
<pre class="r"><code>set.seed(16)
rnorm(5, mean = 0, sd = 1)</code></pre>
<pre><code># [1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293</code></pre>
<p>Using <code>rnorm()</code> directly gives me a single set of simulated values. How do I simulate 5 values from this same distribution multiple times? This is where <code>replicate()</code> comes in. It allows me to run the function I put in <code>expr</code> exactly <code>n</code> times.</p>
<p>Here I’ll ask for three runs of 5 values each. Notice I use <code>simplify = FALSE</code> to get a list as output.</p>
<p>The output below is a list of three vectors. Each vector is from a unique run of the function, so contains five random numbers drawn from the normal distribution with a mean of 0 and standard deviation of 1.</p>
<pre class="r"><code>set.seed(16)
replicate(n = 3, rnorm(5, 0, 1), simplify = FALSE )</code></pre>
<pre><code># [[1]]
# [1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293
#
# [[2]]
# [1] -0.46841204 -1.00595059 0.06356268 1.02497260 0.57314202
#
# [[3]]
# [1] 1.8471821 0.1119334 -0.7460373 1.6582137 0.7217206</code></pre>
<p>Note if I don’t use <code>simplify = FALSE</code> I will get a matrix of values instead of a list. Each column in the matrix is the output from one run of the function. In this case there will be three columns in the output, one for each run, and 5 rows. This can be a useful output type for simulations. I focus on list output throughout the rest of this post only because that’s what I have been using recently for simulations.</p>
<pre class="r"><code>set.seed(16)
replicate(n = 3, rnorm(5, 0, 1) )</code></pre>
<pre><code># [,1] [,2] [,3]
# [1,] 0.4764134 -0.46841204 1.8471821
# [2,] -0.1253800 -1.00595059 0.1119334
# [3,] 1.0962162 0.06356268 -0.7460373
# [4,] -1.4442290 1.02497260 1.6582137
# [5,] 1.1478293 0.57314202 0.7217206</code></pre>
</div>
<div id="an-equivalent-for-loop-example" class="section level2">
<h2>An equivalent for() loop example</h2>
<p>A <code>for()</code> loop can be used in place of <code>replicate()</code> for simulations. With time and practice I’ve found <code>replicate()</code> to be much more convenient in terms of writing the code. However, in my experience some folks find <code>for()</code> loops intuitive when they are starting out in R. I think it’s because <code>for()</code> loops are more explicit on the looping process: the user can see that <code>i</code> is looped over and the output for each <code>i</code> iteration is saved into the output object because the code is written out explicitly.</p>
<p>In my example I’ll save the output of each iteration of the loop into a list called <code>list1</code>. I initialize this as an empty list prior to starting the loop. To match what I did with <code>replicate()</code> I do three iterations of the loop (<code>i in 1:3</code>), drawing 5 values via <code>rnorm()</code> each time.</p>
<p>The result is identical to my <code>replicate()</code> code above. It took a little more code to do it but the process is very clear since it is explicitly written out.</p>
<pre class="r"><code>set.seed(16)
list1 = list() # Make an empty list to save output in
for (i in 1:3) { # Indicate number of iterations with "i"
list1[[i]] = rnorm(5, 0, 1) # Save output in list for each iteration
}
list1</code></pre>
<pre><code># [[1]]
# [1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293
#
# [[2]]
# [1] -0.46841204 -1.00595059 0.06356268 1.02497260 0.57314202
#
# [[3]]
# [1] 1.8471821 0.1119334 -0.7460373 1.6582137 0.7217206</code></pre>
</div>
<div id="using-replicate-on-a-user-made-function" class="section level2">
<h2>Using replicate() on a user-made function</h2>
<p>When I do simulations to explore the behavior of linear models under different scenarios I will create a function to simulate the data and fit the model. For example, here’s a function I used in an <a href="https://aosmith.rbind.io/2018/01/09/simulate-simulate-part1/">earlier blog post</a> to simulate data from and then fit a two group linear model.</p>
<pre class="r"><code>twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) {
ngroup = 2
group = rep( c("group1", "group2"), each = nrep)
eps = rnorm(ngroup*nrep, 0, sigma)
growth = b0 + b1*(group == "group2") + eps
growthfit = lm(growth ~ group)
growthfit
}</code></pre>
<p>The output is a model fit to data generated from the fixed and random (residual) effects.</p>
<pre class="r"><code>twogroup_fun()</code></pre>
<pre><code>#
# Call:
# lm(formula = growth ~ group)
#
# Coefficients:
# (Intercept) groupgroup2
# 4.686 -1.267</code></pre>
<p>To explore the long-run behavior in my simulated scenario I will repeat the data generation and model fitting many times using <code>replicate()</code>. The result is a list of fitted models. I’ll run the function 5 times and save the result as <code>sim_lm</code> to use throughout the next section on <code>map()</code>.</p>
<pre class="r"><code>sim_lm = replicate(5, twogroup_fun(), simplify = FALSE )
length(sim_lm)</code></pre>
<pre><code># [1] 5</code></pre>
</div>
</div>
<div id="using-purrrmap-for-looping-through-lists" class="section level1">
<h1>Using purrr::map() for looping through lists</h1>
<p>So I have a list of fitted models from <code>replicate()</code>; now what?</p>
<p>The <code>replicate()</code> function was about repeatedly running a function. Once I have the repeated runs I can explore the long-run behavior of some statistic by extracting value(s) from the resulting models. This involves looping through the list of models.</p>
<p>Looping through the list can be done using a <code>for()</code> loop, but I prefer to use functions that do the looping without all the typing. In particular, these days I use the <em>map</em> family of functions from the <strong>purrr</strong> package to loop through lists. Before <strong>purrr</strong> I primarily used <code>lapply()</code> (the only other <em>apply</em> family function that I know 😆).</p>
<p>The <code>map()</code> function takes a list as input and puts the output into a list of the same length. The first argument to <code>map()</code> is the list to loop through and the second argument is the function to apply to each element of the list.</p>
<p>For example, I can pull out the coefficients of each model in my 5-run simulation by looping through <code>sim_lm</code> and applying the <code>coef()</code> function to each list element.</p>
<pre class="r"><code>library(purrr) # v. 0.2.5
map(sim_lm, coef)</code></pre>
<pre><code># [[1]]
# (Intercept) groupgroup2
# 5.189474 -1.715602
#
# [[2]]
# (Intercept) groupgroup2
# 4.670188 -1.965463
#
# [[3]]
# (Intercept) groupgroup2
# 5.231922 -2.589953
#
# [[4]]
# (Intercept) groupgroup2
# 6.285158 -3.195090
#
# [[5]]
# (Intercept) groupgroup2
# 4.3296875 -0.9724314</code></pre>
<div id="other-variants-of-map-for-non-list-outputs" class="section level2">
<h2>Other variants of map() for non-list outputs</h2>
<p>There are many variants of <code>map()</code> that are convenient for saving results into something other than a list. For example, if I am going to extract a single numeric value from each model, such as <span class="math inline">\(R^2\)</span>, I might want the output to be a numeric vector instead of a list. I can use <code>map_dbl()</code> for this.</p>
<p>The unadjusted <span class="math inline">\(R^2\)</span> from a model fit with <code>lm()</code> can be pulled from the model <code>summary()</code> output. The code looks like: <code>summary(model)$r.squared</code>,<br />
where “model” is a fitted model.</p>
<p>So getting <span class="math inline">\(R^2\)</span> involves extracting a value after applying a function, which isn’t quite as straightforward as applying a single function to every model in the list like I did with <code>coef()</code>. This gives me a chance to demonstrate the formula coding styling available in <em>map</em> functions. In formula coding a tilde (<code>~</code>) goes in front of the function and <code>.x</code> refers to the list element.</p>
<pre class="r"><code>map_dbl(sim_lm, ~summary(.x)$r.squared)</code></pre>
<pre><code># [1] 0.22823549 0.16199867 0.25730022 0.38591045 0.06375695</code></pre>
<p>If you don’t like the formula style you can use an anonymous function inside <em>map</em> functions, where the function argument is used to refer to the list element.</p>
<pre class="r"><code>map_dbl(sim_lm, function(x) summary(x)$r.squared)</code></pre>
<pre><code># [1] 0.22823549 0.16199867 0.25730022 0.38591045 0.06375695</code></pre>
<p>For data.frame output we can use <code>map_dfr()</code> for row binding or <em>stacking</em> results together into a single data.frame.</p>
<p>Estimated coefficients, their standard errors, and their statistical tests from models fit with <code>lm()</code> can be extracted into a tidy data.frame using <code>broom::tidy()</code>. Looping through the results and doing this for each model via <code>map_dfr()</code> will put the output in one data.frame instead of storing the individual data.frames for each model as one element of a list.</p>
<pre class="r"><code>map_dfr(sim_lm, broom::tidy)</code></pre>
<pre><code># term estimate std.error statistic p.value
# 1 (Intercept) 5.1894736 0.5257947 9.869772 1.092300e-08
# 2 groupgroup2 -1.7156023 0.7435860 -2.307201 3.314134e-02
# 3 (Intercept) 4.6701884 0.7450412 6.268362 6.535353e-06
# 4 groupgroup2 -1.9654632 1.0536474 -1.865390 7.851233e-02
# 5 (Intercept) 5.2319216 0.7333769 7.134015 1.203499e-06
# 6 groupgroup2 -2.5899532 1.0371516 -2.497179 2.243919e-02
# 7 (Intercept) 6.2851581 0.6717450 9.356464 2.460846e-08
# 8 groupgroup2 -3.1950902 0.9499909 -3.363285 3.461704e-03
# 9 (Intercept) 4.3296875 0.6210667 6.971372 1.641131e-06
# 10 groupgroup2 -0.9724314 0.8783210 -1.107148 2.828066e-01</code></pre>
<p>The <code>map_dfr()</code> function has an additional argument, <code>.id</code>, which can be used to store the list names (if the original list had names) or add the list index to the output (if it didn’t have names). I’m using a list that has no names, so each unique model output will be assigned its index number if I use the <code>.id</code> argument. The name of the new column is given as a string to <code>.id</code>.</p>
<pre class="r"><code>map_dfr(sim_lm, broom::tidy, .id = "model")</code></pre>
<pre><code># model term estimate std.error statistic p.value
# 1 1 (Intercept) 5.1894736 0.5257947 9.869772 1.092300e-08
# 2 1 groupgroup2 -1.7156023 0.7435860 -2.307201 3.314134e-02
# 3 2 (Intercept) 4.6701884 0.7450412 6.268362 6.535353e-06
# 4 2 groupgroup2 -1.9654632 1.0536474 -1.865390 7.851233e-02
# 5 3 (Intercept) 5.2319216 0.7333769 7.134015 1.203499e-06
# 6 3 groupgroup2 -2.5899532 1.0371516 -2.497179 2.243919e-02
# 7 4 (Intercept) 6.2851581 0.6717450 9.356464 2.460846e-08
# 8 4 groupgroup2 -3.1950902 0.9499909 -3.363285 3.461704e-03
# 9 5 (Intercept) 4.3296875 0.6210667 6.971372 1.641131e-06
# 10 5 groupgroup2 -0.9724314 0.8783210 -1.107148 2.828066e-01</code></pre>
<p>Further arguments to the function used within <code>map()</code> can be passed as additional arguments. For example, I can add confidence intervals for estimated coefficients when using the <code>tidy.lm()</code> function via <code>conf.int = TRUE</code>. If I want to get confidence intervals for all models I add this as an additional argument in <code>map_dfr()</code>.</p>
<pre class="r"><code>map_dfr(sim_lm, broom::tidy, conf.int = TRUE)</code></pre>
<pre><code># term estimate std.error statistic p.value conf.low
# 1 (Intercept) 5.1894736 0.5257947 9.869772 1.092300e-08 4.084820
# 2 groupgroup2 -1.7156023 0.7435860 -2.307201 3.314134e-02 -3.277819
# 3 (Intercept) 4.6701884 0.7450412 6.268362 6.535353e-06 3.104915
# 4 groupgroup2 -1.9654632 1.0536474 -1.865390 7.851233e-02 -4.179094
# 5 (Intercept) 5.2319216 0.7333769 7.134015 1.203499e-06 3.691154
# 6 groupgroup2 -2.5899532 1.0371516 -2.497179 2.243919e-02 -4.768928
# 7 (Intercept) 6.2851581 0.6717450 9.356464 2.460846e-08 4.873874
# 8 groupgroup2 -3.1950902 0.9499909 -3.363285 3.461704e-03 -5.190947
# 9 (Intercept) 4.3296875 0.6210667 6.971372 1.641131e-06 3.024875
# 10 groupgroup2 -0.9724314 0.8783210 -1.107148 2.828066e-01 -2.817715
# conf.high
# 1 6.2941272
# 2 -0.1533862
# 3 6.2354619
# 4 0.2481678
# 5 6.7726893
# 6 -0.4109786
# 7 7.6964420
# 8 -1.1992333
# 9 5.6345003
# 10 0.8728526</code></pre>
<p>The <em>map</em> family of functions can easily be used with pipes as one step in a chain of functions. I can, for example, take the estimates I get using <code>broom::tidy</code>, pull out the estimated intercepts, and then plot a histogram of those estimates. I’ll need packages <strong>dplyr</strong> and <strong>ggplot2</strong> for this.</p>
<pre class="r"><code>suppressPackageStartupMessages( library(dplyr) ) # v. 0.7.5
library(ggplot2) # v 2.2.1</code></pre>
<p>You can see all the steps in the pipe chain below. I loop through <code>sim_lm</code> using <code>map_dfr()</code> to extract the coefficients from each element of the list and output a data.frame of results. I use <code>dplyr::filter()</code> to keep only the rows with estimated intercepts and then plot a histogram of these estimates for the whole simulation with <code>ggplot2::qplot()</code>.</p>
<pre class="r"><code>sim_lm %>%
map_dfr(broom::tidy) %>%
filter(term == "(Intercept)") %>%
qplot(x = estimate, data = ., geom = "histogram")</code></pre>
<pre><code># `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-06-05-a-closer-look-at-replicate-and-purrr_files/figure-html/unnamed-chunk-15-1.png" width="672" /></p>
<p>There are more variants of <code>map()</code> that you might find useful, both for simulations and in other iterative work. See the documentation for <code>map()</code> (<code>?map</code>) to see all of them along with additional examples.</p>
</div>
</div>
Simulate! Simulate! - Part 2: A linear mixed model
https://aosmith.rbind.io/2018/04/23/simulate-simulate-part-2/
Mon, 23 Apr 2018 00:00:00 +0000https://aosmith.rbind.io/2018/04/23/simulate-simulate-part-2/<p>I feel like I learn something every time start simulating new data to update an assignment or exploring a question from a client via simulation. I’ve seen instances where residual autocorrelation isn’t detectable when I <em>know</em> it exists (because I simulated it) or I have skewed residuals and/or unequal variances when I simulated residuals from a normal distribution with a single variance. Such results are often due to small sample sizes, which even in this era of big data still isn’t so unusual in ecology. I’ve found exploring the effect of sample size on statistical results to be quite eye-opening 👁.</p>
<p>In <a href="https://aosmith.rbind.io/2018/01/09/simulate-simulate-part1/">my first simulation post</a> I showed how to simulate data for a basic linear model. Today I’ll be talking about how to simulate data for a linear mixed model. So I’m still working with normally distributed residuals but will add an additional level of variation.</p>
<div id="simulate-simulate-dance-to-the-music" class="section level1">
<h1>Simulate, simulate, dance to the music</h1>
<p>I learned the basics of linear mixed models in a class where we learned how to analyze data from “classically designed” experiments. We spent a lot of time writing down the various statistical models in mathematical notation and then fitting said models in SAS. I felt like I understood the basics of mixed models when the class was over (and swore I was done with all the <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span> mathematical notation once and for all 😆).</p>
<p>It wasn’t until I started working with clients and teaching labs on mixed models in R that I learned how to do simulations to understand how well such models worked under various scenarios. These simulations took me to a whole new level of understanding of these models and the meaning of all that pesky mathematical notation.</p>
</div>
<div id="the-statistical-model" class="section level1">
<h1>The statistical model</h1>
<p><strong>Danger, equations below!</strong></p>
<p>If you don’t have a good understanding of the statistical model (or even if you do), writing it out in mathematical notation can actually be pretty useful. I’ll write a relatively simple model with random effects below. (Note that I only have an overall mean in this model; I’ll do a short example at the end of the post to show how one can simulate additional fixed effects.)</p>
<p>The study design that is the basis of my model has two different sizes of study units. I’m using a classic forestry example, where stands of trees are somehow chosen for sampling and then multiple plots within each stand are measured. This is a design with two levels, stands and plots; we could add a third level if individual trees were measured in each plot.</p>
<p>Everything today will be perfectly balanced, so the same number of plots will be sampled in each stand.</p>
<p>I’m using (somewhat sloppy) “regression” style notation instead of experimental design notation, where the <span class="math inline">\(t\)</span> indexes the observations.</p>
<p><span class="math display">\[y_t = \mu + (b_s)_t + \epsilon_t\]</span></p>
<ul>
<li><span class="math inline">\(y_t\)</span> is the recorded value for the <span class="math inline">\(t\)</span>th observation of the quantitative response variable; <span class="math inline">\(t\)</span> goes from 1 to the number of observations in the dataset. Since plot is the level of observation in this example (i.e., we have a single observation for each plot), <span class="math inline">\(t\)</span> indexes both the number of plots and the number of rows in the dataset.</li>
<li><span class="math inline">\(\mu\)</span> is the overall mean response</li>
<li><span class="math inline">\(b_s\)</span> is the (random) effect of the <span class="math inline">\(s\)</span>th stand on the response. <span class="math inline">\(s\)</span> goes from 1 to the total number of stands sampled. The stand-level random effects are assumed to come from an iid normal distribution with a mean of 0 and some shared, stand-level variance, <span class="math inline">\(\sigma^2_s\)</span>: <span class="math inline">\(b_s \thicksim N(0, \sigma^2_s)\)</span></li>
<li><span class="math inline">\(\epsilon_t\)</span> is the observation-level random effect (the residual error term). Since plots are the level of observation in my scenario, this is essentially the effect of each plot measurement on the response. These are assumed to come from an iid normal distribution with a mean of 0 and some shared variance, <span class="math inline">\(\sigma^2\)</span>: <span class="math inline">\(\epsilon_t \thicksim N(0, \sigma^2)\)</span></li>
</ul>
</div>
<div id="a-single-simulation-for-the-two-level-model" class="section level1">
<h1>A single simulation for the two-level model</h1>
<p>Let’s jump in and start simulating, as I find the statistical model and all those words I used trying to explain it become clearer once we have a simulated dataset to look at.</p>
<p>I couldn’t think of a good name for a plot-level response 😜 so I’ll call the response variable <code>resp</code>, the stands <code>stand</code> and the plots <code>plot</code>.</p>
<p>I’ll start by setting the seed so these results can be exactly reproduced.</p>
<pre class="r"><code>set.seed(16)</code></pre>
<p>I need to define the “truth” in the simulation by setting all the parameters in the statistical model to a value of my choosing. Here’s what I’ll do today.</p>
<ul>
<li>The true mean (<span class="math inline">\(\mu\)</span>) will be 10</li>
<li>The stand-level variance (<span class="math inline">\(\sigma^2_s\)</span>) will be set at 4, so the standard deviation (<span class="math inline">\(\sigma_s\)</span>) is 2.</li>
<li>The observation-level random effect variance (<span class="math inline">\(\sigma^2\)</span>) will be set at 1, so the standard deviation (<span class="math inline">\(\sigma\)</span>) is 1.</li>
</ul>
<p>I’ll define the number of groups and number of replicates per group while I’m at it. I’ll use 5 stands and 4 plots per stand. The total number of plots (and so observations) is the number of stands times the number of plots per stand: <code>5*4 = 20</code>.</p>
<pre class="r"><code>nstand = 5
nplot = 4
mu = 10
sds = 2
sd = 1</code></pre>
<p>I need to create a <code>stand</code> variable, containing unique names for the five sampled stands. I use capital letters for this. Each stand name will be repeated four times, because each one was measured four times (i.e., there are four plots in each stand).</p>
<pre class="r"><code>( stand = rep(LETTERS[1:nstand], each = nplot) )</code></pre>
<pre><code># [1] "A" "A" "A" "A" "B" "B" "B" "B" "C" "C" "C" "C" "D" "D" "D" "D" "E"
# [18] "E" "E" "E"</code></pre>
<p>I can make a <code>plot</code> variable, as well, although it’s not needed for modelling since we have a single value per plot. It is fairly common to give plots the same name in each stand (i.e., plots are named 1-4 in each stand), but I’m a big believer in giving plots unique names. I’ll name plots uniquely using lowercase letters. There are a total of 20 plots.</p>
<pre class="r"><code>( plot = letters[1:(nstand*nplot)] )</code></pre>
<pre><code># [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
# [18] "r" "s" "t"</code></pre>
<p>Now I will simulate the stand-level random effects. I defined these as <span class="math inline">\(b_s \thicksim N(0, \sigma^2_s)\)</span>, so will randomly draw from a normal distribution with a mean of 0 and standard deviation of 2 (remember that <code>rnorm()</code> in R uses standard deviation, not variance). I have five stands, so I draw five values.</p>
<pre class="r"><code>( standeff = rnorm(nstand, 0, sds) )</code></pre>
<pre><code># [1] 0.9528268 -0.2507600 2.1924324 -2.8884581 2.2956586</code></pre>
<p>Every plot in a stand has the same “stand effect”, which I simulated with the five values above. This means that the stand itself is causing the measured response variable to be higher or lower than other stands across all plots in the stand. So the “stand effect” must be repeated for every plot in a stand.</p>
<p>The <code>stand</code> variable I made helps me know how to repeat the stand effect values. Based on that variable, every stand effect needs to be repeated four times in a row (once for each plot).</p>
<pre class="r"><code>( standeff = rep(standeff, each = nplot) )</code></pre>
<pre><code># [1] 0.9528268 0.9528268 0.9528268 0.9528268 -0.2507600 -0.2507600
# [7] -0.2507600 -0.2507600 2.1924324 2.1924324 2.1924324 2.1924324
# [13] -2.8884581 -2.8884581 -2.8884581 -2.8884581 2.2956586 2.2956586
# [19] 2.2956586 2.2956586</code></pre>
<p>The observation-level random effect is simulated the same way as for a linear model. Every unique plot measurement has some effect on the response, and that effect is drawn from a normal distribution with a mean of 0 and a standard deviation of 1 (<span class="math inline">\(\epsilon_t \thicksim N(0, \sigma^2)\)</span>).</p>
<p>I make 20 draws from this distribution, one for every plot/observation.</p>
<pre class="r"><code>( ploteff = rnorm(nstand*nplot, 0, sd) )</code></pre>
<pre><code># [1] -0.46841204 -1.00595059 0.06356268 1.02497260 0.57314202
# [6] 1.84718210 0.11193337 -0.74603732 1.65821366 0.72172057
# [11] -1.66308050 0.57590953 0.47276012 -0.54273166 1.12768707
# [16] -1.64779762 -0.31417395 -0.18268157 1.47047849 -0.86589878</code></pre>
<p>I’m going to put all of these variables in a dataset together. This helps me keep things organized for modelling but, more importantly for learning how to do simulations, I think this helps demonstrate how every stand has an overall effect (repeated for every observation in that stand) and every plot has a unique effect. This becomes clear when you peruse the 20-row dataset shown below.</p>
<pre class="r"><code>( dat = data.frame(stand, standeff, plot, ploteff) )</code></pre>
<pre><code># stand standeff plot ploteff
# 1 A 0.9528268 a -0.46841204
# 2 A 0.9528268 b -1.00595059
# 3 A 0.9528268 c 0.06356268
# 4 A 0.9528268 d 1.02497260
# 5 B -0.2507600 e 0.57314202
# 6 B -0.2507600 f 1.84718210
# 7 B -0.2507600 g 0.11193337
# 8 B -0.2507600 h -0.74603732
# 9 C 2.1924324 i 1.65821366
# 10 C 2.1924324 j 0.72172057
# 11 C 2.1924324 k -1.66308050
# 12 C 2.1924324 l 0.57590953
# 13 D -2.8884581 m 0.47276012
# 14 D -2.8884581 n -0.54273166
# 15 D -2.8884581 o 1.12768707
# 16 D -2.8884581 p -1.64779762
# 17 E 2.2956586 q -0.31417395
# 18 E 2.2956586 r -0.18268157
# 19 E 2.2956586 s 1.47047849
# 20 E 2.2956586 t -0.86589878</code></pre>
<p>I now have the fixed values of the parameters, the variable <code>stand</code> to represent the random effect in a model, and the simulated effects of stands and plots drawn from their defined distributions. That’s all the pieces I need to calculate my response variable.</p>
<p>The statistical model</p>
<p><span class="math display">\[y_t = \mu + (b_s)_t + \epsilon_t\]</span></p>
<p>is my guide for how to combine these pieces to create the simulated response variable, <span class="math inline">\(y_t\)</span>. Notice I call the simulated response variable <code>resp</code>.</p>
<pre class="r"><code>( dat$resp = with(dat, mu + standeff + ploteff ) )</code></pre>
<pre><code># [1] 10.484415 9.946876 11.016389 11.977799 10.322382 11.596422 9.861173
# [8] 9.003203 13.850646 12.914153 10.529352 12.768342 7.584302 6.568810
# [15] 8.239229 5.463744 11.981485 12.112977 13.766137 11.429760</code></pre>
<p>It’s time for model fitting! I can fit a model with two sources of variation (stand and plot) with, e.g., the <code>lmer()</code> function from package <strong>lme4</strong>.</p>
<pre class="r"><code>library(lme4)</code></pre>
<pre><code># Loading required package: Matrix</code></pre>
<p>The results for the estimated overall mean and standard deviations of random effects in this model look pretty similar to my defined parameter values.</p>
<pre class="r"><code>fit1 = lmer(resp ~ 1 + (1|stand), data = dat)
fit1</code></pre>
<pre><code># Linear mixed model fit by REML ['lmerMod']
# Formula: resp ~ 1 + (1 | stand)
# Data: dat
# REML criterion at convergence: 72.5943
# Random effects:
# Groups Name Std.Dev.
# stand (Intercept) 2.168
# Residual 1.130
# Number of obs: 20, groups: stand, 5
# Fixed Effects:
# (Intercept)
# 10.57</code></pre>
</div>
<div id="make-a-function-for-the-simulation" class="section level1">
<h1>Make a function for the simulation</h1>
<p>A single simulation can help us understand the statistical model, but usually the goal of a simulation is to see how the model behaves over the long run. To repeat this simulation many times in R we’ll want to “functionize” the data simulating and model fitting process.</p>
<p>In my function I’m going to set all the arguments to the parameter values as I defined them above. I allow some flexibility, though, so the argument values can be changed if I want to explore the simulation with, say, a different number of replications or different standard deviations at either level.</p>
<p>This function returns a linear model fit with <code>lmer()</code>.</p>
<pre class="r"><code>twolevel_fun = function(nstand = 5, nplot = 4, mu = 10, sigma_s = 2, sigma = 1) {
standeff = rep( rnorm(nstand, 0, sigma_s), each = nplot)
stand = rep(LETTERS[1:nstand], each = nplot)
ploteff = rnorm(nstand*nplot, 0, sigma)
resp = mu + standeff + ploteff
dat = data.frame(stand, resp)
lmer(resp ~ 1 + (1|stand), data = dat)
}</code></pre>
<p>I test the function, using the same <code>seed</code>, to make sure things are working as expected and that I get the same results as above.</p>
<pre class="r"><code>set.seed(16)
twolevel_fun()</code></pre>
<pre><code># Linear mixed model fit by REML ['lmerMod']
# Formula: resp ~ 1 + (1 | stand)
# Data: dat
# REML criterion at convergence: 72.5943
# Random effects:
# Groups Name Std.Dev.
# stand (Intercept) 2.168
# Residual 1.130
# Number of obs: 20, groups: stand, 5
# Fixed Effects:
# (Intercept)
# 10.57</code></pre>
</div>
<div id="repeat-the-simulation-many-times" class="section level1">
<h1>Repeat the simulation many times</h1>
<p>Now that I have a working function to simulate data and fit the model it’s time to do the simulation many times. The model from each individual simulation is saved to allow exploration of long run model performance.</p>
<p>This is a task for <code>replicate()</code>, which repeatedly calls a function and saves the output. The output is a list, which is convenient for going through to extract elements from the models later. I’ll re-run the simulation 100 times as an example, although I will do 1000 runs later when I explore the long-run performance of variance estimates.</p>
<pre class="r"><code>sims = replicate(100, twolevel_fun() )
sims[[100]]</code></pre>
<pre><code># Linear mixed model fit by REML ['lmerMod']
# Formula: resp ~ 1 + (1 | stand)
# Data: dat
# REML criterion at convergence: 58.0201
# Random effects:
# Groups Name Std.Dev.
# stand (Intercept) 1.7197
# Residual 0.7418
# Number of obs: 20, groups: stand, 5
# Fixed Effects:
# (Intercept)
# 7.711</code></pre>
</div>
<div id="extract-results-from-the-linear-mixed-model" class="section level1">
<h1>Extract results from the linear mixed model</h1>
<p>After running all the models we will want to extract whatever we are interested in. The <code>tidy()</code> function from package <strong>broom</strong> can be used to conveniently extract both fixed and random effects.</p>
<p>Below is an example on the practice model. You’ll notice there are no p-values for fixed effects. If those are desired and the degrees of freedom can be calculated, see packages <strong>lmerTest</strong> and <a href="https://github.com/bbolker/broom.mixed"><strong>broom.mixed</strong></a>.</p>
<pre class="r"><code>library(broom)
tidy(fit1)</code></pre>
<pre><code># term estimate std.error statistic group
# 1 (Intercept) 10.570880 1.002051 10.54924 fixed
# 2 sd_(Intercept).stand 2.168186 NA NA stand
# 3 sd_Observation.Residual 1.130493 NA NA Residual</code></pre>
<p>If we want to extract only the fixed effects:</p>
<pre class="r"><code>tidy(fit1, effects = "fixed")</code></pre>
<pre><code># term estimate std.error statistic
# 1 (Intercept) 10.57088 1.002051 10.54924</code></pre>
<p>And for the random effects, which can be pulled out as variances via <code>scales</code> instead of the default standard deviations:</p>
<pre class="r"><code>tidy(fit1, effects = "ran_pars", scales = "vcov")</code></pre>
<pre><code># term group estimate
# 1 var_(Intercept).stand stand 4.701030
# 2 var_Observation.Residual Residual 1.278014</code></pre>
</div>
<div id="explore-the-effect-of-sample-size-on-variance-estimation" class="section level1">
<h1>Explore the effect of sample size on variance estimation</h1>
<p>Today I’ll look at how well we estimate variances of random effects for different samples sizes. I’ll simulate data for sampling 5 stands, 20 stands, and 100 stands.</p>
<p>I’m going to load some helper packages for this, including <strong>purrr</strong> for looping, <strong>dplyr</strong> for data manipulation tasks, and <strong>ggplot2</strong> for plotting.</p>
<pre class="r"><code>library(purrr) # v. 0.2.4
suppressPackageStartupMessages( library(dplyr) ) # v. 0.7.4
library(ggplot2) # v. 2.2.1</code></pre>
<p>I’m going to loop through a vector of the three stand sample sizes and then simulate data and fit a model 1000 times for each one. I’m using <strong>purrr</strong> functions for this, and I end up with a list of lists (1000 models for each sample size). It takes a minute or two to fit the 3000 models.</p>
<pre class="r"><code>stand_sims = c(5, 20, 100) %>%
set_names() %>%
map(~replicate(1000, twolevel_fun(nstand = .x) ) )</code></pre>
<p>Next I’ll pull out the stand variance for each model via <code>tidy()</code>.</p>
<p>I use <code>modify_depth()</code> to work on the nested (innermost) list, and then row bind the nested lists into a data.frame to get things in a convenient format for plotting. I finish by filtering things to keep only the <code>stand</code> variance, as I extracted both stand and residual variances from the model.</p>
<pre class="r"><code>stand_vars = stand_sims %>%
modify_depth(2, ~tidy(.x, effects = "ran_pars", scales = "vcov") ) %>%
map_dfr(bind_rows, .id = "stand_num") %>%
filter(group == "stand")
head(stand_vars)</code></pre>
<pre><code># stand_num term group estimate
# 1 5 var_(Intercept).stand stand 2.528572
# 2 5 var_(Intercept).stand stand 7.701715
# 3 5 var_(Intercept).stand stand 4.194774
# 4 5 var_(Intercept).stand stand 3.282683
# 5 5 var_(Intercept).stand stand 12.880303
# 6 5 var_(Intercept).stand stand 4.623168</code></pre>
<p>Let’s take a look at the distributions of the variances for each sample size via density plots. We know the true variance is 4, so I’ll add a vertical line at 4.</p>
<pre class="r"><code>ggplot(stand_vars, aes(x = estimate) ) +
geom_density(fill = "blue", alpha = .25) +
facet_wrap(~stand_num) +
geom_vline(xintercept = 4)</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-04-23-simulate-simulate-part-2-a-linear-mixed-model_files/figure-html/unnamed-chunk-21-1.png" width="672" /></p>
<p>Whoops, I need to get my factor levels in order.</p>
<pre class="r"><code>stand_vars = mutate(stand_vars, stand_num = forcats::fct_inorder(stand_num) )</code></pre>
<p>I’ll also add some clearer labels for the facets.</p>
<pre class="r"><code>add_prefix = function(string) {
paste("Number stands:", string, sep = " ")
}</code></pre>
<p>And finally I’ll add the median of each distribution as a second vertical line.</p>
<pre class="r"><code>groupmed = stand_vars %>%
group_by(stand_num) %>%
summarise(mvar = median(estimate) )</code></pre>
<p>Looking at the plots we can really see how poorly we can estimate variances when we have few replications. When only 5 stands are sampled, the variance can be estimated as low as 0 and as high as ~18 (😮) when it’s really 4.</p>
<p>By the time we have 20 stands things look better, and things look quite good with 100 stands (although notice variance still ranges from 1 to ~8).</p>
<pre class="r"><code>ggplot(stand_vars, aes(x = estimate) ) +
geom_density(fill = "blue", alpha = .25) +
facet_wrap(~stand_num, labeller = as_labeller(add_prefix) ) +
geom_vline(aes(xintercept = 4, linetype = "True variance"), size = .5 ) +
geom_vline(data = groupmed, aes(xintercept = mvar, linetype = "Median variance"),
size = .5) +
theme_bw() +
scale_linetype_manual(name = "", values = c(2, 1) ) +
theme(legend.position = "bottom",
legend.key.width = unit(.1, "cm") ) +
labs(x = "Estimated Variance", y = NULL)</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-04-23-simulate-simulate-part-2-a-linear-mixed-model_files/figure-html/unnamed-chunk-25-1.png" width="672" /></p>
<p>Here are some additional descriptive statistics of the distribution of variances in each group to complement the info in the plot.</p>
<pre class="r"><code>stand_vars %>%
group_by(stand_num) %>%
summarise_at("estimate", funs(min, mean, median, max) )</code></pre>
<pre><code># # A tibble: 3 x 5
# stand_num min mean median max
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 5 0 4.04 3.39 17.6
# 2 20 0.994 4.03 3.88 9.15
# 3 100 1.27 4.04 3.92 8.71</code></pre>
<p>So how much of the distribution is below 4 for each estimate?</p>
<p>We can see for 5 samples that the variance is definitely underestimated more often than it is overestimated; almost 60% of the distribution is below the true variance of 4.</p>
<p>Every time I do this sort of simulation I am newly surprised that even large samples tend to underestimate variances slightly more often than overestimate them.</p>
<pre class="r"><code>stand_vars %>%
group_by(stand_num) %>%
summarise(mean(estimate < 4) )</code></pre>
<pre><code># # A tibble: 3 x 2
# stand_num `mean(estimate < 4)`
# <fct> <dbl>
# 1 5 0.577
# 2 20 0.532
# 3 100 0.525</code></pre>
<p>My take home from all of this? We may need to be cautious with results for studies where the goal is to make inference about the estimates of the variances or for testing variables measured at the level of the largest unit when the number of units is small.</p>
</div>
<div id="an-actual-mixed-model-with-fixed-effects-this-time" class="section level1">
<h1>An actual mixed model (with fixed effects this time)</h1>
<p>I simplified things above so much that I didn’t have any fixed effects variables. We can certainly include fixed effects in a simulation.</p>
<p>Below is a quick example. I’ll create two continuous variables, one measured at the stand level and one measured at the plot level, that have linear relationships with the mean response variable. The study is the same as the one I defined for the previous simulation.</p>
<p>Here’s the new statistical model.</p>
<p><span class="math display">\[y_t = \beta_0 + \beta_1*(Elevation_s)_t + \beta_2*Slope_t + (b_s)_t + \epsilon_t\]</span></p>
<p>Where</p>
<ul>
<li><span class="math inline">\(\beta_0\)</span> is the mean response when both Elevation and Slope are 0<br />
</li>
<li><span class="math inline">\(\beta_1\)</span> is the change in mean response for a 1-unit change in elevation. Elevation is measured at the stand level, so all plots in a stand share a single value of elevation.<br />
</li>
<li><span class="math inline">\(\beta_2\)</span> is the change in mean response for a 1-unit change in slope. Slope is measured at the plot level, so every plot potentially has a unique value of slope.</li>
</ul>
<p>Setting the values for the three new parameters and simulating values for the continuous explanatory variables will be additional steps in the simulation. The random effects are simulated the same way as before.</p>
<p>I define the new parameters below.</p>
<ul>
<li>The intercept (<span class="math inline">\(\beta_0\)</span>) will be -1</li>
<li>The coefficient for elevation (<span class="math inline">\(\beta_1\)</span>) will be set to 0.005</li>
<li>The coefficient for slope (<span class="math inline">\(\beta_2\)</span>) will be set to 0.1</li>
</ul>
<pre class="r"><code>nstand = 5
nplot = 4
b0 = -1
b1 = .005
b2 = .1
sds = 2
sd = 1</code></pre>
<p>Here are the variables I simulated previously.</p>
<pre class="r"><code>set.seed(16)
stand = rep(LETTERS[1:nstand], each = nplot)
standeff = rep( rnorm(nstand, 0, sds), each = nplot)
ploteff = rnorm(nstand*nplot, 0, sd)</code></pre>
<p>I will simulate the explanatory variables by randomly drawing from uniform distributions via <code>runif()</code>. I change the minimum and maximum values of the uniform distribution as needed to get an appropriate spread for a given variable. If the distribution of your explanatory variables are more skewed you could use a different distribution (like the Gamma distribution).</p>
<p>First I simulate values for elevation. This variable only five values, as it is a stand-level variable. I need to repeat each value for the four plots measured in each stand like I did when making the <code>stand</code> variable.</p>
<pre class="r"><code>( elevation = rep( runif(nstand, 1000, 1500), each = nplot) )</code></pre>
<pre><code># [1] 1468.339 1468.339 1468.339 1468.339 1271.581 1271.581 1271.581
# [8] 1271.581 1427.050 1427.050 1427.050 1427.050 1166.014 1166.014
# [15] 1166.014 1166.014 1424.256 1424.256 1424.256 1424.256</code></pre>
<p>I can simulate slope the same way, pulling random values from a uniform distribution with different limits. The slope is measured at the plot level, so I have one value for every plot in the dataset.</p>
<pre class="r"><code>( slope = runif(nstand*nplot, 2, 75) )</code></pre>
<pre><code># [1] 48.45477 60.37014 59.58588 44.76939 61.88313 20.29559 71.51617
# [8] 42.35035 63.67044 36.43613 10.58778 14.62304 35.11192 13.19697
# [15] 9.29946 46.56462 34.68245 53.61456 37.00606 42.30044</code></pre>
<p>We now have everything we need to create the response variable.</p>
<p>Based on our equation <span class="math display">\[y_t = \beta_0 + \beta_1*(Elevation_s)_t + \beta_2*Slope_t + (b_s)_t + \epsilon_t\]</span></p>
<p>the response variable will be calculated via</p>
<pre class="r"><code>( resp2 = b0 + b1*elevation + b2*slope + standeff + ploteff )</code></pre>
<pre><code># [1] 11.671585 12.325584 13.316671 12.796432 11.868602 8.983889 12.370697
# [8] 8.596145 16.352939 12.693014 7.723378 10.365894 5.925563 2.718576
# [15] 3.999244 4.950275 11.571009 13.595712 13.588022 11.781083</code></pre>
<p>Now we can fit a mixed model for <code>resp2</code> with <code>elevation</code> and <code>slope</code> as fixed effects, <code>stand</code> as the random effect and the residual error term based on plot-to-plot variation. (<em>Notice I didn’t put these variables in a dataset, which I usually like to do to keep things organized and to avoid problems of vectors in my environment getting overwritten by mistake.</em>)</p>
<p>We can see some of the estimates in this one model aren’t very similar to our set values, and doing a full simulation would allow us to explore the variation in the estimates. For example, I expect the coefficient for elevation, based on only five values, will be extremely unstable.</p>
<pre class="r"><code>lmer(resp2 ~ elevation + slope + (1|stand) )</code></pre>
<pre><code># Linear mixed model fit by REML ['lmerMod']
# Formula: resp2 ~ elevation + slope + (1 | stand)
# REML criterion at convergence: 81.9874
# Random effects:
# Groups Name Std.Dev.
# stand (Intercept) 1.099
# Residual 1.165
# Number of obs: 20, groups: stand, 5
# Fixed Effects:
# (Intercept) elevation slope
# -21.31463 0.02060 0.09511</code></pre>
</div>
Unstandardizing coefficients from a GLMM
https://aosmith.rbind.io/2018/03/26/unstandardizing-coefficients/
Mon, 26 Mar 2018 00:00:00 +0000https://aosmith.rbind.io/2018/03/26/unstandardizing-coefficients/<p>Winter term grades are in and I can once again scrape together some time to write blog posts! 🎉</p>
<p>The last post I did about making <a href="https://aosmith.rbind.io/2018/01/31/added-variable-plots/">added variable plots</a> led me to think about other “get model results” topics, such as the one I’m talking about today: unstandardizing coefficients.</p>
<p>I find this comes up particularly for generalized linear mixed models (GLMM), where models don’t always converge if explanatory variables are left unstandardized. The lack of convergence can be caused by explanatory variables with very different magnitudes, and standardizing the variables prior to model fitting can be useful. In such cases, coefficients and confidence interval limits will often need to be converted to their unstandardized values for interpretation. I don’t find thinking about the change in mean response for a 1 standard deviation increase in a variable to be super intuitive, which is the interpretation of a standardized coefficient.</p>
<p>The math for converting the standardized slope estimates to unstandardized ones turns out to be fairly straightforward. Coefficients for each variable need to be divided by the standard deviation of that variable (this is only true for slopes, not intercepts). The math is shown <a href="https://stats.stackexchange.com/questions/74622/converting-standardized-betas-back-to-original-variables">here</a>.</p>
<p>The first time I went though this process was quite clunky. Since then I’ve managed to tidy things up quite a bit through work with students, and things are now much more organized.</p>
<div id="r-packages" class="section level1">
<h1>R packages</h1>
<p>Model fitting will be done via <strong>lme4</strong>, which is where I’ve most often needed to do this. Data manipulation tools from <strong>dplyr</strong> will be useful for getting results tidied up. I’ll also use helper functions from <strong>purrr</strong> to loop through variables and <strong>broom</strong> for the tidy extraction of fixed-effects coefficients from the model.</p>
<pre class="r"><code>library(lme4) # v. 1.1-15</code></pre>
<pre><code># Loading required package: Matrix</code></pre>
<pre class="r"><code>suppressPackageStartupMessages( library(dplyr) ) # v. 0.7.4
library(purrr) # 0.2.4
library(broom) # 0.4.3</code></pre>
</div>
<div id="the-dataset" class="section level1">
<h1>The dataset</h1>
<p>The dataset I’ll use is named <code>cbpp</code>, and comes with <strong>lme4</strong>. It is a dataset that has a response variable that is counted proportions, so the data will be analyzed via a binomial generalized linear mixed model.</p>
<pre class="r"><code>glimpse(cbpp)</code></pre>
<pre><code># Observations: 56
# Variables: 4
# $ herd <fct> 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5...
# $ incidence <dbl> 2, 3, 4, 0, 3, 1, 1, 8, 2, 0, 2, 2, 0, 2, 0, 5, 0, 0...
# $ size <dbl> 14, 12, 9, 5, 22, 18, 21, 22, 16, 16, 20, 10, 10, 9,...
# $ period <fct> 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3...</code></pre>
<p>This dataset has no continuous explanatory variables in it, so I’ll add some to demonstrate standardizing/unstandardizing. I create three new variables with very different ranges.</p>
<pre class="r"><code>set.seed(16)
cbpp = mutate(cbpp, y1 = rnorm(56, 500, 100),
y2 = runif(56, 0, 1),
y3 = runif(56, 10000, 20000) )
glimpse(cbpp)</code></pre>
<pre><code># Observations: 56
# Variables: 7
# $ herd <fct> 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5...
# $ incidence <dbl> 2, 3, 4, 0, 3, 1, 1, 8, 2, 0, 2, 2, 0, 2, 0, 5, 0, 0...
# $ size <dbl> 14, 12, 9, 5, 22, 18, 21, 22, 16, 16, 20, 10, 10, 9,...
# $ period <fct> 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3...
# $ y1 <dbl> 547.6413, 487.4620, 609.6216, 355.5771, 614.7829, 45...
# $ y2 <dbl> 0.87758754, 0.33596115, 0.11495285, 0.26466003, 0.99...
# $ y3 <dbl> 14481.07, 11367.88, 14405.16, 18497.73, 17955.66, 10...</code></pre>
</div>
<div id="analysis" class="section level1">
<h1>Analysis</h1>
<div id="unstandardized-model" class="section level2">
<h2>Unstandardized model</h2>
<p>Here is my initial generalized linear mixed model, using the three continuous explanatory variables as fixed effects and “herd” as the random effect. A warning message indicates that standardizing might be necessary.</p>
<pre class="r"><code>fit1 = glmer( cbind(incidence, size - incidence) ~ y1 + y2 + y3 + (1|herd),
data = cbpp, family = binomial)</code></pre>
<pre><code># Warning: Some predictor variables are on very different scales: consider
# rescaling</code></pre>
<pre><code># Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
# control$checkConv, : Model failed to converge with max|grad| = 0.947357
# (tol = 0.001, component 1)</code></pre>
<pre><code># Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
# - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
# - Rescale variables?</code></pre>
</div>
<div id="standardizing-the-variables" class="section level2">
<h2>Standardizing the variables</h2>
<p>I’ll now standardize the three explanatory variables, which involves subtracting the mean and then dividing by the standard deviation. The <code>scale()</code> function is one way to do this in R.</p>
<p>I do the work inside <code>mutate_at()</code>, which allows me to choose the three variables I want to standardize by name and add “s” as the suffix by using a name in <code>funs()</code>. Adding the suffix allows me to keep the original variables, as I will need them later. I use <code>as.numeric()</code> to convert the matrix that the <code>scale()</code> function returns into a vector.</p>
<pre class="r"><code>cbpp = mutate_at(cbpp, vars( y1:y3 ), funs(s = as.numeric( scale(.) ) ) )
glimpse(cbpp)</code></pre>
<pre><code># Observations: 56
# Variables: 10
# $ herd <fct> 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5...
# $ incidence <dbl> 2, 3, 4, 0, 3, 1, 1, 8, 2, 0, 2, 2, 0, 2, 0, 5, 0, 0...
# $ size <dbl> 14, 12, 9, 5, 22, 18, 21, 22, 16, 16, 20, 10, 10, 9,...
# $ period <fct> 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3...
# $ y1 <dbl> 547.6413, 487.4620, 609.6216, 355.5771, 614.7829, 45...
# $ y2 <dbl> 0.87758754, 0.33596115, 0.11495285, 0.26466003, 0.99...
# $ y3 <dbl> 14481.07, 11367.88, 14405.16, 18497.73, 17955.66, 10...
# $ y1_s <dbl> 0.34007250, -0.32531152, 1.02536895, -1.78352139, 1....
# $ y2_s <dbl> 1.4243017, -0.4105502, -1.1592536, -0.6520950, 1.837...
# $ y3_s <dbl> -0.2865770, -1.3443309, -0.3123665, 1.0781427, 0.893...</code></pre>
</div>
<div id="standardized-model" class="section level2">
<h2>Standardized model</h2>
<p>The model with standardized variables converges without any problems.</p>
<pre class="r"><code>fit2 = glmer( cbind(incidence, size - incidence) ~ y1_s + y2_s + y3_s + (1|herd),
data = cbpp, family = binomial)</code></pre>
</div>
</div>
<div id="unstandardizing-slope-coefficients" class="section level1">
<h1>Unstandardizing slope coefficients</h1>
<div id="get-coefficients-and-profile-confidence-intervals-from-model" class="section level2">
<h2>Get coefficients and profile confidence intervals from model</h2>
<p>If I want to use this model for inference I need to unstandardize the coefficients before reporting them to make them more easily interpretable.</p>
<p>The first step in the process is to get the standardized estimates and confidence intervals from the model. I use <code>tidy()</code> from package <strong>broom</strong> for this, which returns a data.frame of coefficients, statistical tests, and confidence intervals. The help page is at <code>?tidy.merMod</code> if you want to explore some of the options.</p>
<p>I use <code>tidy()</code> to extract the fixed effects along with profile likelihood confidence intervals.</p>
<pre class="r"><code>coef_st = tidy(fit2, effects = "fixed",
conf.int = TRUE,
conf.method = "profile")</code></pre>
<pre><code># Computing profile confidence intervals ...</code></pre>
<pre class="r"><code>coef_st</code></pre>
<pre><code># term estimate std.error statistic p.value conf.low
# 1 (Intercept) -2.0963512 0.2161353 -9.6992521 3.037166e-22 -2.565692724
# 2 y1_s 0.2640116 0.1395533 1.8918334 5.851318e-02 -0.009309633
# 3 y2_s 0.1031655 0.1294722 0.7968153 4.255583e-01 -0.151459144
# 4 y3_s 0.1569910 0.1229460 1.2769107 2.016338e-01 -0.086291865
# conf.high
# 1 -1.6536063
# 2 0.5472589
# 3 0.3635675
# 4 0.4033177</code></pre>
</div>
<div id="calculate-standard-deviations-for-each-variable" class="section level2">
<h2>Calculate standard deviations for each variable</h2>
<p>I need the standard deviations for each variable in order to unstandardize the coefficients. If I do this right, I can get the standard deviations into a data.frame that I can then join to <code>coef_st</code>. Once that is done, dividing the estimated slopes and confidence interval limits by the standard deviation will be straightforward.</p>
<p>I will calculate the standard deviations per variable with <code>map()</code> from <strong>purrr</strong>, as it is a convenient way to loop through columns. I pull out the variables I want to calculate standard deviations for via <code>select()</code>. An alternative approach would have been to take the variables from columns and put them in rows (i.e., put the data in <em>long</em> format), and then summarize by groups.</p>
<p>The output from <code>map()</code> returns a list, which can be stacked into a long format data.frame via <code>utils::stack()</code>. This results in a two column data.frame, with a column for the standard deviation (called <code>values</code>) and a column with the variable names (called <code>ind</code>).</p>
<pre class="r"><code>map( select(cbpp, y1:y3), sd) %>%
stack()</code></pre>
<pre><code># values ind
# 1 90.4430192 y1
# 2 0.2951881 y2
# 3 2943.2098667 y3</code></pre>
<p>The variables in my model and in my output end with <code>_s</code> , so I’ll need to add that suffix to the variable names in the “standard deviations” dataset prior to joining the two data.frames together.</p>
<pre class="r"><code>sd_all = map( select(cbpp, y1:y3), sd) %>%
stack() %>%
mutate(ind = paste(ind, "s", sep = "_") )
sd_all</code></pre>
<pre><code># values ind
# 1 90.4430192 y1_s
# 2 0.2951881 y2_s
# 3 2943.2098667 y3_s</code></pre>
</div>
<div id="joining-the-standard-deviations-to-the-coefficients-table" class="section level2">
<h2>Joining the standard deviations to the coefficients table</h2>
<p>Once the names of the variables match between the datasets I can join the “standard deviations” data.frame to the “coefficients” data.frame. I’m not unstandardizing the intercept at this point, so I’ll use <code>inner_join()</code> to keep only rows that have a match in both data.frames. Notice that the columns I’m joining by have different names in the two data.frames.</p>
<pre class="r"><code>coef_st %>%
inner_join(., sd_all, by = c("term" = "ind") )</code></pre>
<pre><code># term estimate std.error statistic p.value conf.low conf.high
# 1 y1_s 0.2640116 0.1395533 1.8918334 0.05851318 -0.009309633 0.5472589
# 2 y2_s 0.1031655 0.1294722 0.7968153 0.42555829 -0.151459144 0.3635675
# 3 y3_s 0.1569910 0.1229460 1.2769107 0.20163378 -0.086291865 0.4033177
# values
# 1 90.4430192
# 2 0.2951881
# 3 2943.2098667</code></pre>
<p>With everything in one data.frame I can easily divide <code>estimate</code>, <code>conf.low</code> and <code>conf.high</code> by the standard deviation in <code>values</code> via <code>mutate_at()</code>. I will round the results, as well, although I’m ignoring the vast differences in the variable range when I do this rounding.</p>
<pre class="r"><code>coef_st %>%
inner_join(., sd_all, by = c("term" = "ind") ) %>%
mutate_at( vars(estimate, conf.low, conf.high), funs(round( ./values, 4) ) )</code></pre>
<pre><code># term estimate std.error statistic p.value conf.low conf.high
# 1 y1_s 0.0029 0.1395533 1.8918334 0.05851318 -0.0001 0.0061
# 2 y2_s 0.3495 0.1294722 0.7968153 0.42555829 -0.5131 1.2316
# 3 y3_s 0.0001 0.1229460 1.2769107 0.20163378 0.0000 0.0001
# values
# 1 90.4430192
# 2 0.2951881
# 3 2943.2098667</code></pre>
<p>I’ll get rid of the extra variables via <code>select()</code>, so I end up with the unstandardized coefficients and confidence interval limits along with the variable name. I could also get the variable names cleaned up, possibly removing the suffix and/or capitalizing and adding units, etc., although I don’t do that with these fake variables today.</p>
<pre class="r"><code>coef_unst = coef_st %>%
inner_join(., sd_all, by = c("term" = "ind") ) %>%
mutate_at( vars(estimate, conf.low, conf.high), funs(round( ./values, 4) ) ) %>%
select(-(std.error:p.value), -values)
coef_unst</code></pre>
<pre><code># term estimate conf.low conf.high
# 1 y1_s 0.0029 -0.0001 0.0061
# 2 y2_s 0.3495 -0.5131 1.2316
# 3 y3_s 0.0001 0.0000 0.0001</code></pre>
</div>
<div id="estimates-from-the-unstandardized-model" class="section level2">
<h2>Estimates from the unstandardized model</h2>
<p>Note that the estimated coefficients are the same from the model where I manually unstandardize the coefficients (above) and the model fit using unstandardized variables.</p>
<pre class="r"><code>round( fixef(fit1)[2:4], 4)</code></pre>
<pre><code># y1 y2 y3
# 0.0029 0.3495 0.0001</code></pre>
<p>Given that the estimates are the same, couldn’t we simply go back and fit the unstandardized model and ignored the warning message? Unfortunately, the convergence issues can cause problems when trying to calculate profile likelihood confidence intervals, so the simpler approach doesn’t always work.</p>
<p>In this case there are a bunch of warnings (not shown), and the profile likelihood confidence interval limits aren’t successfully calculated for some of the coefficients.</p>
<pre class="r"><code>tidy(fit1, effects = "fixed",
conf.int = TRUE,
conf.method = "profile")</code></pre>
<pre><code># Computing profile confidence intervals ...</code></pre>
<pre><code># term estimate std.error statistic p.value
# 1 (Intercept) -4.582545e+00 1.094857e+00 -4.1855190 2.845153e-05
# 2 y1 2.919191e-03 1.520407e-03 1.9200071 5.485700e-02
# 3 y2 3.495238e-01 4.370778e-01 0.7996834 4.238943e-01
# 4 y3 5.334512e-05 4.049206e-05 1.3174218 1.876973e-01
# conf.low conf.high
# 1 NA NA
# 2 NA NA
# 3 NA NA
# 4 -2.931883e-05 0.0001370332</code></pre>
</div>
</div>
<div id="further-important-work" class="section level1">
<h1>Further (important!) work</h1>
<p>These results are all on the scale of log-odds, and I would exponentiate the unstandardized coefficients to the odds scale for reporting and interpretation.</p>
<p>Along these same lines, one thing I didn’t discuss in this post that is important to consider is the appropriate and interesting unit increase for each variable. Clearly the effect of a “1 unit” increase in the variable is likely not of interest for at least <code>y2</code> (range between 0 and 1) and <code>y3</code> (range between 10000 and 20000). In the first case, 1 unit encompasses the entire range of the variable and in the second case 1 unit appears to be much smaller than the scale of the measurement.</p>
<p>The code to calculate the change in odds for a practically interesting increase in each explanatory variable would be similar to what I’ve done above. I would create a data.frame with the unit increase of interest for each variable in it, join this to the “coefficients” dataset, and multiply all estimates and CI by those values. The multiplication step can occur before or after unstandardizing but must happen before doing exponentiation/inverse-linking. I’d report the unit increase for each variable in any tables of results so the reader can see that the reported estimate is a change in estimated odds/mean for the given practically important increase in the variable.</p>
</div>
Making many added variable plots with purrr and ggplot2
https://aosmith.rbind.io/2018/01/31/added-variable-plots/
Wed, 31 Jan 2018 00:00:00 +0000https://aosmith.rbind.io/2018/01/31/added-variable-plots/<p>Last week two of my consulting meetings ended up on the same topic: making added variable plots.</p>
<p>In both cases, the student had a linear model of some flavor that had several continuous explanatory variables. They wanted to plot the estimated relationship between each variable in the model and the response. This could easily lead to a lot of copying and pasting of code, since they want to do the same thing for every explanatory variable in the model. I worked up some example code showing an approach on how one might automate the task in R with functions and loops, and thought I’d generalize it for a blog post.</p>
<div id="the-basics-of-added-variable-plots" class="section level1">
<h1>The basics of added variable plots</h1>
<p>Added variable plots (aka partial regression plots, adjusted variable plots, individual coefficient plots), are “results” plots. They are plots showing the estimated relationship between the response and an explanatory variable <em>after accounting for the other variables in the model</em>. If working with only two continuous explanatory variables, a 3-dimensional plot could be used in place of an added variable plot (if one likes those sorts of plots 😃). Once there are many variables in the model, though, we don’t have enough plotting dimensions to show how all the variables relate to the response simultaneously and so the added variable plot is an alternative.</p>
<p>There are packages available for making added variable plots in R, such as the <strong>effects</strong> package. However, I tend to like a bit more flexibility, which I can get by making my own plots. To do this I need to extract the appropriate predictions and confidence intervals from the model.</p>
<p>When making an added variable plot, it is fairly standard to make the predictions with all other variables fixed to their medians or means. I use medians today. Note that in my example I’m demonstrating code for the relatively simple case where there are no interactions between continuous variables in the model. Continuous-by-continuous interactions would involve a more complicated set-up for making plots.</p>
</div>
<div id="r-packages" class="section level1">
<h1>R packages</h1>
<p>The main workhorses I’m using today is <strong>purrr</strong> for looping through variables/lists and <strong>ggplot2</strong> for plotting. I also use helper functions from <strong>dplyr</strong> for data manipulation and <strong>broom</strong> for getting the model predictions and standard errors.</p>
<pre class="r"><code>suppressPackageStartupMessages( library(dplyr) ) # v. 0.7.4
library(ggplot2) # v. 2.2.1.9000
library(purrr) # v. 0.2.4
library(broom) # v. 0.4.3</code></pre>
</div>
<div id="the-linear-model" class="section level1">
<h1>The linear model</h1>
<p>My example model is a linear model with a transformed response variable, fit using <code>lm()</code>. The process works the same for generalized linear models fit with <code>glm()</code> and would be very similar for other linear models (although you may have to calculate any standard errors manually).</p>
<p>My linear model is based on five continuous variable from the <em>mtcars</em> dataset. (<em>I know, I know, that dataset has been incredibly overused. Seriously, though, I looked around a bit for another dataset to use that had many continuous explanatory variables in it (that wasn’t a time series) but I couldn’t come up with anything. If you know of one, let me know!</em>)</p>
<p>The model I fit uses a log transformation for the response variable, so predictions and confidence interval limits will need to be back-transformed prior to plotting to show the relationship between the variables on the response on the original scale.</p>
<pre class="r"><code>fit1 = lm( log(mpg) ~ disp + hp + drat + wt, data = mtcars)
summary(fit1)</code></pre>
<pre><code>#
# Call:
# lm(formula = log(mpg) ~ disp + hp + drat + wt, data = mtcars)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.16449 -0.08240 -0.03421 0.08048 0.26221
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.5959909 0.2746352 13.094 3.29e-13 ***
# disp -0.0001296 0.0004715 -0.275 0.78553
# hp -0.0014709 0.0005061 -2.906 0.00722 **
# drat 0.0445959 0.0575916 0.774 0.44545
# wt -0.1719512 0.0470572 -3.654 0.00110 **
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.1136 on 27 degrees of freedom
# Multiple R-squared: 0.8733, Adjusted R-squared: 0.8546
# F-statistic: 46.54 on 4 and 27 DF, p-value: 9.832e-12</code></pre>
</div>
<div id="the-explanatory-variables" class="section level1">
<h1>The explanatory variables</h1>
<p>The approach I’m going to take is to loop through the explanatory variables in the model, create datasets for prediction, get the predictions, and make the plots. I’ll end with one added variable plot per variable.</p>
<p>I could write out a vector of variable names to loop through manually, but I prefer to pull them out of the model. In the approach I use the first variable in the output is the response variable. I don’t need the response variable for what I’m doing, so I remove it.</p>
<pre class="r"><code>( mod_vars = all.vars( formula(fit1) )[-1] )</code></pre>
<pre><code># [1] "disp" "hp" "drat" "wt"</code></pre>
</div>
<div id="a-function-for-making-a-prediction-dataset" class="section level1">
<h1>A function for making a prediction dataset</h1>
<p>The first step in making an added variable plot manually is to create the dataset to use for making the predictions. This dataset will contain the observed data for the explanatory variable of interest (the “focus” variable) with all other variables fixed to their medians.</p>
<p>Below is a function for doing this task. The function takes a dataset, a vector of all the variables in the model (as strings), and the name of the focus variable (as a string). The <strong>dplyr</strong> <code>*_at()</code> functions can take strings as input, so I use <code>summarise_at()</code> to calculate the medians of the non-focus variables. I bind the summary values to the focus variable data from the original dataset with <code>cbind()</code>, since <code>cbind()</code> allows recycling.</p>
<pre class="r"><code>preddat_fun = function(data, allvars, var) {
sums = summarise_at(data,
vars( one_of(allvars), -one_of(var) ),
median)
cbind( select_at(data, var), sums)
}</code></pre>
<p>Here’s what the result of the function looks like for a single focus variable, “disp” (showing first six rows).</p>
<pre class="r"><code>head( preddat_fun(mtcars, mod_vars, "disp") )</code></pre>
<pre><code># disp hp drat wt
# Mazda RX4 160 123 3.695 3.325
# Mazda RX4 Wag 160 123 3.695 3.325
# Datsun 710 108 123 3.695 3.325
# Hornet 4 Drive 258 123 3.695 3.325
# Hornet Sportabout 360 123 3.695 3.325
# Valiant 225 123 3.695 3.325</code></pre>
</div>
<div id="making-a-prediction-dataset-for-each-variable" class="section level1">
<h1>Making a prediction dataset for each variable</h1>
<p>Now that I have a working function I can loop through each variable in the “mod_vars” vector and create a prediction dataset for each one. I’ll use <code>map()</code> from <strong>purrr</strong> for the loop. I use <code>set_names()</code> prior to <code>map()</code> so each element of the resulting list will be labeled with the name of the focus variable of that dataset. This helps me stay organized.</p>
<p>The result is a list of prediction datasets.</p>
<pre class="r"><code>pred_dats = mod_vars %>%
set_names() %>%
map( ~preddat_fun(mtcars, mod_vars, .x) )
str(pred_dats)</code></pre>
<pre><code># List of 4
# $ disp:'data.frame': 32 obs. of 4 variables:
# ..$ disp: num [1:32] 160 160 108 258 360 ...
# ..$ hp : num [1:32] 123 123 123 123 123 123 123 123 123 123 ...
# ..$ drat: num [1:32] 3.7 3.7 3.7 3.7 3.7 ...
# ..$ wt : num [1:32] 3.33 3.33 3.33 3.33 3.33 ...
# $ hp :'data.frame': 32 obs. of 4 variables:
# ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
# ..$ disp: num [1:32] 196 196 196 196 196 ...
# ..$ drat: num [1:32] 3.7 3.7 3.7 3.7 3.7 ...
# ..$ wt : num [1:32] 3.33 3.33 3.33 3.33 3.33 ...
# $ drat:'data.frame': 32 obs. of 4 variables:
# ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# ..$ disp: num [1:32] 196 196 196 196 196 ...
# ..$ hp : num [1:32] 123 123 123 123 123 123 123 123 123 123 ...
# ..$ wt : num [1:32] 3.33 3.33 3.33 3.33 3.33 ...
# $ wt :'data.frame': 32 obs. of 4 variables:
# ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
# ..$ disp: num [1:32] 196 196 196 196 196 ...
# ..$ hp : num [1:32] 123 123 123 123 123 123 123 123 123 123 ...
# ..$ drat: num [1:32] 3.7 3.7 3.7 3.7 3.7 ...</code></pre>
</div>
<div id="calculate-model-predictions" class="section level1">
<h1>Calculate model predictions</h1>
<p>Once the prediction datasets are created, the predictions can be calculated from the model and added to each dataset. I do this on the model scale, since I want to make confidence intervals with the standard errors prior to back-transforming.</p>
<p>The <code>augment()</code> function from <strong>broom</strong> works with a variety of model objects, including <em>lm</em> and <em>glm</em> objects. It can take new datasets for prediction with the “newdata” argument, and I use it here to add both the prediction and the standard errors of the predictions to each dataset.</p>
<p>I do this task by looping through the prediction datasets with <code>map()</code>, first to add the predictions via <code>augment()</code> and then to calculate the approximate confidence intervals and back-transform the predictions and confidence interval limits to the original data scale.</p>
<pre class="r"><code>preds = pred_dats %>%
map(~augment(fit1, newdata = .x) ) %>%
map(~mutate(.x,
lower = exp(.fitted - 2*.se.fit),
upper = exp(.fitted + 2*.se.fit),
pred = exp(.fitted) ) )</code></pre>
<p>Here is what the structure of the list elements look like now (showing only the first list element).</p>
<pre class="r"><code>str(preds$disp)</code></pre>
<pre><code># 'data.frame': 32 obs. of 10 variables:
# $ .rownames: chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
# $ disp : num 160 160 108 258 360 ...
# $ hp : num 123 123 123 123 123 123 123 123 123 123 ...
# $ drat : num 3.7 3.7 3.7 3.7 3.7 ...
# $ wt : num 3.33 3.33 3.33 3.33 3.33 ...
# $ .fitted : num 2.99 2.99 2.99 2.97 2.96 ...
# $ .se.fit : num 0.0367 0.0367 0.0574 0.0309 0.0713 ...
# $ lower : num 18.4 18.4 17.8 18.4 16.8 ...
# $ upper : num 21.3 21.3 22.4 20.8 22.3 ...
# $ pred : num 19.8 19.8 20 19.6 19.3 ...</code></pre>
</div>
<div id="a-function-for-plotting" class="section level1">
<h1>A function for plotting</h1>
<p>With the predictions successfully made it’s time for plotting. If each plot needs to look really different, the plots could be made individually. However, if they will all have a similar look then it makes sense to create a function for making the plots.</p>
<p>One problem I anticipated running into when automating the plotting is with the x axis labels. The variable names in the dataset aren’t very nice looking. If I want the x axis labels to be more polished in the plots I’ll need replacement labels. I decided to make a vector of nicer labels, one label for each focus variable. This vector needs to be the same length and in the same order as the vector of variable names and the list of prediction datasets so each plot gets the correct new axis label.</p>
<pre class="r"><code>xlabs = c("Displacement (cu.in.)", "Gross horsepower",
"Rear axle ratio", "Weight (1000 lbs)")</code></pre>
<p>My plotting function has three arguments: the dataset to plot, the explanatory variable to plot on the x axis (as a string), and label for the x axis (also as a string).</p>
<pre class="r"><code>pred_plot = function(data, variable, xlab) {
ggplot(data, aes_string(x = variable, y = "pred") ) +
geom_line(size = 1) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = .25) +
geom_rug(sides = "b") +
theme_bw() +
labs(x = xlab,
y = "Miles/(US) gallon") +
ylim(10, 32)
}</code></pre>
<p>Here is the plotting function in action. I plot the “disp” variable, which is the first element of the three lists (prediction datasets, variables, axis labels).</p>
<pre class="r"><code>pred_plot(preds[[1]], mod_vars[1], xlabs[1])</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-01-31-making-many-added-variable-plots-with-purrr-and-ggplot2_files/figure-html/unnamed-chunk-11-1.png" width="672" /></p>
</div>
<div id="making-all-the-plots" class="section level1">
<h1>Making all the plots</h1>
<p>The very last step is to make all the plots. Because we want to loop through three different lists (the prediction datasets, the variables, and the axis labels), this can be a done via <code>pmap()</code> from <strong>purrr</strong>. <code>pmap()</code> loops through all three lists simultaneously.</p>
<pre class="r"><code>all_plots = pmap( list(preds, mod_vars, xlabs), pred_plot)
all_plots</code></pre>
<pre><code># $disp</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-01-31-making-many-added-variable-plots-with-purrr-and-ggplot2_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
<pre><code>#
# $hp</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-01-31-making-many-added-variable-plots-with-purrr-and-ggplot2_files/figure-html/unnamed-chunk-12-2.png" width="672" /></p>
<pre><code>#
# $drat</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-01-31-making-many-added-variable-plots-with-purrr-and-ggplot2_files/figure-html/unnamed-chunk-12-3.png" width="672" /></p>
<pre><code>#
# $wt</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-01-31-making-many-added-variable-plots-with-purrr-and-ggplot2_files/figure-html/unnamed-chunk-12-4.png" width="672" /></p>
<div id="using-the-plots" class="section level2">
<h2>Using the plots</h2>
<p>The plots can be printed all at once as above or individually using indexes or list names, using code such as <code>all_plots[[1]]</code> or <code>all_plots$disp</code>. Plots can also be saved for use outside of R; for saving individual plots you might use a <code>walk()</code> loop.</p>
<p>It might be nice to combine these individual plots into a single multi-plot. A faceted plot would be an option, but the approach I’ve done here in its current form isn’t a great one for faceting (although I’m sure it could be modified with that in mind).</p>
<p>The individual plots can be combined into a single figure via <strong>cowplot</strong> functions, though, without too much trouble. Note that <strong>cowplot</strong> is opinionated about the theme, so I use it without loading it.</p>
<p>The <code>plot_grid()</code> function can take a list of plots, which is what I have. It has a variety of options you might want to explore for getting the plots stitched together nicely.</p>
<pre class="r"><code>cowplot::plot_grid(plotlist = all_plots,
labels = "AUTO",
align = "hv")</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-01-31-making-many-added-variable-plots-with-purrr-and-ggplot2_files/figure-html/unnamed-chunk-13-1.png" width="672" /></p>
<p><strong>Addendum:</strong> Package <strong>rms</strong> makes added variable plots via <strong>ggplot2</strong> and <strong>plotly</strong> along with simultaneous confidence bands for any model type the package works with. That includes linear models and generalized linear models excluding the negative binomial family. This may be a useful place to start if you are working with these kinds of models.</p>
</div>
</div>
Reversing the order of a ggplot2 legend
https://aosmith.rbind.io/2018/01/19/reversing-the-order-of-a-ggplot2-legend/
Fri, 19 Jan 2018 00:00:00 +0000https://aosmith.rbind.io/2018/01/19/reversing-the-order-of-a-ggplot2-legend/<p>It’s always nice to get good questions in a workshop. It can help everybody, including the instructor, get a bit of extra learnin’ in.</p>
<p>Every spring I give a <strong>ggplot2</strong> workshop for graduate students in my college. The first half is focused on the terminology and understanding the basics of how to put a plot together (I remember as a beginner feeling like I was throwing darts at things to see what stuck when deciding if something should go inside or outside aes() 🎯 ).</p>
<p>The second half is spent more on making more “final” versions of plots, which is where the question came up.</p>
<pre class="r"><code>library(ggplot2)</code></pre>
<p>I have a “results” dataset I use for one of the “final” plot examples, displaying the results from some statistical model used to answer a question about differences in mean response between some groups vs a control group.</p>
<pre class="r"><code>res = structure(list(Diffmeans = c(-0.27, 0.11, -0.15, -1.27, -1.18
), Lower.CI = c(-0.63, -0.25, -0.51, -1.63, -1.54), Upper.CI = c(0.09,
0.47, 0.21, -0.91, -0.82), plantdate = structure(c(1L, 2L, 2L,
3L, 3L), .Label = c("January 2", "January 28", "February 25"), class = "factor"),
stocktype = structure(c(2L, 2L, 1L, 2L, 1L), .Label = c("bare",
"cont"), class = "factor")), .Names = c("Diffmeans", "Lower.CI",
"Upper.CI", "plantdate", "stocktype"), row.names = c(NA, -5L), class = "data.frame")</code></pre>
<p>In the workshop, making a plot of these results is done in many stages to demonstrate controlling scales and themes and dodging, etc. At one point during the workshop the graph looks like this.</p>
<pre class="r"><code>( g1 = ggplot(res, aes(x = plantdate, y = Diffmeans, group = stocktype) ) +
geom_point(position = position_dodge(width = .75) ) +
geom_errorbar( aes(ymin = Lower.CI, ymax = Upper.CI,
linetype = stocktype,
width = c(.2, .4, .4, .4, .4) ),
position = position_dodge(width = .75) ) +
theme_bw() +
labs(y = "Difference in Growth (cm)",
x = "Planting Date") +
geom_rect(xmin = -Inf, xmax = Inf, ymin = -.25, ymax = .25,
fill = "grey54", alpha = .05) +
scale_y_continuous(breaks = seq(-1.5, .5, by = .25) ) +
coord_flip() +
scale_linetype_manual(values = c("solid", "twodash"),
name = element_blank(),
labels = c("Bare root", "Container") ) )</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-01-19-reversing-the-order-of-a-ggplot2-legend_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
<p>One year I had a student ask a great question about the order of the legend. The order of the lines representing the two groups in the plot is exactly opposite of the order of the groups in the legend. He felt the plot would feel more polished if the group order between the elements matched; I agreed.</p>
<p>Now, a lot of time the answer to “how do I change the order of a categorical variable in <strong>ggplot2</strong>” is <em>change the data to change the plot</em>. (I’ve use <code>forcats::fct_inorder</code> a <em>lot</em> for getting the levels of variables like month names in the correct order for plotting.)</p>
<p>But that doesn’t work in this case. If I change the order of the levels of the factor…</p>
<pre class="r"><code>res$stocktype = factor(res$stocktype, levels = c("cont", "bare") )</code></pre>
<p>…both the order of the groups in the plot and the legend flip, so they are still exactly opposite of each other.</p>
<pre class="r"><code>g1 %+%
res +
scale_linetype_manual(values = c("solid", "twodash"),
name = element_blank(),
labels = c("Container", "Bare root") )</code></pre>
<pre><code># Scale for 'linetype' is already present. Adding another scale for
# 'linetype', which will replace the existing scale.</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-01-19-reversing-the-order-of-a-ggplot2-legend_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
<p>Back to the drawing board (and some time online), it turns out the answer is to reverse the legend within <strong>ggplot2</strong>. There is a “reverse” argument in the guide_legend() function. This function can be used either inside a scale_*() function with the “guides” argument or within guides().</p>
<p>In this particular case, the scale_linetype_manual() line can be changed to incorporate guide_legend().</p>
<p>Like this:</p>
<pre class="r"><code>scale_linetype_manual(values = c("solid", "twodash"),
name = element_blank(),
labels = c("Container", "Bare root"),
guide = guide_legend(reverse = TRUE) )</code></pre>
<p>This works to get the plot order and legend matching in a more aesthetically pleasing way.</p>
<pre class="r"><code>g1 %+%
res +
scale_linetype_manual(values = c("solid", "twodash"),
name = element_blank(),
labels = c("Container", "Bare root"),
guide = guide_legend(reverse = TRUE) )</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-01-19-reversing-the-order-of-a-ggplot2-legend_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>It was a good day; everyone got to learn something new! 🎉</p>
Simulate! Simulate! - Part 1: A linear model
https://aosmith.rbind.io/2018/01/09/simulate-simulate-part1/
Tue, 09 Jan 2018 00:00:00 +0000https://aosmith.rbind.io/2018/01/09/simulate-simulate-part1/<p>Confession: I love simulations.</p>
<p>In simulations you get to define everything about a model and then see how that model behaves over the long run. It’s like getting the luxury of taking many samples instead of only the one real one we have resources for in an actual study.</p>
<p>I find simulations incredibly useful in understanding statistical theory and assumptions of linear models. When someone tells me with great certainty “I don’t need to meet that assumption because [fill in the blank]” or asks “Does it matter that [something complicated]?”, I often turn to simulations instead of textbooks to check.</p>
<p>I like simulations for the same reasons I like building Bayesian models and using resampling methods (i.e., Monte Carlo) for inference. Building the simulation increases my understanding of the problem and makes all the assumptions clearer to me because I must use them explicitly. Plus it’s fun to put the code together and explore the results. 🙂</p>
<div id="simulate-simulate-dance-to-the-music" class="section level1">
<h1>Simulate, simulate, dance to the music</h1>
<p>Simulations have been so helpful in my own understanding of statistical models that I find myself wishing I knew how to use them more in teaching and consulting. Being able to build a simulation could really help folks understand the strengths and weaknesses of their analysis. I haven’t managed to fit it in so far, but it’s always on my mind. Hence, this post.</p>
<p>Today I’m going to go over an example of simulating data from a two-group linear model. I’ll work work up to linear mixed models and generalized linear mixed models (the fun stuff! 😆) in subsequent posts.</p>
</div>
<div id="the-statistical-model" class="section level1">
<h1>The statistical model</h1>
<p><strong>Warning: Here there be equations.</strong></p>
<p>If you’re like me and your brain says “I think this section must not pertain to me” when your eyes hit mathematical notation, you can jump right down to the R code in the next section. But if you can power through, these equations are actually pretty useful when setting up a simulation (honest).</p>
<p>A simulation for a linear model is based on the statistical model. The statistical model is an equation that describes the processes we believe gave rise to the observed response variable. It includes parameters to describe the assumed effect of explanatory variables on the response variable as well as a description of any distributions associated with processes we assume are random variation. (There is more in-depth coverage of the statistical model in Stroup’s 2013 <a href="https://books.google.com/books/about/Generalized_Linear_Mixed_Models.html?id=GcGrySpkXRMC">Generalized Linear Mixed Models</a> book if you are interested and have access to it.)</p>
<p>So the statistical model is where we write down the exact assumptions we are making when we fit a linear model to a set of data.</p>
<p>Here is an example of a linear model for two groups. I wrote the statistical model to match the form of the default summary output from a model fit with <code>lm()</code> in R.</p>
<p><span class="math display">\[y_t = \beta_0 + \beta_1*I_{(group_t=\textit{''group''})} + \epsilon_t\]</span></p>
<ul>
<li><span class="math inline">\(y_t\)</span> is the observed values for the quantitative response variable; <span class="math inline">\(t\)</span> goes from 1 to the number of observations in the dataset<br />
</li>
<li><span class="math inline">\(\beta_0\)</span> is the mean response variable when the group is “group1”</li>
<li><span class="math inline">\(\beta_1\)</span> is the difference in mean response between the groups, “group2” minus “group1”.<br />
</li>
<li>The indicator variable, <span class="math inline">\(I_{(group_t=\textit{''group2''})}\)</span>, is 1 when the group is “group2” and 0 otherwise, as <span class="math inline">\(\beta_1\)</span> only affects the response variable for observations in “group2”</li>
<li><span class="math inline">\(\epsilon_t\)</span> is the random variation present for each observation that is not explained by the group variable. These are assumed to come from an iid normal distribution with a mean of 0 and some shared variance, <span class="math inline">\(\sigma^2\)</span>: <span class="math inline">\(\epsilon_t \thicksim N(0, \sigma^2)\)</span></li>
</ul>
</div>
<div id="a-single-simulation-from-a-two-group-model" class="section level1">
<h1>A single simulation from a two-group model</h1>
<p>I use the statistical model to build a simulation. In this case I’ll call my response variable “growth”, and the two groups “group1” and “group2”. I’ll have 10 observations per group (it’s possible to simulate unbalanced groups but balanced groups is a good place to start).</p>
<p>I’ll set my seed so these particular results can be reproduced.</p>
<pre class="r"><code>set.seed(16)</code></pre>
<p>I start out by defining what the “truth” is in the simulation by setting all the parameters in the statistical model to a value of my choosing. Here’s what I’ll do today.</p>
<ul>
<li>The true group mean (<span class="math inline">\(\beta_0\)</span>) for “group1” will be 5</li>
<li>The mean of “group2” will be 2 less than “group1” (<span class="math inline">\(\beta_1\)</span>)</li>
<li>The shared variance will be set at 4 (<span class="math inline">\(\sigma^2\)</span>), so the standard deviation (<span class="math inline">\(\sigma\)</span>) is 2.</li>
</ul>
<p>I’ll define the number of groups and number of replicates per group while I’m at it. The total number of observations is the number of groups times the number of replicates per group, which is <code>2*10 = 20</code>.</p>
<pre class="r"><code>ngroup = 2
nrep = 10
b0 = 5
b1 = -2
sd = 2</code></pre>
<p>I need to create the variable I’ll call “group” to use as the explanatory variable when fitting a model in R. I use <code>rep()</code> a lot when doing simulations in order to repeat values of variables to appropriately match the scenario I’m working in. Here I’ll repeat each level of <code>group</code> 10 times (<code>nrep</code>).</p>
<pre class="r"><code>( group = rep( c("group1", "group2"), each = nrep) )</code></pre>
<pre><code># [1] "group1" "group1" "group1" "group1" "group1" "group1" "group1"
# [8] "group1" "group1" "group1" "group2" "group2" "group2" "group2"
# [15] "group2" "group2" "group2" "group2" "group2" "group2"</code></pre>
<p>Next I’ll simulate the random errors. Remember I defined these above as <span class="math inline">\(\epsilon_t \thicksim N(0, \sigma^2)\)</span>. To simulate these I’ll take random draws from a normal distribution with a mean of 0 and standard deviation of 2 (note that <code>rnorm()</code> in takes the standard deviation as input, not the variance). Every observation has a random error, so I need to make 20 draws total (<code>ngroup*nrep</code>).</p>
<pre class="r"><code>( eps = rnorm(ngroup*nrep, 0, sd) )</code></pre>
<pre><code># [1] 0.9528268 -0.2507600 2.1924324 -2.8884581 2.2956586 -0.9368241
# [7] -2.0119012 0.1271254 2.0499452 1.1462840 3.6943642 0.2238667
# [13] -1.4920746 3.3164273 1.4434411 -3.3261610 1.1518191 0.9455202
# [19] -1.0854633 2.2553741</code></pre>
<p>Now I have the fixed estimates of parameters, the variable <code>group</code> on which to base the indicator variable, and the simulated errors drawn from the defined distribution. That’s all the pieces I need to calculate my response variable.</p>
<p>The statistical model</p>
<p><span class="math display">\[y_t = \beta_0 + \beta_1*I_{(group_t=\textit{''group''})} + \epsilon_t\]</span></p>
<p>is my guide for how to combine these pieces to create the simulated response variable, <span class="math inline">\(y_t\)</span>. Notice I create the indicator variable in R with <code>group == "group2"</code> and call the simulated response variable <code>growth</code>.</p>
<pre class="r"><code>( growth = b0 + b1*(group == "group2") + eps )</code></pre>
<pre><code># [1] 5.952827 4.749240 7.192432 2.111542 7.295659 4.063176 2.988099
# [8] 5.127125 7.049945 6.146284 6.694364 3.223867 1.507925 6.316427
# [15] 4.443441 -0.326161 4.151819 3.945520 1.914537 5.255374</code></pre>
<p>It’s not necessary for this simple case, but I often store the variables I will use in fitting the model in a dataset to help keep things organized. This becomes more important when working with more variables. I’ll skip this step today.</p>
<p>Once the response and explanatory variables have been created, it’s time for model fitting. I can fit the two group linear model with <code>lm()</code>.</p>
<pre class="r"><code>growthfit = lm(growth ~ group)
summary(growthfit)</code></pre>
<pre><code>#
# Call:
# lm(formula = growth ~ group)
#
# Residuals:
# Min 1Q Median 3Q Max
# -4.039 -1.353 0.336 1.603 2.982
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 5.2676 0.6351 8.294 1.46e-07 ***
# groupgroup2 -1.5549 0.8982 -1.731 0.101
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 2.008 on 18 degrees of freedom
# Multiple R-squared: 0.1427, Adjusted R-squared: 0.0951
# F-statistic: 2.997 on 1 and 18 DF, p-value: 0.1005</code></pre>
</div>
<div id="make-a-function-for-the-simulation" class="section level1">
<h1>Make a function for the simulation</h1>
<p>A single simulation can help us understand the statistical model, but it doesn’t help us see how the model behaves over the long run. To repeat this simulation many times in R we’ll want to “functionize” the data simulating and model fitting process.</p>
<p>In my function I’m going to set all the arguments to the parameter values I’ve defined above. I allow some flexibility, though, so the argument values can be changed if I want to explore the simulation with different coefficient values, a different number of replications, or a different amount of random variation.</p>
<p>This function returns a linear model fit with <code>lm</code>.</p>
<pre class="r"><code>twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) {
ngroup = 2
group = rep( c("group1", "group2"), each = nrep)
eps = rnorm(ngroup*nrep, 0, sigma)
growth = b0 + b1*(group == "group2") + eps
growthfit = lm(growth ~ group)
growthfit
}</code></pre>
<p>I test the function, using the same seed, to make things are working as expected and that I get the same results as above.</p>
<pre class="r"><code>set.seed(16)
twogroup_fun()</code></pre>
<pre><code>#
# Call:
# lm(formula = growth ~ group)
#
# Coefficients:
# (Intercept) groupgroup2
# 5.268 -1.555</code></pre>
<p>If I want to change some element of the simulation, I can. Here’s a simulation from the same model but with a smaller standard deviation.</p>
<pre class="r"><code>twogroup_fun(sigma = 1)</code></pre>
<pre><code>#
# Call:
# lm(formula = growth ~ group)
#
# Coefficients:
# (Intercept) groupgroup2
# 5.313 -2.476</code></pre>
</div>
<div id="repeat-the-simulation-many-times" class="section level1">
<h1>Repeat the simulation many times</h1>
<p>Now that I have a working function to simulate data and fit the model, it’s time to do the simulation many times. The model from each individual simulation is saved to allow exploration of long run model performance.</p>
<p>This is a task I’ve commonly used <code>replicate()</code> for. The <code>rerun()</code> function from package <strong>purrr</strong> is equivalent to <code>replicate()</code> with <code>simplify = FALSE</code>, and I’ll use it here for convenience.</p>
<pre class="r"><code>library(purrr)</code></pre>
<p>I’ll run this simulation 1000 times, resulting in a list of fitted two-group linear models based on the simulation parameters I’ve set.</p>
<pre class="r"><code>sims = rerun(1000, twogroup_fun() )</code></pre>
</div>
<div id="extracting-results-from-the-linear-model" class="section level1">
<h1>Extracting results from the linear model</h1>
<p>There are many elements of our model that we might be interested in exploring, including estimated coefficients, estimated standard deviations/variances, and the statistical results (test statistics/p-values).</p>
<p>To get the coefficients and statistical tests of coefficients we can use <code>tidy()</code> from package <strong>broom</strong>.</p>
<pre class="r"><code>library(broom)</code></pre>
<p>This returns the information on coefficients and tests of those coefficients in a tidy format that is easy to work with.</p>
<pre class="r"><code>tidy(growthfit)</code></pre>
<pre><code># term estimate std.error statistic p.value
# 1 (Intercept) 5.267633 0.6351230 8.293878 1.460563e-07
# 2 groupgroup2 -1.554922 0.8981996 -1.731154 1.005290e-01</code></pre>
<p>I have often been interested in understanding how the variances/standard deviations behave over the long run, in particular in mixed models. For a linear model we can extract an estimate of the residual standard deviation from the <code>summary()</code> output. This can be squared to get the variance as needed.</p>
<pre class="r"><code>summary(growthfit)$sigma</code></pre>
<pre><code># [1] 2.008435</code></pre>
</div>
<div id="simulation-results" class="section level1">
<h1>Simulation results</h1>
<p>Now for the fun part! Given we know the truth, how do the parameters behave over many samples?</p>
<p>To extract any results of interest I loop through the list of models, which I’ve stored in <code>sims</code>, and pull out the element of interest. Functions from package <strong>purrr</strong> are useful here for looping through the list of models. I’ll use functions from <strong>dplyr</strong> for any data manipulation and plot distributions via <strong>ggplot2</strong>.</p>
<pre class="r"><code>suppressMessages( library(dplyr) )
library(ggplot2)</code></pre>
<p><strong>Estimated differences in mean response</strong></p>
<p>As this is a linear model about differences among groups, the estimate of <span class="math inline">\(\beta_1\)</span> is one of the statistics of primary interest. What does the distribution of differences in mean growth between groups look like? Here’s a density plot.</p>
<pre class="r"><code>sims %>%
map_df(tidy) %>%
filter(term == "groupgroup2") %>%
ggplot( aes(estimate) ) +
geom_density(fill = "blue", alpha = .5) +
geom_vline( xintercept = -2)</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-01-09-simulate-simulate-part-1-a-linear-model_files/figure-html/unnamed-chunk-16-1.png" width="672" /></p>
<p>It’s a simulation result like this one, from a scenario involving relatively few samples of a noisy measurement that I think can be so compelling. Sure, “on average” we get the correct result, as the peak is right around the true value of <code>-2</code>. However, there is quite a range in the estimated coefficient across simulations, with some samples leading to overestimation and some to underestimation of the parameter. Some models even get the sign of the coefficient wrong. See Gelman and Carlin’s 2014 paper, <a href="http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf">Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors</a> if you are interested in further discussion.</p>
<p><strong>Estimates of the standard deviation</strong></p>
<p>I can do a similar plot exploration of the residual standard deviation, extracting <code>sigma</code> from the model object and plotting it as a density plot.</p>
<pre class="r"><code>sims %>%
map_dbl(~summary(.x)$sigma) %>%
data.frame(sigma = .) %>%
ggplot( aes(sigma) ) +
geom_density(fill = "blue", alpha = .5) +
geom_vline(xintercept = 2)</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-01-09-simulate-simulate-part-1-a-linear-model_files/figure-html/unnamed-chunk-17-1.png" width="672" /></p>
<p>The estimated variation ranges between 1 to just over 3, and the distribution is roughly centered on the true value of 2. Like with the coefficient above, the model performs pretty well on average but any single model can have a biased estimate of the standard deviation.</p>
<p>The standard deviation is underestimated a bit more than 50% of the time.</p>
<pre class="r"><code>sims %>%
map_dbl(~summary(.x)$sigma) %>%
{. < 2} %>%
mean()</code></pre>
<pre><code># [1] 0.539</code></pre>
<p><strong>Statistical results</strong></p>
<p>If the goal of a simulation is to get an idea of the statistical power of a test we could look at the proportion of times the null hypothesis was rejected given a fixed alpha level (often 0.05, but of course it can be something else).</p>
<p>Here the proportion of models that correctly rejected the null hypothesis, given that we know it’s not true, is just over 56%. That’s an estimate of statistical power.</p>
<pre class="r"><code>sims %>%
map_df(tidy) %>%
filter(term == "groupgroup2") %>%
pull(p.value) %>%
{. < 0.05} %>%
mean()</code></pre>
<pre><code># [1] 0.563</code></pre>
<p>Those are the basics that I generally pull out of models, but any output from the model is fair game. For linear models you could pull out <span class="math inline">\(R^2\)</span> or the overall F-test, etc.</p>
</div>
<div id="where-to-go-from-here" class="section level1">
<h1>Where to go from here?</h1>
<p>I’ll do future posts about simulating from more complicated linear models, likely starting with linear mixed models. In particular I will explore interesting issues that crop up when estimating variances.</p>
</div>
Fiesta 2017 gas mileage
https://aosmith.rbind.io/2018/01/03/fiesta-2017-gas-mileage/
Wed, 03 Jan 2018 00:00:00 +0000https://aosmith.rbind.io/2018/01/03/fiesta-2017-gas-mileage/<p>It’s January - time to see how we did on car gas mileage last year!</p>
<p>We purchased a used 2011 Ford Fiesta in 2016. It replaced a 1991 Honda CRX, which was running like a champ but was starting to look a little, well, limited in its safety features. 👅</p>
<p>I drive ~70 commuting miles each day, and we wanted a car we could afford that was competitive with the CRX on gas mileage. The CRX averaged solidly just over 40 mpg.</p>
<div class="figure"><span id="fig:unnamed-chunk-1"></span>
<img src="https://aosmith.rbind.io/img/fiesta.jpg" alt="The dog wonders if the gas mileage will be good enough" />
<p class="caption">
Figure 1: The dog wonders if the gas mileage will be good enough
</p>
</div>
<div id="the-gas-mileage-data" class="section level1">
<h1>The gas mileage data</h1>
<p>We record gas mileage for all our vehicles on Google Sheets, so I can read the data in directly from there.</p>
<pre class="r"><code>library(googlesheets) # v 0.2.2
library(skimr) # v. 1.0.1
library(dplyr) # v. 0.7.4</code></pre>
<p>The workbook has a sheet for every year of ownership (into the future! 😆) plus service records.</p>
<pre class="r"><code>gs_title("Fiesta mpg")</code></pre>
<pre><code>## Sheet successfully identified: "Fiesta mpg"</code></pre>
<pre><code>## Spreadsheet title: Fiesta mpg
## Spreadsheet author: skylarkguy
## Date of googlesheets registration: 2018-08-17 01:01:45 GMT
## Date of last spreadsheet update: 2018-08-16 01:56:05 GMT
## visibility: private
## permissions: rw
## version: new
##
## Contains 6 worksheets:
## (Title): (Nominal worksheet extent as rows x columns)
## 2016: 1000 x 26
## Service records: 1000 x 26
## 2017: 1000 x 26
## 2018: 1000 x 26
## 2019: 1000 x 26
## 2020: 1000 x 26
##
## Key: 1xzNrd6c3sWYIxPciREz3nmY8hcol4KJ5qekNqvXyLso
## Browser URL: https://docs.google.com/spreadsheets/d/1xzNrd6c3sWYIxPciREz3nmY8hcol4KJ5qekNqvXyLso/</code></pre>
<pre class="r"><code>fiesta2017 = gs_title("Fiesta mpg") %>%
gs_read("2017")</code></pre>
<pre><code>## Sheet successfully identified: "Fiesta mpg"</code></pre>
<pre><code>## Accessing worksheet titled '2017'.</code></pre>
<pre><code>## Parsed with column specification:
## cols(
## date = col_character(),
## gallons = col_double(),
## cost = col_double(),
## mileage = col_double(),
## mpggage = col_double(),
## mpgcalc = col_double()
## )</code></pre>
<p>We calculate gas mileage (<code>mpgcalc</code>) based on recorded gallons and mileage driven every time we get gas. We also record what the car estimated the gas mileage would be (<code>mpggage</code>).</p>
<pre class="r"><code>head(fiesta2017)</code></pre>
<pre><code>## # A tibble: 6 x 6
## date gallons cost mileage mpggage mpgcalc
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1/2/2017 9.78 24.0 359. 38.3 36.8
## 2 1/10/2017 8.87 22.2 330. 38.2 37.2
## 3 1/17/2017 10.4 25.9 NA 39.2 0
## 4 1/20/2017 7.15 17.7 297. 40.6 41.6
## 5 1/24/2017 7.96 19.7 311. 40.4 39.1
## 6 1/30/2017 9.60 23.6 386. 41.1 40.2</code></pre>
<p>Here’s my first chance to use functions from package <strong>skimr</strong> to get a quick summary of the variables in the dataset. (I’ve removed the spark histograms since I haven’t gotten them to play nicely in HTML.)</p>
<pre class="r"><code>skim(fiesta2017)</code></pre>
<pre><code>## Skim summary statistics
## n obs: 64
## n variables: 6
##
## -- Variable type:character ---------------------------------------------------
## variable missing complete n min max empty n_unique
## date 0 64 64 8 10 0 62
##
## -- Variable type:numeric -----------------------------------------------------
## variable missing complete n mean sd p0 p25 p50 p75
## cost 0 64 64 23.75 3.68 12.23 22.65 24.05 26.39
## gallons 0 64 64 8.83 1.33 5.01 8.13 9.13 9.81
## mileage 1 63 64 362.11 66.85 194.9 327.95 373.4 397.9
## mpgcalc 0 64 64 40.67 8.87 0 39.1 40.35 41.7
## mpggage 0 64 64 41.23 1.55 37.9 40.3 41.2 42.2
## p100
## 30.82
## 10.63
## 653.7
## 96.2
## 46.8</code></pre>
<p>I’ll need to convert <code>date</code> to a date and remove that missing value from <code>mileage</code> before proceeding (looks like we forget to enter the mileage on one gas stop).</p>
<pre class="r"><code>fiesta2017 = fiesta2017 %>%
mutate(date = as.Date(date, format = "%m/%d/%Y") ) %>%
filter( !is.na(mileage) )</code></pre>
<p>The <code>skim</code> output shows an odd <code>mpgcalc</code> value as the max, with a mpg over 90.</p>
<pre class="r"><code>filter(fiesta2017, mpgcalc > 50)</code></pre>
<pre><code>## # A tibble: 1 x 6
## date gallons cost mileage mpggage mpgcalc
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2017-06-05 6.80 22.4 654. 43.3 96.2</code></pre>
<p>Turns out something weird happened on June 5th. There are two data points, one that looks pretty standard and the other that is impossibly high for <code>mpgcalc</code>.</p>
<pre class="r"><code>filter(fiesta2017, date == as.Date("2017-06-05") )</code></pre>
<pre><code>## # A tibble: 2 x 6
## date gallons cost mileage mpggage mpgcalc
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2017-06-05 8.12 23.6 327 41.9 40.3
## 2 2017-06-05 6.80 22.4 654. 43.3 96.2</code></pre>
<p>I’ll have to remove the odd data point as I’m not sure what the mistake is (changing the mileage to <code>3</code> instead of <code>6</code> seems a reasonable guess but that still led to something impossibly large).</p>
<pre class="r"><code>fiesta2017 = filter(fiesta2017, mpgcalc <= 50)</code></pre>
</div>
<div id="plot-gas-mileage-through-time" class="section level1">
<h1>Plot gas mileage through time</h1>
<p>I’ll use <strong>ggplot2</strong> for plotting.</p>
<pre class="r"><code>library(ggplot2) # v. 2.2.1</code></pre>
<p><strong>Observed gas mileage over the year</strong></p>
<p>Here’s a plot of calculated gas mileage over the year, plotted via <strong>ggplot2</strong>. I put a horizontal line at 40 mpg and one at the annual average observed mpg to get an idea of how we’re meeting the “40 mpg” goal.</p>
<pre class="r"><code>ggplot(fiesta2017, aes(date, mpgcalc) ) +
geom_line() +
theme_bw() +
geom_hline(aes(yintercept = 40, color = "40 mpg") ) +
geom_hline(aes(yintercept = mean(fiesta2017$mpgcalc), colour = "Average observed mpg") ) +
labs(y = "Miles per gallon", x = NULL) +
scale_x_date(date_breaks = "1 month",
date_labels = "%b",
limits = c( as.Date("2017-01-01"), as.Date("2017-12-31") ),
expand = c(0, 0) ) +
scale_color_manual(name = NULL, values = c("black", "#009E73") ) +
theme(legend.position = "bottom",
legend.direction = "horizontal")</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-01-03-fiesta-2017-gas-mileage_files/figure-html/unnamed-chunk-12-1.png" width="2100" /></p>
<p>That high point is a high value for both the observed and estimated mpg, so I’m guessing the driving conditions were good for that tank. 😄</p>
<pre class="r"><code>filter(fiesta2017, mpgcalc > 45)</code></pre>
<pre><code>## # A tibble: 1 x 6
## date gallons cost mileage mpggage mpgcalc
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2017-10-03 8.37 25.6 401. 46.8 48</code></pre>
<p><strong>Compare observed mpg vs car-estimated mpg</strong></p>
<p>It’s fun to watch how the estimated mpg is affected by conditions while I’m driving (e.g., AC, defrost, wind), but it looks like whatever that algorithm is tends to overestimate the gas mileage. But not always!</p>
<pre class="r"><code>ggplot(fiesta2017, aes(date, mpgcalc) ) +
geom_line( aes(color = "Calculated mpg") ) +
geom_line( aes(y = mpggage, color = "Estimated mpg") ) +
theme_bw() +
labs(y = "Miles per gallon", x = NULL) +
scale_x_date(date_breaks = "1 month",
date_labels = "%b",
limits = c( as.Date("2017-01-01"), as.Date("2017-12-31") ),
expand = c(0, 0) ) +
scale_color_manual(name = NULL, values = c("black", "#009E73") ) +
theme(legend.position = "bottom",
legend.direction = "horizontal")</code></pre>
<p><img src="https://aosmith.rbind.io/post/2018-01-03-fiesta-2017-gas-mileage_files/figure-html/unnamed-chunk-14-1.png" width="2100" /></p>
</div>
Combining many datasets in R
https://aosmith.rbind.io/2017/12/31/many-datasets/
Sun, 31 Dec 2017 00:00:00 +0000https://aosmith.rbind.io/2017/12/31/many-datasets/<p>At least once a year I meet with a graduate student who has many separate datasets that need to be combined into a single file. The data are usually from a series of data loggers (e.g., iButtons or RFID readers) that record data remotely over a specified time period. The researcher periodically downloads the data from each data logger and then redeploys it for further data collection.</p>
<p>I’m going to set up the background for my particular use case before jumping into the R code to perform this sort of task. <strong>Go straight to “List all files to read in” if you want to get right into R code.</strong></p>
<div id="whats-so-hard-about-reading-in-many-datasets" class="section level1">
<h1>What’s so hard about reading in many datasets?</h1>
<p>For someone who is at least somewhat familiar with a programming language (e.g., SAS, Python, R), reading many datasets in and combining them into a single file might not seem like a big deal. For example, if I do a quick web search on “r read many datasets” I get at least 5 Stack Overflow posts (with answers) as well as several blog entries. These links show code for relatively simple situations of reading many identical dataset in to R (a couple SO examples can be found <a href="https://stackoverflow.com/questions/17271833/how-do-i-read-in-multiple-data-sets">here</a> and <a href="https://stackoverflow.com/questions/32888757/reading-multiple-files-into-r-best-practice">here</a>).</p>
<p>However, in my experience this work doesn’t feel very simple to beginner programmers. Most of the graduate students I meet with have never worked with any sort of programming language prior to entering their degree program. By the time I meet with them they usually have had a very basic introduction to R in their intro statistics courses. They may have read in a dataset into R only a couple of times at most, and now they have hundreds of them to manage. To further complicate things, there is usually more work that needs to be done beyond reading the datasets in, such as adding important identifying information to each dataset.</p>
<p><em>Aside: I am couching this around R because that’s what taught in the intro courses in the Statisics Department at my university. I still get a few requests every year for help with SAS programming from faculty and post-docs and so far over six years I’ve had exactly one student client who worked primarily with Python</em></p>
</div>
<div id="why-would-i-want-to-have-to-do-this-in-r-or-sas-or-python-or" class="section level1">
<h1>Why would I want to have to do this in R (or SAS or Python or …)?</h1>
<p>That is the question I got from the first student I ever advised on this topic. She was collecting data using many data loggers in a field experiment set up as randomized complete block design with repeated measures. She had 300 comma-delimited files that needed to be concatenated together from her first field season and was planning on a second season that was at least twice as long (so would have at least twice as many files).</p>
<p>Her research collaborators had used these loggers previously, and had given her the following algorithm to follow:</p>
<ol style="list-style-type: decimal">
<li>Open each file in Excel</li>
<li>Manually delete the first 15 rows, which contained information about the data logger that wasn’t related to the study</li>
<li>Add columns to indicate the physical study units the data was collected in (“Block”, “Site”, “Plot”)</li>
<li>Copy and paste into a new, combined file</li>
<li>Repeat with all datasets</li>
<li>Name columns</li>
</ol>
<p>Me:
<img src="https://aosmith.rbind.io/img/2017-12-31_cat_disgusted.png" /><!-- --></p>
<p>But really this gave me a chance to discuss reproducibility and the convenience of using computers to do repetitive tasks with the next generation of researchers. While she would need to expend effort understanding R code, the effort in the short term would be valuable in the long term given she was going to do a second field season.</p>
<p>To be honest, she was pretty skeptical that it made sense to use a programming language to do the work. From her perspective, doing the work via R looked more difficult and more “black-box” than manually copying and pasting in Excel. It helped that we found mistakes in the files she’d already edited when when we were setting up the R code (it’s so easy to make mistakes when copying and pasting 300 times!).</p>
<p>Her skepticism continues to be a good reminder to me of what it feels like to be a beginner in a programming language, where you don’t quite trust that the program is doing what you want it to and you don’t exactly fully understand the code. The manual approach you already know how to do can look pretty darn attractive by comparison.</p>
</div>
<div id="list-all-files-to-read-in" class="section level1">
<h1>List all files to read in</h1>
<p>When reading in many files, my first two tasks are:</p>
<ul>
<li>Getting a list of files to read<br />
</li>
<li>Figuring out the steps needed to read and manipulate a single file</li>
</ul>
<p>Here I’ll start by getting a list of the files. I’m using some toy files I made to mirror what the real files looked like. These CSV files are available on the <a href="https://github.com/aosmith16/aosmith/tree/master/static/data">blog GitHub repository</a>.</p>
<p>Listing all files can be done in R via <code>list.files</code> or <code>dir</code>. I’ll use <code>list.files</code> here (no reason, just habit most likely based on what I learned first).</p>
<p>For this particular case I will use four of the arguments in <code>list.files</code>:</p>
<ol style="list-style-type: decimal">
<li><p>The directory containing the files needs to be defined in the <code>"path"</code> argument of <code>list.files</code>. I’m working within an RStudio Project, and will use the <code>here</code> package to indicate the directory the files are in relative to the root directory. See <a href="https://www.tidyverse.org/articles/1/01/">Jenny Bryan’s post here</a> on the merits of <code>here</code> and self-contained projects.</p></li>
<li><p>The <code>"pattern"</code> argument is used to tell R which file paths should be listed. If you want to read in all CSV files in a directory, for example, the pattern to match might by <code>".csv"</code>. In the real scenario, there were additional CSV files in the directory that we didn’t want to read. All the files we wanted to read ended in “AB.csv”, so we first defined the pattern as <code>"AB.csv"</code>. Later we realized that some file names were all lowercase, so used <code>"AB.csv|ab.csv"</code>. The vertical pipe, <code>|</code>, stands for “or”.</p></li>
<li><p>The <code>"recursive"</code> argument is used to indicate whether or not child folders in the parent directory should be searched for files to list or not. It defaults to <code>FALSE</code>. The files in this particular case are not stored in a single folder. Instead they are in child folders within an overall “data” directory. The names of the child folders actually indicate the study units (“Blocks” and “Sites”) the data were collected in.</p></li>
<li><p>The <code>"full.names"</code> argument is used to indicate if the complete file paths should be returned or only the relative file paths. In this case, the only place in the information about some of the physical units of the study (“Blocks” and “Sites”) are in the directory path. We needed the full path names in order to extract that information and add it to the dataset.</p></li>
</ol>
<pre class="r"><code>library(here) # v. 0.1</code></pre>
<p>The <code>list.files</code> function returns a vector of file paths.</p>
<pre class="r"><code>( allfiles = list.files(path = here("static", "data"),
pattern = "AB.csv|ab.csv",
full.names = TRUE,
recursive = TRUE) )</code></pre>
<pre><code># [1] "C:/Users/Owner/Documents/Aosmith/Blog/aosmith/static/data/Block1/Siteone/SIT1_17_12_21_5.2_AB.csv"
# [2] "C:/Users/Owner/Documents/Aosmith/Blog/aosmith/static/data/Block1/Siteone/sit1_17_12_31_2.2_ab.csv"
# [3] "C:/Users/Owner/Documents/Aosmith/Blog/aosmith/static/data/Block1/Siteone/SIT1_17_12_9_5.2_AB.csv"
# [4] "C:/Users/Owner/Documents/Aosmith/Blog/aosmith/static/data/Block2/Sitenew/SIT1_17_12_10_3.2_AB.csv"
# [5] "C:/Users/Owner/Documents/Aosmith/Blog/aosmith/static/data/Block2/Sitenew/SIT1_17_12_21_3.2_AB.csv"
# [6] "C:/Users/Owner/Documents/Aosmith/Blog/aosmith/static/data/Block2/Sitenew/SIT1_17_12_31_5.2_AB.csv"</code></pre>
</div>
<div id="practice-reading-in-one-file" class="section level1">
<h1>Practice reading in one file</h1>
<p>I find things go more smoothly if I work out the file-reading process with a single file before I try to read a bunch of files. It’s an easy step to want to skip because it feels more efficient to do “everything at once”. I’ve never found that to actually be the case. 👅</p>
<p>I’ll practice with the first file listed in <code>allfiles</code>. The top 6 lines of the raw data file is all extraneous header information, which will be skipped via <code>"skip"</code>. There are no column headers (<code>"header"</code>) in the file, so those need to be added (<code>"col.names"</code>).</p>
<pre class="r"><code>( test = read.csv(allfiles[1],
skip = 6,
header = FALSE,
col.names = c("date", "temperature") ) )</code></pre>
<pre><code># date temperature
# 1 15 9
# 2 16 8
# 3 17 15
# 4 18 9
# 5 19 10</code></pre>
<p>That went pretty smoothly, but things get a little hairy from here. The information on the physical units of the study, “Blocks” and “Sites”, are contained only in the file directory path. These need to be extracted from the file path and added to the dataset.</p>
<p>In addition, the “Plot” information is contained in the file name. Plot names are single numbers that are found directly before the period in the file name. In <code>allfiles[1]</code> that number is 5 (the file name is “SIT1_17_12_21_5.2_AB.csv”).</p>
<p>Last, the final two digits of the file name is a code to indicate where the data logger was located. This also needs to be added to the dataset. In the toy example these values are all “AB”, but in the larger set of files this wasn’t true.</p>
<p>All the tasks above are string manipulation tasks. I will tackle these with the functions from the <strong>stringr</strong> package.</p>
<pre class="r"><code>library(stringr) # v. 1.2.0</code></pre>
<p><strong>Extract “Block” names from the file path</strong></p>
<p>Since some information is located within the file path string, splitting the file path up into separate pieces seems like a reasonable first step. This can be done via <code>str_split</code> using <code>"/"</code> as the symbol to split on. As there is only a single character string to split for each dataset, it is convenient to return a matrix instead of a list via <code>simplify = TRUE</code>.</p>
<p>The result is a matrix containing strings in each column.</p>
<pre class="r"><code>( allnames = str_split( allfiles[1], pattern = "/", simplify = TRUE) )</code></pre>
<pre><code># [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] "C:" "Users" "Owner" "Documents" "Aosmith" "Blog" "aosmith" "static"
# [,9] [,10] [,11] [,12]
# [1,] "data" "Block1" "Siteone" "SIT1_17_12_21_5.2_AB.csv"</code></pre>
<p>The “Block” information is always the third column if counting from the end (it’s the 10th in this case if counting from the beginning). It’s “safer” (i.e., less likely to fail on a different computer) to count from the last column and work backwards here; on a different computer the full file paths may change length but the directory containing the files to read will be the same.</p>
<p>To extract the third column from the end I take the total number of columns and subtract 2.</p>
<pre class="r"><code>allnames[, ncol(allnames) - 2]</code></pre>
<pre><code># [1] "Block1"</code></pre>
<p>This can be added to the dataset as a “block” variable.</p>
<pre class="r"><code>test$block = allnames[, ncol(allnames) - 2]
test</code></pre>
<pre><code># date temperature block
# 1 15 9 Block1
# 2 16 8 Block1
# 3 17 15 Block1
# 4 18 9 Block1
# 5 19 10 Block1</code></pre>
<p><strong>Extract “Site” names from the file path</strong></p>
<p>This will be the same as above, except site names are contained in the second-to-last column.</p>
<pre class="r"><code>test$site = allnames[, ncol(allnames) - 1]</code></pre>
<p><strong>Extract “Plot” names from the file name</strong></p>
<p>The last character string in our matrix is the file name, which contains the plot name and logger location. In the test case the plot name is “5” and the logger location is “AB”.</p>
<pre class="r"><code>allnames[, ncol(allnames)]</code></pre>
<pre><code># [1] "SIT1_17_12_21_5.2_AB.csv"</code></pre>
<p>This can be split on the underscores and periods and the “Plot” information extracted in much the same was as the “Block” information. I think this option can feel more approachable to beginners and is a reasonable way to solve the problem.</p>
<p>Another option is to define which part of the file name we want to pull out. This involves using <em>regular expressions</em>. I personally find regular expressions quite difficult and invariably turn to Stack Overflow to find the answers. I will use them here to demonstrate a wider breadth of options</p>
<p>A basic introduction to regular expressions is on the second page of the “Work with Strings” <a href="https://www.rstudio.com/resources/cheatsheets/">cheatsheet from RStudio</a>. A more in depth set of examples can be found on the <a href="http://stat545.com/block022_regular-expression.html">UBC ST 545 page here</a>.</p>
<p>I used <code>str_extract</code> from <strong>stringr</strong> with a regular expression for the <code>"pattern"</code> to extract the plot number. In this case I used a <a href="http://www.regular-expressions.info/lookaround.html">lookaround</a> following this <a href="https://stackoverflow.com/a/35404422/2461552">Stack Overflow answer</a>. These can be apparently be costly in terms of performance, which I did not find that to be a deterrent in my case. 😄</p>
<p>The regular expression I use indicates I want to extract a digit (<code>[0-9]</code>) that comes immediately before a period. The <code>(?=\\.)</code> is a <em>positive lookahead</em>, telling R that the digit to match will be followed by a period. The plot names are always just in front of a period, which is why this works.</p>
<pre class="r"><code>str_extract(allnames[, ncol(allnames)], pattern = "[0-9](?=\\.)")</code></pre>
<pre><code># [1] "5"</code></pre>
<p>This can then be added to the dataset.</p>
<pre class="r"><code>test$plot = str_extract(allnames[, ncol(allnames)], pattern = "[0-9](?=\\.)")</code></pre>
<p><strong>Extract data logger location from the file name</strong></p>
<p>The last thing to do is extract the data logger location code from the file names. These are the last two digits of the file name, immediately before “.csv”.</p>
<p>The <code>str_sub</code> function from <strong>stringr</strong> is useful for extracting characters from fixed locations in a string. The logger location information is in the same position in every file name if counting from the end of the string. The <code>str_sub</code> function allows the user to pull information counting from the end of the string as well as from the beginning. Because our file names differ in length due to the way that dates are stored, the location data does <em>not</em> always have the same indices if counting characters from the beginning of the string.</p>
<p>Negative indices are used to extract from the end of the string. The location information is stored in the 5th and 6th positions from the end. The negative number largest in absolute value is passed to <code>start</code> and the smallest in absolute value to <code>end</code>. (I forget this pretty much every time I subset strings from the end.)</p>
<pre class="r"><code>str_sub(allnames[, ncol(allnames)], start = -6, end = -5)</code></pre>
<pre><code># [1] "AB"</code></pre>
<p>The location data can then be added to the dataset. Since at least one of the file names is all lowercase, I make sure the data logger location information is converted to all caps via <code>toupper</code>.</p>
<pre class="r"><code>test$logloc = toupper( str_sub(allnames[, ncol(allnames)], start = -6, end = -5) )</code></pre>
<p>Here’s what the test dataset looks like now.</p>
<pre class="r"><code>test</code></pre>
<pre><code># date temperature block site plot logloc
# 1 15 9 Block1 Siteone 5 AB
# 2 16 8 Block1 Siteone 5 AB
# 3 17 15 Block1 Siteone 5 AB
# 4 18 9 Block1 Siteone 5 AB
# 5 19 10 Block1 Siteone 5 AB</code></pre>
</div>
<div id="make-a-function-to-read-all-the-files" class="section level1">
<h1>Make a function to read all the files</h1>
<p>Once the process is worked out for one file, I can “functionize” it (i.e., make a function). This allows me to apply the exact same procedure to every dataset as it is read in.</p>
<p>The function I create below takes a single argument: the file path of the dataset. The function reads the file and then adds all the desired columns. It returns the modified dataset.</p>
<pre class="r"><code>read_fun = function(path) {
test = read.csv(path,
skip = 6,
header = FALSE,
col.names = c("date", "temperature") )
allnames = str_split( path, pattern = "/", simplify = TRUE)
test$block = allnames[, ncol(allnames) - 2]
test$site = allnames[, ncol(allnames) - 1]
test$plot = str_extract(allnames[, ncol(allnames)], pattern = "[0-9](?=\\.)")
test$logloc = toupper( str_sub(allnames[, ncol(allnames)], start = -6, end = -5) )
test
}</code></pre>
<p>I’ll test the function with that first file path again to make sure it works like I expect it to.</p>
<pre class="r"><code>read_fun(allfiles[1])</code></pre>
<pre><code># date temperature block site plot logloc
# 1 15 9 Block1 Siteone 5 AB
# 2 16 8 Block1 Siteone 5 AB
# 3 17 15 Block1 Siteone 5 AB
# 4 18 9 Block1 Siteone 5 AB
# 5 19 10 Block1 Siteone 5 AB</code></pre>
<p>Looks good!</p>
</div>
<div id="read-all-the-files" class="section level1">
<h1>Read all the files</h1>
<p>All that’s left to do now is to loop through all the file paths in <code>allfiles</code>, read and modify each one with my function, and stack them together into a single dataset. This can be done in base R with a <code>for</code> or <code>lapply</code> loop. If using either of those options, the final concatenation step can be done via <code>rbind</code> in <code>do.call</code>.</p>
<p>These days I’ve been using the <code>map</code> functions from package <strong>purrr</strong> for this, mostly because the <code>map_dfr</code> variant conveniently binds everything together by rows for me.</p>
<pre class="r"><code>library(purrr) # v. 0.2.3</code></pre>
<p>Here’s what using <code>map_dfr</code> looks like, looping through each element of <code>allfiles</code> to read and modify the datasets with the <code>read_fun</code> function and then stacking everything together into a final combined dataset.</p>
<pre class="r"><code>( combined_dat = map_dfr(allfiles, read_fun) )</code></pre>
<pre><code># date temperature block site plot logloc
# 1 15 9 Block1 Siteone 5 AB
# 2 16 8 Block1 Siteone 5 AB
# 3 17 15 Block1 Siteone 5 AB
# 4 18 9 Block1 Siteone 5 AB
# 5 19 10 Block1 Siteone 5 AB
# 6 1 12 Block1 Siteone 2 AB
# 7 2 15 Block1 Siteone 2 AB
# 8 3 21 Block1 Siteone 2 AB
# 9 4 20 Block1 Siteone 2 AB
# 10 5 20 Block1 Siteone 2 AB
# 11 6 13 Block1 Siteone 2 AB
# 12 1 10 Block1 Siteone 5 AB
# 13 2 19 Block1 Siteone 5 AB
# 14 3 17 Block1 Siteone 5 AB
# 15 4 6 Block1 Siteone 5 AB
# 16 5 5 Block1 Siteone 5 AB
# 17 6 10 Block1 Siteone 5 AB
# 18 7 15 Block1 Siteone 5 AB
# 19 8 16 Block1 Siteone 5 AB
# 20 9 10 Block1 Siteone 5 AB
# 21 1 9 Block2 Sitenew 3 AB
# 22 2 8 Block2 Sitenew 3 AB
# 23 3 15 Block2 Sitenew 3 AB
# 24 5 10 Block2 Sitenew 3 AB
# 25 6 9 Block2 Sitenew 3 AB
# 26 7 10 Block2 Sitenew 3 AB
# 27 8 8 Block2 Sitenew 3 AB
# 28 1 11 Block2 Sitenew 5 AB
# 29 2 12 Block2 Sitenew 5 AB
# 30 3 13 Block2 Sitenew 5 AB
# 31 4 18 Block2 Sitenew 5 AB
# 32 5 19 Block2 Sitenew 5 AB
# 33 6 18 Block2 Sitenew 5 AB
# 34 8 19 Block2 Sitenew 5 AB
# 35 7 18 Block2 Sitenew 5 AB
# 36 9 19 Block2 Sitenew 5 AB
# 37 10 10 Block2 Sitenew 5 AB</code></pre>
</div>
<div id="are-we-finally-done" class="section level1">
<h1>Are we finally done?</h1>
<p>Hopefully! 😄</p>
<p>In working with real data, the final “combining” step can lead to errors due to unanticipated complexities. In my experience, this most often happens because some of the datasets are physically different than the rest. I’ve worked on problems where, for example, it turned out some datasets were present in the directory but were empty.</p>
<p>In the real files for this particular example, it turned out some of the files had been previously modified manually to remove the header information. We ended up adding an <code>if</code> statement to the function to test each file as we read it in. If it had the header information we’d use <code>skip</code> while reading in the dataset and if it didn’t we wouldn’t. I did something similar in the case where some of the datasets were blank.</p>
<p>After the combined dataset has been created, you might want to save it for further data exploration and/or analysis. If working on an interim set of datasets (such as before a field season is over), saving the R object with <code>saveRDS</code> can be pretty convenient. Saving a final dataset as a CSV may be useful for sharing with collaborators once all datasets have been downloaded and combined, which can be done with, e.g., <code>write.csv</code>.</p>
</div>
Using DHARMa for residual checks of unsupported models
https://aosmith.rbind.io/2017/12/21/using-dharma-for-residual-checks-of-unsupported-models/
Thu, 21 Dec 2017 00:00:00 +0000https://aosmith.rbind.io/2017/12/21/using-dharma-for-residual-checks-of-unsupported-models/<div id="why-use-simulations-for-model-checking" class="section level1">
<h1>Why use simulations for model checking?</h1>
<p>One of the difficult things about working with generalized linear models (GLM) and generalized linear mixed models (GLMM) is figuring out how to interpret residual plots. We don’t really expect residual plots from a GLMM to look like one from a linear model, sure, but how do we tell when something looks “bad”?</p>
<p>This is the situation I was in several years ago, working on an analysis involving counts from a fairly complicated study design. I was using a negative binomial generalized linear mixed models, and the residual vs fitted values plot looked, well, “funny”. But was something wrong or was this just how the residuals from a complicated model like this look sometimes?</p>
<p>Here is an example of a plot I wasn’t feeling too good about but also wasn’t certain if what I was seeing indicated a lack of fit.</p>
<p><img src="https://aosmith.rbind.io/img/2017-12-21_bad_residual_plot.png" /><!-- --></p>
<p>To try to figure out if what I was seeing was a problem, I fit models to response data simulated from my model. The beauty of such simulations is that I know that the model definitely <em>does</em> fit the simulated response data; the model is what created the data! I compared residuals plots from simulated data models to my real plot to help decide if what I was seeing was unusual. I looked at a fair number of simulated residual plots and decided that, yes, something was wrong with my model. I ended up moving on to a different model that worked better.</p>
<p>Here is an example of one of the simulated residual plots. There was definitely variation in plots from models fit to the simulated data, but this is a good example of what they generally looked like.</p>
<p><img src="https://aosmith.rbind.io/img/2017-12-21_good_residual_plot.png" /><!-- --></p>
</div>
<div id="the-dharma-package" class="section level1">
<h1>The DHARMa package</h1>
<p>I found my “brute force” simulation approach useful, but I spent a lot of time visually comparing the simulated plots to my real plot. I didn’t have a metric to help me decide if my actual residual plot seemed unusual compared to residual plots from my “true” models. This left me unable to recommend this as a general approach to folks I consult with.</p>
<p>Since then, the author of the <a href="https://github.com/florianhartig/DHARMa"><strong>DHARMa</strong> package</a> has come up with a clever way to use a simulation-based approach for residuals checks of GLMM’s. If you are interested in trying the package out, it has <a href="https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html">a very nice vignette</a> to get you started.</p>
<p>These days I’m been happily recommending the <strong>DHARMa</strong> packages to students I work with for residual checks of GLMM’s. However, students aren’t always working with models that <strong>DHARMa</strong> currently supports. Luckily, <strong>DHARMa</strong> can simulate residuals for any model as long as the user can provide simulated values to the <code>createDHARMa</code> function. Below I show how to do this for a couple of different situations.</p>
</div>
<div id="how-to-use-createdharma" class="section level1">
<h1>How to use createDHARMa()</h1>
<p>In order to use <code>createDHARMa</code>, the user needs to provide three pieces of information.</p>
<ul>
<li>Simulated response vectors</li>
<li>Observed data</li>
<li>The predicted response from the model</li>
</ul>
<p>I thinking making the simulated response vectors is the biggest bottleneck for folks I’ve worked with, and that is what I’m focusing on today.</p>
</div>
<div id="simulations-for-models-that-have-a-simulate-function" class="section level1">
<h1>Simulations for models that have a simulate() function</h1>
<p>When a function has a <code>simulate</code> function, getting the simulations needed to use <code>createDHARMa</code> can be comparatively straightforward.</p>
<div id="example-using-glmmtmb" class="section level2">
<h2>Example using glmmTMB()</h2>
<p>The <code>glmmTMB</code> function from package <strong>glmmTMB</strong> is one of those models that <strong>DHARMa</strong> doesn’t currently support. (<em>2018-04-05 update: the development version of DHARMA <a href="https://github.com/florianhartig/DHARMa/issues/16">now supports glmmTMB objects for glmmTMB 0.2.1</a>. I believe the example below is still useful for showing how to work with DHARMa-unsupported model types that have a simulate() function.</em>)</p>
<p>The <code>glmmTMB</code> function does have a <code>simulate</code> function.</p>
<pre class="r"><code>library(DHARMa) # version 0.1.5
library(glmmTMB) # version 0.1.1</code></pre>
<p>I’m going to use one of the models from the <strong>glmmTMB</strong> documentation to demonstrate how to make the simulations and then use them in <code>createDHARMa</code>.</p>
<p>Below I fit zero-inflated negative binomial model with <code>glmmTMB</code>.</p>
<pre class="r"><code>fit_nbz = glmmTMB(count ~ spp + mined + (1|site),
zi = ~spp + mined,
family = nbinom2, data = Salamanders)</code></pre>
<p>If I try to calculate the scaled residuals via <strong>DHARMa</strong> functions, I get a warning and then an error. <strong>DHARMa</strong> attempts to make predictions from the model to simulate with, but it fails.</p>
<pre class="r"><code>simulateResiduals(fittedModel = fit_nbz, n = 250)</code></pre>
<pre><code># Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help.
#
# Scaled residual values: 0.12 0.492 0.424 0.672 0.644 0.568 0.456 0.672 0.836 0.556 0.392 0.004 0.452 0.116 0.084 0.296 0.184 0.54 0.576 0.56 ...</code></pre>
<p>This is an indication I’ll need to use <code>createDHARMa</code> to make the residuals instead.</p>
<p>I can simulate from my model via the <code>simulate</code> function (see the documentation for <code>simulate.glmmTMB</code> for details). I usually do at least 1000 simulations if it isn’t too time-consuming. I’ll do only 10 here as an example.</p>
<pre class="r"><code>sim_nbz = simulate(fit_nbz, nsim = 10)</code></pre>
<p>The result is a list of simulated response vectors.</p>
<pre class="r"><code>str(sim_nbz)</code></pre>
<pre><code># 'data.frame': 644 obs. of 10 variables:
# $ sim_1 : num 0 0 0 0 6 2 0 1 1 0 ...
# $ sim_2 : num 0 1 0 0 0 0 0 0 0 2 ...
# $ sim_3 : num 0 0 0 1 4 0 3 0 0 1 ...
# $ sim_4 : num 0 0 0 1 0 7 0 0 0 4 ...
# $ sim_5 : num 0 0 0 1 0 0 0 1 1 4 ...
# $ sim_6 : num 0 0 0 0 0 1 1 0 10 0 ...
# $ sim_7 : num 0 0 1 2 4 1 7 1 2 0 ...
# $ sim_8 : num 0 0 0 0 10 13 1 2 0 9 ...
# $ sim_9 : num 0 0 0 0 1 6 0 4 3 0 ...
# $ sim_10: num 2 0 0 0 5 0 0 0 3 1 ...</code></pre>
<p>I need these in a matrix, not a list, where each column contains a simulated response vector and each row is an observation. I’ll collapse the list into a matrix by using <code>cbind</code> in <code>do.call</code>.</p>
<pre class="r"><code>sim_nbz = do.call(cbind, sim_nbz)
head(sim_nbz)</code></pre>
<pre><code># sim_1 sim_2 sim_3 sim_4 sim_5 sim_6 sim_7 sim_8 sim_9 sim_10
# [1,] 0 0 0 0 0 0 0 0 0 2
# [2,] 0 1 0 0 0 0 0 0 0 0
# [3,] 0 0 0 0 0 0 1 0 0 0
# [4,] 0 0 1 1 1 0 2 0 0 0
# [5,] 6 0 4 0 0 0 4 10 1 5
# [6,] 2 0 0 7 0 1 1 13 6 0</code></pre>
<p>Now I can pass these to <code>createDHARMa</code> along with observed values and model predictions. I set <code>integerResponse</code> to <code>TRUE</code>, as well, as I’m working with counts.</p>
<pre class="r"><code>sim_res_nbz = createDHARMa(simulatedResponse = sim_nbz,
observedResponse = Salamanders$count,
fittedPredictedResponse = predict(fit_nbz),
integerResponse = TRUE)</code></pre>
<p>This function creates the scaled residuals that can be used to make residual plots via <strong>DHARMa</strong>’s <code>plotSimulatedResiduals</code>.</p>
<p>Remember I’ve only done 10 simulations here, so this particular set of plots don’t look very nice.</p>
<pre class="r"><code>plotSimulatedResiduals(sim_res_nbz)</code></pre>
<pre><code># plotSimulatedResiduals is deprecated, switch your code to using the plot function</code></pre>
<p><img src="https://aosmith.rbind.io/post/2017-12-21-using-dharma-for-simulated-residual-checks-of-unsupported-models_files/figure-html/unnamed-chunk-10-1.png" width="672" /></p>
</div>
</div>
<div id="simulations-for-models-without-a-simulate-function" class="section level1">
<h1>Simulations for models without a simulate() function</h1>
<p>The <code>simulate</code> function did most of the heavy lifting for me in the <code>glmmTMB</code> example. Not all models have a <code>simulate</code> function, though. This doesn’t mean I can’t use <strong>DHARMa</strong>, but it does mean I have to put more effort in up front.</p>
<p>I will do the following simulations“by hand” in R, more-or-less following the method shown <a href="https://stats.stackexchange.com/a/189052/29350">in this answer on CrossValidated</a>.</p>
<div id="example-using-zeroinfl" class="section level2">
<h2>Example using zeroinfl()</h2>
<p>The <code>zeroinfl</code> function from package <strong>pscl</strong> is an example of a model that doesn’t have a <code>simulate</code> function and is unsupported by <strong>DHARMa</strong>. I will use this in my next example.</p>
<p>I will also load package <strong>VGAM</strong>, which has a function for making random draws from a zero-inflated negative binomial distribution.</p>
<pre class="r"><code>library(pscl) # version 1.5.2
library(VGAM) # version 1.0-4</code></pre>
<p>I will use a <code>zeroinfl</code> documentation example in this section. The zero-inflated negative binomial model is below.</p>
<pre class="r"><code>fit_zinb = zeroinfl(art ~ . | 1,
data = bioChemists,
dist = "negbin")</code></pre>
<p>In order to make my own simulations, I’ll need both the model-predicted count and the model-predicted probability of a 0 for each observation in the dataset. I’ll also need an estimate of the negative binomial dispersion parameter, <span class="math inline">\(\theta\)</span>.</p>
<p>The <code>predict</code> function for <code>zeroinfl</code> models lets the user define the kind of prediction desired. I use <code>predict</code> twice, once to extra the predicted counts and once to extract the predicted probability of 0.</p>
<pre class="r"><code># Predicted probabilities
p = predict(fit_zinb, type = "zero")
# Predicted counts
mus = predict(fit_zinb, type = "count")</code></pre>
<p>I can pull <span class="math inline">\(\theta\)</span> directly out of the model output.</p>
<pre class="r"><code>fit_zinb$theta</code></pre>
<pre><code># [1] 2.264391</code></pre>
<p>Now that I have these, I can make random draws from a zero-inflated negative distribution using <code>rzinegbin</code> from package <strong>VGAM</strong>.</p>
<p>It took me awhile to figure out which arguments I needed to use in <code>rzinegbin</code>. I need to provide the predicted counts to the <code>"munb"</code> argument and the predicted probabilities to the <code>"pstr0"</code> argument. The <code>"size"</code> argument is the estimate of the dispersion parameter. And the <code>"n"</code> arguments indicates the number of simulated values needed, in this case the same number as the rows in the original dataset.</p>
<pre class="r"><code>sim1 = rzinegbin(n = nrow(bioChemists),
size = fit_zinb$theta,
pstr0 = p,
munb = mus)</code></pre>
<p>I use <code>replicate</code> to draw more than one simulated vector. The first argument of <code>replicate</code>, <code>"n"</code>, indicates the number of times to evaluate the given expression. Here I make 10 simulated response vectors.</p>
<pre class="r"><code>sim_zinb = replicate(10, rzinegbin(n = nrow(bioChemists),
size = fit_zinb$theta,
pstr0 = p,
munb = mus) )</code></pre>
<p>The output of <code>replicate</code> in this case is a matrix, with one simulated response vector in every column and an observation in every row. This is ready for use in <code>createDHARMa</code>.</p>
<pre class="r"><code>head(sim_zinb)</code></pre>
<pre><code># [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 2 0 0 0 3 1 3 2 0 1
# [2,] 1 2 1 0 1 0 1 0 4 0
# [3,] 0 1 2 0 3 2 2 1 0 0
# [4,] 0 0 2 0 1 1 0 1 1 2
# [5,] 4 3 1 3 1 0 3 5 3 0
# [6,] 0 0 0 0 0 0 1 3 3 1</code></pre>
<p>Making the simulations was the hard part. Now that I have them, <code>createDHARMa</code> works exactly the same way as in the <code>glmmTMB</code> example.</p>
<pre class="r"><code>sim_res_zinb = createDHARMa(simulatedResponse = sim_zinb,
observedResponse = bioChemists$art,
fittedPredictedResponse = predict(fit_zinb, type = "response"),
integerResponse = TRUE)</code></pre>
<pre class="r"><code>plotSimulatedResiduals(sim_res_zinb)</code></pre>
<pre><code># plotSimulatedResiduals is deprecated, switch your code to using the plot function</code></pre>
<p><img src="https://aosmith.rbind.io/post/2017-12-21-using-dharma-for-simulated-residual-checks-of-unsupported-models_files/figure-html/unnamed-chunk-19-1.png" width="672" /></p>
</div>
</div>