I first learned about embedding many small subplots into a larger plot as a way to visualize large datasets with package ggsubplot. Embedding subplots is still possible in ggplot2 today with the annotation_custom() function. I demonstrate one approach to do this, making many subplots in a loop and then adding them to the larger plot.
When working with counts, having many zeros does not necessarily indicate zero inflation. I demonstrate this by simulating data from the negative binomial and generalized Poisson distributions. I then show one way to check if the data has excess zeros compared to the number of zeros expected based on the model.
Analyzing positive data with 0 values can be challenging, since a direct log transformation isn't possible. I discuss some of the things to consider when deciding on an analysis strategy for such data and then explore the effect of the value of the constant, c, when using log(y + c) as the response variable.
In this post I show an example of how to automate the process of making many exploratory plots in ggplot2 with multiple continuous response and explanatory variables. To loop through both x and y variables involves nested looping. In the latter section of the post I go over options for saving the resulting plots, either together in a single document, separately, or by creating combined plots prior to saving.
Checking for autocorrelation must be done carefully when some observations are missing from a time series or the time series is measured for independent groups. I show an approach where I pad the dataset with NA via tidyr::complete() to fill in any missed sampling times and make sure groups are considered independent prior to calculating the autocorrelation function.
Unstandardizing coefficients in order to interpret them on the original scale can be needed when explanatory variables were standardized to help with model convergence when fitting generalized linear mixed models. Here I show one approach to unstandardizing for a generalized linear mixed model fit with lme4.
I currently work as a consulting statistician, advising natural and social science researchers on statistics, statistical programming, and study design. I create and teach R workshops for applied science graduate students who are just getting started in R, where my goal is to make their transition to a programming language as smooth as possible. See my workshop materials at my website.
Many similar models - Part 2: Automate model fitting with purrr::map() loops
When we have many similar models to fit, automating at least some portions of the task can be a real time saver. In my last post I demonstrated how to make a function for model fitting. Once you have made such a function it’s possible to loop through variable names and fit a model for each one. In this post I am specifically focusing on having many response variables with the same explanatory variables, using purrr::map() and friends for the looping.
Many similar models - Part 1: How to make a function for model fitting
I worked with several students over the last few months who were fitting many linear models, all with the same basic structure but different response variables. They were struggling to find an efficient way to do this in R while still taking the time to check model assumptions. A first step when working towards a more automated process for fitting many models is to learn how to build model formulas with paste() and as.
The small multiples plot: how to combine ggplot2 plots with one shared axis
There are a variety of ways to combine ggplot2 plots with a single shared axis. However, things can get tricky if you want a lot of control over all plot elements. I demonstrate three different approaches for this: 1. Using facets, which is built in to ggplot2 but doesn’t allow much control over the non-shared axes. 2. Using package cowplot, which has a lot of nice features but the plot spacing doesn’t play well with a single shared axis.
Embedding subplots in ggplot2 graphics
The idea of embedded plots for visualizing a large dataset that has an overplotting problem recently came up in some discussions with students. I first learned about embedded graphics from package ggsubplot. You can still see an old post about that package and about embedded graphics in general, with examples. However, ggsubplot is no longer maintained and doesn’t work with current versions of ggplot2. I poked around a bit, and found that annotation_custom() is the go-to function for embedding plots in a ggplot2 graphic.
Custom contrasts in emmeans
Following up on a previous post, where I demonstrated the basic usage of package emmeans for doing post hoc comparisons, here I’ll demonstrate how to make custom comparisons (aka contrasts). These are comparisons that aren’t encompassed by the built-in functions in the package. Remember that you can explore the available built-in emmeans functions for doing comparisons via ?"contrast-methods". Table of Contents Reasons for custom comparisons R packages The dataset and model Treatment vs control comparisons Building custom contrasts The contrast() function for custom comparisons Using named lists for better output Using “at” for simple comparisons Multiple custom contrasts at once More complicated custom contrasts Just the code, please Reasons for custom comparisons There are a variety of reasons you might need custom comparisons instead of some of the standard, built-in ones.
Getting started with emmeans
Package emmeans (formerly known as lsmeans) is enormously useful for folks wanting to do post hoc comparisons among groups after fitting a model. It has a very thorough set of vignettes (see the vignette topics here), is very flexible with a ton of options, and works out of the box with a lot of different model objects (and can be extended to others 👍). I’ve started recommending emmeans all the time to students fitting models in R.
Lots of zeros or too many zeros?: Thinking about zero inflation in count data
In a recent lecture I gave a basic overview of zero-inflation in count distributions. My main take-home message to the students that I thought worth posting about here is that having a lot of zero values does not necessarily mean you have zero inflation. Zero inflation is when there are more 0 values in the data than the distribution allows for. But some distributions can have a lot of zeros!
How to plot fitted lines with ggplot2
Most analyses aren’t really done until we’ve found a way to visualize the results graphically, and I’ve recently been getting some questions from students on how to plot fitted lines from models. There are some R packages that are made specifically for this purpose; see packages effects and visreg, for example. If using the ggplot2 package for plotting, fitted lines from simple models can be graphed using geom_smooth(). However, once models get more complicated that convenient function is no longer useful.
Analysis essentials: An example directory structure for an analysis using R
There are a lot of practical skills involved in doing an analysis that are essential but that I rarely (never?) see included in the curriculum, statistics or otherwise. These are skills like how to organize your data, how to approach QAQC, and how to set up a naming algorithm for files. We all need to do these things, but too often we end up learning these skills by muddling through on our own.
The log-0 problem: analysis strategies and options for choosing c in log(y + c)
I periodically find myself having long conversations with consultees about 0’s. Why? Well, the basic suite of statistical tools many of us learn first involves the normal distribution (for the errors). The log transformation tends to feature prominently for working with right-skewed data. Since log(0) returns -Infinity, a common first reaction is to use log(y + c) as the response in place of log(y), where c is some constant added to the y variable to get rid of the 0 values.