An example of base::split() for looping through groups
I recently had a question from a client about the simplest way to subset a data.frame and apply a function to each subset. “Simplest” could mean many things, of course, since what is simple for one person could appear very difficult to another. In this specific case I suggested using base::split()
as a possible option since it is one I find fairly approachable.
I turns out I don’t have a go-to example for how to get started with a split()
approach. So here’s a quick blog post about it! 😄
Table of Contents
Load R packages
I’ll load purrr for looping through lists.
library(purrr) # 0.3.3
A dataset with groups
I made a small dataset to use with split()
. The id
variable contains the group information. There are three groups, a, b, and c, with 10 observations per group. There are also two numeric variables, var1
and var2
.
dat = structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
var1 = c(4, 2.7, 3.4, 2.7, 4.6, 2.9, 2.2, 4.5, 4.6, 2.4,
3, 3.8, 2.5, 4, 3.6, 2.7, 4.5, 4.1, 4.2, 2.2, 4.9, 4.4, 3.6,
3.3, 2.7, 3.9, 4.9, 4.9, 4.3, 3.4), var2 = c(6, 22.3, 19.4,
22.8, 18.6, 14.2, 10.9, 22.7, 22.4, 11.7, 6, 13.3, 12.5,
6.3, 13.6, 20.5, 23.6, 10.9, 8.9, 20.9, 23.7, 15.9, 22.1,
11.6, 22, 17.7, 21, 20.8, 16.7, 21.4)), class = "data.frame", row.names = c(NA,
-30L))
head(dat)
# id var1 var2
# 1 a 4.0 6.0
# 2 a 2.7 22.3
# 3 a 3.4 19.4
# 4 a 2.7 22.8
# 5 a 4.6 18.6
# 6 a 2.9 14.2
Create separate data.frames per group
If the goal is to apply a function to each dataset in each group, we need to pull out a dataset for each id
. One approach to do this is to make a subset for each group and then apply the function of interest to the subset. A classic approach would be to do the subsetting within a for()
loop.
This is a situation where I find split()
to be really convenient. It splits the data by a defined group variable so we don’t have to subset things manually.
The output from split()
is a list. If I split a dataset by groups, each element of the list will be a data.frame for one of the groups. Note the group values are used as the names of the list elements. I find the list-naming aspect of split()
handy for keeping track of groups in subsequent steps.
Here’s an example, where I split dat
by the id
variable.
dat_list = split(dat, dat$id)
dat_list
# $a
# id var1 var2
# 1 a 4.0 6.0
# 2 a 2.7 22.3
# 3 a 3.4 19.4
# 4 a 2.7 22.8
# 5 a 4.6 18.6
# 6 a 2.9 14.2
# 7 a 2.2 10.9
# 8 a 4.5 22.7
# 9 a 4.6 22.4
# 10 a 2.4 11.7
#
# $b
# id var1 var2
# 11 b 3.0 6.0
# 12 b 3.8 13.3
# 13 b 2.5 12.5
# 14 b 4.0 6.3
# 15 b 3.6 13.6
# 16 b 2.7 20.5
# 17 b 4.5 23.6
# 18 b 4.1 10.9
# 19 b 4.2 8.9
# 20 b 2.2 20.9
#
# $c
# id var1 var2
# 21 c 4.9 23.7
# 22 c 4.4 15.9
# 23 c 3.6 22.1
# 24 c 3.3 11.6
# 25 c 2.7 22.0
# 26 c 3.9 17.7
# 27 c 4.9 21.0
# 28 c 4.9 20.8
# 29 c 4.3 16.7
# 30 c 3.4 21.4
Looping through the list
Once the data are split into separate data.frames per group, we can loop through the list and apply a function to each one using whatever looping approach we prefer.
For example, if I want to fit a linear model of var1
vs var2
for each group I might do the looping with purrr::map()
or lapply()
.
Each element of the new list still has the grouping information attached via the list names.
map(dat_list, ~lm(var1 ~ var2, data = .x) )
# $a
#
# Call:
# lm(formula = var1 ~ var2, data = .x)
#
# Coefficients:
# (Intercept) var2
# 2.64826 0.04396
#
#
# $b
#
# Call:
# lm(formula = var1 ~ var2, data = .x)
#
# Coefficients:
# (Intercept) var2
# 3.80822 -0.02551
#
#
# $c
#
# Call:
# lm(formula = var1 ~ var2, data = .x)
#
# Coefficients:
# (Intercept) var2
# 3.35241 0.03513
I could also create a function that fit a model and then returned model output. For example, maybe what I really wanted to do is the fit a linear model and extract \(R^2\) for each group model fit.
r2 = function(data) {
fit = lm(var1 ~ var2, data = data)
broom::glance(fit)
}
The output of my r2
function, which uses broom::glance()
, is a data.frame.
r2(data = dat)
# # A tibble: 1 x 11
# r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
# <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.0292 -0.00550 0.867 0.841 0.367 2 -37.3 80.5 84.7
# # ... with 2 more variables: deviance <dbl>, df.residual <int>
Since the function output is a data.frame, I can use purrr::map_dfr()
to combine the output per group into a single data.frame. The .id
argument creates a new variable to store the list names in the output.
map_dfr(dat_list, r2, .id = "id")
# # A tibble: 3 x 12
# id r.squared adj.r.squared sigma statistic p.value df logLik AIC
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
# 1 a 0.0775 -0.0378 0.968 0.672 0.436 2 -12.7 31.5
# 2 b 0.0387 -0.0815 0.832 0.322 0.586 2 -11.2 28.5
# 3 c 0.0285 -0.0930 0.808 0.235 0.641 2 -10.9 27.9
# # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
Splitting by multiple groups
It is possible to split data by multiple grouping variables in the split()
function. The grouping variables must be passed as a list.
Here’s an example, using the built-in mtcars
dataset. I show only the first two list elements to demonstrate that the list names are now based on a combination of the values for the two groups. By default these values are separated by a .
(but see the sep
argument to control this).
mtcars_cylam = split(mtcars, list(mtcars$cyl, mtcars$am) )
mtcars_cylam[1:2]
# $`4.0`
# mpg cyl disp hp drat wt qsec vs am gear carb
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#
# $`6.0`
# mpg cyl disp hp drat wt qsec vs am gear carb
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
If all combinations of groups are not present, the drop
argument in split()
allows us to drop missing combinations. By default combinations that aren’t present are kept as 0-length data.frames.
Other thoughts on split()
I feel like split()
was a gateway function for me to get started working with lists and associated convenience functions like lapply()
and purrr::map()
for looping through lists. I think learning to work with lists and “list loops” also made the learning curve for list-columns in data.frames and the nest()
/unnest()
approach of analysis-by-groups a little less steep for me.
Just the code, please
Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.
library(purrr) # 0.3.3
dat = structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
var1 = c(4, 2.7, 3.4, 2.7, 4.6, 2.9, 2.2, 4.5, 4.6, 2.4,
3, 3.8, 2.5, 4, 3.6, 2.7, 4.5, 4.1, 4.2, 2.2, 4.9, 4.4, 3.6,
3.3, 2.7, 3.9, 4.9, 4.9, 4.3, 3.4), var2 = c(6, 22.3, 19.4,
22.8, 18.6, 14.2, 10.9, 22.7, 22.4, 11.7, 6, 13.3, 12.5,
6.3, 13.6, 20.5, 23.6, 10.9, 8.9, 20.9, 23.7, 15.9, 22.1,
11.6, 22, 17.7, 21, 20.8, 16.7, 21.4)), class = "data.frame", row.names = c(NA,
-30L))
head(dat)
dat_list = split(dat, dat$id)
dat_list
map(dat_list, ~lm(var1 ~ var2, data = .x) )
r2 = function(data) {
fit = lm(var1 ~ var2, data = data)
broom::glance(fit)
}
r2(data = dat)
map_dfr(dat_list, r2, .id = "id")
mtcars_cylam = split(mtcars, list(mtcars$cyl, mtcars$am) )
mtcars_cylam[1:2]