# A closer look at replicate() and purrr::map() for simulations

I’ve done a couple of posts so far on simulations, here and here, where I demonstrate how to build a function for simulating data from a defined linear model and then explore long-run behavior of models fit to the simulated datasets. The focus of those posts was on the general simulation process, and I didn’t go into much detail on the specific R code. In this post I’ll focus in on the code I use for repeatedly simulating data and extracting output, specifically talking about the function `replicate()`

and the *map* family of functions from package **purrr**.

# The replicate() function

The `replicate()`

function is a member of the *apply* family of functions in base R.

Specifically, from the documentation:

`replicate`

is a wrapper for the common use of`sapply`

for repeated evaluation of an expression (which will usually involve random number generation).

Notice the documentation mentions *repeated evaluations* and that the use of `replicate()`

involves *random number generation*. Those are primary parts of the simulations I do. While I don’t actually know the *apply* family of functions very well, I use `replicate()`

a lot (although also see `purrr::rerun()`

). Using `replicate()`

is an alternative to building a `for()`

loop to repeatedly simulate new values.

The `replicate()`

function takes three arguments:

`n`

, which is the number of replications to perform. This is where I set the number of simulations I want to run.

`expr`

, the expression that should be run repeatedly. I’ve only ever used a function here.

`simplify`

, which controls the type of output the results of`expr`

are saved into. Use`simplify = FALSE`

to get vectors saved into a list instead of in an array.

## Simple example of replicate()

Let’s say I wanted to simulate some values from a normal distribution, which I can do using the `rnorm()`

function. Below I’ll simulate five values from a normal distribution with a mean of 0 and a standard deviation of 1 (which are the defaults for `mean`

and `sd`

arguments, respectively).

Since I’m going generate random numbers I’ll set the seed so anyone following along at home will see the same values.

```
set.seed(16)
rnorm(5, mean = 0, sd = 1)
```

`# [1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293`

Using `rnorm()`

directly gives me a single set of simulated values. How do I simulate 5 values from this same distribution multiple times? This is where `replicate()`

comes in. It allows me to run the function I put in `expr`

exactly `n`

times.

Here I’ll ask for three runs of 5 values each. Notice I use `simplify = FALSE`

to get a list as output.

The output below is a list of three vectors. Each vector is from a unique run of the function, so contains five random numbers drawn from the normal distribution with a mean of 0 and standard deviation of 1.

```
set.seed(16)
replicate(n = 3, rnorm(5, 0, 1), simplify = FALSE )
```

```
# [[1]]
# [1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293
#
# [[2]]
# [1] -0.46841204 -1.00595059 0.06356268 1.02497260 0.57314202
#
# [[3]]
# [1] 1.8471821 0.1119334 -0.7460373 1.6582137 0.7217206
```

Note if I don’t use `simplify = FALSE`

I will get a matrix of values instead of a list. Each column in the matrix is the output from one run of the function. In this case there will be three columns in the output, one for each run, and 5 rows. This can be a useful output type for simulations. I focus on list output throughout the rest of this post only because that’s what I have been using recently for simulations.

```
set.seed(16)
replicate(n = 3, rnorm(5, 0, 1) )
```

```
# [,1] [,2] [,3]
# [1,] 0.4764134 -0.46841204 1.8471821
# [2,] -0.1253800 -1.00595059 0.1119334
# [3,] 1.0962162 0.06356268 -0.7460373
# [4,] -1.4442290 1.02497260 1.6582137
# [5,] 1.1478293 0.57314202 0.7217206
```

## An equivalent for() loop example

A `for()`

loop can be used in place of `replicate()`

for simulations. With time and practice I’ve found `replicate()`

to be much more convenient in terms of writing the code. However, in my experience some folks find `for()`

loops intuitive when they are starting out in R. I think it’s because `for()`

loops are more explicit on the looping process: the user can see that `i`

is looped over and the output for each `i`

iteration is saved into the output object because the code is written out explicitly.

In my example I’ll save the output of each iteration of the loop into a list called `list1`

. I initialize this as an empty list prior to starting the loop. To match what I did with `replicate()`

I do three iterations of the loop (`i in 1:3`

), drawing 5 values via `rnorm()`

each time.

The result is identical to my `replicate()`

code above. It took a little more code to do it but the process is very clear since it is explicitly written out.

```
set.seed(16)
list1 = list() # Make an empty list to save output in
for (i in 1:3) { # Indicate number of iterations with "i"
list1[[i]] = rnorm(5, 0, 1) # Save output in list for each iteration
}
list1
```

```
# [[1]]
# [1] 0.4764134 -0.1253800 1.0962162 -1.4442290 1.1478293
#
# [[2]]
# [1] -0.46841204 -1.00595059 0.06356268 1.02497260 0.57314202
#
# [[3]]
# [1] 1.8471821 0.1119334 -0.7460373 1.6582137 0.7217206
```

## Using replicate() on a user-made function

When I do simulations to explore the behavior of linear models under different scenarios I will create a function to simulate the data and fit the model. For example, here’s a function I used in an earlier blog post to simulate data from and then fit a two group linear model.

```
twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) {
ngroup = 2
group = rep( c("group1", "group2"), each = nrep)
eps = rnorm(ngroup*nrep, 0, sigma)
growth = b0 + b1*(group == "group2") + eps
growthfit = lm(growth ~ group)
growthfit
}
```

The output is a model fit to data generated from the fixed and random (residual) effects.

`twogroup_fun()`

```
#
# Call:
# lm(formula = growth ~ group)
#
# Coefficients:
# (Intercept) groupgroup2
# 4.686 -1.267
```

To explore the long-run behavior in my simulated scenario I will repeat the data generation and model fitting many times using `replicate()`

. The result is a list of fitted models. I’ll run the function 5 times and save the result as `sim_lm`

to use throughout the next section on `map()`

.

```
sim_lm = replicate(5, twogroup_fun(), simplify = FALSE )
length(sim_lm)
```

`# [1] 5`

# Using purrr::map() for looping through lists

So I have a list of fitted models from `replicate()`

; now what?

The `replicate()`

function was about repeatedly running a function. Once I have the repeated runs I can explore the long-run behavior of some statistic by extracting value(s) from the resulting models. This involves looping through the list of models.

Looping through the list can be done using a `for()`

loop, but I prefer to use functions that do the looping without all the typing. In particular, these days I use the *map* family of functions from the **purrr** package to loop through lists. Before **purrr** I primarily used `lapply()`

(the only other *apply* family function that I know 😆).

The `map()`

function takes a list as input and puts the output into a list of the same length. The first argument to `map()`

is the list to loop through and the second argument is the function to apply to each element of the list.

For example, I can pull out the coefficients of each model in my 5-run simulation by looping through `sim_lm`

and applying the `coef()`

function to each list element.

```
library(purrr) # v. 0.2.5
map(sim_lm, coef)
```

```
# [[1]]
# (Intercept) groupgroup2
# 5.189474 -1.715602
#
# [[2]]
# (Intercept) groupgroup2
# 4.670188 -1.965463
#
# [[3]]
# (Intercept) groupgroup2
# 5.231922 -2.589953
#
# [[4]]
# (Intercept) groupgroup2
# 6.285158 -3.195090
#
# [[5]]
# (Intercept) groupgroup2
# 4.3296875 -0.9724314
```

## Other variants of map() for non-list outputs

There are many variants of `map()`

that are convenient for saving results into something other than a list. For example, if I am going to extract a single numeric value from each model, such as \(R^2\), I might want the output to be a numeric vector instead of a list. I can use `map_dbl()`

for this.

The unadjusted \(R^2\) from a model fit with `lm()`

can be pulled from the model `summary()`

output. The code looks like: `summary(model)$r.squared`

,

where “model” is a fitted model.

So getting \(R^2\) involves extracting a value after applying a function, which isn’t quite as straightforward as applying a single function to every model in the list like I did with `coef()`

. This gives me a chance to demonstrate the formula coding styling available in *map* functions. In formula coding a tilde (`~`

) goes in front of the function and `.x`

refers to the list element.

`map_dbl(sim_lm, ~summary(.x)$r.squared)`

`# [1] 0.22823549 0.16199867 0.25730022 0.38591045 0.06375695`

If you don’t like the formula style you can use an anonymous function inside *map* functions, where the function argument is used to refer to the list element.

`map_dbl(sim_lm, function(x) summary(x)$r.squared)`

`# [1] 0.22823549 0.16199867 0.25730022 0.38591045 0.06375695`

For data.frame output we can use `map_dfr()`

for row binding or *stacking* results together into a single data.frame.

Estimated coefficients, their standard errors, and their statistical tests from models fit with `lm()`

can be extracted into a tidy data.frame using `broom::tidy()`

. Looping through the results and doing this for each model via `map_dfr()`

will put the output in one data.frame instead of storing the individual data.frames for each model as one element of a list.

`map_dfr(sim_lm, broom::tidy)`

```
# term estimate std.error statistic p.value
# 1 (Intercept) 5.1894736 0.5257947 9.869772 1.092300e-08
# 2 groupgroup2 -1.7156023 0.7435860 -2.307201 3.314134e-02
# 3 (Intercept) 4.6701884 0.7450412 6.268362 6.535353e-06
# 4 groupgroup2 -1.9654632 1.0536474 -1.865390 7.851233e-02
# 5 (Intercept) 5.2319216 0.7333769 7.134015 1.203499e-06
# 6 groupgroup2 -2.5899532 1.0371516 -2.497179 2.243919e-02
# 7 (Intercept) 6.2851581 0.6717450 9.356464 2.460846e-08
# 8 groupgroup2 -3.1950902 0.9499909 -3.363285 3.461704e-03
# 9 (Intercept) 4.3296875 0.6210667 6.971372 1.641131e-06
# 10 groupgroup2 -0.9724314 0.8783210 -1.107148 2.828066e-01
```

The `map_dfr()`

function has an additional argument, `.id`

, which can be used to store the list names (if the original list had names) or add the list index to the output (if it didn’t have names). I’m using a list that has no names, so each unique model output will be assigned its index number if I use the `.id`

argument. The name of the new column is given as a string to `.id`

.

`map_dfr(sim_lm, broom::tidy, .id = "model")`

```
# model term estimate std.error statistic p.value
# 1 1 (Intercept) 5.1894736 0.5257947 9.869772 1.092300e-08
# 2 1 groupgroup2 -1.7156023 0.7435860 -2.307201 3.314134e-02
# 3 2 (Intercept) 4.6701884 0.7450412 6.268362 6.535353e-06
# 4 2 groupgroup2 -1.9654632 1.0536474 -1.865390 7.851233e-02
# 5 3 (Intercept) 5.2319216 0.7333769 7.134015 1.203499e-06
# 6 3 groupgroup2 -2.5899532 1.0371516 -2.497179 2.243919e-02
# 7 4 (Intercept) 6.2851581 0.6717450 9.356464 2.460846e-08
# 8 4 groupgroup2 -3.1950902 0.9499909 -3.363285 3.461704e-03
# 9 5 (Intercept) 4.3296875 0.6210667 6.971372 1.641131e-06
# 10 5 groupgroup2 -0.9724314 0.8783210 -1.107148 2.828066e-01
```

Further arguments to the function used within `map()`

can be passed as additional arguments. For example, I can add confidence intervals for estimated coefficients when using the `tidy.lm()`

function via `conf.int = TRUE`

. If I want to get confidence intervals for all models I add this as an additional argument in `map_dfr()`

.

`map_dfr(sim_lm, broom::tidy, conf.int = TRUE)`

```
# term estimate std.error statistic p.value conf.low
# 1 (Intercept) 5.1894736 0.5257947 9.869772 1.092300e-08 4.084820
# 2 groupgroup2 -1.7156023 0.7435860 -2.307201 3.314134e-02 -3.277819
# 3 (Intercept) 4.6701884 0.7450412 6.268362 6.535353e-06 3.104915
# 4 groupgroup2 -1.9654632 1.0536474 -1.865390 7.851233e-02 -4.179094
# 5 (Intercept) 5.2319216 0.7333769 7.134015 1.203499e-06 3.691154
# 6 groupgroup2 -2.5899532 1.0371516 -2.497179 2.243919e-02 -4.768928
# 7 (Intercept) 6.2851581 0.6717450 9.356464 2.460846e-08 4.873874
# 8 groupgroup2 -3.1950902 0.9499909 -3.363285 3.461704e-03 -5.190947
# 9 (Intercept) 4.3296875 0.6210667 6.971372 1.641131e-06 3.024875
# 10 groupgroup2 -0.9724314 0.8783210 -1.107148 2.828066e-01 -2.817715
# conf.high
# 1 6.2941272
# 2 -0.1533862
# 3 6.2354619
# 4 0.2481678
# 5 6.7726893
# 6 -0.4109786
# 7 7.6964420
# 8 -1.1992333
# 9 5.6345003
# 10 0.8728526
```

The *map* family of functions can easily be used with pipes as one step in a chain of functions. I can, for example, take the estimates I get using `broom::tidy`

, pull out the estimated intercepts, and then plot a histogram of those estimates. I’ll need packages **dplyr** and **ggplot2** for this.

```
suppressPackageStartupMessages( library(dplyr) ) # v. 0.7.5
library(ggplot2) # v 2.2.1
```

You can see all the steps in the pipe chain below. I loop through `sim_lm`

using `map_dfr()`

to extract the coefficients from each element of the list and output a data.frame of results. I use `dplyr::filter()`

to keep only the rows with estimated intercepts and then plot a histogram of these estimates for the whole simulation with `ggplot2::qplot()`

.

```
sim_lm %>%
map_dfr(broom::tidy) %>%
filter(term == "(Intercept)") %>%
qplot(x = estimate, data = ., geom = "histogram")
```

`# `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

There are more variants of `map()`

that you might find useful, both for simulations and in other iterative work. See the documentation for `map()`

(`?map`

) to see all of them along with additional examples.