1 Overview

In this tutorial, we’ll cover the basics of iterative coding using the “purrr” package inside of the tidyverse.

1.1 Packages (and versions) used in this document

## [1] "R version 4.0.2 (2020-06-22)"
##    Package Version
##  tidyverse   1.3.0
##      purrr   0.3.4

2 Using purrr

The purrr package is incredibly useful when you need to iterate your code. For example, if we wanted to calculate the median of each column in the iris dataset (provided in R) we could do it manually as shown below.

median(iris$Sepal.Length)
## [1] 5.8
median(iris$Sepal.Width)
## [1] 3
median(iris$Petal.Length)
## [1] 4.35
median(iris$Petal.Width)
## [1] 1.3

Or, we could use the map_*() functions inside of the purrr package.

map_dbl(.x = iris[,1:4], .f = median)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##         5.80         3.00         4.35         1.30

What the map_*() function is doing behind the scenes is taking the data (.x =) and applying the function (.f =) to each column (in this instance). There are many different map_*() functions that have their own special uses, and we’ll dive more into these nuances now.

2.1 map_*()

The map_*() functions rely on two main arguments:

map_*(.x, .f)

These arguments represent a list/vector/df and a function, respectively. map_*() then applies the function to each of the elements in the list/vector/df and outputs the result in a prespecified format.

“Precisely which elements are the function applied to,” you might ask. Well, this depends on what type of data you’re feeding into the map_*() function:

  • map(.x = df, .f) will apply .f to each column of the dataframe
  • map(.x = list, .f) will apply .f to each element of the list
  • map(.x = vector, .f) will apply .f to each element of the vector

“But how do we specify the output format,” you may ask. This is simpler than it may seem at first. The specific map_*() function that you choose will return a specific type of output:

  • map() returns a list
  • map_chr() returns a character vector
  • map_dbl() returns a double vector
  • map_int() returns an integer vector
  • map_lgl() returns a logical vector

2.1.1 Practice

In this section, we’ll do a few practice problems to illustrate the usefulness of the map family of functions. First, let’s make a dataframe and a list.

dat <- tibble(
  a = c(1:10),
  b = c(11:20),
  c = c(21:30),
  d = c(31:40),
  e = c(41:50),
  f = c(rep(1:5, each = 2)))
dat
## # A tibble: 10 x 6
##        a     b     c     d     e     f
##    <int> <int> <int> <int> <int> <int>
##  1     1    11    21    31    41     1
##  2     2    12    22    32    42     1
##  3     3    13    23    33    43     2
##  4     4    14    24    34    44     2
##  5     5    15    25    35    45     3
##  6     6    16    26    36    46     3
##  7     7    17    27    37    47     4
##  8     8    18    28    38    48     4
##  9     9    19    29    39    49     5
## 10    10    20    30    40    50     5
list_of_lists <- list(
  list(number = 7, character = "S"),
  list(number = 8, character = "E"),
  list(number = 9, character = "N"))
list_of_lists
## [[1]]
## [[1]]$number
## [1] 7
## 
## [[1]]$character
## [1] "S"
## 
## 
## [[2]]
## [[2]]$number
## [1] 8
## 
## [[2]]$character
## [1] "E"
## 
## 
## [[3]]
## [[3]]$number
## [1] 9
## 
## [[3]]$character
## [1] "N"

Problem 1:
We want to make a list of column means from our dataframe. Because we want our results to be returned as a list, we’re going to use map().

p1 <- map(.x = dat, .f = mean)
class(p1)
## [1] "list"
p1
## $a
## [1] 5.5
## 
## $b
## [1] 15.5
## 
## $c
## [1] 25.5
## 
## $d
## [1] 35.5
## 
## $e
## [1] 45.5
## 
## $f
## [1] 3

Problem 2:
Repeat problem 1, but make an appropriate vector instead of a list.
Note: Because we know what the output is already, we should be able to choose the correct map_*() function. If we look back at the results of Problem 1, we can see that it returned data of type double - leading us to use map_dbl().

p2 <- map_dbl(.x = dat, .f = mean)
class(p2)
## [1] "numeric"
p2
##    a    b    c    d    e    f 
##  5.5 15.5 25.5 35.5 45.5  3.0

Problem 3:
We want to determine which elements in our first column are greater than or equal to 5. We’ll write our own function to accomplish this. Because the use of logical operators (>, <, ==, etc.) returns a logical vector, we know to use map_lgl().

gte5 <- function(x) {
  x >= 5
}

p3 <- map_lgl(.x = dat$a, .f = gte5)
class(p3)
## [1] "logical"
p3
##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Problem 4:
We want to take our third column and change all values that are 25 or greater to be “25+”.
Note: for this we’ll write an anonymous function inside of the correct map_*() function. Because we’re wanting this vector to contain the value “25+” (a character string), we know that we should use the map_chr() function.

p4 <- map_chr(.x = dat$c, .f = function(x) {
  if(x >= 25){x <- "25+"} else{x <- x} })
class(p4)
## [1] "character"
p4
##  [1] "21"  "22"  "23"  "24"  "25+" "25+" "25+" "25+" "25+" "25+"

2.2 .f shortcuts

So far we’ve seen 3 different ways to specify the .f argument in the map_*() functions:

  1. map_dbl(.x = dat, .f = mean) used an existing function
  2. map_lgl(.x = dat$a, .f = gte5) used a function we wrote
  3. map_chr(.x = dat$c, .f = function(x){if(x >= 25){x <- "25+"}else{x <- x}}) used an anonymous function we wrote inside of map_*()

2.3 Using purrr to run multiple models

In our professional lives, we all (at one point or another) will wish to run a bunch of models on various subsets of our data. How will the things you’ve learned in this tutorial help you to do this faster and easier? Let’s find out!

2.3.1 Splitting our dataset

For this example, we’ll be using the iris dataset. Additionally, we want to split our dataset by the 3 different species of iris in the data. We can easily do this with the split() function.

iris_split <- split(x = iris, f = iris$Species)

We’re specifying two arguments in split(): x = is the dataset that we want to split, and f = is the factor in the dataset that we want to split by.

Note: This creates a list in which each element of the list is a dataframe that corresponding to each element of the specified variable. Our list contains 3 dataframes: one for each Species in the dataset.

2.3.2 Model building

Let’s say that we want to predict Sepal.Length from Sepal.Width. We should build a general model to make sure it works.

lm(Sepal.Length ~ Sepal.Width, data = iris)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
## 
## Coefficients:
## (Intercept)  Sepal.Width  
##      6.5262      -0.2234

This isn’t new, so let’s get onto the fun, new things you will be able to do from now on.

2.3.3 Model for each subset

For this, we want to make our results into a list. We’ll use our formula shortcut from earlier to simplify the map() arguments. Also, we’re using our split iris dataframe.

models <- map(iris_split, ~ lm(Sepal.Length ~ Sepal.Width, data = .))
models
## $setosa
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = .)
## 
## Coefficients:
## (Intercept)  Sepal.Width  
##      2.6390       0.6905  
## 
## 
## $versicolor
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = .)
## 
## Coefficients:
## (Intercept)  Sepal.Width  
##      3.5397       0.8651  
## 
## 
## $virginica
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = .)
## 
## Coefficients:
## (Intercept)  Sepal.Width  
##      3.9068       0.9015

Here we can see all of the different pieces of information that are available to us (these are always available when you run an lm() model).

2.3.4 Extract model coefficients

For this, we’ll use the coef() function to pull out the coefficients for us.

coefs <- map(models, coef)
coefs
## $setosa
## (Intercept) Sepal.Width 
##   2.6390012   0.6904897 
## 
## $versicolor
## (Intercept) Sepal.Width 
##   3.5397347   0.8650777 
## 
## $virginica
## (Intercept) Sepal.Width 
##   3.9068365   0.9015345

Now, let’s use our subsetting shortcuts from earlier to pull out the slopes for Sepal.Width associated with our three species.

map(coefs, "Sepal.Width") # Could also use position, which is 2
## $setosa
## [1] 0.6904897
## 
## $versicolor
## [1] 0.8650777
## 
## $virginica
## [1] 0.9015345

2.3.5 Putting it all together and generalizing

Let’s look at how we can write this code more succinctly and how we can pull a wider range of values from it.

This first chunk of code will subset our data and run our model on each subset.

models2 <- iris %>% # original data
  split(iris$Species) %>% # split data
  map(~ lm(Sepal.Length ~ Sepal.Width, data = .)) # model

This second chunk of code will generate a summary for each model, from which we can pull out whatever information we need.

mod_sum <- models2 %>%
  map(summary)
mod_sum
## $setosa
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52476 -0.16286  0.02166  0.13833  0.44428 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.6390     0.3100   8.513 3.74e-11 ***
## Sepal.Width   0.6905     0.0899   7.681 6.71e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2385 on 48 degrees of freedom
## Multiple R-squared:  0.5514, Adjusted R-squared:  0.542 
## F-statistic: 58.99 on 1 and 48 DF,  p-value: 6.71e-10
## 
## 
## $versicolor
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73497 -0.28556 -0.07544  0.43666  0.83805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.5397     0.5629   6.289 9.07e-08 ***
## Sepal.Width   0.8651     0.2019   4.284 8.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4436 on 48 degrees of freedom
## Multiple R-squared:  0.2766, Adjusted R-squared:  0.2615 
## F-statistic: 18.35 on 1 and 48 DF,  p-value: 8.772e-05
## 
## 
## $virginica
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26067 -0.36921 -0.03606  0.19841  1.44917 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.9068     0.7571   5.161 4.66e-06 ***
## Sepal.Width   0.9015     0.2531   3.562 0.000843 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5714 on 48 degrees of freedom
## Multiple R-squared:  0.2091, Adjusted R-squared:  0.1926 
## F-statistic: 12.69 on 1 and 48 DF,  p-value: 0.0008435

Inside of mod_sum, we can determine which values we want to extract, and then do so via subsetting. For example, the f-statistics:

map(mod_sum, "fstatistic")
## $setosa
##    value    numdf    dendf 
## 58.99373  1.00000 48.00000 
## 
## $versicolor
##    value    numdf    dendf 
## 18.35169  1.00000 48.00000 
## 
## $virginica
##    value    numdf    dendf 
## 12.68707  1.00000 48.00000

or the r-squared:

map(mod_sum, "r.squared")
## $setosa
## [1] 0.5513756
## 
## $versicolor
## [1] 0.2765821
## 
## $virginica
## [1] 0.2090573

or if you know what you want ahead of time, you can put it all into one pipeline. The following chunk of code will split the data, run the model, and extract the r-squared.

mod_rsquared <- iris %>% 
  split(iris$Species) %>% 
  map(~ lm(Sepal.Length ~ Sepal.Width, data = .)) %>% 
  map(summary) %>% 
  map("r.squared")
mod_rsquared
## $setosa
## [1] 0.5513756
## 
## $versicolor
## [1] 0.2765821
## 
## $virginica
## [1] 0.2090573

2.4 Additional .x Values

So far, we’ve only used the functions in purrr that allow for functions to be applied to 1 .x value. However, there are functions that allow for multiple .x values. These are:

  • map2(.x, .y, .f)
  • map2_chr(.x, .y, .f)
  • map2_dbl(.x, .y, .f)
  • map2_int(.x, .y, .f)
  • map2_lgl(.x, .y, .f)

As you can tell, the map2_*() functions allow for two vectors, datasets, or lists of the same length to have a function iterated over them.

Instead of having map3(), map4(), etc., the purrr package has another set of functions to handle iteration over more than two vectors, datasets, or lists:

  • pmap(.l, .f)
  • pmap_chr(.l, .f)
  • pmap_dbl(.l, .f)
  • pmap_int(.l, .f)
  • pmap_lgl(.l, .f)

These functions behave in much the same way as the basic map_*() functions. The .l argument in the pmap_*() functions is a list of the vectors, datasets, or lists that the function (.f) should be iterated over.