In this tutorial, we’ll cover the basics of iterative coding using the “purrr” package inside of the tidyverse.
## [1] "R version 4.0.2 (2020-06-22)"
## Package Version
## tidyverse 1.3.0
## purrr 0.3.4
purrr
The purrr package is incredibly useful when you need to iterate your code. For example, if we wanted to calculate the median of each column in the iris
dataset (provided in R) we could do it manually as shown below.
median(iris$Sepal.Length)
## [1] 5.8
median(iris$Sepal.Width)
## [1] 3
median(iris$Petal.Length)
## [1] 4.35
median(iris$Petal.Width)
## [1] 1.3
Or, we could use the map_*()
functions inside of the purrr package.
map_dbl(.x = iris[,1:4], .f = median)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.80 3.00 4.35 1.30
What the map_*()
function is doing behind the scenes is taking the data (.x =
) and applying the function (.f =
) to each column (in this instance). There are many different map_*()
functions that have their own special uses, and we’ll dive more into these nuances now.
map_*()
The map_*()
functions rely on two main arguments:
map_*(.x, .f)
These arguments represent a list/vector/df and a function, respectively. map_*()
then applies the function to each of the elements in the list/vector/df and outputs the result in a prespecified format.
“Precisely which elements are the function applied to,” you might ask. Well, this depends on what type of data you’re feeding into the map_*()
function:
map(.x = df, .f)
will apply .f
to each column of the dataframemap(.x = list, .f)
will apply .f
to each element of the listmap(.x = vector, .f)
will apply .f
to each element of the vector“But how do we specify the output format,” you may ask. This is simpler than it may seem at first. The specific map_*()
function that you choose will return a specific type of output:
map()
returns a listmap_chr()
returns a character vectormap_dbl()
returns a double vectormap_int()
returns an integer vectormap_lgl()
returns a logical vectorIn this section, we’ll do a few practice problems to illustrate the usefulness of the map
family of functions. First, let’s make a dataframe and a list.
dat <- tibble(
a = c(1:10),
b = c(11:20),
c = c(21:30),
d = c(31:40),
e = c(41:50),
f = c(rep(1:5, each = 2)))
dat
## # A tibble: 10 x 6
## a b c d e f
## <int> <int> <int> <int> <int> <int>
## 1 1 11 21 31 41 1
## 2 2 12 22 32 42 1
## 3 3 13 23 33 43 2
## 4 4 14 24 34 44 2
## 5 5 15 25 35 45 3
## 6 6 16 26 36 46 3
## 7 7 17 27 37 47 4
## 8 8 18 28 38 48 4
## 9 9 19 29 39 49 5
## 10 10 20 30 40 50 5
list_of_lists <- list(
list(number = 7, character = "S"),
list(number = 8, character = "E"),
list(number = 9, character = "N"))
list_of_lists
## [[1]]
## [[1]]$number
## [1] 7
##
## [[1]]$character
## [1] "S"
##
##
## [[2]]
## [[2]]$number
## [1] 8
##
## [[2]]$character
## [1] "E"
##
##
## [[3]]
## [[3]]$number
## [1] 9
##
## [[3]]$character
## [1] "N"
Problem 1:
We want to make a list of column means from our dataframe. Because we want our results to be returned as a list, we’re going to use map()
.
p1 <- map(.x = dat, .f = mean)
class(p1)
## [1] "list"
p1
## $a
## [1] 5.5
##
## $b
## [1] 15.5
##
## $c
## [1] 25.5
##
## $d
## [1] 35.5
##
## $e
## [1] 45.5
##
## $f
## [1] 3
Problem 2:
Repeat problem 1, but make an appropriate vector instead of a list.
Note: Because we know what the output is already, we should be able to choose the correct map_*()
function. If we look back at the results of Problem 1, we can see that it returned data of type double - leading us to use map_dbl()
.
p2 <- map_dbl(.x = dat, .f = mean)
class(p2)
## [1] "numeric"
p2
## a b c d e f
## 5.5 15.5 25.5 35.5 45.5 3.0
Problem 3:
We want to determine which elements in our first column are greater than or equal to 5. We’ll write our own function to accomplish this. Because the use of logical operators (>
, <
, ==
, etc.) returns a logical vector, we know to use map_lgl()
.
gte5 <- function(x) {
x >= 5
}
p3 <- map_lgl(.x = dat$a, .f = gte5)
class(p3)
## [1] "logical"
p3
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
Problem 4:
We want to take our third column and change all values that are 25 or greater to be “25+”.
Note: for this we’ll write an anonymous function inside of the correct map_*()
function. Because we’re wanting this vector to contain the value “25+” (a character string), we know that we should use the map_chr()
function.
p4 <- map_chr(.x = dat$c, .f = function(x) {
if(x >= 25){x <- "25+"} else{x <- x} })
class(p4)
## [1] "character"
p4
## [1] "21" "22" "23" "24" "25+" "25+" "25+" "25+" "25+" "25+"
.f
shortcutsSo far we’ve seen 3 different ways to specify the .f
argument in the map_*()
functions:
map_dbl(.x = dat, .f = mean)
used an existing functionmap_lgl(.x = dat$a, .f = gte5)
used a function we wrotemap_chr(.x = dat$c, .f = function(x){if(x >= 25){x <- "25+"}else{x <- x}})
used an anonymous function we wrote inside of map_*()
Our first shortcut comes into play when you need to write an anonymous function inside of map_*()
. Here is our answer to problem 3 written with an anonymous function.
map_lgl(.x = dat$a, .f = function(x){x >= 5})
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
map_*()
allows us to instead write these anonymous functions as a formula. In these formulas, we replace the function(x)
with ~
and the argument is replaced with .
inside the formula.
map_lgl(.x = dat$a, .f = ~ . >= 5) # with argument names
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
map_lgl(dat$a, ~ . >= 5) # without argument names
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
Our next shortcut involves subsetting. If we wanted to extract only the number
elements from each list in our list_of_lists
, we’d need to do this:
map_dbl(list_of_lists, function(x){x[["number"]]})
## [1] 7 8 9
However, since subsetting is such a common practice, map_*()
allows us to simplify the process:
map_dbl(list_of_lists, "number") # This uses the name to subset
## [1] 7 8 9
map_dbl(list_of_lists, 1) # This uses the position to subset
## [1] 7 8 9
In our professional lives, we all (at one point or another) will wish to run a bunch of models on various subsets of our data. How will the things you’ve learned in this tutorial help you to do this faster and easier? Let’s find out!
For this example, we’ll be using the iris
dataset. Additionally, we want to split our dataset by the 3 different species of iris in the data. We can easily do this with the split()
function.
iris_split <- split(x = iris, f = iris$Species)
We’re specifying two arguments in split()
: x =
is the dataset that we want to split, and f =
is the factor in the dataset that we want to split by.
Note: This creates a list in which each element of the list is a dataframe that corresponding to each element of the specified variable. Our list contains 3 dataframes: one for each Species
in the dataset.
Let’s say that we want to predict Sepal.Length
from Sepal.Width
. We should build a general model to make sure it works.
lm(Sepal.Length ~ Sepal.Width, data = iris)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
##
## Coefficients:
## (Intercept) Sepal.Width
## 6.5262 -0.2234
This isn’t new, so let’s get onto the fun, new things you will be able to do from now on.
For this, we want to make our results into a list. We’ll use our formula shortcut from earlier to simplify the map()
arguments. Also, we’re using our split iris dataframe.
models <- map(iris_split, ~ lm(Sepal.Length ~ Sepal.Width, data = .))
models
## $setosa
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = .)
##
## Coefficients:
## (Intercept) Sepal.Width
## 2.6390 0.6905
##
##
## $versicolor
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = .)
##
## Coefficients:
## (Intercept) Sepal.Width
## 3.5397 0.8651
##
##
## $virginica
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = .)
##
## Coefficients:
## (Intercept) Sepal.Width
## 3.9068 0.9015
Here we can see all of the different pieces of information that are available to us (these are always available when you run an lm()
model).
For this, we’ll use the coef()
function to pull out the coefficients for us.
coefs <- map(models, coef)
coefs
## $setosa
## (Intercept) Sepal.Width
## 2.6390012 0.6904897
##
## $versicolor
## (Intercept) Sepal.Width
## 3.5397347 0.8650777
##
## $virginica
## (Intercept) Sepal.Width
## 3.9068365 0.9015345
Now, let’s use our subsetting shortcuts from earlier to pull out the slopes for Sepal.Width
associated with our three species.
map(coefs, "Sepal.Width") # Could also use position, which is 2
## $setosa
## [1] 0.6904897
##
## $versicolor
## [1] 0.8650777
##
## $virginica
## [1] 0.9015345
Let’s look at how we can write this code more succinctly and how we can pull a wider range of values from it.
This first chunk of code will subset our data and run our model on each subset.
models2 <- iris %>% # original data
split(iris$Species) %>% # split data
map(~ lm(Sepal.Length ~ Sepal.Width, data = .)) # model
This second chunk of code will generate a summary for each model, from which we can pull out whatever information we need.
mod_sum <- models2 %>%
map(summary)
mod_sum
## $setosa
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52476 -0.16286 0.02166 0.13833 0.44428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6390 0.3100 8.513 3.74e-11 ***
## Sepal.Width 0.6905 0.0899 7.681 6.71e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2385 on 48 degrees of freedom
## Multiple R-squared: 0.5514, Adjusted R-squared: 0.542
## F-statistic: 58.99 on 1 and 48 DF, p-value: 6.71e-10
##
##
## $versicolor
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73497 -0.28556 -0.07544 0.43666 0.83805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5397 0.5629 6.289 9.07e-08 ***
## Sepal.Width 0.8651 0.2019 4.284 8.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4436 on 48 degrees of freedom
## Multiple R-squared: 0.2766, Adjusted R-squared: 0.2615
## F-statistic: 18.35 on 1 and 48 DF, p-value: 8.772e-05
##
##
## $virginica
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.26067 -0.36921 -0.03606 0.19841 1.44917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.9068 0.7571 5.161 4.66e-06 ***
## Sepal.Width 0.9015 0.2531 3.562 0.000843 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5714 on 48 degrees of freedom
## Multiple R-squared: 0.2091, Adjusted R-squared: 0.1926
## F-statistic: 12.69 on 1 and 48 DF, p-value: 0.0008435
Inside of mod_sum
, we can determine which values we want to extract, and then do so via subsetting. For example, the f-statistics:
map(mod_sum, "fstatistic")
## $setosa
## value numdf dendf
## 58.99373 1.00000 48.00000
##
## $versicolor
## value numdf dendf
## 18.35169 1.00000 48.00000
##
## $virginica
## value numdf dendf
## 12.68707 1.00000 48.00000
or the r-squared:
map(mod_sum, "r.squared")
## $setosa
## [1] 0.5513756
##
## $versicolor
## [1] 0.2765821
##
## $virginica
## [1] 0.2090573
or if you know what you want ahead of time, you can put it all into one pipeline. The following chunk of code will split the data, run the model, and extract the r-squared.
mod_rsquared <- iris %>%
split(iris$Species) %>%
map(~ lm(Sepal.Length ~ Sepal.Width, data = .)) %>%
map(summary) %>%
map("r.squared")
mod_rsquared
## $setosa
## [1] 0.5513756
##
## $versicolor
## [1] 0.2765821
##
## $virginica
## [1] 0.2090573
.x
ValuesSo far, we’ve only used the functions in purrr that allow for functions to be applied to 1 .x value. However, there are functions that allow for multiple .x
values. These are:
map2(.x, .y, .f)
map2_chr(.x, .y, .f)
map2_dbl(.x, .y, .f)
map2_int(.x, .y, .f)
map2_lgl(.x, .y, .f)
As you can tell, the map2_*()
functions allow for two vectors, datasets, or lists of the same length to have a function iterated over them.
Instead of having map3()
, map4()
, etc., the purrr package has another set of functions to handle iteration over more than two vectors, datasets, or lists:
pmap(.l, .f)
pmap_chr(.l, .f)
pmap_dbl(.l, .f)
pmap_int(.l, .f)
pmap_lgl(.l, .f)
These functions behave in much the same way as the basic map_*()
functions. The .l
argument in the pmap_*()
functions is a list of the vectors, datasets, or lists that the function (.f
) should be iterated over.