1 Overview

In this lesson, we’ll cover the basics of merging multiple data frames or pieces of data frames. This skill will be incredibly important if you ever need to work with databases that store different aspects of the data in different files.

We’ll focus on a few different methods for merging data from the dplyr package, such as:

Mutating Joins
Filtering Joins
Binding

We’ll also take a deeper look at some of the more simple ways to join dataframes that have the same rows/columns.

1.1 Packages (and versions) used in this document

## [1] "R version 4.0.2 (2020-06-22)"

##    Package Version
##  tidyverse   1.3.0
##      dplyr   1.0.1

1.2 Data used in this document

Because the focus of this tutorial is on merging data files together, we will be utilizing a number of files in this tutorial. To, hopefully, keep this interesting, I compiled a number of data files about topics that are interesting to me (and that I can find information on within my office). These files are stored on my github account and can be accessed via the links used to import them (should you be interested in doing so).

authors <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/authors.csv")
genre <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/genre.csv")
publishers <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/publishers.csv")
series <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/series.csv")
awards <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/awards.csv")
new_weeks <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/new_weeks.csv")
night_angel <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/night_angel.csv")
lightbringer <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/lightbringer.csv")
weeks_series_i <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/weeks_series.csv")
weeks_series_c <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/weeks_series2.csv")

2 Mutating Joins

We’ll start off this tutorial with mutating joins. Mutating joins will take the original, primary dataframe and alter it by adding columns to it from other dataframes.

2.1 `inner_join()`

The first mutating join is inner_join(). This is particularly useful if you only want to include complete observations. This function only returns rows from the first dataframe that have a match in the second. For this example, we’ll use the authors and series dataframes.

authors

## # A tibble: 15 x 4
##    first   last    book                  year
##    <chr>   <chr>   <chr>                <dbl>
##  1 Trudi   Canavan Magician's Guild      2001
##  2 Trudi   Canavan Novice                2002
##  3 Trudi   Canavan High Lord             2003
##  4 Raymond Feist   Magician              1982
##  5 Raymond Feist   Silverthorn           1985
##  6 Raymond Feist   Darkness at Sethanon  1986
##  7 Nate    Silver  Signal and the Noise  2012
##  8 Daniel  Tammet  Thinking in Numbers   2012
##  9 Brent   Weeks   Way of Shadows        2008
## 10 Brent   Weeks   Shadow's Edge         2008
## 11 Brent   Weeks   Beyond the Shadows    2008
## 12 Brent   Weeks   Black Prism           2010
## 13 Brent   Weeks   Blinding Knife        2012
## 14 Brent   Weeks   Broken Eye            2014
## 15 Brent   Weeks   Blood Mirror          2016

series

## # A tibble: 13 x 2
##    book                 series                
##    <chr>                <chr>                 
##  1 Magician's Guild     Black Magician Trilogy
##  2 Novice               Black Magician Trilogy
##  3 High Lord            Black Magician Trilogy
##  4 Magician             Riftwar Saga          
##  5 Silverthorn          Riftwar Saga          
##  6 Darkness at Sethanon Riftwar Saga          
##  7 Way of Shadows       Night Angel Series    
##  8 Shadow's Edge        Night Angel Series    
##  9 Beyond the Shadows   Night Angel Series    
## 10 Black Prism          Lightbringer Series   
## 11 Blinding Knife       Lightbringer Series   
## 12 Broken Eye           Lightbringer Series   
## 13 Blood Mirror         Lightbringer Series

Our goal here is to create a dataframe that only contains books listed in the author dataframe that have a corresponding series in the series dataframe. We’ll accomplish this by using the inner_join() function.

All of the *_join() functions have the same three basic arguments. The x and y arguments specify the dataframes that we’re wanting to work with. The by = argument allows us to specify the column(s) (sometimes referred to as “anchor”- or “key”-columns) that dplyr should use to determine what information goes with which row.

inner_join(x = authors, y = series, by = "book")

## # A tibble: 13 x 5
##    first   last    book                  year series                
##    <chr>   <chr>   <chr>                <dbl> <chr>                 
##  1 Trudi   Canavan Magician's Guild      2001 Black Magician Trilogy
##  2 Trudi   Canavan Novice                2002 Black Magician Trilogy
##  3 Trudi   Canavan High Lord             2003 Black Magician Trilogy
##  4 Raymond Feist   Magician              1982 Riftwar Saga          
##  5 Raymond Feist   Silverthorn           1985 Riftwar Saga          
##  6 Raymond Feist   Darkness at Sethanon  1986 Riftwar Saga          
##  7 Brent   Weeks   Way of Shadows        2008 Night Angel Series    
##  8 Brent   Weeks   Shadow's Edge         2008 Night Angel Series    
##  9 Brent   Weeks   Beyond the Shadows    2008 Night Angel Series    
## 10 Brent   Weeks   Black Prism           2010 Lightbringer Series   
## 11 Brent   Weeks   Blinding Knife        2012 Lightbringer Series   
## 12 Brent   Weeks   Broken Eye            2014 Lightbringer Series   
## 13 Brent   Weeks   Blood Mirror          2016 Lightbringer Series

As you can see, inner_join() merged the dataframes together based on the values provided in the "book" column - only values that appeared in both dataframes appeared in our merged dataframe. We used the book column in the by = argument, because it is the only column that our primary and secondary dataframes have in common. If the dataframes had multiple columns in common, we could specify multiple columns in the by = argument (as you’ll see later).

2.2 `left_join()`

The second join that we’ll use is the left_join(). What makes it a “left” join will be more apparent when we look at the code below. For this example, we’ll use the authors and publishers dataframes. Notice that we can pass multiple key columns to our by = argument to ensure that we’re as specific as possible.

authors

## # A tibble: 15 x 4
##    first   last    book                  year
##    <chr>   <chr>   <chr>                <dbl>
##  1 Trudi   Canavan Magician's Guild      2001
##  2 Trudi   Canavan Novice                2002
##  3 Trudi   Canavan High Lord             2003
##  4 Raymond Feist   Magician              1982
##  5 Raymond Feist   Silverthorn           1985
##  6 Raymond Feist   Darkness at Sethanon  1986
##  7 Nate    Silver  Signal and the Noise  2012
##  8 Daniel  Tammet  Thinking in Numbers   2012
##  9 Brent   Weeks   Way of Shadows        2008
## 10 Brent   Weeks   Shadow's Edge         2008
## 11 Brent   Weeks   Beyond the Shadows    2008
## 12 Brent   Weeks   Black Prism           2010
## 13 Brent   Weeks   Blinding Knife        2012
## 14 Brent   Weeks   Broken Eye            2014
## 15 Brent   Weeks   Blood Mirror          2016

publishers

## # A tibble: 15 x 3
##    book                  year publisher     
##    <chr>                <dbl> <chr>         
##  1 Magician's Guild      2001 Orbit         
##  2 Novice                2002 Orbit         
##  3 High Lord             2003 Orbit         
##  4 Magician              1982 Doubleday     
##  5 Silverthorn           1985 Doubleday     
##  6 Darkness at Sethanon  1986 Doubleday     
##  7 Signal and the Noise  2012 Penguin       
##  8 Thinking in Numbers   2012 Back Bay Books
##  9 Way of Shadows        2008 Orbit         
## 10 Shadow's Edge         2008 Orbit         
## 11 Beyond the Shadows    2008 Orbit         
## 12 Black Prism           2010 Orbit         
## 13 Blinding Knife        2012 Orbit         
## 14 Broken Eye            2014 Orbit         
## 15 Blood Mirror          2016 Orbit

auth_pub_l <- left_join(x = authors, y = publishers, 
                        by = c("book", "year"))

What makes this a “left” join is that we are treating the first dataframe (the one on the left) as the primary dataframe. We’re then appending any relevant information from the second dataframe to the primary one.

auth_pub_l

## # A tibble: 15 x 5
##    first   last    book                  year publisher     
##    <chr>   <chr>   <chr>                <dbl> <chr>         
##  1 Trudi   Canavan Magician's Guild      2001 Orbit         
##  2 Trudi   Canavan Novice                2002 Orbit         
##  3 Trudi   Canavan High Lord             2003 Orbit         
##  4 Raymond Feist   Magician              1982 Doubleday     
##  5 Raymond Feist   Silverthorn           1985 Doubleday     
##  6 Raymond Feist   Darkness at Sethanon  1986 Doubleday     
##  7 Nate    Silver  Signal and the Noise  2012 Penguin       
##  8 Daniel  Tammet  Thinking in Numbers   2012 Back Bay Books
##  9 Brent   Weeks   Way of Shadows        2008 Orbit         
## 10 Brent   Weeks   Shadow's Edge         2008 Orbit         
## 11 Brent   Weeks   Beyond the Shadows    2008 Orbit         
## 12 Brent   Weeks   Black Prism           2010 Orbit         
## 13 Brent   Weeks   Blinding Knife        2012 Orbit         
## 14 Brent   Weeks   Broken Eye            2014 Orbit         
## 15 Brent   Weeks   Blood Mirror          2016 Orbit

2.3 `right_join()`

The third join that we’ll use is the right_join(). Can you intuit what the right_join() function will do based off of what left_join() does?

auth_pub_r <- right_join(x = publishers, y = authors,
                         by = c("book", "year"))

As you can see in the code, we’ve switched the dataframes in our x and y arguments. This should give us the same dataframe, right? Let’s check.

setequal(auth_pub_l, auth_pub_r)

## [1] TRUE

Alright, so it looks like they’re equal. So what’s the point of having both right and left if we can structure our code to use either? Well, this becomes important when we reintroduce our friend the pipe operator, %>%. The pipe operator is discussed at length in the tidying data tutorial on this site. Essentially, it allows us to string many pieces of code (i.e., “pipes”) together to form a longer, more readable chunk of code (i.e., a “pipeline”) by applying the output of the code before the operator (%>%) as the first argument in the code following the operator.

For a quick example of this, let’s make a dataframe of authors that have written fantasy books.

fantasy <- authors %>%
  # Let's treat authors as our primary dataframe.
  # We want to join it with genre.
  left_join(genre, by = "book") %>%
  # We then want to focus on fantasy books
  filter(genre == "Fantasy") %>%
  # And then get a dataframe of just author names and book titles.
  select(first, last, book)
fantasy

## # A tibble: 13 x 3
##    first   last    book                
##    <chr>   <chr>   <chr>               
##  1 Trudi   Canavan Magician's Guild    
##  2 Trudi   Canavan Novice              
##  3 Trudi   Canavan High Lord           
##  4 Raymond Feist   Magician            
##  5 Raymond Feist   Silverthorn         
##  6 Raymond Feist   Darkness at Sethanon
##  7 Brent   Weeks   Way of Shadows      
##  8 Brent   Weeks   Shadow's Edge       
##  9 Brent   Weeks   Beyond the Shadows  
## 10 Brent   Weeks   Black Prism         
## 11 Brent   Weeks   Blinding Knife      
## 12 Brent   Weeks   Broken Eye          
## 13 Brent   Weeks   Blood Mirror

2.4 `full_join()`

The fourth join is full_join(). This is the most inclusive of the joins. It returns every row from both dataframes. We’ll use the dataframes that we used in the inner_join() example:

authors

## # A tibble: 15 x 4
##    first   last    book                  year
##    <chr>   <chr>   <chr>                <dbl>
##  1 Trudi   Canavan Magician's Guild      2001
##  2 Trudi   Canavan Novice                2002
##  3 Trudi   Canavan High Lord             2003
##  4 Raymond Feist   Magician              1982
##  5 Raymond Feist   Silverthorn           1985
##  6 Raymond Feist   Darkness at Sethanon  1986
##  7 Nate    Silver  Signal and the Noise  2012
##  8 Daniel  Tammet  Thinking in Numbers   2012
##  9 Brent   Weeks   Way of Shadows        2008
## 10 Brent   Weeks   Shadow's Edge         2008
## 11 Brent   Weeks   Beyond the Shadows    2008
## 12 Brent   Weeks   Black Prism           2010
## 13 Brent   Weeks   Blinding Knife        2012
## 14 Brent   Weeks   Broken Eye            2014
## 15 Brent   Weeks   Blood Mirror          2016

series

## # A tibble: 13 x 2
##    book                 series                
##    <chr>                <chr>                 
##  1 Magician's Guild     Black Magician Trilogy
##  2 Novice               Black Magician Trilogy
##  3 High Lord            Black Magician Trilogy
##  4 Magician             Riftwar Saga          
##  5 Silverthorn          Riftwar Saga          
##  6 Darkness at Sethanon Riftwar Saga          
##  7 Way of Shadows       Night Angel Series    
##  8 Shadow's Edge        Night Angel Series    
##  9 Beyond the Shadows   Night Angel Series    
## 10 Black Prism          Lightbringer Series   
## 11 Blinding Knife       Lightbringer Series   
## 12 Broken Eye           Lightbringer Series   
## 13 Blood Mirror         Lightbringer Series

Let’s say that we just want to put all of this data together, whether we have complete observations (e.g., rows) or not. We can do that with full_join().

full_join(x = authors, y = series, by = "book")

## # A tibble: 15 x 5
##    first   last    book                  year series                
##    <chr>   <chr>   <chr>                <dbl> <chr>                 
##  1 Trudi   Canavan Magician's Guild      2001 Black Magician Trilogy
##  2 Trudi   Canavan Novice                2002 Black Magician Trilogy
##  3 Trudi   Canavan High Lord             2003 Black Magician Trilogy
##  4 Raymond Feist   Magician              1982 Riftwar Saga          
##  5 Raymond Feist   Silverthorn           1985 Riftwar Saga          
##  6 Raymond Feist   Darkness at Sethanon  1986 Riftwar Saga          
##  7 Nate    Silver  Signal and the Noise  2012 <NA>                  
##  8 Daniel  Tammet  Thinking in Numbers   2012 <NA>                  
##  9 Brent   Weeks   Way of Shadows        2008 Night Angel Series    
## 10 Brent   Weeks   Shadow's Edge         2008 Night Angel Series    
## 11 Brent   Weeks   Beyond the Shadows    2008 Night Angel Series    
## 12 Brent   Weeks   Black Prism           2010 Lightbringer Series   
## 13 Brent   Weeks   Blinding Knife        2012 Lightbringer Series   
## 14 Brent   Weeks   Broken Eye            2014 Lightbringer Series   
## 15 Brent   Weeks   Blood Mirror          2016 Lightbringer Series

This time, we kept our rows that didn’t have a series - their values are just missing.

2.5 Joining Joins

Finally, it is important to note that these joins can be used in conjunction with each other to combine data easily based on what aspects you want to have. For a simple example, let’s just combine all of the dataframes so far into one large dataframe called books.

books <- authors %>%
  left_join(publishers, by = c("book", "year")) %>%
  full_join(series, by = "book") %>% 
  left_join(genre, by = "book")
books

## # A tibble: 15 x 7
##    first   last    book             year publisher    series           genre    
##    <chr>   <chr>   <chr>           <dbl> <chr>        <chr>            <chr>    
##  1 Trudi   Canavan Magician's Gui…  2001 Orbit        Black Magician … Fantasy  
##  2 Trudi   Canavan Novice           2002 Orbit        Black Magician … Fantasy  
##  3 Trudi   Canavan High Lord        2003 Orbit        Black Magician … Fantasy  
##  4 Raymond Feist   Magician         1982 Doubleday    Riftwar Saga     Fantasy  
##  5 Raymond Feist   Silverthorn      1985 Doubleday    Riftwar Saga     Fantasy  
##  6 Raymond Feist   Darkness at Se…  1986 Doubleday    Riftwar Saga     Fantasy  
##  7 Nate    Silver  Signal and the…  2012 Penguin      <NA>             Mathemat…
##  8 Daniel  Tammet  Thinking in Nu…  2012 Back Bay Bo… <NA>             Mathemat…
##  9 Brent   Weeks   Way of Shadows   2008 Orbit        Night Angel Ser… Fantasy  
## 10 Brent   Weeks   Shadow's Edge    2008 Orbit        Night Angel Ser… Fantasy  
## 11 Brent   Weeks   Beyond the Sha…  2008 Orbit        Night Angel Ser… Fantasy  
## 12 Brent   Weeks   Black Prism      2010 Orbit        Lightbringer Se… Fantasy  
## 13 Brent   Weeks   Blinding Knife   2012 Orbit        Lightbringer Se… Fantasy  
## 14 Brent   Weeks   Broken Eye       2014 Orbit        Lightbringer Se… Fantasy  
## 15 Brent   Weeks   Blood Mirror     2016 Orbit        Lightbringer Se… Fantasy

3 Filtering Joins

Unlike mutating joins, filtering joins filter unwanted observations out of the primary dataframe.

3.1 `semi_join()`

The first filter join that we’ll look at is semi_join(). semi_join() takes the key column and filters the primary dataframe for only observations that also appear in the secondary dataframe. For this example, we want to know which authors in our authors dataframe received awards for books that appeared in the authors dataframe (these are listed in the awards dataframe).

authors

## # A tibble: 15 x 4
##    first   last    book                  year
##    <chr>   <chr>   <chr>                <dbl>
##  1 Trudi   Canavan Magician's Guild      2001
##  2 Trudi   Canavan Novice                2002
##  3 Trudi   Canavan High Lord             2003
##  4 Raymond Feist   Magician              1982
##  5 Raymond Feist   Silverthorn           1985
##  6 Raymond Feist   Darkness at Sethanon  1986
##  7 Nate    Silver  Signal and the Noise  2012
##  8 Daniel  Tammet  Thinking in Numbers   2012
##  9 Brent   Weeks   Way of Shadows        2008
## 10 Brent   Weeks   Shadow's Edge         2008
## 11 Brent   Weeks   Beyond the Shadows    2008
## 12 Brent   Weeks   Black Prism           2010
## 13 Brent   Weeks   Blinding Knife        2012
## 14 Brent   Weeks   Broken Eye            2014
## 15 Brent   Weeks   Blood Mirror          2016

awards

## # A tibble: 8 x 5
##   first   last   book                       award                           year
##   <chr>   <chr>  <chr>                      <chr>                          <dbl>
## 1 Trudi   Canav… Whispers of the Mist Chil… Aurealis Award for Best Fanta…  1999
## 2 Trudi   Canav… Room for Improvement       Ditmars Best Short Story        2003
## 3 Trudi   Canav… Magician's Apprentice      Aurealis Award for Best Fanta…  2009
## 4 Raymond Fiest  Servant of the Empire      HOMer Awards Best Fantasy Nov…  1991
## 5 Nate    Silver Signal and the Noise       Phi Beta Kappa Award            2013
## 6 Daniel  Tammet Born on a Blue Day         Booklist Editor's Choic Adult…  2007
## 7 Daniel  Tammet Every Word is a Bird We T… Booklist Editor's Choic Adult…  2017
## 8 Brent   Weeks  Blinding Knife             Legend Award for Best Novel     2013

Without semi_join(), we’d need to write something like this:

authors %>% 
  left_join(awards, by = c("first", "last", "book")) %>% 
  filter(!is.na(award)) %>% 
  # This filters out all non-award-winning books
  select("first", "last", "book", "year" = "year.x")

## # A tibble: 2 x 4
##   first last   book                  year
##   <chr> <chr>  <chr>                <dbl>
## 1 Nate  Silver Signal and the Noise  2012
## 2 Brent Weeks  Blinding Knife        2012

The syntax used in the select() portion of the code will be discussed in more depth later, but for now, just know that we’re specifying the variables we want and renaming them to the correct names.

Thanks to semi_join(), we can write the following code instead of the previous code.

authors %>% 
  semi_join(awards, by = "book")

## # A tibble: 2 x 4
##   first last   book                  year
##   <chr> <chr>  <chr>                <dbl>
## 1 Nate  Silver Signal and the Noise  2012
## 2 Brent Weeks  Blinding Knife        2012

3.2 `anti_join()`

anti_join() is kind of the opposite of semi_join() in that it shows you which of the rows in your primary dataframe do not have matches in the secondary dataframe. This can be handy for checking for mistakes, such as typos, in the data.

What if we want to know which of the books in our authors dataframe haven’t won awards? Simple, anti_join().

authors %>% 
  anti_join(awards, by = "book")

## # A tibble: 13 x 4
##    first   last    book                  year
##    <chr>   <chr>   <chr>                <dbl>
##  1 Trudi   Canavan Magician's Guild      2001
##  2 Trudi   Canavan Novice                2002
##  3 Trudi   Canavan High Lord             2003
##  4 Raymond Feist   Magician              1982
##  5 Raymond Feist   Silverthorn           1985
##  6 Raymond Feist   Darkness at Sethanon  1986
##  7 Daniel  Tammet  Thinking in Numbers   2012
##  8 Brent   Weeks   Way of Shadows        2008
##  9 Brent   Weeks   Shadow's Edge         2008
## 10 Brent   Weeks   Beyond the Shadows    2008
## 11 Brent   Weeks   Black Prism           2010
## 12 Brent   Weeks   Broken Eye            2014
## 13 Brent   Weeks   Blood Mirror          2016

Knowing that all of the people in the authors dataframe have at least one award in the awards dataframe (not necessarily for the books listed in the authors dataframe), we can use anti_join() to make sure that all of our authors’ names are spelled correctly.

authors %>% 
  anti_join(awards, by = c("first", "last"))

## # A tibble: 3 x 4
##   first   last  book                  year
##   <chr>   <chr> <chr>                <dbl>
## 1 Raymond Feist Magician              1982
## 2 Raymond Feist Silverthorn           1985
## 3 Raymond Feist Darkness at Sethanon  1986

Because Raymond Feist was returned, we know that his name (at least spelled correctly) doesn’t appear in the awards dataframe. If we go back up and look at it, we can see that it was spelled “Fiest” instead of “Feist.”

4 Binding

When your dataframes contain the same rows or columns you may want to simply bind them together instead of messing with joining functions. This can be done with bind_rows() and bind_cols().

Note: These perform similarly to rbind() and cbind(), respectively; however, bind_rows() and bind_cols() are more efficient for binding dataframes.

4.1 `bind_rows()`

When you have two dataframes with the same variables (columns), and you’re just wanting to append one dataframe onto the other (that is, stack the data in one dataframe on top of the other), you should use bind_rows().

For this, we’ll use the auth_pub_l dataframe that we created earlier with the left_join() function. Mr. Brent Weeks has a new book that has yet to be released (as of the writing of this tutorial). We’ll use bind_rows() to append information about this book that is contained in the new_weeks dataframe.

complete_auth <- auth_pub_l %>% 
  bind_rows(new_weeks)
complete_auth

## # A tibble: 16 x 5
##    first   last    book                  year publisher     
##    <chr>   <chr>   <chr>                <dbl> <chr>         
##  1 Trudi   Canavan Magician's Guild      2001 Orbit         
##  2 Trudi   Canavan Novice                2002 Orbit         
##  3 Trudi   Canavan High Lord             2003 Orbit         
##  4 Raymond Feist   Magician              1982 Doubleday     
##  5 Raymond Feist   Silverthorn           1985 Doubleday     
##  6 Raymond Feist   Darkness at Sethanon  1986 Doubleday     
##  7 Nate    Silver  Signal and the Noise  2012 Penguin       
##  8 Daniel  Tammet  Thinking in Numbers   2012 Back Bay Books
##  9 Brent   Weeks   Way of Shadows        2008 Orbit         
## 10 Brent   Weeks   Shadow's Edge         2008 Orbit         
## 11 Brent   Weeks   Beyond the Shadows    2008 Orbit         
## 12 Brent   Weeks   Black Prism           2010 Orbit         
## 13 Brent   Weeks   Blinding Knife        2012 Orbit         
## 14 Brent   Weeks   Broken Eye            2014 Orbit         
## 15 Brent   Weeks   Blood Mirror          2016 Orbit         
## 16 Brent   Weeks   Burning White         2019 Orbit

So we can use bind_rows() to add one row to our dataframe, but can we use it to bind 2 complete dataframes? We sure can! We’ll bind two dataframes (each having data on one of Brent’s series) with more than one row.

brent_weeks <- night_angel %>% 
  bind_rows(lightbringer)
brent_weeks

## # A tibble: 8 x 4
##   name                year genre   publisher
##   <chr>              <dbl> <chr>   <chr>    
## 1 Way of Shadows      2008 Fantasy Orbit    
## 2 Shadow's Edge       2008 Fantasy Orbit    
## 3 Beyond the Shadows  2008 Fantasy Orbit    
## 4 Black Prism         2010 Fantasy Orbit    
## 5 Blinding Knife      2012 Fantasy Orbit    
## 6 Broken Eye          2014 Fantasy Orbit    
## 7 Blood Mirror        2016 Fantasy Orbit    
## 8 Burning White       2019 Fantasy Orbit

It looks like our new brent_weeks dataframe doesn’t have the name of the series. We’ll remedy that in the next section.

4.2 `bind_cols()`

When you want to append a column from one dataframe onto another dataframe, you can use bind_cols().

CAUTION! This function will not discriminate against joining incorrect rows like the *_join() functions do. This should only be used when the order of the rows matches in both dataframes. We’ll demonstrate ths in the first chunk of code.

brent_weeks %>% 
  bind_cols(weeks_series_i)

## # A tibble: 8 x 5
##   name                year genre   publisher series      
##   <chr>              <dbl> <chr>   <chr>     <chr>       
## 1 Way of Shadows      2008 Fantasy Orbit     Lightbringer
## 2 Shadow's Edge       2008 Fantasy Orbit     Lightbringer
## 3 Beyond the Shadows  2008 Fantasy Orbit     Lightbringer
## 4 Black Prism         2010 Fantasy Orbit     Lightbringer
## 5 Blinding Knife      2012 Fantasy Orbit     Lightbringer
## 6 Broken Eye          2014 Fantasy Orbit     Night Angel 
## 7 Blood Mirror        2016 Fantasy Orbit     Night Angel 
## 8 Burning White       2019 Fantasy Orbit     Night Angel

As we can see, bind_cols() has no way to determine that the information in weeks_series_i is in the incorrect order, which is why you should be very cautious when using the bind_*() functions.

In the next code chunk, we’ll bind the correctly ordered column to brent_weeks. We can tell it’s correct, because the Night Angel trilogy was published, in its entirety, in 2008.

brent_weeks %>% 
  bind_cols(weeks_series_c)

## # A tibble: 8 x 5
##   name                year genre   publisher series      
##   <chr>              <dbl> <chr>   <chr>     <chr>       
## 1 Way of Shadows      2008 Fantasy Orbit     Night Angel 
## 2 Shadow's Edge       2008 Fantasy Orbit     Night Angel 
## 3 Beyond the Shadows  2008 Fantasy Orbit     Night Angel 
## 4 Black Prism         2010 Fantasy Orbit     Lightbringer
## 5 Blinding Knife      2012 Fantasy Orbit     Lightbringer
## 6 Broken Eye          2014 Fantasy Orbit     Lightbringer
## 7 Blood Mirror        2016 Fantasy Orbit     Lightbringer
## 8 Burning White       2019 Fantasy Orbit     Lightbringer

5 Some Issues that may Arise

5.1 Same name - Different variables

What if our dataframes use the same name for different variables? For example:

colnames(authors)

## [1] "first" "last"  "book"  "year"

colnames(awards)

## [1] "first" "last"  "book"  "award" "year"

We can see that the first, last, book, and year variables appear in both dataframes. The first, last, and book variables represent the same information, but the year variables represent two different years - the year of publication and the year the award was won. What happens if we join them?

authors %>% 
  left_join(awards, by = c("first", "last", "book", "year"))

## # A tibble: 15 x 5
##    first   last    book                  year award
##    <chr>   <chr>   <chr>                <dbl> <chr>
##  1 Trudi   Canavan Magician's Guild      2001 <NA> 
##  2 Trudi   Canavan Novice                2002 <NA> 
##  3 Trudi   Canavan High Lord             2003 <NA> 
##  4 Raymond Feist   Magician              1982 <NA> 
##  5 Raymond Feist   Silverthorn           1985 <NA> 
##  6 Raymond Feist   Darkness at Sethanon  1986 <NA> 
##  7 Nate    Silver  Signal and the Noise  2012 <NA> 
##  8 Daniel  Tammet  Thinking in Numbers   2012 <NA> 
##  9 Brent   Weeks   Way of Shadows        2008 <NA> 
## 10 Brent   Weeks   Shadow's Edge         2008 <NA> 
## 11 Brent   Weeks   Beyond the Shadows    2008 <NA> 
## 12 Brent   Weeks   Black Prism           2010 <NA> 
## 13 Brent   Weeks   Blinding Knife        2012 <NA> 
## 14 Brent   Weeks   Broken Eye            2014 <NA> 
## 15 Brent   Weeks   Blood Mirror          2016 <NA>

If we don’t realize that the years represent different things, and we try to use the year column as a key, we can see that we don’t successfully join any information from our secondary dataframe because this combination of keys doesn’t completely match any rows in the secondary dataframe. This is a good tip that something might be wrong with our code or our data.

If we realized that the year variables represent different things and don’t use it as a key, we can see what happens if we have multiple columns with the same name.

authors %>% 
  left_join(awards, by = c("first", "last", "book"))

## # A tibble: 15 x 6
##    first   last    book                 year.x award                      year.y
##    <chr>   <chr>   <chr>                 <dbl> <chr>                       <dbl>
##  1 Trudi   Canavan Magician's Guild       2001 <NA>                           NA
##  2 Trudi   Canavan Novice                 2002 <NA>                           NA
##  3 Trudi   Canavan High Lord              2003 <NA>                           NA
##  4 Raymond Feist   Magician               1982 <NA>                           NA
##  5 Raymond Feist   Silverthorn            1985 <NA>                           NA
##  6 Raymond Feist   Darkness at Sethanon   1986 <NA>                           NA
##  7 Nate    Silver  Signal and the Noise   2012 Phi Beta Kappa Award         2013
##  8 Daniel  Tammet  Thinking in Numbers    2012 <NA>                           NA
##  9 Brent   Weeks   Way of Shadows         2008 <NA>                           NA
## 10 Brent   Weeks   Shadow's Edge          2008 <NA>                           NA
## 11 Brent   Weeks   Beyond the Shadows     2008 <NA>                           NA
## 12 Brent   Weeks   Black Prism            2010 <NA>                           NA
## 13 Brent   Weeks   Blinding Knife         2012 Legend Award for Best Nov…   2013
## 14 Brent   Weeks   Broken Eye             2014 <NA>                           NA
## 15 Brent   Weeks   Blood Mirror           2016 <NA>                           NA

Thankfully, dplyr doesn’t try to join those two columns. Instead, it renames them (intuitively, year.x is from our primary dataframe, and year.y is from our secondary), and that is easy enough for us to fix - as shown in the code below.

authors %>% 
  left_join(awards, by = c("first", "last", "book")) %>%
  rename(pub_year = year.x, award_year = year.y)

## # A tibble: 15 x 6
##    first   last    book              pub_year award                   award_year
##    <chr>   <chr>   <chr>                <dbl> <chr>                        <dbl>
##  1 Trudi   Canavan Magician's Guild      2001 <NA>                            NA
##  2 Trudi   Canavan Novice                2002 <NA>                            NA
##  3 Trudi   Canavan High Lord             2003 <NA>                            NA
##  4 Raymond Feist   Magician              1982 <NA>                            NA
##  5 Raymond Feist   Silverthorn           1985 <NA>                            NA
##  6 Raymond Feist   Darkness at Seth…     1986 <NA>                            NA
##  7 Nate    Silver  Signal and the N…     2012 Phi Beta Kappa Award          2013
##  8 Daniel  Tammet  Thinking in Numb…     2012 <NA>                            NA
##  9 Brent   Weeks   Way of Shadows        2008 <NA>                            NA
## 10 Brent   Weeks   Shadow's Edge         2008 <NA>                            NA
## 11 Brent   Weeks   Beyond the Shado…     2008 <NA>                            NA
## 12 Brent   Weeks   Black Prism           2010 <NA>                            NA
## 13 Brent   Weeks   Blinding Knife        2012 Legend Award for Best …       2013
## 14 Brent   Weeks   Broken Eye            2014 <NA>                            NA
## 15 Brent   Weeks   Blood Mirror          2016 <NA>                            NA

5.2 Same variable - Different name

So now we know how to deal with having the same name applied to different variables, but what do we do if our KEY has different names in each dataframe? Such as,

complete_auth

## # A tibble: 16 x 5
##    first   last    book                  year publisher     
##    <chr>   <chr>   <chr>                <dbl> <chr>         
##  1 Trudi   Canavan Magician's Guild      2001 Orbit         
##  2 Trudi   Canavan Novice                2002 Orbit         
##  3 Trudi   Canavan High Lord             2003 Orbit         
##  4 Raymond Feist   Magician              1982 Doubleday     
##  5 Raymond Feist   Silverthorn           1985 Doubleday     
##  6 Raymond Feist   Darkness at Sethanon  1986 Doubleday     
##  7 Nate    Silver  Signal and the Noise  2012 Penguin       
##  8 Daniel  Tammet  Thinking in Numbers   2012 Back Bay Books
##  9 Brent   Weeks   Way of Shadows        2008 Orbit         
## 10 Brent   Weeks   Shadow's Edge         2008 Orbit         
## 11 Brent   Weeks   Beyond the Shadows    2008 Orbit         
## 12 Brent   Weeks   Black Prism           2010 Orbit         
## 13 Brent   Weeks   Blinding Knife        2012 Orbit         
## 14 Brent   Weeks   Broken Eye            2014 Orbit         
## 15 Brent   Weeks   Blood Mirror          2016 Orbit         
## 16 Brent   Weeks   Burning White         2019 Orbit

brent_weeks

## # A tibble: 8 x 4
##   name                year genre   publisher
##   <chr>              <dbl> <chr>   <chr>    
## 1 Way of Shadows      2008 Fantasy Orbit    
## 2 Shadow's Edge       2008 Fantasy Orbit    
## 3 Beyond the Shadows  2008 Fantasy Orbit    
## 4 Black Prism         2010 Fantasy Orbit    
## 5 Blinding Knife      2012 Fantasy Orbit    
## 6 Broken Eye          2014 Fantasy Orbit    
## 7 Blood Mirror        2016 Fantasy Orbit    
## 8 Burning White       2019 Fantasy Orbit

What is our key here? Is it book like in the complete_auth dataframe, or is it name like in the brent_weeks dataframe? We can get around this fairly easily by changing how we specify our by = argument.

complete_auth %>% 
  semi_join(brent_weeks, by = c("book" = "name"))

## # A tibble: 8 x 5
##   first last  book                year publisher
##   <chr> <chr> <chr>              <dbl> <chr>    
## 1 Brent Weeks Way of Shadows      2008 Orbit    
## 2 Brent Weeks Shadow's Edge       2008 Orbit    
## 3 Brent Weeks Beyond the Shadows  2008 Orbit    
## 4 Brent Weeks Black Prism         2010 Orbit    
## 5 Brent Weeks Blinding Knife      2012 Orbit    
## 6 Brent Weeks Broken Eye          2014 Orbit    
## 7 Brent Weeks Blood Mirror        2016 Orbit    
## 8 Brent Weeks Burning White       2019 Orbit

Notice that the first value in by = refers to the primary dataframe’s naming convention.

Merging Dataframes in R

Original Author: Kyle R. Ripley

Updated on 6/25/2020 by Kyle R. Ripley

1 Overview

1.1 Packages (and versions) used in this document

1.2 Data used in this document

2 Mutating Joins

2.1 `inner_join()`

2.2 `left_join()`

2.3 `right_join()`

2.4 `full_join()`

2.5 Joining Joins

3 Filtering Joins

3.1 `semi_join()`

3.2 `anti_join()`

4 Binding

4.1 `bind_rows()`

4.2 `bind_cols()`

5 Some Issues that may Arise

5.1 Same name - Different variables

5.2 Same variable - Different name

Merging Dataframes in R

Original Author: Kyle R. Ripley

Updated on 6/25/2020 by Kyle R. Ripley

1 Overview

1.1 Packages (and versions) used in this document

1.2 Data used in this document

2 Mutating Joins

2.1 inner_join()

2.2 left_join()

2.3 right_join()

2.4 full_join()

2.5 Joining Joins

3 Filtering Joins

3.1 semi_join()

3.2 anti_join()

4 Binding

4.1 bind_rows()

4.2 bind_cols()

5 Some Issues that may Arise

5.1 Same name - Different variables

5.2 Same variable - Different name

2.1 `inner_join()`

2.2 `left_join()`

2.3 `right_join()`

2.4 `full_join()`

3.1 `semi_join()`

3.2 `anti_join()`

4.1 `bind_rows()`

4.2 `bind_cols()`