In this lesson, we’ll cover the basics of merging multiple data frames or pieces of data frames. This skill will be incredibly important if you ever need to work with databases that store different aspects of the data in different files.
We’ll focus on a few different methods for merging data from the dplyr
package, such as:
We’ll also take a deeper look at some of the more simple ways to join dataframes that have the same rows/columns.
## [1] "R version 4.0.2 (2020-06-22)"
## Package Version
## tidyverse 1.3.0
## dplyr 1.0.1
Because the focus of this tutorial is on merging data files together, we will be utilizing a number of files in this tutorial. To, hopefully, keep this interesting, I compiled a number of data files about topics that are interesting to me (and that I can find information on within my office). These files are stored on my github account and can be accessed via the links used to import them (should you be interested in doing so).
authors <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/authors.csv")
genre <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/genre.csv")
publishers <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/publishers.csv")
series <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/series.csv")
awards <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/awards.csv")
new_weeks <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/new_weeks.csv")
night_angel <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/night_angel.csv")
lightbringer <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/lightbringer.csv")
weeks_series_i <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/weeks_series.csv")
weeks_series_c <- read_csv("https://raw.githubusercontent.com/KRR1114/public_data_files/master/weeks_series2.csv")
We’ll start off this tutorial with mutating joins. Mutating joins will take the original, primary dataframe and alter it by adding columns to it from other dataframes.
inner_join()
The first mutating join is inner_join()
. This is particularly useful if you only want to include complete observations. This function only returns rows from the first dataframe that have a match in the second. For this example, we’ll use the authors
and series
dataframes.
authors
## # A tibble: 15 x 4
## first last book year
## <chr> <chr> <chr> <dbl>
## 1 Trudi Canavan Magician's Guild 2001
## 2 Trudi Canavan Novice 2002
## 3 Trudi Canavan High Lord 2003
## 4 Raymond Feist Magician 1982
## 5 Raymond Feist Silverthorn 1985
## 6 Raymond Feist Darkness at Sethanon 1986
## 7 Nate Silver Signal and the Noise 2012
## 8 Daniel Tammet Thinking in Numbers 2012
## 9 Brent Weeks Way of Shadows 2008
## 10 Brent Weeks Shadow's Edge 2008
## 11 Brent Weeks Beyond the Shadows 2008
## 12 Brent Weeks Black Prism 2010
## 13 Brent Weeks Blinding Knife 2012
## 14 Brent Weeks Broken Eye 2014
## 15 Brent Weeks Blood Mirror 2016
series
## # A tibble: 13 x 2
## book series
## <chr> <chr>
## 1 Magician's Guild Black Magician Trilogy
## 2 Novice Black Magician Trilogy
## 3 High Lord Black Magician Trilogy
## 4 Magician Riftwar Saga
## 5 Silverthorn Riftwar Saga
## 6 Darkness at Sethanon Riftwar Saga
## 7 Way of Shadows Night Angel Series
## 8 Shadow's Edge Night Angel Series
## 9 Beyond the Shadows Night Angel Series
## 10 Black Prism Lightbringer Series
## 11 Blinding Knife Lightbringer Series
## 12 Broken Eye Lightbringer Series
## 13 Blood Mirror Lightbringer Series
Our goal here is to create a dataframe that only contains book
s listed in the author
dataframe that have a corresponding series
in the series
dataframe. We’ll accomplish this by using the inner_join()
function.
All of the *_join()
functions have the same three basic arguments. The x
and y
arguments specify the dataframes that we’re wanting to work with. The by =
argument allows us to specify the column(s) (sometimes referred to as “anchor”- or “key”-columns) that dplyr
should use to determine what information goes with which row.
inner_join(x = authors, y = series, by = "book")
## # A tibble: 13 x 5
## first last book year series
## <chr> <chr> <chr> <dbl> <chr>
## 1 Trudi Canavan Magician's Guild 2001 Black Magician Trilogy
## 2 Trudi Canavan Novice 2002 Black Magician Trilogy
## 3 Trudi Canavan High Lord 2003 Black Magician Trilogy
## 4 Raymond Feist Magician 1982 Riftwar Saga
## 5 Raymond Feist Silverthorn 1985 Riftwar Saga
## 6 Raymond Feist Darkness at Sethanon 1986 Riftwar Saga
## 7 Brent Weeks Way of Shadows 2008 Night Angel Series
## 8 Brent Weeks Shadow's Edge 2008 Night Angel Series
## 9 Brent Weeks Beyond the Shadows 2008 Night Angel Series
## 10 Brent Weeks Black Prism 2010 Lightbringer Series
## 11 Brent Weeks Blinding Knife 2012 Lightbringer Series
## 12 Brent Weeks Broken Eye 2014 Lightbringer Series
## 13 Brent Weeks Blood Mirror 2016 Lightbringer Series
As you can see, inner_join()
merged the dataframes together based on the values provided in the "book"
column - only values that appeared in both dataframes appeared in our merged dataframe. We used the book
column in the by =
argument, because it is the only column that our primary and secondary dataframes have in common. If the dataframes had multiple columns in common, we could specify multiple columns in the by =
argument (as you’ll see later).
left_join()
The second join that we’ll use is the left_join()
. What makes it a “left” join will be more apparent when we look at the code below. For this example, we’ll use the authors
and publishers
dataframes. Notice that we can pass multiple key columns to our by =
argument to ensure that we’re as specific as possible.
authors
## # A tibble: 15 x 4
## first last book year
## <chr> <chr> <chr> <dbl>
## 1 Trudi Canavan Magician's Guild 2001
## 2 Trudi Canavan Novice 2002
## 3 Trudi Canavan High Lord 2003
## 4 Raymond Feist Magician 1982
## 5 Raymond Feist Silverthorn 1985
## 6 Raymond Feist Darkness at Sethanon 1986
## 7 Nate Silver Signal and the Noise 2012
## 8 Daniel Tammet Thinking in Numbers 2012
## 9 Brent Weeks Way of Shadows 2008
## 10 Brent Weeks Shadow's Edge 2008
## 11 Brent Weeks Beyond the Shadows 2008
## 12 Brent Weeks Black Prism 2010
## 13 Brent Weeks Blinding Knife 2012
## 14 Brent Weeks Broken Eye 2014
## 15 Brent Weeks Blood Mirror 2016
publishers
## # A tibble: 15 x 3
## book year publisher
## <chr> <dbl> <chr>
## 1 Magician's Guild 2001 Orbit
## 2 Novice 2002 Orbit
## 3 High Lord 2003 Orbit
## 4 Magician 1982 Doubleday
## 5 Silverthorn 1985 Doubleday
## 6 Darkness at Sethanon 1986 Doubleday
## 7 Signal and the Noise 2012 Penguin
## 8 Thinking in Numbers 2012 Back Bay Books
## 9 Way of Shadows 2008 Orbit
## 10 Shadow's Edge 2008 Orbit
## 11 Beyond the Shadows 2008 Orbit
## 12 Black Prism 2010 Orbit
## 13 Blinding Knife 2012 Orbit
## 14 Broken Eye 2014 Orbit
## 15 Blood Mirror 2016 Orbit
auth_pub_l <- left_join(x = authors, y = publishers,
by = c("book", "year"))
What makes this a “left” join is that we are treating the first dataframe (the one on the left) as the primary dataframe. We’re then appending any relevant information from the second dataframe to the primary one.
auth_pub_l
## # A tibble: 15 x 5
## first last book year publisher
## <chr> <chr> <chr> <dbl> <chr>
## 1 Trudi Canavan Magician's Guild 2001 Orbit
## 2 Trudi Canavan Novice 2002 Orbit
## 3 Trudi Canavan High Lord 2003 Orbit
## 4 Raymond Feist Magician 1982 Doubleday
## 5 Raymond Feist Silverthorn 1985 Doubleday
## 6 Raymond Feist Darkness at Sethanon 1986 Doubleday
## 7 Nate Silver Signal and the Noise 2012 Penguin
## 8 Daniel Tammet Thinking in Numbers 2012 Back Bay Books
## 9 Brent Weeks Way of Shadows 2008 Orbit
## 10 Brent Weeks Shadow's Edge 2008 Orbit
## 11 Brent Weeks Beyond the Shadows 2008 Orbit
## 12 Brent Weeks Black Prism 2010 Orbit
## 13 Brent Weeks Blinding Knife 2012 Orbit
## 14 Brent Weeks Broken Eye 2014 Orbit
## 15 Brent Weeks Blood Mirror 2016 Orbit
right_join()
The third join that we’ll use is the right_join()
. Can you intuit what the right_join()
function will do based off of what left_join()
does?
auth_pub_r <- right_join(x = publishers, y = authors,
by = c("book", "year"))
As you can see in the code, we’ve switched the dataframes in our x
and y
arguments. This should give us the same dataframe, right? Let’s check.
setequal(auth_pub_l, auth_pub_r)
## [1] TRUE
Alright, so it looks like they’re equal. So what’s the point of having both right and left if we can structure our code to use either? Well, this becomes important when we reintroduce our friend the pipe operator, %>%
. The pipe operator is discussed at length in the tidying data tutorial on this site. Essentially, it allows us to string many pieces of code (i.e., “pipes”) together to form a longer, more readable chunk of code (i.e., a “pipeline”) by applying the output of the code before the operator (%>%
) as the first argument in the code following the operator.
For a quick example of this, let’s make a dataframe of authors that have written fantasy books.
fantasy <- authors %>%
# Let's treat authors as our primary dataframe.
# We want to join it with genre.
left_join(genre, by = "book") %>%
# We then want to focus on fantasy books
filter(genre == "Fantasy") %>%
# And then get a dataframe of just author names and book titles.
select(first, last, book)
fantasy
## # A tibble: 13 x 3
## first last book
## <chr> <chr> <chr>
## 1 Trudi Canavan Magician's Guild
## 2 Trudi Canavan Novice
## 3 Trudi Canavan High Lord
## 4 Raymond Feist Magician
## 5 Raymond Feist Silverthorn
## 6 Raymond Feist Darkness at Sethanon
## 7 Brent Weeks Way of Shadows
## 8 Brent Weeks Shadow's Edge
## 9 Brent Weeks Beyond the Shadows
## 10 Brent Weeks Black Prism
## 11 Brent Weeks Blinding Knife
## 12 Brent Weeks Broken Eye
## 13 Brent Weeks Blood Mirror
full_join()
The fourth join is full_join()
. This is the most inclusive of the joins. It returns every row from both dataframes. We’ll use the dataframes that we used in the inner_join()
example:
authors
## # A tibble: 15 x 4
## first last book year
## <chr> <chr> <chr> <dbl>
## 1 Trudi Canavan Magician's Guild 2001
## 2 Trudi Canavan Novice 2002
## 3 Trudi Canavan High Lord 2003
## 4 Raymond Feist Magician 1982
## 5 Raymond Feist Silverthorn 1985
## 6 Raymond Feist Darkness at Sethanon 1986
## 7 Nate Silver Signal and the Noise 2012
## 8 Daniel Tammet Thinking in Numbers 2012
## 9 Brent Weeks Way of Shadows 2008
## 10 Brent Weeks Shadow's Edge 2008
## 11 Brent Weeks Beyond the Shadows 2008
## 12 Brent Weeks Black Prism 2010
## 13 Brent Weeks Blinding Knife 2012
## 14 Brent Weeks Broken Eye 2014
## 15 Brent Weeks Blood Mirror 2016
series
## # A tibble: 13 x 2
## book series
## <chr> <chr>
## 1 Magician's Guild Black Magician Trilogy
## 2 Novice Black Magician Trilogy
## 3 High Lord Black Magician Trilogy
## 4 Magician Riftwar Saga
## 5 Silverthorn Riftwar Saga
## 6 Darkness at Sethanon Riftwar Saga
## 7 Way of Shadows Night Angel Series
## 8 Shadow's Edge Night Angel Series
## 9 Beyond the Shadows Night Angel Series
## 10 Black Prism Lightbringer Series
## 11 Blinding Knife Lightbringer Series
## 12 Broken Eye Lightbringer Series
## 13 Blood Mirror Lightbringer Series
Let’s say that we just want to put all of this data together, whether we have complete observations (e.g., rows) or not. We can do that with full_join()
.
full_join(x = authors, y = series, by = "book")
## # A tibble: 15 x 5
## first last book year series
## <chr> <chr> <chr> <dbl> <chr>
## 1 Trudi Canavan Magician's Guild 2001 Black Magician Trilogy
## 2 Trudi Canavan Novice 2002 Black Magician Trilogy
## 3 Trudi Canavan High Lord 2003 Black Magician Trilogy
## 4 Raymond Feist Magician 1982 Riftwar Saga
## 5 Raymond Feist Silverthorn 1985 Riftwar Saga
## 6 Raymond Feist Darkness at Sethanon 1986 Riftwar Saga
## 7 Nate Silver Signal and the Noise 2012 <NA>
## 8 Daniel Tammet Thinking in Numbers 2012 <NA>
## 9 Brent Weeks Way of Shadows 2008 Night Angel Series
## 10 Brent Weeks Shadow's Edge 2008 Night Angel Series
## 11 Brent Weeks Beyond the Shadows 2008 Night Angel Series
## 12 Brent Weeks Black Prism 2010 Lightbringer Series
## 13 Brent Weeks Blinding Knife 2012 Lightbringer Series
## 14 Brent Weeks Broken Eye 2014 Lightbringer Series
## 15 Brent Weeks Blood Mirror 2016 Lightbringer Series
This time, we kept our rows that didn’t have a series - their values are just missing.
Finally, it is important to note that these joins can be used in conjunction with each other to combine data easily based on what aspects you want to have. For a simple example, let’s just combine all of the dataframes so far into one large dataframe called books
.
books <- authors %>%
left_join(publishers, by = c("book", "year")) %>%
full_join(series, by = "book") %>%
left_join(genre, by = "book")
books
## # A tibble: 15 x 7
## first last book year publisher series genre
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Trudi Canavan Magician's Gui… 2001 Orbit Black Magician … Fantasy
## 2 Trudi Canavan Novice 2002 Orbit Black Magician … Fantasy
## 3 Trudi Canavan High Lord 2003 Orbit Black Magician … Fantasy
## 4 Raymond Feist Magician 1982 Doubleday Riftwar Saga Fantasy
## 5 Raymond Feist Silverthorn 1985 Doubleday Riftwar Saga Fantasy
## 6 Raymond Feist Darkness at Se… 1986 Doubleday Riftwar Saga Fantasy
## 7 Nate Silver Signal and the… 2012 Penguin <NA> Mathemat…
## 8 Daniel Tammet Thinking in Nu… 2012 Back Bay Bo… <NA> Mathemat…
## 9 Brent Weeks Way of Shadows 2008 Orbit Night Angel Ser… Fantasy
## 10 Brent Weeks Shadow's Edge 2008 Orbit Night Angel Ser… Fantasy
## 11 Brent Weeks Beyond the Sha… 2008 Orbit Night Angel Ser… Fantasy
## 12 Brent Weeks Black Prism 2010 Orbit Lightbringer Se… Fantasy
## 13 Brent Weeks Blinding Knife 2012 Orbit Lightbringer Se… Fantasy
## 14 Brent Weeks Broken Eye 2014 Orbit Lightbringer Se… Fantasy
## 15 Brent Weeks Blood Mirror 2016 Orbit Lightbringer Se… Fantasy
Unlike mutating joins, filtering joins filter unwanted observations out of the primary dataframe.
semi_join()
The first filter join that we’ll look at is semi_join()
. semi_join()
takes the key column and filters the primary dataframe for only observations that also appear in the secondary dataframe. For this example, we want to know which authors in our authors
dataframe received awards for book
s that appeared in the authors
dataframe (these are listed in the awards
dataframe).
authors
## # A tibble: 15 x 4
## first last book year
## <chr> <chr> <chr> <dbl>
## 1 Trudi Canavan Magician's Guild 2001
## 2 Trudi Canavan Novice 2002
## 3 Trudi Canavan High Lord 2003
## 4 Raymond Feist Magician 1982
## 5 Raymond Feist Silverthorn 1985
## 6 Raymond Feist Darkness at Sethanon 1986
## 7 Nate Silver Signal and the Noise 2012
## 8 Daniel Tammet Thinking in Numbers 2012
## 9 Brent Weeks Way of Shadows 2008
## 10 Brent Weeks Shadow's Edge 2008
## 11 Brent Weeks Beyond the Shadows 2008
## 12 Brent Weeks Black Prism 2010
## 13 Brent Weeks Blinding Knife 2012
## 14 Brent Weeks Broken Eye 2014
## 15 Brent Weeks Blood Mirror 2016
awards
## # A tibble: 8 x 5
## first last book award year
## <chr> <chr> <chr> <chr> <dbl>
## 1 Trudi Canav… Whispers of the Mist Chil… Aurealis Award for Best Fanta… 1999
## 2 Trudi Canav… Room for Improvement Ditmars Best Short Story 2003
## 3 Trudi Canav… Magician's Apprentice Aurealis Award for Best Fanta… 2009
## 4 Raymond Fiest Servant of the Empire HOMer Awards Best Fantasy Nov… 1991
## 5 Nate Silver Signal and the Noise Phi Beta Kappa Award 2013
## 6 Daniel Tammet Born on a Blue Day Booklist Editor's Choic Adult… 2007
## 7 Daniel Tammet Every Word is a Bird We T… Booklist Editor's Choic Adult… 2017
## 8 Brent Weeks Blinding Knife Legend Award for Best Novel 2013
Without semi_join()
, we’d need to write something like this:
authors %>%
left_join(awards, by = c("first", "last", "book")) %>%
filter(!is.na(award)) %>%
# This filters out all non-award-winning books
select("first", "last", "book", "year" = "year.x")
## # A tibble: 2 x 4
## first last book year
## <chr> <chr> <chr> <dbl>
## 1 Nate Silver Signal and the Noise 2012
## 2 Brent Weeks Blinding Knife 2012
The syntax used in the select()
portion of the code will be discussed in more depth later, but for now, just know that we’re specifying the variables we want and renaming them to the correct names.
Thanks to semi_join()
, we can write the following code instead of the previous code.
authors %>%
semi_join(awards, by = "book")
## # A tibble: 2 x 4
## first last book year
## <chr> <chr> <chr> <dbl>
## 1 Nate Silver Signal and the Noise 2012
## 2 Brent Weeks Blinding Knife 2012
anti_join()
anti_join()
is kind of the opposite of semi_join()
in that it shows you which of the rows in your primary dataframe do not have matches in the secondary dataframe. This can be handy for checking for mistakes, such as typos, in the data.
What if we want to know which of the book
s in our authors
dataframe haven’t won awards? Simple, anti_join()
.
authors %>%
anti_join(awards, by = "book")
## # A tibble: 13 x 4
## first last book year
## <chr> <chr> <chr> <dbl>
## 1 Trudi Canavan Magician's Guild 2001
## 2 Trudi Canavan Novice 2002
## 3 Trudi Canavan High Lord 2003
## 4 Raymond Feist Magician 1982
## 5 Raymond Feist Silverthorn 1985
## 6 Raymond Feist Darkness at Sethanon 1986
## 7 Daniel Tammet Thinking in Numbers 2012
## 8 Brent Weeks Way of Shadows 2008
## 9 Brent Weeks Shadow's Edge 2008
## 10 Brent Weeks Beyond the Shadows 2008
## 11 Brent Weeks Black Prism 2010
## 12 Brent Weeks Broken Eye 2014
## 13 Brent Weeks Blood Mirror 2016
Knowing that all of the people in the authors
dataframe have at least one award
in the awards
dataframe (not necessarily for the books listed in the authors
dataframe), we can use anti_join()
to make sure that all of our authors’ names are spelled correctly.
authors %>%
anti_join(awards, by = c("first", "last"))
## # A tibble: 3 x 4
## first last book year
## <chr> <chr> <chr> <dbl>
## 1 Raymond Feist Magician 1982
## 2 Raymond Feist Silverthorn 1985
## 3 Raymond Feist Darkness at Sethanon 1986
Because Raymond Feist was returned, we know that his name (at least spelled correctly) doesn’t appear in the awards
dataframe. If we go back up and look at it, we can see that it was spelled “Fiest” instead of “Feist.”
When your dataframes contain the same rows or columns you may want to simply bind them together instead of messing with joining functions. This can be done with bind_rows()
and bind_cols()
.
Note: These perform similarly to rbind()
and cbind()
, respectively; however, bind_rows()
and bind_cols()
are more efficient for binding dataframes.
bind_rows()
When you have two dataframes with the same variables (columns), and you’re just wanting to append one dataframe onto the other (that is, stack the data in one dataframe on top of the other), you should use bind_rows()
.
For this, we’ll use the auth_pub_l
dataframe that we created earlier with the left_join()
function. Mr. Brent Weeks has a new book that has yet to be released (as of the writing of this tutorial). We’ll use bind_rows()
to append information about this book that is contained in the new_weeks
dataframe.
complete_auth <- auth_pub_l %>%
bind_rows(new_weeks)
complete_auth
## # A tibble: 16 x 5
## first last book year publisher
## <chr> <chr> <chr> <dbl> <chr>
## 1 Trudi Canavan Magician's Guild 2001 Orbit
## 2 Trudi Canavan Novice 2002 Orbit
## 3 Trudi Canavan High Lord 2003 Orbit
## 4 Raymond Feist Magician 1982 Doubleday
## 5 Raymond Feist Silverthorn 1985 Doubleday
## 6 Raymond Feist Darkness at Sethanon 1986 Doubleday
## 7 Nate Silver Signal and the Noise 2012 Penguin
## 8 Daniel Tammet Thinking in Numbers 2012 Back Bay Books
## 9 Brent Weeks Way of Shadows 2008 Orbit
## 10 Brent Weeks Shadow's Edge 2008 Orbit
## 11 Brent Weeks Beyond the Shadows 2008 Orbit
## 12 Brent Weeks Black Prism 2010 Orbit
## 13 Brent Weeks Blinding Knife 2012 Orbit
## 14 Brent Weeks Broken Eye 2014 Orbit
## 15 Brent Weeks Blood Mirror 2016 Orbit
## 16 Brent Weeks Burning White 2019 Orbit
So we can use bind_rows()
to add one row to our dataframe, but can we use it to bind 2 complete dataframes? We sure can! We’ll bind two dataframes (each having data on one of Brent’s series) with more than one row.
brent_weeks <- night_angel %>%
bind_rows(lightbringer)
brent_weeks
## # A tibble: 8 x 4
## name year genre publisher
## <chr> <dbl> <chr> <chr>
## 1 Way of Shadows 2008 Fantasy Orbit
## 2 Shadow's Edge 2008 Fantasy Orbit
## 3 Beyond the Shadows 2008 Fantasy Orbit
## 4 Black Prism 2010 Fantasy Orbit
## 5 Blinding Knife 2012 Fantasy Orbit
## 6 Broken Eye 2014 Fantasy Orbit
## 7 Blood Mirror 2016 Fantasy Orbit
## 8 Burning White 2019 Fantasy Orbit
It looks like our new brent_weeks
dataframe doesn’t have the name of the series. We’ll remedy that in the next section.
bind_cols()
When you want to append a column from one dataframe onto another dataframe, you can use bind_cols()
.
CAUTION! This function will not discriminate against joining incorrect rows like the *_join()
functions do. This should only be used when the order of the rows matches in both dataframes. We’ll demonstrate ths in the first chunk of code.
brent_weeks %>%
bind_cols(weeks_series_i)
## # A tibble: 8 x 5
## name year genre publisher series
## <chr> <dbl> <chr> <chr> <chr>
## 1 Way of Shadows 2008 Fantasy Orbit Lightbringer
## 2 Shadow's Edge 2008 Fantasy Orbit Lightbringer
## 3 Beyond the Shadows 2008 Fantasy Orbit Lightbringer
## 4 Black Prism 2010 Fantasy Orbit Lightbringer
## 5 Blinding Knife 2012 Fantasy Orbit Lightbringer
## 6 Broken Eye 2014 Fantasy Orbit Night Angel
## 7 Blood Mirror 2016 Fantasy Orbit Night Angel
## 8 Burning White 2019 Fantasy Orbit Night Angel
As we can see, bind_cols()
has no way to determine that the information in weeks_series_i
is in the incorrect order, which is why you should be very cautious when using the bind_*()
functions.
In the next code chunk, we’ll bind the correctly ordered column to brent_weeks
. We can tell it’s correct, because the Night Angel trilogy was published, in its entirety, in 2008.
brent_weeks %>%
bind_cols(weeks_series_c)
## # A tibble: 8 x 5
## name year genre publisher series
## <chr> <dbl> <chr> <chr> <chr>
## 1 Way of Shadows 2008 Fantasy Orbit Night Angel
## 2 Shadow's Edge 2008 Fantasy Orbit Night Angel
## 3 Beyond the Shadows 2008 Fantasy Orbit Night Angel
## 4 Black Prism 2010 Fantasy Orbit Lightbringer
## 5 Blinding Knife 2012 Fantasy Orbit Lightbringer
## 6 Broken Eye 2014 Fantasy Orbit Lightbringer
## 7 Blood Mirror 2016 Fantasy Orbit Lightbringer
## 8 Burning White 2019 Fantasy Orbit Lightbringer
What if our dataframes use the same name for different variables? For example:
colnames(authors)
## [1] "first" "last" "book" "year"
colnames(awards)
## [1] "first" "last" "book" "award" "year"
We can see that the first
, last
, book
, and year
variables appear in both dataframes. The first
, last
, and book
variables represent the same information, but the year
variables represent two different years - the year
of publication and the year
the award was won. What happens if we join them?
authors %>%
left_join(awards, by = c("first", "last", "book", "year"))
## # A tibble: 15 x 5
## first last book year award
## <chr> <chr> <chr> <dbl> <chr>
## 1 Trudi Canavan Magician's Guild 2001 <NA>
## 2 Trudi Canavan Novice 2002 <NA>
## 3 Trudi Canavan High Lord 2003 <NA>
## 4 Raymond Feist Magician 1982 <NA>
## 5 Raymond Feist Silverthorn 1985 <NA>
## 6 Raymond Feist Darkness at Sethanon 1986 <NA>
## 7 Nate Silver Signal and the Noise 2012 <NA>
## 8 Daniel Tammet Thinking in Numbers 2012 <NA>
## 9 Brent Weeks Way of Shadows 2008 <NA>
## 10 Brent Weeks Shadow's Edge 2008 <NA>
## 11 Brent Weeks Beyond the Shadows 2008 <NA>
## 12 Brent Weeks Black Prism 2010 <NA>
## 13 Brent Weeks Blinding Knife 2012 <NA>
## 14 Brent Weeks Broken Eye 2014 <NA>
## 15 Brent Weeks Blood Mirror 2016 <NA>
If we don’t realize that the year
s represent different things, and we try to use the year
column as a key, we can see that we don’t successfully join any information from our secondary dataframe because this combination of keys doesn’t completely match any rows in the secondary dataframe. This is a good tip that something might be wrong with our code or our data.
If we realized that the year
variables represent different things and don’t use it as a key, we can see what happens if we have multiple columns with the same name.
authors %>%
left_join(awards, by = c("first", "last", "book"))
## # A tibble: 15 x 6
## first last book year.x award year.y
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Trudi Canavan Magician's Guild 2001 <NA> NA
## 2 Trudi Canavan Novice 2002 <NA> NA
## 3 Trudi Canavan High Lord 2003 <NA> NA
## 4 Raymond Feist Magician 1982 <NA> NA
## 5 Raymond Feist Silverthorn 1985 <NA> NA
## 6 Raymond Feist Darkness at Sethanon 1986 <NA> NA
## 7 Nate Silver Signal and the Noise 2012 Phi Beta Kappa Award 2013
## 8 Daniel Tammet Thinking in Numbers 2012 <NA> NA
## 9 Brent Weeks Way of Shadows 2008 <NA> NA
## 10 Brent Weeks Shadow's Edge 2008 <NA> NA
## 11 Brent Weeks Beyond the Shadows 2008 <NA> NA
## 12 Brent Weeks Black Prism 2010 <NA> NA
## 13 Brent Weeks Blinding Knife 2012 Legend Award for Best Nov… 2013
## 14 Brent Weeks Broken Eye 2014 <NA> NA
## 15 Brent Weeks Blood Mirror 2016 <NA> NA
Thankfully, dplyr
doesn’t try to join those two columns. Instead, it renames them (intuitively, year.x
is from our primary dataframe, and year.y
is from our secondary), and that is easy enough for us to fix - as shown in the code below.
authors %>%
left_join(awards, by = c("first", "last", "book")) %>%
rename(pub_year = year.x, award_year = year.y)
## # A tibble: 15 x 6
## first last book pub_year award award_year
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Trudi Canavan Magician's Guild 2001 <NA> NA
## 2 Trudi Canavan Novice 2002 <NA> NA
## 3 Trudi Canavan High Lord 2003 <NA> NA
## 4 Raymond Feist Magician 1982 <NA> NA
## 5 Raymond Feist Silverthorn 1985 <NA> NA
## 6 Raymond Feist Darkness at Seth… 1986 <NA> NA
## 7 Nate Silver Signal and the N… 2012 Phi Beta Kappa Award 2013
## 8 Daniel Tammet Thinking in Numb… 2012 <NA> NA
## 9 Brent Weeks Way of Shadows 2008 <NA> NA
## 10 Brent Weeks Shadow's Edge 2008 <NA> NA
## 11 Brent Weeks Beyond the Shado… 2008 <NA> NA
## 12 Brent Weeks Black Prism 2010 <NA> NA
## 13 Brent Weeks Blinding Knife 2012 Legend Award for Best … 2013
## 14 Brent Weeks Broken Eye 2014 <NA> NA
## 15 Brent Weeks Blood Mirror 2016 <NA> NA
So now we know how to deal with having the same name applied to different variables, but what do we do if our KEY has different names in each dataframe? Such as,
complete_auth
## # A tibble: 16 x 5
## first last book year publisher
## <chr> <chr> <chr> <dbl> <chr>
## 1 Trudi Canavan Magician's Guild 2001 Orbit
## 2 Trudi Canavan Novice 2002 Orbit
## 3 Trudi Canavan High Lord 2003 Orbit
## 4 Raymond Feist Magician 1982 Doubleday
## 5 Raymond Feist Silverthorn 1985 Doubleday
## 6 Raymond Feist Darkness at Sethanon 1986 Doubleday
## 7 Nate Silver Signal and the Noise 2012 Penguin
## 8 Daniel Tammet Thinking in Numbers 2012 Back Bay Books
## 9 Brent Weeks Way of Shadows 2008 Orbit
## 10 Brent Weeks Shadow's Edge 2008 Orbit
## 11 Brent Weeks Beyond the Shadows 2008 Orbit
## 12 Brent Weeks Black Prism 2010 Orbit
## 13 Brent Weeks Blinding Knife 2012 Orbit
## 14 Brent Weeks Broken Eye 2014 Orbit
## 15 Brent Weeks Blood Mirror 2016 Orbit
## 16 Brent Weeks Burning White 2019 Orbit
brent_weeks
## # A tibble: 8 x 4
## name year genre publisher
## <chr> <dbl> <chr> <chr>
## 1 Way of Shadows 2008 Fantasy Orbit
## 2 Shadow's Edge 2008 Fantasy Orbit
## 3 Beyond the Shadows 2008 Fantasy Orbit
## 4 Black Prism 2010 Fantasy Orbit
## 5 Blinding Knife 2012 Fantasy Orbit
## 6 Broken Eye 2014 Fantasy Orbit
## 7 Blood Mirror 2016 Fantasy Orbit
## 8 Burning White 2019 Fantasy Orbit
What is our key here? Is it book
like in the complete_auth
dataframe, or is it name
like in the brent_weeks
dataframe? We can get around this fairly easily by changing how we specify our by =
argument.
complete_auth %>%
semi_join(brent_weeks, by = c("book" = "name"))
## # A tibble: 8 x 5
## first last book year publisher
## <chr> <chr> <chr> <dbl> <chr>
## 1 Brent Weeks Way of Shadows 2008 Orbit
## 2 Brent Weeks Shadow's Edge 2008 Orbit
## 3 Brent Weeks Beyond the Shadows 2008 Orbit
## 4 Brent Weeks Black Prism 2010 Orbit
## 5 Brent Weeks Blinding Knife 2012 Orbit
## 6 Brent Weeks Broken Eye 2014 Orbit
## 7 Brent Weeks Blood Mirror 2016 Orbit
## 8 Brent Weeks Burning White 2019 Orbit
Notice that the first value in by =
refers to the primary dataframe’s naming convention.