Tibble Joins

From today’s class, students are anticipated to be able to:

Recognize how to manipulate data through a variety of tibble joins such as:
- Mutating joins: left_join(), right_join(), full_join(), anti_join()
- Filtering joins: semi_join(), anti_join()
Perform binding: bind_rows(), bind_cols()
Join more than 2 tibbles
Join based on multiple conditions
Perform set operations on data: intersect(), union(), setdiff()
Join tibbles with different types of variables

Resources

Video lecture:

Tibble Joins with dplyr

Other resources, in addition to the notes below:

A comprehensive overview can be found in the “Relational Data” chapter in “R for Data Science”.
Concise resources:
- “Two-table verbs” vignette gives a concise overview of tibble joins with dplyr.
- Jenny’s Join Cheatsheet for a quick reference to joins.
Revisit the dplyr cheatsheet for all these concepts packed onto a sheet of paper.

Overview of join functions

Note: In order to merge two tibbles, you need to have an identifier variable that has unique values for every row of observations in both tibbles.

Create two sample tibbles:

# First tibble
df1 <- tibble(ID = 1:3,                     
              Name = c("Sophie", "Josh","Alex"))

# Second tibble
df2 <- tibble(ID = 2:4,                      
              Age = c(20,50,31))

Mutating joins

Join matching rows from df2 to df1

left_join(df1, df2, by = "ID")

## # A tibble: 3 × 3
##      ID Name     Age
##   <int> <chr>  <dbl>
## 1     1 Sophie    NA
## 2     2 Josh      20
## 3     3 Alex      50

Join matching rows from df1 to df2

right_join(df1, df2, by = "ID")

## # A tibble: 3 × 3
##      ID Name    Age
##   <int> <chr> <dbl>
## 1     2 Josh     20
## 2     3 Alex     50
## 3     4 <NA>     31

Retain only rows present in both sets

inner_join(df1, df2, by = "ID")

## # A tibble: 2 × 3
##      ID Name    Age
##   <int> <chr> <dbl>
## 1     2 Josh     20
## 2     3 Alex     50

Retain all values, all rows

full_join(df1, df2, by = "ID")

## # A tibble: 4 × 3
##      ID Name     Age
##   <int> <chr>  <dbl>
## 1     1 Sophie    NA
## 2     2 Josh      20
## 3     3 Alex      50
## 4     4 <NA>      31

Filtering joins

Retain all rows in df1 that have a match in df2

semi_join(df1, df2, by = "ID")

## # A tibble: 2 × 2
##      ID Name 
##   <int> <chr>
## 1     2 Josh 
## 2     3 Alex

Retain all rows in df1 that do not have a match in df2

anti_join(df1, df2, by = "ID")

## # A tibble: 1 × 2
##      ID Name  
##   <int> <chr> 
## 1     1 Sophie

Binding

Append df2 to df1 as new rows

bind_rows(df1, df2)

## # A tibble: 6 × 3
##      ID Name     Age
##   <int> <chr>  <dbl>
## 1     1 Sophie    NA
## 2     2 Josh      NA
## 3     3 Alex      NA
## 4     2 <NA>      20
## 5     3 <NA>      50
## 6     4 <NA>      31

Append df2 to df1 as new columns

bind_cols(df1, df2)

## New names:
## • `ID` -> `ID...1`
## • `ID` -> `ID...3`

## # A tibble: 3 × 4
##   ID...1 Name   ID...3   Age
##    <int> <chr>   <int> <dbl>
## 1      1 Sophie      2    20
## 2      2 Josh        3    50
## 3      3 Alex        4    31

Joining multiple (>2) tibbles

Create a third tibble

df3 <- tibble(ID = 1:5,                      
              Height = c(175,167,190,155,160))

Use piping operator (%>%) to layer multiple join functions

full_join(df1, df2, by = "ID") %>%
  full_join(df3, by = "ID")

## # A tibble: 5 × 4
##      ID Name     Age Height
##   <int> <chr>  <dbl>  <dbl>
## 1     1 Sophie    NA    175
## 2     2 Josh      20    167
## 3     3 Alex      50    190
## 4     4 <NA>      31    155
## 5     5 <NA>      NA    160

Joining tibbles on multiple conditions

Create two new tibbles df4 and df5

df4 <- tibble(FirstName = c("Sophie", "Josh","Alex"),
              LastName=c("Wang","Smith","Smith"),
              Age = c(42,20,50))

df5 <- tibble(First_name = c("Josh","Alex","Sophie"),        
              Last_name=c("Smith","Smith","Jones"),
              Height = c(167,190,155))

full_join(df4, df5, by = c("FirstName" = "First_name", "LastName" = "Last_name"))

## # A tibble: 4 × 4
##   FirstName LastName   Age Height
##   <chr>     <chr>    <dbl>  <dbl>
## 1 Sophie    Wang        42     NA
## 2 Josh      Smith       20    167
## 3 Alex      Smith       50    190
## 4 Sophie    Jones       NA    155

Set operations

Create sample tibbles

# First tibble
df6 <- tibble(Number = 1:3,                     
              Letter = c("A", "B","C"))

# Second tibble
df7 <- tibble(Number = 2:4,                      
              Letter = c("B","C","D"))

Include rows that appear in both tibbles

intersect(df6, df7)

## # A tibble: 2 × 2
##   Number Letter
##    <int> <chr> 
## 1      2 B     
## 2      3 C

Include rows that appear in either or both tibbles

union(df6, df7)

## # A tibble: 4 × 2
##   Number Letter
##    <int> <chr> 
## 1      1 A     
## 2      2 B     
## 3      3 C     
## 4      4 D

Include rows that appear in one df but not another

Include rows that appear in df6 but not in df7

setdiff(df6, df7)

## # A tibble: 1 × 2
##   Number Letter
##    <int> <chr> 
## 1      1 A

Include rows that appear in df7 but not in df6

setdiff(df7, df6)

## # A tibble: 1 × 2
##   Number Letter
##    <int> <chr> 
## 1      4 D

Joining tibbles with different types of variables

You can also join tibbles with sets of predictions:

set.seed(1)
x <- rnorm(5)
model1 <- tibble(x = x, yhat = 2.1 + 3.2 * x)
model2 <- tibble(x = x, yhat = 1.5 + 2.9 * x)
left_join(model1, model2, by = "x")

## # A tibble: 5 × 3
##        x  yhat.x yhat.y
##    <dbl>   <dbl>  <dbl>
## 1 -0.626  0.0953 -0.317
## 2  0.184  2.69    2.03 
## 3 -0.836 -0.574  -0.923
## 4  1.60   7.20    6.13 
## 5  0.330  3.15    2.46

Your turn: learning tibble joins

We think the best way to learn the basics of tibble joins from here is to work through the corresponding part of Worksheet A5.

There will be some class time to go over solutions if you got stuck on any questions.

Attributions

Written by Albina Gibadullina, reviewed by Vincenzo Coia.