From today’s class, students are anticipated to be able to:
- Reorder levels within factors according to various principles.
- Make a date and datetime object and extract components from these objects, using the lubridate package.
Resources
Video lecture:
Other resources, in addition to the notes below:
- For factors:
- The stat545.com chapter on Factors
- The forcats package page and reference guide on page.
- For dates and times:
Factors
“There is no other object that creates as much trouble as factors.” - Patrick Burns, “The R Inferno”.
In R, we use factors to represent categorical variables: variables that take on a fixed number of known values (i.e. levels). For example, in the penguins
data set, species
is a factor with three levels: “Adelie”, “Chinstrap”, and “Gentoo”.
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Under the hood, R stores a factor with (say) 3 levels as a numeric vector containing integers between 1 and 3, paired with a character vector of length 3 that identifies the mapping between the numbers 1, 2, and 3 and the levels.
This is not immediately obvious, because R will print factors using the character string levels rather than the numbers that it stores:
penguins %>% slice_sample(n=10) %>% pull(species)
## [1] Gentoo Adelie Adelie Gentoo Gentoo Adelie Adelie
## [8] Chinstrap Adelie Adelie
## Levels: Adelie Chinstrap Gentoo
This dual nature of factors creates a whole slew of hidden traps and headaches, especially for new useRs!
Nevertheless, factors are important and worth the pain. Many functions throughout the R landscape expect categorical variables to be coded as factors. For example, when making plots in either ggplot2
or in base R, we need factors in order to map categorical variables to aesthetic elements like colour.
To make our lives easier, we will work with factors through the forcats
package loaded as part of the tidyverse.
Your turn: learning to use factors
We think the best way to learn the basics of factors is to work through Worksheet A5 (factors portion).
In-class schedule
First part
- Haven’t attempted all of the factors portion of Worksheet A5? Then spend this time attempting unattempted questions.
- Finished attempting all of the questions? Then do the optional R4DS Factors reading, and maybe even do some of the exercises for extra practice.
During this time, teaching team will also walk around and answer questions and chat about anything factors related.
Second part
Now’s your chance to ask about any questions you got stuck on and get them answered by the instructor!
Third part: Dates and/or Times
Often you will need to work with dates and times in your data. For example, we could have had a variable in the FEV data set that contains the date of each patient visit.
Dates and times seem simple, but they are actually one of the most complicated things you will encounter in data analysis.
Test Your Understanding
- The output of the following code is equivalent to a factor with the letters “a” to “f”.
(abc <- factor(letters[1:3]))
#> [1] a b c
#> Levels: a b c
(def <- factor(letters[4:6]))
#> [1] d e f
#> Levels: d e f
c(abc, def)
- The output of the following code is a date object.
library(lubridate)
date <- ymd("2020-10-13")
dt <- ymd_hms("2020-10-13 09:30:00")
c(date, dt)
Your turn: Making a date variable
Write down today’s date. (Don’t peek at your neighbours!)
Copy what you wrote down into the Google Sheets link on Canvas.
We’ll look at the results together. Expect pain …
Here’s another example of date/time complications. Ask yourself:
- Are there 365 days in every year?
- Are there 30 days in every month?
- Are there 24 hours in every day?
- Are there 60 seconds in every minute?
The answer to all of these questions is NO. What a headache this can be when trying to compute how much time has elapsed between two date/times!
The lubridate
package can help us with a lot of the headaches that dates and times cause. It can create date and time objects from different inputs, extract important pieces of information like year/month/day, do hard math with dates and times, and help you navigate time zones.
NYC Flights Case Study
We’ll show off how to use the lubridate
package in the tidyverse to work with date variables in datasets in this NYC Flights case study.
For the sake of time, we’ll just go over the solutions together, instead of having you attempt exercises on your own first. We think this will be sufficient to get a hang of the basics of lubridate
. That being said, want extra practice? Then check out the R4DS Dates and Times Chapter!