From this topic, students are anticipated to be able to:
- Manipulate a character vector in R using the stringr package.
- Write simple regular expressions (regex).
- Apply stringr and regular expressions to manipulate data in tibbles.
Resources
Video lecture:
- Regular Expressions and stringr for Text Data (only labelled as “age restricted” because it looks at real emails within the Enron company.)
Written material:
- Overview tutorials similar to our worksheet:
- The stat545.com Chapter 11 on character vectors has an elaborate discussion on useful resources for learning more about strings.
Strings
You’ve used a bunch of strings at this point without knowing explicitly what they are: any time you surround text by "
, you’ve been making a string: a storage format for text. In R, they are of type “character”.
sample_string <- "This is a string"
typeof(sample_string)
## [1] "character"
Two places where you’ll often want to manipulate these in data analysis:
- Cleaning up column/variable names
- Cleaning up character column values
Good to know: Constructing strings out of characters and numbers is intuitive, but there’s a gotcha involving particular symbols with special meaning in R. For example, try running quote <- """
in R; it won’t work, because the "
symbol is interpreted as you trying to make a string! To literally include a quote in a string, you can use the \
character to “escape” it:
single_quote <- "\""
cat(single_quote)
## "
You can see more examples of special characters and how to escape them in R4DS Chapter 15.2.
Working with strings
Our main tools for working with strings will be the powerful stringr
package in the tidyverse paired with regular expressions. We think the best way to start learning these is through the guided tutorial in Worksheet B2.
Test Your Understanding
Use these questions to check your understanding of the material.
- True or False: The regular expression
[ab][ab]
will match “ab”, “aa”, and “bb” as possibilities, whereas[ab]{2}
will only match “aa” or “bb”. - True or False: The regular expression
[ab][ab]
will match “ab”, “aa”, and “bb” as possibilities, whereas(ab)(ab)
will only match “aa” or “bb”. - True or False: The regular expression
^ab
will match “ab” as the first characters to a string, whereas[^ab]
will match “a” or “b” as being the first character to a string.
Agenda
Class 1
- Before class, start working on parts I and II of Worksheet B2.
- Class will be dedicated to getting your questions answered.
- Done early? Then do the optional R4DS Strings and R4DS Regular expressions readings (linked above), and do exercises for extra practice.
Class 2
- Before class, start working on parts II and III of Worksheet B2.
- Class will be dedicated to getting your questions answered.
- Done early? Then do the optional R4DS Strings and R4DS Regular expressions readings (linked above), and do exercises for extra practice. Or, start Assignment B4.