Notes

NYC Flights Case Study: Dates/Times, With Solutions

NYC Flights Data The NYC Flights data set contains (among many other things) on-time performance data for all flights departing a New York City airport in 2013. Let’s load it from the package nycflights13. Let’s also load the tidyverse; the key package we will be using from it today is lubridate. There’s lots to explore in this data set, and lots of variables! We’ll work with a super pared down version.

FEV Case Study: Data Manipulation

Introduction We will perform exploratory data analysis of the fev data set. Let’s start by getting the data set. (Reminder: you will need to run install.packages("rigr") to install the rigr package which contains the data set before loading it. ) Let’s also load dplyr while we’re at it - we’ll need it to do the exercises! library(rigr) library(dplyr) fev_tbl <- as_tibble(fev) %>% mutate(across(sex:smoke, ~ as.factor(.x))) The fev data set It is now widely believed that smoking tends to impair lung function.

FEV Case Study: Data Manipulation, With Solutions

Introduction We will perform exploratory data analysis of the fev data set. Let’s start by getting the data set. (Reminder: you will need to run install.packages("rigr") to install the rigr package which contains the data set before loading it. ) Let’s also load dplyr while we’re at it - we’ll need it to do the exercises! library(rigr) library(dplyr) fev_tbl <- as_tibble(fev) %>% mutate(across(sex:smoke, ~ as.factor(.x))) The fev data set It is now widely believed that smoking tends to impair lung function.

FEV Case Study: Graphing

Review We’ll continue exploring the FEV data set from last week. Let’s start by loading the data and required packages. library(rigr) library(dplyr) library(ggplot2) fev_tbl <- as_tibble(fev) %>% mutate(across(sex:smoke, ~ as.factor(.x))) Last week, we found that the mean FEV in the smoking group was substantially higher than the average FEV in the non-smoking group. That is, the smokers appear to have better lung function than the non-smokers. We also had a theory as to why: the association between FEV and smoking status may be confounded, eg.

FEV Case Study: Graphing, With Solutions

Review We’ll continue exploring the FEV data set from last week. Let’s start by loading the data and required packages. library(rigr) library(dplyr) library(ggplot2) fev_tbl <- as_tibble(fev) %>% mutate(across(sex:smoke, ~ as.factor(.x))) Last week, we found that the mean FEV in the smoking group was substantially higher than the average FEV in the non-smoking group. That is, the smokers appear to have better lung function than the non-smokers. We also had a theory as to why: the association between FEV and smoking status may be confounded, eg.

FEV Case Study: Modelling, With Solutions

Review We’ll continue exploring the FEV data set. Let’s start by loading the data and required packages. library(rigr) library(tidyverse) library(broom) fev_tbl <- as_tibble(fev) %>% mutate(across(sex:smoke, ~ as.factor(.x))) Previously, we found that the mean FEV in the smoking group was substantially higher than the average FEV in the non-smoking group; this speaks to the unadjusted association between smoking and lung function. But we also found that the FEV of smokers and non-smokers of the same height is pretty similar; that is, there doesn’t seem to be much of an association between smoking and lung function, when adjusted for height.

FEV Case Study: Modelling, With Solutions

Review We’ll continue exploring the FEV data set. Let’s start by loading the data and required packages. library(rigr) library(tidyverse) library(broom) fev_tbl <- as_tibble(fev) %>% mutate(across(sex:smoke, ~ as.factor(.x))) Previously, we found that the mean FEV in the smoking group was substantially higher than the average FEV in the non-smoking group; this speaks to the unadjusted association between smoking and lung function. But we also found that the FEV of smokers and non-smokers of the same height is pretty similar; that is, there doesn’t seem to be much of an association between smoking and lung function, when adjusted for height.

Automation

Note: This is an optional topic. From today’s class, students are anticipated to be able to: Use make to record which files are inputs vs. intermediates vs. outputs to capture how scripts and commands convert inputs to outputs to re-run parts of an analysis that are out-of-date Write a Makefile. Interact with make in RStudio. Use make from the shell. Other tools aside from make (We won’t be covering these):

Character Data

From this topic, students are anticipated to be able to: Manipulate a character vector in R using the stringr package. Write simple regular expressions (regex). Apply stringr and regular expressions to manipulate data in tibbles. Resources Video lecture: Regular Expressions and stringr for Text Data (only labelled as “age restricted” because it looks at real emails within the Enron company.) Written material: Overview tutorials similar to our worksheet: stat545.com Chapter 11: character vectors R4DS Chapter 15: strings.

Graphing using the Grammar of Graphics through ggplot2

library(tidyverse) library(gapminder) library(scales) Learning Objectives From this topic, students are anticipated to be able to: Identify the seven components of the grammar of graphics underlying ggplot2. Produce plots with ggplot2 by implementing the components of the grammar of graphics. Customize the look of ggplot2 graphs. Choose an appropriate plot type for Exploratory Data Analysis, based on an understanding of what makes a graph effective. Resources Video lectures for this topic (ignore the episode numbering):