FEV Case Study: Data Manipulation

Introduction

We will perform exploratory data analysis of the fev data set. Let’s start by getting the data set. (Reminder: you will need to run install.packages("rigr") to install the rigr package which contains the data set before loading it. ) Let’s also load dplyr while we’re at it - we’ll need it to do the exercises!

library(rigr) 
library(dplyr)

fev_tbl <- as_tibble(fev) %>% mutate(across(sex:smoke, ~ as.factor(.x)))

The `fev` data set

It is now widely believed that smoking tends to impair lung function. Much of the data to support this claim arises from studies of pulmonary function in adults who are long-time smokers. A question then arises whether such deleterious effects of smoking can be detected in children who smoke. To address this question, measures of lung function were made in about 600 children seen for a routine check up in a particular pediatric clinic. The children participating in this study were asked whether they were current smokers.

A common measurement of lung function is the forced expiratory volume (FEV), which measures how much air you can blow out of your lungs in a short period of time. A higher FEV is usually associated with better respiratory function. It is well known that prolonged smoking diminishes FEV in adults, and those adults with diminished FEV also tend to have decreased pulmonary function as measured by other clinical variables, such as blood oxygen and carbon dioxide levels.

Data collected from the study on the relationship between smoking status and lung function (measured by FEV) in children are contained in the fev_tbl dataset. Here is a data dictionary:

Variable Name	Description
seqnbr	case number
subjid	subject identification number
age	subject age (years)
fev	measured forced exhalation volume (litres/second)
height	subject height (inches)
sex	subject sex
smoke	smoking status (yes or no)

Here is a few rows of the data set:

head(fev_tbl)

## # A tibble: 6 × 7
##   seqnbr subjid   age   fev height sex    smoke
##    <int>  <int> <int> <dbl>  <dbl> <fct>  <fct>
## 1      1    301     9  1.71   57   female no   
## 2      2    451     8  1.72   67.5 female no   
## 3      3    501     7  1.72   54.5 female no   
## 4      4    642     9  1.56   53   male   no   
## 5      5    901     9  1.90   57   male   no   
## 6      6   1701     8  2.34   61   female no

Understanding the data structure

Exercise 1

Am I missing any variables compared to the data dictionary? Let’s check.

# FILL IN HERE

Exercise 2

Next: are there any duplicate case numbers? Are there any duplicate subject IDs?

# FILL IN HERE

Now we know that no cases were entered twice, and each case corresponds to a different patient.

Understanding the patients in the study

Exercise 3

Let’s summarize the age of the patients first, by computing the mean, standard deviation, min, and max of the patient ages.

# FILL IN HERE

Something’s a little odd about these summaries. Remember: this is a smoking study on children.

Why would a 3 year old be enrolled in a smoking study?
Why would a 19 year old be enrolled in a study on children?

Exercise 4

I’m now a bit worried: what’s the youngest patient who smokes in this dataset?

# FILL IN HERE

Looks like the youngest patient who smokes is 9. Seems young to me, but much less far-fetched than (say) 3.

Exercise 5

What about the 19 year olds? Should they be included? The definition of a “child” can vary from study to study. Possible definitions include < 21 and < 18. Let’s find out who the patients 18 or older are and what their case numbers are so that we can ask our point of contact for the study about them.

# FILL IN HERE

Aside: if it turns out that we need to exclude any of these odd-looking patients from our final data analysis, then we will need to re-run everything after this point with their data removed. Isn’t it nice that we are preparing this report in R Markdown?

Exercise 6

This is a smoking study, so it seems useful to know what proportion of the study participants are smokers. In fact, let’s break it down by sex, and calculate the proportion of girls who are smokers and the proportion of boys who are smokers, as well as the number of girls and the number of boys in the study.

# FILL IN HERE

The proportion of girls who are smokers is higher than the proportion of boys that are smokers.

Exercise 7

Is this because there are more smokers among teenage girls than teenage boys? Or is this a phenomenon that is uniform across age groups? To find out, let’s calculate:

the proportion and number of girls aged 0-6 who are smokers
the proportion and number of girls aged 7-12 who are smokers
the proportion and number of girls aged 13-19 who are smokers
the proportion and number of boys aged 0-6 who are smokers
the proportion and number of boys aged 7-12 who are smokers
the proportion and number of boys aged 13-19 who are smokers

Hint: you will need to create a new variable that has three categories: age 0-6, age 7-12, and age 13-19. You can do so with fev_tbl %>% mutate(age_cat = cut(age, c(0, 6, 12, 19))).

# FILL IN HERE

There are no smokers (female or male) in the 0-6 group. There is a higher proportion of girls in the 7-12 group who are smokers than boys in the 7-12 group. Ditto the 13-19 group.

Does smoking have an effect on lung function?

Let’s continue exploring the data set with a closer eye to our main research question: does smoking have an effect on lung function in children?

Exercise 8

Let’s start by summarizing the FEV of the smokers and non-smokers. Let’s calculate the mean, standard deviation, and number of observations in each group. We will mainly be comparing the means to gather information about the question, but the standard deviation and number of observations are important to look at too.

# FILL IN HERE

We see that the mean FEV in the smoking group seems to be substantially higher than the average FEV in the non-smoking group. That is, the smokers appear to have better lung function than the non-smokers.

Does this surprise you? Recall that this is an observational study - children were not randomly assigned to smoke or not smoke. We might then have reason to suspect that the association between FEV and smoking status is confounded by some other factors. For example, we already know that the youngest smoker in our data set is 9, while the non-smokers are as young as 3. This suggests that the smokers in our data set are generally older than the non-smokers. Furthermore, we might expect older children to have higher FEV, because they are bigger and have bigger lungs. Could age be a confounder here? Maybe the smoking group has higher lung function simply because they are older and bigger.

We will investigate this point next week after we have learned about graphing tools.