FEV Case Study: Data Manipulation, With Solutions

Introduction

We will perform exploratory data analysis of the fev data set. Let’s start by getting the data set. (Reminder: you will need to run install.packages("rigr") to install the rigr package which contains the data set before loading it. ) Let’s also load dplyr while we’re at it - we’ll need it to do the exercises!

library(rigr) 
library(dplyr)

fev_tbl <- as_tibble(fev) %>% mutate(across(sex:smoke, ~ as.factor(.x)))

The fev data set

It is now widely believed that smoking tends to impair lung function. Much of the data to support this claim arises from studies of pulmonary function in adults who are long-time smokers. A question then arises whether such deleterious effects of smoking can be detected in children who smoke. To address this question, measures of lung function were made in about 600 children seen for a routine check up in a particular pediatric clinic. The children participating in this study were asked whether they were current smokers.

A common measurement of lung function is the forced expiratory volume (FEV), which measures how much air you can blow out of your lungs in a short period of time. A higher FEV is usually associated with better respiratory function. It is well known that prolonged smoking diminishes FEV in adults, and those adults with diminished FEV also tend to have decreased pulmonary function as measured by other clinical variables, such as blood oxygen and carbon dioxide levels.

Data collected from the study on the relationship between smoking status and lung function (measured by FEV) in children are contained in the fev_tbl dataset. Here is a data dictionary:

Variable Name Description
seqnbr case number
subjid subject identification number
age subject age (years)
fev measured forced exhalation volume (litres/second)
height subject height (inches)
sex subject sex
smoke smoking status (yes or no)

Here is a few rows of the data set:

head(fev_tbl)
## # A tibble: 6 × 7
##   seqnbr subjid   age   fev height sex    smoke
##    <int>  <int> <int> <dbl>  <dbl> <fct>  <fct>
## 1      1    301     9  1.71   57   female no   
## 2      2    451     8  1.72   67.5 female no   
## 3      3    501     7  1.72   54.5 female no   
## 4      4    642     9  1.56   53   male   no   
## 5      5    901     9  1.90   57   male   no   
## 6      6   1701     8  2.34   61   female no

Understanding the data structure

Exercise 1

Am I missing any variables compared to the data dictionary? Let’s check.

glimpse(fev_tbl)
## Rows: 654
## Columns: 7
## $ seqnbr <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ subjid <int> 301, 451, 501, 642, 901, 1701, 1752, 1753, 1901, 1951, 1952, 20…
## $ age    <int> 9, 8, 7, 9, 9, 8, 6, 6, 8, 9, 6, 8, 8, 8, 8, 7, 5, 6, 9, 9, 5, …
## $ fev    <dbl> 1.708, 1.724, 1.720, 1.558, 1.895, 2.336, 1.919, 1.415, 1.987, …
## $ height <dbl> 57.0, 67.5, 54.5, 53.0, 57.0, 61.0, 58.0, 56.0, 58.5, 60.0, 53.…
## $ sex    <fct> female, female, female, male, male, female, female, female, fem…
## $ smoke  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…

Exercise 2

Next: are there any duplicate case numbers? Are there any duplicate subject IDs?

fev_tbl %>% summarize(num_case = length(unique(seqnbr)), 
                      num_patient = length(unique(subjid)))
## # A tibble: 1 × 2
##   num_case num_patient
##      <int>       <int>
## 1      654         654

Now we know that no cases were entered twice, and each case corresponds to a different patient.

Understanding the patients in the study

Exercise 3

Let’s summarize the age of the patients first, by computing the mean, standard deviation, min, and max of the patient ages.

fev_tbl %>% summarize(mean = mean(age), sd = sd(age), 
                      min = min(age), max = max(age))
## # A tibble: 1 × 4
##    mean    sd   min   max
##   <dbl> <dbl> <int> <int>
## 1  9.93  2.95     3    19

Something’s a little odd about these summaries. Remember: this is a smoking study on children.

  • Why would a 3 year old be enrolled in a smoking study?
  • Why would a 19 year old be enrolled in a study on children?

Exercise 4

I’m now a bit worried: what’s the youngest patient who smokes in this dataset?

fev_tbl %>% filter(smoke == "yes") %>% arrange(age)
## # A tibble: 65 × 7
##    seqnbr subjid   age   fev height sex    smoke
##     <int>  <int> <int> <dbl>  <dbl> <fct>  <fct>
##  1    191  45241     9  1.95   58   male   yes  
##  2    388  16551    10  2.39   66   female yes  
##  3    441  30052    10  3.41   66   female yes  
##  4    446  31502    10  2.98   63   female yes  
##  5    518  50301    10  3.50   68   male   yes  
##  6    574  72552    10  3.04   65   female yes  
##  7    369  11601    11  1.69   60   male   yes  
##  8    456  34201    11  3.17   62.5 female yes  
##  9    496  45201    11  3.10   64   female yes  
## 10    523  51201    11  2.95   67   female yes  
## # ℹ 55 more rows

Looks like the youngest patient who smokes is 9. Seems young to me, but much less far-fetched than (say) 3.

Exercise 5

What about the 19 year olds? Should they be included? The definition of a “child” can vary from study to study. Possible definitions include < 21 and < 18. Let’s find out who the patients 18 or older are and what their case numbers are so that we can ask our point of contact for the study about them.

fev_tbl %>% filter(age >= 18) %>% select(seqnbr, subjid)
## # A tibble: 9 × 2
##   seqnbr subjid
##    <int>  <int>
## 1    608   4051
## 2    609   6144
## 3    610   6252
## 4    618  21351
## 5    619  22251
## 6    626  30441
## 7    638  48141
## 8    645  59944
## 9    652  73751

Aside: if it turns out that we need to exclude any of these odd-looking patients from our final data analysis, then we will need to re-run everything after this point with their data removed. Isn’t it nice that we are preparing this report in R Markdown?

Exercise 6

This is a smoking study, so it seems useful to know what proportion of the study participants are smokers. In fact, let’s break it down by sex, and calculate the proportion of girls who are smokers and the proportion of boys who are smokers, as well as the number of girls and the number of boys in the study.

fev_tbl %>% group_by(sex) %>% 
            summarize(prop_smoke = mean(smoke == "yes"), 
                      n = n())
## # A tibble: 2 × 3
##   sex    prop_smoke     n
##   <fct>       <dbl> <int>
## 1 female     0.123    318
## 2 male       0.0774   336

The proportion of girls who are smokers is higher than the proportion of boys that are smokers.

Exercise 7

Is this because there are more smokers among teenage girls than teenage boys? Or is this a phenomenon that is uniform across age groups? To find out, let’s calculate:

  • the proportion and number of girls aged 0-6 who are smokers
  • the proportion and number of girls aged 7-12 who are smokers
  • the proportion and number of girls aged 13-19 who are smokers
  • the proportion and number of boys aged 0-6 who are smokers
  • the proportion and number of boys aged 7-12 who are smokers
  • the proportion and number of boys aged 13-19 who are smokers

Hint: you will need to create a new variable that has three categories: age 0-6, age 7-12, and age 13-19. You can do so with fev_tbl %>% mutate(age_cat = cut(age, c(0, 6, 12, 19))).

fev_tbl %>% mutate(age_cat = cut(age, c(0, 6, 12, 19))) %>% 
  group_by(age_cat, sex) %>% 
  summarize(prop_smoke=mean(smoke == "yes"), n=n())
## `summarise()` has grouped output by 'age_cat'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 4
## # Groups:   age_cat [3]
##   age_cat sex    prop_smoke     n
##   <fct>   <fct>       <dbl> <int>
## 1 (0,6]   female     0         36
## 2 (0,6]   male       0         40
## 3 (6,12]  female     0.0617   227
## 4 (6,12]  male       0.0342   234
## 5 (12,19] female     0.455     55
## 6 (12,19] male       0.290     62

There are no smokers (female or male) in the 0-6 group. There is a higher proportion of girls in the 7-12 group who are smokers than boys in the 7-12 group. Ditto the 13-19 group.

Does smoking have an effect on lung function?

Let’s continue exploring the data set with a closer eye to our main research question: does smoking have an effect on lung function in children?

Exercise 8

Let’s start by summarizing the FEV of the smokers and non-smokers. Let’s calculate the mean, standard deviation, and number of observations in each group. We will mainly be comparing the means to gather information about the question, but the standard deviation and number of observations are important to look at too.

fev_tbl %>% group_by(smoke) %>% 
  summarize(mean_fev = mean(fev), sd_fev = sd(fev), n = n())
## # A tibble: 2 × 4
##   smoke mean_fev sd_fev     n
##   <fct>    <dbl>  <dbl> <int>
## 1 no        2.57  0.851   589
## 2 yes       3.28  0.750    65

We see that the mean FEV in the smoking group seems to be substantially higher than the average FEV in the non-smoking group. That is, the smokers appear to have better lung function than the non-smokers.

Does this surprise you? Recall that this is an observational study - children were not randomly assigned to smoke or not smoke. We might then have reason to suspect that the association between FEV and smoking status is confounded by some other factors. For example, we already know that the youngest smoker in our data set is 9, while the non-smokers are as young as 3. This suggests that the smokers in our data set are generally older than the non-smokers. Furthermore, we might expect older children to have higher FEV, because they are bigger and have bigger lungs. Could age be a confounder here? Maybe the smoking group has higher lung function simply because they are older and bigger.

We will investigate this point next week after we have learned about graphing tools.