Introduction
We will perform exploratory data analysis of the fev
data set.
Let’s start by getting the data set. (Reminder: you will need to run install.packages("rigr")
to install the rigr
package which contains the data set before loading it. ) Let’s also load dplyr
while we’re at it - we’ll need it to do the exercises!
library(rigr)
library(dplyr)
fev_tbl <- as_tibble(fev) %>% mutate(across(sex:smoke, ~ as.factor(.x)))
The fev
data set
It is now widely believed that smoking tends to impair lung function. Much of the data to support this claim arises from studies of pulmonary function in adults who are long-time smokers. A question then arises whether such deleterious effects of smoking can be detected in children who smoke. To address this question, measures of lung function were made in about 600 children seen for a routine check up in a particular pediatric clinic. The children participating in this study were asked whether they were current smokers.
A common measurement of lung function is the forced expiratory volume (FEV), which measures how much air you can blow out of your lungs in a short period of time. A higher FEV is usually associated with better respiratory function. It is well known that prolonged smoking diminishes FEV in adults, and those adults with diminished FEV also tend to have decreased pulmonary function as measured by other clinical variables, such as blood oxygen and carbon dioxide levels.
Data collected from the study on the relationship between smoking status and lung function (measured by FEV) in children are contained in the fev_tbl
dataset. Here is a data dictionary:
Variable Name | Description |
---|---|
seqnbr | case number |
subjid | subject identification number |
age | subject age (years) |
fev | measured forced exhalation volume (litres/second) |
height | subject height (inches) |
sex | subject sex |
smoke | smoking status (yes or no) |
Here is a few rows of the data set:
head(fev_tbl)
## # A tibble: 6 × 7
## seqnbr subjid age fev height sex smoke
## <int> <int> <int> <dbl> <dbl> <fct> <fct>
## 1 1 301 9 1.71 57 female no
## 2 2 451 8 1.72 67.5 female no
## 3 3 501 7 1.72 54.5 female no
## 4 4 642 9 1.56 53 male no
## 5 5 901 9 1.90 57 male no
## 6 6 1701 8 2.34 61 female no
Understanding the data structure
Exercise 1
Am I missing any variables compared to the data dictionary? Let’s check.
glimpse(fev_tbl)
## Rows: 654
## Columns: 7
## $ seqnbr <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ subjid <int> 301, 451, 501, 642, 901, 1701, 1752, 1753, 1901, 1951, 1952, 20…
## $ age <int> 9, 8, 7, 9, 9, 8, 6, 6, 8, 9, 6, 8, 8, 8, 8, 7, 5, 6, 9, 9, 5, …
## $ fev <dbl> 1.708, 1.724, 1.720, 1.558, 1.895, 2.336, 1.919, 1.415, 1.987, …
## $ height <dbl> 57.0, 67.5, 54.5, 53.0, 57.0, 61.0, 58.0, 56.0, 58.5, 60.0, 53.…
## $ sex <fct> female, female, female, male, male, female, female, female, fem…
## $ smoke <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…
Exercise 2
Next: are there any duplicate case numbers? Are there any duplicate subject IDs?
fev_tbl %>% summarize(num_case = length(unique(seqnbr)),
num_patient = length(unique(subjid)))
## # A tibble: 1 × 2
## num_case num_patient
## <int> <int>
## 1 654 654
Now we know that no cases were entered twice, and each case corresponds to a different patient.
Understanding the patients in the study
Exercise 3
Let’s summarize the age of the patients first, by computing the mean, standard deviation, min, and max of the patient ages.
fev_tbl %>% summarize(mean = mean(age), sd = sd(age),
min = min(age), max = max(age))
## # A tibble: 1 × 4
## mean sd min max
## <dbl> <dbl> <int> <int>
## 1 9.93 2.95 3 19
Something’s a little odd about these summaries. Remember: this is a smoking study on children.
- Why would a 3 year old be enrolled in a smoking study?
- Why would a 19 year old be enrolled in a study on children?
Exercise 4
I’m now a bit worried: what’s the youngest patient who smokes in this dataset?
fev_tbl %>% filter(smoke == "yes") %>% arrange(age)
## # A tibble: 65 × 7
## seqnbr subjid age fev height sex smoke
## <int> <int> <int> <dbl> <dbl> <fct> <fct>
## 1 191 45241 9 1.95 58 male yes
## 2 388 16551 10 2.39 66 female yes
## 3 441 30052 10 3.41 66 female yes
## 4 446 31502 10 2.98 63 female yes
## 5 518 50301 10 3.50 68 male yes
## 6 574 72552 10 3.04 65 female yes
## 7 369 11601 11 1.69 60 male yes
## 8 456 34201 11 3.17 62.5 female yes
## 9 496 45201 11 3.10 64 female yes
## 10 523 51201 11 2.95 67 female yes
## # ℹ 55 more rows
Looks like the youngest patient who smokes is 9. Seems young to me, but much less far-fetched than (say) 3.
Exercise 5
What about the 19 year olds? Should they be included? The definition of a “child” can vary from study to study. Possible definitions include < 21 and < 18. Let’s find out who the patients 18 or older are and what their case numbers are so that we can ask our point of contact for the study about them.
fev_tbl %>% filter(age >= 18) %>% select(seqnbr, subjid)
## # A tibble: 9 × 2
## seqnbr subjid
## <int> <int>
## 1 608 4051
## 2 609 6144
## 3 610 6252
## 4 618 21351
## 5 619 22251
## 6 626 30441
## 7 638 48141
## 8 645 59944
## 9 652 73751
Aside: if it turns out that we need to exclude any of these odd-looking patients from our final data analysis, then we will need to re-run everything after this point with their data removed. Isn’t it nice that we are preparing this report in R Markdown?
Exercise 6
This is a smoking study, so it seems useful to know what proportion of the study participants are smokers. In fact, let’s break it down by sex, and calculate the proportion of girls who are smokers and the proportion of boys who are smokers, as well as the number of girls and the number of boys in the study.
fev_tbl %>% group_by(sex) %>%
summarize(prop_smoke = mean(smoke == "yes"),
n = n())
## # A tibble: 2 × 3
## sex prop_smoke n
## <fct> <dbl> <int>
## 1 female 0.123 318
## 2 male 0.0774 336
The proportion of girls who are smokers is higher than the proportion of boys that are smokers.
Exercise 7
Is this because there are more smokers among teenage girls than teenage boys? Or is this a phenomenon that is uniform across age groups? To find out, let’s calculate:
- the proportion and number of girls aged 0-6 who are smokers
- the proportion and number of girls aged 7-12 who are smokers
- the proportion and number of girls aged 13-19 who are smokers
- the proportion and number of boys aged 0-6 who are smokers
- the proportion and number of boys aged 7-12 who are smokers
- the proportion and number of boys aged 13-19 who are smokers
Hint: you will need to create a new variable that has three categories: age 0-6, age 7-12, and age 13-19. You can do so with fev_tbl %>% mutate(age_cat = cut(age, c(0, 6, 12, 19)))
.
fev_tbl %>% mutate(age_cat = cut(age, c(0, 6, 12, 19))) %>%
group_by(age_cat, sex) %>%
summarize(prop_smoke=mean(smoke == "yes"), n=n())
## `summarise()` has grouped output by 'age_cat'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 4
## # Groups: age_cat [3]
## age_cat sex prop_smoke n
## <fct> <fct> <dbl> <int>
## 1 (0,6] female 0 36
## 2 (0,6] male 0 40
## 3 (6,12] female 0.0617 227
## 4 (6,12] male 0.0342 234
## 5 (12,19] female 0.455 55
## 6 (12,19] male 0.290 62
There are no smokers (female or male) in the 0-6 group. There is a higher proportion of girls in the 7-12 group who are smokers than boys in the 7-12 group. Ditto the 13-19 group.
Does smoking have an effect on lung function?
Let’s continue exploring the data set with a closer eye to our main research question: does smoking have an effect on lung function in children?
Exercise 8
Let’s start by summarizing the FEV of the smokers and non-smokers. Let’s calculate the mean, standard deviation, and number of observations in each group. We will mainly be comparing the means to gather information about the question, but the standard deviation and number of observations are important to look at too.
fev_tbl %>% group_by(smoke) %>%
summarize(mean_fev = mean(fev), sd_fev = sd(fev), n = n())
## # A tibble: 2 × 4
## smoke mean_fev sd_fev n
## <fct> <dbl> <dbl> <int>
## 1 no 2.57 0.851 589
## 2 yes 3.28 0.750 65
We see that the mean FEV in the smoking group seems to be substantially higher than the average FEV in the non-smoking group. That is, the smokers appear to have better lung function than the non-smokers.
Does this surprise you? Recall that this is an observational study - children were not randomly assigned to smoke or not smoke. We might then have reason to suspect that the association between FEV and smoking status is confounded by some other factors. For example, we already know that the youngest smoker in our data set is 9, while the non-smokers are as young as 3. This suggests that the smokers in our data set are generally older than the non-smokers. Furthermore, we might expect older children to have higher FEV, because they are bigger and have bigger lungs. Could age be a confounder here? Maybe the smoking group has higher lung function simply because they are older and bigger.
We will investigate this point next week after we have learned about graphing tools.