R Functions for Data Analysis

From this topic, students are expected to be able to:

Start getting a sense of when to make a function in a data analysis (we will build on this next week).
Workflow for building a function: start interactively, wrap it as a function. return(). Argument names.
Fortify a function:
- generalize the function and use smart defaults; NA handling, and ellipses package https://ellipsis.r-lib.org/
- provide useful error messages; sidebar: if statements
- Unit tests, and (sidebar) assertions
Data masking in a function.
Documenting a function

Resources

Video lecture:

R Functions for Data Analysis

Written resources:

Basic function syntax in R: https://swcarpentry.github.io/r-novice-inflammation/02-func-R/
When to use functions in your data analysis:
- stat545.com Functions, Parts 1-3
- R4DS functions chapter

Our own R functions

At this point in the course, we’ve used lots of functions, like mean(), mutate(), and pivot_longer(). But it can be really useful to write your own function. For example, the ability to writing your own functions can supercharge your group_by() %>% summarize() workflow: you can write your own function to use inside summarize(), instead of relying soly on functions built into R or available in packages!

Here’s a simple example of a function I wrote to simulate rolling a user-inputted number of D10s (a 10-sided die used for tabletop gaming).

roll_d10 <- function(num_dice) { 
    sum(sample(1:10, num_dice, replace=TRUE))
}

roll_d10(2)

## [1] 3

(Sidebar: this is not reproducible code, as the output depends on the random seed, which R made up for us and won’t tell us. If I wanted to make this reproducible, then I would set the seed to (say) 123 before running my function with set.seed(123).)

Why Functions?

In short, it avoids repeatedly duplicating code. This is helpful because:

It shortens your code – crucially, without losing interpretability – making it easier and faster to read through and process its overall intent.
If your needs change, then you only need to change your code in one place (the function definition) rather than a bunch of places.
Bullet points 1 and 2 mean that using functions typically leads to fewer bugs and fewer headaches.

A good rule of thumb is whenever you find yourself repeating code more than a few times, consider writing a function.

Testing

When you’re using other people’s functions – like those in packages – they often work. However, as you have probably discovered by this point, it is very easy to inadvertently write code – and therefore functions – that do not work. Because of this, it’s important to test the functions we write to make sure they work.

Documentation

You should have also noticed by now that other people’s functions in packages are documented - there’s information about:

what the function does, at a high level
the objects it expects you to input
the object that the function outputs

We can do this with roxygen2 tags to document your function, placed immediately above the function definition. Although roxygen2 tags are designed for use when creating R packages, they provide a standardized way to document a function – and make it easy for you to migrate your function to an R package if need be!

Your turn: functions and tests, the basics

We think working through Worksheet B1 is a great place to go from here to learn the basics of how to define your own functions and how to test it.

Class 1

Haven’t attempted all of the questions on Worksheet B1? Then attempt unattempted questions.
Put any questions you have about the worksheet questions or about functions in general in the Google Doc posted to Canvas.

If FAQ emerge in the Google Doc, then we will discuss them together.

Once you’re done Worksheet B1, tackle these follow-up challenge questions to check your understanding. If there are no questions to be answered about the worksheet in the Google Doc, then we will discuss these challenge questions.

Naming

Will R do anything to stop you from doing this?

cube_num <- function(num) { 
  num^2  
}

Will R do anything to stop you from writing a function where the input argument is named blahblahblah?
What happens if you do this? Can you think of any adverse consequences?

mean <- function(num) { 
  num^2  
}

Syntax

There are at least two other ways (structurally) to write roll_d10(). What are they? (Hint: one is showcased in the worksheet.)
What would the function return if I added this line of code before the sum() call in roll_d10()? sum(sample(1:4, num_dice, replace=TRUE)
There’s one function on the worksheet test cells that you haven’t used yet: expect_known_hash(). What is it, and when would it be useful?

Class 2

Agenda

We will learn about a couple of advanced topics:

Ellipses
Curly-curly
Default values

These topics are covered in the R4DS Functions book chapter as well. So if you miss this class, then the R4DS Functions reading is a good alternative.

Counting missing values by group

Let’s start by loading some libraries.

library(palmerpenguins)
library(tidyverse)
library(gapminder)

Here’s some code that:

groups penguins by species, then summarizes the number of missing values in each variable.
groups gapminder by continent, then summarizes the number of missing values in each variable.

penguins %>% group_by(species) %>% 
  summarize(across(everything(), ~ sum(is.na(.x))))

## # A tibble: 3 × 8
##   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>      <int>          <int>         <int>             <int>       <int>
## 1 Adelie         0              1             1                 1           1
## 2 Chinstrap      0              0             0                 0           0
## 3 Gentoo         0              1             1                 1           1
## # ℹ 2 more variables: sex <int>, year <int>

gapminder %>% group_by(continent) %>% 
  summarize(across(everything(), ~ sum(is.na(.x))))

## # A tibble: 5 × 6
##   continent country  year lifeExp   pop gdpPercap
##   <fct>       <int> <int>   <int> <int>     <int>
## 1 Africa          0     0       0     0         0
## 2 Americas        0     0       0     0         0
## 3 Asia            0     0       0     0         0
## 4 Europe          0     0       0     0         0
## 5 Oceania         0     0       0     0         0

Your turn: turn this code into a function

By yourself or in small groups, try to turn the code above into a function. Slack react to tell us either “I’m stuck!” or “I’m done!”

Instructor demo: curly-curly

We will construct a solution to the exercise.

Your turn: curly-curly practice

Make a modification to our function: allow the user to also pass in which variables they want to summarize. (Right now it just summarizes all of them.) Slack react to tell us either “I’m stuck!” or “I’m done!”

Instructor demo: ellipses

We’ll modify our function using ellipses to get extra functionality: we’ll allow the user to group by more than one variable.

Instructor demo: defaults

We’ll talk about when you might conceptually want to set a default value for function arguments, and then make a new argument for our function called .groups that makes it default to dropping the groups in the output.

Attribution

Some of these notes were originally compiled by previous iterations of the instructional staff, including Vincenzo Coia.