Introduction to plotting with ggplot2

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(scales))

This tutorial will get you warmed up to plotting with ggplot2 in R. It covers:

  • The plotting framework available in R
  • Why you should learn the ggplot2 tool
  • The importance of statistical graphics in communicating information
  • The seven components of the grammar of graphics underlying ggplot2
  • Geometric objects and aesthetics for exploring various plot types.

Orientation to plotting in R

Traditionally, plots in R are produced using “base R” methods, the crown function here being plot(). This method tends to be quite involved, and requires a lot of “coding by hand”.

Then, an R package called lattice was created that aimed to make it easier to create multiple “panels” of plots. It seems to have gone to the wayside in the R community. Personally, I found that using this package often involved several lines of code to set up a plot, which then needed to get overriden by “special cases”.

After lattice came ggplot2, which provides a very powerful and relatively simple framework for making plots. It has a theoretical underpinning, too, based on the Grammar of Graphics, first described by Leland Wilkinson in his “Grammar of Graphics” book. With ggplot2, you can make a great many type of plots with minimal code. It’s been a hit in and outside of the R community.

Check out this comparison of the three by Joseph V. Casillas.

A newer tool is called plotly, which was actually developed outside of R, but the plotly R package accesses the plotly functionality. Plotly graphs allow for interactive exploration of a plot. You can convert ggplot2 graphics to a plotly graph, too.

Just plot it

The human visual cortex is a powerful thing. If you’re wanting to point someone’s attention to a bunch of numbers, I can assure you that you won’t elicit any “aha” moments by displaying a large table like this, either in a report or (especially!) a presentation. Make a plot to communicate your message.

If you really feel the need to tell your audience exactly what every quantity evaluates to, consider putting your table in an appendix. Because chances are, the reader doesn’t care about the exact numeric values. Or, perhaps you just want to point out one or a few numbers, in which case you can put that number directly on a plot.

Challenger example from Jenny Bryan.

The grammar of graphics

As mentioned, ggplot2 is based on the grammar of graphics. You can think of the grammar of graphics as a systematic approach for describing the components of a graph. It has seven components (the ones in bold are required to be specified explicitly in ggplot2):

  • Data
    • Exactly as it sounds: the data that you’re feeding into a plot.
  • Aesthetic mappings
    • This is a specification of how you will connect variables (columns) from your data to a visual dimension. These visual dimensions are called “aesthetics”, and can be (for example) horizontal positioning, vertical positioning, size, colour, shape, etc.
  • Geometric objects
    • This is a specification of what object will actually be drawn on the plot. This could be a point, a line, a bar, etc.
  • Scales
    • This is a specification of how a variable is mapped to its aesthetic. Will it be mapped linearly? On a log scale? Something else?
  • Statistical transformations
    • This is a specification of whether and how the data are combined/transformed before being plotted. For example, in a bar chart, data are transformed into their frequencies; in a box-plot, data are transformed to a five-number summary.
  • Coordinate system
    • This is a specification of how the position aesthetics (x and y) are depicted on the plot. For example, rectangular/cartesian, or polar coordinates.
  • Facet
    • This is a specification of data variables that partition the data into smaller “sub plots”, or panels.

These components are like parameters of statistical graphics, defining the “space” of statistical graphics. In theory, there is a one-to-one mapping between a plot and its grammar components, making this a useful way to specify graphics.

Example: Scatterplot grammar

For example, consider the following plot from the gapminder data set. For now, don’t focus on the code, just the graph itself.

ggplot(gapminder, aes(gdpPercap, lifeExp)) +
  geom_point(alpha = 0.1) +
  scale_x_log10("GDP per capita", labels = scales::dollar_format()) +
  theme_bw() +
  ylab("Life Expectancy")

This scatterplot has the following components of the grammar of graphics.

Grammar Component Specification
data gapminder
aesthetic mapping x: gdpPercap, y: lifeExp
geometric object points
scale x: log10, y: linear
statistical transform none
coordinate system rectangular
facetting none

Note that x and y aesthetics are required for scatterplots (or “point” geometric objects). In general, each geometric object has its own required set of aesthetics.

Activity: Bar chart grammar

Fill out Exercise 1: Bar Chart Grammar (Together) in your worksheet.

Click here if you don’t have it yet.

Working with ggplot2

First, the ggplot2 package comes with the tidyverse meta-package. So, loading that is enough.

There are two main ways to interact with ggplot2:

  1. The qplot() or quickplot() functions (the two are identical): Useful for making a quick plot if you have vectors stored in your workspace that you’d like to plot. Usually not worthwhile using.
  2. The ggplot() function: use to access the full power of ggplot2.

Let’s use the above scatterplot as an example to see how to use the ggplot() function.

First, the ggplot() function takes two arguments: - data: the data frame containing your plotting data. - mapping: aesthetic mappings applying to the entire plot. Expecting the output of the aes() function.

Notice that the aes() function has x and y as its first two arguments, so we don’t need to explicitly name these aesthetics.

ggplot(gapminder, aes(gdpPercap, lifeExp))

This just initializes the plot. You’ll notice that the aesthetic mappings are already in place. Now, we need to add components by adding layers, literally using the + sign. These layers are functions that have further specifications.

For our next layer, let’s add a geometric object to the plot, which have the syntax geom_SOMETHING(). There’s a bit of overplotting, so we can specify some alpha transparency using the alpha argument (you can interpret alpha as neeing 1/alpha points overlaid to achieve an opaque point).

ggplot(gapminder, aes(gdpPercap, lifeExp)) +
  geom_point(alpha = 0.1)

That’s the only geom that we’re wanting to add. Now, let’s specify a scale transformation, because the plot would really benefit if the x-axis is on a log scale. These functions take the form scale_AESTHETIC_TRANSFORM(). As usual, you can tweak this layer, too, using this function’s arguments. In this example, we’re re-naming the x-axis (the first argument), and changing the labels to have a dollar format (a handy function thanks to the scales package).

ggplot(gapminder, aes(gdpPercap, lifeExp)) +
  geom_point(alpha = 0.1) +
  scale_x_log10("GDP per capita", labels = scales::dollar_format())

I’m tired of seeing the grey background, so I’ll add a theme() layer. I like theme_bw(). Then, I’ll re-label the y-axis using the ylab() function. Et voilà!

ggplot(gapminder, aes(gdpPercap, lifeExp)) +
  geom_point(alpha = 0.1) +
  scale_x_log10("GDP per capita", labels = scales::dollar_format()) +
  theme_bw() +
  ylab("Life Expectancy")

A tour of some important geoms

Here, we’ll explore some common plot types, and how to produce them with ggplot2.

Histograms: geom_histogram()

Useful for depicting the distribution of a continuous random variable. Partitions the number line into bins of certain width, counts the number of observations falling into each bin, and erects a bar of that height for each bin.

Required aesthetics:

  • x: A numeric vector.

By default, a histogram plots the count on the y-axis. If you want to use proportion, specify the y = ..prop.. aesthetic.

You can change the smoothness of the plot via two arguments (your choice):

  • bins: the number of bins/bars shown in the plot.
  • binwidth: the with of the bins shown on the plot.

Example:

ggplot(gapminder, aes(lifeExp)) +
  geom_histogram(bins = 50)

Density: geom_density()

Essentially, a “smooth” version of a histogram. Uses kernels to produce the curve.

Required aesthetics:

  • x: A numeric vector.

Good to know:

  • bw argument controls the smoothness: Smaller = rougher.

Example:

ggplot(gapminder, aes(lifeExp)) +
  geom_density()

Jitter plots: geom_jitter()

A scatterplot, but with minor random perturbations of each point. Useful for scatterplots where points are overlaying, or when one variable is categorical.

Required aesthetics:

  • x: any vector
  • y: any vector

Example:

ggplot(gapminder, aes(continent, lifeExp)) +
  geom_jitter()

Box plots: geom_boxplot()

This geom makes a boxplot for a numeric variable in each of a category. Useful for visualizing probability distributions across different categories.

Required aesthetics:

  • x: A factor (categorical variable)
  • y: A numeric variable

Example:

ggplot(gapminder, aes(continent, lifeExp)) +
  geom_boxplot()

Ridge plots: ggridges::geom_density_ridges()

A (superior?) alternative to the boxplot, the ridge plot (also known as the joy plot) places a kernel density for each group, instead of the box.

You’ll need to install the ggridges package. You can do lots more with ridges – check out the ggridges intro vignette.

Required aesthetics (reversed from boxplots!)

  • x: A numeric variable
  • y: A factor (categorical variable)

Example:

ggplot(gapminder, aes(lifeExp, continent)) +
  ggridges::geom_density_ridges()
## Picking joint bandwidth of 2.23

Bar plots: geom_bar() or geom_col()

These geom’s erect a bar over each category.

geom_bar() automatically determines the height of the bar according to the count of each category.

geom_col() requires a manual specification of the bar heights.

Required aesthetics:

  • x: A categorical variable
  • y: A numeric variable (only required for geom_col()!)
    • To use proportion in geom_bar() instead of count, set y = ..prop..

Example: number of 4-, 6-, and 8- cylinder cars in the mtcars dataset:

ggplot(mtcars, aes(cyl)) +
  geom_bar()

Line charts: geom_line()

A line plot connects points with straight lines, from left-to-right. Especially useful if time is on the x-axis.

Required aesthetics:

  • x: a variable having some ordering to it.
  • y: a numeric variable.

Although not required, the group aesthetic will come in handy here. This aesthetic produces a plot independently for each group, and overlays the results.

tsibble::as_tsibble(co2) %>% 
  rename(yearmonth = index,
         conc = value) %>% 
  mutate(month = lubridate::month(yearmonth, label = TRUE),
         year  = lubridate::year(yearmonth)) %>% 
  ggplot(aes(month, conc)) +
  geom_line(aes(group = year), alpha = 0.5) +
  ylab("CO2 Concentration")

Path plots: geom_path()

Like geom_line(), except connects points in the order that they appear in the dataset.