# Introduction to plotting with ggplot2

``````suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(scales))``````

This tutorial will get you warmed up to plotting with `ggplot2` in R. It covers:

• The plotting framework available in R
• Why you should learn the `ggplot2` tool
• The importance of statistical graphics in communicating information
• The seven components of the grammar of graphics underlying `ggplot2`
• Geometric objects and aesthetics for exploring various plot types.

## Orientation to plotting in R

Traditionally, plots in R are produced using “base R” methods, the crown function here being `plot()`. This method tends to be quite involved, and requires a lot of “coding by hand”.

Then, an R package called `lattice` was created that aimed to make it easier to create multiple “panels” of plots. It seems to have gone to the wayside in the R community. Personally, I found that using this package often involved several lines of code to set up a plot, which then needed to get overriden by “special cases”.

After `lattice` came `ggplot2`, which provides a very powerful and relatively simple framework for making plots. It has a theoretical underpinning, too, based on the Grammar of Graphics, first described by Leland Wilkinson in his “Grammar of Graphics” book. With `ggplot2`, you can make a great many type of plots with minimal code. It’s been a hit in and outside of the R community.

Check out this comparison of the three by Joseph V. Casillas.

A newer tool is called plotly, which was actually developed outside of R, but the `plotly` R package accesses the plotly functionality. Plotly graphs allow for interactive exploration of a plot. You can convert ggplot2 graphics to a plotly graph, too.

## Just plot it

The human visual cortex is a powerful thing. If you’re wanting to point someone’s attention to a bunch of numbers, I can assure you that you won’t elicit any “aha” moments by displaying a large table like this, either in a report or (especially!) a presentation. Make a plot to communicate your message.

If you really feel the need to tell your audience exactly what every quantity evaluates to, consider putting your table in an appendix. Because chances are, the reader doesn’t care about the exact numeric values. Or, perhaps you just want to point out one or a few numbers, in which case you can put that number directly on a plot.

## The grammar of graphics

As mentioned, `ggplot2` is based on the grammar of graphics. You can think of the grammar of graphics as a systematic approach for describing the components of a graph. It has seven components (the ones in bold are required to be specified explicitly in `ggplot2`):

• Data
• Exactly as it sounds: the data that you’re feeding into a plot.
• Aesthetic mappings
• This is a specification of how you will connect variables (columns) from your data to a visual dimension. These visual dimensions are called “aesthetics”, and can be (for example) horizontal positioning, vertical positioning, size, colour, shape, etc.
• Geometric objects
• This is a specification of what object will actually be drawn on the plot. This could be a point, a line, a bar, etc.
• Scales
• This is a specification of how a variable is mapped to its aesthetic. Will it be mapped linearly? On a log scale? Something else?
• Statistical transformations
• This is a specification of whether and how the data are combined/transformed before being plotted. For example, in a bar chart, data are transformed into their frequencies; in a box-plot, data are transformed to a five-number summary.
• Coordinate system
• This is a specification of how the position aesthetics (x and y) are depicted on the plot. For example, rectangular/cartesian, or polar coordinates.
• Facet
• This is a specification of data variables that partition the data into smaller “sub plots”, or panels.

These components are like parameters of statistical graphics, defining the “space” of statistical graphics. In theory, there is a one-to-one mapping between a plot and its grammar components, making this a useful way to specify graphics.

### Example: Scatterplot grammar

For example, consider the following plot from the `gapminder` data set. For now, don’t focus on the code, just the graph itself.

``````ggplot(gapminder, aes(gdpPercap, lifeExp)) +
geom_point(alpha = 0.1) +
scale_x_log10("GDP per capita", labels = scales::dollar_format()) +
theme_bw() +
ylab("Life Expectancy")`````` This scatterplot has the following components of the grammar of graphics.

Grammar Component Specification
data `gapminder`
aesthetic mapping x: `gdpPercap`, y: `lifeExp`
geometric object points
scale x: log10, y: linear
statistical transform none
coordinate system rectangular
facetting none

Note that `x` and `y` aesthetics are required for scatterplots (or “point” geometric objects). In general, each geometric object has its own required set of aesthetics.

### Activity: Bar chart grammar

Fill out Exercise 1: Bar Chart Grammar (Together) in your worksheet.

## Working with `ggplot2`

First, the `ggplot2` package comes with the `tidyverse` meta-package. So, loading that is enough.

There are two main ways to interact with `ggplot2`:

1. The `qplot()` or `quickplot()` functions (the two are identical): Useful for making a quick plot if you have vectors stored in your workspace that you’d like to plot. Usually not worthwhile using.
2. The `ggplot()` function: use to access the full power of `ggplot2`.

Let’s use the above scatterplot as an example to see how to use the `ggplot()` function.

First, the `ggplot()` function takes two arguments: - `data`: the data frame containing your plotting data. - `mapping`: aesthetic mappings applying to the entire plot. Expecting the output of the `aes()` function.

Notice that the `aes()` function has `x` and `y` as its first two arguments, so we don’t need to explicitly name these aesthetics.

``ggplot(gapminder, aes(gdpPercap, lifeExp))`` This just initializes the plot. You’ll notice that the aesthetic mappings are already in place. Now, we need to add components by adding layers, literally using the `+` sign. These layers are functions that have further specifications.

For our next layer, let’s add a geometric object to the plot, which have the syntax `geom_SOMETHING()`. There’s a bit of overplotting, so we can specify some alpha transparency using the `alpha` argument (you can interpret `alpha` as neeing `1/alpha` points overlaid to achieve an opaque point).

``````ggplot(gapminder, aes(gdpPercap, lifeExp)) +
geom_point(alpha = 0.1)`````` That’s the only `geom` that we’re wanting to add. Now, let’s specify a scale transformation, because the plot would really benefit if the x-axis is on a log scale. These functions take the form `scale_AESTHETIC_TRANSFORM()`. As usual, you can tweak this layer, too, using this function’s arguments. In this example, we’re re-naming the x-axis (the first argument), and changing the labels to have a dollar format (a handy function thanks to the `scales` package).

``````ggplot(gapminder, aes(gdpPercap, lifeExp)) +
geom_point(alpha = 0.1) +
scale_x_log10("GDP per capita", labels = scales::dollar_format())`````` I’m tired of seeing the grey background, so I’ll add a `theme()` layer. I like `theme_bw()`. Then, I’ll re-label the y-axis using the `ylab()` function. Et voilà!

``````ggplot(gapminder, aes(gdpPercap, lifeExp)) +
geom_point(alpha = 0.1) +
scale_x_log10("GDP per capita", labels = scales::dollar_format()) +
theme_bw() +
ylab("Life Expectancy")`````` ## A tour of some important `geom`s

Here, we’ll explore some common plot types, and how to produce them with `ggplot2`.

### Histograms: `geom_histogram()`

Useful for depicting the distribution of a continuous random variable. Partitions the number line into bins of certain width, counts the number of observations falling into each bin, and erects a bar of that height for each bin.

Required aesthetics:

• `x`: A numeric vector.

By default, a histogram plots the count on the y-axis. If you want to use proportion, specify the `y = ..prop..` aesthetic.

You can change the smoothness of the plot via two arguments (your choice):

• `bins`: the number of bins/bars shown in the plot.
• `binwidth`: the with of the bins shown on the plot.

Example:

``````ggplot(gapminder, aes(lifeExp)) +
geom_histogram(bins = 50)`````` ### Density: `geom_density()`

Essentially, a “smooth” version of a histogram. Uses kernels to produce the curve.

Required aesthetics:

• `x`: A numeric vector.

Good to know:

• `bw` argument controls the smoothness: Smaller = rougher.

Example:

``````ggplot(gapminder, aes(lifeExp)) +
geom_density()`````` ### Jitter plots: `geom_jitter()`

A scatterplot, but with minor random perturbations of each point. Useful for scatterplots where points are overlaying, or when one variable is categorical.

Required aesthetics:

• `x`: any vector
• `y`: any vector

Example:

``````ggplot(gapminder, aes(continent, lifeExp)) +
geom_jitter()`````` ### Box plots: `geom_boxplot()`

This geom makes a boxplot for a numeric variable in each of a category. Useful for visualizing probability distributions across different categories.

Required aesthetics:

• `x`: A factor (categorical variable)
• `y`: A numeric variable

Example:

``````ggplot(gapminder, aes(continent, lifeExp)) +
geom_boxplot()`````` ### Ridge plots: `ggridges::geom_density_ridges()`

A (superior?) alternative to the boxplot, the ridge plot (also known as the joy plot) places a kernel density for each group, instead of the box.

You’ll need to install the `ggridges` package. You can do lots more with ridges – check out the ggridges intro vignette.

Required aesthetics (reversed from boxplots!)

• `x`: A numeric variable
• `y`: A factor (categorical variable)

Example:

``````ggplot(gapminder, aes(lifeExp, continent)) +
ggridges::geom_density_ridges()``````
``## Picking joint bandwidth of 2.23`` ### Bar plots: `geom_bar()` or `geom_col()`

These geom’s erect a bar over each category.

`geom_bar()` automatically determines the height of the bar according to the count of each category.

`geom_col()` requires a manual specification of the bar heights.

Required aesthetics:

• `x`: A categorical variable
• `y`: A numeric variable (only required for `geom_col()`!)
• To use proportion in `geom_bar()` instead of count, set `y = ..prop..`

Example: number of 4-, 6-, and 8- cylinder cars in the `mtcars` dataset:

``````ggplot(mtcars, aes(cyl)) +
geom_bar()`````` ### Line charts: `geom_line()`

A line plot connects points with straight lines, from left-to-right. Especially useful if time is on the x-axis.

Required aesthetics:

• `x`: a variable having some ordering to it.
• `y`: a numeric variable.

Although not required, the `group` aesthetic will come in handy here. This aesthetic produces a plot independently for each group, and overlays the results.

``````tsibble::as_tsibble(co2) %>%
rename(yearmonth = index,
conc = value) %>%
mutate(month = lubridate::month(yearmonth, label = TRUE),
year  = lubridate::year(yearmonth)) %>%
ggplot(aes(month, conc)) +
geom_line(aes(group = year), alpha = 0.5) +
ylab("CO2 Concentration")`````` ### Path plots: `geom_path()`

Like `geom_line()`, except connects points in the order that they appear in the dataset.