Grammar of Graphics: ggplot2 in R

Lesson
Materials
Code References

Download the Code

There are extra Materials and code references to help you follow the tutorials in this lesson. A full set of code used in the tutorials can also be found on the GitHub repository. Be sure to check:

  • the Materials tab at the top of this Lesson
  • the GitHub Repository for this course: CovidRT

Grammar of Graphics and ggplot2 in R

To understand the ggplot2 library, we need to go back to 1999, the year in which the book The Grammar of Graphics by Leland Wilkinson is published. This work is built on earlier by Bertin (1983) and came to popularity when Hadley Wickham, a prolific contributor to the R community, releases the ggplot2 library, which differs from the original idea by proposing an “alternative parameterization of the grammar”, based around the idea of constructing graphics from multiple layers of data.

The formalization of a grammar, “enables us to concisely describe the components of a graphic”. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”, “bar plot” etc) and gain insight into the deep structure that underlies statistical graphics. Hadley wrote extensively on the elegance of this (link to article in Materials tab)

It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.

ggplot2, Hadley Wickham

Getting Started with ggplot2

With ggplot2, plots may be created using a system that allow where user can add or alter plot components layer-by-layer, one at a time. We’ll do some examples so this whole idea will become a lot clearer in due time.

Start by installing the package by executing the following command in your R:

install.packages("ggplot2")

You will see a few messages and then a confirmation that the package has been downloaded successfully:

trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.6/ggplot2_3.3.0.tgz'
Content type 'application/x-gzip' length 4018200 bytes (3.8 MB)
==================================================
downloaded 3.8 MB

The downloaded binary packages are in
	/var/folders/f6/c67vppf10d350p8d7b_k6nkr0000gn/T//RtmpDs5DkO/downloaded_packages

If you click on the Packages tab in RStudio and use the search bar to search for “ggplot” you will see it in the list of packages on your system (New to programming in R? Learn more about the RStudio interface here).

Create a new R Markdown document, deleting all the filler content except the header (Forgotten how to? Learn more about creating and using R Markdown here). Save it as learnggplot.Rmd.

Create your first chunk in the empty notebook (delete everything, except for the header) and in it, loads the ggplot2 package you’ve just installed:

```{r}
library(ggplot2)
```

That basically loads the package into your environment, allowing you to use functions that were not part of base R but provided through external (third-party) libraries, such as ggplot2.

Since we’ll also need to prepare our data – because what is there to plot without the data? – we can reuse the same code that reads our csv into a data frame and perform the necessary subsetting operation, just as the previous tutorial.

Our R Markdown at this point will look like this:

---
title: "learnggplot"
author: "Samuel"
date: "4/21/2020"
output: html_document
---

```{r}
options(scipen=1)
library(ggplot2)
covid <- read.csv("data_input/covid_clean.csv")
covid$date <- as.Date(covid$date)
covid <- covid[order(covid$date), 2:6]
northam <- subset(covid, country=="US" | country=="Canada")
```
  1. Notice that the install.packages() code is not in the notebook, since it only needs to be installed onto your system once (and not be executed every time)
  2. Notice that on line 13, we use the pipe (|) character to add an OR condition in our subset operation
  3. Everything else we’ve learned from the previous tutorial lesson

On point (2), we can also use the & character to add an AND condition. We can of course chain as many “OR” and “AND” conditions as we want:

covid[covid$country == "Germany" & 
             covid$date >= as.Date("2020-04-15") |
             covid$recovered > 74000, ]

The code above subsets for the all rows where the country is Germany and date is equal to or greater than 15th April. Notice that we use as.Date("2020-04-15") to convert the character string into a date object since it’s not meaningful to compare a date class to a string class otherwise.

While you can chain as many “OR” and “AND” conditions as we want, sometimes it’s easier to use an %in% operator.

In the following code, we create a data frame named euro5, by subsetting from the covid data frame for any rows where the value of country is Germany OR country is Italy OR country is Spain OR country is France or country is United Kingdom:

euro5 <- subset(covid, country == "Germany" | 
                  country=="Italy" | 
                  country == "Spain" |
                  country == "France" |
                  country == "United Kingdom")

This has the same effect using an %in% operator, which essentially says subset for any rows where country is in a given list:

euro5countries <- c("Germany", "Italy", "Spain", "France", "United Kingdom")
euro5 <- subset(covid, country %in% euro5countries)

It’s your turn now. Use any methods above to create a data frame that includes Germany, Italy, Spain, France, UK, US, and Iran.

Creating your first ggplot visualization

All ggplot2 plots begin with a call to ggplot(), commonly with the data and aesthetic (aes) mappings specified.

In our example, we’ll pass the following arguments:

  • data=euro5
  • mapping=aes(x=date, y=deaths, color=country)

The name of our data frame we wish to plot is named euro5, and we map (using mapping) our x-axis to the date column, y-axis to the deaths column, and color to the country column within our data frame. Copy and paste the following code into your R Markdown and run the chunk:

ggplot(data=euro5, mapping=aes(x=date, y=deaths, color=country)) +
  geom_line()

This returns the following plot:

Assuming you have done the earlier exercise, you can substitute euro5 with the name of your data frame containing 7 countries, and your plot should look like the following:

The geom_line in our code adds a geometric object, in this case, a line layer onto our plot. In subsequent examples, we’ll see some more examples of different geom_ function, each corresponding to a geometric object “layered” on top of our plot.

What happens when we call ggplot() without any arguments? Well, an empty “canvas” is shown. There are no coordinates because it doesn’t have any information to create one. It doesn’t know what the x- and y-coordinates are, the axis names, the scales or range etc. Try the following code for example:

ggplot()

This changes when you add the mappings. With the aesthetic mappings, the ggplot object knows that the x-axis correspond to the date variable, and can work out its axis range, the ticks, and the axis name. Common aesthetics (apart from the x and y) include:

  • color
  • size
  • shape

Notice that with the aesthetic mapping, a legend is automatically created for us as well.

To reinforce these concepts, let’s do a few more examples. Insert a chunk and add only the following code. Then run it:

plot1 <- ggplot(data=euro5, mapping=aes(x=date, y=deaths, color=country))
plot1

Pay attention to the plot. A coordinated system is created because we supplied the data and the aesthetic mappings (mapping). This is however, an empty graph because we haven’t add any layers on it.

Insert a new chunk, and add the following code:

plot1 + geom_boxplot()

See what you just did? By adding onto the ggplot object a new layer, the new layer (geom_boxplot) renders the geometric shapes.

Geometric Objects and ggPlot Layers

There are many geom_ functions and we’ll explore more of them in future exercises. So far, you’ve seen geom_line() and geom_boxplot(), both of which creates a layer of geometric object.

We can rewrite the code in the earlier example this way:

plot1 <- ggplot(data=euro5)
plot1 + geom_boxplot(mapping=aes(x=date, y=deaths, color=country))

Running this code would yield exactly the same plot as earlier. You’ve only moved the aesthetic mapping from a plot object level (inherited by all the layers) down to the layer level.

Let’s see some real examples. Consider the following code, which specify the aesthetic mapping on the plot object level. Here, geom_line() and geom_point() inherit the aesthetic mappings, x, y and color. Notice how you can add multiple layers on top of each other under the ggplot system:

plot1 <- ggplot(data=euro5, mapping=aes(x=date, y=deaths, color=country))
plot1 + 
  geom_line() + 
  geom_point() 

Compare this to the following code, which yields exactly the same plot:

plot1 <- ggplot(data=euro5)
plot1 + 
  geom_line(mapping=aes(x=date, y=deaths, color=country)) + 
  geom_point(mapping=aes(x=date, y=deaths, color=country)) 

In the second method, we set the aesthetic mapping for each layer explicitly. This is redundant in our case, but it also gives us maximum flexibility. For example, we can remove the color mapping from the points layer; This gives us a plot where the line layer has a color mapping to the country but the point layers are exempted from this mapping.

plot1 <- ggplot(data=euro5)
plot1 + 
  geom_line(mapping=aes(x=date, y=deaths, color=country)) + 
  geom_point(mapping=aes(x=date, y=deaths)) 

Aesthetic Mappings in ggPlot

You want to be aware that whatever you pass into the aesthetics mapping (aes()), that is going to be the name of the variable. It creates a relationship between your aesthetics (color, size, shape etc) and a variable in your data frame.

A common misconception novice R programmers make is to see color, and go “hah! that’s where I’ll put my color”, resulting in output that is not the desired behavior:

america <- subset(covid, country=="US")
# wrong, "green" is not a valid variable for the mapping. 
ggplot(data=america) +
  geom_line(mapping=aes(x=date, y=recovered, col=green))

america <- subset(covid, country=="US")
# also wrong, since "green" is not used as a mapping
ggplot(data=america) +
  geom_line(mapping=aes(x=date, y=recovered, col="green"))

Remember: anything you pass into aes() creates an aesthetic mapping between the aesthetic and a variable in your data frame.

If you wanted a green line, you will have to put it outside of the aesthetic mapping function. In the following code, the color of the line is specified to not map to any variable in our data frame but actually be a fixed dark green value:

america <- subset(covid, country=="US")
ggplot(data=america) +
  geom_line(mapping=aes(x=date, y=recovered), col="darkgreen")

We can make this example a lot more complicated if we want to. Run the following code and let’s examine the output:

plot1 <- ggplot(data=euro5, aes(x=date, y=deaths, color=country), size=1)
plot1 + 
  geom_line(aes(linetype=country), color="black", size=0.5) + 
  geom_point(size=0.3)

In line 1, we define our aesthetic mappings

  • Mapping the x and y axis to the date and deaths variables, respectively
  • Mapping color to the country variable in our data frame
  • size=1 is not an aesthetic mapping. It does not map to any existing column (variable) in our data frame
  • x, y, color and size will be inherited by the geometric layers; In our case, unless the geometric layer has values for these 4 different parameters, they will use the plot-level values as defined here

In line 3, we added a layer

  • x, y, color and size will be inherited by this layer, but color and size are both overridden by this layer’s specification, taking on the values of “black” for color (instead of the color mapping to country) and 0.5 for size (instead of 1).
  • Additionally, we want to create another aesthetic mapping for this layer, mapping the linetype to country, thus giving each country its own line type

In line 4, we added a second layer

  • x, y, color and size will be inherited by this layer, but size is overridden by this layer’s specification, taking on the value of 0.3 for size (instead of 1).
  • Because the color is not overridden, this layer thus inherits the mapping as well

One-dimensional vs Two-dimensional visualization

Up to this point, we’ve created many visualizations using both the x and y axis. These are two-dimensional plots, and offer insights to the relationship between multiple variables. When date is plotted against recovered, we can see the statistics on total recovery over a period of time.

Take a look (or run it in your R Markdown document) at the code below:

ggplot(data=covid, aes(x=deaths, y=recovered))+ 
  geom_point()

This is also a two-dimensional plot. It is a scatterplot, and plot the number of deaths against the number of recovered patients. It communicates the relationship, and the correlations between the two variables.

Consider the following code and the resulting histogram:

ggplot(data=covid[covid$confirmed > 20000, ], aes(x=deaths))+ 
  geom_histogram(binwidth=300)

This is a histogram, which communicates the distribution of the number of deaths in our dataset. In this case, we see that on the days where confirmed cases exceed 20,000, the number of deaths exhibit the above distribution. The graph makes no attempt to correlate one variable with another; the only variable that it uses is deaths, which is on our x axis (hence aes(x=deaths)).

binwidth is set to 300 in our histogram layer, resulting in “thinner” and more frequent bins at every interval of 300. Change this value to 1,000 and re-run your code, and observe that the bins are now in widths of 1,000, resulting in less (but wider) bins.

Another alternative to the histogram is the frequency polygons. Both histograms and the frequency polygons divide the x axis into bins and count the number of observations in each bin. The difference is that the histogram display the counts with bars, whereas frequency polygons display them with lines:

ggplot(data=covid[covid$confirmed > 20000, ], aes(x=deaths))+ 
  geom_freqpoly(binwidth=300)

You can, of course, add the two layers one after another:

ggplot(data=covid[covid$confirmed > 20000, ], aes(x=deaths))+ 
  geom_histogram(binwidth=1000, fill="navy") +
  geom_freqpoly(binwidth=1000, col="purple")

Here’s a simple quiz: do you think that the obtained plot appearance would be the same have we swapped the order and added the geom_freqpoly() layer before geom_histogram()?

Let’s do another example of a one-dimensional plot:

ggplot(data=covid[covid$confirmed > 20000, ])+ 
  geom_dotplot(mapping=aes(x=deaths), binwidth=400, stackdir="center")

We’re plotting exactly the same data, and again visualizing the distribution of the number of cumulative deaths on each day for each country. The plot communicates this underlying distribution, which is one level of information.

With some creativity, we can make our narrative a little richer. We can, for example, add a fill or color mapping to our dots, or map the size of our dots, or change the presentation of our dots, from being stacked to dodged as illustrated below:

Left image: colors and fills are mapped

Right image: colors and fills are mapped, and “dodge” each other

In the plots above, we color those points in red if they correspond to any dates where the cumulative number of deaths are higher than number of recovery. The code looks like that (full code available, refer to Code References above):

x <- covid[covid$confirmed > 20000, ]
# create a new variable in our data frame, name it cumudeaths
x$cumudeaths <- x$deaths > x$recovered
  
# for plot on the left
ggplot(data=x)+ 
  geom_dotplot(mapping=aes(x=deaths, fill=cumudeaths, color=cumudeaths), 
               binwidth=400, 
               stackdir="center")

# for plot on the right
ggplot(data=x)+ 
  geom_dotplot(mapping=aes(x=deaths, fill=cumudeaths, color=cumudeaths), 
               binwidth=400, 
               stackdir="center",
               position="dodge")

The difference really is in the last line, position="dodge" instructs the two sets of dots to align side by side rather than being stacked together.

Reshaping your data with reshape2

Data visualization often works in tandem with a set of reshaping techniques. Consider the illustration below; On the left, our data frame is said to be in a “long” format. Supposed we are monitoring the annual interest rate of two countries over 10 years, we would have 3 columns but we would have 2×10 = 20 rows since each country has 10 measurements, each for a year.

On the top right, our data frame is said to be in a “wide” format. The same data would now be 2 rows, but 10 columns wide. The bottom right is yet another representation of the data, and is also considered “wide” because as we have more countries, our dataset gets wider. A dataset with 11 countries would have 11 columns in this wide format, whereas in the tall format it would still be only 3 columns (just extra rows).

There are a number of R packages that help us with the common task of reshaping. One such package is the reshape2 package. Search for it in the Packages tab in RStudio, and if it doesn’t exist, reinstall it:

install.packages("reshape2")

Next, create a dataset by subsetting only for countries that have more than 10,000 cases as of the latest date in our data, and take each country’s data on the 17th April 2020.

newc <- subset(covid, confirmed > 10000 & date == as.Date("2020-04-17"))
tail(newc)

Printing the last few rows of data, you should get an output that looks like this:

13651	2020-04-17	Spain	190839	20002	74797
13999	2020-04-17	Sweden	13216	1400	550
14086	2020-04-17	Switzerland	27078	1327	16400
14869	2020-04-17	Turkey	78546	1769	8631
15217	2020-04-17	United Kingdom	109769	14607	394
15391	2020-04-17	US	699706	36773	58545

All that is left is to “melt” this wide data frame into a long one, using the melt() function:

melted <- melt(newc[,2:5], id="country")
tail(melted)

Your dataset now looks like this:

64	Spain	        recovered	74797	
65	Sweden	        recovered	550	
66	Switzerland	    recovered	16400	
67	Turkey	        recovered	8631	
68	United Kingdom	recovered	394	
69	US	            recovered	58545	

Three columns, first being the country (id="country"), second is the measurement, which is now a factor of three possible values: confirmed, death, recovered; and the third is the value for that measurement, for that country.

Inspecting the dimension of your data using dim() confirms that it is now a long data frame, with 69 rows and 3 columns:

dim(melted)
# returns: 69 3

Labelling your ggPlot using labs()

With the melted data frame, conveniently named melted, we’ll now create our ggplot , taking the opportunity to introduce another popular geometric shape, the geom_col().

ggplot(data=melted, aes(x=value, y=country, fill=variable)) + 
  geom_col()

The plot above uses the our tall data frame melted, and set our y-axis to be country while x-axis to be value. We then add our geometric layer geom_col() to obtain the final plot.

Make one small adjustment to your plot earlier, by setting position="dodge" in your geom_col() layer. Re-run the code to see that the bars are now positioned differently (dodged instead of stacked):

ggplot(data=melted, aes(x=value, y=country, fill=variable)) + 
  geom_col(position="dodge")

Plot Title, Subtitle and Caption

Let’s give our plot a title. This is as simple as adding another layer on our plot; labs() allow you to modify axis labels, plot labels (think x- and y-axis labels), title labels etc. We’ll do a few exercise and this will become clear in due time. We’ll start with adding a title and a caption onto the plot (feel free to give it a better title and caption than the one provided in the code example):

ggplot(data=melted, aes(x=value, y=country, fill=variable)) + 
  geom_col() +
  labs(title="Covid-19 Pandemic, as of 17th April",
       caption="Source: JHU CSSE, graph by @NoahTheDev")

This generates the following plot:

But why stop at title and caption? You can also add subtitle, as well as a tag which its documentation say “can be used for adding identification”, and of course, the x and y corresponding to the axis labels. Here’s an example:

ggplot(data=melted, aes(x=value, y=country, fill=variable)) + 
  geom_col() +
  labs(title="Covid-19 Pandemic, by country", 
       subtitle="Number of cases as of 17th April 2020",
       caption="Source: JHU CSSE, graph by @NoahTheDev", 
       tag="illustration 1",
       x="", y="Country")

Spicing up ggplot with Themes

The ggplot2 packages also ships with a number of themes, making it very easy to add some distinction or style to your plot. Try and add one of the following as a layer to your plot and observe the output:

  • theme_classic()
  • theme_minimal()
  • theme_linedraw()
ggplot(data=melted, aes(x=value, y=country, fill=variable)) + 
  geom_col() +
  labs(title="Covid-19 Pandemic, 2020", 
       subtitle="Cases, Deaths and Recovery by country",
       caption="Source: JHU CSSE, graph by @NoahTheDev") +
  theme_linedraw()

There are many ways to take theming in ggplot even further. There are also third party packages that you can install to get even more pre-made themes. The two themes I used in the bottom rows are theme_economist() and theme_clean(), respectively, courtesy of ggthemes:

  1. Install ggthemes: install.packages("ggthemes")
  2. Load the library and you will get access to a whole lot more themes
library(ggthemes)
ggplot(data=melted, aes(x=value, y=country, fill=variable)) + 
  geom_col() +
  labs(title="Covid-19 Pandemic, 2020", 
       subtitle="--- line: Average of confirmed cases",
       caption="Source: JHU CSSE, graph by Finetut.com", 
       x="", y="") +
  theme_economist()

With a little more customization, you can obtain just about any result you want. In fact, the customizability doesn’t even stop there. You can also develop your own plot, and use your own fonts. In the plot below, I am using the New York font from Apple, and then my own set of colors, custom legend positions, and a few more adjustments.

ggplot example

Theme development deserves a whole new course on its own, so I won’t delve into it. However, interested readers can refer to the full code on my GitHub repository and use it as a baseline reference while you develop yours.

Final Words on ggplot

A full, comprehensive tutorial on data visualization and on the ggplot system covering everything from its graphing template, to aesthetic mappings, to facets, to the hundreds of combinations of geometric objects and layers, to its statistical transformations, positional adjustments, powerful coordinate systems and labelling system is out of the scope of our project.

The kind of coverage may merit a whole book, or series of book, on its own. Indeed, I have been authoring data visualization courses and materials:

Key Takeaways

  • Remember the graphing template: ggplot(data=x, aes(x=date,y=profit))
  • Use aesthetic mapping to map the aesthetic elements (color, size, shape etc) to a variable in the dataframe.
  • For an aesthetic that adopt a fixed color, line style, or size, put them outside the aesthetic mapping aes function
  • Use labs to specify any label elements, including the title, subtitle, caption, x and y axis
  • Combine reshaping techniques such as melt() using the reshape2 package to get your data from wide to tall (the preferred format for ggplot2)
  • Use themes built into the ggplot2 package or develop your own (or use other R packages) to spice up your graphic!


Recommended readings for ggplot2:

Downloadable