Base Plotting in R

Lesson
Materials
Code References

Download the Code

There are extra Materials and code references to help you follow the tutorials in this lesson. A full set of code used in the tutorials can also be found on the GitHub repository. Be sure to check:

  • the Materials tab at the top of this Lesson
  • the GitHub Repository for this course: CovidRT

The Case for Data Visualization

“The ability to take data, to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it, that’s going to be a hugely important skill in the next decades.”

Google’s Chief Economist Dr. Hal R. Varian

While machine learning and artificial intelligence is getting all the attention, ask any recruiters in the field, and they will agree on one thing. It is that ability “to extract value, visualize and communicate it” is still a hugely under-appreciated skill that many developers or data scientists lack.

If you think of it, when is the last time you’ve worked with a programmer (or developer, or computer scientist – take your pick!) and at the end of the project, was shown a presentation or report, and go “Wow! that was very clearly, succinctly, and elegantly presented. I fully understood your points because the report has made a compelling case in the way these evidence are presented!”

A report by LinkedIn and another by Forbes, titled “Data Storytelling: The Essential Data Science Skill Everyone Needs” corroborate this thesis. LinkedIn, in that blog post, reported that data analysis is one of the hottest skill categories over the past two years for recruiters (granted the article is a bit dated, but still true), and it was the only category that consistently ranked in the top 4 across all of the countries they analyzed.

Forbes put it more bluntly, it states that much of the hiring emphasis has centered on the data preparation and analysis (communication) skills, not the “last mile” skills that help convert insights into actions. Forbes continued to report,

“Many of the heavily-recruited individuals with advanced degrees in economics, mathematics, or statistics struggle with communicating their insights to others effectively – essentially, telling the story of their numbers.”

Forbes, Data Storytelling: The Essential Data Science Skill Everyone Needs

The unfortunate reality is that if your data isn’t being understood, no one will act on it and no change will occur. At its core, mastering communication helps drive buy-ins across different divisions and visualization often plays the role as a catalyst for change. Perhaps then, a constructive perspective is to not think of data visualization as a skill in itself.

Rather, think of it as your ability to communicate; A skill as fundamental as such has far-reaching implications and impact regardless of your profession, industries, or seniority in the organization. For brevity I will not delve too much deeper on this topic, but Wikipedia has an interesting article on the topic of incorporating data in the production and distribution of information in the digital era, where it discusses the emerge of data journalism as a concept.

Numbers have an important story to tell. They rely on you to give them a clear and convincing voice.

Stephen Few

R makes it incredibly simple to create stunning visualization and communication devices. Let’s get started.

Visualization in R

Open covidRT.Rproj and create a new R Markdown document, I’ll name it plottingBasics (you can name it anything you want). Delete everything in the filler content except for the header, that is the first few lines between the enclosing ---s. Save the file as plottingBasics.Rmd.

Your R Markdown should look like this (after deleting all the other lines except from the header):

---
title: "plottingBasics"
author: "Samuel"
date: "4/20/2020"
output: html_document
---

Download the cleaned CSV file from the course page (under Materials), and then save a copy of that in your current folder. As a convention, I like to put my data files into a folder named data_input. Assuming my directory is on my Desktop and is called covidRT, within which is a project file is called covidRT.Rproj, this is the current organization:

/Users/samuel/Desktop
└── covidRT
    ├── covidRT.Rproj
    ├── data_input
    │   └── covid_clean.csv
    ├── learnPlotting.Rmd
    ├── learnPlotting.html
    └── plottingBasics.Rmd

2 directories, 5 files

In plottingBasics.Rmd, add a new chunk, then add the following code into the chunk:

covid <- read.csv("data_input/covid_clean.csv")
head(covid)

Click on the run r chunk play icon on this chunk, and pay attention to the output.

  1. Line 1 uses R’s read.csv function to read a csv (comma separated value) file. If you take an Excel workbook and convert the XLS file to a CSV, you can use read.csv() and then pass in the path to that file relative to your working directory. You can optionally pass in an absolute path, such as read.csv("C:\\ProgramFiles\Samuel\Documents\Desktop\...csv") or even read.csv("https://raw.githubusercontent.com/onlyphantom/coronavirus/master/covid_clean.csv") provided you have an internet connection.
    • By assigning the value of read.csv() to an object named covid, we have created this Data Frame in the environment.
  2. Line 2 calls head() on our covid object, which returns the first 6 rows of this Data Frame.

Data Frames in R

There are a few useful tips about this Data Frame object. It’s a two-dimensional data structure, meaning it has rows and columns. Each columns can be thought of as a list of values for one variable. This is not very different from how you work with tables in a spreadsheet software.

Being a Data Frame (or dataframe), there are certain functions you can call on this covid object. For a data frame with the name my_data, calling head(my_data) returns the first 6 rows of the data. What if we’d like to instead see the last 6 rows of data? There’s the tail(my_data).

head() and tail() also accepts an additional parameter n; if you wish to see the first or last n rows of a data:

  • head(covid, n=7) prints the first 7 rows of data
  • tail(covid, n=3) prints the last 3 rows of data

If you want to know the number of rows and columns for a data frame named bad_loans, you can use the nrow() and ncol() functions in R:

  • nrow(bad_loans) prints the number of rows in our data frame
  • ncol(bad_loans) prints the number of columns in our data frame

Remember the summary() function we seen in the earlier Lesson? Well, we can call summary() on our Data Frame as well:

  • summary(covid) prints a compact summary of our data frame:
       X                 date                      country        confirmed          deaths          recovered      
 Min.   :    1   2020-01-22:  185   Afghanistan        :   87   Min.   :     0   Min.   :    0.0   Min.   :    0.0  
 1st Qu.: 4024   2020-01-23:  185   Albania            :   87   1st Qu.:     0   1st Qu.:    0.0   1st Qu.:    0.0  
 Median : 8048   2020-01-24:  185   Algeria            :   87   Median :     1   Median :    0.0   Median :    0.0  
 Mean   : 8048   2020-01-25:  185   Andorra            :   87   Mean   :  2323   Mean   :  128.6   Mean   :  573.3  
 3rd Qu.:12072   2020-01-26:  185   Angola             :   87   3rd Qu.:    74   3rd Qu.:    1.0   3rd Qu.:    3.0  
 Max.   :16095   2020-01-27:  185   Antigua and Barbuda:   87   Max.   :699706   Max.   :36773.0   Max.   :83114.0  
                 (Other)   :14985   (Other)            :15573  

Among other things, it gives us a quick glance into our data frame. It shows me that, for at least the first few days in our data frame (2020-01-22 to 2020-01-27), every day we make 185 observations (rows). This is the number of countries that our data frame contains, from Afghanistan, Albania, … to Zambia and Zimbabwe. Each day, we have 185 records, one for each of these countries.

It also give us some statistical measures: the minimum and maximum, mean and median etc. We see that the largest single-day record in our data frame for deaths is 36,773. As a reminder, this data is cleaned and only updated as of 17th April 2020. By the time you’re following this tutorial, the actual numbers would have shot up. Later in this tutorial series, you will learn how to have your code query the real-time data directly from John Hopkins University’s repository.

While exploring the data, another useful technique is to print the “structure” of our data. This can be done using str(covid):

'data.frame':	16095 obs. of  6 variables:
 $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ date     : Factor w/ 87 levels "2020-01-22","2020-01-23",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ country  : Factor w/ 185 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ confirmed: int  0 0 0 0 0 0 0 0 0 0 ...
 $ deaths   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ recovered: int  0 0 0 0 0 0 0 0 0 0 ...

Among other things, it tells you:

  • 16095 observations (obs): our data has 16,095 rows; This can also be obtained using nrow(covid).
  • 6 variables: our data has 6 columns; This can also be obtained using ncol(covid).
  • Gives us the first few values of each variables (columns)
  • Gives us the class (data types) of each variables; Here it thinks that country is a Factor and deaths is an integer.

But what is a Factor, and what is an integer? This brings us to the following topic.

Data Types in R

Integer vs Double (Numeric) in R

When we created our data frame covid by reading from a csv source using read.csv(), R will read every numeric variable as either an integer or numeric. In practice, the difference is that integer types can only contain whole numbers, like -1, 7, 400, 20.

Numeric type is the more general type that encompasses multiple types, including the integer class and the double class, which is for double-precision floating-point numbers. If your data frame have two variables, UnitsSold and UnitPrice, you will like have UnitsSold stored as an integer since you will see 1 unit of beer, or 2 units, or 12. It is unlikely you will ever sell 3.443 unit of beer.

On the other hand, you will most likely store UnitPrice as a numeric, which allow for the values to be either integer or double.

Factors in R

Factors are used in R to work with categorical variables, and when we created our data frame covid by reading from a csv source using read.csv(), R will read every character variable as factors by default. This behavior can be changed. When we call read.csv(), passing in an additional parameter stringsAsFactors = FALSE will read character variables as characters and not automatically inferred as categorical.

Supposed our data contains the following columns, which among them are integer, which are numeric, and which should be factors?

  • customer_fname: Adam, Drew, Eve
  • customer_lname: Smith, Mason, Gerken
  • sales_price: $26.10, $19.90, $40
  • sales_unit: 3, 2, 11
  • sales_quarter: Quarter 1, Quarter 3, Quarter 4
  • sales_month: Feb, Sep, Dec
  • email: adam@corporatesales.com, drew@insider.com, eve.g@yahoo.com
  • account: retail, retail, corporate
  • discount: 0.2, 0.15, 0

For the most part. figuring out the appropriate classes to store our data is something best done early on in your data analysis project. Supposed you figure out that a few of these variables need to be converted to the correct types, you can use R’s built-in methods to perform the conversion:

  • as.Date() converts the class of our vectors to the Date format
  • as.integer() converts the class of our vectors to the integer format
  • as.character() converts the class of our vectors to the character format
  • as.numeric() converts the class of our vectors to the numeric format

Edit plottingBasics.Rmd so it looks like this (don’t copy the header):

---
title: "plottingBasics"
author: "Samuel"
date: "4/20/2020"
output: html_document
---

```{r reading-data}
covid <- read.csv("data_input/covid_clean.csv")
covid$date <- as.Date(covid$date)
covid <- covid[order(covid$date), 2:6]
str(covid)
```

```{r inspect}
head(covid)
```
  • Line 1 to 6 are the header. It is created by default when you first create your R Markdown
  • Line 9 reads the dataset, and assign the data frame to covid (or any name of your choice)
  • Line 10 performs the conversion to Date, and then assign the resulting vector back to its original variable
    • Notice how we point to a variable within our dataset using the $ sign. In this case, covid$date points to the variable named date in our covid dataset. Whenever you perform class conversion, remember to assign the post-conversion output back to the original variable to override the old values.
  • Line 11 reorder (sort) the data frame by the $date variable, so the first row is the earliest date and the last row in our data would be the latest date
    • The syntax R uses is data[row, column] where you apply your subsetting conditions; In our case, for column, we want the 2nd to 6th column (discarding the first column, since it’s not helpful to our analysis). For row, we use order(covid$date) to instruct R to order our rows based on the date variable. If you’d rather have it in descending order (latest date at the top, earliest date at the bottom), you can add an additional parameter: order(covid$date, decreasing=TRUE).
  • Line 12 prints the structure of our data frame
  • Line 16 prints the first 6 rows of our data frame

Creating our First Plot

Instead of plotting all 185 countries on a single plot, let’s start small. We will take only a fraction of the rows from our data, specifically the ones where the value for country equals to “US“. This is called a subsetting operation and can be done in one of two ways:

  • us_cases <- subset(covid, country == "US")
  • us_cases <- covid[covid$country == "US", ]

The first method uses the subset function explicitly, passing in (1) our data frame and (2) the condition. The second method uses the data[row_conditions, column_conditions] convention. In our case, we wanted all columns, but only the rows where country is equal (==) to “US”. Because we did not specify any subsetting condition in the column_conditions area, this will keep all columns. This is equivalent to:

  • us_cases <- covid[covid$country == "US", 1:5]

Notice that in your RStudio’s Environment tab, you now have two data frames, us_cases and covid. You took a subset of the original data frame and assign it to a new data frame, with the name us_cases.

Creating a plot in R can be as simple as using the plot() function and passing in the values to use for your x-axis and y-axis:

plot(x=us_cases$date, y=us_cases$confirmed)

This plot() function, as you may have guessed, takes additional parameters as well, allowing you greater control over your graphics. These are some common parameters that we’ll be using:

  • main: A main title for your graphical plot
  • sub: A subtitle for your graphic plot
  • xlab and ylab: The labels for your x- and y-axis, respectively
  • pch: Point character, the character for these “marker” / symbols on the plot (by default, a round circle)
  • cex: Character expansion, the size of your point characters (“markers”)
  • type: type of plot
  • col: color for your symbols

Go back to your R Markdown and try to change and adjust a few of these parameters from the following reference:

plot(x=us_cases$date, y=us_cases$confirmed, 
     pch=19, 
     cex=0.3, 
     main="Confirmed Cases vs Recovery in the US",
     sub="Data as of 17th April 2020",
     xlab="Date", ylab="")

For example, change the value of pch to a different value, and run your chunk. Observe the changes. Repeat for other parameters as well. A handy reference for all the different values for point characters (pch):

Credit: statmethods.net

Also give your plot a different main and subtitle (main and sub respectively). The cex in the reference code above is set at 0.3, this is the amount by which our plotting symbol should be magnified. Setting this to 2 means a 2x increase in size, while 0.3 shrinks it down to 30%.

Practical Tips and Techniques for Plotting in R

Another detail from the reference code above is that you have make generous use of new lines instead of specifying all parameters for your function call in one line. This has the benefit of making your code more readable, and allowing you to append comments by starting a line with the # character. I like to use tab (or 4 white spaces) to indent my code at each line for extra readability.

Consider the following code:

plot(x=us_cases$date, 
     # use log values instead
     y=log(us_cases$confirmed), 
     type="l", 
     col="darkgreen", 
     # box type
     bty="l", 
     # linetype
     lty="dashed")

This produces the following plot:

Even. though you may not have been introduced to all the different plot parameters (there are a lot!), I’m sure you can agree that the comment has helped. I’ve incorporated a few changes:

  • Using log(us_cases$confirmed) instead of us_cases$confirmed for values on the y axis.
  • Setting type="l" sets the plot type to be “line” (instead of “points”, default)
  • Use col="darkgreen" to set the color of our line to dark green
    • There’s altogether 657 colors. You can call colors() and you’ll see all R’s predefined colors by name.
  • Use bty="l" to set the bounding box to be L-shaped (so only lines on the left and bottom axis)
  • Use lty="dashed" to set the line type to “dashed”.
    • The possible values for lty are: "blank""solid""dashed""dotted""dotdash","longdash", or "twodash"

If you want some extra practice, modify the reference code above. Add a main title, subtitle, the x- and y-axis labels, Use colors() to print all 657 colors, and pick one of your choice for your line. I’ve also attached a quick reference for all colors in the Materials tab of this lesson.

# print only the first 6 colors
head(colors())
[1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"
[5] "antiquewhite2" "antiquewhite3"

Once you’ve created a plot, you can also add new elements, such as additional lines (lines()) onto that plot. In the following example, we’ve first created a plot, and then use lines() to add a new line to it.

plot(us_cases$date, us_cases$confirmed, 
     pch=19, 
     cex=0.3, 
     main="Confirmed Cases vs Deaths in the US",
     sub="Data as of 17th April 2020",
     las=1,
     xlab="", ylab="",
     type="l", 
     col="darkgreen", lty="solid")
lines(us_cases$date, us_cases$deaths, col="cornsilk3", lwd=2, lty="dashed")

Formatting the axis using las and the scientific notation

Notice the second line is a “dashed” one, and have a slightly thicker line width (lwd=2) . As you increase this number, the weight of the line becomes greater as well. Another detail is that the labels on the y-axis are now in horizontal (perpendicular to the axis) orientation, since we set it to be using las=1.

These labels on the y-axis also use the exponential notation (7e+05, 6e+05, …).

  • 1e+5 is 1*10^5 and if you punch this into a calculator you’ll get 100,000
  • 2e+5 is 2*10^5, which is 200,000
  • If you rather not use the exponential notation and display the values in fixed numeric form, you can use the options() function to control this.

Adding Plot Legends

You can add as many lines as you want, so once you called plot() to create a new plot object, subsequent calls to lines() will each add a line to the plot.

You should use this with caution though, because too many lines on the plot may be confusing and overcrowd the plot without actually adding any level of clarity to the reader. To deal with this, there are numerous strategies:

  • Consider a different line type for each line
  • Consider a different color for each line
  • Consider adding a legend onto the plot to describe each line (by color, or type, for example)

Copy and run the following code in your R Markdown, and notice the addition of options(scipen=1) as well as a call to legend(). Other part of the code remained largely the same as previous examples in this lesson.

options(scipen=1)
plot(us_cases$date, us_cases$confirmed, 
     pch=19, 
     cex=0.3, 
     main="Confirmed Cases vs Recovery in the US",
     las=1,
     xlab="", ylab="")
lines(us_cases$date, us_cases$confirmed, col="cornsilk3", lwd=2)
lines(us_cases$date, us_cases$recovered, col="lightblue", lwd=2)
lines(us_cases$date, us_cases$deaths, col="lightpink", lwd=2)
legend("topleft", fill=c("cornsilk3", "lightblue", "lightpink"), legend=c("confirmed", "recovered", "deaths"))

Renders the following in your R Markdown document:

The call to legend (line 11) uses three parameters,

  • the first, "topleft" is the position of the legend
    • You can substitute this for "bottom", "top", "bottomleft", "bottomright", "topright" and observe the changes in your plot outcome
    • You can also use x- and y- coordinates, which we will explore in future lessons
  • the second is a vector of three colors (one for each color) used to fill the box next to each line of our legend text
  • the third is the a vector of three characters, each representing one line. Try and change the values to use the Capitalized version ("Deaths" instead of "deaths") and observe the changes in your plot outcome

Notice how the plot uses the fixed form instead of the exponential notation (line 1).

Spend some time customizing this plot, by practicing what you’ve learned. If you want an extra challenge, replicate this using data from a different country than US. As a reminder, you can perform subsetting one of two ways:

germany <- subset(covid, country=="Germany")
italy <- covid[covid$country == "Italy", ]

Create your plot, with your own colors (col), box types (bty), point character (pch), line types (lty), line width (lwd), main title (main) and subtitle (sub), x- and y- axis labels (xlab and ylab).

Remember, you can always hop over to the GitHub repository and download the full code and use it as a reference to guide you along as you develop your project.

When you’re done, move on to the next course to learn how to seriously supercharge your visualization using an industrially popular tool, the ggplot!

Code References

Download the Code

There are extra Materials and code references to help you follow the tutorials in this lesson. A full set of code used in the tutorials can also be found on the GitHub repository. Be sure to check:

  • the Materials tab at the top of this Lesson
  • the GitHub Repository for this course: CovidRT


All pre-defined colors in R:

Source: Sape research group at the Faculty of Informatics of the University of Lugano

Downloadable Materials