There are extra Materials and code references to help you follow the tutorials in this lesson. A full set of code used in the tutorials can also be found on the GitHub repository. Be sure to check:
“The ability to take data, to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it, that’s going to be a hugely important skill in the next decades.”
Google’s Chief Economist Dr. Hal R. Varian
While machine learning and artificial intelligence is getting all the attention, ask any recruiters in the field, and they will agree on one thing. It is that ability “to extract value, visualize and communicate it” is still a hugely under-appreciated skill that many developers or data scientists lack.
If you think of it, when is the last time you’ve worked with a programmer (or developer, or computer scientist – take your pick!) and at the end of the project, was shown a presentation or report, and go “Wow! that was very clearly, succinctly, and elegantly presented. I fully understood your points because the report has made a compelling case in the way these evidence are presented!”
A report by LinkedIn and another by Forbes, titled “Data Storytelling: The Essential Data Science Skill Everyone Needs” corroborate this thesis. LinkedIn, in that blog post, reported that data analysis is one of the hottest skill categories over the past two years for recruiters (granted the article is a bit dated, but still true), and it was the only category that consistently ranked in the top 4 across all of the countries they analyzed.
Forbes put it more bluntly, it states that much of the hiring emphasis has centered on the data preparation and analysis (communication) skills, not the “last mile” skills that help convert insights into actions. Forbes continued to report,
“Many of the heavily-recruited individuals with advanced degrees in economics, mathematics, or statistics struggle with communicating their insights to others effectively – essentially, telling the story of their numbers.”
Forbes, Data Storytelling: The Essential Data Science Skill Everyone Needs
The unfortunate reality is that if your data isn’t being understood, no one will act on it and no change will occur. At its core, mastering communication helps drive buy-ins across different divisions and visualization often plays the role as a catalyst for change. Perhaps then, a constructive perspective is to not think of data visualization as a skill in itself.
Rather, think of it as your ability to communicate; A skill as fundamental as such has far-reaching implications and impact regardless of your profession, industries, or seniority in the organization. For brevity I will not delve too much deeper on this topic, but Wikipedia has an interesting article on the topic of incorporating data in the production and distribution of information in the digital era, where it discusses the emerge of data journalism as a concept.
Numbers have an important story to tell. They rely on you to give them a clear and convincing voice.
Stephen Few
R makes it incredibly simple to create stunning visualization and communication devices. Let’s get started.
Open covidRT.Rproj and create a new R Markdown document, I’ll name it plottingBasics (you can name it anything you want). Delete everything in the filler content except for the header, that is the first few lines between the enclosing ---
s. Save the file as plottingBasics.Rmd.
Your R Markdown should look like this (after deleting all the other lines except from the header):
---
title: "plottingBasics"
author: "Samuel"
date: "4/20/2020"
output: html_document
---
Download the cleaned CSV file from the course page (under Materials), and then save a copy of that in your current folder. As a convention, I like to put my data files into a folder named data_input
. Assuming my directory is on my Desktop and is called covidRT, within which is a project file is called covidRT.Rproj, this is the current organization:
/Users/samuel/Desktop
└── covidRT
├── covidRT.Rproj
├── data_input
│ └── covid_clean.csv
├── learnPlotting.Rmd
├── learnPlotting.html
└── plottingBasics.Rmd
2 directories, 5 files
In plottingBasics.Rmd, add a new chunk, then add the following code into the chunk:
covid <- read.csv("data_input/covid_clean.csv")
head(covid)
Click on the run icon on this chunk, and pay attention to the output.
read.csv
function to read a csv (comma separated value) file. If you take an Excel workbook and convert the XLS file to a CSV, you can use read.csv()
and then pass in the path to that file relative to your working directory. You can optionally pass in an absolute path, such as read.csv("C:\\ProgramFiles\Samuel\Documents\Desktop\...csv")
or even read.csv("https://raw.githubusercontent.com/onlyphantom/coronavirus/master/covid_clean.csv")
provided you have an internet connection. read.csv()
to an object named covid
, we have created this Data Frame in the environment. head()
on our covid
object, which returns the first 6 rows of this Data Frame.There are a few useful tips about this Data Frame object. It’s a two-dimensional data structure, meaning it has rows and columns. Each columns can be thought of as a list of values for one variable. This is not very different from how you work with tables in a spreadsheet software.
Being a Data Frame (or dataframe), there are certain functions you can call on this covid
object. For a data frame with the name my_data
, calling head(my_data)
returns the first 6 rows of the data. What if we’d like to instead see the last 6 rows of data? There’s the tail(my_data)
.
head()
and tail()
also accepts an additional parameter n
; if you wish to see the first or last n
rows of a data:
head(covid, n=7)
prints the first 7 rows of datatail(covid, n=3)
prints the last 3 rows of dataIf you want to know the number of rows and columns for a data frame named bad_loans
, you can use the nrow()
and ncol()
functions in R:
nrow(bad_loans)
prints the number of rows in our data framencol(bad_loans)
prints the number of columns in our data frameRemember the summary()
function we seen in the earlier Lesson? Well, we can call summary()
on our Data Frame as well:
summary(covid)
prints a compact summary of our data frame: X date country confirmed deaths recovered
Min. : 1 2020-01-22: 185 Afghanistan : 87 Min. : 0 Min. : 0.0 Min. : 0.0
1st Qu.: 4024 2020-01-23: 185 Albania : 87 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0
Median : 8048 2020-01-24: 185 Algeria : 87 Median : 1 Median : 0.0 Median : 0.0
Mean : 8048 2020-01-25: 185 Andorra : 87 Mean : 2323 Mean : 128.6 Mean : 573.3
3rd Qu.:12072 2020-01-26: 185 Angola : 87 3rd Qu.: 74 3rd Qu.: 1.0 3rd Qu.: 3.0
Max. :16095 2020-01-27: 185 Antigua and Barbuda: 87 Max. :699706 Max. :36773.0 Max. :83114.0
(Other) :14985 (Other) :15573
Among other things, it gives us a quick glance into our data frame. It shows me that, for at least the first few days in our data frame (2020-01-22 to 2020-01-27), every day we make 185 observations (rows). This is the number of countries that our data frame contains, from Afghanistan, Albania, … to Zambia and Zimbabwe. Each day, we have 185 records, one for each of these countries.
It also give us some statistical measures: the minimum and maximum, mean and median etc. We see that the largest single-day record in our data frame for deaths is 36,773. As a reminder, this data is cleaned and only updated as of 17th April 2020. By the time you’re following this tutorial, the actual numbers would have shot up. Later in this tutorial series, you will learn how to have your code query the real-time data directly from John Hopkins University’s repository.
While exploring the data, another useful technique is to print the “structure” of our data. This can be done using str(covid)
:
'data.frame': 16095 obs. of 6 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ date : Factor w/ 87 levels "2020-01-22","2020-01-23",..: 1 2 3 4 5 6 7 8 9 10 ...
$ country : Factor w/ 185 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ confirmed: int 0 0 0 0 0 0 0 0 0 0 ...
$ deaths : int 0 0 0 0 0 0 0 0 0 0 ...
$ recovered: int 0 0 0 0 0 0 0 0 0 0 ...
Among other things, it tells you:
nrow(covid)
.ncol(covid)
.But what is a Factor, and what is an integer? This brings us to the following topic.
When we created our data frame covid
by reading from a csv source using read.csv(),
R will read every numeric variable as either an integer or numeric. In practice, the difference is that integer types can only contain whole numbers, like -1, 7, 400, 20.
Numeric type is the more general type that encompasses multiple types, including the integer class and the double class, which is for double-precision floating-point numbers. If your data frame have two variables, UnitsSold
and UnitPrice
, you will like have UnitsSold
stored as an integer since you will see 1 unit of beer, or 2 units, or 12. It is unlikely you will ever sell 3.443 unit of beer.
On the other hand, you will most likely store UnitPrice
as a numeric, which allow for the values to be either integer or double.
Factors are used in R to work with categorical variables, and when we created our data frame covid
by reading from a csv source using read.csv(),
R will read every character variable as factors by default. This behavior can be changed. When we call read.csv(), passing in an additional parameter stringsAsFactors = FALSE
will read character variables as characters and not automatically inferred as categorical.
Supposed our data contains the following columns, which among them are integer, which are numeric, and which should be factors?
For the most part. figuring out the appropriate classes to store our data is something best done early on in your data analysis project. Supposed you figure out that a few of these variables need to be converted to the correct types, you can use R’s built-in methods to perform the conversion:
as.Date()
converts the class of our vectors to the Date formatas.integer()
converts the class of our vectors to the integer formatas.character()
converts the class of our vectors to the character formatas.numeric()
converts the class of our vectors to the numeric formatEdit plottingBasics.Rmd so it looks like this (don’t copy the header):
---
title: "plottingBasics"
author: "Samuel"
date: "4/20/2020"
output: html_document
---
```{r reading-data}
covid <- read.csv("data_input/covid_clean.csv")
covid$date <- as.Date(covid$date)
covid <- covid[order(covid$date), 2:6]
str(covid)
```
```{r inspect}
head(covid)
```
covid
(or any name of your choice)$
sign. In this case, covid$date
points to the variable named date
in our covid
dataset. Whenever you perform class conversion, remember to assign the post-conversion output back to the original variable to override the old values.$date
variable, so the first row is the earliest date and the last row in our data would be the latest datedata[row, column]
where you apply your subsetting conditions; In our case, for column, we want the 2nd to 6th column (discarding the first column, since it’s not helpful to our analysis). For row, we use order(covid$date)
to instruct R to order our rows based on the date
variable. If you’d rather have it in descending order (latest date at the top, earliest date at the bottom), you can add an additional parameter: order(covid$date, decreasing=TRUE)
.Instead of plotting all 185 countries on a single plot, let’s start small. We will take only a fraction of the rows from our data, specifically the ones where the value for country
equals to “US
“. This is called a subsetting operation and can be done in one of two ways:
us_cases <- subset(covid, country == "US")
us_cases <- covid[covid$country == "US", ]
The first method uses the subset
function explicitly, passing in (1) our data frame and (2) the condition. The second method uses the data[row_conditions, column_conditions]
convention. In our case, we wanted all columns, but only the rows where country is equal (==
) to “US”. Because we did not specify any subsetting condition in the column_conditions area, this will keep all columns. This is equivalent to:
us_cases <- covid[covid$country == "US", 1:5]
Notice that in your RStudio’s Environment tab, you now have two data frames, us_cases
and covid
. You took a subset of the original data frame and assign it to a new data frame, with the name us_cases
.
Creating a plot in R can be as simple as using the plot()
function and passing in the values to use for your x-axis and y-axis:
plot(x=us_cases$date, y=us_cases$confirmed)
This plot()
function, as you may have guessed, takes additional parameters as well, allowing you greater control over your graphics. These are some common parameters that we’ll be using:
main
: A main title for your graphical plotsub
: A subtitle for your graphic plotxlab
and ylab
: The labels for your x- and y-axis, respectivelypch
: Point character, the character for these “marker” / symbols on the plot (by default, a round circle)cex
: Character expansion, the size of your point characters (“markers”)type
: type of plotcol
: color for your symbolsGo back to your R Markdown and try to change and adjust a few of these parameters from the following reference:
plot(x=us_cases$date, y=us_cases$confirmed,
pch=19,
cex=0.3,
main="Confirmed Cases vs Recovery in the US",
sub="Data as of 17th April 2020",
xlab="Date", ylab="")
For example, change the value of pch
to a different value, and run your chunk. Observe the changes. Repeat for other parameters as well. A handy reference for all the different values for point characters (pch
):
Also give your plot a different main and subtitle (main
and sub
respectively). The cex
in the reference code above is set at 0.3, this is the amount by which our plotting symbol should be magnified. Setting this to 2 means a 2x increase in size, while 0.3 shrinks it down to 30%.
Another detail from the reference code above is that you have make generous use of new lines instead of specifying all parameters for your function call in one line. This has the benefit of making your code more readable, and allowing you to append comments by starting a line with the #
character. I like to use tab (or 4 white spaces) to indent my code at each line for extra readability.
Consider the following code:
plot(x=us_cases$date,
# use log values instead
y=log(us_cases$confirmed),
type="l",
col="darkgreen",
# box type
bty="l",
# linetype
lty="dashed")
This produces the following plot:
Even. though you may not have been introduced to all the different plot parameters (there are a lot!), I’m sure you can agree that the comment has helped. I’ve incorporated a few changes:
log(us_cases$confirmed)
instead of us_cases$confirmed
for values on the y axis. type="l"
sets the plot type to be “line” (instead of “points”, default)col="darkgreen"
to set the color of our line to dark greencolors()
and you’ll see all R’s predefined colors by name.bty="l"
to set the bounding box to be L-shaped (so only lines on the left and bottom axis)lty="dashed"
to set the line type to “dashed”. lty
are: "blank"
, "solid"
, "dashed"
, "dotted"
, "dotdash"
,"longdash"
, or "twodash"
If you want some extra practice, modify the reference code above. Add a main title, subtitle, the x- and y-axis labels, Use colors()
to print all 657 colors, and pick one of your choice for your line. I’ve also attached a quick reference for all colors in the Materials tab of this lesson.
# print only the first 6 colors
head(colors())
[1] "white" "aliceblue" "antiquewhite" "antiquewhite1"
[5] "antiquewhite2" "antiquewhite3"
Once you’ve created a plot, you can also add new elements, such as additional lines (lines()
) onto that plot. In the following example, we’ve first created a plot, and then use lines()
to add a new line to it.
plot(us_cases$date, us_cases$confirmed,
pch=19,
cex=0.3,
main="Confirmed Cases vs Deaths in the US",
sub="Data as of 17th April 2020",
las=1,
xlab="", ylab="",
type="l",
col="darkgreen", lty="solid")
lines(us_cases$date, us_cases$deaths, col="cornsilk3", lwd=2, lty="dashed")
las
and the scientific notationNotice the second line is a “dashed” one, and have a slightly thicker line width (lwd=2
) . As you increase this number, the weight of the line becomes greater as well. Another detail is that the labels on the y-axis are now in horizontal (perpendicular to the axis) orientation, since we set it to be using las=1
.
These labels on the y-axis also use the exponential notation (7e+05, 6e+05, …).
options()
function to control this.You can add as many lines as you want, so once you called plot()
to create a new plot object, subsequent calls to lines()
will each add a line to the plot.
You should use this with caution though, because too many lines on the plot may be confusing and overcrowd the plot without actually adding any level of clarity to the reader. To deal with this, there are numerous strategies:
Copy and run the following code in your R Markdown, and notice the addition of options(scipen=1)
as well as a call to legend()
. Other part of the code remained largely the same as previous examples in this lesson.
options(scipen=1)
plot(us_cases$date, us_cases$confirmed,
pch=19,
cex=0.3,
main="Confirmed Cases vs Recovery in the US",
las=1,
xlab="", ylab="")
lines(us_cases$date, us_cases$confirmed, col="cornsilk3", lwd=2)
lines(us_cases$date, us_cases$recovered, col="lightblue", lwd=2)
lines(us_cases$date, us_cases$deaths, col="lightpink", lwd=2)
legend("topleft", fill=c("cornsilk3", "lightblue", "lightpink"), legend=c("confirmed", "recovered", "deaths"))
Renders the following in your R Markdown document:
The call to legend (line 11) uses three parameters,
"topleft"
is the position of the legend"bottom"
, "top"
, "bottomleft"
, "bottomright"
, "topright"
and observe the changes in your plot outcome"Deaths"
instead of "deaths"
) and observe the changes in your plot outcomeNotice how the plot uses the fixed form instead of the exponential notation (line 1).
Spend some time customizing this plot, by practicing what you’ve learned. If you want an extra challenge, replicate this using data from a different country than US. As a reminder, you can perform subsetting one of two ways:
germany <- subset(covid, country=="Germany")
italy <- covid[covid$country == "Italy", ]
Create your plot, with your own colors (col
), box types (bty
), point character (pch
), line types (lty
), line width (lwd
), main title (main
) and subtitle (sub
), x- and y- axis labels (xlab
and ylab
).
Remember, you can always hop over to the GitHub repository and download the full code and use it as a reference to guide you along as you develop your project.
When you’re done, move on to the next course to learn how to seriously supercharge your visualization using an industrially popular tool, the ggplot
!
There are extra Materials and code references to help you follow the tutorials in this lesson. A full set of code used in the tutorials can also be found on the GitHub repository. Be sure to check:
All pre-defined colors in R:
Source: Sape research group at the Faculty of Informatics of the University of Lugano