Introduction

This tutorial introduces data visualization using R and shows how to modify different types of visualizations in the ggplot framework in R. The entire R-markdown document can be downloaded here.

When it comes to data visualization, R offers a myriad of options and ways to show and summarize data which makes R an incredibly flexible tool that offers full control over the distinct layers of plots. Rather than showing how to produce different types of plots (e.g. scatter plots, box plots, and line graphs), this introduction will focus on the three main frameworks for data visualization in R (base, lattice, and ggplot) and show how you can modify your visualizations (e.g. changing axes and tick labels, change colors, and showing different plots in one window). How to create different types of plots is shown in this tutorial. We separate between this introduction and showing how to produce different types of visualizations because rather general questions relating to what needs to be kept in mind when visualizing data are discussed. The practical part presents the code used to set up graphs so that they can be recreated and also discusses potential problems that you may encounter when setting up a graph.

As there exists a multitude of different ways to visualize data, this section only highlights the different philosophies that underlie the different frameworks for data visualization in R (base, lattice, and ggplot) and how to modify visualizations to match one’s individual needs. The major advantage of using R consists in the fact that the code can be store, distributed, and run very easily. This means that R represents a flexible framework for creating graphs that enables sustainable, reproducible, and transparent procedures.

Basics of data visualization

Before turning to the practical issues relating to creating graphs, a few words on what one has to keep in mind when visualizing data are in order. On a very general level, graphs should be used to inform the reader about properties and relationships between variables. This implies that…

  • graphs, including axes, must be labeled properly to allow the reader to understand the visualization with ease.

  • visualizations should not use more dimensions than the data has that is visualized.

  • all elements within a graph should be unambiguous.

  • variable scales should be portrayed accurately (for instance, lines - which imply continuity - should not be used for categorically scaled variables).

  • graphs should be as intuitive as possible and should not mislead the reader.

Different philosophies: base R, lattice, and ggplot

A few words on different frameworks for creating graphics in R are in order. There are three main frameworks in which to create graphics in R. The basic framework, the lattice framework, and the ggplot or tidyverse framework.

The base R framework

The base R framework is the oldest of the three and is included in what is called the base R - a collection of about 30 packages that are automatically activated/loaded when you start R. The idea behind the “base” environment is that the creation of graphics is seen in analogy to a painter who paints on an empty canvass. Each line or element is added to the graph consecutively which oftentimes leads to code that is very comprehensible but also very long.

The lattice framework

The lattice environment was a follow-up to the base framework and it complements it insofar as it made it much easier to display various variables and variable levels simultaneously. The philosophy of the lattice-package is quite different from the philosophy of base: whereas everything had to be specified in base, the graphs created in the lattice environment require only very little code but are therefore very easily created when one is satisfied with the design but vey labor intensive when it comes to customizing graphs. However, lattice is very handy when summarizing relationships between multiple variable and variable levels.

The ggplot framework

The ggplot environment was written by Hadley Wickham and it combines the positive aspects of both the base and the lattice package. It was first publicized in the gplot and ggplot1 packages but the latter was soon repackaged and improved in the now most widely used package for data visualization: the ggplot2 package. The ggplot environment implements a philosophy of graphic design described in builds on The Grammar of Graphics by Leland Wilkinson (Wilkinson 2012).

The philosophy of ggplot2 is to consider graphics as consisting out of basic elements (called aesthetics and they include, for instance, the data set to be plotted and the axes) and layers that overlaid onto the aesthetics. The idea of the ggplot2 package can be summarized as taking “care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.”

Thus, ggplots typically start with the function call (ggplot) followed by the specification of the data, then the aesthetics (aes), and then a specification of the type of plot that is created (geom_line for line graphs, geom_box for box plots, geom_bar for bar graphs, geom_text for text, etc.). In addition, ggplot allows to specify all elements that the graph consists of (e.g. the theme and axes).

As the ggplot framework has become the dominant way to create visualizations in R, we will only focus on this framework in the following practical examples.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# install libraries
install.packages(c("dplyr", "ggplot", "gridExtra", "knitr", "kableExtra", "stringr"))

Once you have installed R and initiated the session by executing the code shown above, you are good to go.

Getting started

Before turning to the graphs, we will load the packages for this tutorial. The data set is called lmmdata but we will change the name to plotdata for this tutorial. The data set is based on the Penn Parsed Corpora of Historical English (PPC) and it contains the date when a text was written (Date), the genre of the text (Genre), the name of the text (Text), the relative frequency of prepositions in the text (Prepositions), and the region in which the text was written (Region). We also add two more variables to the data called GenreRedux and DateRedux. GenreRedux collapses the existing genres into five main categories (Conversational, Religious, Legal, Fiction, and NonFiction) while DateRedux collapses the dates when the texts were composed into five main periods (1150-1499, 1500-1599, 1600-1699, 1700-1799, and 1800-1913). We also factorize non-numeric variables.

# activate packages
library(dplyr)
library(ggplot2) 
library(gridExtra)
library(knitr) 
library(kableExtra)
library(stringr)
# load data
plotdata <- read.delim("https://slcladal.github.io/data/lmmdata.txt", header = TRUE) %>%
  mutate(GenreRedux = case_when(str_detect(.$Genre, "Letter") ~ "Conversational",
                                Genre == "Diary" ~ "Conversational",
                                Genre == "Bible"|Genre == "Sermon" ~ "Religious",
                                Genre == "Law"|Genre == "TrialProceeding" ~ "Legal",
                                Genre == "Fiction" ~ "Fiction",
                                TRUE ~ "NonFiction")) %>%
  mutate(DateRedux = case_when(Date < 1500 ~ "1150-1499",
                               Date < 1600 ~ "1500-1599",
                               Date < 1700 ~ "1600-1699",
                               Date < 1800 ~ "1700-1799",
                               TRUE ~ "1800-1913")) %>%
  mutate(Genre = factor(Genre),
         Text = factor(Text),
         Region = factor(Region),
         GenreRedux = factor(GenreRedux),
         DateRedux = factor(DateRedux))

The first six rows of the data look like this:

First 6 rows of the plotdata
Date Genre Text Prepositions Region GenreRedux DateRedux
1736 Science albin 166.01 North NonFiction 1700-1799
1711 Education anon 139.86 North NonFiction 1700-1799
1808 PrivateLetter austen 130.78 North Conversational 1800-1913
1878 Education bain 151.29 North NonFiction 1800-1913
1743 Education barclay 145.72 North NonFiction 1700-1799
1908 Education benson 120.77 North NonFiction 1800-1913

We will now turn to creating the graphs.

Creating a simple graph

When creating a visualization with ggplot, we first use the function ggplot and define the data that the visualization will use, then, we define the aesthetics which define the layout, i.e. the x- and y-axes.

ggplot(plotdata, aes(x = Date, y = Prepositions))

In a next step, we add the geom-layer which defines the type of visualization that we want to display. In this case, we use geom_point as we want to show points that stand for the frequencies of prepositions in each text. Note that we add the geom-layer by adding a + at the end of the line!

ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  geom_point()

We can also add another layer, e.g. a layer which shows a smoothed loess line, and we can change the theme by specifying the theme we want to use. Here, we will use theme_bw which stands for the black-and-white theme (we will get into the different types of themes later).

ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  geom_point() +
  geom_smooth(se = F) +
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can also store our plot in an object and then add different layers to it or modify the plot. Here we store the basic graph in an object that we call p and then change the axes names.

p <- ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  geom_point()
p + labs(x = "Year", y = "Frequency")

We can also integrate plots into data processing pipelines as shown below. When you integrate visualizations into pipelines, you should not specify the data as it is clear from the pipe which data the plot is using.

plotdata %>%
  select(DateRedux, GenreRedux, Prepositions) %>%
  group_by(DateRedux, GenreRedux) %>%
  summarise(Frequency = mean(Prepositions)) %>%
    ggplot(aes(x = DateRedux, y = Frequency, group = GenreRedux, color = GenreRedux)) +
    geom_line()
## `summarise()` regrouping output by 'DateRedux' (override with `.groups` argument)

Modifying axes and titles

There are different way to modify axes, the easiest way is to specify the axes labels using labs (as already shown above). To add a custom title, we can use ggtitle.

p + labs(x = "Year", y = "Frequency") +
  ggtitle("Preposition use over time", subtitle="based on the PPC corpus")

To change the range of the axes, we can specify their limits in the coord_cartesian layer.

p + coord_cartesian(xlim = c(1000, 2000), ylim = c(-100, 300))

p <- ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  geom_point() + 
  labs(x = "Year", y = "Frequency")
p + theme(axis.text.x = element_text(face="italic", color="red", size=8, angle=45),
          axis.text.y = element_text(face="bold", color="blue", size=15, angle=90))

p + theme(
  axis.text.x = element_blank(),
  axis.text.y = element_blank(),
  axis.ticks = element_blank())

p + scale_x_discrete(name ="Year of composition", limits=seq(1150, 1900, 50)) +
  scale_y_discrete(name ="Relative Frequency", limits=seq(70, 190, 20))
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?

## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?

Modifying colors

To modify colors, you can include a color specification in the main aesthetics.

ggplot(plotdata, aes(x = Date, y = Prepositions, color = GenreRedux)) +
  geom_point() 

Or you can specify the color in the aesthetics of the geom-layer.

p + geom_point(aes(color = GenreRedux))

To change the default colors manually, you can use scale_color_manual and define the colors you want to use in the values argument and specify the variable levels that want to distinguish by colors in the breaks argument. You can find an overview of the colors that you can define in R here.

ggplot(plotdata, aes(x = Date, y = Prepositions, color = GenreRedux)) +
  geom_point()  + 
  scale_color_manual(values = c("red", "gray30", "blue", "orange", "gray80"),
                       breaks = c("Conversational", "Fiction", "Legal", "NonFiction", "Religious"))

When the variable that you want to colorize does not have discrete levels, you use scale_color_continuous instead of scale_color_manual.

p + geom_point(aes(color = Prepositions)) + 
  scale_color_continuous()

You can also change colors by specifying color palettes. Color palettes are predefined vectors of colors and there are many different color palettes available. Below are some examples using the Brewer color palette.

p + geom_point(aes(color = GenreRedux)) + 
  scale_color_brewer()

p + geom_point(aes(color = GenreRedux)) + 
  scale_color_brewer(palette = 2)

p + geom_point(aes(color = GenreRedux)) + 
  scale_color_brewer(palette = 3)

We now use the viridis color palette to show how you can use another palette. The example below uses the viridis palette for a discrete variable (GenreRedux).

p + geom_point(aes(color = GenreRedux)) + 
  scale_color_viridis_d()

To use the viridis palette for continuous variables you need to use scale_color_viridis_c instead of scale_color_viridis_d.

p + geom_point(aes(color = Prepositions)) + 
  scale_color_viridis_c()

The Brewer color palette (see below) is the most commonly used color palette but there are many more. You can find an overview of the color palettes that are available here.

library(RColorBrewer)
display.brewer.all()

Changing lines and shapes and adding tranparency

ggplot(plotdata, aes(x = Date, y = Prepositions, shape = GenreRedux)) +
  geom_point() 

ggplot(plotdata, aes(x = Date, y = Prepositions)) + 
  geom_point(aes(shape = GenreRedux)) + 
  scale_shape_manual(values = 1:5)

Similarly, if you want to change the lines in a line plot, you define the linetype in the aesthetics.

plotdata %>%
  select(GenreRedux, DateRedux, Prepositions) %>%
  group_by(GenreRedux, DateRedux) %>%
  summarize(Frequency = mean(Prepositions)) %>%
  ggplot(aes(x = DateRedux, y = Frequency, group = GenreRedux, linetype = GenreRedux)) +
  geom_line()
## `summarise()` regrouping output by 'GenreRedux' (override with `.groups` argument)

You can of course also manually specify the line types.

plotdata %>%
  select(GenreRedux, DateRedux, Prepositions) %>%
  group_by(GenreRedux, DateRedux) %>%
  summarize(Frequency = mean(Prepositions)) %>%
  ggplot(aes(x = DateRedux, y = Frequency, group = GenreRedux, linetype = GenreRedux)) +
  geom_line() +
  scale_linetype_manual(values = c("twodash", "longdash", "solid", "dotted", "dashed"))
## `summarise()` regrouping output by 'GenreRedux' (override with `.groups` argument)

Here is an overview of the most commonly used linetypes in R.

d=data.frame(lt=c("blank", "solid", "dashed", "dotted", "dotdash", "longdash", "twodash", "1F", "F1", "4C88C488", "12345678"))
ggplot() +
scale_x_continuous(name="", limits=c(0,1)) +
scale_y_discrete(name="linetype") +
scale_linetype_identity() +
geom_segment(data=d, mapping=aes(x=0, xend=1, y=lt, yend=lt, linetype=lt))

To make your layers transparent, you need to specify alpha values.

ggplot(plotdata, aes(x = Date, y = Prepositions)) + 
  geom_point(alpha = .2)

Transparency can be particularly useful when using different layers that add different types of visualizations.

ggplot(plotdata, aes(x = Date, y = Prepositions)) + 
  geom_point(alpha = .1) + 
  geom_smooth(se = F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Transparency can also be linked to other variables.

ggplot(plotdata, aes(x = Date, y = Prepositions, alpha = Region)) + 
  geom_point()
## Warning: Using alpha for a discrete variable is not advised.

ggplot(plotdata, aes(x = Date, y = Prepositions, alpha = Prepositions)) + 
  geom_point()

Adapting sizes

ggplot(plotdata, aes(x = Date, y = Prepositions, size = Region, color = GenreRedux)) +
  geom_point() 
## Warning: Using size for a discrete variable is not advised.

ggplot(plotdata, aes(x = Date, y = Prepositions, color = GenreRedux, size = Prepositions)) +
  geom_point() 

Adding text

plotdata %>%
  filter(Genre == "Fiction") %>%
  ggplot(aes(x = Date, y = Prepositions, label = Prepositions, color = Region)) +
  geom_text(size = 3) +
  theme_bw()

plotdata %>%
  filter(Genre == "Fiction") %>%
  ggplot(aes(x = Date, y = Prepositions, label = Prepositions)) +
  geom_text(size = 3, hjust=1.2) +
  geom_point() +
  theme_bw()

plotdata %>%
  filter(Genre == "Fiction") %>%
  ggplot(aes(x = Date, y = Prepositions, label = Prepositions)) +
  geom_text(size = 3, nudge_x = -15, check_overlap = T) +
  geom_point() +
  theme_bw()

plotdata %>%
  filter(Genre == "Fiction") %>%
  ggplot(aes(x = Date, y = Prepositions, label = Prepositions)) +
  geom_text(size = 3, nudge_x = -15, check_overlap = T) +
  geom_point() +
  theme_bw()

ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  geom_point() +
  annotate(geom = "text", label = "Some text", x = 1200, y = 175, color = "orange") +
  annotate(geom = "text", label = "More text", x = 1850, y = 75, color = "lightblue", size = 8) +
    theme_bw()

plotdata %>%
  group_by(GenreRedux) %>%
  summarise(Frequency = round(mean(Prepositions), 1)) %>%
  ggplot(aes(x = GenreRedux, y = Frequency, label = Frequency)) +
  geom_bar(stat="identity") +
  geom_text(vjust=-1.6, color = "black") +
  coord_cartesian(ylim = c(0, 180)) +
  theme_bw()
## `summarise()` ungrouping output (override with `.groups` argument)

plotdata %>%
  group_by(Region, GenreRedux) %>%
  summarise(Frequency = round(mean(Prepositions), 1)) %>%
  ggplot(aes(x = GenreRedux, y = Frequency, group = Region, fill = Region, label = Frequency)) +
  geom_bar(stat="identity", position = "dodge") +
  geom_text(vjust=1.6, position = position_dodge(0.9)) + 
  theme_bw()
## `summarise()` regrouping output by 'Region' (override with `.groups` argument)

plotdata %>%
  filter(Genre == "Fiction") %>%
  ggplot(aes(x = Date, y = Prepositions, label = Prepositions)) +
  geom_label(size = 3, vjust=1.2) +
  geom_point() +
  theme_bw()

Combining multiple plots

ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  facet_grid(~GenreRedux) +
  geom_point() + 
  theme_bw()

ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  facet_wrap(vars(Region, GenreRedux), ncol = 5) +
  geom_point() + 
  theme_bw()

p1 <- ggplot(plotdata, aes(x = Date, y = Prepositions)) + geom_point() + theme_bw()
p2 <- ggplot(plotdata, aes(x = GenreRedux, y = Prepositions)) + geom_boxplot() + theme_bw()
p3 <- ggplot(plotdata, aes(x = DateRedux, group = GenreRedux)) + geom_bar() + theme_bw()
p4 <- ggplot(plotdata, aes(x = Date, y = Prepositions)) + geom_point() + geom_smooth(se = F) + theme_bw()
grid.arrange(p1, p2, nrow = 1)

grid.arrange(grobs = list(p4, p2, p3), 
             widths = c(2, 1), 
             layout_matrix = rbind(c(1, 1), c(2, 3)))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Available themes

p <- ggplot(plotdata, aes(x = Date, y = Prepositions)) + geom_point() + labs(x = "", y= "") +
  ggtitle("Default") + theme(axis.text.x = element_text(size=6, angle=90))
p1 <- p + theme_bw() + ggtitle("theme_bw") + theme(axis.text.x = element_text(size=6, angle=90))
p2 <- p + theme_classic() + ggtitle("theme_classic") + theme(axis.text.x = element_text(size=6, angle=90))
p3 <- p + theme_minimal() + ggtitle("theme_minimal") + theme(axis.text.x = element_text(size=6, angle=90))
p4 <- p + theme_light() + ggtitle("theme_light") + theme(axis.text.x = element_text(size=6, angle=90))
p5 <- p + theme_dark() + ggtitle("theme_dark") + theme(axis.text.x = element_text(size=6, angle=90))
p6 <- p + theme_void() + ggtitle("theme_void") + theme(axis.text.x = element_text(size=6, angle=90))
p7 <- p + theme_gray() + ggtitle("theme_gray") + theme(axis.text.x = element_text(size=6, angle=90))
grid.arrange(p, p1, p2, p3, p4, p5, p6, p7, ncol = 4)

ggplot(plotdata, aes(x = Date, y = Prepositions, color = GenreRedux)) +
  geom_point() + 
  theme(panel.background = element_rect(fill = "white", colour = "red"))

Extensive information about how to modify themes can be found here.

Modifying legends

ggplot(plotdata, aes(x = Date, y = Prepositions, color = GenreRedux)) +
  geom_point() + 
  theme(legend.position = "top")

ggplot(plotdata, aes(x = Date, y = Prepositions, color = GenreRedux)) +
  geom_point() + 
  theme(legend.position = "none")

ggplot(plotdata, aes(x = Date, y = Prepositions, linetype = GenreRedux, color = GenreRedux)) +
  geom_smooth(se = F) +  
  theme(legend.position = c(0.2, 0.7)) 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(plotdata, aes(x = Date, y = Prepositions, linetype = GenreRedux, color = GenreRedux)) +
  geom_smooth(se = F) + 
  guides(color=guide_legend(override.aes=list(fill=NA))) +  
  theme(legend.position = "top", 
        legend.text = element_text(color = "green")) +
  scale_linetype_manual(values=1:5, 
                        name=c("Genre"),
                        breaks = names(table(plotdata$GenreRedux)),
                        labels = names(table(plotdata$GenreRedux))) + 
  scale_colour_manual(values=c("red", "gray30", "blue", "orange", "gray80"),
                      name=c("Genre"),
                      breaks=names(table(plotdata$GenreRedux)),  
                      labels = names(table(plotdata$GenreRedux)))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Citation & Session Info

Schweinberger, Martin. 2020. Introduction to Data Visualization in R. Brisbane: The University of Queensland. url: https://slcladal.github.io/introviz.html.

@manual{schweinberger2020introqant,
  author = {Schweinberger, Martin},
  title = {Introduction to Data Visualization in R},
  note = {https://slcladal.github.io/introviz.html},
  year = {2020},
  organization = "The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2020/09/23}
}
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RColorBrewer_1.1-2 stringr_1.4.0      kableExtra_1.2.1   knitr_1.30        
## [5] gridExtra_2.3      ggplot2_3.3.2      dplyr_1.0.2       
## 
## loaded via a namespace (and not attached):
##  [1] pillar_1.4.6      compiler_4.0.2    highr_0.8         tools_4.0.2      
##  [5] digest_0.6.25     lattice_0.20-41   nlme_3.1-148      evaluate_0.14    
##  [9] lifecycle_0.2.0   tibble_3.0.3      gtable_0.3.0      viridisLite_0.3.0
## [13] mgcv_1.8-31       pkgconfig_2.0.3   rlang_0.4.7       Matrix_1.2-18    
## [17] rstudioapi_0.11   yaml_2.2.1        xfun_0.16         withr_2.3.0      
## [21] httr_1.4.2        xml2_1.3.2        generics_0.0.2    vctrs_0.3.4      
## [25] grid_4.0.2        webshot_0.5.2     tidyselect_1.1.0  glue_1.4.2       
## [29] R6_2.4.1          rmarkdown_2.3     farver_2.0.3      purrr_0.3.4      
## [33] magrittr_1.5      splines_4.0.2     scales_1.1.1      ellipsis_0.3.1   
## [37] htmltools_0.5.0   rvest_0.3.6       colorspace_1.4-1  labeling_0.3     
## [41] stringi_1.5.3     munsell_0.5.0     crayon_1.3.4

Main page


References

Wilkinson, Leland. 2012. “The Grammar of Graphics.” In Handbook of Computational Statistics. Concepts and Methods, edited by James E. Gentle, Wolfgang Karl H, and Yuichi Mori, 375–414. Springer.