Introduction

This tutorial introduces tables and shows how to tabulate data. The entire R markdown document for this tutorial can be downloaded here. A truly excellent and highly recommendable resource on processing data in R is Wickham and Grolemund (2016) [see also Wickham et al. (2019). Alternative but also very useful resources are Stander and Dalla Valle (2017); and Estrellado et al. (2020).

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # supress math annotation
# install libraries
install.packages(c("xlsx", "dplyr", "tidyr"))

Once you have installed R-Studio and initiated the session by executing the code shown above, you are good to go.

1 Gettign started

Tables are one of the most common and important ways to process, handle, and describe data. This tutorial introduces different types of tables, how to load and save different types of tables, as well as how to modify and process tables and tabulated data.

When working with R, there are different kinds or types of tables that have different characteristics. The most common types of tables in “R” are:

  • matrices
  • data frames
  • tibbels

Matrices can only contain one type of data and all data points will be converted to the type of scale with the lowest information value. For instance, if at least one variables in a table represents characters (or strings), then all other variables are also converted into characters (although they may be numbers).

Data frames can contain different types of scales and variables are not converted automatically. In this sense, data frames are more flexible and are preferrable over matrices.

Tibbles are the tidyverse equivalent of data frames which offer new functions and possibilities of handling and inspecting the data. .

2 Loading tables into R

There are several different functions that allow us to read tabulated data into R. If the data is stored as a txt file, there are various functions to read in the data. The most common functions are “read.delim” and “read.table”.

# load data with read.delim
mytable <- read.delim("https://slcladal.github.io/data/mlrdata.txt", header = TRUE)
# show first 6 lines of table
head(mytable)
##         status    attraction money
## 1 Relationship NotInterested 86.33
## 2 Relationship NotInterested 45.58
## 3 Relationship NotInterested 68.43
## 4 Relationship NotInterested 52.93
## 5 Relationship NotInterested 61.86
## 6 Relationship NotInterested 48.47
# load data with read.delim
myothertable <- read.table("https://slcladal.github.io/data/mlrdata.txt", header = TRUE)
# show first 6 lines of table
head(myothertable)
##         status    attraction money
## 1 Relationship NotInterested 86.33
## 2 Relationship NotInterested 45.58
## 3 Relationship NotInterested 68.43
## 4 Relationship NotInterested 52.93
## 5 Relationship NotInterested 61.86
## 6 Relationship NotInterested 48.47

When a tables should be loaded from Excel, it is neccessary to load the “xlsx” library. The example below only works if you exchange the path with your own path to an xls or xlsx file. This is because if we want to load data from the LADAL server, loading the table becomes slightly more complex. You can try it out loading an xls or xlsx file with some xls or xlsx file on your own computer though.

# load library
library(xlsx)
# load table into R
# WARNING! set your own path!
exceldata <- read.xlsx("data/testdata1.xlsx", 1)
# show first 6 lines of table
head(exceldata)

The following section shows how to access and manipulate tables.

3 Dealing with tables in the Tidyverse

The tidyverse is a specific way of writing R code that builds on a family of libraries designed for efficient data science work flows which were developed initially by Hadley Wickham. This new way of writing R code buidls on a shared and underlying design philosophy and grammar. Due to its popularity and ease of use, the tidyverse way to write R code is becoming increasingly popular and we will use it in the following to handle and manipulate tabulated data.

If you have already loaded data into R and now want to process the data, you typically have to modify the data in some form or another to get the information or format you need. The tidyverse offers very user-friendly, intuitive, and handy functions for processing the data to match the needs of your analysis. To have access to the tidyverse functions for data processing, we load the dplyr and tidyr libraries and load and inspect another set of data using the “read.delim” function. The new data is stored as a txt file and has 100 observations (rows) and 3 variables (status, attraction, and money). The data set represents how much money people have spend in someone they were interested in or not (attraction: Intersted versus NotInterested) and their own relationship status (status: Singel versus Relationship).

# load libraries
library(dplyr)
library(tidyr)
# load new data
newdata <- read.delim("data/mlrdata.txt", sep = "\t", header = T)
# inspect data before processing
nrow(newdata); str(newdata); table(newdata$attraction)
## [1] 100
## 'data.frame':    100 obs. of  3 variables:
##  $ status    : chr  "Relationship" "Relationship" "Relationship" "Relationship" ...
##  $ attraction: chr  "NotInterested" "NotInterested" "NotInterested" "NotInterested" ...
##  $ money     : num  86.3 45.6 68.4 52.9 61.9 ...
## 
##    Interested NotInterested 
##            50            50

The output shows that the table represents 3 variables (status, attraction, and money) and 100 rows which contain 50 people who are interested and 50 who are not interested. We will now check out different ways and fucntions to process data.

3.1 Piping

Piping, done with the sequence “%>%”, is a very easy, intuitive, quick, and handy way to process data. Essentially piping means that we take an element that is to the left of the piping symbol and then do something to it; that way, the piping symbol can be translated as “and then”.

We could, for example, load data and then capitalize the column names and then group the data by status and attraction and then get the mean of money spend on deleting all observations except for the first one. A more formal way to write this would be: load %>% capitalize %>% group %>% summarize.

In R this command would look like this:

# example of a data processing pipeline
pipeddata <- read.delim("data/mlrdata.txt", sep = "\t", header = T) %>%
  dplyr::rename(Status = status, Attraction = attraction, Money = money) %>%
  dplyr::group_by(Status, Attraction) %>%
  dplyr::summarise(Mean = mean(Money))
# inspect summarized data
pipeddata
## # A tibble: 4 x 3
## # Groups:   Status [2]
##   Status       Attraction     Mean
##   <chr>        <chr>         <dbl>
## 1 Relationship Interested     99.2
## 2 Relationship NotInterested  51.5
## 3 Single       Interested    157. 
## 4 Single       NotInterested  46.0

The pipe has worked and we get the resulting summary which shows the mean of the money spend based on Attraction and Status.

3.2 Selecting and filtering

Among the most frequent procedures in data processing is selecting certain columns or extracting rows based on variable levels. In the tidyverse, this is done by using the select and filter functions. While select allows to extract columns, filter is used to extract rows, e.g. to get only observations that have a certain feature. Have a look at the example below.

# select and filter
reduceddata <- newdata %>%
  # select the columns attraction and money
  dplyr::select(attraction, money) %>%
  # extract rows which represent cases where the person was intersted in someone
  dplyr::filter(attraction == "Interested")
# inspect new table
nrow(reduceddata); table(reduceddata$attraction)
## [1] 50
## 
## Interested 
##         50

We have now reduced the data by excluding status (we have only selected attraction and money) and we have removed those 50 data rows of people who were not interested. The select function (like most other tyidyverse functions) can also be used together with a minus sign which cuases a column to be removed, thus "“dplyr::select(-money)” would remove the money column (see below).

# select and filter
datawithoutmoney <- newdata %>%
  # remove money
  dplyr::select(-money) 
# inspect data
head(datawithoutmoney)
##         status    attraction
## 1 Relationship NotInterested
## 2 Relationship NotInterested
## 3 Relationship NotInterested
## 4 Relationship NotInterested
## 5 Relationship NotInterested
## 6 Relationship NotInterested

Selecting and filtering are extremely powerful functions that can also be combined with other functions. But before we discuss more complex issues, we will have alook at how we can change columns.

3.3 Changing data and adding columns

Changing and adding data is done with the mutate function. The mutate functions requires that we specify a column name - if we use the same name as the column we are changing, then we change the column but if we specify another column mame, then a new column is created.

We will now create a new column (Spendalot) in which we encode if the person has spend a lot of money (100 AUD or more) on the present or not (less than 100 AUD).

# creating a new column
newdata <- newdata %>%
  dplyr::mutate(Spendalot = ifelse(money >= 100, "Alot", "Alittle")) 
# inspect data
head(newdata)
##         status    attraction money Spendalot
## 1 Relationship NotInterested 86.33   Alittle
## 2 Relationship NotInterested 45.58   Alittle
## 3 Relationship NotInterested 68.43   Alittle
## 4 Relationship NotInterested 52.93   Alittle
## 5 Relationship NotInterested 61.86   Alittle
## 6 Relationship NotInterested 48.47   Alittle

The table now has a new column (Spendalot) because we have specified a column name that did not exist yet - had we written “dplyr::mutate(money = ifelse(money >= 100,”Alot“,”Alittle“))” then we would have changed the money column and replaced the money values with the labels “Alot” and “Alittle”.

3.4 Renaming columns

Oftentimes, column names are not really meaningful or incoherent which makes it easier to wrap your head around what the values in a column refer to. The easiest way around this is rename columns which is, fortunately very simple in the tidyverse. While the column names of our example table are meaningful, I want to capitalize the first letter of each column name. This can be done as follows.

# renaming columns
newdata <- newdata  %>%
  dplyr::rename(Status = status, Attraction = attraction, Money = money)
# inspect data
head(newdata)
##         Status    Attraction Money Spendalot
## 1 Relationship NotInterested 86.33   Alittle
## 2 Relationship NotInterested 45.58   Alittle
## 3 Relationship NotInterested 68.43   Alittle
## 4 Relationship NotInterested 52.93   Alittle
## 5 Relationship NotInterested 61.86   Alittle
## 6 Relationship NotInterested 48.47   Alittle

The renaming was successful as all column names now bein with a capital letter.

3.5 Grouping and summarising

In contrast to mutate, which retains the number of rows, summarizing creates new columns but collapses rows and only provides the summary value (or values if more than one summary is specified). Also, columns that are not grouping variables are removed.

Summarizing is particularly useful when we want to get summaries of groups. We will modify the example from above and extract the mean and the standard deviation of the money spend on presents by relationship status and whether the giver was attracted to the givee.

#grouping and summarizing data 
datasummary <- newdata %>%
  dplyr::group_by(Status, Attraction) %>%
  dplyr::summarise(Mean = round(mean(Money), 2), SD = round(sd(Money), 1))
# inspect summarized data
datasummary
## # A tibble: 4 x 4
## # Groups:   Status [2]
##   Status       Attraction     Mean    SD
##   <chr>        <chr>         <dbl> <dbl>
## 1 Relationship Interested     99.2  14.7
## 2 Relationship NotInterested  51.5  17  
## 3 Single       Interested    157.   23.2
## 4 Single       NotInterested  46.0  19.9

3.6 Gathering and Spreading

One very common problem is that data - or at least parts of it - have to be transformed from long to wide format or vice versa. In the tidyverse, this is doen using the gather and spread function. We will convert the summary table shown above into a wide format (we also remove the SD column as it is no longer needed)

# converting data to wide format 
widedata <- datasummary %>%
  # remove SD column
  dplyr::select(-SD) %>% 
  # convert into wide format
  tidyr::spread(Attraction, Mean)
# inspect wide data
widedata
## # A tibble: 2 x 3
## # Groups:   Status [2]
##   Status       Interested NotInterested
##   <chr>             <dbl>         <dbl>
## 1 Relationship       99.2          51.5
## 2 Single            157.           46.0

We can re-convert the wide into a long format using the gather function.

# converting data to long format 
longdata <- widedata %>%
  # convert into long format
  tidyr::gather(Attraction, Money, Interested:NotInterested)
# inspect wide data
longdata
## # A tibble: 4 x 3
## # Groups:   Status [2]
##   Status       Attraction    Money
##   <chr>        <chr>         <dbl>
## 1 Relationship Interested     99.2
## 2 Single       Interested    157. 
## 3 Relationship NotInterested  51.5
## 4 Single       NotInterested  46.0

There are many more useful functions for processing, handling, and summarizing tables but this should suffice to get you started.

Citation & Session Info

Schweinberger, Martin. 2020. Tabulating data in R. Brisbane: The University of Queensland. url: https://slcladal.github.io/introtables.html.

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_1.1.2 dplyr_1.0.2
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.30       magrittr_1.5     tidyselect_1.1.0 R6_2.4.1        
##  [5] rlang_0.4.7      fansi_0.4.1      stringr_1.4.0    tools_4.0.2     
##  [9] xfun_0.16        utf8_1.1.4       cli_2.0.2        htmltools_0.5.0 
## [13] ellipsis_0.3.1   yaml_2.2.1       digest_0.6.25    assertthat_0.2.1
## [17] tibble_3.0.3     lifecycle_0.2.0  crayon_1.3.4     purrr_0.3.4     
## [21] vctrs_0.3.4      glue_1.4.2       evaluate_0.14    rmarkdown_2.3   
## [25] stringi_1.5.3    compiler_4.0.2   pillar_1.4.6     generics_0.0.2  
## [29] pkgconfig_2.0.3

Go back to the main page


References

Estrellado, Ryan A, Emily A Freer, Jesse Mostipak, Joshua M Rosenberg, and Isabella C Velásquez. 2020. Data Science in Education Using R. Routledge.

Stander, Julian, and Luciana Dalla Valle. 2017. “On Enthusing Students About Big Data and Social Media Visualization and Analysis Using R, Rstudio, and Rmarkdown.” Journal of Statistics Education 25 (2): 60–67.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".