1 Introduction

This tutorial focuses on how to import data into “R” and how to export results from “R”. In addition, we will have a look at how to set the “workspace” and why this makes sense when working in R. The entire code for the sections below can be downloaded here.

2 Preparation and session set up

As all caluculations and visualizations in this tutorial rely on “R”, it is necessary to install “R”, “RStudio”, and “Tinn-R”. If these programms (or, in the case of “R”, environments) are not already installed on your machine, please search for them in your favorite search engine and add the term “download”. Open any of the first few links and follow the installation instructions (they are easy to follow, do not require any specifications, and are pretty much self-explanatory).

In addition, certain “libraries” or “packages” need to be installed so that the scripts shown below are executed without errors. Before turning to the code below, please install the libraries by running the code below this paragraph. If you have already installed the libraries mentioned below, then you can skip ahead ignore this section. To install the necessary libraries, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # supress math annotation
# install libraries
install.packages(c("cluster", "factoextra", "cluster", 
                   "seriation", "pvclust", "ape", "vcd", 
                   "exact2x2", "factoextra", "seriation", 
                   "NbClust", "pvclust"))

Once you have installed “R”, “R-Studio”, “Tinn-R”, and have also initiated the session by executing the code shown above, you are good to go.

3 Setting the workspace

When you are not working within an R project which automatically assumes that the folder, in which the project is located, is the working directory, it is useful to define the working directory (“work space”) at the beginning of each session.

Defining the working directory means that you specify the path to a folder that serves as the default location, so that you do not have to specify the entire path to the objects or folders you are working with all the time. Instead of having to type, for example, “D:\Uni\UQ\LADAL\SLCLADAL.github.io/mydata.txt” you would only have to type “mydata.txt” if “SLCLADAL.github.io” is defined as the working directory.

Therefore, if you do not work within an R project (Rproj), the working directory should be specified manually by you (the user) which is done by specifying a path in the “setwd()” function (as shown below). In the present case, you may want to specify the folder that contains the materials for this tutorial. Before defining a working directory, it can be useful to check what is defined as the present working directory. To get the present working directory, you can use the command “getwd()” as shown below ().

# Set workspace
#setwd("D:\\StatisticsForLinguists")
# show workspace
getwd ()

From now on, we no longer need to specify the entire path to the files, just the name of the object we want to load.

4 Importing Data

This section deals with importing and exporting data.

4.1 Importing tables from Excel

The most common way to load data is to load tabulated data - often data that were created in some type of spread sheet software (e.g. OpenOffice Calc or Microsoft Excel). We will therefore start with importing a single file containing tabulated data. To read spreadsheets directly from Microsoft Excel, you must first install and enable the “xlsx package”. To install this package enter the command “install.packages(”xlsx“)” and press “Enter”. The package “xlsx” can then be activated by the command “library(xlsx)”.

install.packages("xlsx")
library(xlsx)

In a next step, we can either define the path to the data or explicitly define the path within the function “read.xlsx”. Both options are shown below.

# define path to data
path <- "data/testdata1.xlsx"
# load data with defined path
mydata <- read.xlsx(path, 1)
# load data without pre-defining the path
mydata <- read.xlsx("data/testdata1.xlsx", 1)

Another way to import data is to navigate to the data using a GUI or Browser interface which is done by using the “chose.files()” command within the “read.xlsx” function.

mydata <- read.xlsx(choose.files(), 1)

To have a look at the entire data set, we can simply type the name of the object containing the data. In our case we called the object “mydata”. So enter “mydataxlsx” into the “R” GUI and “R” will show you the loaded data.

mydata
##    Variable1 Variable2
## 1          6        67
## 2         65        16
## 3         12        56
## 4         56        34
## 5         45        54
## 6         84        42
## 7         38        36
## 8         46        47
## 9         64        54
## 10        24        29

The command “summary(mydata)” summarizes the data set.

summary(mydata)
##    Variable1      Variable2   
##  Min.   : 6.0   Min.   :16.0  
##  1st Qu.:27.5   1st Qu.:34.5  
##  Median :45.5   Median :44.5  
##  Mean   :44.0   Mean   :43.5  
##  3rd Qu.:62.0   3rd Qu.:54.0  
##  Max.   :84.0   Max.   :67.0

We can get an overview of the structure of the data with the command “str(mydata)”.

str(mydata)
## 'data.frame':    10 obs. of  2 variables:
##  $ Variable1: num  6 65 12 56 45 84 38 46 64 24
##  $ Variable2: num  67 16 56 34 54 42 36 47 54 29

The command “head(mydata)” outputs the first six lines or elements of a data object.

head(mydata)
##   Variable1 Variable2
## 1         6        67
## 2        65        16
## 3        12        56
## 4        56        34
## 5        45        54
## 6        84        42

You import data into “R” so that “R” will then have this data available and you can edit it in “R”.

4.2 Importing plain text tables

Tables are often stored not as spread sheets but as plain text files (, i.e. files with the file extension .txt). This is done to save space and also because it can be “tidier”.

As the interactive way to load plain text tables is the most common way to load data into “R” among people who do not have much experience with “R”, we will elaborate a little bit on this function.

The function we use to load plain text tables is “read.table()” (or, as we will see later on, “read.delim()”). When we use the “chose.files()” function within the “read.table()” function, “R” opens a navigation window that allows us to browse to the data we want to load. In plain text files, tables are tap-stop-separated by default. To interactively open a tap-stop-separated .txt file, you need to enter the following command in R:

mydata <- read.table(choose.files(), header = T, sep = "\t", quote = "", comment.char = "")

The command (or the function to be precise) contains a number of arguments, which we should briefly discuss here. To get help on functions, enter the command “?package name” in “R” (of course, the name of the function must be used instead of the sequence “package name”).

Now to our argument: first we define a name for the data, “mydata”, to which we assign the result of the function “read.table” with the sequence “<-”.

Then, we use the function “read.table” which requires only the path to the data which we include by browsing to the data due to “choose.files()”. In the case above, we also explicity define the following arguments: “header”, “sep”, “quote”, and “comment.char”.

“choose.files()” tells “R” that we are interactively navigating to the file via a browser. “choose.files()” can also be replaced by the exact path to the file (e.g., “D:\MyProjects\RorLinguists/testdata1.txt”). In this case, “R” reads the file directly from the specified source. If we use a Windows machine, the direct path is specified with double backslashes for directories (instead of simple backslashes or forwardslashes that we use on a Mac) and simple forwardslashes for individual files. The “header” argument is needed to indicate whether the table has header that define the variables in the table. If the data has header, we set “header = TRUE” or “header = T”, for short.

The “sep” argument is very important because it indicates how the data points in the driven file are separated. In most cases, the data points will be tab separated, but there are also comma-separated files (.csv).

The “ indicates that the data points are tab-separated, whereas”sep =" "" would indicate that the data points are separated by spaces and “sep =”, "" would indicate that the data points passed through Commas are separated. The argument “quote” informs “R” that certain characters delineate quotes. The argument “comment.char” informs “R” that certain characters are not to be read as normal characters, but are programming comments.

4.3 Loading more than one file: loading corpus data

To load many files at once, as we typically do when we load a corpus, requires different functions but it works very similar to the way individual files are loaded. In such cases, it does make sense however, to specify the path rather than browsing to the directory of the corpus.

# define path to corpus
corpuspath <- "D:\\Uni\\UQ\\LADAL\\SLCLADAL.github.io\\data\\testcorpus"

After we have specified the path, we can now create a list of all the files that are in that directory

# define files to load
corpus.files = list.files(path = corpuspath, pattern = NULL, all.files = T,
  full.names = T, recursive = T, ignore.case = T, include.dirs = T)

Now, we loop over the files in the list and scan the content, i.e. we load the corpus into “R”.

# load corpus and start processing
corpus <- lapply(corpus.files, function(x) {
  x <- scan(x, what = "char", sep = "", quote = "", quiet = T, skipNul = T)
  x <- paste(x, sep = " ", collapse = " ")
  } )

To load the corpus, we have used two basic fiunctions, “scan()” and “paste()”. The “scan()” function load the data while the “paste()” function conmbines the individual words of each file into a single object.

We can inspect the corpus files with “corpus[1]” which shows us the first corpus file.

# inspect first file
corpus[[1]]
## [1] "Linguistics is the scientific study of language. It involves analysing language form language meaning and language in context. The earliest activities in the documentation and description of language have been attributed to the th-century-BC Indian grammarian Pa?ini who wrote a formal description of the Sanskrit language in his A??adhyayi. Linguists traditionally analyse human language by observing an interplay between sound and meaning. Phonetics is the study of speech and non-speech sounds and delves into their acoustic and articulatory properties. The study of language meaning on the other hand deals with how languages encode relations between entities properties and other aspects of the world to convey process and assign meaning as well as manage and resolve ambiguity. While the study of semantics typically concerns itself with truth conditions pragmatics deals with how situational context influences the production of meaning."

Another way to inspect the corpus is “str(corpus)” which tells us about the structure of the corpus.

# inspect corpus structure
str(corpus)
## List of 7
##  $ : chr "Linguistics is the scientific study of language. It involves analysing language form language meaning and langu"| __truncated__
##  $ : chr "Grammar is a system of rules which governs the production and use of utterances in a given language. These rule"| __truncated__
##  $ : chr "In the early 20th century, Ferdinand de Saussure distinguished between the notions of langue and parole in his "| __truncated__
##  $ : chr "The study of parole (which manifests through cultural discourses and dialects) is the domain of sociolinguistic"| __truncated__
##  $ : chr "Stylistics also involves the study of written, signed, or spoken discourse through varying speech communities, "| __truncated__
##  $ : chr "Linguistics also deals with the social, cultural, historical and political factors that influence language, thr"| __truncated__
##  $ : chr "Related areas of study also includes the disciplines of semiotics (the study of direct and indirect language th"| __truncated__

The last way to inspect corpus data that we will discuss here is to use the “summary()” function which gives us a summary of the structure of the corpus.

# inspect corpus
summary(corpus)
##      Length Class  Mode     
## [1,] 1      -none- character
## [2,] 1      -none- character
## [3,] 1      -none- character
## [4,] 1      -none- character
## [5,] 1      -none- character
## [6,] 1      -none- character
## [7,] 1      -none- character

5 Exporting data

The following section shows how data can be exported from “R” and can then be stored on your computer. The most common way to export data from “R” is to save a tab-separated, plain text file (i.e. a file with the extension .txt). To export the data that was processed or generated in “R” we typically use the “write.table()” function. This function needs the following arguments: The first argument “file” is the object to be saved. “file” does not have to be written out, but should be mentioned first. The second argument “path” is the indication where “R” should save the file. Since we set the workspace, we only have to tell “R” which name we want to give to the object to be stored. Again, the argument does not have to be named. The following arguments, sep, col.names, and row.names, are the delimiter (either tab, comma, or space), and whether there are rows or column names in the object.

# define path to data
testdatawords <- paste(as.vector(unlist(corpus)), sep = " ", collapse = " ")
testdatawords <- gsub("[^[:alpha:][:space:]]*", "", testdatawords)
testdatawords <- as.vector(unlist(strsplit(testdatawords, " ")))
testdatawords <- table(testdatawords)[order(table(testdatawords), decreasing = T)]
head(testdatawords)
## testdatawords
##       of      the      and language       in        a 
##       54       50       46       29       19       16

The data we want to output is a table called “testdatawords” which contains the words from the corpus and their frequencies in decending order.

# define path to data
outpath <- "data/testdatawords.txt"
# save data to pc
write.table(testdatawords, file = outpath, sep = "\t", col.names = TRUE, row.names = F, quote = F)

To save data directly as a Microsoft Excel file you must first activate the package “xlsx” and then apply the “write.xlsx” command:

library(xlsx)
write.xlsx (testdatawords, "data/testdatawords.xlsx")

There are many other ways to read and write data - and especially the tidyverse functions can be intersting to explore as they are less prone to changing features of data (such as converting factors to character variables). However, the functions explored above should give you some idea of how to get started.