1 Introduction

This tutorial introduces regular expressions and how they can be used when working with language data. The entire code for the sections below can be downloaded here.

How can you search texts for complex patterns or combinations of patterns? This question will answered in this tutorial and at the end you will be able to perform very complex searches yourself. The key concept of this tutorial is that of a regular expression. A regular expression (in shhort also called regex or regexp) is a special tsequence of characters (or string) for describing a search pattern. You can think of regular expressions as very powerful combinations of wildcards or as wildcards on steroids.

Preparation and session set up

Before we can start and because everything in this tutorial is based on R (a very flexible, user-friendly, and powerful programming environment), it is necessary to install R and RStudio. If these programs (or, in the case of R, environments) are not already installed on your machine, please search for them in your favorite search engine and add the term “download”. Open any of the first few links and follow the installation instructions (they are easy to follow, do not require any specifications, and are pretty much self-explanatory).

In addition, certain packages need to be installed from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
# set options
options(stringsAsFactors = F)
# load packages

Once you have installed R, R-Studio, and have also initiated the session by executing the code shown above, you are good to go.

2 Getting started with Regular Expressions

To put regular expressions into practive, we need some text that we will perform out searches on. In this tutorial, we will use texts from wikipedia about grammar.

# read in first text
text1 <- readLines("https://slcladal.github.io/data/testcorpus/linguistics02.txt")
exampletext <-  paste(text1, sep = " ", collapse = " ")
# inspect exampletext
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

Before we delve into using regular expressions, we will have a look at the regular expressions that can be used in R and also check what they stand for.

There are three basic types of regular expressions:

  • Regular expressions that stand for individual symbols and determine frequencies.

  • Regular expressions that stand for classes of symbols;

  • Regular expressions that stand for structural properties.

The regular exressions below show the first type of regular xepressions, i.e. regular expressions that stand for individual symbols and determine frequencies.

? The preceding item is optional and will be matched at most once
* The preceding item will be matched zero or more times
+ The preceding item will be matched one or more times
{n} The preceding item is matched exactly n times
{n,} The preceding item is matched n or more times
{n,m} The preceding item is matched at least n times, but not more than m times

The regular exressions below show the second type of regular xepressions, i.e. regular expressions that stand for classes of symbols.

[:digit:] Digits: 0 1 2 3 4 5 6 7 8 9
[:lower:] Lowercase characters: a–z
[:upper:] Uppercase characters: A–Z
[:alpha:] Alphabetic characters: a–z and A–Z
[:alnum:] Digits and alphabetic characters
[:punct:] Punctuation characters: . , ; etc.
[:graph:] Graphical characters: [:alnum:] and [:punct:]
[:blank:] Blank characters: Space and tab
[:space:] Space characters: Space, tab, newline, and other space characters
[:print:] Printable characters: [:alnum:], [:punct:] and [:space:]

The regular expressions that denote classes of symbols are enclosed in [] and :. The last type of regular expressions, i.e. regular expressions that stand for structural properties are shown below.

\w Word characters: [[:alnum:]_]
\W No word characters: [ˆ[:alnum:]_]
\s Space characters: [[:blank:]]
\S No space characters: [ˆ[:blank:]]
\d Digits: [[:digit:]]
\D No digits: [ˆ[:digit:]]
\b Word edge
\B No word edge
\< Word beginning
\> Word end

3 Exercises for regular expressions

In the following we will combine and use these regular expressions to change, modify, and replace patters that we will define using the regular expressions above.

Matching patterns

In the following, we use str_view() and str_view_all() to show how regular expressions work because these functions show the matches of a regular expression.

To begin with, we match an exactly defined pattern (“ang”).

str_view_all(exampletext, "ang")

Now, we include . which stands for nay symbol (except a new line symbol).

str_view_all(exampletext, ".n.")

How to cite this tutorial

Schweinberger, Martin. 2020. Regular Expressions in R. Brisbane: The University of Queensland. url: https://slcladal.github.io/regularexpressions.html.