1 Introduction

This tutorial introduces regular expressions and how they can be used when working with language data. The “R”-code for the sections below can be downloaded here.

2 Preparation

As the following focuses on regular expressions in “R”, it is necessary to install “R”, “RStudio”, and “Tinn-R”. If these programms (or, in the case of “R”, environments) are not installed yet, please search for them in your favorite search engine and add the term “download”. Open any of the first few links and follow the installation instructions (they are easy to follow, do not require any specifications, and are pretty much self-explanatory).

In addition, we clean the existign work spece and it is also useful to set certain options so that “R” does, e.g., not automatically convert strings into factors.

# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)

Once you have installed “R”, “R-Studio”, “Tinn-R”, and have also initiated the session by executing the code shown above, you are good to go.

. 
[:digit:] Digits: 0 1 2 3 4 5 6 7 8 9
[:lower:] Lowercase characters: a–z
[:upper:] Uppercase characters: A–Z
[:alpha:] Alphabetic characters: a–z and A–Z
[:alnum:] Digits and alphabetic characters
[:punct:] Punctuation characters: . , ; etc.
[:graph:] Graphical characters: [:alnum:] and [:punct:]
[:blank:] Blank characters: Space and tab
[:space:] Space characters: Space, tab, newline, and other space characters
[:print:] Printable characters: [:alnum:], [:punct:] and [:space:]

? The preceding item is optional and will be matched at most once
* The preceding item will be matched zero or more times
+ The preceding item will be matched one or more times
{n} The preceding item is matched exactly n times
{n,} The preceding item is matched n or more times
{n,m} The preceding item is matched at least n times, but not more than m times

\w Word characters: [[:alnum:]_]
\W No word characters: [ˆ[:alnum:]_]
\s Space characters: [[:blank:]]
\S No space characters: [ˆ[:blank:]]
\d Digits: [[:digit:]]
\D No digits: [ˆ[:digit:]]
\b Word edge
\B No word edge
\< Word beginning
\> Word end

Exercises: regular expressions