1 Introduction

This tutorial introduces regular expressions and how they can be used when working with language data. The entire R markdown document for the sections below can be downloaded here.

How can you search texts for complex patterns or combinations of patterns? This question will answered in this tutorial and at the end you will be able to perform very complex searches yourself. The key concept of this tutorial is that of a regular expression. A regular expression (in short also called regex or regexp) is a special sequence of characters (or string) for describing a search pattern. You can think of regular expressions as very powerful combinations of wildcards or as wildcards on steroids.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
# set options
options(stringsAsFactors = F)
# load packages

Once you have installed R-Studio and have initiated the session by executing the code shown above, you are good to go.

2 Getting started with Regular Expressions

To put regular expressions into practive, we need some text that we will perform out searches on. In this tutorial, we will use texts from wikipedia about grammar.

# read in first text
text1 <- readLines("https://slcladal.github.io/data/testcorpus/linguistics02.txt")
exampletext <-  paste(text1, sep = " ", collapse = " ")
# inspect exampletext
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

Before we delve into using regular expressions, we will have a look at the regular expressions that can be used in R and also check what they stand for.

There are three basic types of regular expressions:

  • regular expressions that stand for individual symbols and determine frequencies

  • regular expressions that stand for classes of symbols

  • regular expressions that stand for structural properties

The regular expressions below show the first type of regular expressions, i.e. regular expressions that stand for individual symbols and determine frequencies.

The regular expressions below show the second type of regular expressions, i.e. regular expressions that stand for classes of symbols.

The regular expressions that denote classes of symbols are enclosed in [] and :. The last type of regular expressions, i.e. regular expressions that stand for structural properties are shown below.

3 Exercises for regular expressions

In the following we will combine and use these regular expressions to change, modify, and replace patters that we will define using the regular expressions above.

Matching patterns

In the following, we use str_view() and str_view_all() to show how regular expressions work because these functions show the matches of a regular expression.

To begin with, we match an exactly defined pattern (“ang”).

str_view_all(exampletext, "ang")

Now, we include . which stands for nay symbol (except a new line symbol).

str_view_all(exampletext, ".n.")

Citation & Session Info

Schweinberger, Martin. 2020. Regular Expressions in R. Brisbane: The University of Queensland. url: https://slcladal.github.io/regularexpressions.html (Version 2020.09.29).

  author = {Schweinberger, Martin},
  title = {Regular Expressions in R},
  note = {https://slcladal.github.io/regularexpressions.html},
  year = {2020},
  organization = {The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2020/09/29}
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## Matrix products: default
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] DT_0.15          knitr_1.30       kableExtra_1.2.1 stringr_1.4.0   
## loaded via a namespace (and not attached):
##  [1] rstudioapi_0.11   xml2_1.3.2        magrittr_1.5      rvest_0.3.6      
##  [5] munsell_0.5.0     colorspace_1.4-1  viridisLite_0.3.0 R6_2.4.1         
##  [9] rlang_0.4.7       httr_1.4.2        tools_4.0.2       webshot_0.5.2    
## [13] xfun_0.16         crosstalk_1.1.0.1 htmltools_0.5.0   yaml_2.2.1       
## [17] digest_0.6.25     lifecycle_0.2.0   htmlwidgets_1.5.1 glue_1.4.2       
## [21] evaluate_0.14     rmarkdown_2.3     stringi_1.5.3     compiler_4.0.2   
## [25] scales_1.1.1      jsonlite_1.7.1

Main page