1 Introduction

This tutorial introduces regular expressions and how they can be used when working with language data. Regular expressions are powerful tools used to search and manipulate text patterns. They provide a way to find specific sequences of characters within larger bodies of text. Think of them as search patterns on steroids. Regular expressions are useful for tasks like extracting specific words, finding patterns, or replacing text in bulk. They offer a concise and flexible way to describe complex text patterns using symbols and special characters. Regular expressions have applications in linguistics and humanities research, aiding in tasks such as text analysis, corpus linguistics, and language processing. Understanding regular expressions can unlock new possibilities for exploring and analyzing textual data.

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to use regular expression (or wild cards) in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful functions and methods associated with regular expressions.

To be able to follow this tutorial, we suggest you check out and familiarize yourself with the content of the following R Basics tutorials:

Click here1 to download the entire R Notebook for this tutorial.

Binder
Click here to open an interactive Jupyter notebook that allows you to execute, change, and edit the code as well as to upload your own data.


How can you search texts for complex patterns or combinations of patterns? This question will answered in this tutorial and at the end you will be able to perform very complex searches yourself. The key concept of this tutorial is that of a regular expression. A regular expression (in short also called regex or regexp) is a special sequence of characters (or string) for describing a search pattern. You can think of regular expressions as very powerful combinations of wildcards or as wildcards on steroids.

If you would like to get deeper into regular expressions, I can recommend Friedl (2006) and, in particular, chapter 17 of Peng (2020) for further study (although the latter uses base R rather than tidyverse functions, but this does not affect the utility of the discussion of regular expressions in any major or meaningful manner). Also, here is a so-called cheatsheet about regular expressions written by Ian Kopacka and provided by RStudio. Nick Thieberger has also recorded a very nice Introduction to Regular Expressions for humanities scholars to YouTube.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# install packages
install.packages("tidyverse")
install.packages("flextable")
install.packages("htmlwidgets")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

In a next step, we load the packages.

library(tidyverse)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed RStudio and have initiated the session by executing the code shown above, you are good to go.

2 Getting started with Regular Expressions

To put regular expressions into practice, we need some text that we will perform out searches on. In this tutorial, we will use texts from wikipedia about grammar.

# read in first text
text1 <- readLines("https://slcladal.github.io/data/testcorpus/linguistics02.txt")
et <-  paste(text1, sep = " ", collapse = " ")
# inspect example text
et
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

In addition, we will split the example text into words to have another resource we can use to understand regular expressions

# split example text
set <- str_split(et, " ") %>%
  unlist()
# inspect
head(set)
## [1] "Grammar" "is"      "a"       "system"  "of"      "rules"

Before we delve into using regular expressions, we will have a look at the regular expressions that can be used in R and also check what they stand for.

There are three basic types of regular expressions:

  • regular expressions that stand for individual symbols and determine frequencies

  • regular expressions that stand for classes of symbols

  • regular expressions that stand for structural properties

The regular expressions below show the first type of regular expressions, i.e. regular expressions that stand for individual symbols and determine frequencies.

Regular expressions that stand for individual symbols and determine frequencies.

RegEx Symbol/Sequence

Explanation

Example

?

The preceding item is optional and will be matched at most once

walk[a-z]? = walk, walks

*

The preceding item will be matched zero or more times

walk[a-z]* = walk, walks, walked, walking

+

The preceding item will be matched one or more times

walk[a-z]+ = walks, walked, walking

{n}

The preceding item is matched exactly n times

walk[a-z]{2} = walked

{n,}

The preceding item is matched n or more times

walk[a-z]{2,} = walked, walking

{n,m}

The preceding item is matched at least n times, but not more than m times

walk[a-z]{2,3} = walked, walking

The regular expressions below show the second type of regular expressions, i.e. regular expressions that stand for classes of symbols.

Regular expressions that stand for classes of symbols.

RegEx Symbol/Sequence

Explanation

[ab]

lower case a and b

[a-z]

all lower case characters from a to z

[AB]

upper case a and b

[A-Z]

all upper case characters from A to Z

[12]

digits 1 and 2

[0-9]

digits: 0 1 2 3 4 5 6 7 8 9

[:digit:]

digits: 0 1 2 3 4 5 6 7 8 9

[:lower:]

lower case characters: a–z

[:upper:]

upper case characters: A–Z

[:alpha:]

alphabetic characters: a–z and A–Z

[:alnum:]

digits and alphabetic characters

[:punct:]

punctuation characters: . , ; etc.

[:graph:]

graphical characters: [:alnum:] and [:punct:]

[:blank:]

blank characters: Space and tab

[:space:]

space characters: Space, tab, newline, and other space characters

The regular expressions that denote classes of symbols are enclosed in [] and :. The last type of regular expressions, i.e. regular expressions that stand for structural properties are shown below.

Regular expressions that stand for structural properties.

RegEx Symbol/Sequence

Explanation

\\w

Word characters: [[:alnum:]_]

\\W

No word characters: [^[:alnum:]_]

\\s

Space characters: [[:blank:]]

\\S

No space characters: [^[:blank:]]

\\d

Digits: [[:digit:]]

\\D

No digits: [^[:digit:]]

\\b

Word edge

\\B

No word edge

<

Word beginning

>

Word end

^

Beginning of a string

$

End of a string

3 Practice

In this section, we will explore how to use regular expressions. At the end, we will go through some exercises to help you understand how you can best utilize regular expressions.

Show all words in the split example text that contain a or n.

set[str_detect(set, "[an]")]
##  [1] "Grammar"      "a"            "governs"      "production"   "and"         
##  [6] "utterances"   "in"           "a"            "given"        "language."   
## [11] "apply"        "sound"        "as"           "as"           "meaning,"    
## [16] "and"          "include"      "componential" "as"           "pertaining"  
## [21] "phonology"    "organisation" "phonetic"     "sound"        "formation"   
## [26] "and"          "composition"  "and"          "syntax"       "formation"   
## [31] "and"          "composition"  "phrases"      "and"          "sentences)." 
## [36] "Many"         "modern"       "that"         "deal"         "principles"  
## [41] "grammar"      "are"          "based"        "on"           "Noam"        
## [46] "framework"    "generative"   "linguistics."

Show all words in the split example text that begin with a lower case a.

set[str_detect(set, "^a")]
##  [1] "a"     "and"   "a"     "apply" "as"    "as"    "and"   "as"    "and"  
## [10] "and"   "and"   "and"   "are"

Show all words in the split example text that end in a lower case s.

set[str_detect(set, "s$")]
##  [1] "is"         "rules"      "governs"    "utterances" "rules"     
##  [6] "as"         "as"         "subsets"    "as"         "phrases"   
## [11] "theories"   "principles" "Chomsky's"

Show all words in the split example text in which there is an e, then any other character, and than another n.

set[str_detect(set, "e.n")]
## [1] "governs"  "meaning," "modern"

Show all words in the split example text in which there is an e, then two other characters, and than another n.

set[str_detect(set, "e.{2,2}n")]
## [1] "utterances"

Show all words that consist of exactly three alphabetical characters in the split example text.

set[str_detect(set, "^[:alpha:]{3,3}$")]
##  [1] "the" "and" "use" "and" "and" "and" "and" "and" "the" "are"

Show all words that consist of six or more alphabetical characters in the split example text.

set[str_detect(set, "^[:alpha:]{6,}$")]
##  [1] "Grammar"      "system"       "governs"      "production"   "utterances"  
##  [6] "include"      "componential" "subsets"      "pertaining"   "phonology"   
## [11] "organisation" "phonetic"     "morphology"   "formation"    "composition" 
## [16] "syntax"       "formation"    "composition"  "phrases"      "modern"      
## [21] "theories"     "principles"   "grammar"      "framework"    "generative"

Replace all lower case as with upper case Es in the example text.

str_replace_all(et, "a", "E")
## [1] "GrEmmEr is E system of rules which governs the production End use of utterEnces in E given lEnguEge. These rules Epply to sound Es well Es meEning, End include componentiEl subsets of rules, such Es those pertEining to phonology (the orgEnisEtion of phonetic sound systems), morphology (the formEtion End composition of words), End syntEx (the formEtion End composition of phrEses End sentences). MEny modern theories thEt deEl with the principles of grEmmEr Ere bEsed on NoEm Chomsky's frEmework of generEtive linguistics."

Remove all non-alphabetical characters in the split example text.

str_remove_all(set, "\\W")
##  [1] "Grammar"      "is"           "a"            "system"       "of"          
##  [6] "rules"        "which"        "governs"      "the"          "production"  
## [11] "and"          "use"          "of"           "utterances"   "in"          
## [16] "a"            "given"        "language"     "These"        "rules"       
## [21] "apply"        "to"           "sound"        "as"           "well"        
## [26] "as"           "meaning"      "and"          "include"      "componential"
## [31] "subsets"      "of"           "rules"        "such"         "as"          
## [36] "those"        "pertaining"   "to"           "phonology"    "the"         
## [41] "organisation" "of"           "phonetic"     "sound"        "systems"     
## [46] "morphology"   "the"          "formation"    "and"          "composition" 
## [51] "of"           "words"        "and"          "syntax"       "the"         
## [56] "formation"    "and"          "composition"  "of"           "phrases"     
## [61] "and"          "sentences"    "Many"         "modern"       "theories"    
## [66] "that"         "deal"         "with"         "the"          "principles"  
## [71] "of"           "grammar"      "are"          "based"        "on"          
## [76] "Noam"         "Chomskys"     "framework"    "of"           "generative"  
## [81] "linguistics"

Remove all white spaces in the example text.

str_remove_all(et, " ")
## [1] "Grammarisasystemofruleswhichgovernstheproductionanduseofutterancesinagivenlanguage.Theserulesapplytosoundaswellasmeaning,andincludecomponentialsubsetsofrules,suchasthosepertainingtophonology(theorganisationofphoneticsoundsystems),morphology(theformationandcompositionofwords),andsyntax(theformationandcompositionofphrasesandsentences).ManymoderntheoriesthatdealwiththeprinciplesofgrammararebasedonNoamChomsky'sframeworkofgenerativelinguistics."

Highlighting patterns

We use the str_view and str_view_all functions to show the occurrences of regular expressions in the example text.

To begin with, we match an exactly defined pattern (ang).

str_view_all(et, "ang")
## [1] │ Grammar is a system of rules which governs the production and use of utterances in a given l<ang>uage. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics.

Now, we include . which stands for any symbol (except a new line symbol).

str_view_all(et, ".n.")
## [1] │ Grammar is a system of rules which gove<rns> the producti<on ><and> use of utter<anc>es <in >a giv<en >l<ang>uage. These rules apply to so<und> as well as me<ani>ng, <and> <inc>lude comp<one>ntial subsets of rules, such as those perta<ini>ng to ph<ono>logy (the org<ani>sati<on >of ph<one>tic so<und> systems), morphology (the formati<on ><and> compositi<on >of words), <and> s<ynt>ax (the formati<on ><and> compositi<on >of phrases <and> s<ent><enc>es). M<any> mode<rn >theories that deal with the pr<inc>iples of grammar are based <on >Noam Chomsky's framework of g<ene>rative l<ing>uistics.

EXERCISE TIME!

`

  1. What regular expression can you use to extract all forms of walk from a text?
Answer [Ww][Aa][Ll][Kk].*

More exercises will follow - bear with us ;)

`


Citation & Session Info

Schweinberger, Martin. 2022. Regular Expressions in R. Brisbane: The University of Queensland. url: https://ladal.edu.au/regex.html (Version 2022.11.17).

@manual{schweinberger2022regex,
  author = {Schweinberger, Martin},
  title = {Regular Expressions in R},
  note = {https://ladal.edu.au/regex.html},
  year = {2022},
  organization = {The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2022.11.17}
}
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] flextable_0.9.1 lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0  
##  [5] dplyr_1.1.2     purrr_1.0.1     readr_2.1.4     tidyr_1.3.0    
##  [9] tibble_3.2.1    ggplot2_3.4.2   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.10             assertthat_0.2.1        digest_0.6.31          
##  [4] utf8_1.2.3              mime_0.12               R6_2.5.1               
##  [7] evaluate_0.21           highr_0.10              pillar_1.9.0           
## [10] gdtools_0.3.3           rlang_1.1.1             uuid_1.1-0             
## [13] curl_5.0.0              rstudioapi_0.14         data.table_1.14.8      
## [16] jquerylib_0.1.4         klippy_0.0.0.9500       rmarkdown_2.21         
## [19] textshaping_0.3.6       munsell_0.5.0           shiny_1.7.4            
## [22] compiler_4.2.2          httpuv_1.6.11           xfun_0.39              
## [25] askpass_1.1             pkgconfig_2.0.3         systemfonts_1.0.4      
## [28] gfonts_0.2.0            htmltools_0.5.5         openssl_2.0.6          
## [31] tidyselect_1.2.0        fontBitstreamVera_0.1.1 httpcode_0.3.0         
## [34] fansi_1.0.4             crayon_1.5.2            tzdb_0.4.0             
## [37] withr_2.5.0             later_1.3.1             crul_1.4.0             
## [40] grid_4.2.2              jsonlite_1.8.4          xtable_1.8-4           
## [43] gtable_0.3.3            lifecycle_1.0.3         magrittr_2.0.3         
## [46] scales_1.2.1            zip_2.3.0               cli_3.6.1              
## [49] stringi_1.7.12          cachem_1.0.8            promises_1.2.0.1       
## [52] xml2_1.3.4              bslib_0.4.2             ragg_1.2.5             
## [55] ellipsis_0.3.2          generics_0.1.3          vctrs_0.6.2            
## [58] tools_4.2.2             glue_1.6.2              officer_0.6.2          
## [61] fontquiver_0.2.1        hms_1.1.3               fastmap_1.1.1          
## [64] yaml_2.3.7              timechange_0.2.0        colorspace_2.1-0       
## [67] fontLiberation_0.1.0    knitr_1.43              sass_0.4.6

Back to top

Back to LADAL home


References

Friedl, Jeffrey EF. 2006. Mastering Regular Expressions. Sebastopol, CA: "O’Reilly Media".
Peng, Roger D. 2020. R Programming for Data Science. Leanpub. https://bookdown.org/rdpeng/rprogdatascience/.

  1. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.↩︎