--- title: "String Processing in R" author: "Martin Schweinberger" date: "" output: bookdown::html_document2 bibliography: bibliography.bib link-citations: yes --- ```{r uq1, echo=F, eval = T, fig.cap="", message=FALSE, warning=FALSE, out.width='100%'} knitr::include_graphics("https://slcladal.github.io/images/uq1.jpg") ``` # Introduction{-} ```{r diff, echo=FALSE, out.width= "15%", out.extra='style="float:right; padding:10px"'} knitr::include_graphics("https://slcladal.github.io/images/gy_chili.jpg") ``` This tutorial introduces string processing and it is aimed at beginners and intermediate users of R with the aim of showcasing how to work with and process textual data using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful functions and methods associated with text processing.

To be able to follow this tutorial, we suggest you check out and familiarize yourself with the content of the following **R Basics** tutorials:

Click [**here**](https://ladal.edu.au/content/string.Rmd)^[If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the [**bibliography file**](https://slcladal.github.io/content/bibliography.bib) and store it in the same folder where you store the Rmd file.] to download the **entire R Notebook** for this tutorial.


[![Binder](https://mybinder.org/badge_logo.svg)](https://binderhub.atap-binder.cloud.edu.au/v2/gh/SLCLADAL/interactive-notebooks-environment/main?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgithub.com%252FSLCLADAL%252Finteractive-notebooks%26urlpath%3Dlab%252Ftree%252Finteractive-notebooks%252Fnotebooks%252Fstring_cb.ipynb%26branch%3Dmain)
Click [**here**](https://binderhub.atap-binder.cloud.edu.au/v2/gh/SLCLADAL/interactive-notebooks-environment/main?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgithub.com%252FSLCLADAL%252Finteractive-notebooks%26urlpath%3Dlab%252Ftree%252Finteractive-notebooks%252Fnotebooks%252Fstring_cb.ipynb%26branch%3Dmain) to open an interactive and simplified version of this tutorial that allows you to execute, change, and edit the code used in this tutorial as well as to upload your own data.




**LADAL TOOL**

Click on this [![Binder](https://mybinder.org/badge_logo.svg)](https://binderhub.atap-binder.cloud.edu.au/v2/gh/SLCLADAL/interactive-notebooks-environment/main?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgithub.com%252FSLCLADAL%252Finteractive-notebooks%26urlpath%3Dlab%252Ftree%252Finteractive-notebooks%252Fnotebooks%252Fstringtool.ipynb%26branch%3Dmain) badge to open an notebook-based tool
that allows you upload your own text(s), to clean the texts, and download the resulting cleaned texts.



**Preparation and session set up** This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R [here](https://slcladal.github.io/intror.html). For this tutorials, we need to install certain *packages* from an R *library* so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time). ```{r prep1, echo=T, eval = F, message=FALSE, warning=FALSE} # install packages install.packages("tidyverse") install.packages("htmlwidgets") # install klippy for copy-to-clipboard button in code chunks install.packages("remotes") remotes::install_github("rlesur/klippy") ``` Now that we have installed the packages, we can activate them as shown below. ```{r prep2, message=FALSE, warning=FALSE} # load packages for website library(tidyverse) # activate klippy for copy-to-clipboard button klippy::klippy() ``` Once you have installed RStudio and initiated the session by executing the code shown above, you are good to go. Before we start with string processing, we will load some example texts on which we will perform the processing. The first example text represents a paragraph about grammar. ```{r readdat1, message=FALSE, warning=FALSE} # read in text exampletext <- base::readRDS(url("https://slcladal.github.io/data/tx1.rda", "rb")) # inspect exampletext ``` The second example text represents the same paragraph about grammar, but split into individual sentences. ```{r readdat2, message=FALSE, warning=FALSE} # read in text splitexampletext <- base::readRDS(url("https://slcladal.github.io/data/tx2.rda", "rb")) # inspect splitexampletext ``` The third example text represents a paragraph about Ferdinand de Saussure - the founder of modern linguistics. ```{r readdat3, message=FALSE, warning=FALSE} additionaltext <- base::readRDS(url("https://slcladal.github.io/data/tx3.rda", "rb")) # inspect additionaltext ``` The fourth example text consist of 3 short plain sentences. ```{r readdat4, message=FALSE, warning=FALSE} sentences <- base::readRDS(url("https://slcladal.github.io/data/tx4.rda", "rb")) # inspect sentences ``` In the following, we will perform various operations on the example texts. # Basic String Processing {-} Before turning to functions provided in the `stringr`, let us just briefly focus on some base functions that are extremely useful when working with texts. A very useful function is, e.g. `tolower` which converts everything to lower case. ```{r lwr, message=FALSE, warning=FALSE} tolower(exampletext) ``` Conversely, `toupper` converts everything to upper case. ```{r upr, message=FALSE, warning=FALSE} toupper(exampletext) ``` The `stringr` package (see [here](https://raw.githubusercontent.com/rstudio/cheatsheets/main/strings.pdf) is part of the so-called *tidyverse* - a collection of packages that allows to write R code in a readable manner - and it is the most widely used package for string processing in . The advantage of using `stringr` is that it makes string processing very easy. All `stringr` functions share a common structure: `str_function(string, pattern)` The two arguments in the structure of `stringr` functions are: *string* which is the character string to be processed and a pattern which is either a simple sequence of characters, a regular expression, or a combination of both. Because the *string* comes first, the `stringr` functions are ideal for piping and thus use in tidyverse style R. All function names of `stringr` begin with str, then an underscore and then the name of the action to be performed. For example, to replace the first occurrence of a pattern in a string, we should use `str_replace()`. In the following, we will use `stringr` functions to perform various operations on the example text. As we have already loaded the `tidyverse` package, we can start right away with using `stringr` functions as shown below. Like `nchar` in `base`, `str_count` provides the number of characters of a text. ```{r stringr2, message=FALSE, warning=FALSE} str_count(splitexampletext) ``` The function `str_detect` informs about whether a pattern is present in a text and outputs a logical vector with *TRUE* if the pattern occurs and *FALSE* if it does not. ```{r stringr3, message=FALSE, warning=FALSE} str_detect(splitexampletext, "and") ``` The function `str_extract` extracts the first occurrence of a pattern, if that pattern is present in a text. ```{r stringr4, message=FALSE, warning=FALSE} str_extract(exampletext, "and") ``` The function `str_extract_all` extracts all occurrences of a pattern, if that pattern is present in a text. ```{r stringr5, message=FALSE, warning=FALSE} str_extract_all(exampletext, "and") ``` The function `str_locate` provides the start and end position of the match of the pattern in a text. ```{r stringr6, message=FALSE, warning=FALSE} str_locate(exampletext, "and") ``` The function `str_locate_all` provides the start and end positions of the match of the pattern in a text and displays the result in matrix-form. ```{r stringr7, message=FALSE, warning=FALSE} str_locate_all(exampletext, "and") ``` The function `str_match` extracts the first occurrence of the pattern in a text. ```{r stringr8, message=FALSE, warning=FALSE} str_match(exampletext, "and") ``` The function `str_match_all` extracts the all occurrences of the pattern from a text. ```{r stringr9, message=FALSE, warning=FALSE} str_match_all(exampletext, "and") ``` The function `str_remove` removes the first occurrence of a pattern in a text. ```{r stringr10, message=FALSE, warning=FALSE} str_remove(exampletext, "and") ``` The function `str_remove_all` removes all occurrences of a pattern from a text. ```{r stringr11, message=FALSE, warning=FALSE} str_remove_all(exampletext, "and") ``` The function `str_replace` replaces the first occurrence of a pattern with something else in a text. ```{r stringr12, message=FALSE, warning=FALSE} str_replace(exampletext, "and", "AND") ``` The function `str_replace_all` replaces all occurrences of a pattern with something else in a text. ```{r stringr13, message=FALSE, warning=FALSE} str_replace_all(exampletext, "and", "AND") ``` The function `str_starts` tests whether a given text begins with a certain pattern and outputs a logical vector. ```{r stringr14, message=FALSE, warning=FALSE} str_starts(exampletext, "and") ``` The function `str_ends` tests whether a text ends with a certain pattern and outputs a logical vector. ```{r stringr15, message=FALSE, warning=FALSE} str_ends(exampletext, "and") ``` Like `strsplit`, the function `str_split` splits a text when a given pattern occurs. If no pattern is provided, then the text is split into individual symbols. ```{r stringr16, message=FALSE, warning=FALSE} str_split(exampletext, "and") ``` The function `str_split_fixed` splits a text when a given pattern occurs but only so often as is indicated by the argument `n`. So, even if the patter occur more often than `n`, `str_split_fixed` will only split the text `n` times. ```{r stringr17, message=FALSE, warning=FALSE} str_split_fixed(exampletext, "and", n = 3) ``` The function `str_subset` extracts those subsets of a text that contain a certain pattern. ```{r stringr18, message=FALSE, warning=FALSE} str_subset(splitexampletext, "and") ``` The function `str_which` provides a vector with the indices of the texts that contain a certain pattern. ```{r stringr19, message=FALSE, warning=FALSE} str_which(splitexampletext, "and") ``` The function `str_view` shows the locations of the first instances of a pattern in a text or vector of texts. ```{r stringr20, message=FALSE, warning=FALSE} str_view(splitexampletext, "and") ``` The function `str_view_all` shows the locations of all instances of a pattern in a text or vector of texts. ```{r stringr21, message=FALSE, warning=FALSE} str_view_all(exampletext, "and") ``` The function `str_pad` adds white spaces to a text or vector of texts so that they reach a given number of symbols. ```{r stringr22, message=FALSE, warning=FALSE} # create text with white spaces text <- " this is a text " str_pad(text, width = 30) ``` The function `str_trim` removes white spaces from the beginning(s) and end(s) of a text or vector of texts. ```{r stringr23, message=FALSE, warning=FALSE} str_trim(text) ``` The function `str_squish` removes white spaces that occur within a text or vector of texts. ```{r stringr24, message=FALSE, warning=FALSE} str_squish(text) ``` The function `str_wrap` removes white spaces from the beginning(s) and end(s) of a text or vector of texts and also those white spaces that occur within a text or vector of texts. ```{r stringr25, message=FALSE, warning=FALSE} str_wrap(text) ``` The function `str_order` provides a vector that represents the order of a vector of texts according to the lengths of texts in that vector. ```{r stringr26, message=FALSE, warning=FALSE} str_order(splitexampletext) ``` The function `str_sort` orders of a vector of texts according to the lengths of texts in that vector. ```{r stringr27, message=FALSE, warning=FALSE} str_sort(splitexampletext) ``` The function `str_to_upper` converts all symbols in a text or vector of texts to upper case. ```{r stringr28, message=FALSE, warning=FALSE} str_to_upper(exampletext) ``` The function `str_to_lower` converts all symbols in a text or vector of texts to lower case. ```{r stringr29, message=FALSE, warning=FALSE} str_to_lower(exampletext) ``` The function `str_c` combines texts into one text ```{r stringr30, message=FALSE, warning=FALSE} str_c(exampletext, additionaltext) ``` The function `str_conv` converts a text into a certain type of encoding, e.g. into `UTF-8` or `Latin1`. ```{r stringr31, message=FALSE, warning=FALSE} str_conv(exampletext, encoding = "UTF-8") ``` The function `str_dup` reduplicates a text or a vector of texts n times. ```{r stringr32, message=FALSE, warning=FALSE} str_dup(exampletext, times=2) ``` The function `str_flatten` combines a vector of texts into one text. The argument `collapse` defines the symbol that occurs between the combined texts. If the argument `collapse` is left out, the texts will be combined without any symbol between the combined texts. ```{r stringr33, message=FALSE, warning=FALSE} str_flatten(sentences, collapse = " ") ``` If the argument `collapse` is left out, the texts will be combined without any symbol between the combined texts. ```{r stringr34, message=FALSE, warning=FALSE} str_flatten(sentences) ``` The function `str_length` provides the length of texts in characters. ```{r stringr35, message=FALSE, warning=FALSE} str_length(exampletext) ``` The function `str_replace_na` replaces NA in texts. It is important to note that NA, if it occurs within a string, is considered to be the literal string `NA`. ```{r stringr36, message=FALSE, warning=FALSE} # create sentences with NA sentencesna <- c("Some text", NA, "Some more text", "Some NA text") # apply str_replace_na function str_replace_na(sentencesna, replacement = "Something new") ``` The function `str_trunc` ends strings with ... after a certain number of characters. ```{r stringr37, message=FALSE, warning=FALSE} str_trunc(sentences, width = 20) ``` The function `str_sub` extracts a string from a text from a start location to an end position (expressed as character positions). ```{r stringr38, message=FALSE, warning=FALSE} str_sub(exampletext, 5, 25) ``` The function `word` extracts words from a text (expressed as word positions). ```{r stringr39, message=FALSE, warning=FALSE} word(exampletext, 2:7) ``` The function `str_glue` combines strings and allows to input variables. ```{r stringr40, message=FALSE, warning=FALSE} name <- "Fred" age <- 50 anniversary <- as.Date("1991-10-12") str_glue( "My name is {name}, ", "my age next year is {age + 1}, ", "and my anniversary is {format(anniversary, '%A, %B %d, %Y')}." ) ``` The function `str_glue_data` is particularly useful when it is used in data pipelines. The data set `mtcars` is a build in data set that is loaded automatically when starting R. ```{r stringr41, message=FALSE, warning=FALSE} mtcars %>% str_glue_data("{rownames(.)} has {hp} hp") ``` ***

EXERCISE TIME!

` 1. Load the text `linguistics04`. How many words does the text consist of?
Answer ```{r message=FALSE, warning=FALSE} readLines("https://slcladal.github.io/data/testcorpus/linguistics04.txt") %>% paste0(collapse = " ") %>% strsplit(" ") %>% unlist() %>% length() ```
2. How many characters does the text consist of?
Answer ```{r message=FALSE, warning=FALSE} readLines("https://slcladal.github.io/data/testcorpus/linguistics04.txt") %>% paste0(collapse = " ") %>% strsplit("") %>% unlist() %>% length() ```
` *** # Advanced String Processing {-} Above, we have used functions and regular expressions to extract and find patters in textual data. Here, we will focus on common methods for cleaning text data that are applied before implementing certain methods. We start by installing and then loading some additional packages, e.g., the `quanteda` (see [here](https://raw.githubusercontent.com/rstudio/cheatsheets/main/quanteda.pdf) for a cheat sheet for the `quanteda` package), the `tm`, and the `udpipe` package, which are extremely useful when dealing with more advanced text processing. ```{r atp0, eval = F, message=F, warning=F} install.packages("quanteda") install.packages("tm") install.packages("udpipe") ``` ```{r atp1, message=F, warning=F} library(quanteda) library(tm) library(udpipe) ``` One common procedure is to split texts into sentences which we can do by using, e.g., the `tokenize_sentence` function from the `quanteda` package. I also unlist the data to have a vector wot work with (rather than a list). ```{r atp3, message=F, warning=F} et_sent <- quanteda::tokenize_sentence(exampletext) %>% unlist() # inspect et_sent ``` Another common procedure is to remove stop words, i.e., words that do not have semantic or referential meaning (like nouns such as *tree* or *cat*, or verbs like *sit* or *speak* or adjectives such as *green* or *loud*) but that indicate syntactic relations, roles, or features.(e.g., articles and pronouns). We can remove stopwords using, e.g., the `removeWords` function from the `tm` package ```{r atp5, message=F, warning=F} et_wostop <- tm::removeWords(exampletext, tm::stopwords("english")) # inspect et_wostop ``` To remove the superfluous whote spaces, we can use, e.g., the `stripWhitespace` function from the `tm` package. ```{r atp6, message=F, warning=F} et_wows <- tm::stripWhitespace(et_wostop) # inspect et_wows ``` It can also be useful to remove numbers. We can do this using, e.g., the `removeNumbers` function from the `tm` package. ```{r atp7, message=F, warning=F} et_wonum <- tm::removeNumbers("This is the 1 and only sentence I will write in 2022.") # inspect et_wonum ``` We may also want to remove any type of punctuation using, e.g., the `removePunctuation` function from the `tm` package. ```{r atp9, message=F, warning=F} et_wopunct <- tm::removePunctuation(exampletext) # inspect et_wopunct ``` We may also want to stem the words in a document, i.e. removing the ends of words to be able to group together semantically related words such as *walk*, *walks*, *walking*, *walked* which would all be stemmed into *walk*. We can stem a text using, e.g., the `stemDocument` function from the `tm` package. ```{r atp11, message=F, warning=F} et_stem <- tm::stemDocument(exampletext, language = "en") # inspect et_stem ``` **Tokenization, lemmatization, pos-tagging, and dependency parsing** A far better option than stemming is lemmatization as lemmatization is based on proper morphological information and vocabularies. For lemmatization, we can use the `udpipe` package which also tokenizes texts, adds part-of-speech tags, and provides information about dependency relations. Before we can tokenize, lemmatize, pos-tag and parse though, we need to download a pre-trained language model. ```{r atp12, eval = F, message=F, warning=F} # download language model m_eng <- udpipe::udpipe_download_model(language = "english-ewt") ``` If you have downloaded a model once, you can also load the model directly from the place where you stored it on your computer. In my case, I have stored the model in a folder called udpipemodels ```{r atp14, message=F, warning=F} # load language model from your computer after you have downloaded it once m_eng <- udpipe_load_model(file = here::here("udpipemodels", "english-ewt-ud-2.5-191206.udpipe")) ``` We can now use the model to annotate out text. ```{r atp17, message=F, warning=F} # tokenise, tag, dependency parsing text_anndf <- udpipe::udpipe_annotate(m_eng, x = exampletext) %>% as.data.frame() %>% dplyr::select(-sentence) # inspect head(text_anndf, 10) ``` We could, of course, perform many more manipulations of textual data but this should suffice to get you started. # Citation & Session Info {-} Schweinberger, Martin. 2022. *String processing in R*. Brisbane: The University of Queensland. url: https://ladal.edu.au/string.html (Version 2022.11.17). ``` @manual{schweinberger2022string, author = {Schweinberger, Martin}, title = {String processing in R}, note = {https://ladal.edu.au/string.html}, year = {2022}, organization = {The University of Queensland, School of Languages and Cultures}, address = {Brisbane}, edition = {2022.11.17} } ``` ```{r fin} sessionInfo() ``` *** [Back to top](#introduction) [Back to LADAL home](https://ladal.edu.au) ***