Introduction

This tutorial introduces string processing and it is aimed at beginners and intermediate users of R with the aim of showcasing how to work with and process textual data using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful functions and methods associated with text processing.

To be able to follow this tutorial, we suggest you check out and familiarize yourself with the content of the following R Basics tutorials:

Click here¹ to download the entire R Notebook for this tutorial.

Click here to open an interactive and simplified version of this tutorial that allows you to execute, change, and edit the code used in this tutorial as well as to upload your own data.

LADAL TOOL

Click on this badge to open an notebook-based tool
that allows you upload your own text(s), to clean the texts, and download the resulting cleaned texts.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).

# install packages
install.packages("tidyverse")
install.packages("htmlwidgets")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

Now that we have installed the packages, we can activate them as shown below.

# load packages for website
library(tidyverse)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed RStudio and initiated the session by executing the code shown above, you are good to go.

Before we start with string processing, we will load some example texts on which we will perform the processing.

The first example text represents a paragraph about grammar.

# read in text
exampletext  <- base::readRDS(url("https://slcladal.github.io/data/tx1.rda", "rb"))
# inspect
exampletext

## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The second example text represents the same paragraph about grammar, but split into individual sentences.

# read in text
splitexampletext  <- base::readRDS(url("https://slcladal.github.io/data/tx2.rda", "rb"))
# inspect
splitexampletext

## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language."                                                                                                                                                                                                   
## [2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
## [3] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The third example text represents a paragraph about Ferdinand de Saussure - the founder of modern linguistics.

additionaltext  <- base::readRDS(url("https://slcladal.github.io/data/tx3.rda", "rb"))
# inspect
additionaltext

## [1] "In the early 20th century, Ferdinand de Saussure distinguished between the notions of langue and parole in his formulation of structural linguistics. According to him, parole is the specific utterance of speech, whereas langue refers to an abstract phenomenon that theoretically defines the principles and system of rules that govern a language. This distinction resembles the one made by Noam Chomsky between competence and performance in his theory of transformative or generative grammar. According to Chomsky, competence is an individual's innate capacity and potential for language (like in Saussure's langue), while performance is the specific way in which it is used by individuals, groups, and communities (i.e., parole, in Saussurean terms). "

The fourth example text consist of 3 short plain sentences.

sentences  <- base::readRDS(url("https://slcladal.github.io/data/tx4.rda", "rb"))
# inspect
sentences

## [1] "This is a first sentence."     "This is a second sentence."   
## [3] "And this is a third sentence."

In the following, we will perform various operations on the example texts.

Basic String Processing

Before turning to functions provided in the stringr, let us just briefly focus on some base functions that are extremely useful when working with texts.

A very useful function is, e.g. tolower which converts everything to lower case.

tolower(exampletext)

## [1] "grammar is a system of rules which governs the production and use of utterances in a given language. these rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). many modern theories that deal with the principles of grammar are based on noam chomsky's framework of generative linguistics."

Conversely, toupper converts everything to upper case.

toupper(exampletext)

## [1] "GRAMMAR IS A SYSTEM OF RULES WHICH GOVERNS THE PRODUCTION AND USE OF UTTERANCES IN A GIVEN LANGUAGE. THESE RULES APPLY TO SOUND AS WELL AS MEANING, AND INCLUDE COMPONENTIAL SUBSETS OF RULES, SUCH AS THOSE PERTAINING TO PHONOLOGY (THE ORGANISATION OF PHONETIC SOUND SYSTEMS), MORPHOLOGY (THE FORMATION AND COMPOSITION OF WORDS), AND SYNTAX (THE FORMATION AND COMPOSITION OF PHRASES AND SENTENCES). MANY MODERN THEORIES THAT DEAL WITH THE PRINCIPLES OF GRAMMAR ARE BASED ON NOAM CHOMSKY'S FRAMEWORK OF GENERATIVE LINGUISTICS."

The stringr package (see here is part of the so-called tidyverse - a collection of packages that allows to write R code in a readable manner - and it is the most widely used package for string processing in . The advantage of using stringr is that it makes string processing very easy. All stringr functions share a common structure:

str_function(string, pattern)

The two arguments in the structure of stringr functions are: string which is the character string to be processed and a pattern which is either a simple sequence of characters, a regular expression, or a combination of both. Because the string comes first, the stringr functions are ideal for piping and thus use in tidyverse style R.

All function names of stringr begin with str, then an underscore and then the name of the action to be performed. For example, to replace the first occurrence of a pattern in a string, we should use str_replace(). In the following, we will use stringr functions to perform various operations on the example text. As we have already loaded the tidyverse package, we can start right away with using stringr functions as shown below.

Like nchar in base, str_count provides the number of characters of a text.

str_count(splitexampletext)

## [1] 100 295 126

The function str_detect informs about whether a pattern is present in a text and outputs a logical vector with TRUE if the pattern occurs and FALSE if it does not.

str_detect(splitexampletext, "and")

## [1]  TRUE  TRUE FALSE

The function str_extract extracts the first occurrence of a pattern, if that pattern is present in a text.

str_extract(exampletext, "and")

## [1] "and"

The function str_extract_all extracts all occurrences of a pattern, if that pattern is present in a text.

str_extract_all(exampletext, "and")

## [[1]]
## [1] "and" "and" "and" "and" "and" "and"

The function str_locate provides the start and end position of the match of the pattern in a text.

str_locate(exampletext, "and")

##      start end
## [1,]    59  61

The function str_locate_all provides the start and end positions of the match of the pattern in a text and displays the result in matrix-form.

str_locate_all(exampletext, "and")

## [[1]]
##      start end
## [1,]    59  61
## [2,]   149 151
## [3,]   302 304
## [4,]   329 331
## [5,]   355 357
## [6,]   382 384

The function str_match extracts the first occurrence of the pattern in a text.

str_match(exampletext, "and")

##      [,1] 
## [1,] "and"

The function str_match_all extracts the all occurrences of the pattern from a text.

str_match_all(exampletext, "and")

## [[1]]
##      [,1] 
## [1,] "and"
## [2,] "and"
## [3,] "and"
## [4,] "and"
## [5,] "and"
## [6,] "and"

The function str_remove removes the first occurrence of a pattern in a text.

str_remove(exampletext, "and")

## [1] "Grammar is a system of rules which governs the production  use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function str_remove_all removes all occurrences of a pattern from a text.

str_remove_all(exampletext, "and")

## [1] "Grammar is a system of rules which governs the production  use of utterances in a given language. These rules apply to sound as well as meaning,  include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation  composition of words),  syntax (the formation  composition of phrases  sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function str_replace replaces the first occurrence of a pattern with something else in a text.

str_replace(exampletext, "and", "AND")

## [1] "Grammar is a system of rules which governs the production AND use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function str_replace_all replaces all occurrences of a pattern with something else in a text.

str_replace_all(exampletext, "and", "AND")

## [1] "Grammar is a system of rules which governs the production AND use of utterances in a given language. These rules apply to sound as well as meaning, AND include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation AND composition of words), AND syntax (the formation AND composition of phrases AND sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function str_starts tests whether a given text begins with a certain pattern and outputs a logical vector.

str_starts(exampletext, "and")

## [1] FALSE

The function str_ends tests whether a text ends with a certain pattern and outputs a logical vector.

str_ends(exampletext, "and")

## [1] FALSE

Like strsplit, the function str_split splits a text when a given pattern occurs. If no pattern is provided, then the text is split into individual symbols.

str_split(exampletext, "and")

## [[1]]
## [1] "Grammar is a system of rules which governs the production "                                                                                            
## [2] " use of utterances in a given language. These rules apply to sound as well as meaning, "                                                               
## [3] " include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation "
## [4] " composition of words), "                                                                                                                              
## [5] " syntax (the formation "                                                                                                                               
## [6] " composition of phrases "                                                                                                                              
## [7] " sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function str_split_fixed splits a text when a given pattern occurs but only so often as is indicated by the argument n. So, even if the patter occur more often than n, str_split_fixed will only split the text n times.

str_split_fixed(exampletext, "and", n = 3)

##      [,1]                                                        
## [1,] "Grammar is a system of rules which governs the production "
##      [,2]                                                                                     
## [1,] " use of utterances in a given language. These rules apply to sound as well as meaning, "
##      [,3]                                                                                                                                                                                                                                                                                                                                                                                  
## [1,] " include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function str_subset extracts those subsets of a text that contain a certain pattern.

str_subset(splitexampletext, "and")

## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language."                                                                                                                                                                                                   
## [2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."

The function str_which provides a vector with the indices of the texts that contain a certain pattern.

str_which(splitexampletext, "and")

## [1] 1 2

The function str_view shows the locations of the first instances of a pattern in a text or vector of texts.

str_view(splitexampletext, "and")

## [1] │ Grammar is a system of rules which governs the production <and> use of utterances in a given language.
## [2] │ These rules apply to sound as well as meaning, <and> include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation <and> composition of words), <and> syntax (the formation <and> composition of phrases <and> sentences).

The function str_view_all shows the locations of all instances of a pattern in a text or vector of texts.

str_view_all(exampletext, "and")

## [1] │ Grammar is a system of rules which governs the production <and> use of utterances in a given language. These rules apply to sound as well as meaning, <and> include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation <and> composition of words), <and> syntax (the formation <and> composition of phrases <and> sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics.

The function str_pad adds white spaces to a text or vector of texts so that they reach a given number of symbols.

# create text with white spaces
text <- " this    is a    text   "
str_pad(text, width = 30)

## [1] "       this    is a    text   "

The function str_trim removes white spaces from the beginning(s) and end(s) of a text or vector of texts.

str_trim(text)

## [1] "this    is a    text"

The function str_squish removes white spaces that occur within a text or vector of texts.

str_squish(text)

## [1] "this is a text"

The function str_wrap removes white spaces from the beginning(s) and end(s) of a text or vector of texts and also those white spaces that occur within a text or vector of texts.

str_wrap(text)

## [1] "this is a text"

The function str_order provides a vector that represents the order of a vector of texts according to the lengths of texts in that vector.

str_order(splitexampletext)

## [1] 1 3 2

The function str_sort orders of a vector of texts according to the lengths of texts in that vector.

str_sort(splitexampletext)

## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language."                                                                                                                                                                                                   
## [2] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."                                                                                                                                                                         
## [3] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."

The function str_to_upper converts all symbols in a text or vector of texts to upper case.

str_to_upper(exampletext)

## [1] "GRAMMAR IS A SYSTEM OF RULES WHICH GOVERNS THE PRODUCTION AND USE OF UTTERANCES IN A GIVEN LANGUAGE. THESE RULES APPLY TO SOUND AS WELL AS MEANING, AND INCLUDE COMPONENTIAL SUBSETS OF RULES, SUCH AS THOSE PERTAINING TO PHONOLOGY (THE ORGANISATION OF PHONETIC SOUND SYSTEMS), MORPHOLOGY (THE FORMATION AND COMPOSITION OF WORDS), AND SYNTAX (THE FORMATION AND COMPOSITION OF PHRASES AND SENTENCES). MANY MODERN THEORIES THAT DEAL WITH THE PRINCIPLES OF GRAMMAR ARE BASED ON NOAM CHOMSKY'S FRAMEWORK OF GENERATIVE LINGUISTICS."

The function str_to_lower converts all symbols in a text or vector of texts to lower case.

str_to_lower(exampletext)

## [1] "grammar is a system of rules which governs the production and use of utterances in a given language. these rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). many modern theories that deal with the principles of grammar are based on noam chomsky's framework of generative linguistics."

The function str_c combines texts into one text

str_c(exampletext, additionaltext)

## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics.In the early 20th century, Ferdinand de Saussure distinguished between the notions of langue and parole in his formulation of structural linguistics. According to him, parole is the specific utterance of speech, whereas langue refers to an abstract phenomenon that theoretically defines the principles and system of rules that govern a language. This distinction resembles the one made by Noam Chomsky between competence and performance in his theory of transformative or generative grammar. According to Chomsky, competence is an individual's innate capacity and potential for language (like in Saussure's langue), while performance is the specific way in which it is used by individuals, groups, and communities (i.e., parole, in Saussurean terms). "

The function str_conv converts a text into a certain type of encoding, e.g. into UTF-8 or Latin1.

str_conv(exampletext, encoding = "UTF-8")

## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function str_dup reduplicates a text or a vector of texts n times.

str_dup(exampletext, times=2)

## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics.Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function str_flatten combines a vector of texts into one text. The argument collapse defines the symbol that occurs between the combined texts. If the argument collapse is left out, the texts will be combined without any symbol between the combined texts.

str_flatten(sentences, collapse = " ")

## [1] "This is a first sentence. This is a second sentence. And this is a third sentence."

If the argument collapse is left out, the texts will be combined without any symbol between the combined texts.

str_flatten(sentences)

## [1] "This is a first sentence.This is a second sentence.And this is a third sentence."

The function str_length provides the length of texts in characters.

str_length(exampletext)

## [1] 523

The function str_replace_na replaces NA in texts. It is important to note that NA, if it occurs within a string, is considered to be the literal string NA.

# create sentences with NA
sentencesna <- c("Some text", NA, "Some more text", "Some NA text")
# apply str_replace_na function
str_replace_na(sentencesna, replacement = "Something new")

## [1] "Some text"      "Something new"  "Some more text" "Some NA text"

The function str_trunc ends strings with … after a certain number of characters.

str_trunc(sentences, width = 20)

## [1] "This is a first s..." "This is a second ..." "And this is a thi..."

The function str_sub extracts a string from a text from a start location to an end position (expressed as character positions).

str_sub(exampletext, 5, 25)

## [1] "mar is a system of ru"

The function word extracts words from a text (expressed as word positions).

word(exampletext, 2:7)

## [1] "is"     "a"      "system" "of"     "rules"  "which"

The function str_glue combines strings and allows to input variables.

name <- "Fred"
age <- 50
anniversary <- as.Date("1991-10-12")
str_glue(
  "My name is {name}, ",
  "my age next year is {age + 1}, ",
  "and my anniversary is {format(anniversary, '%A, %B %d, %Y')}."
)

## My name is Fred, my age next year is 51, and my anniversary is Saturday, October 12, 1991.

The function str_glue_data is particularly useful when it is used in data pipelines. The data set mtcars is a build in data set that is loaded automatically when starting R.

mtcars %>% 
  str_glue_data("{rownames(.)} has {hp} hp")

## Mazda RX4 has 110 hp
## Mazda RX4 Wag has 110 hp
## Datsun 710 has 93 hp
## Hornet 4 Drive has 110 hp
## Hornet Sportabout has 175 hp
## Valiant has 105 hp
## Duster 360 has 245 hp
## Merc 240D has 62 hp
## Merc 230 has 95 hp
## Merc 280 has 123 hp
## Merc 280C has 123 hp
## Merc 450SE has 180 hp
## Merc 450SL has 180 hp
## Merc 450SLC has 180 hp
## Cadillac Fleetwood has 205 hp
## Lincoln Continental has 215 hp
## Chrysler Imperial has 230 hp
## Fiat 128 has 66 hp
## Honda Civic has 52 hp
## Toyota Corolla has 65 hp
## Toyota Corona has 97 hp
## Dodge Challenger has 150 hp
## AMC Javelin has 150 hp
## Camaro Z28 has 245 hp
## Pontiac Firebird has 175 hp
## Fiat X1-9 has 66 hp
## Porsche 914-2 has 91 hp
## Lotus Europa has 113 hp
## Ford Pantera L has 264 hp
## Ferrari Dino has 175 hp
## Maserati Bora has 335 hp
## Volvo 142E has 109 hp

EXERCISE TIME!

Load the text linguistics04. How many words does the text consist of?

Answer

  readLines("https://slcladal.github.io/data/testcorpus/linguistics04.txt") %>%
  paste0(collapse = " ") %>%
  strsplit(" ") %>%
  unlist() %>%
  length()

  ## [1] 101

How many characters does the text consist of?

Answer

  readLines("https://slcladal.github.io/data/testcorpus/linguistics04.txt") %>%
  paste0(collapse = " ") %>%
  strsplit("") %>%
  unlist() %>%
  length()

  ## [1] 673

Advanced String Processing

Above, we have used functions and regular expressions to extract and find patters in textual data. Here, we will focus on common methods for cleaning text data that are applied before implementing certain methods.

We start by installing and then loading some additional packages, e.g., the quanteda (see here for a cheat sheet for the quanteda package), the tm, and the udpipe package, which are extremely useful when dealing with more advanced text processing.

install.packages("quanteda")
install.packages("tm")
install.packages("udpipe")

library(quanteda)
library(tm)
library(udpipe)

One common procedure is to split texts into sentences which we can do by using, e.g., the tokenize_sentence function from the quanteda package. I also unlist the data to have a vector wot work with (rather than a list).

et_sent <- quanteda::tokenize_sentence(exampletext) %>%
  unlist()
# inspect
et_sent

## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language."                                                                                                                                                                                                   
## [2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
## [3] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

Another common procedure is to remove stop words, i.e., words that do not have semantic or referential meaning (like nouns such as tree or cat, or verbs like sit or speak or adjectives such as green or loud) but that indicate syntactic relations, roles, or features.(e.g., articles and pronouns). We can remove stopwords using, e.g., the removeWords function from the tm package

et_wostop <-  tm::removeWords(exampletext, tm::stopwords("english"))
# inspect
et_wostop

## [1] "Grammar   system  rules  governs  production  use  utterances   given language. These rules apply  sound  well  meaning,  include componential subsets  rules,    pertaining  phonology ( organisation  phonetic sound systems), morphology ( formation  composition  words),  syntax ( formation  composition  phrases  sentences). Many modern theories  deal   principles  grammar  based  Noam Chomsky's framework  generative linguistics."

To remove the superfluous whote spaces, we can use, e.g., the stripWhitespace function from the tm package.

et_wows <-  tm::stripWhitespace(et_wostop)
# inspect
et_wows

## [1] "Grammar system rules governs production use utterances given language. These rules apply sound well meaning, include componential subsets rules, pertaining phonology ( organisation phonetic sound systems), morphology ( formation composition words), syntax ( formation composition phrases sentences). Many modern theories deal principles grammar based Noam Chomsky's framework generative linguistics."

It can also be useful to remove numbers. We can do this using, e.g., the removeNumbers function from the tm package.

et_wonum <-  tm::removeNumbers("This is the 1 and only sentence I will write in 2022.")
# inspect
et_wonum

## [1] "This is the  and only sentence I will write in ."

We may also want to remove any type of punctuation using, e.g., the removePunctuation function from the tm package.

et_wopunct <-  tm::removePunctuation(exampletext)
# inspect
et_wopunct

## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language These rules apply to sound as well as meaning and include componential subsets of rules such as those pertaining to phonology the organisation of phonetic sound systems morphology the formation and composition of words and syntax the formation and composition of phrases and sentences Many modern theories that deal with the principles of grammar are based on Noam Chomskys framework of generative linguistics"

We may also want to stem the words in a document, i.e. removing the ends of words to be able to group together semantically related words such as walk, walks, walking, walked which would all be stemmed into walk. We can stem a text using, e.g., the stemDocument function from the tm package.

et_stem <-  tm::stemDocument(exampletext, language = "en")
# inspect
et_stem

## [1] "Grammar is a system of rule which govern the product and use of utter in a given language. These rule appli to sound as well as meaning, and includ componenti subset of rules, such as those pertain to phonolog (the organis of phonet sound systems), morpholog (the format and composit of words), and syntax (the format and composit of phrase and sentences). Mani modern theori that deal with the principl of grammar are base on Noam Chomski framework of generat linguistics."

Tokenization, lemmatization, pos-tagging, and dependency parsing

A far better option than stemming is lemmatization as lemmatization is based on proper morphological information and vocabularies. For lemmatization, we can use the udpipe package which also tokenizes texts, adds part-of-speech tags, and provides information about dependency relations.

Before we can tokenize, lemmatize, pos-tag and parse though, we need to download a pre-trained language model.

# download language model
m_eng   <- udpipe::udpipe_download_model(language = "english-ewt")

If you have downloaded a model once, you can also load the model directly from the place where you stored it on your computer. In my case, I have stored the model in a folder called udpipemodels

# load language model from your computer after you have downloaded it once
m_eng <- udpipe_load_model(file = here::here("udpipemodels",
                                             "english-ewt-ud-2.5-191206.udpipe"))

We can now use the model to annotate out text.

# tokenise, tag, dependency parsing
text_anndf <- udpipe::udpipe_annotate(m_eng, x = exampletext) %>%
  as.data.frame() %>%
  dplyr::select(-sentence)
# inspect
head(text_anndf, 10)

##    doc_id paragraph_id sentence_id token_id      token      lemma  upos xpos
## 1    doc1            1           1        1    Grammar    Grammar PROPN  NNP
## 2    doc1            1           1        2         is         be   AUX  VBZ
## 3    doc1            1           1        3          a          a   DET   DT
## 4    doc1            1           1        4     system     system  NOUN   NN
## 5    doc1            1           1        5         of         of   ADP   IN
## 6    doc1            1           1        6      rules       rule  NOUN  NNS
## 7    doc1            1           1        7      which      which  PRON  WDT
## 8    doc1            1           1        8    governs     govern  VERB  VBZ
## 9    doc1            1           1        9        the        the   DET   DT
## 10   doc1            1           1       10 production production  NOUN   NN
##                                                    feats head_token_id
## 1                                            Number=Sing             4
## 2  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             4
## 3                              Definite=Ind|PronType=Art             4
## 4                                            Number=Sing             0
## 5                                                   <NA>             6
## 6                                            Number=Plur             4
## 7                                           PronType=Rel             8
## 8  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             4
## 9                              Definite=Def|PronType=Art            10
## 10                                           Number=Sing             8
##      dep_rel deps misc
## 1      nsubj <NA> <NA>
## 2        cop <NA> <NA>
## 3        det <NA> <NA>
## 4       root <NA> <NA>
## 5       case <NA> <NA>
## 6       nmod <NA> <NA>
## 7      nsubj <NA> <NA>
## 8  acl:relcl <NA> <NA>
## 9        det <NA> <NA>
## 10       obj <NA> <NA>

We could, of course, perform many more manipulations of textual data but this should suffice to get you started.

Citation & Session Info

Schweinberger, Martin. 2022. String processing in R. Brisbane: The University of Queensland. url: https://ladal.edu.au/string.html (Version 2022.11.17).

@manual{schweinberger2022string,
  author = {Schweinberger, Martin},
  title = {String processing in R},
  note = {https://ladal.edu.au/string.html},
  year = {2022},
  organization = {The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2022.11.17}
}

sessionInfo()

## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22621)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.utf8    
## 
## time zone: Australia/Brisbane
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] udpipe_0.8.11   tm_0.7-11       NLP_0.2-1       quanteda_3.3.1 
##  [5] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
##  [9] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
## [13] ggplot2_3.5.0   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.8         utf8_1.2.4         generics_0.1.3     xml2_1.3.6        
##  [5] slam_0.1-50        stringi_1.8.3      lattice_0.21-9     hms_1.1.3         
##  [9] digest_0.6.34      magrittr_2.0.3     evaluate_0.23      grid_4.3.2        
## [13] timechange_0.3.0   fastmap_1.1.1      rprojroot_2.0.4    jsonlite_1.8.8    
## [17] Matrix_1.6-5       stopwords_2.3      fansi_1.0.6        scales_1.3.0      
## [21] klippy_0.0.0.9500  jquerylib_0.1.4    cli_3.6.2          rlang_1.1.3       
## [25] crayon_1.5.2       munsell_0.5.0      withr_3.0.0        cachem_1.0.8      
## [29] yaml_2.3.8         parallel_4.3.2     tools_4.3.2        tzdb_0.4.0        
## [33] colorspace_2.1-0   fastmatch_1.1-4    here_1.0.1         assertthat_0.2.1  
## [37] vctrs_0.6.5        R6_2.5.1           lifecycle_1.0.4    pkgconfig_2.0.3   
## [41] RcppParallel_5.1.7 pillar_1.9.0       bslib_0.6.1        gtable_0.3.4      
## [45] data.table_1.15.2  glue_1.7.0         Rcpp_1.0.12        xfun_0.42         
## [49] tidyselect_1.2.0   highr_0.10         rstudioapi_0.15.0  knitr_1.45        
## [53] SnowballC_0.7.1    htmltools_0.5.7    rmarkdown_2.25     compiler_4.3.2

Back to LADAL home

If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.↩︎

String Processing in R

Martin Schweinberger

Introduction

Basic String Processing

Advanced String Processing

Citation & Session Info