# 1 Introduction

This tutorial introduces string processing and this can be used when working with language data. The entire code for the sections below can be downloaded here.

# 2 Preparation

As all caluculations and visualizations in this tutorial rely on “R”, it is necessary to install “R”, “RStudio”, and “Tinn-R”. If these programms (or, in the case of “R”, environments) are not installed yet, please search for them in your favorite search engine and add the term “download”. Open any of the first few links and follow the installation instructions (they are easy to follow, do not require any specifications, and are pretty much self-explanatory).

In addition, certain “libraries” need to be installed so that the scripts shown below are executed without errors. Before turning to the code below, please install the libraries needed for running the code below. If you have already installed the libraries mentioned below, then you can skip ahead ignore this section. To install the necessary libraries, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)
# install libraries
install.packages(c("stringr"))

Once you have installed “R”, “R-Studio”, “Tinn-R”, and have also initiated the session by executing the code shown above, you are good to go.

# 3 String processing in base

Before we start with string processing, we will create an example text on which we will perform the processing. In addition, we create two more elements: an element which also contains the example text but split up into sentences and another example text.

# read in text
exampletext <-  paste(exampletext, sep = " ", collapse = " ")
# split example text into sentences
splitexampletext <- unlist(strsplit(gsub("(\\.) ", "\\1qwertz", exampletext), "qwertz"))
# create vector with sentences
sentences <- c("This is a first sentence.", "This is a second sentence.", "And this is a third sentence.")
# inspect data
exampletext; splitexampletext; additionaltext; sentences
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language."
## [2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
## [3] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."
## [1] "In the early 20th century, Ferdinand de Saussure distinguished between the notions of langue and parole in his formulation of structural linguistics. According to him, parole is the specific utterance of speech, whereas langue refers to an abstract phenomenon that theoretically defines the principles and system of rules that govern a language. This distinction resembles the one made by Noam Chomsky between competence and performance in his theory of transformative or generative grammar. According to Chomsky, competence is an individual's innate capacity and potential for language (like in Saussure's langue), while performance is the specific way in which it is used by individuals, groups, and communities (i.e., parole, in Saussurean terms). "
## [1] "This is a first sentence."     "This is a second sentence."
## [3] "And this is a third sentence."

In the following, we will perform various operations on the example text using only in-build, or base, functions.

The function “substr” extracts a substring from the text by position (position is the character position, i.e. the first character has position 1, the second character position 2, etc.).

# extract substring by position
substr(exampletext, start=14, stop=30)
## [1] "system of rules w"

The function “grep” informs whether a pattern is present in a text (or in a series of texts) with 1 indicating yes and 0 indicating no.

# find substring
grep("language", splitexampletext, value=FALSE, ignore.case=FALSE, fixed=FALSE)
## [1] 1

When the argument “value” is set to “TRUE”, grep returns the element in which the match occurs but not the elements in which is does not occur.

# find substring
grep("language", splitexampletext, value=TRUE, ignore.case=FALSE, fixed=FALSE)
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language."

The function“grepl” return a logical vector with “TRUE” if the pattern occurs in the string and NO if the pattern does not occur in the string.

# find substring
grepl("language", splitexampletext, ignore.case=FALSE, fixed=FALSE)
## [1]  TRUE FALSE FALSE

The function “sub” replaces the first(!) occurrence of a pattern with another pattern in a given text.

sub("and", "AND", exampletext, ignore.case=FALSE, fixed=FALSE)
## [1] "Grammar is a system of rules which governs the production AND use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function “gsub” replaces all occurences of a pattern with another pattern in a given text.

gsub("and", "AND", exampletext, ignore.case=FALSE, fixed=FALSE)
## [1] "Grammar is a system of rules which governs the production AND use of utterances in a given language. These rules apply to sound as well as meaning, AND include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation AND composition of words), AND syntax (the formation AND composition of phrases AND sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function “gregexpr” informs about if a pattern is present in a text and if so where the pattern begins and how long it is.

gregexpr("and", exampletext, ignore.case=FALSE, perl=FALSE,
fixed=FALSE)
## [[1]]
## [1]  59 149 302 329 355 382
## attr(,"match.length")
## [1] 3 3 3 3 3 3
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

The function “strsplit” splits a text if a pattern occurs. The pattern itself is then no longer present in the result (the . is no longer present in the result).

strsplit(exampletext, "\\. ")
## [[1]]
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language"
## [2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)"
## [3] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

On way to get around this is to first replace the pattern with some sequence that does not occur in the text the “gsub” function and then splitting on the newly introduced sequence. This way, the original text remains intact.

strsplit(gsub("(\\.) ", "\\1somestring", exampletext), "somestring")
## [[1]]
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language."
## [2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
## [3] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function “paste” combines texts and the arguments “sep” and “collapse” are there to specify what should occur between the combined texts.

paste(splitexampletext, sep=" ", collapse= " ")
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function “toupper” converts text characters to upper case.

toupper(exampletext)
## [1] "GRAMMAR IS A SYSTEM OF RULES WHICH GOVERNS THE PRODUCTION AND USE OF UTTERANCES IN A GIVEN LANGUAGE. THESE RULES APPLY TO SOUND AS WELL AS MEANING, AND INCLUDE COMPONENTIAL SUBSETS OF RULES, SUCH AS THOSE PERTAINING TO PHONOLOGY (THE ORGANISATION OF PHONETIC SOUND SYSTEMS), MORPHOLOGY (THE FORMATION AND COMPOSITION OF WORDS), AND SYNTAX (THE FORMATION AND COMPOSITION OF PHRASES AND SENTENCES). MANY MODERN THEORIES THAT DEAL WITH THE PRINCIPLES OF GRAMMAR ARE BASED ON NOAM CHOMSKY'S FRAMEWORK OF GENERATIVE LINGUISTICS."

The function “tolower” converts text characters to lower case.

tolower(exampletext)
## [1] "grammar is a system of rules which governs the production and use of utterances in a given language. these rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). many modern theories that deal with the principles of grammar are based on noam chomsky's framework of generative linguistics."

The function “nchar” provides the number of characters of a text.

nchar(exampletext)
## [1] 523

These are the most common base-functions for string operations is “R”. We will now turn to string operating functions in the “stringr”-package.

Exercises for string processing with “base”

1. How many words doe the exercise text consist of?
2. How many characters does the text consist of?

# 4 String processing with stringr

The package “stringr” is the most widely used package for string processing in“R” as it makes string processing very easy. All “stringr” functions share a common structure:

str_function(string, pattern)

The two arguments in the structure of stringr-functions are: string which is the character string to be processed and a pattern which is either a simple sequence of characters, a regular expression, or a combination of both. Because the string comes first, the stringr-functions are ideal for piping and thus use in tidyverse-R.

All function names of “stringr” begin with str, then an underscore and then the name of the action to be performed. For example, to replace the first occurence of a pattern in a string, we should use str_replace(). In the following, we will use “stringr” functions to perform various operations on the example text. In a first step, we load the “stringr”-package.

# load stringr library
library(stringr)

Like “nchar” in “base”, “str_count” provides the number of characters of a text.

str_count(splitexampletext)
## [1] 100 295 126

The function “str_detect” informs about whether a pattern is present in a text and outputs a logical vector with TRUE if the pattern occurs and FALSE if it does not.

str_detect(splitexampletext, "and")
## [1]  TRUE  TRUE FALSE

The function “str_extract” extracts the first occurence of a pattern, if that pattern is present in a text.

str_extract(exampletext, "and")
## [1] "and"

The function “str_extract_all” extracts all occurences of a pattern, if that pattern is present in a text.

str_extract_all(exampletext, "and")
## [[1]]
## [1] "and" "and" "and" "and" "and" "and"

The function “str_locate” provides the start and end position of the match of the pattern in a text.

str_locate(exampletext, "and") 
##      start end
## [1,]    59  61

The function “str_locate_all” provides the start and end positions of the match of the pattern in a text and displays the result in matrix-form.

str_locate_all(exampletext, "and")
## [[1]]
##      start end
## [1,]    59  61
## [2,]   149 151
## [3,]   302 304
## [4,]   329 331
## [5,]   355 357
## [6,]   382 384

The function “str_match” extracts the first occurence of the pattern in a text.

str_match(exampletext, "and") 
##      [,1]
## [1,] "and"

The function “str_match_all” extracts the all occurences of the pattern from a text.

str_match_all(exampletext, "and")
## [[1]]
##      [,1]
## [1,] "and"
## [2,] "and"
## [3,] "and"
## [4,] "and"
## [5,] "and"
## [6,] "and"

The function “str_remove” removes the first occurence of a pattern in a text.

str_remove(exampletext, "and") 
## [1] "Grammar is a system of rules which governs the production  use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function “str_remove_all” removes all occurences of a pattern from a text.

str_remove_all(exampletext, "and")
## [1] "Grammar is a system of rules which governs the production  use of utterances in a given language. These rules apply to sound as well as meaning,  include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation  composition of words),  syntax (the formation  composition of phrases  sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function “str_replace” replaces the first occurence of a pattern with something else in a text.

str_replace(exampletext, "and", "AND")
## [1] "Grammar is a system of rules which governs the production AND use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function “str_replace_all” replaces all occurences of a pattern with something else in a text.

str_replace_all(exampletext, "and", "AND")
## [1] "Grammar is a system of rules which governs the production AND use of utterances in a given language. These rules apply to sound as well as meaning, AND include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation AND composition of words), AND syntax (the formation AND composition of phrases AND sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function “str_starts” tests whether a given text begins with a certain pattern and outputs a logical vector.

str_starts(exampletext, "and") 
## [1] FALSE

The function “str_ends” tests whether a text ends with a certain pattern and outputs a logical vector.

str_ends(exampletext, "and")
## [1] FALSE

Like “strsplit, the function”str_split" splits a text when a given pattern occurs. If no pattern is provided, then the text is split into individual symbols.

str_split(exampletext, "and") 
## [[1]]
## [1] "Grammar is a system of rules which governs the production "
## [2] " use of utterances in a given language. These rules apply to sound as well as meaning, "
## [3] " include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation "
## [4] " composition of words), "
## [5] " syntax (the formation "
## [6] " composition of phrases "
## [7] " sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function “str_split_fixed” splits a text when a given pattern occurs but only so often as is indicated by the argument “n”. So, even if the patter occur more often than “n”, “str_split_fixed” will only split the text “n” times.

str_split_fixed(exampletext, "and", n = 3)
##      [,1]
## [1,] "Grammar is a system of rules which governs the production "
##      [,2]
## [1,] " use of utterances in a given language. These rules apply to sound as well as meaning, "
##      [,3]
## [1,] " include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function “str_subset” extracts those subsets of a text that contain a certain pattern.

str_subset(splitexampletext, "and") 
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language."
## [2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."

The function “str_which” provides a vector with the indices of the texts that contain a certain pattern.

str_which(splitexampletext, "and")
## [1] 1 2

The function “str_view” shows the locations of the first instances of a pattern in a text or vector of texts.

str_view(splitexampletext, "and")

The function “str_view_all” shows the locations of all instances of a pattern in a text or vector of texts.

str_view_all(exampletext, "and")

The function “str_pad” adds white spaces to a text or vector of texts so that they reach a given number of symbols.

# cretae text with white spaces
text <- " this    is a    text   "
str_pad(text, width = 30)
## [1] "       this    is a    text   "

The function “str_trim” removes white spaces from the beginning(s) and end(s) of a text or vector of texts.

str_trim(text) 
## [1] "this    is a    text"

The function “str_squish” removes white spaces that occur within a text or vector of texts.

str_squish(text)
## [1] "this is a text"

The function “str_wrap” removes white spaces from the beginning(s) and end(s) of a text or vector of texts and also those white spaces that occur within a text or vector of texts.

str_wrap(text)
## [1] "this is a text"

The function “str_order” provides a vector that represents the order of a vector of texts according to the lengths of texts in that vector.

str_order(splitexampletext)
## [1] 1 3 2

The function “str_sort” orders of a vector of texts according to the lengths of texts in that vector.

str_sort(splitexampletext)
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language."
## [2] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."
## [3] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."

The function “str_to_upper” converts all symbols in a text or vector of texts to upper case.

str_to_upper(exampletext) 
## [1] "GRAMMAR IS A SYSTEM OF RULES WHICH GOVERNS THE PRODUCTION AND USE OF UTTERANCES IN A GIVEN LANGUAGE. THESE RULES APPLY TO SOUND AS WELL AS MEANING, AND INCLUDE COMPONENTIAL SUBSETS OF RULES, SUCH AS THOSE PERTAINING TO PHONOLOGY (THE ORGANISATION OF PHONETIC SOUND SYSTEMS), MORPHOLOGY (THE FORMATION AND COMPOSITION OF WORDS), AND SYNTAX (THE FORMATION AND COMPOSITION OF PHRASES AND SENTENCES). MANY MODERN THEORIES THAT DEAL WITH THE PRINCIPLES OF GRAMMAR ARE BASED ON NOAM CHOMSKY'S FRAMEWORK OF GENERATIVE LINGUISTICS."

The function “str_to_lower” converts all symbols in a text or vector of texts to lower case.

str_to_lower(exampletext) 
## [1] "grammar is a system of rules which governs the production and use of utterances in a given language. these rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). many modern theories that deal with the principles of grammar are based on noam chomsky's framework of generative linguistics."

The function “str_c” combines texts into one text

str_c(exampletext, additionaltext)
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics.In the early 20th century, Ferdinand de Saussure distinguished between the notions of langue and parole in his formulation of structural linguistics. According to him, parole is the specific utterance of speech, whereas langue refers to an abstract phenomenon that theoretically defines the principles and system of rules that govern a language. This distinction resembles the one made by Noam Chomsky between competence and performance in his theory of transformative or generative grammar. According to Chomsky, competence is an individual's innate capacity and potential for language (like in Saussure's langue), while performance is the specific way in which it is used by individuals, groups, and communities (i.e., parole, in Saussurean terms). "

The function “str_conv” converts a text into a certain type of encoding, e.g. into “UTF-8” or “Latin1”.

str_conv(exampletext, encoding = "UTF-8")
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function “str_dup” reduplicates a text or a vector of texts n times.

str_dup(exampletext, times=2)
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics.Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

The function “str_flatten” combines a vector of texts into one text. The argument “collapse” defines the symbol that occurs between the combined texts. If the argument “collapse” is left out, the texts will be combined without any symbol between the combined texts.

str_flatten(sentences, collapse = " ")
## [1] "This is a first sentence. This is a second sentence. And this is a third sentence."

If the argument “collapse” is left out, the texts will be combined without any symbol between the combined texts.

str_flatten(sentences)
## [1] "This is a first sentence.This is a second sentence.And this is a third sentence."

The function “str_length” provides the length of texts in characters.

str_length(exampletext)
## [1] 523

The function “str_replace_na” replaces NA in texts. It is important to note that NA, if it occurs within a string, is considered to be the literal string “NA”.

# create sentences with NA
sentencesna <- c("Some text", NA, "Some more text", "Some NA text")
# apply str_replace_na function
str_replace_na(sentencesna, replacement = "Something new")
## [1] "Some text"      "Something new"  "Some more text" "Some NA text"

The function “str_trunc” ends strings with … after a certain number of characters.

str_trunc(sentences, width = 20)
## [1] "This is a first s..." "This is a second ..." "And this is a thi..."

The function “str_sub” extracts a string from a text from a start location to an end position (expressed as character positions).

str_sub(exampletext, 5, 25)
## [1] "mar is a system of ru"

The function “word” extracts words from a text (expressed as word positions).

word(exampletext, 2:7)
## [1] "is"     "a"      "system" "of"     "rules"  "which"

The function “str_glue” combines strings and allows to input variables.

name <- "Fred"
age <- 50
anniversary <- as.Date("1991-10-12")
str_glue(
"My name is {name}, ",
"my age next year is {age + 1}, ",
"and my anniversary is {format(anniversary, '%A, %B %d, %Y')}."
)
## My name is Fred, my age next year is 51, and my anniversary is Samstag, Oktober 12, 1991.

The function “str_glue_data” is particularly useful when it is used in data pipelines. The datavset “mtcars” is a build in data set that is loaded automatically when starting “R”.

mtcars %>%
str_glue_data("{rownames(.)} has {hp} hp")
## Mazda RX4 has 110 hp
## Mazda RX4 Wag has 110 hp
## Datsun 710 has 93 hp
## Hornet 4 Drive has 110 hp
## Hornet Sportabout has 175 hp
## Valiant has 105 hp
## Duster 360 has 245 hp
## Merc 240D has 62 hp
## Merc 230 has 95 hp
## Merc 280 has 123 hp
## Merc 280C has 123 hp
## Merc 450SE has 180 hp
## Merc 450SL has 180 hp
## Merc 450SLC has 180 hp
## Cadillac Fleetwood has 205 hp
## Lincoln Continental has 215 hp
## Chrysler Imperial has 230 hp
## Fiat 128 has 66 hp
## Honda Civic has 52 hp
## Toyota Corolla has 65 hp
## Toyota Corona has 97 hp
## Dodge Challenger has 150 hp
## AMC Javelin has 150 hp
## Camaro Z28 has 245 hp
## Pontiac Firebird has 175 hp
## Fiat X1-9 has 66 hp
## Porsche 914-2 has 91 hp
## Lotus Europa has 113 hp
## Ford Pantera L has 264 hp
## Ferrari Dino has 175 hp
## Maserati Bora has 335 hp
## Volvo 142E has 109 hp

Exercises for string processing with “stringr”