Introduction

This tutorial introduces how to extract concordances and keyword-in-context (KWIC) displays with R. The entire R-markdown document for the tutorial can be downloaded here.

In the language sciences, concordancing refers to the extraction of words from a given text or texts (Lindquist 2009, 5). Commonly, concordances are displayed in the form of keyword-in-context displays (KWICs) where the search term is shown in context, i.e. with preceding and following words. Concordancing are central to analyses of text and they often represents the first step in more sophisticated analyses of language data (Stefanowitsch 2020). The play such a key role in the language sciences because concordances are extremely valuable for understanding how a word or phrase is used, how often it is used, and in which contexts is used. As concordances allow us to analyze the context in which a word or phrase occurs and provide frequency information about word use, they also enable us to analyze collocations or the collocational profiles of words and phrases (Stefanowitsch 2020, 50–51). Finally, concordances can also be used to extract examples and it is a very common procedure.

\label{fig:Fig1} Concordances in AntConc.

Concordances in AntConc.

There are various very good software packages that can be used to create concordances - both for offline use (e.g. AntConc (Anthony 2004), SketchEngine(Kilgarriff et al. 2004), MONOCONC(Barlow 1999), and ParaConc)(Barlow 2002) and online use (see e.g. here).

In addition, many corpora that are available such as the BYU corpora can be accessed via a web interface that have in-built concordancing functions.

\label{fig:Fig2} Online concordances extracted from the COCA corpus that is part of the BYU corpora.

Online concordances extracted from the COCA corpus that is part of the BYU corpora.

While these packages are very user-friendly, offer various additional functionalities, and almost everyone who is engaged in analyzing language has used concordance software, they all suffer from shortcomings that render R a viable alternative. Such issues include that these applications

  • are black boxes that researchers do not have full control over or do not know what is going on within the software

  • they are not open source

  • they hinder replication because the replications is more time consuming compared to analyses based on Notebooks.

  • they are commonly not free-of charge or have other restrictions on use (a notable exception is AntConc)

R represents an alternative to ready-made concordancing applications because it:

  • is extremely flexible and enables researchers to perform their entire analysis in a single environment

  • allows full transparency and documentation as analyses can be based on Notebooks

  • offer version control measures (this means that the specific versions of the involved software are traceable)

  • makes research more replicable as entire analyses can be reproduced by simply running the Notebooks that the research is based on

Especially the aspect that R enables full transparency and replicability is relevant given the ongoing Replication Crisis (Yong, n.d.; Aschwanden, n.d.; Diener and Biswas-Diener 2019; Velasco, n.d.; McRae, n.d.). The Replication Crisis is a ongoing methodological crisis primarily affecting parts of the social and life sciences beginning in the early 2010s (see also Fanelli 2009). Replication is important so that other researchers, or the public for that matter, can see or, indeed, reproduce, exactly what you have done. Fortunately, R allows you to document your entire workflow as you can store everything you do in what is called a script or a notebook (in fact, this document was originally a R notebook). If someone is then interested in how you conducted your analysis, you can simply share this notebook or the script you have written with that person.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# install libraries
install.packages(c("quanteda", "dplyr", "stringr", "knitr", "kableExtra", "gutenbergr"))

Once you have installed R-Studio and initiated the session by executing the code shown above, you are good to go.

Loading and processing textual data

Before creating concordances or kwics, we load the necessary packages from the R library which will help to download, process, and display the data as well as create the concordances.

# activate packages
library(quanteda)
library(dplyr)
library(stringr)
library(knitr)
library(kableExtra)
library(gutenbergr)

For this tutorial, we will use Charles Darwin’s On the Origin of Species by means of Natural Selection which we download from the Project Gutenberg archive (see Stroube 2003). Thus, Darwin’s Origin of Species forms the basis of our analysis. You can use the code below to download this text into R (but you have to have access to the internet to do so).

origin <- gutenberg_works(gutenberg_id == "1228") %>%
  gutenberg_download(meta_fields = "gutenberg_id")
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
# inspect data
kable(head(origin), caption = "First 6 rows of the text") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 rows of the text
gutenberg_id text
1228 ON THE ORIGIN OF SPECIES.
1228
1228 OR THE PRESERVATION OF FAVOURED RACES IN THE STRUGGLE FOR LIFE.
1228
1228
1228 By Charles Darwin, M.A.,

The table above shows that Darwin’s Origin of Species requires formatting so that we can use it. Therefore, we collapse it into a single object (or text) and remove superfluous white spaces.

origin <- origin$text %>%
  paste0(collapse = " ") %>%
  str_squish()
str(origin)
##  chr "ON THE ORIGIN OF SPECIES. OR THE PRESERVATION OF FAVOURED RACES IN THE STRUGGLE FOR LIFE. By Charles Darwin, M."| __truncated__

The result confirms that the entire text is now combined into a single character object.

Creating simple concordances

Now that we have loaded the data, we can easily extract concordances using the kwic function from the quanteda package. The kwic function takes the text (x) and the search pattern (pattern) as it main arguments but it also allows the specification of the context window, i.e. how many words/elements are show to the left and right of the key word (we will go over this later on).

kwic_natural <- kwic(x = origin, pattern = "selection")
# inspect data
kable(head(kwic_natural), caption = "First 6 concordances for the keyword *natural*.") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 concordances for the keyword natural.
docname from to pre keyword post pattern
text1 269 269 and Origin . Principle of Selection anciently followed , its Effects selection
text1 279 279 Effects . Methodical and Unconscious Selection . Unknown Origin of our selection
text1 294 294 favourable to Man’s power of Selection . CHAPTER 2 . VARIATION selection
text1 380 380 EXISTENCE . Bears on natural selection . The term used in selection
text1 474 474 . CHAPTER 4 . NATURAL SELECTION . Natural Selection : its selection
text1 477 477 . NATURAL SELECTION . Natural Selection : its power compared with selection

We can easily extract the frequency of the search term (selection) using the nrow or the length functions which provide the number of rows of a tables (nrow) or the length of a vector (length).

nrow(kwic_natural)
## [1] 414
length(kwic_natural$keyword)
## [1] 414

The results show that there are 414 instances of the search term (selection) but we can also find out how often different variants (lower case versus upper case) of the search term were found using the table function. This is especially useful when searches involve many different search terms (while it is, admittedly, less useful in the present example).

table(kwic_natural$keyword)
## 
## selection Selection SELECTION 
##       369        39         6

To get a better understanding of the use of a word, it is often useful to extract more context. This is easily done by increasing size of the context window. To do this, we specify the window argument of the kwic function. In the example below, we set the context window size to 10 words/elements rather than using the default (which is 5 word/elements).

kwic_natural_longer <- kwic(x = origin, pattern = "selection", window = 10)
# inspect data
kable(head(kwic_natural_longer), 
      caption = "First 6 concordances for the keyword *natural* with an extended context (10 elements).") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 concordances for the keyword natural with an extended context (10 elements).
docname from to pre keyword post pattern
text1 269 269 Domestic Pigeons , their Differences and Origin . Principle of Selection anciently followed , its Effects . Methodical and Unconscious Selection selection
text1 279 279 Selection anciently followed , its Effects . Methodical and Unconscious Selection . Unknown Origin of our Domestic Productions . Circumstances favourable selection
text1 294 294 our Domestic Productions . Circumstances favourable to Man’s power of Selection . CHAPTER 2 . VARIATION UNDER NATURE . Variability . selection
text1 380 380 CHAPTER 3 . STRUGGLE FOR EXISTENCE . Bears on natural selection . The term used in a wide sense . Geometrical selection
text1 474 474 most important of all relations . CHAPTER 4 . NATURAL SELECTION . Natural Selection : its power compared with man’s selection selection
text1 477 477 all relations . CHAPTER 4 . NATURAL SELECTION . Natural Selection : its power compared with man’s selection , its power selection

Extracting more than single words

While extracting single words is very common, you may want to extract more than just one word. To extract phrases, all you need to so is to specify that the pattern you are looking for is a phrase, as shown below.

kwic_naturalselection <- kwic(origin, pattern = phrase("natural selection"))
# inspect data
kable(head(kwic_naturalselection), caption = "First 6 concordances for the key phrase *natural selection*.") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 concordances for the key phrase natural selection.
docname from to pre keyword post pattern
text1 379 380 FOR EXISTENCE . Bears on natural selection . The term used in natural selection
text1 473 474 relations . CHAPTER 4 . NATURAL SELECTION . Natural Selection : its natural selection
text1 476 477 4 . NATURAL SELECTION . Natural Selection : its power compared with natural selection
text1 524 525 Circumstances favourable and unfavourable to Natural Selection , namely , intercrossing , natural selection
text1 543 544 action . Extinction caused by Natural Selection . Divergence of Character , natural selection
text1 567 568 to naturalisation . Action of Natural Selection , through Divergence of Character natural selection

Of course you can extend this to longer sequences such as entire sentences. However, you may want to extract more or less concrete patterns rather than words or phrases. To search for patterns rather than words, you need to include regular expressions in your search pattern.

Searches using regular expressions

Regular expressions allow you to search for abstract patterns rather than concrete words or phrases which provides you with an extreme flexibility in what you can retrieve. A regular expression (in short also called regex or regexp) is a special sequence of characters that stand for are that describe a pattern. You can think of regular expressions as very powerful combinations of wildcards or as wildcards on steroids. For example, the sequence [a-z]{1,3} is a regular expression that stands for one up to three lower case characters and if you searched for this regular expression, you would get, for instance, is, a, an, of, the, my, our, etc, and many other short words as results.

There are three basic types of regular expressions:

  • regular expressions that stand for individual symbols and determine frequencies

  • regular expressions that stand for classes of symbols

  • regular expressions that stand for structural properties

The regular expressions below show the first type of regular expressions, i.e. regular expressions that stand for individual symbols and determine frequencies.

Regular expressions that stand for individual symbols and determine frequencies.
RegEx Symbol/Sequence Explanation Example
? The preceding item is optional and will be matched at most once walk[a-z]? = walk, walks
* The preceding item will be matched zero or more times walk[a-z]* = walk, walks, walked, walking
+ The preceding item will be matched one or more times walk[a-z]+ = walks, walked, walking
{n} The preceding item is matched exactly n times walk[a-z]{2} = walked
{n,} The preceding item is matched n or more times walk[a-z]{2,} = walked, walking
{n,m} The preceding item is matched at least n times, but not more than m times walk[a-z]{2,3} = walked, walking

The regular expressions below show the second type of regular expressions, i.e. regular expressions that stand for classes of symbols.

Regular expressions that stand for classes of symbols.
RegEx Symbol/Sequence Explanation
[ab] lower case a and b
[AB] upper case a and b
[12] digits 1 and 2
[:digit:] digits: 0 1 2 3 4 5 6 7 8 9
[:lower:] lower case characters: a–z
[:upper:] upper case characters: A–Z
[:alpha:] alphabetic characters: a–z and A–Z
[:alnum:] digits and alphabetic characters
[:punct:] punctuation characters: . , ; etc.
[:graph:] graphical characters: [:alnum:] and [:punct:]
[:blank:] blank characters: Space and tab
[:space:] space characters: Space, tab, newline, and other space characters
[:print:] printable characters: [:alnum:], [:punct:] and [:space:]

The regular expressions that denote classes of symbols are enclosed in [] and :. The last type of regular expressions, i.e. regular expressions that stand for structural properties are shown below.

Regular expressions that stand for structural properties.
RegEx Symbol/Sequence Explanation
\ Word characters: [[:alnum:]_]
\ No word characters: [^[:alnum:]_]
\ Space characters: [[:blank:]]
\ No space characters: [^[:blank:]]
\/td> Digits: [[:digit:]]
\ No digits: [^[:digit:]]
\/td> Word edge
\ No word edge
\< Word beginning
\> Word end
^ Beginning of a string
$ End of a string

To include regular expressions in your KWIC searches, you include them in your search pattern and set the argument valuetype to "regex". The search pattern "\\bnatu.*|\\bselec.*" retrieves elements that contain natu and selec followed by any characters and where the n in natu and the s in selec are at a word boundary, i.e. where they are the first letters of a word. Hence, our serach would not retrieve words like unnatural or deselect. The | is an operator (like +, -, or *) that stands for or.

# define search patterns
patterns <- c("\\bnatu.*|\\bselec.*")
kwic_regex <- kwic(origin, patterns, valuetype = "regex")
# inspect data
kable(head(kwic_regex), caption = "First 6 concordances for the regular expression \\bnatu.*.") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 concordances for the regular expression .*.
docname from to pre keyword post pattern
text1 269 269 and Origin . Principle of Selection anciently followed , its Effects .|.
text1 279 279 Effects . Methodical and Unconscious Selection . Unknown Origin of our .|.
text1 294 294 favourable to Man’s power of Selection . CHAPTER 2 . VARIATION .|.
text1 301 301 CHAPTER 2 . VARIATION UNDER NATURE . Variability . Individual Differences .|.
text1 379 379 FOR EXISTENCE . Bears on natural selection . The term used .|.
text1 380 380 EXISTENCE . Bears on natural selection . The term used in .|.

Piping concordances

Quite often, we only want to retrieve patterns if they occur in a certain context. For instance, we might be interested in instances of selection but only if the preceding word is natural. Such conditional concordances could be extracted using regular expressions but they are easier to retrieve by piping. Piping is done using the %>% function from the dplyr package and the piping sequence can be translated as and then. We can then filter those concordances that contain natural using the filter function from the dplyr package. Note the the $ stands for the end of a string so that natural$ means that natural is the last element in the string that is preceding the keyword.

kwic_pipe <- kwic(x = origin, pattern = "selection") %>%
  filter(str_detect(pre, "natural$|NATURAL$"))
# inspect data
kable(head(kwic_pipe), caption = "First 6 concordances for instances of *selection* that are preceeded by *natural*.") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 concordances for instances of selection that are preceeded by natural.
docname from to pre keyword post pattern
text1 380 380 EXISTENCE . Bears on natural selection . The term used in selection
text1 474 474 . CHAPTER 4 . NATURAL SELECTION . Natural Selection : its selection
text1 612 612 disuse , combined with natural selection ; organs of flight and selection
text1 1455 1455 far the theory of natural selection may be extended . Effects selection
text1 6474 6474 do occur ; but natural selection , as will hereafter be selection
text1 14664 14664 a process of " natural selection , " as will hereafter selection

Piping is a very useful helper function and it is very frequently used in R - not only in the context of text processing but in all data science related domains.

Arranging concordances and adding frequency information

When inspecting concordances, it is useful to re-order the concordances so that they do not appear in the order that they appeared in the text or texts but by the context. To reorder concordances, we can use the arrange function from the dplyr package which takes the column according to which we want to re-arrange the data as it main argument.

In the example below, we extract all instances of natural and then arrange the instances according to the content of the post column in alphabetical.

kwic_ordered <- kwic(x = origin, pattern = "natural") %>%
  arrange(post)
# inspect data
kable(head(kwic_ordered), caption = "First 6 re-ordered concordances for instances of *natural*.") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 re-ordered concordances for instances of natural.
docname from to pre keyword post pattern
text1 175809 175809 , 190 . System , natural , 413 . Tail : natural
text1 176269 176269 , 159 . Varieties : natural , 44 . struggle between natural
text1 175333 175333 . unconscious , 34 . natural , 80 . sexual , natural
text1 146888 146888 and this would be strictly natural , as it would connect natural
text1 175341 175341 . sexual , 87 . natural , circumstances favourable to , natural
text1 145995 145995 genealogical in order to be natural ; but that the AMOUNT natural

Arranging concordances according to alphabetical properties may, however, not be the most useful option. A more useful option may be to arrange concordances according to the frequency of co-occurring terms or collocates. In order to do this, we need to extract the co-occurring words and calculate their frequency. We can do this by combining the mutate, group_by, n() functions from the dplyr package with the str_remove_all function from the stringr package. Then, we arrange the concordances by the frequency of the collocates in descending order (that is why we put a - in the arrange function). In order to do this, we need to

  1. create a new variable or column which represents the word that co-occurs with, or, as in the example below, immediately follows the search term. In the example below, we use the mutate function to create a new column called post_word. We then use the str_remove_all function to remove everything except for the word that immediately follows the search term (we simply remove everything and including a white space).

  2. group the data by the word that immediately follows the search term.

  3. create a new column called post_word_freq which represents the frequencies of all the words that immediately follow the search term.

  4. arrange the concordances by the frequency of the collocates in descending order.

kwic_ordered_coll <- kwic(x = origin, pattern = "natural") %>%
  mutate(post_word = str_remove_all(pre, " .*")) %>%
  group_by(post_word) %>%
  mutate(post_word_freq = n()) %>%
  arrange(-post_word_freq)
# inspect data
kable(head(kwic_ordered_coll), caption = "First 6 re-ordered concordances for instances of *natural*.") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 re-ordered concordances for instances of natural.
docname from to pre keyword post pattern post_word post_word_freq
text1 3394 3394 , I am convinced that Natural Selection has been the main natural , 29
text1 19498 19498 , to the action of natural selection in accumulating ( as natural , 29
text1 22752 22752 , by the term of Natural Selection , in order to natural , 29
text1 31819 31819 , and if so , natural selection will be able to natural , 29
text1 32745 32745 , as I believe , natural selection acts , I must natural , 29
text1 39277 39277 , I do believe that natural selection will always act very natural , 29

We add more columns according to which we could arrange the concordance following the same schema. For example, we could add another column that represented the frequency of words that immediately preceded the search term and then arrange according to this column.

Concordances from transcriptions

As many analyses use transcripts as their primary data and because transcripts have features that require additional processing, we will now perform concordancing based on on transcripts. As a first step, we load five example transcripts that represent the first five files from the Irish component of the International Corpus of English.

# define corpus files
files <- paste("https://slcladal.github.io/data/ICEIrelandSample/S1A-00", 1:5, ".txt", sep = "")
# load corpus files
transcripts <- sapply(files, function(x){
  x <- readLines(x)
})
# inspect first ten lines of the first transcript
transcripts[[1]][1:10]
##  [1] "<S1A-001 Riding>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
##  [2] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
##  [3] "<I>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
##  [4] "<S1A-001$A> <#> Well how did the riding go tonight"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
##  [5] "<S1A-001$B> <#> It was good so it was <#> Just I I couldn't believe that she was going to let me jump <,> that was only the fourth time you know <#> It was great <&> laughter </&>"                                                                                                                                                                                                                                                                                                                                                                                                                                    
##  [6] "<S1A-001$A> <#> What did you call your horse"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
##  [7] "<S1A-001$B> <#> I can't remember <#> Oh Mary 's Town <,> oh"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
##  [8] "<S1A-001$A> <#> And how did Mabel do"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
##  [9] "<S1A-001$B> <#> Did you not see her whenever she was going over the jumps <#> There was one time her horse refused and it refused three times <#> And then <,> she got it round and she just lined it up straight and she just kicked it and she hit it with the whip <,> and over it went the last time you know <#> And Stephanie told her she was very determined and very well-ridden <&> laughter </&> because it had refused the other times you know <#> But Stephanie wouldn't let her give up on it <#> She made her keep coming back and keep coming back <,> until <,> it jumped it you know <#> It was good"
## [10] "<S1A-001$A> <#> Yeah I 'm not so sure her jumping 's improving that much <#> She uh <,> seemed to be holding the reins very tight"

The first ten lines shown above let us know that, after the header (<S1A-001 Riding>) and the symbol which indicates the start of the transcript (<I>), each utterance is preceded by a sequence which indicates the section, file, and speaker (e.g. <S1A-001$A>). The first utterance is thus uttered by speaker A in file 001 of section S1A. In addition, there are several sequences that provide meta-linguistic information which indicate the beginning of a speech unit (<#>), pauses (<,>), and laughter (<&> laughter </&>).

To perform the concordancing, we need to change the format of the transcripts because the kwic function only works on character, corpus, tokens object- in their present form, the transcripts represent a list which contains vectors of strings. To change the format, we collapse the individual utterances into a single character vector for each transcript.

transcripts_collapsed <- sapply(files, function(x){
  x <- readLines(x)
  x <- paste0(x, collapse = " ")
  x <- str_squish(x)
})
# inspect data
str(transcripts_collapsed)
##  Named chr [1:5] "<S1A-001 Riding> <I> <S1A-001$A> <#> Well how did the riding go tonight <S1A-001$B> <#> It was good so it was <"| __truncated__ ...
##  - attr(*, "names")= chr [1:5] "https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt" "https://slcladal.github.io/data/ICEIrelandSample/S1A-002.txt" "https://slcladal.github.io/data/ICEIrelandSample/S1A-003.txt" "https://slcladal.github.io/data/ICEIrelandSample/S1A-004.txt" ...

We can now extract the concordances.

kwic_trans <- kwic(x = transcripts_collapsed, pattern = phrase("you know"))
# inspect data
kable(head(kwic_trans), caption = "First 6 concordances for *you know* in three example transcripts.") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 concordances for you know in three example transcripts.
docname from to pre keyword post pattern
https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 62 63 was only the fourth time you know < # > It was you know
https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 204 205 it went the last time you know < # > And Stephanie you know
https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 235 236 had refused the other times you know < # > But Stephanie you know
https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 272 273 , > it jumped it you know < # > It was you know
https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 602 603 that one < , > you know and starting anew fresh < you know
https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 665 666 { > < [ > you know < / [ > < you know

The results show that each non-alphanumeric character is counted as a single word which reduces the context of the keyword substantially. Also, the docname column contains the full path to the data which make it hard to parse the content of the table. To address the first issue, we remove symbols by adding remove_symbols = T and remove punctuation by adding remove_punct = T. In addition, we clean the docname column and extract only the file name.

kwic_trans <- kwic(x = transcripts_collapsed, pattern = phrase("you know"),
                   remove_symbols = T, remove_punct = T)
# clean docnames
kwic_trans$docname <- kwic_trans$docname %>%
  str_replace_all(".*/([A-Z][0-9][A-Z]-[0-9]{1,3}).txt", "\\1") 
# inspect data
kable(head(kwic_trans), caption = "First 6 concordances for *you know* in three example transcripts.") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 concordances for you know in three example transcripts.
docname from to pre keyword post pattern
S1A-001 40 41 was only the fourth time you know It was great laughter S1A-001 you know
S1A-001 129 130 it went the last time you know And Stephanie told her she you know
S1A-001 150 151 had refused the other times you know But Stephanie wouldn’t let her you know
S1A-001 175 176 back until it jumped it you know It was good S1A-001 A you know
S1A-001 256 257 quite a few weeks now you know any proper jumping really And you know
S1A-001 355 356 better waiting for that one you know and starting anew fresh S1A-001 you know

We could also extend the context window and merge the symbols that the kwic function has separated.

Extending the context can also be used to identify the speaker that has uttered the search pattern that we are interested in. We will do just that as this is a common task in linguistics analyses.

To extract speakers, we need to follow these steps:

  1. Create normal concordances of the pattern that we are interested in.

  2. Generate concordances of the pattern that we are interested in with a substantially enlarged context window size.

  3. Extract the speakers from the enlarged context window size.

  4. Add the speakers to the normal concordances using the left-join function from the dplyr package.

kwic_normal <- kwic(transcripts_collapsed, phrase("you know"))
kwic_long <- kwic(transcripts_collapsed, phrase("you know"), window = 500) %>%
  mutate(pre = str_remove_all(pre, ".*\\$")) %>%
  mutate(pre = str_remove_all(pre, "\\>.*"),
         speaker = str_squish(pre)) %>%
  select(docname, speaker)
# add speaker to normal kwic
kwic_combined <- left_join(kwic_normal, kwic_long)
# clean docnames
kwic_combined$docname <- kwic_combined$docname %>%
  str_replace_all(".*/([A-Z][0-9][A-Z]-[0-9]{1,3}).txt", "\\1") 
# inspect data
kable(head(kwic_combined), caption = "First 6 concordances for *you know* with speakers that uttered them.") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 concordances for you know with speakers that uttered them.
docname from to pre keyword post pattern speaker
S1A-001 62 63 was only the fourth time you know < # > It was you know B
S1A-001 62 63 was only the fourth time you know < # > It was you know B
S1A-001 62 63 was only the fourth time you know < # > It was you know B
S1A-001 62 63 was only the fourth time you know < # > It was you know B
S1A-001 62 63 was only the fourth time you know < # > It was you know B
S1A-001 62 63 was only the fourth time you know < # > It was you know B

The resulting table shows that we have successfully extracted the speakers (identified by the letters in the speaker column) and cleaned the file names (in the docnames column).

Customizing concordances

As R represents a fully-fledged programming environment, we can, of course, also write our own, customized concordance function. The code below shows how you could go about doing so. Note, however, that this function only works if you enter more than a single file.

mykwic <- function(txts, pattern, context) {
  # activate packages
  require(stringr)
  require(plyr)
  # list files
  conc <- sapply(txts, function(x) {
    # determine length of text
    lngth <- as.vector(unlist(nchar(x)))
    # determine position of hits
    idx <- str_locate_all(x, pattern)
    idx <- idx[[1]]
    ifelse(nrow(idx) >= 1, idx <- idx, return("No hits found"))
    # define start position of hit
    token.start <- idx[,1]
    # define end position of hit
    token.end <- idx[,2]
    # define start position of preceding context
    pre.start <- ifelse(token.start-context < 1, 1, token.start-context)
    # define end position of preceding context
    pre.end <- token.start-1
    # define start position of subsequent context
    post.start <- token.end+1
    # define end position of subsequent context
    post.end <- ifelse(token.end+context > lngth, lngth, token.end+context)
    # extract the texts defined by the positions
    PreceedingContext <- substring(x, pre.start, pre.end)
    Token <- substring(x, token.start, token.end)
    SubsequentContext <- substring(x, post.start, post.end)
    conc <- cbind(PreceedingContext, Token, SubsequentContext)
    # return concordance
    return(conc)
    })
  concdf <- ldply(conc, data.frame)
  colnames(concdf)[1]<- "File"
  return(concdf)
}

We can now try if this function works by searching for the sequence you know in the transcripts that we have loaded earlier. One difference between the kwic function provided by the quanteda package and the customized concordance function used here is that the kwic function uses the number of words to define the context window, while the mykwic function uses the number of characters or symbols instead (which is why we use a notably higher number to define the context window).

myconcordances <- mykwic(transcripts_collapsed, "you know", 50)
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
# inspect data
kable(head(myconcordances), caption = "First 6 concordances for you know extracted using the mykwic function.") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 concordances for you know extracted using the mykwic function.
File PreceedingContext Token SubsequentContext
https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt to let me jump <,> that was only the fourth time you know <#> It was great <&> laughter </&> <S1A-001\(A&gt; &lt;# </td> </tr> <tr> <td style="text-align:left;"> https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt </td> <td style="text-align:left;"> with the whip &lt;,&gt; and over it went the last time </td> <td style="text-align:left;"> you know </td> <td style="text-align:left;"> &lt;#&gt; And Stephanie told her she was very determine </td> </tr> <tr> <td style="text-align:left;"> https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt </td> <td style="text-align:left;"> ghter &lt;/&amp;&gt; because it had refused the other times </td> <td style="text-align:left;"> you know </td> <td style="text-align:left;"> &lt;#&gt; But Stephanie wouldn't let her give up on it </td> </tr> <tr> <td style="text-align:left;"> https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt </td> <td style="text-align:left;"> k and keep coming back &lt;,&gt; until &lt;,&gt; it jumped it </td> <td style="text-align:left;"> you know </td> <td style="text-align:left;"> &lt;#&gt; It was good &lt;S1A-001\)A> <#> Yeah I ’m not so
https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt she ’d be far better waiting <,> for that one <,> you know and starting anew fresh <S1A-001\(A&gt; &lt;#&gt; Yeah but </td> </tr> <tr> <td style="text-align:left;"> https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt </td> <td style="text-align:left;"> er 's the best goes top of the league &lt;,&gt; &lt;{&gt; &lt;[&gt; </td> <td style="text-align:left;"> you know </td> <td style="text-align:left;"> &lt;/[&gt; &lt;S1A-001\)A> <#> <[> So </[> </{> it ’s like

As this concordance function only works for more than one text, we split the text of Darwin’s On the Origin of Species into chapters and assign each section a name.

# read in text
origin_split <- origin %>%
  str_squish() %>%
  str_split(" [0-9]{1,2}\\. ") %>%
  unlist()
origin_split <- origin_split[which(nchar(origin_split) > 2000)]
# add names
names(origin_split) <- paste0("text", 1:length(origin_split))
# inspect data
nchar(origin_split)
##  text1  text2  text3  text4  text5  text6  text7  text8  text9 text10 text11 
##  10583  69614  29402  35644  94090  72494  66361  69133  61073  58530  62064 
## text12 text13 text14 text15 
##  67890  51322  87415  58207

Now that we have named elements, we can search for the pattern natural selection. We also need to clean the concordance as some sections do not contain any instances of the search pattern. To clean the data, we select only the columns File, PreceedingContext, Token, and SubsequentContext and then remove all rows where information is missing.

natsel_conc <- mykwic(origin_split, "natural selection", 50) 
natsel_conc <- natsel_conc %>%
  select(File, PreceedingContext, Token, SubsequentContext) %>%
  na.omit()
# inspect data
kable(head(natsel_conc), caption = "First 6 concordances for *natural selection* extracted using the mykwic function.") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 concordances for natural selection extracted using the mykwic function.
File PreceedingContext Token SubsequentContext
text1 he immutability of species. How far the theory of natural selection may be extended. Effects of its adoption on the s
text2 nd reversions of character probably do occur; but natural selection , as will hereafter be explained, will determine h
text2 ntry than in the other, and thus by a process of " natural selection ," as will hereafter be more fully explained, two
text3 ly important for us, as they afford materials for natural selection to accumulate, in the same manner as man can accu
text3 have not been seized on and rendered definite by natural selection , as hereafter will be explained. Those forms whic
text3 to one in which it differs more, to the action of natural selection in accumulating (as will hereafter be more fully

You can go ahead and modify the customized concordance function to suit your needs.

Citation & Session Info

Schweinberger, Martin. 2020. Concordancing with R. Brisbane: The University of Queensland. url: https://slcladal.github.io/kwics.html (Version 2020.09.29).

@manual{schweinberger2020introqant,
  author = {Schweinberger, Martin},
  title = {Concordancing with R},
  note = {https://slcladal.github.io/kwics.html},
  year = {2020},
  organization = "The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2020/09/29}
}
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] plyr_1.8.6       gutenbergr_0.2.0 kableExtra_1.2.1 knitr_1.30      
## [5] stringr_1.4.0    dplyr_1.0.2      quanteda_2.1.1  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5         pillar_1.4.6       compiler_4.0.2     highr_0.8         
##  [5] tools_4.0.2        stopwords_2.0      digest_0.6.25      viridisLite_0.3.0 
##  [9] evaluate_0.14      lifecycle_0.2.0    tibble_3.0.3       gtable_0.3.0      
## [13] lattice_0.20-41    pkgconfig_2.0.3    rlang_0.4.7        Matrix_1.2-18     
## [17] fastmatch_1.1-0    rstudioapi_0.11    curl_4.3           yaml_2.2.1        
## [21] xfun_0.16          xml2_1.3.2         httr_1.4.2         hms_0.5.3         
## [25] fs_1.5.0           generics_0.0.2     vctrs_0.3.4        triebeard_0.3.0   
## [29] webshot_0.5.2      grid_4.0.2         tidyselect_1.1.0   glue_1.4.2        
## [33] data.table_1.13.0  R6_2.4.1           rmarkdown_2.3      readr_1.3.1       
## [37] ggplot2_3.3.2      purrr_0.3.4        magrittr_1.5       urltools_1.7.3    
## [41] usethis_1.6.3      scales_1.1.1       ellipsis_0.3.1     htmltools_0.5.0   
## [45] rvest_0.3.6        colorspace_1.4-1   stringi_1.5.3      lazyeval_0.2.2    
## [49] RcppParallel_5.0.2 munsell_0.5.0      crayon_1.3.4

Main page


References

Anthony, Laurence. 2004. “AntConc: A Learner and Classroom Friendly, Multi-Platform Corpus Analysis Toolkit.” Proceedings of IWLeL, 7–13.

Aschwanden, Christie. n.d. “Psychologyâs Replication Crisis Has Made the Field Better.” https://fivethirtyeight.com/features/psychologys-replication-crisis-has-made-the-field-better/.

Barlow, Michael. 1999. “Monoconc 1.5 and Paraconc.” International Journal of Corpus Linguistics 4 (1): 173–84.

———. 2002. “ParaConc: Concordance Software for Multilingual Parallel Corpora.” In Proceedings of the Third International Conference on Language Resources and Evaluation. Workshop on Language Resources in Translation Work and Research, 20–24.

Diener, Edward, and Robert Biswas-Diener. 2019. “The Replication Crisis in Psychology.” https://nobaproject.com/modules/the-replication-crisis-in-psychology.

Fanelli, Daniele. 2009. “How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data.” PLoS One 4 (5): e5738.

Kilgarriff, Adam, Pavel Rychly, Pavel Smrz, and David Tugwell. 2004. “Itri-04-08 the Sketch Engine.” Information Technology 105: 116.

Lindquist, Hans. 2009. Corpus Linguistics and the Description of English. Edinburgh: Edinburgh University Press.

McRae, Mike. n.d. “Science’s ’Replication Crisis’ Has Reached Even the Most Respectable Journals, Report Shows.” https://www.sciencealert.com/replication-results-reproducibility-crisis-science-nature-journals.

Stefanowitsch, Anatol. 2020. Corpus Linguistics. A Guide to the Methodology. Textbooks in Language Sciences. Berlin: Language Science Press.

Stroube, Bryan. 2003. “Literary Freedom: Project Gutenberg.” XRDS: Crossroads, the ACM Magazine for Students 10 (1): 3–3.

Velasco, Emily. n.d. “Researcher Discusses the the Science Replication Crisis.” https://phys.org/news/2018-11-discusses-science-replication-crisis.html.

Yong, Ed. n.d. “Psychologyâs Replication Crisis Is Running Out of Excuses. Another Big Project Has Found That Only Half of Studies Can Be Repeated. And This Time, the Usual Explanations Fall Flat.” https://www.theatlantic.com/science/archive/2018/11/psychologys-replication-crisis-real/576223/.