Introduction

This tutorial introduces how to extract concordances and keyword-in-context (KWIC) displays with R.

Please cite as:
Schweinberger, Martin. 2023. Concordancing with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/kwics.html (Version 2023.09.23).

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to extract keywords and key phrases from textual data and how to process the resulting concordances using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with concordancing.

To be able to follow this tutorial, we suggest you check out and familiarize yourself with the content of the following R Basics tutorials:

Click here1 to download the entire R Notebook for this tutorial.

Binder
Click here to open an interactive and simplified version of this tutorial that allows you to execute, change, and edit the code used in this tutorial as well as to upload your own data.



LADAL TOOL

Click on this Binder badge to open an notebook-based tool
that allows you to upload your own text(s), perform concordancing on them, and download the resulting kwic(s).



In the language sciences, concordancing refers to the extraction of words from a given text or texts (Lindquist 2009, 5). Commonly, concordances are displayed in the form of keyword-in-context displays (KWICs) where the search term is shown in context, i.e. with preceding and following words. Concordancing are central to analyses of text and they often represents the first step in more sophisticated analyses of language data (Stefanowitsch 2020). The play such a key role in the language sciences because concordances are extremely valuable for understanding how a word or phrase is used, how often it is used, and in which contexts is used. As concordances allow us to analyze the context in which a word or phrase occurs and provide frequency information about word use, they also enable us to analyze collocations or the collocational profiles of words and phrases (Stefanowitsch 2020, 50–51). Finally, concordances can also be used to extract examples and it is a very common procedure.

\label{fig:Fig1} Concordances in AntConc.

Concordances in AntConc.


There are various very good software packages that can be used to create concordances - both for offline use. There are, for example

In addition, many corpora that are available such as the BYU corpora can be accessed via a web interface that have in-built concordancing functions.

\label{fig:Fig2} Online concordances extracted from the COCA corpus that is part of the BYU corpora.

Online concordances extracted from the COCA corpus that is part of the BYU corpora.

While these packages are very user-friendly, offer various additional functionalities, and almost everyone who is engaged in analyzing language has used concordance software, they all suffer from shortcomings that render R a viable alternative. Such issues include that these applications

  • are black boxes that researchers do not have full control over or do not know what is going on within the software

  • they are not open source

  • they hinder replication because the replications is more time consuming compared to analyses based on Notebooks.

  • they are commonly not free-of charge or have other restrictions on use (a notable exception is AntConc)

R represents an alternative to ready-made concordancing applications because it:

  • is extremely flexible and enables researchers to perform their entire analysis in a single environment

  • allows full transparency and documentation as analyses can be based on Notebooks

  • offer version control measures (this means that the specific versions of the involved software are traceable)

  • makes research more replicable as entire analyses can be reproduced by simply running the Notebooks that the research is based on

Especially the aspect that R enables full transparency and replicability is relevant given the ongoing Replication Crisis (Yong 2018; Aschwanden 2018; Diener and Biswas-Diener 2019; Velasco 2019; McRae 2018). The Replication Crisis is a ongoing methodological crisis primarily affecting parts of the social and life sciences beginning in the early 2010s (see also Fanelli 2009). Replication is important so that other researchers, or the public for that matter, can see or, indeed, reproduce, exactly what you have done. Fortunately, R allows you to document your entire workflow as you can store everything you do in what is called a script or a notebook (in fact, this document was originally a R notebook). If someone is then interested in how you conducted your analysis, you can simply share this notebook or the script you have written with that person.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# install packages
install.packages("quanteda")
install.packages("dplyr")
install.packages("stringr")
install.packages("writexl")
install.packages("here")
install.packages("flextable")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

Now that we have installed the necessary packages, we activate them as shown below.

# activate packages
library(quanteda)
library(dplyr)
library(stringr)
library(writexl)
library(here)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed R and RStudio and also initiated the session by executing the code shown above, you are good to go.

Loading and processing textual data

For this tutorial, we will use Lewis Caroll’s Alice’s Adventures in Wonderland. You can use the code below to load this text into R (but you have to have access to the internet to do so).

text <- base::readRDS(url("https://slcladal.github.io/data/alice.rda", "rb"))
First 10 text elements of the example text.

.

Alice’s Adventures in Wonderland

by Lewis Carroll

CHAPTER I.

Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the

bank, and of having nothing to do: once or twice she had peeped into

the book her sister was reading, but it had no pictures or

conversations in it, “and what is the use of a book,” thought Alice

“without pictures or conversations?”

So she was considering in her own mind (as well as she could, for the

The table above shows that the example text requires formatting so that we can use it. Therefore, we collapse it into a single object (or text) and remove superfluous white spaces.

text <- text %>%
  # collapse lines into a single  text
  paste0(collapse = " ") %>%
  # remove superfluous white spaces
  str_squish()
First 1000 characters of of the example text.

.

Alice’s Adventures in Wonderland by Lewis Carroll CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?” So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so _very_ remarkable in that; nor did Alice think it so _very_ much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be late!” (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when

The result confirms that the entire text is now combined into a single character object.

Creating simple concordances

Now that we have loaded the data, we can easily extract concordances using the kwic function from the quanteda package. The kwic function takes the text (x) and the search pattern (pattern) as it main arguments but it also allows the specification of the context window, i.e. how many words/elements are show to the left and right of the key word (we will go over this later on).

mykwic <- kwic(
  # define text
  text, 
  # define search pattern
  pattern = "Alice") %>%
  # make it a data frame
  as.data.frame()
First 10 concordances for the keyword Alice in our example text.

docname

from

to

pre

keyword

post

pattern

text1

14

14

I . Down the Rabbit-Hole

Alice

was beginning to get very

Alice

text1

73

73

a book , ” thought

Alice

“ without pictures or conversations

Alice

text1

153

153

in that ; nor did

Alice

think it so _very_ much

Alice

text1

239

239

and then hurried on ,

Alice

started to her feet ,

Alice

text1

309

309

In another moment down went

Alice

after it , never once

Alice

text1

348

348

down , so suddenly that

Alice

had not a moment to

Alice

text1

531

531

“ Well ! ” thought

Alice

to herself , “ after

Alice

text1

656

656

for , you see ,

Alice

had learnt several things of

Alice

text1

727

727

got to ? ” (

Alice

had no idea what Latitude

Alice

text1

916

916

else to do , so

Alice

soon began talking again .

Alice

You will see that you get a warning stating that you should use token before extracting concordances. This can be done as shown below. Also, we can specify the package from which we want to use a function by adding the package name plus :: before the function (see below)

mykwic <- quanteda::kwic(
  # define and tokenize text
  quanteda::tokens(text), 
  # define search pattern
  pattern = "alice") %>%
  # make it a data frame
  as.data.frame()
First 10 concordances for the keyword *alice* in example text.

docname

from

to

pre

keyword

post

pattern

text1

14

14

I . Down the Rabbit-Hole

Alice

was beginning to get very

alice

text1

73

73

a book , ” thought

Alice

“ without pictures or conversations

alice

text1

153

153

in that ; nor did

Alice

think it so _very_ much

alice

text1

239

239

and then hurried on ,

Alice

started to her feet ,

alice

text1

309

309

In another moment down went

Alice

after it , never once

alice

text1

348

348

down , so suddenly that

Alice

had not a moment to

alice

text1

531

531

“ Well ! ” thought

Alice

to herself , “ after

alice

text1

656

656

for , you see ,

Alice

had learnt several things of

alice

text1

727

727

got to ? ” (

Alice

had no idea what Latitude

alice

text1

916

916

else to do , so

Alice

soon began talking again .

alice

We can easily extract the frequency of the search term (alice) using the nrow or the length functions which provide the number of rows of a tables (nrow) or the length of a vector (length).

nrow(mykwic)
## [1] 386
length(mykwic$keyword)
## [1] 386

The results show that there are 386 instances of the search term (alice) but we can also find out how often different variants (lower case versus upper case) of the search term were found using the table function. This is especially useful when searches involve many different search terms (while it is, admittedly, less useful in the present example).

table(mykwic$keyword)
## 
## Alice 
##   386

To get a better understanding of the use of a word, it is often useful to extract more context. This is easily done by increasing size of the context window. To do this, we specify the window argument of the kwic function. In the example below, we set the context window size to 10 words/elements rather than using the default (which is 5 word/elements).

mykwic_longer <- kwic(
  # define text
  text, 
  # define search pattern
  pattern = "alice", 
  # define context window size
  window = 10) %>%
  # make it a data frame
  as.data.frame()
First 10 concordances for the keyword *alice* in the example text with extended context (10 elements).

docname

from

to

pre

keyword

post

pattern

text1

14

14

Wonderland by Lewis Carroll CHAPTER I . Down the Rabbit-Hole

Alice

was beginning to get very tired of sitting by her

alice

text1

73

73

what is the use of a book , ” thought

Alice

“ without pictures or conversations ? ” So she was

alice

text1

153

153

was nothing so _very_ remarkable in that ; nor did

Alice

think it so _very_ much out of the way to

alice

text1

239

239

and looked at it , and then hurried on ,

Alice

started to her feet , for it flashed across her

alice

text1

309

309

rabbit-hole under the hedge . In another moment down went

Alice

after it , never once considering how in the world

alice

text1

348

348

, and then dipped suddenly down , so suddenly that

Alice

had not a moment to think about stopping herself before

alice

text1

531

531

she fell past it . “ Well ! ” thought

Alice

to herself , “ after such a fall as this

alice

text1

656

656

, I think— ” ( for , you see ,

Alice

had learnt several things of this sort in her lessons

alice

text1

727

727

what Latitude or Longitude I’ve got to ? ” (

Alice

had no idea what Latitude was , or Longitude either

alice

text1

916

916

down . There was nothing else to do , so

Alice

soon began talking again . “ Dinah’ll miss me very

alice


EXERCISE TIME!

`

  1. Extract the first 10 concordances for the word confused.
Answer
  kwic_confused <- kwic(x = text, pattern = "confused")
  # inspect
  kwic_confused %>%
  as.data.frame() %>%
  head(10)
  ##   docname  from    to                    pre  keyword
  ## 1   text1  6211  6211     , calling out in a confused
  ## 2   text1 19095 19095     . ” This answer so confused
  ## 3   text1 19277 19277 said Alice , very much confused
  ## 4   text1 33304 33304      she knew ) to the confused
  ##                                  post  pattern
  ## 1                    way , “ Prizes ! confused
  ## 2               poor Alice , that she confused
  ## 3                  , “ I don’t think— confused
  ## 4 clamour of the busy farm-yard—while confused
  1. How many instances are there of the word wondering?
Answer
  kwic(x = text, pattern = "wondering") %>%
  as.data.frame() %>%
  nrow()
  ## [1] 7
  1. Extract concordances for the word strange and show the first 5 concordance lines.
Answer
  kwic_strange <- kwic(x = text, pattern = "strange")
  # inspect
  kwic_strange %>%
  as.data.frame() %>%
  head(5)
  ##   docname  from    to                          pre keyword
  ## 1   text1  3530  3530 her voice sounded hoarse and strange
  ## 2   text1 13124 13124         , that it felt quite strange
  ## 3   text1 32879 32879    remember them , all these strange
  ## 4   text1 33086 33086    her became alive with the strange
  ## 5   text1 33396 33396        and eager with many a strange
  ##                               post pattern
  ## 1              , and the words did strange
  ## 2               at first ; but she strange
  ## 3      Adventures of hers that you strange
  ## 4 creatures of her little sister’s strange
  ## 5         tale , perhaps even with strange

`


Exporting concordances

To export or save a concordance table as an MS Excel spreadsheet, you can use the write_xlsx function from the writexl package as shown below. Be aware that we use the here function from the here package to define where we want to save the file (in this case this will be in the current working directory. If you work with Rproj files in RStudio - as you should - then the current working directory is the directory or folder where your Rproj file is).

write_xlsx(mykwic, here::here("mykwic.xlsx"))

Extracting more than single words

While extracting single words is very common, you may want to extract more than just one word. To extract phrases, all you need to so is to specify that the pattern you are looking for is a phrase, as shown below.

kwic_pooralice <- kwic(text, pattern = phrase("poor alice")) %>%
  as.data.frame()
First 10 concordances for the keyphrase *poor alice* in the example text.

docname

from

to

pre

keyword

post

pattern

text1

1,547

1,548

go through , ” thought

poor Alice

, “ it would be

poor alice

text1

2,136

2,137

; but , alas for

poor Alice

! when she got to

poor alice

text1

2,338

2,339

use now , ” thought

poor Alice

, “ to pretend to

poor alice

text1

2,892

2,893

to the garden door .

Poor Alice

! It was as much

poor alice

text1

3,607

3,608

right words , ” said

poor Alice

, and her eyes filled

poor alice

text1

6,869

6,870

mean it ! ” pleaded

poor Alice

. “ But you’re so

poor alice

text1

7,283

7,284

more ! ” And here

poor Alice

began to cry again ,

poor alice

text1

8,232

8,233

at home , ” thought

poor Alice

, “ when one wasn’t

poor alice

text1

11,766

11,767

to it ! ” pleaded

poor Alice

in a piteous tone .

poor alice

text1

19,096

19,097

” This answer so confused

poor Alice

, that she let the

poor alice

You may also want to extract more or less fixed patterns rather than exact words or phrases. To search for patterns that allow variation rather than specific, exactly-defined words, you need to include regular expressions in your search pattern.


EXERCISE TIME!

`

  1. Extract the first 10 concordances for the phrase the hatter.
Answer
  kwic_thehatter <- kwic(x = text, pattern = phrase("the hatter"))
  # inspect
  kwic_thehatter %>%
  as.data.frame() %>%
  head(10)
  ##    docname  from    to                    pre    keyword
  ## 1    text1 16541 16542   wish I’d gone to see the Hatter
  ## 2    text1 16572 16573 and the March Hare and the Hatter
  ## 3    text1 16824 16825 wants cutting , ” said the Hatter
  ## 4    text1 16870 16871     it’s very rude . ” The Hatter
  ## 5    text1 17011 17012         a bit ! ” said the Hatter
  ## 6    text1 17136 17137      with you , ” said the Hatter
  ## 7    text1 17171 17172  , which wasn’t much . The Hatter
  ## 8    text1 17249 17250  days wrong ! ” sighed the Hatter
  ## 9    text1 17300 17301         in as well , ” the Hatter
  ## 10   text1 17415 17416 should it ? ” muttered the Hatter
  ##                           post    pattern
  ## 1      instead ! ” CHAPTER VII the hatter
  ## 2        were having tea at it the hatter
  ## 3        . He had been looking the hatter
  ## 4    opened his eyes very wide the hatter
  ## 5           . “ You might just the hatter
  ## 6  , and here the conversation the hatter
  ## 7       was the first to break the hatter
  ## 8               . “ I told you the hatter
  ## 9   grumbled : “ you shouldn’t the hatter
  ## 10       . “ Does _your_ watch the hatter
  1. How many instances are there of the phrase the hatter?
Answer
  kwic_thehatter %>%
  as.data.frame() %>%
  nrow()
  ## [1] 51
  1. Extract concordances for the phrase the cat and show the first 5 concordance lines.
Answer
  kwic_thecat <- kwic(x = text, pattern = phrase("the cat"))
  # inspect
  kwic_thecat %>%
  as.data.frame() %>%
  head(5)
  ##   docname  from    to               pre keyword                     post
  ## 1   text1   938   939   ! ” ( Dinah was the cat             . ) “ I hope
  ## 2   text1 15591 15592 a few yards off . The Cat only grinned when it saw
  ## 3   text1 15716 15717   get to , ” said the Cat         . “ I don’t much
  ## 4   text1 15741 15742   you go , ” said the Cat          . “ —so long as
  ## 5   text1 15770 15771  do that , ” said the Cat          , “ if you only
  ##   pattern
  ## 1 the cat
  ## 2 the cat
  ## 3 the cat
  ## 4 the cat
  ## 5 the cat

`


Searches using regular expressions

Regular expressions allow you to search for abstract patterns rather than concrete words or phrases which provides you with an extreme flexibility in what you can retrieve. A regular expression (in short also called regex or regexp) is a special sequence of characters that stand for are that describe a pattern. You can think of regular expressions as very powerful combinations of wildcards or as wildcards on steroids. For example, the sequence [a-z]{1,3} is a regular expression that stands for one up to three lower case characters and if you searched for this regular expression, you would get, for instance, is, a, an, of, the, my, our, etc, and many other short words as results.

There are three basic types of regular expressions:

  • regular expressions that stand for individual symbols and determine frequencies

  • regular expressions that stand for classes of symbols

  • regular expressions that stand for structural properties

The regular expressions below show the first type of regular expressions, i.e. regular expressions that stand for individual symbols and determine frequencies.

Regular expressions that stand for individual symbols and determine frequencies.

RegEx Symbol/Sequence

Explanation

Example

?

The preceding item is optional and will be matched at most once

walk[a-z]? = walk, walks

*

The preceding item will be matched zero or more times

walk[a-z]* = walk, walks, walked, walking

+

The preceding item will be matched one or more times

walk[a-z]+ = walks, walked, walking

{n}

The preceding item is matched exactly n times

walk[a-z]{2} = walked

{n,}

The preceding item is matched n or more times

walk[a-z]{2,} = walked, walking

{n,m}

The preceding item is matched at least n times, but not more than m times

walk[a-z]{2,3} = walked, walking

The regular expressions below show the second type of regular expressions, i.e. regular expressions that stand for classes of symbols.

Regular expressions that stand for classes of symbols.

RegEx Symbol/Sequence

Explanation

[ab]

lower case a and b

[AB]

upper case a and b

[12]

digits 1 and 2

[:digit:]

digits: 0 1 2 3 4 5 6 7 8 9

[:lower:]

lower case characters: a–z

[:upper:]

upper case characters: A–Z

[:alpha:]

alphabetic characters: a–z and A–Z

[:alnum:]

digits and alphabetic characters

[:punct:]

punctuation characters: . , ; etc.

[:graph:]

graphical characters: [:alnum:] and [:punct:]

[:blank:]

blank characters: Space and tab

[:space:]

space characters: Space, tab, newline, and other space characters

[:print:]

printable characters: [:alnum:], [:punct:] and [:space:]

The regular expressions that denote classes of symbols are enclosed in [] and :. The last type of regular expressions, i.e. regular expressions that stand for structural properties are shown below.

Regular expressions that stand for structural properties.

RegEx Symbol/Sequence

Explanation

\\w

Word characters: [[:alnum:]_]

\\W

No word characters: [^[:alnum:]_]

\\s

Space characters: [[:blank:]]

\\S

No space characters: [^[:blank:]]

\\d

Digits: [[:digit:]]

\\D

No digits: [^[:digit:]]

\\b

Word edge

\\B

No word edge

<

Word beginning

>

Word end

^

Beginning of a string

$

End of a string

To include regular expressions in your KWIC searches, you include them in your search pattern and set the argument valuetype to "regex". The search pattern "\\balic.*|\\bhatt.*" retrieves elements that contain alic and hatt followed by any characters and where the a in alic and the h in hatt are at a word boundary, i.e. where they are the first letters of a word. Hence, our search would not retrieve words like malice or shatter. The | is an operator (like +, -, or *) that stands for or.

# define search patterns
patterns <- c("\\balic.*|\\bhatt.*")
kwic_regex <- kwic(
  # define text
  text, 
  # define search pattern
  patterns, 
  # define valuetype
  valuetype = "regex") %>%
  # make it a data frame
  as.data.frame()
First 10 concordances for the regular expression \balic.* and \bhatt.*.

docname

from

to

pre

keyword

post

pattern

text1

1

1

Alice’s

Adventures in Wonderland by Lewis

\balic.*|\bhatt.*

text1

14

14

I . Down the Rabbit-Hole

Alice

was beginning to get very

\balic.*|\bhatt.*

text1

73

73

a book , ” thought

Alice

“ without pictures or conversations

\balic.*|\bhatt.*

text1

153

153

in that ; nor did

Alice

think it so _very_ much

\balic.*|\bhatt.*

text1

239

239

and then hurried on ,

Alice

started to her feet ,

\balic.*|\bhatt.*

text1

309

309

In another moment down went

Alice

after it , never once

\balic.*|\bhatt.*

text1

348

348

down , so suddenly that

Alice

had not a moment to

\balic.*|\bhatt.*

text1

531

531

“ Well ! ” thought

Alice

to herself , “ after

\balic.*|\bhatt.*

text1

656

656

for , you see ,

Alice

had learnt several things of

\balic.*|\bhatt.*

text1

727

727

got to ? ” (

Alice

had no idea what Latitude

\balic.*|\bhatt.*


EXERCISE TIME!

`

  1. Extract the first 10 concordances for words containing exu.
Answer
  kwic_exu <- kwic(x = text, pattern = ".*exu.*", valuetype = "regex")
  # inspect
  kwic_exu %>%
  as.data.frame() %>%
  head(10)
  ## [1] docname from    to      pre     keyword post    pattern
  ## <0 rows> (or 0-length row.names)
  1. How many instances are there of words beginning with pit?
Answer
  kwic(x = text, pattern = "\\bpit.*", valuetype = "regex") %>%
  as.data.frame() %>%
  nrow()
  ## [1] 5
  1. Extract concordances for words ending with ption and show the first 5 concordance lines.
Answer
  kwic(x = text, pattern = "ption\\b", valuetype = "regex")  %>%
  as.data.frame() %>%
  head(5)
  ##   docname from   to                         pre  keyword
  ## 1   text1 5770 5770 adjourn , for the immediate adoption
  ##                            post  pattern
  ## 1 of more energetic remedies— ” ption\\b

`


Piping concordances

Quite often, we only want to retrieve patterns if they occur in a certain context. For instance, we might be interested in instances of selection but only if the preceding word is natural. Such conditional concordances could be extracted using regular expressions but they are easier to retrieve by piping. Piping is done using the %>% function from the dplyr package and the piping sequence can be translated as and then. We can then filter those concordances that contain natural using the filter function from the dplyr package. Note the the $ stands for the end of a string so that natural$ means that natural is the last element in the string that is preceding the keyword.

kwic_pipe <- kwic(x = text, pattern = "alice") %>%
  as.data.frame() %>%
  dplyr::filter(stringr::str_detect(pre, "poor$|little$"))
First 10 concordances for instances of *alice* that are preceeded by *poor* or *little*.

docname

from

to

pre

keyword

post

pattern

text1

1,548

1,548

through , ” thought poor

Alice

, “ it would be

alice

text1

1,731

1,731

” but the wise little

Alice

was not going to do

alice

text1

2,137

2,137

but , alas for poor

Alice

! when she got to

alice

text1

2,339

2,339

now , ” thought poor

Alice

, “ to pretend to

alice

text1

3,608

3,608

words , ” said poor

Alice

, and her eyes filled

alice

text1

6,870

6,870

it ! ” pleaded poor

Alice

. “ But you’re so

alice

text1

7,284

7,284

! ” And here poor

Alice

began to cry again ,

alice

text1

8,233

8,233

home , ” thought poor

Alice

, “ when one wasn’t

alice

text1

11,767

11,767

it ! ” pleaded poor

Alice

in a piteous tone .

alice

text1

19,097

19,097

This answer so confused poor

Alice

, that she let the

alice

Piping is a very useful helper function and it is very frequently used in R - not only in the context of text processing but in all data science related domains.

Arranging concordances and adding frequency information

When inspecting concordances, it is useful to re-order the concordances so that they do not appear in the order that they appeared in the text or texts but by the context. To reorder concordances, we can use the arrange function from the dplyr package which takes the column according to which we want to re-arrange the data as it main argument.

In the example below, we extract all instances of alice and then arrange the instances according to the content of the post column in alphabetical.

kwic_ordered <- kwic(x = text, pattern = "alice") %>%
  as.data.frame() %>%
  dplyr::arrange(post)
First 10 re-ordered concordances for instances of alice.

docname

from

to

pre

keyword

post

pattern

text1

7,747

7,747

happen : “ ‘ Miss

Alice

! Come here directly ,

alice

text1

2,893

2,893

the garden door . Poor

Alice

! It was as much

alice

text1

2,137

2,137

but , alas for poor

Alice

! when she got to

alice

text1

30,785

30,785

voice , the name “

Alice

! ” CHAPTER XII .

alice

text1

8,416

8,416

“ Oh , you foolish

Alice

! ” she answered herself

alice

text1

2,612

2,612

and curiouser ! ” cried

Alice

( she was so much

alice

text1

25,783

25,783

I haven’t , ” said

Alice

) — “ and perhaps

alice

text1

32,165

32,165

explain it , ” said

Alice

, ( she had grown

alice

text1

32,725

32,725

for you ? ” said

Alice

, ( she had grown

alice

text1

1,684

1,684

here before , ” said

Alice

, ) and round the

alice

Arranging concordances according to alphabetical properties may, however, not be the most useful option. A more useful option may be to arrange concordances according to the frequency of co-occurring terms or collocates. In order to do this, we need to extract the co-occurring words and calculate their frequency. We can do this by combining the mutate, group_by, n() functions from the dplyr package with the str_remove_all function from the stringr package. Then, we arrange the concordances by the frequency of the collocates in descending order (that is why we put a - in the arrange function). In order to do this, we need to

  1. create a new variable or column which represents the word that co-occurs with, or, as in the example below, immediately follows the search term. In the example below, we use the mutate function to create a new column called post_word. We then use the str_remove_all function to remove everything except for the word that immediately follows the search term (we simply remove everything and including a white space).

  2. group the data by the word that immediately follows the search term.

  3. create a new column called post_word_freq which represents the frequencies of all the words that immediately follow the search term.

  4. arrange the concordances by the frequency of the collocates in descending order.

kwic_ordered_coll <- kwic(
  # define text
  x = text, 
  # define search pattern
  pattern = "alice") %>%
  # make it a data frame
  as.data.frame() %>%
  # extract word following the keyword
  dplyr::mutate(post_word = str_remove_all(post, " .*")) %>%
  # group following words
  dplyr::group_by(post_word) %>%
  # extract frequencies of the following words
  dplyr::mutate(post_word_freq = n()) %>%
  # arrange/order by the frequency of the following word
  dplyr::arrange(-post_word_freq)
First 10 re-ordered concordances for instances of natural.

docname

from

to

pre

keyword

post

pattern

post_word

post_word_freq

text1

1,548

1,548

through , ” thought poor

Alice

, “ it would be

alice

,

78

text1

1,684

1,684

here before , ” said

Alice

, ) and round the

alice

,

78

text1

2,339

2,339

now , ” thought poor

Alice

, “ to pretend to

alice

,

78

text1

2,416

2,416

eat it , ” said

Alice

, “ and if it

alice

,

78

text1

2,744

2,744

to them , ” thought

Alice

, “ or perhaps they

alice

,

78

text1

2,950

2,950

of yourself , ” said

Alice

, “ a great girl

alice

,

78

text1

3,608

3,608

words , ” said poor

Alice

, and her eyes filled

alice

,

78

text1

3,751

3,751

oh dear ! ” cried

Alice

, with a sudden burst

alice

,

78

text1

3,918

3,918

narrow escape ! ” said

Alice

, a good deal frightened

alice

,

78

text1

4,181

4,181

so much ! ” said

Alice

, as she swam about

alice

,

78

We add more columns according to which we could arrange the concordance following the same schema. For example, we could add another column that represented the frequency of words that immediately preceded the search term and then arrange according to this column.

Ordering by subsequent elements

In this section, we will extract the three words following the keyword (alice) and organize the concordances by the frequencies of the following words. We begin by inspecting the first 6 lines of the concordance of selection.

head(mykwic)
##   docname from  to                         pre keyword
## 1   text1   14  14    I . Down the Rabbit-Hole   Alice
## 2   text1   73  73          a book , ” thought   Alice
## 3   text1  153 153           in that ; nor did   Alice
## 4   text1  239 239       and then hurried on ,   Alice
## 5   text1  309 309 In another moment down went   Alice
## 6   text1  348 348     down , so suddenly that   Alice
##                                  post pattern
## 1           was beginning to get very   alice
## 2 “ without pictures or conversations   alice
## 3             think it so _very_ much   alice
## 4               started to her feet ,   alice
## 5               after it , never once   alice
## 6                 had not a moment to   alice

Next, we take the concordances and create a clean post column that is all in lower case and that does not contain any punctuation.

mykwic %>%
  # create new CleanPost
  dplyr::mutate(CleanPost = stringr::str_remove_all(post, "[:punct:]"),
                CleanPost = stringr::str_squish(CleanPost),
                CleanPost = tolower(CleanPost))-> mykwic_following
# inspect
head(mykwic_following)
##   docname from  to                         pre keyword
## 1   text1   14  14    I . Down the Rabbit-Hole   Alice
## 2   text1   73  73          a book , ” thought   Alice
## 3   text1  153 153           in that ; nor did   Alice
## 4   text1  239 239       and then hurried on ,   Alice
## 5   text1  309 309 In another moment down went   Alice
## 6   text1  348 348     down , so suddenly that   Alice
##                                  post pattern                         CleanPost
## 1           was beginning to get very   alice         was beginning to get very
## 2 “ without pictures or conversations   alice without pictures or conversations
## 3             think it so _very_ much   alice             think it so very much
## 4               started to her feet ,   alice               started to her feet
## 5               after it , never once   alice               after it never once
## 6                 had not a moment to   alice               had not a moment to

In a next step, we extract the 1st, 2nd, and 3rd words following the keyword.

mykwic_following %>%
  # extract first element after keyword
  dplyr::mutate(FirstWord = stringr::str_remove_all(CleanPost, " .*")) %>%
  # extract second element after keyword
  dplyr::mutate(SecWord = stringr::str_remove(CleanPost, ".*? "),
                SecWord = stringr::str_remove_all(SecWord, " .*")) %>%
  # extract third element after keyword
  dplyr::mutate(ThirdWord = stringr::str_remove(CleanPost, ".*? "),
                ThirdWord = stringr::str_remove(ThirdWord, ".*? "),
                ThirdWord = stringr::str_remove_all(ThirdWord, " .*")) -> mykwic_following
# inspect
head(mykwic_following)
##   docname from  to                         pre keyword
## 1   text1   14  14    I . Down the Rabbit-Hole   Alice
## 2   text1   73  73          a book , ” thought   Alice
## 3   text1  153 153           in that ; nor did   Alice
## 4   text1  239 239       and then hurried on ,   Alice
## 5   text1  309 309 In another moment down went   Alice
## 6   text1  348 348     down , so suddenly that   Alice
##                                  post pattern                         CleanPost
## 1           was beginning to get very   alice         was beginning to get very
## 2 “ without pictures or conversations   alice without pictures or conversations
## 3             think it so _very_ much   alice             think it so very much
## 4               started to her feet ,   alice               started to her feet
## 5               after it , never once   alice               after it never once
## 6                 had not a moment to   alice               had not a moment to
##   FirstWord   SecWord ThirdWord
## 1       was beginning        to
## 2   without  pictures        or
## 3     think        it        so
## 4   started        to       her
## 5     after        it     never
## 6       had       not         a

Next, we calculate the frequencies of the subsequent words and order in descending order from the 1st to the 3rd word following the keyword.

mykwic_following %>%
  # calculate frequency of following words
  # 1st word
  dplyr::group_by(FirstWord) %>%
  dplyr::mutate(FreqW1 = n()) %>%
  # 2nd word
  dplyr::group_by(SecWord) %>%
  dplyr::mutate(FreqW2 = n()) %>%
  # 3rd word
  dplyr::group_by(ThirdWord) %>%
  dplyr::mutate(FreqW3 = n()) %>%
  # ungroup
  dplyr::ungroup() %>%
  # arrange by following words
  dplyr::arrange(-FreqW1, -FreqW2, -FreqW3) -> mykwic_following
# inspect results
head(mykwic_following, 10)
## # A tibble: 10 × 14
##    docname  from    to pre     keyword post  pattern CleanPost FirstWord SecWord
##    <chr>   <int> <int> <chr>   <chr>   <chr> <fct>   <chr>     <chr>     <chr>  
##  1 text1   15675 15675 so far… Alice   , an… alice   and she … and       she    
##  2 text1   20735 20735 be beh… Alice   , an… alice   and she … and       she    
##  3 text1   25584 25584 quite … Alice   , an… alice   and she … and       she    
##  4 text1   32861 32861 curiou… Alice   , an… alice   and she … and       she    
##  5 text1   32982 32982 , and … Alice   and … alice   and all … and       all    
##  6 text1   16329 16329 said p… Alice   ; “ … alice   and i wi… and       i      
##  7 text1    3608  3608 words … Alice   , an… alice   and her … and       her    
##  8 text1    1684  1684 here b… Alice   , ) … alice   and roun… and       round  
##  9 text1   25692 25692 eyes .… Alice   , an… alice   and trie… and       tried  
## 10 text1    6519  6519 you kn… Alice   , “ … alice   and why … and       why    
## # ℹ 4 more variables: ThirdWord <chr>, FreqW1 <int>, FreqW2 <int>, FreqW3 <int>

The results now show the concordance arranged by the frequency of the words following the keyword.

Concordances from transcriptions

As many analyses use transcripts as their primary data and because transcripts have features that require additional processing, we will now perform concordancing based on on transcripts. As a first step, we load five example transcripts that represent the first five files from the Irish component of the International Corpus of English.

# define corpus files
files <- paste("https://slcladal.github.io/data/ICEIrelandSample/S1A-00", 1:5, ".txt", sep = "")
# load corpus files
transcripts <- sapply(files, function(x){
  x <- readLines(x)
  })
First 10 utterances in the sample transcripts.

.

<S1A-001 Riding>

<I>

<S1A-001$A> <#> Well how did the riding go tonight

<S1A-001$B> <#> It was good so it was <#> Just I I couldn't believe that she was going to let me jump <,> that was only the fourth time you know <#> It was great <&> laughter </&>

<S1A-001$A> <#> What did you call your horse

<S1A-001$B> <#> I can't remember <#> Oh Mary 's Town <,> oh

<S1A-001$A> <#> And how did Mabel do

<S1A-001$B> <#> Did you not see her whenever she was going over the jumps <#> There was one time her horse refused and it refused three times <#> And then <,> she got it round and she just lined it up straight and she just kicked it and she hit it with the whip <,> and over it went the last time you know <#> And Stephanie told her she was very determined and very well-ridden <&> laughter </&> because it had refused the other times you know <#> But Stephanie wouldn't let her give up on it <#> She made her keep coming back and keep coming back <,> until <,> it jumped it you know <#> It was good

<S1A-001$A> <#> Yeah I 'm not so sure her jumping 's improving that much <#> She uh <,> seemed to be holding the reins very tight

The first ten lines shown above let us know that, after the header (<S1A-001 Riding>) and the symbol which indicates the start of the transcript (<I>), each utterance is preceded by a sequence which indicates the section, file, and speaker (e.g. <S1A-001$A>). The first utterance is thus uttered by speaker A in file 001 of section S1A. In addition, there are several sequences that provide meta-linguistic information which indicate the beginning of a speech unit (<#>), pauses (<,>), and laughter (<&> laughter </&>).

To perform the concordancing, we need to change the format of the transcripts because the kwic function only works on character, corpus, tokens object- in their present form, the transcripts represent a list which contains vectors of strings. To change the format, we collapse the individual utterances into a single character vector for each transcript.

transcripts_collapsed <- sapply(files, function(x){
  # read-in text
  x <- readLines(x)
  # paste all lines together
  x <- paste0(x, collapse = " ")
  # remove superfluous white spaces
  x <- str_squish(x)
})
First 500 characters of the collapsed sample transcripts.

.

<S1A-001 Riding> <I> <S1A-001$A> <#> Well how did the riding go tonight <S1A-001$B> <#> It was good so it was <#> Just I I couldn't believe that she was going to let me jump <,> that was only the fourth time you know <#> It was great <&> laughter </&> <S1A-001$A> <#> What did you call your horse <S1A-001$B> <#> I can't remember <#> Oh Mary 's Town <,> oh <S1A-001$A> <#> And how did Mabel do <S1A-001$B> <#> Did you not see her whenever she was going over the jumps <#> There was one time her horse

<S1A-002 Dinner chat 1> <I> <S1A-002$A> <#> He 's been married for three years and is now <{> <[> getting divorced </[> <S1A-002$B> <#> <[> No no </[> </{> he 's got married last year and he 's getting <{> <[> divorced </[> <S1A-002$A> <#> <[> He 's now </[> </{> getting divorced <S1A-002$C> <#> Just right <S1A-002$D> <#> A wee girl of her age like <S1A-002$E> <#> Well there was a guy <S1A-002$C> <#> How long did she try it for <#> An hour a a year <S1A-002$B> <#> Mhm <{> <[> mhm </[> <S1A-002$E

<S1A-003 Dinner chat 2> <I> <S1A-003$A> <#> I <.> wa </.> I want to go to Peru but uh <S1A-003$B> <#> Do you <S1A-003$A> <#> Oh aye <S1A-003$B> <#> I 'd love to go to Peru <S1A-003$A> <#> I want I want to go up the Machu Picchu before it falls off the edge of the mountain <S1A-003$B> <#> Lima 's supposed to be a bit dodgy <S1A-003$A> <#> Mm <S1A-003$B> <#> Bet it would be <S1A-003$B> <#> Mm <S1A-003$A> <#> But I I just I I would like <,> Machu Picchu is collapsing <S1A-003$B> <#> I don't know wh

<S1A-004 Nursing home 1> <I> <S1A-004$A> <#> Honest to God <,> I think the young ones <#> Sure they 're flying on Monday in I think it 's Shannon <#> This is from Texas <S1A-004$B> <#> This English girl <S1A-004$A> <#> The youngest one <,> the dentist <,> she 's married to the dentist <#> Herself and her husband <,> three children and she 's six months pregnant <S1A-004$C> <#> Oh God <S1A-004$B> <#> And where are they going <S1A-004$A> <#> Coming to Dublin to the mother <{> <[> or <unclear> 3 sy

<S1A-005 Masons> <I> <S1A-005$A> <#> Right shall we risk another beer or shall we try and <,> <{> <[> ride the bikes down there or do something like that </[> <S1A-005$B> <#> <[> Well <,> what about the </[> </{> provisions <#> What time <{> <[> <unclear> 4 sylls </unclear> </[> <S1A-005$C> <#> <[> Is is your </[> </{> man coming here <S1A-005$B> <#> <{> <[> Yeah </[> <S1A-005$A> <#> <[> He said </[> </{> he would meet us here <S1A-005$B> <#> Just the boat 's arriving you know a few minutes ' wa

We can now extract the concordances.

kwic_trans <- quanteda::kwic(
  # tokenize transcripts
  quanteda::tokens(transcripts_collapsed), 
  # define search pattern
  pattern = phrase("you know")) %>%
  # make it a data frame
  as.data.frame()
First 10 concordances for you know in three example transcripts.

docname

from

to

pre

keyword

post

pattern

https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt

62

63

was only the fourth time

you know

< # > It was

you know

https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt

204

205

it went the last time

you know

< # > And Stephanie

you know

https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt

235

236

had refused the other times

you know

< # > But Stephanie

you know

https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt

272

273

, > it jumped it

you know

< # > It was

you know

https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt

602

603

that one < , >

you know

and starting anew fresh <

you know

https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt

665

666

{ > < [ >

you know

< / [ > <

you know

https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt

736

737

> We didn't discuss it

you know

< S1A-001 $ A >

you know

https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt

922

923

on Tuesday < , >

you know

< # > But I

you know

https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt

1,126

1,127

that she could take her

you know

the wee shoulder bag she

you know

https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt

1,257

1,258

around < , > uhm

you know

their timetable and < ,

you know

The results show that each non-alphanumeric character is counted as a single word which reduces the context of the keyword substantially. Also, the docname column contains the full path to the data which make it hard to parse the content of the table. To address the first issue, we specify the tokenizer that we will use to not disrupt the annotation too much. In addition, we clean the docname column and extract only the file name. Lastly, we will expand the context window to 10 so that we have a better understanding of the context in which the phrase was used.

kwic_trans <- quanteda::kwic(
  # tokenize transcripts
  quanteda::tokens(transcripts_collapsed, what = "fasterword"), 
  # define search
  pattern = phrase("you know"),
  # extend context
  window = 10) %>%
  # make it a data frame
  as.data.frame() %>%
  # clean docnames
  dplyr::mutate(docname = str_replace_all(docname, ".*/([A-Z][0-9][A-Z]-[0-9]{1,3}).txt", "\\1"))
First 10 concordances for you know in three example transcripts.

docname

from

to

pre

keyword

post

pattern

S1A-001

42

43

let me jump <,> that was only the fourth time

you know

<#> It was great <&> laughter </&> <S1A-001$A> <#> What

you know

S1A-001

140

141

the whip <,> and over it went the last time

you know

<#> And Stephanie told her she was very determined and

you know

S1A-001

164

165

<&> laughter </&> because it had refused the other times

you know

<#> But Stephanie wouldn't let her give up on it

you know

S1A-001

193

194

and keep coming back <,> until <,> it jumped it

you know

<#> It was good <S1A-001$A> <#> Yeah I 'm not

you know

S1A-001

402

403

'd be far better waiting <,> for that one <,>

you know

and starting anew fresh <S1A-001$A> <#> Yeah but I mean

you know

S1A-001

443

444

the best goes top of the league <,> <{> <[>

you know

</[> <S1A-001$A> <#> <[> So </[> </{> it 's like

you know

S1A-001

484

485

I 'm not sure now <#> We didn't discuss it

you know

<S1A-001$A> <#> Well it sounds like more money <S1A-001$B> <#>

you know

S1A-001

598

599

on Monday and do without her lesson on Tuesday <,>

you know

<#> But I was keeping her going cos I says

you know

S1A-001

727

728

to take it tomorrow <,> that she could take her

you know

the wee shoulder bag she has <S1A-001$A> <#> Mhm <S1A-001$B>

you know

S1A-001

808

809

<,> and <,> sort of show them around <,> uhm

you know

their timetable and <,> give them their timetable and show

you know

Extending the context can also be used to identify the speaker that has uttered the search pattern that we are interested in. We will do just that as this is a common task in linguistics analyses.

To extract speakers, we need to follow these steps:

  1. Create normal concordances of the pattern that we are interested in.

  2. Generate concordances of the pattern that we are interested in with a substantially enlarged context window size.

  3. Extract the speakers from the enlarged context window size.

  4. Add the speakers to the normal concordances using the left-join function from the dplyr package.

kwic_normal <- quanteda::kwic(
  # tokenize transcripts
  quanteda::tokens(transcripts_collapsed, what = "fasterword"), 
  # define search
  pattern = phrase("you know")) %>%
  as.data.frame()
kwic_speaker <- quanteda::kwic(
    # tokenize transcripts
  quanteda::tokens(transcripts_collapsed, what = "fasterword"), 
  # define search
  pattern = phrase("you know"), 
  # extend search window
  window = 500) %>%
  # convert to data frame
  as.data.frame() %>%
  # extract speaker (comes after $ and before >)
  dplyr::mutate(speaker = stringr::str_replace_all(pre, ".*\\$(.*?)>.*", "\\1")) %>%
  # extract speaker
  dplyr::pull(speaker)
# add speaker to normal kwic
kwic_combined <- kwic_normal %>%
  # add speaker
  dplyr::mutate(speaker = kwic_speaker) %>%
  # simplify docname
  dplyr::mutate(docname = stringr::str_replace_all(docname, ".*/([A-Z][0-9][A-Z]-[0-9]{1,3}).txt", "\\1")) %>%
  # remove superfluous columns
  dplyr::select(-to, -from, -pattern)
First 10 concordances for you know in three example transcripts with speakers that uttered them..

docname

pre

keyword

post

speaker

S1A-001

was only the fourth time

you know

<#> It was great <&>

B

S1A-001

it went the last time

you know

<#> And Stephanie told her

B

S1A-001

had refused the other times

you know

<#> But Stephanie wouldn't let

B

S1A-001

until <,> it jumped it

you know

<#> It was good <S1A-001$A>

B

S1A-001

<,> for that one <,>

you know

and starting anew fresh <S1A-001$A>

B

S1A-001

the league <,> <{> <[>

you know

</[> <S1A-001$A> <#> <[> So

B

S1A-001

<#> We didn't discuss it

you know

<S1A-001$A> <#> Well it sounds

B

S1A-001

her lesson on Tuesday <,>

you know

<#> But I was keeping

B

S1A-001

that she could take her

you know

the wee shoulder bag she

B

S1A-001

show them around <,> uhm

you know

their timetable and <,> give

B

The resulting table shows that we have successfully extracted the speakers (identified by the letters in the speaker column) and cleaned the file names (in the docnames column).

Customizing concordances

As R represents a fully-fledged programming environment, we can, of course, also write our own, customized concordance function. The code below shows how you could go about doing so. Note, however, that this function only works if you enter more than a single file.

mykwic <- function(txts, pattern, context) {
  # activate packages
  require(stringr)
  # list files
  txts <- txts[stringr::str_detect(txts, pattern)]
  conc <- sapply(txts, function(x) {
    # determine length of text
        lngth <- as.vector(unlist(nchar(x)))
    # determine position of hits
    idx <- str_locate_all(x, pattern)
    idx <- idx[[1]]
    ifelse(nrow(idx) >= 1, idx <- idx, return(NA))
    # define start position of hit
    token.start <- idx[,1]
    # define end position of hit
    token.end <- idx[,2]
    # define start position of preceding context
    pre.start <- ifelse(token.start-context < 1, 1, token.start-context)
    # define end position of preceding context
    pre.end <- token.start-1
    # define start position of subsequent context
    post.start <- token.end+1
    # define end position of subsequent context
    post.end <- ifelse(token.end+context > lngth, lngth, token.end+context)
    # extract the texts defined by the positions
    PreceedingContext <- substring(x, pre.start, pre.end)
    Token <- substring(x, token.start, token.end)
    SubsequentContext <- substring(x, post.start, post.end)
    Id <- 1:length(Token)
    conc <- cbind(Id, PreceedingContext, Token, SubsequentContext)
    # return concordance
    return(conc)
    })
  concdf <- do.call(rbind, conc) %>%
    as.data.frame()
  return(concdf)
}

We can now try if this function works by searching for the sequence you know in the transcripts that we have loaded earlier. One difference between the kwic function provided by the quanteda package and the customized concordance function used here is that the kwic function uses the number of words to define the context window, while the mykwic function uses the number of characters or symbols instead (which is why we use a notably higher number to define the context window).

kwic_youknow <- mykwic(transcripts_collapsed, "you know", 50)
First 6 concordances for *you know* extracted using the mykwic function.

Id

PreceedingContext

Token

SubsequentContext

1

to let me jump <,> that was only the fourth time

you know

<#> It was great <&> laughter </&> <S1A-001$A> <#

2

with the whip <,> and over it went the last time

you know

<#> And Stephanie told her she was very determine

3

ghter </&> because it had refused the other times

you know

<#> But Stephanie wouldn't let her give up on it

4

k and keep coming back <,> until <,> it jumped it

you know

<#> It was good <S1A-001$A> <#> Yeah I 'm not so

5

she 'd be far better waiting <,> for that one <,>

you know

and starting anew fresh <S1A-001$A> <#> Yeah but

6

er 's the best goes top of the league <,> <{> <[>

you know

</[> <S1A-001$A> <#> <[> So </[> </{> it 's like

As this concordance function only works for more than one text, we split the text into chapters and assign each section a name.

# read in text
text_split <- text %>%
  stringr::str_squish() %>%
  stringr::str_split("[CHAPTER]{7,7} [XVI]{1,7}\\. ") %>%
  unlist()
text_split <- text_split[which(nchar(text_split) > 2000)]
# add names
names(text_split) <- paste0("text", 1:length(text_split))
# inspect data
nchar(text_split)
##  text1  text2  text3  text4  text5  text6  text7  text8  text9 text10 text11 
##  11331  10888   9137  13830  11767  13730  12564  13585  12527  11287  10292 
## text12 
##  11518

Now that we have named elements, we can search for the pattern poor alice. We also need to clean the concordance as some sections do not contain any instances of the search pattern. To clean the data, we select only the columns File, PreceedingContext, Token, and SubsequentContext and then remove all rows where information is missing.

mykwic_pooralice <- mykwic(text_split, "poor Alice", 50)
First 6 concordances of *poor alice* extracted using the mykwic function.

Id

PreceedingContext

Token

SubsequentContext

1

; “and even if my head would go through,” thought

poor Alice

, “it would be of very little use without my shoul

2

d on going into the garden at once; but, alas for

poor Alice

! when she got to the door, she found she had forg

3

to be two people. “But it’s no use now,” thought

poor Alice

, “to pretend to be two people! Why, there’s hardl

1

!” “I’m sure those are not the right words,” said

poor Alice

, and her eyes filled with tears again as she went

1

lking such nonsense!” “I didn’t mean it!” pleaded

poor Alice

. “But you’re so easily offended, you know!” The M

2

onder if I shall ever see you any more!” And here

poor Alice

began to cry again, for she felt very lonely and

You can go ahead and modify the customized concordance function to suit your needs.

Citation & Session Info

Schweinberger, Martin. 2023. Concordancing with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/kwics.html (Version 2023.09.23).

@manual{schweinberger2023kwics,
  author = {Schweinberger, Martin},
  title = {Concordancing with R},
  note = {https://ladal.edu.au/kwics.html},
  year = {2023},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL)},
  address = {Brisbane},
  edition = {2023.09.23}
}
sessionInfo()
## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22621)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.utf8    
## 
## time zone: Australia/Brisbane
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] flextable_0.9.4 here_1.0.1      writexl_1.5.0   stringr_1.5.1  
## [5] dplyr_1.1.4     quanteda_3.3.1 
## 
## loaded via a namespace (and not attached):
##  [1] fastmatch_1.1-4         xfun_0.42               bslib_0.6.1            
##  [4] lattice_0.21-9          vctrs_0.6.5             tools_4.3.2            
##  [7] generics_0.1.3          curl_5.2.0              tibble_3.2.1           
## [10] klippy_0.0.0.9500       fansi_1.0.6             highr_0.10             
## [13] pkgconfig_2.0.3         Matrix_1.6-5            data.table_1.15.2      
## [16] RcppParallel_5.1.7      assertthat_0.2.1        uuid_1.2-0             
## [19] lifecycle_1.0.4         compiler_4.3.2          textshaping_0.3.7      
## [22] httpuv_1.6.14           fontquiver_0.2.1        fontLiberation_0.1.0   
## [25] htmltools_0.5.7         sass_0.4.8              yaml_2.3.8             
## [28] pillar_1.9.0            later_1.3.2             crayon_1.5.2           
## [31] jquerylib_0.1.4         gfonts_0.2.0            ellipsis_0.3.2         
## [34] openssl_2.1.1           cachem_1.0.8            mime_0.12              
## [37] fontBitstreamVera_0.1.1 stopwords_2.3           tidyselect_1.2.0       
## [40] zip_2.3.1               digest_0.6.34           stringi_1.8.3          
## [43] rprojroot_2.0.4         fastmap_1.1.1           grid_4.3.2             
## [46] cli_3.6.2               magrittr_2.0.3          crul_1.4.0             
## [49] utf8_1.2.4              withr_3.0.0             gdtools_0.3.6          
## [52] promises_1.2.1          rmarkdown_2.25          officer_0.6.5          
## [55] askpass_1.2.0           ragg_1.2.7              shiny_1.8.0            
## [58] evaluate_0.23           knitr_1.45              rlang_1.1.3            
## [61] Rcpp_1.0.12             xtable_1.8-4            glue_1.7.0             
## [64] httpcode_0.3.0          xml2_1.3.6              rstudioapi_0.15.0      
## [67] jsonlite_1.8.8          R6_2.5.1                systemfonts_1.0.5

Back to top

Back to HOME


References

Anthony, Laurence. 2004. “AntConc: A Learner and Classroom Friendly, Multi-Platform Corpus Analysis Toolkit.” Proceedings of IWLeL, 7–13.
Aschwanden, Christie. 2018. “Psychology’s Replication Crisis Has Made the Field Better.” https://fivethirtyeight.com/features/psychologys-replication-crisis-has-made-the-field-better/.
Barlow, Michael. 1999. “Monoconc 1.5 and Paraconc.” International Journal of Corpus Linguistics 4 (1): 173–84.
———. 2002. “ParaConc: Concordance Software for Multilingual Parallel Corpora.” In Proceedings of the Third International Conference on Language Resources and Evaluation. Workshop on Language Resources in Translation Work and Research, 20–24.
Diener, Edward, and Robert Biswas-Diener. 2019. “The Replication Crisis in Psychology.” https://nobaproject.com/modules/the-replication-crisis-in-psychology.
Fanelli, Daniele. 2009. “How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data.” PLoS One 4: e5738.
Kilgarriff, Adam, Pavel Rychly, Pavel Smrz, and David Tugwell. 2004. “Itri-04-08 the Sketch Engine.” Information Technology 105: 116.
Lindquist, Hans. 2009. Corpus Linguistics and the Description of English. Edinburgh: Edinburgh University Press.
McRae, Mike. 2018. “Science’s ’Replication Crisis’ Has Reached Even the Most Respectable Journals, Report Shows.” https://www.sciencealert.com/replication-results-reproducibility-crisis-science-nature-journals.
Stefanowitsch, Anatol. 2020. Corpus Linguistics. A Guide to the Methodology. Textbooks in Language Sciences. Berlin: Language Science Press.
Velasco, Emily. 2019. “Researcher Discusses the the Science Replication Crisis.” https://phys.org/news/2018-11-discusses-science-replication-crisis.html.
Yong, Ed. 2018. “Psychology’s Replication Crisis Is Running Out of Excuses. Another Big Project Has Found That Only Half of Studies Can Be Repeated. And This Time, the Usual Explanations Fall Flat.” https://www.theatlantic.com/science/archive/2018/11/psychologys-replication-crisis-real/576223/.

  1. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.↩︎