Introduction

This tutorial introduces collocation and co-occurrence analysis with R and shows how to extract and visualize semantic links between words.¹

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to extract and analyze collocations and N-grams from textual data using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with collocation analysis.

To be able to follow this tutorial, we suggest you check out and familiarize yourself with the content of the following R Basics tutorials:

Click here² to download the entire R Notebook for this tutorial.

Click here to open a Jupyter notebook that allows you to follow this tutorial interactively. This means that you can execute, change, and edit the code used in this tutorial to help you better understand how the code shown here works (make sure you run all code chunks in the order in which they appear - otherwise you will get an error).

LADAL TOOL

Click on this badge to open an notebook-based tool
that calculates association measures and allows you to download the results.

How can you determine if words occur more frequently together than would be expected by chance?

This tutorial aims to show how you can answer this question.

So, how would you find words that are associated with a specific term and how can you visualize such word nets? This tutorial focuses on co-occurrence and collocations of words. Collocations are words that occur very frequently together. For example, Merry Christmas is a collocation because merry and Christmas occur more frequently together than would be expected by chance. This means that if you were to shuffle all words in a corpus and would then test the frequency of how often merry and Christmas co-occurred, they would occur significantly less often in the shuffled or randomized corpus than in a corpus that contain non-shuffled natural speech.

Co-occurrence and association

Collocations are combinations of words that frequently co-occur in a language, appearing together more often than would be expected by chance.

LADAL TOOL

Click on this badge to open an notebook-based tool
that calculates association measures and allows you to download the results.

We need to differentiate between

collocations: words that are significantly attracted to one another and often occur together (but are not necessarily adjacent) such as black and coffee
n-grams: combinations of words that are adjacent as the bi-grams This is, is a, and a sentence that form the sentence This is a sentence

Such word pairings or groupings exhibit a certain degree of naturalness and tend to form recurring patterns. They play a crucial role in language acquisition, learning, fluency, and usage and they contribute to the natural and idiomatic expression of ideas. A typical example of a collocation is Merry Christmas because the words merry and Christmas occur together more frequently together than would be expected, if words were just randomly stringed together. Other examples of collocations include strong coffee, make a decision, or take a risk. Recognizing and understanding collocations is essential for language learners, as it enhances their ability to produce authentic and contextually appropriate language.

Identifying words pairs (w1 and w2) that collocate (i.e. collocations) and determining their association strength (a measure of how strongly attracted words are to each other) is based on the co-occurrence frequencies of word pairs in a contingency table (see below, O is short for observed frequency).

	w₂ present	w₂ absent
w₁ present	O₁₁	O₁₂	= R₁
w₁ absent	O₂₁	O₂₂	= R₂
	= C₁	= C₂	= N

From this contingency table, we can calculate the frequencies that would be expected if the words did not show any attraction or repulsion (see below, E is short for expected frequency).

	w₂ present	w₂ absent
w₁ present	E₁₁ = (R₁ * C₁) / (N)	E₁₂ = (R₁ * C₂) / (N)	= R₁
w₁ absent	E₂₁ = (R₂ * C₁) / (N)	E₂₂ = (R₂ * C₂) / (N)	= R₂
	= C₁	= C₂	= N

Association measures use the frequency information in the above contingency tables to evaluate the strength of attraction or repulsion between words. As such, association measures are statistical metrics used to quantify the strength and significance of the relationship between words within a collocation. These measures help assess how likely it is for two words to appear together more frequently than expected by chance. Several association measures are commonly used in collocation analysis, including:

Gries’ AM: Gries’ AM (Gries 2022) is probably the best association measure that is on conditional probabilities. For information on how it is calculated, see Gries (2022). In contrast to other association measures, it has three main advantages:
it takes into account that the association between word_1 and word_2 is not symmetric (word_1 may be more strongly attracted with word_2 than vice verse) - in sense it is very similar to ΔP
it is not affected by frequency as other association measures (which is a serious issue as association measures should reflect association strength and not frequency).
it is normalized as it takes into account that the possible range of values differs across elements (some words can have very high values while others cannot)
delta P (ΔP): ΔP (Ellis 2007; Gries 2013) is an association measure based on conditional probabilities that is implied in MS (Gries 2013, 141). ΔP has two advantages: it takes into account that the association between word_1 and word_2 is not symmetric (word_1 may be more strongly attracted with word_2 than vice verse) and it is not affected by frequency as other association measures (which is a serious issue as association measures should reflect association strength and not frequency) (see Gries 2022).

\[ \Delta P_1 = P(w_1 | w_2) = \left( \frac{O11}{R1} \right)- \left(\frac{O21} {R2} \right) \]

\[ \Delta P_2 = P(w_2 | w_1) = \left( \frac{O11}{C1} \right) -\left( \frac{O21}{C2} \right) \]

Pointwise Mutual Information (PMI): PMI measures the likelihood of two words occurring together compared to their individual likelihoods of occurring separately. A higher PMI score suggests a stronger association.

\[ \text{PMI}(w_1, w_2) = \log_2 \left( \frac{P(w_1 \cap w_2)}{P(w_1) \cdot P(w_2)} \right) \]

Log-Likelihood Ratio (LLR): LLR compares the likelihood of the observed word combination occurring with the expected likelihood based on the individual frequencies of the words. Higher LLR values indicate a more significant association (where \(O_i\) is the observed frequency and \(E_i\) is the expected frequency for each combination).

\[ \text{LLR}(w_1, w_2) = 2 \sum_{i=1}^4 \frac{(O_i - E_i)^2}{E_i} \]

Dice Coefficient: This measure considers the co-occurrence of words and calculates the ratio of the overlap between the two words to the sum of their individual frequencies. The Dice coefficient ranges from 0 to 1, with higher values indicating stronger association.

\[ \text{Dice}(w_1, w_2) = \frac{2 \times \text{freq}(w_1 \cap w_2)}{\text{freq}(w_1) + \text{freq}(w_2)} \]

Chi-Square: Chi-square measures the difference between the observed and expected frequencies of word co-occurrence. A higher chi-square value signifies a more significant association (where \(O_i\) is the observed frequency and \(E_i\) is the expected frequency for each combination).

\[ \chi^2(w_1, w_2) = \sum \frac{(O_i - E_i)^2}{E_i} \]

t-Score: The t-score is based on the difference between the observed and expected frequencies, normalized by the standard deviation. Higher T-scores indicate a stronger association.

\[ \text{t-Score}(w_1, w_2) = \frac{\text{freq}(w_1 \cap w_2) - \text{expected\_freq}(w_1 \cap w_2)}{\sqrt{\text{freq}(w_1 \cap w_2)}} \]

Mutual Information (MI): MI measures the reduction in uncertainty about one word’s occurrence based on the knowledge of another word’s occurrence. Higher MI values indicate a stronger association (where \(P(w_1 \cap w_2)\) is the joint probability, and \(P(w_1)\) and \(P(w_2)\) are the individual probabilities).

\[ \text{MI}(w_1, w_2) = \log_2 \left( \frac{P(w_1 \cap w_2)}{P(w_1) \cdot P(w_2)} \right) \]

Minimum Sensitivity (MS): The minimum sensitivity is 1 when W1 and W2 always occur together and never apart. It is 0 when W1 and W2 never occur together. A higher minimum sensitivity indicates a stronger dependence between the two words in a bigram (Pedersen 1998).

\[ \text{MS} = min\left( P(w_1 | w_2) , P(w_2 | w_1) \right) \]

These association measures help researchers and language analysts identify meaningful and statistically significant collocations, assisting in the extraction of relevant information from corpora and improving the accuracy of collocation analysis in linguistic studies.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# set options
options(stringsAsFactors = F)
options(scipen = 999)
options(max.print=1000)
# install packages
install.packages("FactoMineR")
install.packages("factoextra")
install.packages("flextable")
install.packages("GGally")
install.packages("ggdendro")
install.packages("igraph")
install.packages("network")
install.packages("Matrix")
install.packages("quanteda")
install.packages("quanteda.textstats")
install.packages("quanteda.textplots")
install.packages("dplyr")
install.packages("stringr")
install.packages("tm")
install.packages("sna")
install.packages("tidytext")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

Next, we load the packages.

# load packages
library(FactoMineR)
library(factoextra)
library(flextable)
library(GGally)
library(ggdendro)
library(igraph)
library(network)
library(Matrix)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(dplyr)
library(stringr)
library(tm)
library(sna)
# activate klippy for copy-to-clipboard button
klippy::klippy()

We will use the Charles Darwin’s On the Origin of Species by Means of Natural Selection as a data source and begin by generating a bi-gram list. As a first step, we load the data and split it into individual words.

# read in text
text <- base::readRDS(url("https://slcladal.github.io/data/cdo.rda", "rb")) %>%
  paste0(collapse = " ") %>%
  stringr::str_squish() %>%
  stringr::str_remove_all("- ")

First 200 characters of the example text
.
THE ORIGIN OF SPECIES BY CHARLES DARWIN AN HISTORICAL SKETCH OF THE PROGRESS OF OPINION ON THE ORIGIN OF SPECIES INTRODUCTION When on board H.M.S. 'Beagle,' as naturalist, I was much struck with certa

Once you have installed R, RStudio, and once you have initiated the session by executing the code shown above, you are good to go.

Collocations

As collocates do not have to be immediately adjacent but can be separated by several slots, their retrieval is substantially more difficult compared with n-grams. Nonetheless, there are various ways of finding collocations depending on the data provided, the context, and the association measure (which represents information of how strong the association between the words is). Below, you will see how to detect collocations in two different data structures:

a list of sentences
concordances

In the following, we will extract collocations from the sentences in Charles Darwin’s On the Origin of Species by Means of Natural Selection

Identifying collocations in sentences

Data preparation

In a first step, we split our example text into sentences and clean the data (removing punctuation, converting to lower case, etc.).

text %>% 
  # concatenate the elements in the 'text' object
  paste0(collapse = " ") %>%
  # separate possessives and contractions
  stringr::str_replace_all(fixed("'"), fixed(" '")) %>%
  stringr::str_replace_all(fixed("’"), fixed(" '")) %>%
  # split text into sentences
  tokenizers::tokenize_sentences() %>%
  # unlist sentences
  unlist() %>%
  # remove non-word characters
  stringr::str_replace_all("\\W", " ") %>%
  stringr::str_replace_all("[^[:alnum:] ]", " ") %>%
  # remove superfluous white spaces
  stringr::str_squish() %>%
  # convert to lower case and save in 'sentences' object
  tolower() -> sentences

First 10 sentences in the example text
.
the origin of species by charles darwin an historical sketch of the progress of opinion on the origin of species introduction when on board h m s
beagle as naturalist i was much struck with certain facts in the distribution of the organic beings inhabiting south america and in the geological relations of the present to the past inhabitants of that continent
these facts as will be seen in the latter chapters of this volume seemed to throw some light on the origin of species that mystery of mysteries as it has been called by one of our greatest philosophers
on my return home it occurred to me in 1837 that something might perhaps be made out on this question by patiently accumulating and reflecting on all sorts of facts which could possibly have any bearing on it
after five years work i allowed myself to speculate on the subject and drew up some short notes these i enlarged in 1844 into a sketch of the conclusions which then seemed to me probable from that period to the present day i have steadily pursued the same object
i hope that i may be excused for entering on these personal details as i give them to show that i have not been hasty in coming to a decision
my work is now 1859 nearly finished but as it will take me many more years to complete it and as my health is far from strong i have been urged to publish this abstract
i have more especially been induced to do this as mr
wallace who is now studying the natural history of the malay archipelago has arrived at almost exactly the same general conclusions that i have on the origin of species
in 1858 he sent me a memoir on this subject with a request that i would forward it to sir charles lyell who sent it to the linnean society and it is published in the third volume of the journal of that society

Next, we tabulate the data and reformat it so that we have the relevant information to calculate the association statistics (word 1 and word 2 as well as O11, O12, O21, and O22).

# tokenize the 'sentences' data using quanteda package
sentences %>%
  quanteda::tokens() %>%

  # create a document-feature matrix (dfm) using quanteda
  quanteda::dfm() %>%

  # create a feature co-occurrence matrix (fcm) without considering trigrams
  quanteda::fcm(tri = FALSE) %>%

  # tidy the data using tidytext package
  tidytext::tidy() %>%

  # rearrange columns for better readability
  dplyr::relocate(term, document, count) %>%

  # rename columns for better interpretation
  dplyr::rename(w1 = 1,
                w2 = 2,
                O11 = 3) -> coll_basic

First 10 rows of basic collocation table
w1	w2	O11
the	the	24,287
the	origin	170
the	of	37,291
the	species	6,222
the	by	5,415
the	charles	28
the	darwin	11
the	an	2,049
the	historical	7
the	sketch	8

We now enhance our table by calculating all observed frequencies (O11, O12, O21, O22) as well as row totals (R1, R2), column totals (C1, C2), and the overall total (N).

  # calculate the total number of observations (N)
coll_basic %>%  dplyr::mutate(N = sum(O11)) %>%

  # calculate R1, O12, and R2
  dplyr::group_by(w1) %>%
  dplyr::mutate(R1 = sum(O11),
                O12 = R1 - O11,
                R2 = N - R1) %>%
  dplyr::ungroup(w1) %>%

  # calculate C1, O21, C2, and O22
  dplyr::group_by(w2) %>%
  dplyr::mutate(C1 = sum(O11),
                O21 = C1 - O11,
                C2 = N - C1,
                O22 = R2 - O21) -> colldf

First 10 rows of collocation table
w1	w2	O11	N	R1	O12	R2	C1	O21	C2	O22
the	the	24,287	9,405,996	643,895	619,608	8,762,101	643,895	619,608	8,762,101	8,142,493
the	origin	170	9,405,996	643,895	643,725	8,762,101	2,884	2,714	9,403,112	8,759,387
the	of	37,291	9,405,996	643,895	606,604	8,762,101	450,460	413,169	8,955,536	8,348,932
the	species	6,222	9,405,996	643,895	637,673	8,762,101	89,994	83,772	9,316,002	8,678,329
the	by	5,415	9,405,996	643,895	638,480	8,762,101	80,785	75,370	9,325,211	8,686,731
the	charles	28	9,405,996	643,895	643,867	8,762,101	451	423	9,405,545	8,761,678
the	darwin	11	9,405,996	643,895	643,884	8,762,101	179	168	9,405,817	8,761,933
the	an	2,049	9,405,996	643,895	641,846	8,762,101	33,809	31,760	9,372,187	8,730,341
the	historical	7	9,405,996	643,895	643,888	8,762,101	185	178	9,405,811	8,761,923
the	sketch	8	9,405,996	643,895	643,887	8,762,101	152	144	9,405,844	8,761,957

To determine which terms collocate significantly and with what association strength, we use the following information (that is provided by the table above):

O₁₁ = Number of times word₁ occurs with word₂
O₁₂ = Number of times word₁ occurs without word₂
O₂₁ = Number of times CoocTerm occurs without Term
O₂₂ = Number of terms that are not coocTerm or Term

Example:

	w₂ present	w₂ absent
w₁ present	O₁₁	O₁₂	= R₁
w₁ absent	O₂₁	O₂₂	= R₂
	= C₁	= C₂	= N

We could calculate all collocations in the corpus (based on co-occurrence within the same sentence) or we can find collocations of a specific term - here, we will find collocations fo the term selection.

Now that we have all the relevant information, we will reduce the data and add additional information to the data so that the computing of the association measures runs smoothly.

# reduce and complement data
colldf %>%
# determine Term
  dplyr::filter(w1 == "selection",
                # set minimum number of occurrences of w2
                (O11+O21) > 10,
                # set minimum number of co-occurrences of w1 and w2
                O11 > 5)  %>%
  dplyr::rowwise() %>%
  dplyr::mutate(E11 = R1 * C1 / N, 
                E12 = R1 * C2 / N,
                E21 = R2 * C1 / N, 
                E22 = R2 * C2 / N)  -> colldf_redux

First 10 rows of reduced collocation data frame
w1	w2	O11	N	R1	O12	R2	C1	O21	C2	O22	E11	E12	E21	E22
selection	the	1,783	9,405,996	26,793	25,010	9,379,203	643,895	642,112	8,762,101	8,737,091	1,834.13630	24,958.86	642,060.864	8,737,142
selection	origin	19	9,405,996	26,793	26,774	9,379,203	2,884	2,865	9,403,112	9,376,338	8.21508	26,784.78	2,875.785	9,376,327
selection	of	1,556	9,405,996	26,793	25,237	9,379,203	450,460	448,904	8,955,536	8,930,299	1,283.13629	25,509.86	449,176.864	8,930,026
selection	species	175	9,405,996	26,793	26,618	9,379,203	89,994	89,819	9,316,002	9,289,384	256.34810	26,536.65	89,737.652	9,289,465
selection	by	334	9,405,996	26,793	26,459	9,379,203	80,785	80,451	9,325,211	9,298,752	230.11625	26,562.88	80,554.884	9,298,648
selection	an	90	9,405,996	26,793	26,703	9,379,203	33,809	33,719	9,372,187	9,345,484	96.30501	26,696.69	33,712.695	9,345,490
selection	on	200	9,405,996	26,793	26,593	9,379,203	71,209	71,009	9,334,787	9,308,194	202.83899	26,590.16	71,006.161	9,308,197
selection	when	63	9,405,996	26,793	26,730	9,379,203	26,607	26,544	9,379,389	9,352,659	75.79010	26,717.21	26,531.210	9,352,672
selection	s	38	9,405,996	26,793	26,755	9,379,203	5,975	5,937	9,400,021	9,373,266	17.01980	26,775.98	5,957.980	9,373,245
selection	as	296	9,405,996	26,793	26,497	9,379,203	103,198	102,902	9,302,798	9,276,301	293.95973	26,499.04	102,904.040	9,276,299

Now we can calculate the collocation statistics (the association strength).

colldf_redux %>%
  # determine number of rows
  dplyr::mutate(Rws = nrow(.)) %>%
    # work row-wise
    dplyr::rowwise() %>%
    # calculate fishers' exact test
    dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(O11, O12, O21, O22), 
                                                        ncol = 2, byrow = T))[1]))) %>%
    
  # extract AM
    # 1. bias towards top left
    dplyr::mutate(btl_O12 = ifelse(C1 > R1, 0, R1-C1),
                  btl_O11 = ifelse(C1 > R1, R1, R1-btl_O12),
                  btl_O21 = ifelse(C1 > R1, C1-R1, C1-btl_O11),
                  btl_O22 = ifelse(C1 > R1, C2, C2-btl_O12),
                  
    # 2. bias towards top right
                  btr_O11 = 0, 
                  btr_O21 = R1,
                  btr_O12 = C1,
                  btr_O22 = C2-R1) %>%
    
    # 3. calculate AM
    dplyr::mutate(upp = btl_O11/R1,
                  low = btr_O11/R1,
                  op = O11/R1) %>%
    dplyr::mutate(AM = op / upp) %>%
    
    # remove superfluous columns
    dplyr::select(-any_of(c("btr_O21", "btr_O12", "btr_O22", "btl_O12", 
                            "btl_O11", "btl_O21", "btl_O22", "btr_O11"))) %>%

    # extract x2 statistics
    dplyr::mutate(X2 = (O11-E11)^2/E11 + (O12-E12)^2/E12 + (O21-E21)^2/E21 + (O22-E22)^2/E22) %>%

    # extract association measures
    dplyr::mutate(phi = sqrt((X2 / N)),
                Dice = (2 * O11) / (R1 + C1),
                LogDice = log((2 * O11) / (R1 + C1)),
                MI = log2(O11 / E11),
                MS = min((O11/C1), (O11/R1)),
                t.score = (O11 - E11) / sqrt(O11),
                z.score = (O11 - E11) / sqrt(E11),
                PMI = log2( (O11 / N) / ( C1 / N * R1 / N )),
                DeltaP12 = (O11 / (O11 + O12)) - (O21 / (O21 + O22)),
                DeltaP21 =  (O11 / (O11 + O21)) - (O21 / (O12 + O22)),
                DP = (O11 / R1) - (O21 / R2),
                LogOddsRatio = log(((O11 + 0.5) * (O22 + 0.5))  / ( (O12 + 0.5) * (O21 + 0.5) )),
                # calculate LL aka G2
                G2 = 2 * (O11 * log(O11 / E11) + O12 * log(O12 / E12) + O21 * log(O21 / E21) + O22 * log(O22 / E22))) %>%

  # determine Bonferroni corrected significance
  dplyr::mutate(Sig_corrected = dplyr::case_when(p / Rws > .05 ~ "n.s.",
                                                 p / Rws > .01 ~ "p < .05*",
                                                 p / Rws > .001 ~ "p < .01**",
                                                 p / Rws <= .001 ~ "p < .001***",
                                                 T ~ "N.A.")) %>%
  
  # round p-value
    dplyr::mutate(p = round(p, 5)) %>%
    # filter out non significant results
    dplyr::filter(Sig_corrected != "n.s.",
                # filter out instances where the w1 and w2 repel each other
                E11 < O11) %>%
    # arrange by DeltaP12 (association measure)
    dplyr::arrange(-DeltaP12) %>%
    # remove superfluous columns
    dplyr::select(-any_of(c("TermCoocFreq", "AllFreq", "NRows", "O12", "O21", 
                            "O22", "R1", "R2", "C1", "C2", "E11", "E12", "E21",
                            "E22", "upp", "low", "op", "Rws"))) -> assoc_tb

First 10 rows of association statistics table
w1	w2	O11	N	p	AM	X2	phi	Dice	LogDice	MI	MS	t.score	z.score	PMI	DeltaP12	DeltaP21	DP	LogOddsRatio	G2	Sig_corrected
selection	natural	515	9,405,996	0.00000	0.020258841	2,720.22428	0.017005913	0.019726510	-3.925792	2.8302762	0.019221438	19.502767	52.011007	2.8302762	0.016565989	0.017603780	0.016565989	1.9970932	1,150.64107	p < .001***
selection	of	1,556	9,405,996	0.00000	0.058074870	61.11824	0.002549077	0.006520650	-5.032781	0.2781676	0.003454247	6.917370	7.617446	0.2781676	0.010213234	-0.046671620	0.010213234	0.2045038	57.40403	p < .001***
selection	to	776	9,405,996	0.00000	0.028962789	38.30524	0.002018026	0.006317289	-5.064465	0.3156998	0.003545289	5.474939	6.107969	0.3156998	0.005708574	-0.020195137	0.005708574	0.2260098	35.59006	p < .001***
selection	through	149	9,405,996	0.00000	0.010922953	313.56297	0.005773777	0.007370035	-4.910333	1.9390875	0.005561154	9.023314	17.669645	1.9390875	0.004122652	0.009486465	0.004122652	1.3596350	181.59628	p < .001***
selection	by	334	9,405,996	0.00000	0.012465943	47.43872	0.002245764	0.006209448	-5.081683	0.5374853	0.004134431	5.684266	6.848161	0.5374853	0.003888348	-0.004492827	0.003888348	0.3792474	41.64215	p < .001***
selection	theory	82	9,405,996	0.00000	0.011239035	180.97693	0.004386410	0.004810936	-5.336864	1.9802428	0.003060501	6.760323	13.428382	1.9802428	0.002291352	0.010471482	0.002291352	1.3893366	103.32874	p < .001***
selection	been	233	9,405,996	0.00001	0.008696301	21.98381	0.001528794	0.005349312	-5.230787	0.4393942	0.003862668	4.007740	4.666970	0.4393942	0.002289787	-0.002566830	0.002289787	0.3100062	19.77955	p < .001***
selection	variations	77	9,405,996	0.00000	0.009361702	122.94955	0.003615439	0.004397738	-5.426665	1.7165674	0.002873885	6.104990	11.067616	1.7165674	0.002005154	0.008494688	0.002005154	1.2047885	76.55053	p < .001***
selection	will	149	9,405,996	0.00000	0.005561154	28.38208	0.001737080	0.004904300	-5.317643	0.6227757	0.004386223	4.279371	5.310275	0.6227757	0.001955197	0.000777505	0.001955197	0.4384993	24.34962	p < .001***
selection	power	63	9,405,996	0.00000	0.013188193	179.89260	0.004373250	0.003991131	-5.523681	2.2109715	0.002351360	6.222896	13.389887	2.2109715	0.001848759	0.012686769	0.001848759	1.5525839	94.91857	p < .001***

Identifying collocations using kwics

In this section, we will extract collocations and calculate association measures based on concordances and the corpus the concordances were extracted from.

We start by cleaning our corpus and splitting it into chapters.

# clean corpus
text %>%
  # concatenate the elements in the 'text' object
  paste0(collapse = " ") %>%
  # separate possessives and contractions
  stringr::str_replace_all(fixed("'"), fixed(" '")) %>%
  stringr::str_replace_all(fixed("’"), fixed(" '")) %>%
  # split text into different chapters
  stringr::str_split("CHAPTER [IVX]{1,4}") %>%
  # unlist sentences
  unlist() %>%
  # remove non-word characters
  stringr::str_replace_all("\\W", " ") %>%
  stringr::str_replace_all("[^[:alpha:] ]", " ") %>%
  # remove superfluous white spaces
  stringr::str_squish() %>%
  # convert to lower case and save in 'sentences' object
  tolower() -> texts

First 200 characters of the first 10 chapters of the example text
.
the origin of species by charles darwin an historical sketch of the progress of opinion on the origi
variation under domestication causes of variability effects of habit and the use or disuse of partsc
variation under nature variability individual differences doubtful species wide ranging much diffuse
struggle for existence its bearing on natural selection the term used in a wide sense geometrical ra
natural selection or the survival of the fittest natural selection its power compared with man s sel
f under changing conditions of life organic beings present individual differences in almost every pa
laws of variation effects of changed conditions use and disuse combined with natural selection organ
difficulties of the theory difficulties of the theory of descent with modification absence or rarity
miscellaneous objections to the theory of natural selection longevity modifications not necessarily
instinct instincts comparable with habits but different in their origin instincts graduated aphides

We split the corpus into chapter to mirror the fact that most text data will come in the form of corpora which consist of different files containing texts.

Next, we generate a frequency list of words that occur around a keyword (we use the keyword selection in this example but you can also choose a different word).

for this we use the tokens_select function (from the quanteda package) which has the following arguments:

x: a text or collection of texts. The text needs to be tokenised, i.e. split it into individual words, which is why we use the text in the tokens() function.
pattern: a keyword defined by a search pattern
window: the size of the context window (how many word before and after)
valuetype: the type of pattern matching
- “glob” for “glob”-style wildcard expressions;
- “regex” for regular expressions; or
- “fixed” for exact matching
selection: a character to define if the key word should be retained in the resulting frequency list or if it should be removed. The argument offers two options
- “keep”
- “remove”
case_insensitive: logical; if TRUE, ignore case when matching a pattern or dictionary values

kwic_words <- quanteda::tokens_select(tokens(texts), 
                                      pattern = "selection", 
                                      window = 5, 
                                      selection = "keep") %>%
  unlist() %>%
  # tabulate results
  table() %>%
  # convert into data frame
  as.data.frame() %>%
  # rename columns
  dplyr::rename(token = 1,
                n = 2) %>%
  # add a column with type
  dplyr::mutate(type = "kwic")

First 10 rows of the kwic table
token	n	type
a	54	kwic
able	2	kwic
abounding	1	kwic
above	2	kwic
absolute	1	kwic
absurd	1	kwic
accordance	2	kwic
according	2	kwic
account	4	kwic
accumulate	2	kwic

Next, we create a frequency table of the entire clean corpus.

corpus_words <- texts %>%
  # tokenize the corpus files
  quanteda::tokens() %>%
  # unlist the tokens to create a data frame
  unlist() %>%
  as.data.frame() %>%
  # rename the column to 'token'
  dplyr::rename(token = 1) %>%
  # group by 'token' and count the occurrences
  dplyr::group_by(token) %>%
  dplyr::summarise(n = n()) %>%
  # add column stating where the frequency list is 'from'
  dplyr::mutate(type = "corpus")

First 10 rows of the corpus table
token	n	type
a	3,163	corpus
abdomen	3	corpus
aberrant	7	corpus
aberration	2	corpus
abhorrent	1	corpus
abilities	1	corpus
ability	3	corpus
abjectly	1	corpus
able	54	corpus
ably	3	corpus

Next, we combine the two frequency lists.

freq_df <- dplyr::left_join(corpus_words, kwic_words, by = c("token")) %>%
  # rename columns and select relevant columns
  dplyr::rename(corpus = n.x,
                kwic = n.y) %>%
  dplyr::select(-type.x, -type.y) %>%
  # replace NA values with 0 in 'corpus' and 'kwic' columns
  tidyr::replace_na(list(corpus = 0, kwic = 0))

First 10 rows of the combined frequency table
token	corpus	kwic
a	3,163	54
abdomen	3	0
aberrant	7	0
aberration	2	0
abhorrent	1	0
abilities	1	0
ability	3	0
abjectly	1	0
able	54	2
ably	3	0

We now calculate the frequencies of the observed and expected frequencies as well as the row and column totals.

freq_df %>%
  dplyr::filter(corpus > 0) %>%
  dplyr::mutate(corpus = as.numeric(corpus),
                kwic = as.numeric(kwic)) %>%
  dplyr::mutate(corpus= corpus-kwic,
                C1 = sum(kwic),
                C2 = sum(corpus),
                N = C1 + C2) %>%
  dplyr::rowwise() %>%
  dplyr::mutate(R1 = corpus+kwic,
                R2 = N - R1,
                O11 = kwic,
                O12 = R1-O11,
                O21 = C1-O11,
                O22 = C2-O12) %>%
  dplyr::mutate(E11 = (R1 * C1) / N,
                E12 = (R1 * C2) / N,
                E21 = (R2 * C1) / N,
                E22 = (R2 * C2) / N) %>%
  dplyr::select(-corpus, -kwic) -> stats_tb

First 10 rows of the processed frequency table
token	C1	C2	N	R1	R2	O11	O12	O21	O22	E11	E12	E21	E22
a	5,830	188,275	194,105	3,163	190,942	54	3,109	5,776	185,166	95.00162283	3,067.9983772	5,734.998	185,207.0
abdomen	5,830	188,275	194,105	3	194,102	0	3	5,830	188,272	0.09010587	2.9098941	5,829.910	188,272.1
aberrant	5,830	188,275	194,105	7	194,098	0	7	5,830	188,268	0.21024703	6.7897530	5,829.790	188,268.2
aberration	5,830	188,275	194,105	2	194,103	0	2	5,830	188,273	0.06007058	1.9399294	5,829.940	188,273.1
abhorrent	5,830	188,275	194,105	1	194,104	0	1	5,830	188,274	0.03003529	0.9699647	5,829.970	188,274.0
abilities	5,830	188,275	194,105	1	194,104	0	1	5,830	188,274	0.03003529	0.9699647	5,829.970	188,274.0
ability	5,830	188,275	194,105	3	194,102	0	3	5,830	188,272	0.09010587	2.9098941	5,829.910	188,272.1
abjectly	5,830	188,275	194,105	1	194,104	0	1	5,830	188,274	0.03003529	0.9699647	5,829.970	188,274.0
able	5,830	188,275	194,105	54	194,051	2	52	5,828	188,223	1.62190567	52.3780943	5,828.378	188,222.6
ably	5,830	188,275	194,105	3	194,102	0	3	5,830	188,272	0.09010587	2.9098941	5,829.910	188,272.1

To determine which terms collocate significantly and with what association strength, we use the following information (that is provided by the table above):

O11 = Number of times word_x occurs in kwic
O12 = Number of times word_x occurs in corpus (without kwic)
O21 = Number of times other words occur in kwic
O22 = Number of times other words occur in corpus

Example:

	kwic	corpus
token	O₁₁	O₁₂	= R₁
other tokens	O₂₁	O₂₂	= R₂
	= C₁	= C₂	= N

stats_tb %>%
  # determine number of rows
  dplyr::mutate(Rws = nrow(.)) %>%
    # work row-wise
    dplyr::rowwise() %>%
    # calculate fishers' exact test
    dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(O11, O12, O21, O22), 
                                                        ncol = 2, byrow = T))[1]))) %>%

      # extract AM
    # 1. bias towards top left
    dplyr::mutate(btl_O12 = ifelse(C1 > R1, 0, R1-C1),
                  btl_O11 = ifelse(C1 > R1, R1, R1-btl_O12),
                  btl_O21 = ifelse(C1 > R1, C1-R1, C1-btl_O11),
                  btl_O22 = ifelse(C1 > R1, C2, C2-btl_O12),
                  
    # 2. bias towards top right
                  btr_O11 = 0, 
                  btr_O21 = R1,
                  btr_O12 = C1,
                  btr_O22 = C2-R1) %>%
    
    # 3. calculate AM
    dplyr::mutate(upp = btl_O11/R1,
                  low = btr_O11/R1,
                  op = O11/R1) %>%
    dplyr::mutate(AM = op / upp) %>%
    
    # remove superfluous columns
    dplyr::select(-any_of(c("btr_O21", "btr_O12", "btr_O22", "btl_O12", 
                            "btl_O11", "btl_O21", "btl_O22", "btr_O11"))) %>% 
  
    # extract x2 statistics
    dplyr::mutate(X2 = (O11-E11)^2/E11 + (O12-E12)^2/E12 + (O21-E21)^2/E21 + (O22-E22)^2/E22) %>%
    # extract expected frequency
    dplyr::mutate(Exp = E11) %>%

    # extract association measures
    dplyr::mutate(phi = sqrt((X2 / N)),
                MS = min((O11/C1), (O11/R1)),
                Dice = (2 * O11) / (R1 + C1),
                LogDice = log((2 * O11) / (R1 + C1)),
                MI = log2(O11 / E11),
                t.score = (O11 - E11) / sqrt(O11),
                z.score = (O11 - E11) / sqrt(E11),
                PMI = log2( (O11 / N) / ((O11+O12) / N) * 
                              ((O11+O21) / N) ),
                DeltaP12 = (O11 / (O11 + O12)) - (O21 / (O21 + O22)),
                DeltaP21 =  (O11 / (O11 + O21)) - (O21 / (O12 + O22)),
                DP = (O11 / R1) - (O21 / R2),
                LogOddsRatio = log(((O11 + 0.5) * (O22 + 0.5))  / ( (O12 + 0.5) * (O21 + 0.5) )),
                # calculate LL aka G2
                G2 = 2 * (O11 * log(O11 / E11) + O12 * log(O12 / E12) + O21 * log(O21 / E21) + O22 * log(O22 / E22))) %>%
  
  # determine Bonferroni corrected significance
  dplyr::mutate(Sig_corrected = dplyr::case_when(p / Rws > .05 ~ "n.s.",
                                                 p / Rws > .01 ~ "p < .05*",
                                                 p / Rws > .001 ~ "p < .01**",
                                                 p / Rws <= .001 ~ "p < .001***",
                                                 T ~ "N.A.")) %>%  
  
  # round p-value
    dplyr::mutate(p = round(p, 5)) %>%
    # filter out non significant results
    dplyr::filter(Sig_corrected != "n.s.",
                # filter out instances where the w1 and w2 repel each other
                E11 < O11) %>%
    # arrange by phi (association measure)
    dplyr::arrange(-DeltaP12) %>%
    # remove superfluous columns
    dplyr::select(-any_of(c("TermCoocFreq", "AllFreq", "NRows", "O12", "O21", 
                            "O22", "R1", "R2", "C1", "C2", "E11", "E12", "E21",
                            "E22", "upp", "low", "op", "Rws"))) -> assoc_tb2

First 10 rows of the association statistic table
token	N	O11	p	AM	X2	Exp	phi	MS	Dice	LogDice	MI	t.score	z.score	PMI	DeltaP12	DeltaP21	DP	LogOddsRatio	Sig_corrected
selection	194,105	540	0.00000	1	17,487.50099	16.21905670	0.30015495	0.0926243568	0.1695447410	-1.774638	5.057198	22.5399430	130.057948	-5.057198	0.9726707	0.06452716	0.9726707	10.557635	p < .001***
methodical	194,105	10	0.00000	1	322.95832	0.30035290	0.04079011	0.0017152659	0.0034246575	-5.676754	5.057198	3.0672977	17.698645	-5.057198	0.9700147	-0.02919696	0.9700147	6.521043	p < .001***
accumulative	194,105	3	0.00003	1	96.88400	0.09010587	0.02234126	0.0005145798	0.0010286302	-6.879527	5.057198	1.6800282	9.693947	-5.057198	0.9699797	-0.03043483	0.9699797	5.421228	p < .001***
rigorous	194,105	3	0.00003	1	96.88400	0.09010587	0.02234126	0.0005145798	0.0010286302	-6.879527	5.057198	1.6800282	9.693947	-5.057198	0.9699797	-0.03043483	0.9699797	5.421228	p < .001***
cotton	194,105	2	0.00090	1	64.58900	0.06007058	0.01824152	0.0003430532	0.0006858711	-7.284821	5.057198	1.3717372	7.915075	-5.057198	0.9699747	-0.03061167	0.9699747	5.084585	p < .001***
incompetent	194,105	2	0.00090	1	64.58900	0.06007058	0.01824152	0.0003430532	0.0006858711	-7.284821	5.057198	1.3717372	7.915075	-5.057198	0.9699747	-0.03061167	0.9699747	5.084585	p < .001***
rigid	194,105	2	0.00090	1	64.58900	0.06007058	0.01824152	0.0003430532	0.0006858711	-7.284821	5.057198	1.3717372	7.915075	-5.057198	0.9699747	-0.03061167	0.9699747	5.084585	p < .001***
agreeable	194,105	1	0.03004	1	32.29433	0.03003529	0.01289867	0.0001715266	0.0003429943	-7.977797	5.057198	0.9699647	5.596803	-5.057198	0.9699697	-0.03078851	0.9699697	4.573587	p < .001***
amoimt	194,105	1	0.03004	1	32.29433	0.03003529	0.01289867	0.0001715266	0.0003429943	-7.977797	5.057198	0.9699647	5.596803	-5.057198	0.9699697	-0.03078851	0.9699697	4.573587	p < .001***
architecture	194,105	1	0.03004	1	32.29433	0.03003529	0.01289867	0.0001715266	0.0003429943	-7.977797	5.057198	0.9699647	5.596803	-5.057198	0.9699697	-0.03078851	0.9699697	4.573587	p < .001***

Visualising collocations

Dotplots

We can now visualize the association strengths in a dotplot as shown in the code chunk below.

# sort the assoc_tb2 data frame in descending order based on the 'phi' column
assoc_tb2 %>%
  dplyr::arrange(-phi) %>%
  # select the top 20 rows after sorting
  head(20) %>%
  # create a ggplot with 'token' on the x-axis (reordered by 'phi') and 'phi' on the y-axis
  ggplot(aes(x = reorder(token, phi, mean), y = phi)) +
  # add a scatter plot with points representing the 'phi' values
  geom_point() +
  # flip the coordinates to have horizontal points
  coord_flip() +
  # set the theme to a basic white and black theme
  theme_bw() +
  # set the x-axis label to "Token" and y-axis label to "Association strength (phi)"
  labs(x = "Token", y = "Association strength (phi)")

Barplots

Another option sis to visualize the association strengths in a barplot as shown in the code chunk below.

# sort the assoc_tb2 data frame in descending order based on the 'phi' column
assoc_tb2 %>%
  dplyr::arrange(-phi) %>%
  # select the top 20 rows after sorting
  head(20) %>%
  # create a ggplot with 'token' on the x-axis (reordered by 'phi') and 'phi' on the y-axis
  ggplot(aes(x = reorder(token, phi, mean), y = phi, label = phi)) +
  # add a bar plot using the 'phi' values
  geom_bar(stat = "identity") +
  # add text labels above the bars with rounded 'phi' values
  geom_text(aes(y = phi - 0.005, label = round(phi, 3)), color = "white", size = 3) + 
  # flip the coordinates to have horizontal bars
  coord_flip() +
  # set the theme to a basic white and black theme
  theme_bw() +
  # set the x-axis label to "Token" and y-axis label to "Association strength (phi)"
  labs(x = "Token", y = "Association strength (phi)")

Dendrograms

Another method for visualizing collocations are dendrograms (tree-diagrams) which show how similarity to indicate groupings based on numeric values (e.g., association strength).

We start by extracting the tokens that we want to show (the top 20 collocates of selection).

# sort the assoc_tb2 data frame in descending order based on the 'phi' column
top20colls <- assoc_tb2 %>%
  dplyr::arrange(-phi) %>%
  # select the top 20 rows after sorting
  head(20) %>%
  # extract the 'token' column 
  dplyr::pull(token)
# inspect the top 20 tokens with the highest 'phi' values
top20colls

##  [1] "selection"    "natural"      "through"      "theory"       "unconscious" 
##  [6] "methodical"   "acts"         "sexual"       "accumulated"  "by"          
## [11] "power"        "man"          "action"       "principle"    "accumulative"
## [16] "rigorous"     "survival"     "effects"      "process"      "aided"

We then need to generate a feature co-occurrence matrix from a document-feature matrix based on the cleaned, lower case sentences of our text.

# tokenize the 'sentences' data using quanteda package
keyword_fcm <- sentences %>%
  quanteda::tokens() %>%
  # create a document-feature matrix (dfm) from the tokens
  quanteda::dfm() %>%
  # select features based on 'top20colls' and the term "selection" pattern
  quanteda::dfm_select(pattern = c(top20colls, "selection")) %>%
  # Create a symmetric feature co-occurrence matrix (fcm) 
  quanteda::fcm(tri = FALSE)
# inspect the first 6 rows and 6 columns of the resulting fcm
keyword_fcm[1:6, 1:6]

## Feature co-occurrence matrix of: 6 by 6 features.
##          features
## features   by natural aided effects power man
##   by      461     282    25      37    54  81
##   natural 282      49     9      28    38  18
##   aided    25       9     0       1     1   0
##   effects  37      28     1       3     1   5
##   power    54      38     1       1     9  15
##   man      81      18     0       5    15  10

Then we generate the dendrogram based on a distance matrix generated from the feature co-occurrence matrix.

# create a hierarchical clustering object using the distance matrix of the fcm as data
hclust(dist(keyword_fcm),     
       # use ward.D as linkage method
       method="ward.D2") %>% 
  # generate visualization (dendrogram)
  ggdendrogram() +              
  # add title
  ggtitle("20 most strongly collocating terms of 'selection'")

Network Graphs

Network graphs, or networks for short, are a powerful and versatile visual representation used to depict relationships or connections among various elements. Network graphs typically consist of nodes, representing individual entities, and edges, indicating the connections or interactions between these entities. Nodes can represent diverse entities such as words (collocates), interlocutors, objects, or concepts, while edges convey the relationships or associations between them.

Here we generate a basic network graph of the collocates of our keyword based on the fcm.

# create a network plot using the fcm
quanteda.textplots::textplot_network(keyword_fcm,
                                     # set the transparency of edges to 0.8 for visibility
                                     edge_alpha = 0.8,
                                     # set the color of edges to gray
                                     edge_color = "gray",
                                     # set the size of edges to 2 for better visibility
                                     edge_size = 2,
                                     # adjust the size of vertex labels 
                                     # based on the logarithm of row sums of the fcm
                                     vertex_labelsize = log(rowSums(keyword_fcm)))

Biplots

An alternative way to display co-occurrence patterns are bi-plots which are used to display the results of a Correspondence Analysis. Bi-plots are useful, in particular, when one is not interested in one particular keyterm and its collocations but in the overall similarity of many terms. Semantic similarity in this case refers to a shared semantic and this distributional profile. As such, words can be deemed semantically similar if they have a similar co-occurrence profile - i.e. they co-occur with the same elements. Biplots can be used to visualize collocations because collocates co-occur and thus share semantic properties which renders then more similar to each other compared with other terms.

# perform correspondence analysis
res.ca <- CA(as.matrix(keyword_fcm), graph = FALSE)
# plot results
fviz_ca_row(res.ca, repel = TRUE, col.row = "gray20")

N-grams

N-grams are contiguous sequences of N items (words, characters, or symbols) in a given text. The term N in N-grams refers to the number of items in the sequence. For example, a bigram (2-gram) consists of two consecutive items, a trigram (3-gram) consists of three, and so on. N-grams are widely used in natural language processing and text analysis to capture patterns and dependencies within a linguistic context. N-grams help analyze the frequency of word sequences in a corpus. This information can reveal common phrases, expressions, or patterns that occur frequently and that often represent multiword expressions such as New York, Prime Minister, or New South Wales. Identifying such multiword expressions can be useful to fuse compound words in subsequent steps of an analysis (e.g., combining wheel chair to wheelchair or wheel-chair). N-grams are fundamental in language modeling, where they are used to estimate the likelihood of a word given its context. This is especially important in predictive text applications and machine translation.

Identifying n-grams using quanteda

The quanteda package (see Benoit et al. 2018) offers excellent and very fast functions for extracting N-grams. It’s a fun way to discover meaningful word pairs in your text! Below, we use the textstat_collocations function for extracting N-grams. This function uses the following main arguments

x: a character, corpus, or tokens object.
method: association measure for detecting collocations. Currently this is limited to “lambda”.
size: integer; the length of the ngram. The default is 2 - if you want to extract tri-grams set size = 3 and if you want to extract four-grams set size = 4 and so on.
min_count: numeric; minimum frequency of collocations that will be scored.
smoothing: numeric; a smoothing parameter added to the observed counts (default is 0.5).
tolower: logical; if TRUE, tokens are transformed to lower-case.

# concatenate the elements in the 'text' object
text %>% 
  paste0(collapse = " ") %>%
  # convert to lower case
  tolower() %>%
  # convert the concatenated text into tokens
  quanteda::tokens() %>%
  # identify and extract bigrams that occur at leats 10 times
  quanteda.textstats::textstat_collocations(size = 2, min_count = 10) %>%
  # convert into a data frame and save results in an object called 'ngrams'
  as.data.frame() %>%
  # order by lambda
  dplyr::arrange(-lambda) -> ngrams


collocation	count	length	lambda	z
la plata	10	2	14.172641	8.972104
asa gray	10	2	13.584854	11.365881
de candolle	20	2	13.232135	9.069094
malay archipelago	11	2	11.795349	8.099435
fritz miiller	12	2	11.781998	14.798749
close interbreeding	11	2	11.060514	7.626630
informs me	14	2	10.547254	7.319025
new zealand	27	2	10.530731	7.372304
reproductive systems	12	2	10.078532	15.838139
laws governing	14	2	10.076309	14.349877
i am	60	2	10.063415	7.084662
systematic affinity	12	2	9.844818	17.664599
consecutive formations	13	2	9.293545	13.332193
reciprocal crosses	15	2	9.284706	17.907863
united states	29	2	9.058088	26.394583

Identifying n-grams using quanteda

Creating N-gram lists manually, especially bi-grams, is surprisingly easy. In our example text, we’ll craft a bi-gram list by doing something quite straightforward: taking each word and introducing it to the next word in line. The difference to the previous method is that we retain the original order of the bi-grams here.

In a first step, we split the text into words and remove any non-word characters.

# process the text
text  %>%
  # convert all text to lowercase
  tolower() %>%
  # remove non-word characters, keeping spaces
  str_replace_all("[^[:alpha:][:space:]]*", "")  %>%
  # remove punctuation
  tm::removePunctuation() %>%
  # squish consecutive spaces into a single space
  stringr::str_squish() %>%
  # split the text into individual words, separated by spaces
  stringr::str_split(" ") %>%
  # unlist the result into a single vector of words  and save result in "text_words"
  unlist() -> text_words

Now, we generate a table with the N-grams(in our case bi-grams).

# create data frame
text_bigrams <- data.frame(text_words[1:length(text_words)-1], 
                       text_words[2:length(text_words)]) %>%
  dplyr::rename(Word1 = 1,
                Word2 = 2) %>%
  dplyr::mutate(Bigram = paste0(Word1, " ", Word2)) %>%
  dplyr::group_by(Bigram) %>%
  dplyr::summarise(Frequency = n()) %>%
  dplyr::arrange(-Frequency)

Top 10 most frequent bigrams and their (relative) freqeuncy in the example text
Bigram	Frequency
of the	2,673
in the	1,440
the same	959
to the	791
on the	743
have been	624
that the	574
it is	500
natural selection	405
and the	351
from the	346
in a	339
of a	337
with the	336
to be	324

It is very useful to perform an N-gram analysis before a collocation analysis to fuse compound words (e.g. New York would become NewYork or New South Wales would become NewSouthWales) to avoid treating new or south as independent elements.

Citation & Session Info

Schweinberger, Martin. 2024. Analyzing Collocations and N-grams in R. Brisbane: The University of Queensland. url: https://ladal.edu.au/coll.html (Version 2024.03.28).

@manual{schweinberger`2024coll,
  author = {Schweinberger, Martin},
  title = {Analyzing Collocations and N-grams in R},
  note = {https://ladal.edu.au/coll.html},
  year = {2024},
  organization = {The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2024.03.28}
}

sessionInfo()

## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.utf8    
## 
## time zone: Australia/Brisbane
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] sna_2.8                   statnet.common_4.10.0    
##  [3] tm_0.7-14                 NLP_0.3-0                
##  [5] stringr_1.5.1             dplyr_1.1.4              
##  [7] quanteda.textplots_0.95   quanteda.textstats_0.97.2
##  [9] quanteda_4.1.0            Matrix_1.7-1             
## [11] network_1.18.2            igraph_2.1.1             
## [13] ggdendro_0.2.0            GGally_2.2.1             
## [15] flextable_0.9.7           factoextra_1.0.7         
## [17] ggplot2_3.5.1             FactoMineR_2.11          
## 
## loaded via a namespace (and not attached):
##   [1] sandwich_3.1-1          rlang_1.1.4             magrittr_2.0.3         
##   [4] multcomp_1.4-26         tidytext_0.4.2          compiler_4.4.1         
##   [7] systemfonts_1.1.0       vctrs_0.6.5             pkgconfig_2.0.3        
##  [10] fastmap_1.2.0           backports_1.5.0         labeling_0.4.3         
##  [13] utf8_1.2.4              rmarkdown_2.28          ragg_1.3.3             
##  [16] purrr_1.0.2             xfun_0.48               cachem_1.1.0           
##  [19] jsonlite_1.8.9          flashClust_1.01-2       SnowballC_0.7.1        
##  [22] highr_0.11              uuid_1.2-1              broom_1.0.7            
##  [25] parallel_4.4.1          stopwords_2.3           cluster_2.1.6          
##  [28] R6_2.5.1                bslib_0.8.0             stringi_1.8.4          
##  [31] RColorBrewer_1.1-3      car_3.1-3               jquerylib_0.1.4        
##  [34] estimability_1.5.1      nsyllable_1.0.1         Rcpp_1.0.13            
##  [37] assertthat_0.2.1        knitr_1.48              zoo_1.8-12             
##  [40] klippy_0.0.0.9500       splines_4.4.1           tidyselect_1.2.1       
##  [43] abind_1.4-8             rstudioapi_0.17.0       yaml_2.3.10            
##  [46] codetools_0.2-20        lattice_0.22-6          tibble_3.2.1           
##  [49] plyr_1.8.9              withr_3.0.1             askpass_1.2.1          
##  [52] coda_0.19-4.1           evaluate_1.0.1          survival_3.7-0         
##  [55] ggstats_0.7.0           zip_2.3.1               xml2_1.3.6             
##  [58] ggpubr_0.6.0            pillar_1.9.0            carData_3.0-5          
##  [61] janeaustenr_1.0.0       DT_0.33                 generics_0.1.3         
##  [64] munsell_0.5.1           scales_1.3.0            xtable_1.8-4           
##  [67] leaps_3.2               glue_1.8.0              slam_0.1-54            
##  [70] gdtools_0.4.0           emmeans_1.10.5          scatterplot3d_0.3-44   
##  [73] tools_4.4.1             data.table_1.16.2       tokenizers_0.3.0       
##  [76] ggsignif_0.6.4          mvtnorm_1.3-2           fastmatch_1.1-4        
##  [79] grid_4.4.1              tidyr_1.3.1             colorspace_2.1-1       
##  [82] Formula_1.2-5           cli_3.6.3               textshaping_0.4.0      
##  [85] officer_0.6.7           fontBitstreamVera_0.1.1 fansi_1.0.6            
##  [88] gtable_0.3.5            rstatix_0.7.2           sass_0.4.9             
##  [91] digest_0.6.37           fontquiver_0.2.1        ggrepel_0.9.6          
##  [94] TH.data_1.1-2           farver_2.1.2            htmlwidgets_1.6.4      
##  [97] htmltools_0.5.8.1       lifecycle_1.0.4         multcompView_0.1-10    
## [100] fontLiberation_0.1.0    openssl_2.2.2           MASS_7.3-61

Back to LADAL home

References

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. “Quanteda: An r Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3 (30): 774. https://doi.org/10.21105/joss.00774.

Ellis, Nick C. 2007. “Language Acquisition as Rational Contingency Learning.” Applied Linguistics 27 (1): 1–24. https://doi.org/10.1093/applin/ami038.

Gries, Stefan Th. 2013. “50-Something Years of Work on Collocations: What Is or Should Be Next….” International Journal of Corpus Linguistics 18 (1): 137–66.

———. 2022. “What Do (Some of) Our Association Measures Measure (Most)? Association?” Journal of Second Language Studies 5 (1): 1–33.

Pedersen, Ted. 1998. “Dependent Bigram Identification.” AAAI/IAAI 1197.

I’m extremely grateful to Joseph Flanagan who provided very helpful feedback and pointed out errors in previous versions of this tutorial. All remaining errors are, of course, my own.↩︎
If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.↩︎

Analyzing Collocations and N-grams in R

Martin Schweinberger

2024-03-07

Introduction

Co-occurrence and association

Collocations

Identifying collocations in sentences

Identifying collocations using kwics

Visualising collocations

Dotplots

Barplots

Dendrograms

Network Graphs

Biplots

N-grams

Identifying n-grams using quanteda

Identifying n-grams using quanteda

Citation & Session Info

References