Introduction

This tutorial introduces collocation and co-occurrence analysis with R and shows how to extract and visualize semantic links between words. The entire R markdown document for the present tutorial can be downloaded here. Parts of this tutorial build on and use materials from this tutorial on co-occurrence analysis with R by Andreas Niekler and Gregor Wiedemann (see Wiedemann and Niekler 2017).

How would you find words that are associated with a specific term and how can you visualize such word nets? This tutorial addresses this issue by focusing on co-occurrence and collocations of words. Collocations are words that occur very frequently together. For example, Merry Christmas is a collocation because merry and Christmas occur more frequently together than would be expected by chance. This means that if you were to shuffle all words in a corpus and would then test the frequency of how often merry and Christmas co-occurred, they would occur significantly less often in the shuffled or randomized corpus than in a corpus that contain non-shuffled natural speech.

But how can you determine if words occur more frequently together than would be expected by chance? This tutorial will answer this question.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)
options(scipen = 999)
options(max.print=10000)
# install packages
install.packages("corpus", "dplyr", "quanteda", "stringr", "tidyr", "tm",
                 "GGally", "network", "sna", "ggplot2", "collostructions")

Once you have installed R-Studio and initiated the session by executing the code shown above, you are good to go.

1 Extracting N-Grams and Collocations

Collocations are terms that co-occur (significantly) more often together than would be expected by chance. A typical example of a collocation is Merry Christmas because the words merry and Christmas occur together more frequently together than would be expected, if words were just randomly stringed together.

N-grams are related to collocates in that they represent words that occur together (bi-grams are two words that occur together, tri-grams three words and so on). Fortunately, creating N-gram lists is very easy. We will use the “Origin” to create a bi-gram list. As a first step, we load the data and split it into individual words.

# load libraries
library(dplyr)
library(stringr)
library(tm)
library(DT)
# read in text
darwin <- readLines("https://slcladal.github.io/data/origindarwin.txt") %>%
  paste(sep = " ", collapse = " ") %>%
  str_squish()
# select sample
darwin <- substr(darwin, 1000000, 1100000)
# further processing
darwin_split <- darwin %>% 
  removePunctuation() %>%
  str_replace_all("\\W", " ") %>%
  tolower() %>%
  str_squish() %>%
  strsplit(" ") %>%
  unlist()
# inspect data
head(darwin_split)
## [1] "red"       "with"      "some"      "few"       "existing"  "organisms"

We can create bi-grams (N-grams consisting of two elements) by pasting every word together with the ford that immediately follows it. To do this, we could use a function that is already available but we can also very simply to this manually. The first step in creating bigrams manually consists in creating a table which holds every word in a corpus and the word that immediately precedes it.

# create data frame
darwindf <- data.frame(darwin_split[1:length(darwin_split)-1], 
                       darwin_split[2:length(darwin_split)])
# add column names
colnames(darwindf) <- c("Word1", "Word2")
# inspect results
datatable(head(darwindf, 100), rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")

We can the paste the elements in these two column together and also inspect the frequency of each bigram.

darwin2grams <- darwindf %>%
  mutate(Bigram = paste(Word1, Word2, sep = " ")) %>%
  group_by(Bigram) %>%
  summarise(Frequency = n()) %>%
  arrange(-Frequency)
# inspect results
datatable(head(darwin2grams, 100), rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")

Both N-grams and collocations are not only an important concept in language teaching but they are also fundamental in Text Analysis and many other research areas working with language data. Unfortunately, words that collocate do not have to be immediately adjacent but can also encompass several slots. This is unfortunate because it makes retrieval of collocates substantially more difficult compared with a situation in which we only need to extract words that occur right next to each other.

In the following, we will extract collocations from Charles Darwin’s On the Origin of Species by Means of Natural Selection. In a first step, we will split the Origin into individual sentences.

# read in text
darwinsentences <- darwin %>%
  str_squish() %>%
  str_replace_all("([A-Z]{2,} [A-Z]{2,}) ([A-Z][a-z]{1,} )", "\\1 qwertz\\2") %>%
  str_replace_all("([a-z]{2,}\\.) ([A-Z] {0,1}[a-z]{0,30})", "\\1qwertz\\2") %>%
    str_replace_all("([a-z]{2,}\\?) ([A-Z] {0,1}[a-z]{0,30})", "\\1qwertz\\2") %>%
  strsplit("qwertz")%>%
  unlist() %>%
  str_remove_all("- ") %>%
  str_replace_all("\\W", " ") %>%
  str_squish()
# inspect data
head(darwinsentences)
## [1] "red with some few existing organisms"                                                                                                                                                                                                                                                          
## [2] "All the descendants of the genus F along its whole line of descent are supposed to have been but little modified and they form a single genus"                                                                                                                                                 
## [3] "But this genus though much isolated will still occupy its proper intermediate position"                                                                                                                                                                                                        
## [4] "The representation of the groups as here given in the diagram on a flat surface is much too simple"                                                                                                                                                                                            
## [5] "The branches ought to have diverged in all directions"                                                                                                                                                                                                                                         
## [6] "If the names of the groups had been simply written down in a linear series the representation would have been still less natural and it is notoriously not possible to represent in a series on a flat surface the affinities which we discover in nature amongst the beings of the same group"

The first element does not represent a full sentence because we selected a sample of the text which began in the middle of a sentence rather than at its beginning. In a next step, we will create a matrix that shows how often each word co-occurred with each other word in the data.

# convert into corpus
darwincorpus <- Corpus(VectorSource(darwinsentences))
# create vector with words to remove
extrawords <- c("the", "can", "get", "got", "can", "one", 
                "dont", "even", "may", "but", "will", 
                "much", "first", "but", "see", "new", 
                "many", "less", "now", "well", "like", 
                "often", "every", "said", "two")
# clean corpus
darwincorpusclean <- darwincorpus %>%
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(tolower) %>%
  tm_map(removeWords, stopwords()) %>%
  tm_map(removeWords, extrawords)
# create document term matrix
darwindtm <- DocumentTermMatrix(darwincorpusclean, control=list(bounds = list(global=c(1, Inf)), weighting = weightBin))
# load library
library(Matrix)
# convert dtm into sparse matrix
darwinsdtm <- sparseMatrix(i = darwindtm$i, j = darwindtm$j, 
                           x = darwindtm$v, 
                           dims = c(darwindtm$nrow, darwindtm$ncol),
                           dimnames = dimnames(darwindtm))
# calculate co-occurrence counts
coocurrences <- t(darwinsdtm) %*% darwinsdtm
# convert into matrix
collocates <- as.matrix(coocurrences)
# inspect results
datatable(collocates[1:10, 1:10], rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")

We can inspect this co-occurrence matrix and check how many terms (words or elements) it represents using the ncol function from base R. We can also check how often terms occur in the data using the summary function from base R. The output of the summary function tells us that the minimum frequency of a word in the data is 2 with a maximum of 1,474. The difference between the median (36.00) and the mean (74.47) indicates that the frequencies are distributed very non-normally - which is common for language data.

# inspect size of matrix
ncol(collocates)
## [1] 2196
summary(rowSums(collocates))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   22.00   36.00   74.47   74.00 1474.00

We will now use an example of one individual word (selection) to show, how collocation strength for individual terms is calculated and how it can be visualized or displayed. The function calculateCoocStatistics is taken from this tutorial (see also Wiedemann and Niekler 2017).

# load function for co-occurrence calculation
source("https://slcladal.github.io/rscripts/calculateCoocStatistics.R")
# define term
coocTerm <- "selection"
# calculate cooccurence statistics
coocs <- calculateCoocStatistics(coocTerm, darwinsdtm, measure="LOGLIK")
# inspect results
coocs[1:20]
##       natural        theory      rendered     variation      admitted 
##     86.103229     19.170607     16.086344     15.956290      7.815544 
##    objections        causes    divergence        longer     injurious 
##      7.815544      7.815544      7.815544      7.815544      7.815544 
##        disuse  modification    variations  difficulties           aid 
##      7.359733      6.625658      6.625658      6.200451      6.200451 
## independently          vary         might   inheritance          show 
##      6.200451      6.200451      6.174221      6.174221      5.942070

2 Visualizing Collocations

There are various visualizations options for collocations. Which visualization method is appropriate depends on what the visualizations should display.

Association Strength

We start with the most basic and visualize the collocation strength using a simple dot chart.

coocdf <- coocs %>%
  as.data.frame() %>%
  mutate(CollStrength = coocs,
         Term = names(coocs)) %>%
  filter(CollStrength > 5)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
ggplot(coocdf, aes(x = reorder(Term, CollStrength, mean), y = CollStrength)) +
         geom_point() +
         coord_flip() +
  theme_bw()

The dot chart shows that natural is collocating more strongly with selection compared to any other term. This substantiates that natural and selection are a collocation in Darwin’s On the Origin of Species by Means of Natural Selection (which is unsurprising, of course).

Dendrograms

Another method for visualizing collocations are dendrograms. Dendrograms (also called tree-diagrams) show how similar elements are based on one or many features. As such, dendrograms are sued to indicate groupings as they show elements (words) that are notably similar or different with respect to their association strength. To use this method, we first need to generate a distance matrix from our co-occurrence matrix.

coolocs <- c(coocdf$Term, "selection")
# remove non-collocating terms
collocates_redux <- collocates[rownames(collocates) %in% coolocs, ]
collocates_redux <- collocates_redux[, colnames(collocates_redux) %in% coolocs]
# create distance matrix
distmtx <- dist(collocates_redux)
# activate library
library("cluster")         # activate library
clustertexts <- hclust(    # hierarchical cluster object
  distmtx,                 # use distance matrix as data
  method="ward.D2")        # ward.D as linkage method
library(ggdendro)
ggdendrogram(clustertexts) +
  ggtitle("Terms strongly collocating with *selection*")

Network Graphs

Network graphs are a very useful tool to show relationships (or the absence of relationships) between elements. Network graphs are highly useful when it comes to displaying the relationships that words have among each other and which properties these networks of words have.

Basic Network Graphs

In order to display a network, we need to create a network graph by using the network function from the network package.

library(network)
net = network(collocates_redux,
              directed = FALSE,
              ignore.eval = FALSE,
              names.eval = "weights"
)
# vertex names
network.vertex.names(net) = rownames(collocates_redux)
net
##  Network attributes:
##   vertices = 28 
##   directed = FALSE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 190 
##     missing edges= 0 
##     non-missing edges= 190 
## 
##  Vertex attribute names: 
##     vertex.names 
## 
##  Edge attribute names: 
##     weights

Now that we have generated a network object, we visualize the network.

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggnet2(net, 
       label = TRUE, 
       label.size = 4,
       alpha = 0.2,
       size.cut = 3,
       edge.alpha = 0.3) +
  guides(color = FALSE, size = FALSE)
## Warning in ggnet2(net, label = TRUE, label.size = 4, alpha = 0.2, size.cut =
## 3, : node.size is invariant; size.cut ignored

The network is already informative but we will customize the network object so that the visualization becomes more appealing and informative. To add information, we create vector of words that contain different groups, e.g. terms that rarely, sometimes, and frequently collocate with selection (I used the dendrogram which displayed the cluster analysis as the basis for the categorization).

Based on these vectors, we can then change or adapt the default values of certain attributes or parameters of the network object (e.g. weights. linetypes, and colors).

# create vectors with collocation occurrences as categories
mid <- c("theory", "variations", "slight", "variation")
high <- c("natural", "selection")
infreq <- colnames(collocates_redux)[!colnames(collocates_redux) %in% mid & !colnames(collocates_redux) %in% high]
# add color by group
net %v% "Collocation" = ifelse(network.vertex.names(net) %in% infreq, "weak", 
                   ifelse(network.vertex.names(net) %in% mid, "medium", 
                   ifelse(network.vertex.names(net) %in% high, "strong", "other")))
# modify color
net %v% "color" = ifelse(net %v% "Collocation" == "weak", "gray60", 
                  ifelse(net %v% "Collocation" == "medium", "orange", 
                  ifelse(net %v% "Collocation" == "strong", "indianred4", "gray60")))
# rescale edge size
set.edge.attribute(net, "weights", ifelse(net %e% "weights" < 1, 0.1, 
                                   ifelse(net %e% "weights" <= 2, .5, 1)))
# define line type
set.edge.attribute(net, "lty", ifelse(net %e% "weights" <=.1, 3, 
                                      ifelse(net %e% "weights" <= .5, 2, 1)))

We can now display the network object and make use of the added information.

ggnet2(net, 
       color = "color", 
       label = TRUE, 
       label.size = 4,
       alpha = 0.2,
       size = "degree",
       edge.size = "weights",
       edge.lty = "lty",
       edge.alpha = 0.2) +
  guides(color = FALSE, size = FALSE)

Advanced Network Graphs

Wiedemann and Niekler (2017) have written a very recommendable tutorial on co-occurrence analysis and they propose an alternative, yet more complex network visualization for co-occurrences. Their approach is to create and customize a graph object based on the iGraph package.

The code below is based on this tutorial and is included to show the flexibility of the approach used by Wiedemann and Niekler (2017).

numberOfCoocs <- 10
# create graph object
resultGraph <- data.frame(from = character(), to = character(), sig = numeric(0))
# create data frame
tmpGraph <- data.frame(from = character(), to = character(), sig = numeric(0))
# fill data frame to produce the correct number of lines
tmpGraph[1:numberOfCoocs, 3] <- coocs[1:numberOfCoocs]
# enter search word into the first column in all lines
tmpGraph[, 1] <- coocTerm
# enter co-occurrences into second column
tmpGraph[, 2] <- names(coocs)[1:numberOfCoocs]
# enter collocation strength
tmpGraph[, 3] <- coocs[1:numberOfCoocs]
# attach data frame to resultGraph
resultGraph <- rbind(resultGraph, tmpGraph)

Calculate cooccurence statistics and add them to the graph object.

# iterate over most significant numberOfCoocs co-occurrences
for (i in 1:numberOfCoocs){
  # calculate co-occurrence strength for term i
  newCoocTerm <- names(coocs)[i]
  coocs2 <- calculateCoocStatistics(newCoocTerm, darwinsdtm, measure="LOGLIK")
  # fill temporary graph object
  tmpGraph <- data.frame(from = character(), to = character(), sig = numeric(0))
  tmpGraph[1:numberOfCoocs, 3] <- coocs2[1:numberOfCoocs]
  tmpGraph[, 1] <- newCoocTerm
  tmpGraph[, 2] <- names(coocs2)[1:numberOfCoocs]
  tmpGraph[, 3] <- coocs2[1:numberOfCoocs]
  # append results to the result graph data frame
  resultGraph <- rbind(resultGraph, tmpGraph[2:length(tmpGraph[, 1]), ])
}

Now, we can create a network graph object.

# load packages
library(igraph)
# define graph and type ("F" means "Force Directed")
graphNetwork <- graph.data.frame(resultGraph, directed = F)
# identify nodes with fewer than 2 edges
graphVs <- V(graphNetwork)[degree(graphNetwork) < 2]
# removed these edges from graph
graphNetwork <- delete.vertices(graphNetwork, graphVs) 
# sssign colors to edges and nodes (searchterm blue, rest orange)
V(graphNetwork)$color <- ifelse(V(graphNetwork)$name == coocTerm, 'cornflowerblue', 'orange') 
# Edges with a significance of at least 50% of the maximum significance in the graph are drawn in orange
halfMaxSig <- max(E(graphNetwork)$sig) * 0.5
E(graphNetwork)$color <- ifelse(E(graphNetwork)$sig > halfMaxSig, "coral", "azure3")
# disable edges with radius
E(graphNetwork)$curved <- 0 
# size the nodes by their degree of networking
V(graphNetwork)$size <- log(degree(graphNetwork)) * 5
# all nodes must be assigned a standard minimum-size
V(graphNetwork)$size[V(graphNetwork)$size < 5] <- 3 
# edge thickness
E(graphNetwork)$width <- 1.5

And finally, we can visualize the network.

# Define the frame and spacing for the plot
par(mai=c(0,0,1,0)) 
# Finaler Plot
plot(graphNetwork,              
     layout = layout.fruchterman.reingold,  # Force Directed Layout 
     main = paste("Cooccurrence network for", " \"", coocTerm, "\""),
     vertex.label.family = "sans",
     vertex.label.cex = .75,
     vertex.shape = "circle",
     vertex.label.dist = 2,           # Labels of the nodes moved slightly
     vertex.frame.color = 'darkolivegreen',
     vertex.label.color = 'black',      # Color of node names
     vertex.label.font = 2,         # Font of node names
     vertex.label = V(graphNetwork)$name,       # node names
     vertex.label.cex = .75 # font size of node names 
)

The advantage of this more elaborate technique to visualize network lies in its flexibility. In contrast to the basic network visualization shown above, this network object can be customized to show very specific aspects of the data.

Biplots

An alternative way to display co-occurrence patterns are bi-plots which are used to display the results of Correspondence Analyses. They are useful, in particular, when one is not interested in one particular key term and its collocations but in the overall similarity of many terms. Semantic similarity in this case refers to a shared semantic and this distributional profile. As such, words can be deemed semantically similar if they have a similar co-occurrence profile - i.e. they co-occur with the same elements. Biplots can be sued to visualize collocations because collocates co-occur and thus share semantic properties which renders then more similar to each other compared with other terms.

# load packages
library(FactoMineR)
library(factoextra)
# perform correspondence analysis
res.ca <- CA(collocates_redux, graph = FALSE)
# plot results
plot(res.ca, title = "", col.col = "gray20")

The bi-plot shows that natural and selection collocate as they are plotted in close proximity. The advantage of the biplot becomes apparent when we fcus on other terms because the biplot also shows other collocates such as vary and independently or might injurious.

3 Determining Significance

In order to identify which words occur together significantly(1) more frequently than would be expected by chance, we have to determine if their co-occurrence frequency is statistical significant. This can be done wither for specific key terms or it can be done for the entire data. In this example, we will continue to focus on the key word selection.

To determine which terms collocate significantly with the key term (selection), we use multiple (or repeated) Fisher’s Exact tests which require the following information:

  • a = Number of times coocTerm occurs with term j

  • b = Number of times coocTerm occurs without term j

  • c = Number of times other terms occur with term j

  • d = Number of terms that are not coocTerm or term j

In a first step, we create a table which holds these quantities.

coocdf <- as.data.frame(as.matrix(collocates))
cooctb <- coocdf %>%
  dplyr::mutate(Term = rownames(coocdf)) %>%
  tidyr::gather(CoocTerm, TermCoocFreq,
                colnames(coocdf)[1]:colnames(coocdf)[ncol(coocdf)]) %>%
  dplyr::mutate(Term = factor(Term),
                CoocTerm = factor(CoocTerm)) %>%
  dplyr::mutate(AllFreq = sum(TermCoocFreq)) %>%
  dplyr::group_by(Term) %>%
  dplyr::mutate(TermFreq = sum(TermCoocFreq)) %>%
  dplyr::ungroup(Term) %>%
  dplyr::group_by(CoocTerm) %>%
  dplyr::mutate(CoocFreq = sum(TermCoocFreq)) %>%
  dplyr::arrange(Term) %>%
  dplyr::mutate(a = TermCoocFreq,
                b = TermFreq - a,
                c = CoocFreq - a, 
                d = AllFreq - (a + b + c)) %>%
  dplyr::mutate(NRows = nrow(coocdf))
# inspect results
datatable(head(cooctb, 100), rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")

We now select the key term (selection). If we wanted to find all collocations that are present in the data, we would use the entire data rather than only the subset that contains selection.

cooctb_redux <- cooctb %>%
  dplyr::filter(Term == coocTerm)

Next, we calculate which terms are (significantly) over- and under-proportionately used with selection. It is important to note that this procedure informs about both: over- and under-use! This is especially crucial when analyzing if specific words are attracted o repelled by certain constructions. Of course, this approach is not restricted to analyses of constructions and it can easily be generalized across domains and has also been used in machine learning applications.

coocStatz <- cooctb_redux %>%
  dplyr::rowwise() %>%
  dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(a, b, c, d), 
                                                        ncol = 2, byrow = T))[1]))) %>%
    dplyr::mutate(x2 = as.vector(unlist(chisq.test(matrix(c(a, b, c, d),                                                           ncol = 2, byrow = T))[1]))) %>%
  dplyr::mutate(phi = sqrt((x2/(a + b + c + d)))) %>%
      dplyr::mutate(expected = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))$expected[1]))) %>%
  dplyr::mutate(Significance = ifelse(p <= .05, "p<.05",
                               ifelse(p <= .01, "p<.01",
                               ifelse(p <= .001, "p<.001", "n.s."))))
# inspect results
datatable(coocStatz, rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")

We now add information to the table and remove superfluous columns s that the table can be more easily parsed.

coocStatz <- coocStatz %>%
  dplyr::ungroup() %>%
  dplyr::arrange(p) %>%
  dplyr::mutate(j = 1:n()) %>%
  # perform benjamini-hochberg correction
  dplyr::mutate(corr05 = ((j/NRows)*0.05)) %>%
  dplyr::mutate(corr01 = ((j/NRows)*0.01)) %>%
  dplyr::mutate(corr001 = ((j/NRows)*0.001)) %>%
  # calculate corrected significance status
  dplyr::mutate(CorrSignificance = ifelse(p <= corr001, "p<.001",
                ifelse(p <= corr01, "p<.01",
                       ifelse(p <= corr05, "p<.05", "n.s.")))) %>%
  dplyr::mutate(p = round(p, 6)) %>%
  dplyr::mutate(x2 = round(x2, 1)) %>%
  dplyr::mutate(phi = round(phi, 2)) %>%
  dplyr::arrange(p) %>%
  dplyr::select(-a, -b, -c, -d, -j, -NRows, -corr05, -corr01, -corr001) %>%
  dplyr::mutate(Type = ifelse(expected > TermCoocFreq, "Antitype", "Type"))
# inspect results
datatable(coocStatz, rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")

The results show that selection collocates significantly with selection (of course) but also, as expected, with natural. The corrected p-values shows that after Benjamini-Hochberg correction for multiple/repeated testing (see Field, Miles, and Field 2012) these are the only significant collocates of selection. Corrections are necessary when performing multiple tests because otherwise, the reliability of the test result would be strongly impaired as repeated testing causes substantive \(\alpha\)-error inflation. The Benjamini-Hochberg correction that has been used here is preferrable over the more popular Bonferroni correction because it is less conservative and therefore less likely to result in \(\beta\)-errors (see again Field, Miles, and Field 2012).

4 Comparive Analyses of Collocation Strength

We now turn to analyses of changes in collocation strength over apparent time. The example focuses on adjective amplification in Australian English. The issue we will analyze here is whether we can unearth changes in the collocation pattern of adjective amplifiers such as very, really, or so. In other words, we will investigate if amplifiers associate with different adjectives among speakers from different age groups.

In a first step, we activate packages and load the data.

# load functions
source("https://SLCLADAL.github.io/rscripts/collexcovar.R")
# load data
ampaus <- read.table("https://SLCLADAL.github.io/data/ampaus.txt", sep = "\t", header = T)
# inspect results
datatable(head(ampaus, 100), rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")

The data consists of three variables (Adjective, Variant, and Age). In a next step, we perform a co-varying collexeme analysis for really versus all other amplifiers. For this reason, we reduce the data set and retain only The function takes a data set consisting of three columns labeled keys, colls, and time.

# rename data
ampaus <- ampaus %>%
  dplyr::rename(keys = Variant, colls = Adjective, time = Age)
# perform analysis
collexcovar_really <- collexcovar(data = ampaus, keyterm = "really")
# inspect results
datatable(collexcovar_really, rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")

Now, that the data has the correct labels, we can continue with the implementation of the co-varying collexeme analysis.

# perform analysis
collexcovar_pretty <- collexcovar(data = ampaus, keyterm = "pretty")
collexcovar_so <- collexcovar(data = ampaus, keyterm = "so")
collexcovar_very <- collexcovar(data = ampaus, keyterm = "very")

For other amplifiers, we have to change the label other to bin as the function already has a a label other. Once we have changed other to bin, we perform the analysis.

ampaus <- ampaus %>%
  dplyr::mutate(keys = ifelse(keys == "other", "bin", keys))
collexcovar_other <- collexcovar(data = ampaus, keyterm = "bin")

Next, we combine the results of the co-varying collexeme analysis into a single table.

# combine tables
collexcovar_ampaus <- rbind(collexcovar_really, collexcovar_very, 
                     collexcovar_so, collexcovar_pretty, collexcovar_other)
collexcovar_ampaus <- collexcovar_ampaus %>%
  dplyr::rename(Age = time,
                Adjective = colls) %>%
  dplyr::mutate(Variant = ifelse(Variant == "bin", "other", Variant)) %>%
  dplyr::arrange(Age)
# inspect results
datatable(collexcovar_ampaus, rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")

We now modify the data set so that we can plot the collocation strength across apparent time.

ampauscoll <- collexcovar_ampaus %>%
  dplyr::select(Age, Adjective, Variant, Type, phi) %>%
  dplyr::mutate(phi = ifelse(Type == "Antitype", -phi, phi)) %>%
  dplyr::select(-Type) %>%
  tidyr::spread(Adjective, phi) %>%
  tidyr::replace_na(list(bad = 0,
                         funny = 0,
                         hard = 0,
                         good = 0,
                         nice = 0,
                         other = 0)) %>%
  tidyr::gather(Adjective, phi, bad:other) %>%
  tidyr::spread(Variant, phi) %>%
  tidyr::replace_na(list(pretty = 0,
                    really = 0,
                    so = 0,
                    very = 0,
                    other = 0)) %>%
  tidyr::gather(Variant, phi, other:very)
ampauscoll
## # A tibble: 90 x 4
##    Age   Adjective Variant   phi
##    <chr> <chr>     <chr>   <dbl>
##  1 17-25 bad       other   -0.05
##  2 17-25 funny     other   -0.05
##  3 17-25 good      other   -0.05
##  4 17-25 hard      other    0.04
##  5 17-25 nice      other   -0.07
##  6 17-25 other     other    0.12
##  7 26-40 bad       other    0.02
##  8 26-40 funny     other   -0.06
##  9 26-40 good      other    0.01
## 10 26-40 hard      other   -0.04
## # ... with 80 more rows

In a final step, we visualize the results of our analysis.

ggplot(ampauscoll, aes(x = reorder(Age, desc(Age)),
                       y = phi, group = Variant, 
                      color = Variant, linetype = Variant)) +
  facet_wrap(vars(Adjective)) +
  geom_line() +
  guides(color=guide_legend(override.aes=list(fill=NA))) +
  scale_color_manual(values = 
                       c("gray70", "gray70", "gray20", "gray70", "gray20"),
                        name="Variant",
                        breaks = c("other", "pretty", "really", "so", "very"), 
                        labels = c("other", "pretty", "really", "so", "very")) +
  scale_linetype_manual(values = 
                          c("dotted", "dotdash", "longdash", "dashed", "solid"),
                        name="Variant",
                        breaks = c("other", "pretty", "really", "so", "very"), 
                        labels = c("other",  "pretty", "really", "so", "very")) +
  theme(legend.position="top", 
        axis.text.x = element_text(size=12),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) +
  theme_set(theme_bw(base_size = 12)) +
  coord_cartesian(ylim = c(-.2, .4)) +
  labs(x = "Age", y = "Collocation Strength") +
  guides(size = FALSE)+
  guides(alpha = FALSE)

The results show that the collocation strength of different amplifier variants changes quite notably across age groups and we can also see that there is considerable variability in the way that the collocation strengths changes. For example, the collocation strengths between bad and really decreases from old to young speakers, while the reverse trend emerges for good which means that really is collocating more strongly with good among younger speakers than it is among older speakers.

5 Collostructional Analysis

Collostructional analysis (Stefanowitsch and Gries 2003, 2005) investigates the lexicogrammatical associations between constructions and lexical elements and there exist three basic subtypes of collostructional analysis:

  • Simple Collexeme Analysis

  • Distinctive Collexeme Analysis

  • Co-Varying Collexeme Analysis

The analyses performed here are based on the collostructions package (Flach 2017).

Simple Collexeme Analysis


NOTE!

The code in this section does not work at the moment as the collostructions package has not yet been updated and is therefore still incompatible with the R version used for this tutorial. The code is merely displayed and would work is processed with an older R versions.


Simple Collexeme Analysis determines if a word is significantly attracted to a specific construction within a corpus. The idea is that the frequency of the word that is attracted to a construction is significantly higher within the construction than would be expected by chance.

The example here analyzses the Go + Verb construction (e.g. “Go suck a nut!”). The question is which verbs are attracted to this constructions (in this case, if suck is attracted to this construction).

In a first step, we load the collostructions package and inspect the data. In this case, we will only use a sample of 100 rows from the data set as the output would become hard to read.

# load library
library(collostructions)
# draw a sample of the data
goVerb <- goVerb[sample(nrow(goVerb), 100),]
# inspect data
str(goVerb)

The collex function which calculates the results of a simple collexeme analysis requires a data frame consisting out of three columns that contain in column 1 the word to be tested, in column 2 the frequency of the word in the construction (CXN.FREQ), and in column 3 the frequency of the word in the corpus (CORP.FREQ).

To perform the simple collexeme analysis, we need the overal size of the corpus, the frequency with which a word occurs in the construction under investigation and the frequency of that construction.

# define corpus size
crpsiz <- sum(goVerb$CORP.FREQ)
# perform simple collexeme analysis
scollex_results <- collex(goVerb, corpsize = crpsiz, am = "logl", 
                          reverse = FALSE, decimals = 5,
                          threshold = 1, cxn.freq = NULL, 
                          str.dir = FALSE)
# inspect results
head(scollex_results)

The results show which words are significantly attracted to the construction. If the ASSOC column did not show attr, then the word would be repelled by the construction.

Covarying Collexeme Analysis

Covarying collexeme analysis determines if the occurrence of a word in the first slot of a constructions affects the occurrence of another word in the second slot of the construction. As such, covarying collexeme analysis analyzes constructions with two slots and how the lexical elemenst within the two slots affect each other.

The data we will use consists of two columns which contain in the first column (CXN.TYPE) the word in the first slot (either cannot or can’t) and in the second slot (COLLEXEME) the word in the second slot, i.e. the collexeme, which the verb that follows after cannot or can’t. The first six rows of the data are shown below.

head(cannot)

We now perform the collexeme analysis and oinspect the results.

covar_results <- collex.covar(cannot)
# inspect results
head(covar_results)

The results show if a words in the first and second slot attract or repel each other (ASSOC) and provide uncorrceted significance levels.

Distinctive Collexeme Analysis

Distinctive Collexeme Analysis determines if the frequencies of items in two alternating constructions or under two conditions differ significantly. This analysis can be extended to analyze if the use of a word differs between two corpora.

Again, we use the cannot data.

collexdist_results <- collex.dist(cannot, raw = TRUE)
# inspect results
head(collexdist_results)

The results show if words are significantly attracted or repelled by either the contraction (can’t) or the full form (cannot). In this example, remember - like all other words shown in the results - is significantly attracted to the contraction (can’t).

Citation & Session Info

Schweinberger, Martin. 2020. Analyzing Co-Occurrences and Collocations in R. Brisbane: The University of Queensland. url: https://slcladal.github.io/collocations.html (Version 2020.09.26).

@manual{schweinberger2020coll,
  author = {Schweinberger, Martin},
  title = {Analyzing Co-Occurrences and Collocations in R},
  note = {https://slcladal.github.io/maps.html},
  year = {2020},
  organization = "The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2020/09/25}
}
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] factoextra_1.0.7 FactoMineR_2.3   igraph_1.2.5     GGally_2.0.0    
##  [5] network_1.16.0   ggdendro_0.1.22  cluster_2.1.0    ggplot2_3.3.2   
##  [9] slam_0.1-47      Matrix_1.2-18    DT_0.15          tm_0.7-7        
## [13] NLP_0.2-0        stringr_1.4.0    dplyr_1.0.2     
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0     xfun_0.16            purrr_0.3.4         
##  [4] sna_2.5              lattice_0.20-41      colorspace_1.4-1    
##  [7] vctrs_0.3.4          generics_0.0.2       htmltools_0.5.0     
## [10] yaml_2.2.1           utf8_1.1.4           rlang_0.4.7         
## [13] pillar_1.4.6         glue_1.4.2           withr_2.3.0         
## [16] RColorBrewer_1.1-2   lifecycle_0.2.0      plyr_1.8.6          
## [19] munsell_0.5.0        gtable_0.3.0         htmlwidgets_1.5.1   
## [22] leaps_3.1            coda_0.19-3          evaluate_0.14       
## [25] labeling_0.3         knitr_1.30           crosstalk_1.1.0.1   
## [28] parallel_4.0.2       fansi_0.4.1          Rcpp_1.0.5          
## [31] flashClust_1.01-2    scales_1.1.1         scatterplot3d_0.3-41
## [34] jsonlite_1.7.1       farver_2.0.3         digest_0.6.25       
## [37] stringi_1.5.3        ggrepel_0.8.2        grid_4.0.2          
## [40] cli_2.0.2            tools_4.0.2          magrittr_1.5        
## [43] tibble_3.0.3         tidyr_1.1.2          crayon_1.3.4        
## [46] pkgconfig_2.0.3      ellipsis_0.3.1       MASS_7.3-51.6       
## [49] xml2_1.3.2           assertthat_0.2.1     rmarkdown_2.3       
## [52] reshape_0.8.8        statnet.common_4.3.0 R6_2.4.1            
## [55] compiler_4.0.2

Main page


References

Field, Andy, Jeremy Miles, and Zoe Field. 2012. Discovering Statistics Using R. Sage.

Flach, Susanne. 2017. “Collostructions: An R Implementation for the Family of Collostructional Methods.” Package version v.0.1.0. https://sfla.ch/collostructions/.

Stefanowitsch, Anatol, and Stefan Th. Gries. 2003. “Collostructions: Investigating the Interaction of Words and Constructions.” International Journal of Corpus Linguistics 8 (2): 209–43.

Stefanowitsch, Anatol, and Stefan Th Gries. 2005. “Covarying Collexemes.” Corpus Linguistics and Linguistic Theory 1 (1): 1–43.

Wiedemann, Gregor, and Andreas Niekler. 2017. “Hands-on: A Five Day Text Mining Course for Humanists and Social Scientists in R.” In Proceedings of the Workshop on Teaching NLP for Digital Humanities (Teach4DH2017), Berlin, Germany, September 12, 2017., 57–65. http://ceur-ws.org/Vol-1918/wiedemann.pdf.