--- title: "Practical Overview of Selected Text Analytics Methods" author: "Martin Schweinberger" date: "" output: bookdown::html_document2 bibliography: bibliography.bib link-citations: yes --- ```{r uq1, echo=F, eval = T, fig.cap="", message=FALSE, warning=FALSE, out.width='100%'} knitr::include_graphics("https://slcladal.github.io/images/uq1.jpg") ``` # Introduction{-} ```{r diff, echo=FALSE, out.width= "15%", out.extra='style="float:right; padding:10px"'} knitr::include_graphics("https://slcladal.github.io/images/gy_chili.jpg") ``` This tutorial introduces Text Analysis [see @bernard1998text; @kabanoff1997introduction; @popping2000computer], i.e. computer-based analysis of language data or the (semi-)automated extraction of information from text.

Please cite as:
Schweinberger, Martin. 2023. *Practical Overview of Selected Text Analytics Methods*. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/textanalysis.html (Version 2023.09.24).


Most of the applications of Text Analysis are based upon a relatively limited number of key procedures or concepts (e.g. concordancing, word frequencies, annotation or tagging, collocation, text classification, Sentiment Analysis, Entity Extraction, Topic Modeling, etc.). In the following, we will explore these procedures and introduce some basic tools that help you perform the introduced tasks.

To be able to follow this tutorial, we suggest you check out and familiarize yourself with the content of the following **R Basics** tutorials:

Click [**here**](https://ladal.edu.au/content/kwics.Rmd)^[If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the [**bibliography file**](https://ladal.edu.au/content/bibliography.bib) and store it in the same folder where you store the Rmd file.] to download the **entire R Notebook** for this tutorial.

[![Binder](https://mybinder.org/badge_logo.svg)](https://binderhub.atap-binder.cloud.edu.au/v2/gh/SLCLADAL/interactive-notebooks-environment/main?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgithub.com%252FSLCLADAL%252Finteractive-notebooks%26urlpath%3Dlab%252Ftree%252Finteractive-notebooks%252Fnotebooks%252Ftextanalysis_cb.ipynb%26branch%3Dmain)
Click [**here**](https://binderhub.atap-binder.cloud.edu.au/v2/gh/SLCLADAL/interactive-notebooks-environment/main?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgithub.com%252FSLCLADAL%252Finteractive-notebooks%26urlpath%3Dlab%252Ftree%252Finteractive-notebooks%252Fnotebooks%252Ftextanalysis_cb.ipynb%26branch%3Dmain) to open an interactive Jupyter notebook that allows you to execute, change, and edit the code as well as to upload your own data.


**Preparation and session set up** This tutorial is based on R. If you're new to R or haven't installed it yet, you can find an introduction and installation instructions [here](https://ladal.edu.au/intror.html). To ensure the scripts below run smoothly, we need to install specific R packages from a library. If you've already installed these packages, you can skip this section. To install them, run the code below (which may take 1 to 5 minutes). ```{r prep1, echo=T, eval = F, message=FALSE, warning=FALSE} # install packages install.packages("quanteda") install.packages("dplyr") install.packages("stringr") install.packages("ggplot2") install.packages("tm") install.packages("udpipe") install.packages("tidytext") install.packages("wordcloud2") install.packages("quanteda.textstats") install.packages("quanteda.textplots") install.packages("ggraph") install.packages("flextable") # install klippy for copy-to-clipboard button in code chunks install.packages("remotes") remotes::install_github("rlesur/klippy") ``` Once all packages are installed, you can activate them bu executing (running) the code chunk below. ```{r prep2, message=FALSE, warning=FALSE} # load packages library(dplyr) library(stringr) library(ggplot2) library(flextable) library(quanteda) library(tm) library(udpipe) library(tidytext) library(wordcloud2) library(flextable) library(quanteda.textstats) library(quanteda.textplots) library(ggraph) library(tidyr) # activate klippy for copy-to-clipboard button klippy::klippy() ``` Once you have initiated the session by executing the code shown above, you are good to go. # Concordancing{-} In Text Analysis, concordancing refers to the extraction of words from a given text or texts [@lindquist2009corpus]. Commonly, concordances are displayed in the form of key-word in contexts (KWIC) where the search term is shown with some preceding and following context. Thus, such displays are referred to as key word in context concordances. A more elaborate tutorial on how to perform concordancing with R is available [here](https://ladal.edu.au/kwics.html).

Concordancing is a text analysis technique that retrieves and displays occurrences of a chosen word or phrase within a text or dataset, showing the surrounding context. It's used to examine word usage, context, and linguistic patterns for research and language analysis purposes.


```{r antconc, echo=FALSE, out.width= "60%", out.extra='style="float:right; padding:10px"'} knitr::include_graphics("https://slcladal.github.io/images/AntConcConcordance.png") ``` Concordancing is a valuable tool that helps us understand how a term is used in the data, examine word frequency, extract examples, and serves as a fundamental step for more advanced language data analyses. In the following section, we'll use R to explore text, using Lewis Carroll's *Alice's Adventures in Wonderland* as our example text. We'll start by loading the text data, which is available from the LADAL GitHub repository for this tutorial. If you're interested in loading your own data, you can refer to [this tutorial](https://ladal.edu.au/intror.html#Working_with_text). We start by loading our example text. ```{r conc1, message=FALSE, warning=FALSE} # load text text <- base::readRDS(url("https://slcladal.github.io/data/alice.rda", "rb")) ``` ```{r conc1b, echo = F, message=FALSE, warning=FALSE} # inspect data text %>% as.data.frame() %>% head() %>% flextable() %>% flextable::set_table_properties(width = .75, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First text elements of the example text") %>% flextable::border_outer() ``` The data still consists of short text snippets which is why we collapse these snippets and then split the collapsed data into chapters. ```{r conc2, message=FALSE, warning=FALSE} # combine and split into chapters text_chapters <- text %>% # paste all texts together into one long text paste0(collapse = " ") %>% # replace Chapter I to Chapter XVI with qwertz stringr::str_replace_all("(CHAPTER [XVI]{1,7}\\.{0,1}) ", "qwertz\\1") %>% # convert text to lower case tolower() %>% # split the long text into chapters stringr::str_split("qwertz") %>% # unlist the result (convert into simple vector) unlist() ``` ```{r conc2b, echo = F, message=FALSE, warning=FALSE} # inspect data text_chapters %>% substr(start=1, stop=500) %>% as.data.frame() %>% head() %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 500 characters of the first 6 chapters of the example text") %>% flextable::border_outer() ``` After dividing the data into chapters, we conduct concordancing and extract KWICs (**K**ey**W**ord **I**n **C**ontext). This is accomplished using the `kwic` function from the `quanteda` package, which requires three main arguments: the data (x), the search pattern (pattern), and the window size. To begin, we'll create KWICs for the term *alice* using the `kwic` function from the `quanteda` package, as demonstrated below.

The `kwic` function in the `quanteda` package extracts KeyWord In Context (KWIC) information. Its main arguments are `x` (text data), `pattern` (search term), and `window` (context size) to display words around the pattern.


```{r conc3, message=FALSE, warning=FALSE} # create kwic kwic_alice <- quanteda::kwic(x = text_chapters, # define text(s) # define pattern pattern = "alice", # define window size window = 5) %>% # convert into a data frame as.data.frame() %>% # remove superfluous columns dplyr::select(-to, -from, -pattern) ``` ```{r conc3b, echo = F, message=FALSE, warning=FALSE} # inspect data kwic_alice %>% as.data.frame() %>% head() %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 10 concordances of *alice* in the example text") %>% flextable::border_outer() ``` In our search, we have the flexibility to utilize regular expressions, allowing us to extract not only straightforward terms like *alice* but also more intricate and even abstract patterns. An abstract pattern may involve only a part of the term being specified. For example, if we specify *walk*, we can retrieve words like *walking*, *walker,* *walked*, and *walks* that contain this sequence. To effectively capture such abstract patterns, we employ what are known as *regular expressions*. When incorporating a regular expression in the `pattern` argument, it's crucial to specify the `valuetype` as `regex`, as illustrated below. ```{r conc5, message=FALSE, warning=FALSE} # create kwic kwic_walk <- quanteda::kwic(x = text_chapters, pattern = "walk.*", window = 5, valuetype = "regex") %>% # convert into a data frame as.data.frame() %>% # remove superfluous columns dplyr::select(-to, -from, -pattern) ``` ```{r conc5b, echo = F, message=FALSE, warning=FALSE} # inspect data kwic_walk %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 10 cleaned Concordances of *walk* in the example text") %>% flextable::border_outer() ``` When searching for expressions that represent phrases consisting of multiple elements, like *poor alice*, it's essential to explicitly specify in the `pattern` argument that we are searching for a `phrase`. ```{r conc7, message=FALSE, warning=FALSE} # create kwic kwic_pooralice <- quanteda::kwic(x = text_chapters, pattern = quanteda::phrase("poor alice"), window = 5) %>% # convert into a data frame as.data.frame() %>% # remove superfluous columns dplyr::select(-to, -from, -pattern) ``` ```{r conc8b, echo = F, message=FALSE, warning=FALSE} # inspect data kwic_pooralice %>% as.data.frame() %>% head() %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First cleaned concordances of the phrase *poor alice* in the example text") %>% flextable::border_outer() ``` We could continue our analysis by exploring in greater detail how the phrase *poor alice* is used in context, perhaps by adjusting the context window size or conducting similar investigations. However, for now, we'll shift our focus to learning how to extract and work with word frequencies. # Word Frequency{-} Frequency information is a cornerstone of text analytics, underpinning nearly all analytical methods. Identifying the most common words within a text is a fundamental technique in text analytics, serving as the bedrock of text analysis. This frequency data is typically organized into word frequency lists, which consist of word forms and their corresponding frequencies within a given text or collection of texts. Given the paramount importance of extracting word frequency lists, we will proceed to demonstrate how to do so. In the first step, we'll continue with our example text, convert the chapters to lowercase, eliminate non-word symbols (including punctuation), and then break down the text (the chapters) into individual words. ```{r wf1, message=FALSE, warning=FALSE} # process the text and save result in "text_words" text_words <- text %>% # convert all text to lowercase tolower() %>% # remove non-word characters, keeping spaces str_replace_all("[^[:alpha:][:space:]]*", "") %>% # remove punctuation tm::removePunctuation() %>% # squish consecutive spaces into a single space stringr::str_squish() %>% # split the text into individual words, separated by spaces stringr::str_split(" ") %>% # unlist the result into a single vector of words unlist() ``` ```{r wf1b, echo = F, message=FALSE, warning=FALSE} # inspect data text_words %>% as.data.frame() %>% head(15) %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 15 words in the example text") %>% flextable::border_outer() ``` With our word vector in hand, let's effortlessly construct a table that showcases a word frequency list, as demonstrated below. ```{r wf2, message=FALSE, warning=FALSE} # Create a word frequency table from the 'text_words' data wfreq <- text_words %>% # count the frequency of each unique word table() %>% # convert the frequency table into a data frame as.data.frame() %>% # arrange the data frame rows in descending order of word frequency arrange(desc(Freq)) %>% # rename the columns for clarity dplyr::rename(word = 1, frequency = 2) ``` ```{r wf2b, echo = F, message=FALSE, warning=FALSE} # inspect data wfreq %>% as.data.frame() %>% head(15) %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Top 15 words in the example text by frequency.") %>% flextable::border_outer() ``` The most common words often consist of function words that may lack significance. To enhance our analysis, we'll eliminate these function words, often referred to as *stopwords*, from the frequency list. Let's take a look at the refined list without stopwords. ```{r wf4, message=FALSE, warning=FALSE} # create table wo stopwords wfreq_wostop <- wfreq %>% anti_join(tidytext::stop_words, by = "word") %>% dplyr::filter(word != "") ``` ```{r wf5b, echo = F, message=FALSE, warning=FALSE} # inspect data wfreq_wostop %>% as.data.frame() %>% head(15) %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Top 15 lexical words in the example text by frequency.") %>% flextable::border_outer() ``` Word frequency lists can be presented visually in several ways, with bar graphs being the most common and intuitive choice for visualization. ```{r wf6, message=FALSE, warning=FALSE} wfreq_wostop %>% head(10) %>% ggplot(aes(x = reorder(word, -frequency, mean), y = frequency)) + geom_bar(stat = "identity") + labs(title = "10 most frequent non-stop words \nin the example text", x = "") + theme(axis.text.x = element_text(angle = 45, size = 12, hjust = 1)) ``` ## Wordclouds{-} Alternatively, word frequency lists can be visually represented as word clouds, though they provide less detailed information. Word clouds are visual representations where words appear larger based on their frequency, offering a quick visual summary of word importance in a dataset. ```{r wc1, message=FALSE, warning=FALSE} # create a word cloud visualization text %>% # Convert text data to a quanteda corpus quanteda::corpus() %>% # tokenize the corpus, removing punctuation quanteda::tokens(remove_punct = TRUE) %>% # remove English stopwords quanteda::tokens_remove(stopwords("english")) %>% # create a document-feature matrix (DFM) quanteda::dfm() %>% # generate a word cloud using textplot_wordcloud quanteda.textplots::textplot_wordcloud( # maximum words to display in the word cloud max_words = 150, # determine the maximum size of words max_size = 10, # determine the minimum size of words min_size = 1.5, # Define a color palette for the word cloud color = scales::viridis_pal(option = "A")(10)) ```

The `textplot_wordcloud` function creates a word cloud visualization of text data in R. Its main arguments are `x` (a Document-Feature Matrix or DFM), `max_words` (maximum words to display), and `color` (color palette for the word cloud).


Another form of word clouds, known as *comparison clouds*, is helpful in discerning disparities between texts. For instance, we can load various texts and assess how they vary in terms of word frequencies. To illustrate this, we'll load Herman Melville's *Moby Dick*, George Orwell's *1984*, and Charles Darwin's *Origin*. First, we'll load these texts and combine them into single documents. ```{r wc2, message=FALSE, warning=FALSE} # load data orwell_sep <- base::readRDS(url("https://slcladal.github.io/data/orwell.rda", "rb")) orwell <- orwell_sep %>% paste0(collapse = " ") melville_sep <- base::readRDS(url("https://slcladal.github.io/data/melville.rda", "rb")) melville <- melville_sep %>% paste0(collapse = " ") darwin_sep <- base::readRDS(url("https://slcladal.github.io/data/darwin.rda", "rb")) darwin <- darwin_sep %>% paste0(collapse = " ") ``` Now, we generate a corpus object from these texts and create a variable with the author name. ```{r wc3, message=FALSE, warning=FALSE} corp_dom <- quanteda::corpus(c(darwin, melville, orwell)) attr(corp_dom, "docvars")$Author = c("Darwin", "Melville", "Orwell") ``` Now, we can remove so-called *stopwords* (non-lexical function words) and punctuation and generate the comparison cloud. ```{r wc4, message=FALSE, warning=FALSE} # create a comparison word cloud for a corpus corp_dom %>% # tokenize the corpus, removing punctuation, symbols, and numbers quanteda::tokens(remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE) %>% # remove English stopwords quanteda::tokens_remove(stopwords("english")) %>% # create a Document-Feature Matrix (DFM) quanteda::dfm() %>% # group the DFM by the 'Author' column from 'corp_dom' quanteda::dfm_group(groups = corp_dom$Author) %>% # trim the DFM, keeping terms that occur at least 10 times quanteda::dfm_trim(min_termfreq = 10, verbose = FALSE) %>% # generate a comparison word cloud quanteda.textplots::textplot_wordcloud( # create a comparison word cloud comparison = TRUE, # set colors for different groups color = c("darkgray", "orange", "purple"), # define the maximum number of words to display in the word cloud max_words = 150) ``` ## Frequency changes{-} We can also explore how the term *alice* is used throughout the chapters of our example text. To begin, let's extract the word count for each chapter. ```{r wf13, message=FALSE, warning=FALSE} # extract the number of words per chapter Words <- text_chapters %>% # split each chapter into words based on spaces stringr::str_split(" ") %>% # measure the length (number of words) in each chapter lengths() # display the resulting data, which contains the word counts per chapter Words ``` Next, we extract the number of matches in each chapter. ```{r wf14, message=FALSE, warning=FALSE} # extract the number of matches of "alice" per chapter Matches <- text_chapters %>% # count the number of times "alice" appears in each chapter stringr::str_count("alice") # display the resulting data, which shows the number of matches of "alice" per chapter Matches ``` Now, we extract the names of the chapters and create a table with the chapter names and the relative frequency of matches per 1,000 words. ```{r wf15, message=FALSE, warning=FALSE} # extract chapters Chapters <- paste0("chapter", 0:(length(text_chapters)-1)) Chapters ``` Next, we combine the information in a single data frame and add a column containing the relative frequency of *alice* in each chapter. ```{r wf16, message=FALSE, warning=FALSE} # create table of results tb <- data.frame(Chapters, Matches, Words) %>% # create new variable with the relative frequency dplyr::mutate(Frequency = round(Matches/Words*1000, 2)) %>% # reorder chapters dplyr::mutate(Chapters = factor(Chapters, levels =c(paste0("chapter", 0:12)))) ``` ```{r wf17b, echo = F, message=FALSE, warning=FALSE} # inspect data tb %>% as.data.frame() %>% head(15) %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Words and their (relative) freqeuncy across in the example text by frequency.") %>% flextable::border_outer() ``` Now, let's visualize the relative frequencies of our search term in each chapter. ```{r wf18, echo=T, eval = T, message=FALSE, warning=FALSE} # create a plot using ggplot ggplot(tb, aes(x = Chapters, y = Frequency, group = 1)) + # add a smoothed line (trendline) in purple color geom_smooth(color = "purple") + # add a line plot in dark gray color geom_line(color = "darkgray") + # remove fill from the legend guides(color = guide_legend(override.aes = list(fill = NA))) + # set a white and black theme theme_bw() + # rotate x-axis text by 45 degrees and adjust alignment theme(axis.text.x = element_text(angle = 45, hjust = 1))+ # customize the y-axis label scale_y_continuous(name = "Relative Frequency (per 1,000 words)") ``` ## Dispersion plots{-} To show when in a text or in a collection of texts certain terms occur, we can use *dispersion plots*. The `quanteda` package offers a very easy-to-use function `textplot_xray` to generate dispersion plots. ```{r dp, warning=F, message=F} # add chapter names names(text_chapters) <- Chapters # generate corpus from chapters text_corpus <- quanteda::corpus(text_chapters) # generate dispersion plots quanteda.textplots::textplot_xray(kwic(text_corpus, pattern = "alice"), kwic(text_corpus, pattern = "hatter"), sort = T) ``` We can modify the plot by saving it into an object and then use `ggplot` to modify it appearance. ```{r dp2, warning=F, message=F} # generate and save dispersion plots dp <- quanteda.textplots::textplot_xray(kwic(text_corpus, pattern = "alice"), kwic(text_corpus, pattern = "cat")) # modify plot dp + aes(color = keyword) + scale_color_manual(values = c('red', 'blue')) + theme(legend.position = "none") ``` ## Over- and underuse{-} Frequency data serves as a valuable lens through which we can explore the essence of a text. For instance, when we examine private dialogues, we often encounter higher occurrences of second-person pronouns compared to more formal text types like scripted monologues or speeches. This insight holds the potential to aid in text classification and assessing text formality. To illustrate, consider the following statistics: the counts of second-person pronouns, *you* and *your*, as well as the total word count excluding these pronouns in private dialogues versus scripted monologues within the Irish segment of the International Corpus of English (ICE). Additionally, the tables provide the percentage of second-person pronouns in both text types, enabling us to discern whether private dialogues indeed contain more of these pronouns compared to scripted monologues, such as speeches. ```{r ou1, eval=T, echo=F, message=FALSE, warning=FALSE, paged.print=FALSE} # create a matrix 'numbers' with rows and columns numbers <- matrix(c("you, your", "6761", "659", "Other words", "259625", "105295", "Percent", "2.60", "0.63"), byrow = TRUE, nrow = 3) # assign column names to the matrix colnames(numbers) <- c("", "Private dialogues", "Scripted monologues") ``` ```{r ou1b, echo = F, message=FALSE, warning=FALSE} # inspect data ndf <- numbers %>% as.data.frame() colnames(ndf)[1] <- "." ndf %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Use of 2nd person pronouns (and all other words) in ICE Ireland.") %>% flextable::border_outer() ``` This straightforward example highlights that second-person pronouns constitute 2.6 percent of all words in private dialogues, yet they represent only 0.63 percent in scripted speeches. To vividly illustrate such variations, we can employ association and mosaic plots, which offer effective visual presentations. ```{r ou2, message=FALSE, warning=FALSE, paged.print=FALSE} # create a matrix 'd' with the specified values and dimensions d <- matrix(c(6761, 659, 259625, 105295), nrow = 2, byrow = TRUE) # assign column names to the matrix colnames(d) <- c("D", "M") # assign row names to the matrix rownames(d) <- c("you, your", "Other words") # generate an association plot using 'assocplot' function assocplot(d) ``` In an association plot, bars above the dashed line signify relative overuse, while bars below indicate relative underuse. Accordingly, the plot reveals that in monologues, there's an underuse of *you* and *your* and an overuse of *other words*. Conversely, in dialogues, the opposite patterns emerge: an overuse of *you* and *your* and an underuse of *other words*. This visual representation helps us grasp the distinctive word usage patterns between these text types. # Collocations{-} Collocations are like linguistic buddies. They're those word pairs that just seem to go hand in hand, like *Merry Christmas*. You see, these words have a special relationship – they occur together way more often than if words were just randomly strung together in a sentence. Before we start tough, it is important to understand that identifying words pairs (w1 and w2) that collocate (i.e. collocations) and determining their association strength (a measure of how strongly attracted words are to each other) is based on the co-occurrence frequencies of word pairs in a contingency table (see below, *O* is short for *observed frequency*). | | w~2~ present | w~2~ absent | | | :--- | :-----: | --------: | --- | | **w~1~ present** | O~11~ | O~12~ | = R~1~ | **w~1~ absent** | O~21~ | O~22~ | = R~2~ | | = C~1~ | = C~2~ | = N | In the following, we will extract collocations from the sentences in our example text. In a first step, we split our example text into sentences and clean the data (removing punctuation, converting to lower case, etc.). ```{r} text %>% # concatenate the elements in the 'text' object paste0(collapse = " ") %>% # split text into sentences tokenizers::tokenize_sentences() %>% # unlist sentences unlist() %>% # remove non-word characters stringr::str_replace_all("\\W", " ") %>% stringr::str_replace_all("[^[:alnum:] ]", " ") %>% # remove superfluous white spaces stringr::str_squish() %>% # convert to lower case and save in 'sentences' object tolower() -> sentences ``` ```{r echo = F, message=FALSE, warning=FALSE} sentences %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 10 sentences in the example text") %>% flextable::border_outer() ``` Next, we tabulate the data and reformat it so that we have the relevant information to calculate the association statistics (word 1 and word 2 as well as O11, O12, O21, and O22). ```{r} # tokenize the 'sentences' data using quanteda package sentences %>% quanteda::tokens() %>% # create a document-feature matrix (dfm) using quanteda quanteda::dfm() %>% # create a feature co-occurrence matrix (fcm) without considering trigrams quanteda::fcm(tri = FALSE) %>% # tidy the data using tidytext package tidytext::tidy() %>% # rearrange columns for better readability dplyr::relocate(term, document, count) %>% # rename columns for better interpretation dplyr::rename(w1 = 1, w2 = 2, O11 = 3) -> coll_basic ``` ```{r echo = F, message=FALSE, warning=FALSE} coll_basic %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 10 rows of basic collocation table") %>% flextable::border_outer() ``` We now enhance our table by calculating all observed frequencies (O11, O12, O21, O22) as well as row totals (R1, R2), column totals (C1, C2), and the overall total (N). ```{r} # calculate the total number of observations (N) coll_basic %>% dplyr::mutate(N = sum(O11)) %>% # calculate R1, O12, and R2 dplyr::group_by(w1) %>% dplyr::mutate(R1 = sum(O11), O12 = R1 - O11, R2 = N - R1) %>% dplyr::ungroup(w1) %>% # calculate C1, O21, C2, and O22 dplyr::group_by(w2) %>% dplyr::mutate(C1 = sum(O11), O21 = C1 - O11, C2 = N - C1, O22 = R2 - O21) -> colldf ``` ```{r echo = F, message=FALSE, warning=FALSE} colldf %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 10 rows of collocation table") %>% flextable::border_outer() ``` We could calculate all collocations in the corpus (based on co-occurrence within the same sentence) or we can find collocations of a specific term - here, we will find collocations fo the term *alice*. Now that we have all the relevant information, we will reduce the data and add additional information to the data so that the computing of the association measures runs smoothly. ```{r eval=T, echo=T, message=FALSE, warning=FALSE, paged.print=FALSE} # reduce and complement data colldf %>% # determine Term dplyr::filter(w1 == "alice", # set minimum number of occurrences of w2 (O11+O21) > 10, # set minimum number of co-occurrences of w1 and w2 O11 > 5) %>% dplyr::rowwise() %>% dplyr::mutate(E11 = R1 * C1 / N, E12 = R1 * C2 / N, E21 = R2 * C1 / N, E22 = R2 * C2 / N) -> colldf_redux ``` ```{r echo = F, message=FALSE, warning=FALSE} colldf_redux %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 10 rows of reduced collocation data frame") %>% flextable::border_outer() ``` Now we can calculate the collocation statistics (the association strength). ```{r message=FALSE, warning=FALSE, paged.print=FALSE} colldf_redux %>% # determine number of rows dplyr::mutate(Rws = nrow(.)) %>% # work row-wise dplyr::rowwise() %>% # calculate fishers' exact test dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(O11, O12, O21, O22), ncol = 2, byrow = T))[1]))) %>% # extract AM # 1. bias towards top left dplyr::mutate(btl_O12 = ifelse(C1 > R1, 0, R1-C1), btl_O11 = ifelse(C1 > R1, R1, R1-btl_O12), btl_O21 = ifelse(C1 > R1, C1-R1, C1-btl_O11), btl_O22 = ifelse(C1 > R1, C2, C2-btl_O12), # 2. bias towards top right btr_O11 = 0, btr_O21 = R1, btr_O12 = C1, btr_O22 = C2-R1) %>% # 3. calculate AM dplyr::mutate(upp = btl_O11/R1, low = btr_O11/R1, op = O11/R1) %>% dplyr::mutate(AM = op / upp) %>% # remove superfluous columns dplyr::select(-btr_O21, -btr_O12, -btr_O22, -btl_O12, -btl_O11, -btl_O21, -btl_O22, -btr_O11) %>% # extract x2 statistics dplyr::mutate(X2 = (O11-E11)^2/E11 + (O12-E12)^2/E12 + (O21-E21)^2/E21 + (O22-E22)^2/E22) %>% # extract association measures dplyr::mutate(phi = sqrt((X2 / N)), MI = log2(O11 / E11), DeltaP12 = (O11 / (O11 + O12)) - (O21 / (O21 + O22)), DeltaP21 = (O11 / (O11 + O21)) - (O21 / (O12 + O22)), LogOddsRatio = log(((O11 + 0.5) * (O22 + 0.5)) / ( (O12 + 0.5) * (O21 + 0.5) ))) %>% # determine Bonferroni corrected significance dplyr::mutate(Sig_corrected = dplyr::case_when(p / Rws > .05 ~ "n.s.", p / Rws > .01 ~ "p < .05*", p / Rws > .001 ~ "p < .01**", p / Rws <= .001 ~ "p < .001***", T ~ "N.A.")) %>% # round p-value dplyr::mutate(p = round(p, 5)) %>% # filter out non significant results dplyr::filter(Sig_corrected != "n.s.", # filter out instances where the w1 and w2 repel each other E11 < O11) %>% # arrange by phi (association measure) dplyr::arrange(-AM) %>% # remove superfluous columns dplyr::select(-any_of(c("TermCoocFreq", "AllFreq", "NRows", "E12", "E21", "E22", "O12", "O21", "O22", "R1", "R2", "C1", "C2"))) -> assoc_tb ``` ```{r echo = F, message=FALSE, warning=FALSE} assoc_tb %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 10 rows of association statistics table") %>% flextable::border_outer() ``` The resulting table shows collocations in the example text descending by collocation strength. We now use a [network graph](https://ladal.edu.au/net.html), or network for short, to visualise the collocations of our keyword (*alice*). Networks are a powerful and versatile visual representation used to depict relationships or connections among various elements. Network graphs typically consist of nodes, representing individual entities, and edges, indicating the connections or interactions between these entities. We start by extracting the tokens that we want to show (the top 20 collocates of *alice*). ```{r} # sort the coocStatz data frame in descending order based on the 'phi' column top20colls <- assoc_tb %>% dplyr::arrange(-phi) %>% # select the top 20 rows after sorting head(20) %>% # extract the 'token' column dplyr::pull(w2) %>% # add keyword c("alice") # inspect the top 20 tokens with the highest 'phi' values top20colls ``` We then need to generate a feature co-occurrence matrix from a document-feature matrix based on the cleaned, lower case sentences of our text. ```{r} # tokenize the 'sentences' data using quanteda package keyword_fcm <- sentences %>% quanteda::tokens() %>% # create a document-feature matrix (dfm) from the tokens quanteda::dfm() %>% # select features based on 'top20colls' and the term "selection" pattern quanteda::dfm_select(pattern = c(top20colls, "selection")) %>% # Create a symmetric feature co-occurrence matrix (fcm) quanteda::fcm(tri = FALSE) # inspect the first 6 rows and 6 columns of the resulting fcm keyword_fcm[1:6, 1:6] ``` ```{r} # create a network plot using the fcm quanteda.textplots::textplot_network(keyword_fcm, # set the transparency of edges to 0.8 for visibility edge_alpha = 0.8, # set the color of edges to gray edge_color = "gray", # set the size of edges to 2 for better visibility edge_size = 2, # adjust the size of vertex labels # based on the logarithm of row sums of the fcm vertex_labelsize = log(rowSums(keyword_fcm))) ``` # Keywords {-} Keywords play a pivotal role in text analysis, serving as distinctive terms that hold particular significance within a given text, context, or collection. This approach revolves around pinpointing words closely associated with a specific text. In simpler terms, keyness analysis strives to identify words that distinctly represent the content of a given text.

Keyness is a statistical measure that helps identify significant terms in text by assessing how prominently a term stands out in a specific context by comparing its frequency to what’s expected based on background data.


To determine if a token is a keyword and if it occurs significantly more frequently in a target corpus compared to a reference corpus, we use the following information (that is provided by the table above): * O11 = Number of times word~x~ occurs in `target corpus` * O12 = Number of times word~x~ occurs in `reference corpus` (without `target corpus`) * O21 = Number of times other words occur in `target corpus` * O22 = Number of times other words occur in `reference corpus` Example: | | target corpus | reference corpus | | :--- | :-----: | --------: | --- | **token** | O~11~ | O~12~ | = R~1~ | **other tokens** | O~21~ | O~22~ | = R~2~ | | = C~1~ | = C~2~ | = N | First, we’ll load two texts. ```{r} # load data text1 <- base::readRDS(url("https://slcladal.github.io/data/orwell.rda", "rb")) %>% paste0(collapse = " ") text2 <- base::readRDS(url("https://slcladal.github.io/data/melville.rda", "rb")) %>% paste0(collapse = " ") ``` ```{r echo = F, message=FALSE, warning=FALSE} text1 %>% substr(start=1, stop=200) %>% as.data.frame() %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 200 characters of text 1") %>% flextable::border_outer() ``` As you can see, text1 is George Orwell's *1984*. ```{r echo = F, message=FALSE, warning=FALSE} text2 %>% substr(start=1, stop=200) %>% as.data.frame() %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 200 characters of text 2") %>% flextable::border_outer() ``` The table shows that text2 is Herman Melville's *Moby Dick*. After loading the two texts, we create a frequency table of first text. ```{r} text1_words <- text1 %>% # remove non-word characters stringr::str_remove_all("[^[:alpha:] ]") %>% # convert to lower tolower() %>% # tokenize the corpus files quanteda::tokens(remove_punct = T, remove_symbols = T, remove_numbers = T) %>% # unlist the tokens to create a data frame unlist() %>% as.data.frame() %>% # rename the column to 'token' dplyr::rename(token = 1) %>% # group by 'token' and count the occurrences dplyr::group_by(token) %>% dplyr::summarise(n = n()) %>% # add column stating where the frequency list is 'from' dplyr::mutate(type = "text1") ``` Now, we create a frequency table of second text. ```{r} text2_words <- text2 %>% # remove non-word characters stringr::str_remove_all("[^[:alpha:] ]") %>% # convert to lower tolower() %>% # tokenize the corpus files quanteda::tokens(remove_punct = T, remove_symbols = T, remove_numbers = T) %>% # unlist the tokens to create a data frame unlist() %>% as.data.frame() %>% # rename the column to 'token' dplyr::rename(token = 1) %>% # group by 'token' and count the occurrences dplyr::group_by(token) %>% dplyr::summarise(n = n()) %>% # add column stating where the frequency list is 'from' dplyr::mutate(type = "text2") ``` In a next step, we combine the tables. ```{r} texts_df <- dplyr::left_join(text1_words, text2_words, by = c("token")) %>% # rename columns and select relevant columns dplyr::rename(text1 = n.x, text2 = n.y) %>% dplyr::select(-type.x, -type.y) %>% # replace NA values with 0 in 'corpus' and 'kwic' columns tidyr::replace_na(list(text1 = 0, text2 = 0)) ``` ```{r echo = F, message=FALSE, warning=FALSE} texts_df %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Frequency table of tokens in text1 and text2") %>% flextable::border_outer() ``` We now calculate the frequencies of the observed and expected frequencies as well as the row and column totals. ```{r} texts_df %>% dplyr::mutate(text1 = as.numeric(text1), text2 = as.numeric(text2)) %>% dplyr::mutate(C1 = sum(text1), C2 = sum(text2), N = C1 + C2) %>% dplyr::rowwise() %>% dplyr::mutate(R1 = text1+text2, R2 = N - R1, O11 = text1, O12 = R1-O11, O21 = C1-O11, O22 = C2-O12) %>% dplyr::mutate(E11 = (R1 * C1) / N, E12 = (R1 * C2) / N, E21 = (R2 * C1) / N, E22 = (R2 * C2) / N) %>% dplyr::select(-text1, -text2) -> stats_tb2 ``` ```{r echo = F, message=FALSE, warning=FALSE} stats_tb2 %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 10 rows of the processed frequency table") %>% flextable::border_outer() ``` We can now calculate the association strength which, in this case serves as a keyness measure. ```{r} stats_tb2 %>% # determine number of rows dplyr::mutate(Rws = nrow(.)) %>% # work row-wise dplyr::rowwise() %>% # calculate fishers' exact test dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(O11, O12, O21, O22), ncol = 2, byrow = T))[1]))) %>% # extract AM # 1. bias towards top left dplyr::mutate(btl_O12 = ifelse(C1 > R1, 0, R1-C1), btl_O11 = ifelse(C1 > R1, R1, R1-btl_O12), btl_O21 = ifelse(C1 > R1, C1-R1, C1-btl_O11), btl_O22 = ifelse(C1 > R1, C2, C2-btl_O12), # 2. bias towards top right btr_O11 = 0, btr_O21 = R1, btr_O12 = C1, btr_O22 = C2-R1) %>% # 3. calculate AM dplyr::mutate(upp = btl_O11/R1, low = btr_O11/R1, op = O11/R1) %>% dplyr::mutate(AM = op / upp) %>% # remove superfluous columns dplyr::select(-btr_O21, -btr_O12, -btr_O22, -btl_O12, -btl_O11, -btl_O21, -btl_O22, -btr_O11) %>% # extract x2 statistics dplyr::mutate(X2 = (O11-E11)^2/E11 + (O12-E12)^2/E12 + (O21-E21)^2/E21 + (O22-E22)^2/E22) %>% # extract expected frequency dplyr::mutate(Exp = E11) %>% # extract association measures dplyr::mutate(phi = sqrt((X2 / N)), MI = log2(O11 / E11), DeltaP12 = (O11 / (O11 + O12)) - (O21 / (O21 + O22)), DeltaP21 = (O11 / (O11 + O21)) - (O21 / (O12 + O22)), LogOddsRatio = log(((O11 + 0.5) * (O22 + 0.5)) / ( (O12 + 0.5) * (O21 + 0.5) ))) %>% # determine Bonferroni corrected significance dplyr::mutate(Sig_corrected = dplyr::case_when(p / Rws > .05 ~ "n.s.", p / Rws > .01 ~ "p < .05*", p / Rws > .001 ~ "p < .01**", p / Rws <= .001 ~ "p < .001***", T ~ "N.A.")) %>% # round p-value dplyr::mutate(p = round(p, 5), type = ifelse(E11 > O11, "antitype", "type")) %>% # filter out non significant results dplyr::filter(Sig_corrected != "n.s.") %>% # arrange by phi (association measure) dplyr::arrange(-DeltaP12) %>% # remove superfluous columns dplyr::select(-any_of(c("TermCoocFreq", "AllFreq", "NRows", "E12", "E21", "E22", "O12", "O21", "O22", "R1", "R2", "C1", "C2", "Exp"))) -> assoc_tb3 ``` ```{r echo = F, message=FALSE, warning=FALSE} assoc_tb3 %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 10 rows of the association statistic table") %>% flextable::border_outer() ``` We can use a barplot to visualize the association strength (keyness) of words with a text. ```{r message=F, warning=F} # get top 10 keywords for text 1 top <- assoc_tb3 %>% dplyr::ungroup() %>% dplyr::slice_head(n = 10) # get top 10 keywords for text 2 bot <- assoc_tb3 %>% dplyr::ungroup() %>% dplyr::slice_tail(n = 10) # combine into table rbind(top, bot) %>% # create a ggplot ggplot(aes(x = reorder(token, DeltaP12, mean), y = DeltaP12, label = DeltaP12, fill = type)) + # add a bar plot using the 'phi' values geom_bar(stat = "identity") + # add text labels above the bars with rounded 'phi' values geom_text(aes(y = ifelse(DeltaP12> 0, DeltaP12 - 0.05, DeltaP12 + 0.05), label = round(DeltaP12, 3)), color = "white", size = 3) + # flip the coordinates to have horizontal bars coord_flip() + # set the theme to a basic white and black theme theme_bw() + # remove legend theme(legend.position = "none") + # define colors scale_fill_manual(values = c("orange", "darkgray")) + # set the x-axis label to "Token" and y-axis label to "Association strength (phi)" labs(title = "Top 10 keywords for text1 and text 2", x = "Keyword", y = "Association strength (DeltaP12)") ``` # Text Classification{-} Text classification involves methods for categorizing text into predefined groups, like languages, genres, or authors. These categorizations usually rely on the frequency of word types, important terms, phonetic elements, and other linguistic characteristics such as sentence length and words per line. Like many other text analysis methods, text classification often starts with a training dataset already marked with the necessary labels. You can create these training datasets and their associated features manually or opt for pre-built training sets offered by specific software or tools.

Text classification is a machine learning task where text documents are categorized into predefined classes or labels based on their content. It involves training a model on labeled data to learn patterns and then using that model to classify new, unlabeled documents. Text classification has numerous applications, such as spam detection, sentiment analysis, and topic categorization.


In the upcoming example, we'll use phoneme frequency to classify a text. To get started, we'll load a German text and break it down into its constituent phonetic elements. ```{r tc1, message=FALSE, warning=FALSE} # read in German text German <- readLines("https://slcladal.github.io/data/phonemictext1.txt") %>% stringr::str_remove_all(" ") %>% stringr::str_split("") %>% unlist() # inspect data head(German, 20) ``` We now do the same for three other texts - an English and a Spanish text as well as one text in a language that we will determine using classification. ```{r tc2, message=FALSE, warning=FALSE} # read in texts English <- readLines("https://slcladal.github.io/data/phonemictext2.txt") Spanish <- readLines("https://slcladal.github.io/data/phonemictext3.txt") Unknown <- readLines("https://slcladal.github.io/data/phonemictext4.txt") # clean, split texts into phonemes, unlist and convert them into vectors English <- as.vector(unlist(strsplit(gsub(" ", "", English), ""))) Spanish <- as.vector(unlist(strsplit(gsub(" ", "", Spanish), ""))) Unknown <- as.vector(unlist(strsplit(gsub(" ", "", Unknown), ""))) # inspect data head(English, 20) ``` We will now create a table that represents the phonemes and their frequencies in each of the 4 texts. In addition, we will add the language and simply the column names. ```{r tc3, echo=T, eval = T, message=FALSE, warning=FALSE} # create data tables German <- data.frame(names(table(German)), as.vector(table(German))) English <- data.frame(names(table(English)), as.vector(table(English))) Spanish <- data.frame(names(table(Spanish)), as.vector(table(Spanish))) Unknown <- data.frame(names(table(Unknown)), as.vector(table(Unknown))) # add column with language German$Language <- "German" English$Language <- "English" Spanish$Language <- "Spanish" Unknown$Language <- "Unknown" # simplify column names colnames(German)[1:2] <- c("Phoneme", "Frequency") colnames(English)[1:2] <- c("Phoneme", "Frequency") colnames(Spanish)[1:2] <- c("Phoneme", "Frequency") colnames(Unknown)[1:2] <- c("Phoneme", "Frequency") # combine all tables into a single table classdata <- rbind(German, English, Spanish, Unknown) ``` ```{r tc3b, echo = F, message=FALSE, warning=FALSE} # inspect data classdata %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "First 10 lines of the class data.") %>% flextable::border_outer() ``` Now, we group the data so that we see, how often each phoneme is used in each language. ```{r tc5, echo=T, eval = T, message=FALSE, warning=FALSE} # convert into wide format classdw <- classdata %>% tidyr::spread(Phoneme, Frequency) %>% replace(is.na(.), 0) ``` ```{r tc6b, echo = F, message=FALSE, warning=FALSE} # inspect data classdw[, 1:6] %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Overview of the class data in wide format.") %>% flextable::border_outer() ``` Next, we must reshape our data to reflect the frequency of each phoneme categorized by language. This transformation aligns with our classifier's design, which will employ *Language* as the dependent variable and utilize phoneme frequencies as predictors. ```{r tc8, echo=T, eval = T, message=FALSE, warning=FALSE} numvar <- colnames(classdw)[2:length(colnames(classdw))] classdw[numvar] <- lapply(classdw[numvar], as.numeric) # function for normalizing numeric variables normalize <- function(x) { (x-min(x))/(max(x)-min(x)) } # apply normalization classdw[numvar] <- as.data.frame(lapply(classdw[numvar], normalize)) ``` ```{r tc9b, echo = F, message=FALSE, warning=FALSE} # inspect data classdw[, 1:6] %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Overview of the probabilities.") %>% flextable::border_outer() ``` Before turning to the actual classification, we will use a cluster analysis to see which texts the unknown text is most similar with. ```{r tc10, echo=T, eval = T, message=FALSE, warning=FALSE} # remove language column textm <- classdw[,2:ncol(classdw)] # add languages as row names rownames(textm) <- classdw[,1] # create distance matrix distmtx <- dist(textm) # perform clustering clustertexts <- hclust(distmtx, method="ward.D") # visualize cluster result plot(clustertexts, hang = .25,main = "") ``` As indicated by the cluster analysis, the unidentified text forms a cluster alongside the English texts, strongly suggesting that the unknown text is likely in English. Before we dive into the actual classification process, we'll partition the data into two distinct sets: one excluding *Unknown* (our training set) and the other containing only *Unknown* (our test set). This segmentation allows us to train our model effectively and subsequently test its accuracy. ```{r tc11, echo=T, eval = T, message=FALSE, warning=FALSE} # create training set train <- classdw %>% filter(Language != "Unknown") # create test set test <- classdw %>% filter(Language == "Unknown") ``` ```{r tc12b, echo = F, message=FALSE, warning=FALSE} # inspect data classdw[, 1:6] %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Overview of the training set probabilities.") %>% flextable::border_outer() ``` In the final stage, we can put our classifier into action. Our chosen classifier is a k-nearest neighbor classifier, which operates on the principle of classifying an unknown element based on its proximity to the clusters within the training set. ```{r message=FALSE, warning=FALSE} # set seed for reproducibility set.seed(12345) # apply k-nearest-neighbor (knn) classifier prediction <- class::knn(train[,2:ncol(train)], test[,2:ncol(test)], cl = train[, 1], k = 3) # inspect the result prediction ``` Using the phoneme frequencies present in the unknown text, our knn-classifier confidently predicts that the text is in English. This prediction aligns with reality, as the text is, indeed, a section of the Wikipedia article for Aldous Huxley's *Brave New World*. It's worth noting that the training texts encompassed German, English, and Spanish translations of a subsection from Wikipedia's article on Hermann Hesse's *Steppenwolf*. # Part-of-Speech tagging{-} One widely used method for enhancing text data is part-of-speech tagging, which involves identifying the word type to which each word belongs. In the following section, we will apply part-of-speech tags to a brief English text. Part-of-speech tagging is the process of assigning grammatical categories (such as noun, verb, adjective, etc.) to individual words in a text. It provides valuable insights into the syntactic and grammatical structure of a text, making it easier to analyze and extract meaningful information.

Part-of-speech tagging (POS tagging) is a natural language processing task where each word in a text is assigned a grammatical category, such as noun, verb, adjective, etc. It involves using linguistic patterns and context to determine the appropriate part of speech for each word. POS tagging is crucial for various language analysis tasks, including information retrieval, text summarization, and grammar analysis.


We start by selecting a portion of our example text. ```{r udi1a, message=FALSE, warning=FALSE} # load text sample <- base::readRDS(url("https://slcladal.github.io/data/alice.rda", "rb")) %>% .[1:10] %>% paste0(collapse = " ") # inspect substr(sample, 1, 200) ``` With our text ready for analysis, our next step is to download a pre-trained language model. ```{r udi1b, eval = F, message=FALSE, warning=FALSE} # download language model m_eng <- udpipe::udpipe_download_model(language = "english-ewt") ``` Once you've downloaded a model previously, you also have the option to load it directly from the location where you've stored it on your computer. In my case, I've placed the model in a folder labeled *udpipemodels*. ```{r udi1c, message=FALSE, warning=FALSE} # load language model from your computer after you have downloaded it once m_eng <- udpipe_load_model(here::here("udpipemodels", "english-ewt-ud-2.5-191206.udpipe")) ``` We can now use the model to annotate out text. ```{r udi1d, message=FALSE, warning=FALSE} # tokenise, tag, dependency parsing text_anndf <- udpipe::udpipe_annotate(m_eng, x = sample) %>% as.data.frame() %>% dplyr::select(-sentence) # inspect head(text_anndf, 10) ``` It can be useful to extract only the words and their pos-tags and convert them back into a text format (rather than a tabular format). ```{r udi2, message=FALSE, warning=FALSE} tagged_text <- paste(text_anndf$token, "/", text_anndf$xpos, collapse = " ", sep = "") # inspect tagged text substr(tagged_text, 1, 200) ``` We could use the pos-tagged data to study differences in the distribution of word classes across different registers. or to find certain syntactic patterns in a collection of texts. # Names Entity Recognition {-} Named Entity Recognition (NER), also known as named entity extraction or entity extraction, is a text analysis technique that automatically identifies and extracts named entities from text, such as people, locations, brands, and more. NER involves the process of extracting textual elements with characteristics commonly associated with proper nouns (e.g., locations, individuals, organizations) rather than other parts of speech. These characteristics may include non-sentence initial capitalization. Named entities are frequently retrieved in automated summarization and topic modeling. NER can be accomplished through straightforward feature extraction, like extracting all non-sentence-initial capitalized words, or with the aid of training sets. Utilizing training sets—texts annotated to identify entities and non-entities—proves more effective when dealing with unknown or inconsistently capitalized data.

Named Entity Recognition (NER) is a natural language processing task that identifies and classifies words or phrases within text into predefined categories, such as persons, locations, organizations, and more. It employs contextual clues and language patterns to recognize these named entities. NER is essential for various applications, including information extraction, text summarization, and knowledge graph construction.


In this context, we will leverage the results obtained from part-of-speech tagging to extract terms tagged as named entities (the label `PROPN` in the `upos` column). ```{r ner1, message=FALSE, warning=FALSE} # tokenise, tag, dependency parsing ner_df <- text_anndf %>% dplyr::filter(upos == "PROPN") %>% dplyr::select(token_id, token, lemma, upos, feats) # inspect head(ner_df) ``` The obtained results can be further processed and categorized into various types such as persons, locations, dates, and other entities. This initial insight should provide you with a starting point for your analysis and exploration. # Dependency Parsing Using UDPipe{-} In addition to part-of-speech tagging, we can create visual representations illustrating the syntactic relationships between the various components of a sentence.

Dependency parsing is a linguistic analysis technique that reveals the grammatical structure of sentences by identifying how words relate to one another. It establishes hierarchical relationships, such as subject-verb, modifier-noun, or object-verb connections, within a sentence. Dependency parsing is fundamental for understanding sentence syntax, semantic roles, and linguistic relationships, playing a critical role in various natural language processing tasks like sentiment analysis, information extraction, and machine translation.


To achieve this, we first construct an object containing a sentence (in this instance, the sentence *John gave Mary a kiss*), and subsequently, we utilize the `textplot_dependencyparser` function to plot or visualize the dependencies. ```{r udi3, message=FALSE, warning=FALSE} # parse text sent <- udpipe::udpipe_annotate(m_eng, x = "John gave Mary a kiss.") %>% as.data.frame() # inspect head(sent) ``` We now generate the plot. ```{r udi5, message=FALSE, warning=FALSE} # generate dependency plot dplot <- textplot::textplot_dependencyparser(sent, size = 3) # show plot dplot ``` Dependency parsing proves invaluable for a range of applications, including analyzing the relationships within sentences and shedding light on the roles of different elements. For instance, it helps distinguish between the agent and the patient in actions like crimes or other activities. This parsing technique enables a deeper understanding of the underlying grammatical and semantic structure of sentences, making it a valuable tool for linguistic analysis, information extraction, and natural language understanding. # Citation & Session Info {-} Schweinberger, Martin. 2023. *Practical Overview of Selected Text Analytics Methods*. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/textanalysis.html (Version 2023.05.31). ``` @manual{schweinberger2023ta, author = {Schweinberger, Martin}, title = {Practical Overview of Selected Text Analytics Methods}, note = {https://ladal.edu.au/textanalysis.html}, year = {2023}, organization = {The Language Technology and Data Analysis Laboratory (LADAL)}, address = {Brisbane}, edition = {2023.05.31} } ``` ```{r fin} sessionInfo() ``` [Back to top](#introduction) [Back to HOME](https://ladal.edu.au) # References{-}