1 Introduction

This tutorial introduces part-of-speech tagging and syntactic parsing using “R”. The entire code for the sections below can be downloaded here.

Part-of-speech tagging, or pos-tagging, is a common procedure when working with natural language data. Despite being used quite freqeuntly, it is a rather complex issue that requires the application of statstical methods that are quite advanced. In the following, we will explore different options for pos-tagging and syntactic parsing.

2 Preparation and session set up

As all caluculations and visualizations in this tutorial rely on “R”, it is necessary to install “R”, “RStudio”, and “Tinn-R”. If these programms (or, in the case of “R”, environments) are not already installed on your machine, please search for them in your favorite search engine and add the term “download”. Open any of the first few links and follow the installation instructions (they are easy to follow, do not require any specifications, and are pretty much self-explanatory).

In addition, certain “libraries” or “packages” need to be installed so that the scripts shown below are executed without errors. Before turning to the code below, please install the librariesby running the code below this paragraph. If you have already installed the libraries mentioned below, then you can skip ahead ignore this section. To install the necessary libraries, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # supress math annotation
# install libraries
install.packages(c("tm", "NLP", "openNLP", "openNLPdata"))

Once you have installed “R”, “R-Studio”, “Tinn-R”, and have also initiated the session by executing the code shown above, you maybe good to go.

A word of warning is in order here. The “openNLP” library is written is Java and may require a re-installation of Java as well as re-setting the path variable to Java. A short video on how to set the path variable can be found (here)[https://www.youtube.com/watch?v=yrRmLOcB9fg].

3 Annotate POS

We extract proper nouns (tag NNP for singular and tag NNPS for plural proper nouns) from paragraphs of president’s speeches.

options(stringsAsFactors = FALSE)
# read suto paragraphs
textdata <- read.csv("https://slcladal.github.io/data/sotu_paragraphs.csv", sep = ";", encoding = "UTF-8")
english_stopwords <- readLines("https://slcladal.github.io/resources/stopwords_en.txt", encoding = "UTF-8")
# Create corpus object
corpus <- Corpus(DataframeSource(textdata))
# openNLP annotator objects
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
annotator_pipeline <- Annotator_Pipeline(
# function for annotation
annotateDocuments <- function(doc, pos_filter = NULL) {
  doc <- as.String(doc)
  doc_with_annotations <- annotate(doc, annotator_pipeline)
  tags <- sapply(subset(doc_with_annotations, type=="word")$features, `[[`, "POS")
  tokens <- doc[subset(doc_with_annotations, type=="word")]
  if (!is.null(pos_filter)) {
    res <- tokens[tags %in% pos_filter]
  } else {
    res <- paste0(tokens, "_", tags)
  res <- paste(res, collapse = " ")
# run annotation on a sample of the corpus
annotated_corpus <- lapply(corpus[1:10], annotateDocuments)
# Have a look into the first annotated documents

4 Filter NEs for further applications

We annotate the first paragraphs of the corpus, extract proper nouns, also referred to as Named Entities (NEs) such as person names, locations etc., and compute significance of co-occurrence of them.

sample_corpus <- sapply(corpus[1:1000], annotateDocuments, pos_filter = c("NNP", "NNPS"))
# Binary term matrix
minimumFrequency <- 2
filtered_corpus <- Corpus(VectorSource(sample_corpus))
binDTM <- DocumentTermMatrix(filtered_corpus, control=list(bounds = list(global=c(minimumFrequency, Inf)), weighting = weightBin))
# colnames(binDTM)
binDTM <- sparseMatrix(i = binDTM$i, j = binDTM$j, x = binDTM$v, dims = c(binDTM$nrow, binDTM$ncol), dimnames = dimnames(binDTM))
# Matrix multiplication for cooccurrence counts
coocCounts <- t(binDTM) %*% binDTM
# Definition of a parameter for the representation of the co-occurrences of a concept
# Determination of the term of which co-competitors are to be measured.
coocTerm <- "indians"
coocs <- calculateCoocStatistics(coocTerm, binDTM, measure="LOGLIK")
##        ohio      wabash    illinois   augustine     indiana   territory 
##   24.124563    8.226914    8.226914    8.226914    7.107320    6.702815 
##      united   pensacola      states     florida   executive     general 
##    6.641194    6.264679    5.695900    5.599043    5.149651    5.036511 
##     georgia      indian  chickasaws mississippi       creek     western 
##    5.036511    4.727217    4.086121    3.425899    3.106909    3.106909 
##    floridas       union 
##    3.106909    2.530202

5 German language support

For German language support run

# install.packages("openNLPmodels.de", repos = "http://datacube.wu.ac.at")
# require("openNLPmodels.de")

Exercise Time!

Plot a co-occurrence graph for the term “califoria_nnp” and its collocates, such as in tutorial 5

Merging tokens by identical consecutive POS-tags can be a useful approach to identification of multi-word-units (MWU). Modify the function annotateDocuments in a way, that consecutive POS-tags get merged into a single token (e.g. “United_NNP States_NNP” becomes “United_States_NNP”).

Bring it all together: Create a topic model visualiazation (topic distribution per decade, Tutorial: Topic Models) based only on paragraphs related to Foreign Policy (Tutorial: Text Classification). Just use nouns (NN, NNS) and proper nouns (NNP, NNPS) for the model (Tutorial: POS-tagging).