1 Introduction

This tutorial introduces classification using “R”. The entire code for the sections below can be downloaded here.

2 Preparation

As all caluculations and visualizations in this tutorial rely on “R”, it is necessary to install “R”, “RStudio”, and “Tinn-R”. If these programms (or, in the case of “R”, environments) are not installed yet, please search for them in your favorite search engine and add the term “download”. Open any of the first few links and follow the installation instructions (they are easy to follow, do not require any specifications, and are pretty much self-explanatory).

In addition, certain “libraries” need to be installed so that the scripts shown below are executed without errors. Before turning to the code below, please install the libraries needed for running the code below. If you have already installed the libraries mentioned below, then you can skip ahead ignore this section. To install the necessary libraries, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)
# install libraries
install.packages(c("LiblineaR", "tm"))

Once you have installed “R”, “R-Studio”, “Tinn-R”, and have also initiated the session by executing the code shown above, you are good to go.

3 Classification of textual data

In this section we show the use of supervised machine learning for text classification. The basic idea is to compute a model based on training data. Training data usually are hand-coded documents or text snippets associated with a specific category (class). From these texts, features (e.g. words) are extracted and associated with categories in the model. The model then can be utilized to categorize new texts.

We cover basic principles of the process such as cross-validation and feature engineering in the following steps:

  1. Read text data
  2. Read training data
  3. Build feature matrix
  4. Classify (LiblineaR)
  5. K-fold cross validation
  6. Optimize C
  7. Optimize features: stopwords, bi-grams, stemming
  8. Final classification

As data, again we use the “State of the Union”-addresses. But this time, we operate on paragraphs instead of documents. The file data/sotu_paragraphs.csv provides the speeches in the appropriate format. For each paragraph, we want to know whether it covers content related to domestic or foreign affairs.

4 Read paragraphs

As already known, we read the text source (21334 paragraphs from 231 speeches). For the moment, we apply very basic preprocessing.

# load library
library(tm)
library(quanteda)
# load data
testdata <- read.delim("D:\\Uni\\UQ\\LADAL\\SLCLADAL.github.io\\data/testparagraphs.txt", sep = "\t", header = T)
str(testdata)
## 'data.frame':    38577 obs. of  6 variables:
##  $ doc_id       : Factor w/ 38459 levels "","$1,000 being usually, on account of their small compensation, filled by",..: 336 1631 2773 3757 3879 3997 4118 4236 4699 353 ...
##  $ speech_doc_id: Factor w/ 84 levels "","$60,624,464 2. Newspapers and periodicals, 1 cent per pound",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ speech_type  : Factor w/ 5 levels "","10,324,069 4. Parcels, etc., 16 cents a pound",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ president    : Factor w/ 18 levels "","512,977,326 ",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ date         : Factor w/ 76 levels "","1790-01-08",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ text         : Factor w/ 4070 levels "","\\Friendly relations with all, but entangling alliances with none,\\ has long\nbeen a maxim with us. Our true m"| __truncated__,..: 653 895 1336 235 45 3145 3529 2903 3785 3766 ...

Now that the test data is loaded into “R”, we load a list of stopwords into “R”.

# load stop words
english_stopwords <- readLines("https://slcladal.github.io/resources/stopwords_en.txt", encoding = "UTF-8")
# inspect stop words
head(english_stopwords)
## [1] "a"         "a's"       "able"      "about"     "above"     "according"
# create corpus object
corpus <- Corpus(DataframeSource(testdata))
# preprocessing chain
processedCorpus <- tm_map(corpus, content_transformer(tolower))
processedCorpus <- tm_map(processedCorpus, removePunctuation, preserve_intra_word_dashes = TRUE)
processedCorpus <- tm_map(processedCorpus, removeNumbers)
processedCorpus <- tm_map(processedCorpus, stripWhitespace)
# inspect corpus
as.character(processedCorpus[[2]])
## [1] "i embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs the recent accession of the important state of north carolina to the constitution of the united states of which official information has been received the rising credit and respectability of our country the general and increasing good will toward the government of the union and the concord peace and plenty with which we are blessed are circumstances auspicious in an eminent degree to our national prosperity"

5 Load training data

We provide 300 manually annotated example paragraphs as training data. In the file “paragraph_training_data.txt”, the paragraph id and the corresponding category is stored.

# load training data (already annotated)
trainingData <- read.delim("https://slcladal.github.io/data/paragraph_training_data.txt", sep = "\t", header = T, quote = "")
# inspect data 
colnames(trainingData); str(trainingData)
## [1] "ID"    "LABEL"
## 'data.frame':    300 obs. of  2 variables:
##  $ ID   : int  15155 5251 8312 1950 20521 234 12249 16303 18626 876 ...
##  $ LABEL: Factor w/ 2 levels "DOMESTIC","FOREIGN": 1 1 2 1 1 2 1 1 2 1 ...

How is the ratio between domestic and foreign content in the training data?

classCounts <- table(trainingData[, "LABEL"])
print(classCounts)
## 
## DOMESTIC  FOREIGN 
##      209       91
numberOfDocuments <- nrow(trainingData)

For our first classification attempt, we create a Document-Term Matrix from the preprocessed corpus and use the extracted single words (unigrams) as features for the classification. Since the resulting DTM might contain too many words, we restrict the vocabulary to a minimum frequency.

# Base line: create feature set out of unigrams
DTM <- DocumentTermMatrix(processedCorpus)
# How many features do we have?
dim(DTM)
## [1] 38577 12886
# Probably the DTM is too big for the classifier. Let us reduce it
minimumFrequency <- 5
DTM <- DocumentTermMatrix(processedCorpus, control = list(bounds = list(global = c(minimumFrequency, Inf))))
dim(DTM)
## [1] 38577  5409

6 Classification

Now we build a linear classification model with the LiblineaR package. The package LiblineaR wraps around the open source library LIBLINEAR which provides a very fast implementations for two text classification algorithms: 1. Logistic Regression, and 2. Support Vector Machines (SVM) with a linear kernel. For both algorithms, different regularization methods exist (e.g. L1, and L2-regularization). The combination of algorithms and regularization can be controlled by the type parameter of the LiblineaR function. We stick to the default type, L2-regularized logistic regression (LR), since it usually achieves good performance in text classification.

First, we load the packages. Since Liblinear requires a special Sparse Matrix format, we also load the “SparseM” package and a conversion function which allows to convert slam-matrices (as used in the tm package) into SparseM-matrices.

Then, we split the annotated data into a training set (80%) and a test set (20%) using a boolean selector. The expression assigned to selector_idx creates a boolean vector of length 300 containing a FALSE value in every fifths position. This selector is used to select to training set. Its inverted vector (!) is used to select the test set.

require(LiblineaR)
require(SparseM)
source("https://slcladal.github.io/rscripts/utils.R")
annotatedDTM <- DTM[trainingData[, "ID"], ]
annotatedDTM <- convertSlamToSparseM(annotatedDTM)
annotatedLabels <- trainingData[, "LABEL"]
# split into training and test set
selector_idx <- rep(c(rep(TRUE, 4), FALSE), length.out = numberOfDocuments)
trainingDTM <- annotatedDTM[selector_idx, ]
trainingLabels <- annotatedLabels[selector_idx]
testDTM <- annotatedDTM[!selector_idx, ]
testLabels <- annotatedLabels[!selector_idx]
# create LR classification model
model <- LiblineaR(trainingDTM, trainingLabels)
summary(model)
##            Length Class  Mode     
## TypeDetail    1   -none- character
## Type          1   -none- numeric  
## W          5410   -none- numeric  
## Bias          1   -none- numeric  
## ClassNames    2   factor numeric  
## NbClass       1   -none- numeric

The model created by the LiblineaR function can now be utilized to predict the labels of the test set. Then we compare the result of the automatic classification to our known labels to determine the accuracy of the process.

classification <- predict(model, testDTM) 
predictedLabels <- classification$predictions
contingencyTable <- table(predictedLabels, testLabels)
print(contingencyTable) 
##                testLabels
## predictedLabels DOMESTIC FOREIGN
##        DOMESTIC        7       1
##        FOREIGN        38      14
# calculate accuracy
accuracy <- sum(diag(contingencyTable)) / length(testLabels)
print(accuracy) 
## [1] 0.35

The accuracy of 0.35 appears moderate for a first try. But how does it actually relate to a base line? Think of the imbalanced class proportions in our training set. Let us create a pseudo classification as base line, in which we do not classify at all, but simply assume the label “DOMESTIC” or “FOREIGN” for each paragraph.

We further employ a function called “F.measure” which gives more differentiated measures than simple accuracy (“A”)to determine the classification quality. The F.measure (“F”) is the harmonic mean of Precision (“P”) and Recall (“R”) (see https://en.wikipedia.org/wiki/Precision_and_recall#Definition_.28classification_context.29 for Details).

# Create pseudo classification
pseudoLabelsDOM <- factor(rep("DOMESTIC", length(testLabels)), levels(testLabels))
pseudoLabelsFOR <- factor(rep("FOREIGN", length(testLabels)), levels(testLabels))
# Evaluation of former LR classification with F-measures
F.measure(predictedLabels, testLabels, positiveClassName = "DOMESTIC")
##          P          R          S          F          A       Pos. 
##  0.8750000  0.1555556  0.9333333  0.2641509  0.3500000 45.0000000
F.measure(predictedLabels, testLabels, positiveClassName = "FOREIGN")
##          P          R          S          F          A       Pos. 
##  0.2692308  0.9333333  0.1555556  0.4179104  0.3500000 15.0000000
# Evaluation of pseudo classification with F-measures
F.measure(pseudoLabelsDOM, testLabels, positiveClassName = "DOMESTIC")
##          P          R          S          F          A       Pos. 
##  0.7500000  1.0000000  0.0000000  0.8571429  0.7500000 45.0000000
F.measure(pseudoLabelsFOR, testLabels, positiveClassName = "FOREIGN")
##     P     R     S     F     A  Pos. 
##  0.25  1.00  0.00  0.40  0.25 15.00

This little experiment shows that depending on the definition of our positive class, the accuracy is either 25% or 75% if not classifying at all. In both cases the specificity (S), the true negative rate, is zero. From this, we can learn two things:

  1. If classes in training/test sets are imbalanced, accuracy might be a misleading measurement. Other measure should be considered additionally.
  2. To utilize accuracy and F-measure in a meaningful way, the less frequent class should be defined as POSITIVE class (FOREIGN in our case).

7 K-fold cross validation

To evaluate a classifier, the training data can be divided into training and test data. The model learns on the former and is evaluated with the latter. In this procedure, we unfortunately lose the test data to learn from. If there is little training data available, the k-fold cross-validation is a more suitable procedure.

For this, training data is split into e.g. K = 10 parts. Then k-1 parts are used for training and 1 part is used for testing. This process is repeated k times, with another split of the overall data set for testing in each iteration.

The final result is determined from the average of the quality of the k runs. This allows a good approximation to the classification quality, including all training data.

The get_k_fold_logical_indexes function introduced below returns a logical vector for the fold j for cross-validation. It splits a training data record of the size n into k folds. The resulting vector and its negated vector can be used for easy training data / test data selection.

get_k_fold_logical_indexes <- function(j, k, n) {
  if (j > k) stop("Cannot select fold larger than nFolds")
  fold_lidx <- rep(FALSE, k)
  fold_lidx[j] <- TRUE
  fold_lidx <- rep(fold_lidx, length.out = n)
  return(fold_lidx)
}
# Example usage
get_k_fold_logical_indexes(1, k = 10, n = 12)
##  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [12] FALSE
get_k_fold_logical_indexes(2, k = 10, n = 12)
##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12]  TRUE
get_k_fold_logical_indexes(3, k = 10, n = 12)
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE

Now we run 1) splitting of the annotated data, 2) model computation and testing in one for-loop.

k <- 10
evalMeasures <- NULL
for (j in 1:k) {
  # create j-th boolean selection vector
  currentFold <- get_k_fold_logical_indexes(j, k, nrow(trainingDTM))
  
  # select training data split
  foldDTM <- annotatedDTM[!currentFold, ]
  foldLabels <- annotatedLabels[!currentFold]
  
  # create model
  model <- LiblineaR(foldDTM, foldLabels)
  
  # select test data split
  testSet <- annotatedDTM[currentFold, ]
  testLabels <- annotatedLabels[currentFold]
  
  # predict test labels
  predictedLabels <- predict(model, testSet)$predictions
  
  # evaluate predicted against test labels
  kthEvaluation <- F.measure(predictedLabels, testLabels, positiveClassName = "FOREIGN")
  
  # combine evaluation measures for k runs
  evalMeasures <- rbind(evalMeasures, kthEvaluation)
}
# Final evaluation values of k runs:
print(evalMeasures)
##                       P         R          S         F         A Pos.
## kthEvaluation 0.3214286 0.9000000 0.05000000 0.4736842 0.3333333   10
## kthEvaluation 0.4800000 0.7500000 0.07142857 0.5853659 0.4333333   16
## kthEvaluation 0.1538462 0.5714286 0.04347826 0.2424242 0.1666667    7
## kthEvaluation 0.2916667 0.7777778 0.19047619 0.4242424 0.3666667    9
## kthEvaluation 0.2400000 0.8571429 0.17391304 0.3750000 0.3333333    7
## kthEvaluation 0.2592593 1.0000000 0.13043478 0.4117647 0.3333333    7
## kthEvaluation 0.3703704 0.9090909 0.10526316 0.5263158 0.4000000   11
## kthEvaluation 0.2222222 1.0000000 0.12500000 0.3636364 0.3000000    6
## kthEvaluation 0.3448276 1.0000000 0.05000000 0.5128205 0.3666667   10
## kthEvaluation 0.2962963 1.0000000 0.13636364 0.4571429 0.3666667    8
# Average over all folds
print(colMeans(evalMeasures))
##         P         R         S         F         A      Pos. 
## 0.2979917 0.8765440 0.1076358 0.4372397 0.3400000 9.1000000

Accuracy is around 0%, F-measure is around 0%. Let’s try some approaches to optimize the automatic classification.

8 Optimization

These first tries have clarified how to utilize and evaluate machine learning functions for text in R. Now we concentrate on optimization strategies to get better results from the automatic classification process.

8.1 C-Parameter

For a linear classification model, the cost parameter (C-parameter) is the most important parameter to tweak (for other SVM kernels such as the radial or polynomial kernel there are other important parameters which influence the shape of the kernel function). The C-parameter determines the cost of classifications on the training data during training.

High values of C lead to a high costs of misclassification. The decision boundary which the classifier learns, will try to avoid any misclassification. But, values too high can lead to an overfitting of the model. This means, it adapts well to the training data, but classification will more likely fail on new test data.

Low values of C lead to less strict decision boundaries, which accepts some misclassifications. Such a model might generalize better on unseen data. But in the end, there is now exact method to determine a good C-value beforehand. It rather is an empirical question. To choose an optimal C-value, we simply try from a range of values, run k-fold-cross-validation for each single value and decide for the C which resulted in the best classification accuracy / F-measure. This is realized in the following for-loop, which utilizes the function k_fold_cross_validation. The function (have a look into F.measure.R) simply wraps the code for cross-validation, we used above.

cParameterValues <- c(0.003, 0.01, 0.03, 0.1, 0.3, 1, 3 , 10, 30, 100)
fValues <- NULL
for (cParameter in cParameterValues) {
  print(paste0("C = ", cParameter))
  evalMeasures <- k_fold_cross_validation(annotatedDTM, annotatedLabels, cost = cParameter)
  fValues <- c(fValues, evalMeasures["F"])
}
## [1] "C = 0.003"
## [1] "C = 0.01"
## [1] "C = 0.03"
## [1] "C = 0.1"
## [1] "C = 0.3"
## [1] "C = 1"
## [1] "C = 3"
## [1] "C = 10"
## [1] "C = 30"
## [1] "C = 100"
plot(fValues, type="o", col="green", xaxt="n")
axis(1,at=1:length(cParameterValues), labels = cParameterValues)

bestC <- cParameterValues[which.max(fValues)]
print(paste0("Best C value: ", bestC, ", F1 = ", max(fValues)))
## [1] "Best C value: 1, F1 = 0.437239695980729"

From the empirical test, we can obtain C = 1 as optimal choice for the cost parameter. On the current training data set with the current features it achieves 0.4372397 F-score.

8.2 Optimized Preprocessing

Not only the classification model has parameters which can be optimized to improve the results. More important are the features used for classification. In our preprocessing chain above, we extracted single types and transformed them into lower case. We now add different preprocessing steps and check on the results. To get an optimal cost parameter for each new feature set, we wrapped the code for C-parameter optimization into the optimize_C function.

Stop word removal

Stop words often do not contribute to the meaning of a text. For the decision between DOMESTIC and FOREIGN affairs, we do not expect any useful information from them. So let’s remove them and if it improves the classifier.

processedCorpus <- tm_map(corpus, content_transformer(tolower))
processedCorpus <- tm_map(processedCorpus, removeWords, english_stopwords)
processedCorpus <- tm_map(processedCorpus, removePunctuation, preserve_intra_word_dashes = TRUE)
processedCorpus <- tm_map(processedCorpus, removeNumbers)
processedCorpus <- tm_map(processedCorpus, stripWhitespace)
as.character(processedCorpus[[4963]])
## [1] ""
minimumFrequency <- 5
DTM <- DocumentTermMatrix(processedCorpus, control = list(bounds = list(global = c(minimumFrequency, Inf))))
dim(DTM)
## [1] 38577  5045
annotatedDTM <- convertSlamToSparseM(DTM[trainingData[, "ID"], ])
best_C <- optimize_C(annotatedDTM, annotatedLabels)
## [1] "C = 0.003"
## [1] "C = 0.01"
## [1] "C = 0.03"
## [1] "C = 0.1"
## [1] "C = 0.3"
## [1] "C = 1"
## [1] "C = 3"
## [1] "C = 10"
## [1] "C = 30"
## [1] "C = 100"
## [1] "Best C value: 1, F1 = 0.443462480072387"
k_fold_cross_validation(annotatedDTM, annotatedLabels, cost = best_C)
##         P         R         S         F         A      Pos. 
## 0.3026075 0.8837031 0.1175201 0.4434625 0.3500000 9.1000000

Now let us see, if the use of bigrams, i.e. concatenations of sequences of two words can improve the result. Bigrams help to overcome a little bit the bag-of-words assumption of Document-Term-Matrices. With them, we can learn multi-word units such as great britain, international affairs or united nations as meaningful features for our task. The package tokenizers provides a simple and fast implementation to generate n-grams.

Bigrams

require(tokenizers)
tokenize_ngrams("This is a test", n = 2, n_min = 1, ngram_delim = "_", simplify = T)
## [1] "this"    "this_is" "is"      "is_a"    "a"       "a_test"  "test"
bigram_corpus <- tm_map(processedCorpus, content_transformer(tokenize_ngrams), n = 2, n_min = 1, ngram_delim = "_", simplify = T)
minimumFrequency <- 5
DTM <- DocumentTermMatrix(bigram_corpus, control = list(bounds = list(global = c(minimumFrequency, Inf))))
dim(DTM)
## [1] 38577  7929
sample(colnames(DTM), 10)
##  [1] "\"exchange\","          "\"revoked\","          
##  [3] "\"creation_national\"," "\"militia_system\","   
##  [5] "\"stipulated_treaty\"," "\"amicable\","         
##  [7] "\"agriculture\","       "\"statute\","          
##  [9] "\"remained\","          "\"prepared\","
annotatedDTM <- convertSlamToSparseM(DTM[trainingData[, "ID"], ])
best_C <- optimize_C(annotatedDTM, annotatedLabels)
## [1] "C = 0.003"
## [1] "C = 0.01"
## [1] "C = 0.03"
## [1] "C = 0.1"
## [1] "C = 0.3"
## [1] "C = 1"
## [1] "C = 3"
## [1] "C = 10"
## [1] "C = 30"
## [1] "C = 100"
## [1] "Best C value: 30, F1 = 0.44114326849621"
k_fold_cross_validation(annotatedDTM, annotatedLabels, cost = best_C)
##         P         R         S         F         A      Pos. 
## 0.3007760 0.8817388 0.1155749 0.4411433 0.3466667 9.1000000

Minimum feature frequency

Up to this point, we dropped features occurring less than five times in our data. Let’s see if we include more features by decreasing the minimum frequency.

# More features
minimumFrequency <- 2
DTM <- DocumentTermMatrix(bigram_corpus, control = list(bounds = list(global = c(minimumFrequency, Inf))))
dim(DTM)
## [1] 38577 28773
annotatedDTM <- convertSlamToSparseM(DTM[trainingData[, "ID"], ])
best_C <- optimize_C(annotatedDTM, annotatedLabels)
## [1] "C = 0.003"
## [1] "C = 0.01"
## [1] "C = 0.03"
## [1] "C = 0.1"
## [1] "C = 0.3"
## [1] "C = 1"
## [1] "C = 3"
## [1] "C = 10"
## [1] "C = 30"
## [1] "C = 100"
## [1] "Best C value: 3, F1 = 0.44247683286872"
k_fold_cross_validation(annotatedDTM, annotatedLabels, cost = best_C)
##         P         R         S         F         A      Pos. 
## 0.3025640 0.8817388 0.1225201 0.4424768 0.3500000 9.1000000

Stemming

As a last method, we utilize stemming (before bigram-concatenation) to unify different variants of the same semantic form (such as nation and nations).

# Stemming
stemmed_corpus <- tm_map(processedCorpus, stemDocument, language = "english")
stemmed_bigram_corpus <- tm_map(stemmed_corpus, content_transformer(tokenize_ngrams), n = 2, n_min = 1, ngram_delim = "_", simplify = T)
minimumFrequency <- 2
DTM <- DocumentTermMatrix(stemmed_bigram_corpus, control = list(bounds = list(global = c(minimumFrequency, Inf))))
dim(DTM)
## [1] 38577 30588
annotatedDTM <- convertSlamToSparseM(DTM[trainingData[, "ID"], ])
best_C <- optimize_C(annotatedDTM, annotatedLabels)
## [1] "C = 0.003"
## [1] "C = 0.01"
## [1] "C = 0.03"
## [1] "C = 0.1"
## [1] "C = 0.3"
## [1] "C = 1"
## [1] "C = 3"
## [1] "C = 10"
## [1] "C = 30"
## [1] "C = 100"
## [1] "Best C value: 10, F1 = 0.434645029121356"
k_fold_cross_validation(annotatedDTM, annotatedLabels, cost = best_C)
##         P         R         S         F         A      Pos. 
## 0.2982734 0.8507864 0.1305801 0.4346450 0.3500000 9.1000000

Each individual approach to optimize our text features for classification has some effect on the results. Best results could be obtained from a combination of bigram features with minimum frequency 2 and stop word removal. Stemming did not have a positive effect on the performance.

But, testing different features must be done for each new task / language individually, since there is no “one-size fits all” approach to this.

GENERAL ADVISE: For this tutorial we utilized a rather small training set of 300 examples, 91 of them in the positive class. Better classification accuracy can be expected, if more training data is available. Hence, instead of spending too much time on feature optimization, it will probably be a better idea to invest into generation of more training data first.

9 Final classification

We now apply our best classification model to the entire data set, to determine the occurrence of FORGEIN/DOMESTIC affairs related content in each single speech.

# Final classification
DTM <- DocumentTermMatrix(bigram_corpus, control = list(bounds = list(global = c(2, Inf))))
annotatedDTM <- convertSlamToSparseM(DTM[trainingData[, "ID"], ])
best_C <- optimize_C(annotatedDTM, annotatedLabels)
## [1] "C = 0.003"
## [1] "C = 0.01"
## [1] "C = 0.03"
## [1] "C = 0.1"
## [1] "C = 0.3"
## [1] "C = 1"
## [1] "C = 3"
## [1] "C = 10"
## [1] "C = 30"
## [1] "C = 100"
## [1] "Best C value: 3, F1 = 0.44247683286872"
final_model <- LiblineaR(annotatedDTM, annotatedLabels, cost = best_C)
final_labels <- predict(final_model, convertSlamToSparseM(DTM))$predictions
table(final_labels) / sum(table(final_labels))
## final_labels
##   DOMESTIC    FOREIGN 
## 0.07784431 0.92215569

We see that the classifier puts the majority of the around 21,000 paragraphs into the DOMESTIC category. We can visualize the result as a bar chart with ggplot2. For better readability

speech_year <- substr(testdata$date, 0, 4)
speech_id <- testdata$speech_doc_id
paragraph_position <- unlist(sapply(table(speech_id), FUN = function(x) {1:x}))
presidents_df <- data.frame(
  paragraph_position = paragraph_position,
  speech = paste0(speech_id, ": ", testdata$president, "_", speech_year),
  category = final_labels
)
# preserve speech order in chart by using factors
presidents_df$speech <- factor(presidents_df$speech, levels = unique(presidents_df$speech))
# Remove two very long speeches to beautify the plot (you can also skipt this)
presidents_df <- presidents_df[!grepl("Carter_1981|Truman_1946", presidents_df$speech), ]
# plot classes of paragraphs for each speech as tile
require(ggplot2)
ggplot(presidents_df, aes(x = speech, y = paragraph_position, fill = category)) + 
  geom_tile(stat="identity") + coord_flip()

Can you see how DOMESTIC affairs related content gets more important over the course of centuries? Also the position of FOREIGN policy statements changes around the turn from the 19th to 20th century from the begginning of a speech to more dispersed positions thoughout the speech, and finally a tendency to rather place them at the end of speeches.

10 Optional exercises

  1. Divide the training data into a 80% training set and 20% test set. Train a classifier on the training set and evaluate the performance on the test set. As performance measure use Krippendorff’a alpha and Cohen’s Kappa as provided in the irr package for R.
  2. Use the proba = T parameter for the predict method during the final classification to evaluate on label probabilities instead of concrete label decisions. Use the output of probabilities for the label “FOREIGN” to classify paragraphs as “FOREIGN” only if the label probability is greater than 60%. Visualize the result. What can you observe compared to the previous plot (decision boundary around 50%)?
  3. Find a visualization that shows the position of FOREIGN and DOMESTIC paragraphs in each speech.