Introduction

This tutorial shows how to summarize texts automatically using R by extracting the most prototypical sentences.

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to summarize textual data using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with summarizing texts.

The entire R Notebook for the tutorial can be downloaded here. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.


Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# set options
options(stringsAsFactors = F)          # no automatic data transformation
options("scipen" = 100, "digits" = 12) # suppress math annotation
# install packages
install.packages("xml2")
install.packages("rvest")
install.packages("lexRankr")
install.packages("textmineR")
install.packages("tidyverse")
install.packages("quanteda")
install.packages("igraph")
install.packages("here")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

Next we activate the packages.

# activate packages
library(xml2)
library(rvest)
library(lexRankr)
library(textmineR)
library(tidyverse)
library(quanteda)
library(igraph)
library(here)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed RStudio and have also initiated the session by executing the code shown above, you are good to go.

Basic Text summarization

This section shows an easy to use text summarizing method which extracts the most prototypical sentences from a text. As such, this text summarizer does not generate sentences based on prototypical words but evaluates how prototypical or central sentences are and then orders the sentences in a text according to their prototypicality (or centrality).

For this example, we will download text from a Guardian article about a meeting between Angela Merkel and Donald Trump at the G20 summit in 2017. In a first step, we define the url of the webpage hosting the article.

# url to scrape
url = "https://www.theguardian.com/world/2017/jun/26/angela-merkel-and-donald-trump-head-for-clash-at-g20-summit"

Next, we extract the text of the article using thexml2 and thervest` packages.

# read page html
page = xml2::read_html(url)
# extract text from page html using selector
page %>%
  # extract paragraphs
  rvest::html_nodes("p") %>%
  # extract text
  rvest::html_text() %>%
  # remove empty elements
  .[. != ""] -> text
# inspect data
head(text)
## [1] "German chancellor plans to make climate change, free trade and mass migration key themes in Hamburg, putting her on collision course with US"                                                                                                        
## [2] "A clash between Angela Merkel and Donald Trump appears unavoidable after Germany signalled that it will make climate change, free trade and the management of forced mass global migration the key themes of the G20 summit in Hamburg next week."   
## [3] "The G20 summit brings together the world’s biggest economies, representing 85% of global gross domestic product (GDP), and Merkel’s chosen agenda looks likely to maximise American isolation while attempting to minimise disunity amongst others. "
## [4] "The meeting, which is set to be the scene of large-scale street protests, will also mark the first meeting between Trump and the Russian president, Vladimir Putin, as world leaders."                                                               
## [5] "Trump has already rowed with Europe once over climate change and refugees at the G7 summit in Italy, and now looks set to repeat the experience in Hamburg but on a bigger stage, as India and China join in the criticism of Washington. "          
## [6] "Last week, the new UN secretary-general, António Guterres, warned the Trump team if the US disengages from too many issues confronting the international community it will be replaced as world leader."

Now that we have the text, we apply the lexRank function from the lexRankr package to determine the prototypicality (or centrality) and extract the three most central sentences.

# perform lexrank for top 3 sentences
top3sentences = lexRankr::lexRank(text,
                          # only 1 article; repeat same docid for all of input vector
                          docId = rep(1, length(text)),
                          # return 3 sentences
                          n = 3,
                          continuous = TRUE)
## Parsing text into sentences and tokens...DONE
## Calculating pairwise sentence similarities...DONE
## Applying LexRank...DONE
## Formatting Output...DONE
# inspect
top3sentences
##   docId sentenceId
## 1     1        1_2
## 2     1        1_5
## 3     1       1_16
##                                                                                                                                                                                                                                                        sentence
## 1             A clash between Angela Merkel and Donald Trump appears unavoidable after Germany signalled that it will make climate change, free trade and the management of forced mass global migration the key themes of the G20 summit in Hamburg next week.
## 2                     Trump has already rowed with Europe once over climate change and refugees at the G7 summit in Italy, and now looks set to repeat the experience in Hamburg but on a bigger stage, as India and China join in the criticism of Washington.
## 3 But the G7, and Trump’s subsequent decision to shun the Paris climate change treaty, clearly left a permanent mark on her, leading to her famous declaration of independence four days later at a Christian Social Union (CSU) rally in a Bavarian beer tent.
##        value
## 1 0.06017053
## 2 0.05656337
## 3 0.04974733

Next, we extract and display the sentences from the table.

top3sentences$sentence
## [1] "A clash between Angela Merkel and Donald Trump appears unavoidable after Germany signalled that it will make climate change, free trade and the management of forced mass global migration the key themes of the G20 summit in Hamburg next week."            
## [2] "Trump has already rowed with Europe once over climate change and refugees at the G7 summit in Italy, and now looks set to repeat the experience in Hamburg but on a bigger stage, as India and China join in the criticism of Washington."                    
## [3] "But the G7, and Trump’s subsequent decision to shun the Paris climate change treaty, clearly left a permanent mark on her, leading to her famous declaration of independence four days later at a Christian Social Union (CSU) rally in a Bavarian beer tent."

The output show the three most prototypical (or central) sentences of the article. The articles are already in chronological order - if the sentences were not in chronological order, we could also have ordered them by sentenceId before displaying them using dplyr and stringr package functions as shown below (in our case the order does not change as the prototypicality and the chronological order are identical).

top3sentences %>%
  dplyr::mutate(sentenceId = as.numeric(stringr::str_remove_all(sentenceId, ".*_"))) %>%
  dplyr::arrange(sentenceId) %>%
  dplyr::pull(sentence)
## [1] "A clash between Angela Merkel and Donald Trump appears unavoidable after Germany signalled that it will make climate change, free trade and the management of forced mass global migration the key themes of the G20 summit in Hamburg next week."            
## [2] "Trump has already rowed with Europe once over climate change and refugees at the G7 summit in Italy, and now looks set to repeat the experience in Hamburg but on a bigger stage, as India and China join in the criticism of Washington."                    
## [3] "But the G7, and Trump’s subsequent decision to shun the Paris climate change treaty, clearly left a permanent mark on her, leading to her famous declaration of independence four days later at a Christian Social Union (CSU) rally in a Bavarian beer tent."

EXERCISE TIME!

`

  1. Extract the top 10 sentences from every chapter of Charles Darwin’s On the Origin of Species. You can download the text using this command: darwin <- base::readRDS(url("https://slcladal.github.io/data/origindarwin.rda", "rb")). You will then have to paste the whole text together, split it into chapters, create a list of sentences in each chapter, and then apply text summarization to each element in the list.

Answer

  darwin <- base::readRDS(url("https://slcladal.github.io/data/origindarwin.rda", "rb")) %>%
  # collapse into single document
  paste0(collapse = " ") %>%
  # split into chapters
  stringr::str_split("CHAPTER")
  
  # split chapters into sentences
  chapters <- sapply(darwin, function(x){
    x <- stringi::stri_split_boundaries(x, type = "sentence")
  })
  
  chapters_clean <- lapply(chapters, function(x){
    # remove chapter headings
    x <- stringr::str_remove_all(x, "[A-Z]{2,} {0,1}[0-9]{0,}")
  })
  
  # extract top 3 sentences from each chapter
  top3s <- lapply(chapters_clean, function(x){
    x <- lexRankr::lexRank(x,
                          # only 1 article; repeat same docid for all of input vector
                          #docId = rep(1, length(text)),
                          # return 3 sentences
                          n = 3,
                          continuous = TRUE) %>%
                          dplyr::pull(sentence) %>%
    # remove special characters
    stringr::str_remove_all("[^[:alnum:] ]") %>%
    # remove superfluous white spaces
    stringr::str_squish()
  })
  
  # inspect top 3 sentences of first 5 chapters
  top3s[1:5]

`


You can go ahead and play with the text summarization and see if it is useful for you or if you can trust the results based on your data.

Citation & Session Info

Schweinberger, Martin. 2022. Automated text summarization with R. Brisbane: The University of Queensland. url: https://slcladal.github.io/txtsum.html (Version 2022.09.13).

@manual{schweinberger2022txtsum,
  author = {Schweinberger, Martin},
  title = {Automated Text Summarization with R},
  note = {https://slcladal.github.io/txtsum.html},
  year = {2022},
  organization = "The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2022.09.13}
}
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] here_1.0.1      igraph_1.3.2    quanteda_3.2.1  forcats_0.5.1  
##  [5] stringr_1.4.0   dplyr_1.0.9     purrr_0.3.4     readr_2.1.2    
##  [9] tidyr_1.2.0     tibble_3.1.7    ggplot2_3.3.6   tidyverse_1.3.2
## [13] textmineR_3.0.5 Matrix_1.4-1    lexRankr_0.5.2  rvest_1.0.2    
## [17] xml2_1.3.3     
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.3          sass_0.4.1          jsonlite_1.8.0     
##  [4] modelr_0.1.8        bslib_0.3.1         RcppParallel_5.1.5 
##  [7] assertthat_0.2.1    highr_0.9           selectr_0.4-2      
## [10] renv_0.15.4         googlesheets4_1.0.0 cellranger_1.1.0   
## [13] yaml_2.3.5          pillar_1.7.0        backports_1.4.1    
## [16] lattice_0.20-45     glue_1.6.2          digest_0.6.29      
## [19] colorspace_2.0-3    htmltools_0.5.2     pkgconfig_2.0.3    
## [22] broom_1.0.0         haven_2.5.0         scales_1.2.0       
## [25] tzdb_0.3.0          googledrive_2.0.0   generics_0.1.3     
## [28] ellipsis_0.3.2      withr_2.5.0         klippy_0.0.0.9500  
## [31] cli_3.3.0           magrittr_2.0.3      crayon_1.5.1       
## [34] readxl_1.4.0        evaluate_0.15       stopwords_2.3      
## [37] fs_1.5.2            fansi_1.0.3         SnowballC_0.7.0    
## [40] RcppProgress_0.4.2  tools_4.2.1         hms_1.1.1          
## [43] gargle_1.2.0        lifecycle_1.0.1     munsell_0.5.0      
## [46] reprex_2.0.1        compiler_4.2.1      jquerylib_0.1.4    
## [49] rlang_1.0.4         grid_4.2.1          rmarkdown_2.14     
## [52] gtable_0.3.0        curl_4.3.2          DBI_1.1.3          
## [55] R6_2.5.1            lubridate_1.8.0     knitr_1.39         
## [58] fastmap_1.1.0       utf8_1.2.2          fastmatch_1.1-3    
## [61] rprojroot_2.0.3     stringi_1.7.8       Rcpp_1.0.8.3       
## [64] vctrs_0.4.1         dbplyr_2.2.1        tidyselect_1.1.2   
## [67] xfun_0.31

Back to top

Back to HOME