Introduction

This tutorial introduces how to extract and process text data from social media sites, web pages, or other documents for later analysis. The entire R markdown document for the present tutorial can be downloaded here. This tutorial builds heavily on and uses materials from this tutorial on web crawling and scraping using R by Andreas Niekler and Gregor Wiedemann (see Wiedemann and Niekler 2017). The tutorial by Andreas Niekler and Gregor Wiedemann is more thorough, goes into more detail than this tutorial, and overs many more very useful text mining methods. An alternative approach for web crawling and scraping would be to use the RCrawler package (Khalil and Fakir 2017) which is not introduced here thought (inspecting the RCrawler package and its functions is, however, also highly recommended). For a more in-depth introduction to web crawling in scraping, Miner et al. (2012) is a very useful introduction.

The automated download of HTML pages is called Crawling. The extraction of the textual data and/or metadata (for example, article date, headlines, author names, article text) from the HTML source code (or the DOM document object model of the website) is called Scraping (see Olston and Najork 2010).

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)
# install libraries
install.packages(c("rvest", "readtext"))

Getting started

For web crawling and scraping, we use the package rvest and to extract text data from various formats such as PDF, DOC, DOCX and TXT files with the readtext package. The tasks described int his section consist of:

  1. Download a single web page and extract its content

  2. Extract links from a overview page and extract articles

  3. Extract text data from PDF and other formats on disk

Crawl single webpage

As a first step, we will download a single web page from The Guardian and extract text together with relevant metadata such as the article date. Let’s define the URL of the article of interest and load the rvest package, which provides very useful functions for web crawling and scraping.

url <- "https://www.theguardian.com/world/2017/jun/26/angela-merkel-and-donald-trump-head-for-clash-at-g20-summit"
require("rvest")

A convenient method to download and parse a webpage is provided by the function read_html which accepts a URL as its main argument. The function downloads the page and interprets the html source code as an HTML or XML object.

html_document <- read_html(url)

HTML and XML objects are a structured representation of HTML/XML source code, which allows to extract single elements (headlines e.g. <h1>, paragraphs <p>, links <a>, …), their attributes (e.g. <a href="http://...">) or text wrapped in between elements (e.g. <p>my text...</p>). Elements can be extracted in XML objects with XPATH-expressions.

XPATH (see here) is a query language that parses XML-tree structures and we use it to select the headline element from the HTML page.

The following xpath expression queries for first-order-headline elements h1, anywhere in the tree // which fulfill a certain condition [...], namely that the class attribute of the h1 element must contain the value content__headline.

The next expression uses R pipe operator %>%, which takes the input from the left side of the expression and passes it on to the function ion the right side as its first argument. The result of this function is either passed onto the next function, via a pipe (%>%) or it is assigned to variable, if it is the last operation in a pipe. Our pipe takes the html_document object, passes it to the html_node function, which extracts the first node fitting the given xpath expression. The resulting node object is passed to the html_text function which extracts the text wrapped in the h1-element.

title_xpath <- "//h1[contains(@class, 'content__headline')]"
title_text <- html_document %>%
  html_node(xpath = title_xpath) %>%
  html_text(trim = T)

Let’s see, what the title_text contains:

cat(title_text)
## Angela Merkel and Donald Trump head for clash at G20 summit

Now we modify the xpath expressions, to extract the article info, the paragraphs of the body text and the article date. Note that there are multiple paragraphs in the article. To extract not only the first, but all paragraphs we utilize the html_nodes function and glue the resulting single text vectors of each paragraph together with the paste0 function.

intro_xpath <- "//div[contains(@class, 'content__standfirst')]//p"
intro_text <- html_document %>%
  html_node(xpath = intro_xpath) %>%
  html_text(trim = T)
cat(intro_text)
## German chancellor plans to make climate change, free trade and mass migration key themes in Hamburg, putting her on collision course with US
body_xpath <- "//div[contains(@class, 'content__article-body')]//p"
body_text <- html_document %>%
  html_nodes(xpath = body_xpath) %>%
  html_text(trim = T) %>%
  paste0(collapse = "\n")

Now, let’s inspect the first 150 elements of the text body.

cat(substr(body_text, 0, 150))
## A clash between Angela Merkel and Donald Trump appears unavoidable after Germany signalled that it will make climate change, free trade and the manage

We now extract the date from the html document.

date_xpath <- "//time"
date_object <- html_document %>%
  html_node(xpath = date_xpath) %>%
  html_attr(name = "datetime") %>%
  as.Date()
cat(format(date_object, "%Y-%m-%d"))
## 2017-06-26

The variables title_text, intro_text, body_text and date_object now contain the raw data for any subsequent text processing.

Read various file formats

In case you have already a collection of text data files on your disk, you can import them into R in a very convenient provided by the readtext package. The package depends on some other programs or libraries in your system, to provide extraction for Word- and PDF-documents. So there may be some hurdles to install the package.

But once it is successfully installed, it allows for very easy extraction of text data from various file formats. Fist, we request a list of files in the directory to extract text from. For demonstration purpose, we provide in data/documents a random selection of various text formats.

data_files <- list.files(path = "data/documents", full.names = T, recursive = T)
# View first file paths
head(data_files, 3)
## character(0)

The readtext function from the package with the same name, detects automatically file formats of the given files list and extracts the content into a data.frame. The parameter docvarsfrom allows you to set metadata variables by splitting path names. In our example, docvar3 contains a source type variable derived from the sub folder name.

require(readtext)
extracted_texts <- readtext(data_files, docvarsfrom = "filepaths", dvsep = "/")
# View first rows of the extracted texts
head(extracted_texts)
# View beginning of tenth extracted text
substr(trimws(extracted_texts$text[10]) , 0, 100)

Again, the extracted_texts can be written by write.csv2 to disk for later use.

Citation & Session Info

Schweinberger, Martin. 2020. Web Crawling and Scraping using R. Brisbane: The University of Queensland. url: https://slcladal.github.io/webcrawling.html (Version 2020.09.29).

@manual{schweinberger2020webc,
  author = {Schweinberger, Martin},
  title = {Web Crawling and Scraping using R},
  note = {https://slcladal.github.io/webcrawling.html},
  year = {2020},
  organization = "The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2020/09/29}
}
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rvest_0.3.6 xml2_1.3.2 
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.25   R6_2.4.1        magrittr_1.5    evaluate_0.14  
##  [5] httr_1.4.2      rlang_0.4.7     stringi_1.5.3   curl_4.3       
##  [9] rmarkdown_2.3   tools_4.0.2     stringr_1.4.0   xfun_0.16      
## [13] yaml_2.2.1      compiler_4.0.2  htmltools_0.5.0 knitr_1.30

References


Main page


Khalil, Salim, and Mohamed Fakir. 2017. “RCrawler: An R Package for Parallel Web Crawling and Scraping.” SoftwareX 6: 98–106.

Miner, Gary, John Elder IV, Andrew Fast, Thomas Hill, Robert Nisbet, and Dursun Delen. 2012. Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications. Academic Press.

Olston, Christopher, and Marc Najork. 2010. Web Crawling. Now Publishers Inc.

Wiedemann, Gregor, and Andreas Niekler. 2017. “Hands-on: A Five Day Text Mining Course for Humanists and Social Scientists in R.” In Proceedings of the Workshop on Teaching NLP for Digital Humanities (Teach4DH2017), Berlin, Germany, September 12, 2017., 57–65. http://ceur-ws.org/Vol-1918/wiedemann.pdf.