Introduction

This tutorial introduces web crawling and web scraping with R. Web crawling and web scraping are important and common procedures for collecting text data from social media sites, web pages, or other documents for later analysis. Regarding terminology, the automated download of HTML pages is called crawling while the extraction of the textual data and/or metadata (for example, article date, headlines, author names, article text) from the HTML source code (or the DOM document object model of the website) is called scraping (see Olston and Najork 2010).

This tutorial is aimed at intermediate and advanced users of R with the aim of showcasing how to crawl and scrape web data using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with crawling and scraping web data.

The entire R Notebook for the tutorial can be downloaded here. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.


This tutorial builds heavily on and uses materials from this tutorial on web crawling and scraping using R by Andreas Niekler and Gregor Wiedemann (see Wiedemann and Niekler 2017). The tutorial by Andreas Niekler and Gregor Wiedemann is more thorough, goes into more detail than this tutorial, and covers many more very useful text mining methods. An alternative approach for web crawling and scraping would be to use the RCrawler package (Khalil and Fakir 2017) which is not introduced here though (inspecting the RCrawler package and its functions is, however, also highly recommended). For a more in-depth introduction to web crawling in scraping, Miner et al. (2012) is a very useful introduction.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# install packages
install.packages("rvest")
install.packages("readtext")
install.packages("webdriver")
install.packages("tidyverse")
webdriver::install_phantomjs()
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

If not done yet, please install the phantomJS headless browser. This needs to be done only once.

Now that we have installed the packages (and the phantomJS headless browser), we can activate them as shown below.

# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# load packages
library(tidyverse)
library(rvest)
library(readtext)
library(webdriver)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed R and RStudio and once you have initiated the session by executing the code shown above, you are good to go.

Scraping a single website

For web crawling and scraping, we use the package rvest and to extract text data from various formats such as PDF, DOC, DOCX and TXT files with the readtext package. In a first exercise, we will download a single web page from The Guardian and extract text together with relevant metadata such as the article date. Let’s define the URL of the article of interest and load the content using the read_html function from the rvest package, which provides very useful functions for web crawling and scraping.

# define url
url <- "https://www.theguardian.com/world/2017/jun/26/angela-merkel-and-donald-trump-head-for-clash-at-g20-summit"
# download content
webc <- rvest::read_html(url)
# inspect
webc
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n                <a href="#maincontent" class="dcr-1y2qbjm">Skip t ...

We download and parse the webpage using the read_html function which accepts a URL as a parameter. The function downloads the page and interprets the html source code as an HTML / XML object.

However, the output contains a lot of information that we do not really need. Thus, we process the data to extract only the text from the webpage.

webc %>%
  # extract paragraphs
  rvest::html_nodes("p") %>%
  # extract text
  rvest::html_text() -> webtxt
# inspect
head(webtxt)
## [1] "German chancellor plans to make climate change, free trade and mass migration key themes in Hamburg, putting her on collision course with US"                                                                                                        
## [2] "A clash between Angela Merkel and Donald Trump appears unavoidable after Germany signalled that it will make climate change, free trade and the management of forced mass global migration the key themes of the G20 summit in Hamburg next week."   
## [3] "The G20 summit brings together the world’s biggest economies, representing 85% of global gross domestic product (GDP), and Merkel’s chosen agenda looks likely to maximise American isolation while attempting to minimise disunity amongst others. "
## [4] "The meeting, which is set to be the scene of large-scale street protests, will also mark the first meeting between Trump and the Russian president, Vladimir Putin, as world leaders."                                                               
## [5] "Trump has already rowed with Europe once over climate change and refugees at the G7 summit in Italy, and now looks set to repeat the experience in Hamburg but on a bigger stage, as India and China join in the criticism of Washington. "          
## [6] "Last week, the new UN secretary-general, António Guterres, warned the Trump team if the US disengages from too many issues confronting the international community it will be replaced as world leader."

The output shows the first 6 text elements of the website which means that we were successful in scraping the text content of the web page.

We can also extract the headline of the article by running the code shown below.

webc %>%
  # extract paragraphs
  rvest::html_nodes("h1") %>%
  # extract text
  rvest::html_text() -> header
# inspect
head(header)
## [1] "Angela Merkel and Donald Trump head for clash at G20 summit"

Citation & Session Info

Schweinberger, Martin. 2022. Web Crawling and Scraping using R. Brisbane: The University of Queensland. url: https://slcladal.github.io/webcrawling.html (Version edition = {2022.05.21}).

@manual{schweinberger2022webc,
  author = {Schweinberger, Martin},
  title = {Web Crawling and Scraping using R},
  note = {https://slcladal.github.io/webcrawling.html},
  year = {2022},
  organization = "The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2022.05.21}
}
sessionInfo()
## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] webdriver_1.0.6 readtext_0.81   rvest_1.0.2     forcats_0.5.1  
##  [5] stringr_1.4.0   dplyr_1.0.9     purrr_0.3.4     readr_2.1.2    
##  [9] tidyr_1.2.0     tibble_3.1.7    ggplot2_3.3.6   tidyverse_1.3.1
## 
## loaded via a namespace (and not attached):
##  [1] lubridate_1.8.0   png_0.1-7         ps_1.7.0          assertthat_0.2.1 
##  [5] digest_0.6.29     utf8_1.2.2        showimage_1.0.0   R6_2.5.1         
##  [9] cellranger_1.1.0  backports_1.4.1   reprex_2.0.1      evaluate_0.15    
## [13] httr_1.4.3        highr_0.9         pillar_1.7.0      rlang_1.0.2      
## [17] curl_4.3.2        readxl_1.4.0      rstudioapi_0.13   data.table_1.14.2
## [21] callr_3.7.0       jquerylib_0.1.4   klippy_0.0.0.9500 rmarkdown_2.14   
## [25] selectr_0.4-2     munsell_0.5.0     broom_0.8.0       compiler_4.2.0   
## [29] modelr_0.1.8      xfun_0.30         base64enc_0.1-3   pkgconfig_2.0.3  
## [33] htmltools_0.5.2   tidyselect_1.1.2  fansi_1.0.3       crayon_1.5.1     
## [37] tzdb_0.3.0        dbplyr_2.1.1      withr_2.5.0       grid_4.2.0       
## [41] jsonlite_1.8.0    gtable_0.3.0      lifecycle_1.0.1   DBI_1.1.2        
## [45] magrittr_2.0.3    scales_1.2.0      debugme_1.1.0     cli_3.3.0        
## [49] stringi_1.7.6     renv_0.15.4       fs_1.5.2          xml2_1.3.3       
## [53] bslib_0.3.1       ellipsis_0.3.2    generics_0.1.2    vctrs_0.4.1      
## [57] tools_4.2.0       glue_1.6.2        hms_1.1.1         processx_3.5.3   
## [61] fastmap_1.1.0     yaml_2.3.5        colorspace_2.0-3  knitr_1.39       
## [65] haven_2.5.0       sass_0.4.1

Back to top

Back to HOME


References

Khalil, Salim, and Mohamed Fakir. 2017. “RCrawler: An r Package for Parallel Web Crawling and Scraping.” SoftwareX 6: 98–106.
Miner, Gary, John Elder IV, Andrew Fast, Thomas Hill, Robert Nisbet, and Dursun Delen. 2012. Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications. Academic Press.
Olston, Christopher, and Marc Najork. 2010. Web Crawling. Now Publishers Inc.
Wiedemann, Gregor, and Andreas Niekler. 2017. “Hands-on: A Five Day Text Mining Course for Humanists and Social Scientists in R.” In Proceedings of the Workshop on Teaching NLP for Digital Humanities (Teach4DH), Berlin, Germany, September 12, 2017., 57–65. http://ceur-ws.org/Vol-1918/wiedemann.pdf.