Introduction

This tutorial shows how to download and clean works from the Project Gutenberg archive using R. Project Gutenberg is a data base whcih contains roughly 60,000 texts for which the US copyright ahs expired. The entire R-markdown document for the sections below can be downloaded here.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # supress math annotation
# install libraries
install.packages(c("dplyr", "gutenbergr", "stringr"))

Once you have installed R-Studio and initiated the session by executing the code shown above, you are good to go.

1 Loading texts from Project Gutenberg

In a first step, we load the necessary packages from the library. To download and work with texts from the Project Gutenberg, you specifically need to load the gutenbergr package.

# activate packages
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gutenbergr)
library(stringr)
library(flextable)
library(officer)

To inspect which works are available for download, we can type gutenberg() or simply gutenberg_metadata which will output a table containing all available texts.

gutenberg_metadata

The table below shows the first 15 lines of the overview table which shows all available texts. As there are currently 51,997 texts available, we limit the output here to 15.

First 10 texts in the overview of texts available in the Gutenberg data base.

gutenberg_id

title

author

gutenberg_author_id

language

gutenberg_bookshelf

rights

has_text

0

NA

en

Public domain in the USA.

TRUE

1

The Declaration of Independence of the United States of America

Jefferson, Thomas

1638

en

United States Law/American Revolutionary War/Politics

Public domain in the USA.

TRUE

2

The United States Bill of Rights
The Ten Original Amendments to the Constitution of the United States

United States

1

en

American Revolutionary War/Politics/United States Law

Public domain in the USA.

TRUE

3

John F. Kennedy's Inaugural Address

Kennedy, John F. (John Fitzgerald)

1666

en

Public domain in the USA.

TRUE

4

Lincoln's Gettysburg Address
Given November 19, 1863 on the battlefield near Gettysburg, Pennsylvania, USA

Lincoln, Abraham

3

en

US Civil War

Public domain in the USA.

TRUE

5

The United States Constitution

United States

1

en

American Revolutionary War/Politics/United States/United States Law

Public domain in the USA.

TRUE

6

Give Me Liberty or Give Me Death

Henry, Patrick

4

en

American Revolutionary War

Public domain in the USA.

TRUE

7

The Mayflower Compact

NA

en

Public domain in the USA.

TRUE

8

Abraham Lincoln's Second Inaugural Address

Lincoln, Abraham

3

en

US Civil War

Public domain in the USA.

TRUE

9

Abraham Lincoln's First Inaugural Address

Lincoln, Abraham

3

en

US Civil War

Public domain in the USA.

TRUE

To find all works by a specific author, you need to specify the author in the gutenberg_works function as shown below.

# load data
darwin <- gutenberg_works(author == "Darwin, Charles")
# inspect data
ft <- flextable(darwin)
ft <- set_caption(ft, "All texts of Charles Darwin available through Project Gutenberg.")
ft <- autofit(ft)
ft
All texts of Charles Darwin available through Project Gutenberg.

gutenberg_id

title

author

gutenberg_author_id

language

gutenberg_bookshelf

rights

has_text

944

The Voyage of the Beagle

Darwin, Charles

485

en

Travel/Harvard Classics

Public domain in the USA.

TRUE

1227

The Expression of the Emotions in Man and Animals

Darwin, Charles

485

en

Public domain in the USA.

TRUE

1228

On the Origin of Species By Means of Natural Selection
Or, the Preservation of Favoured Races in the Struggle for Life

Darwin, Charles

485

en

Harvard Classics/Biology/Banned Books from Anne Haight's list

Public domain in the USA.

TRUE

2009

The Origin of Species by Means of Natural Selection
Or, the Preservation of Favoured Races in the Struggle for Life, 6th Edition

Darwin, Charles

485

en

Harvard Classics/Biology

Public domain in the USA.

TRUE

2010

The Autobiography of Charles Darwin

Darwin, Charles

485

en

Public domain in the USA.

TRUE

2087

Life and Letters of Charles Darwin — Volume 1

Darwin, Charles

485

en

Public domain in the USA.

TRUE

2088

Life and Letters of Charles Darwin — Volume 2

Darwin, Charles

485

en

Public domain in the USA.

TRUE

2300

The Descent of Man, and Selection in Relation to Sex

Darwin, Charles

485

en

Public domain in the USA.

TRUE

2355

The Formation of Vegetable Mould Through the Action of Worms
With Observations on Their Habits

Darwin, Charles

485

en

Public domain in the USA.

TRUE

2485

The Movements and Habits of Climbing Plants

Darwin, Charles

485

en

Botany

Public domain in the USA.

TRUE

2690

Coral Reefs

Darwin, Charles

485

en

Public domain in the USA.

TRUE

2739

More Letters of Charles Darwin — Volume 1
A Record of His Work in a Series of Hitherto Unpublished Letters

Darwin, Charles

485

en

Public domain in the USA.

TRUE

2740

More Letters of Charles Darwin — Volume 2
A Record of His Work in a Series of Hitherto Unpublished Letters

Darwin, Charles

485

en

Public domain in the USA.

TRUE

2871

The Variation of Animals and Plants under Domestication — Volume 1

Darwin, Charles

485

en

Animals-Domestic/Botany

Public domain in the USA.

TRUE

2872

The Variation of Animals and Plants under Domestication — Volume 2

Darwin, Charles

485

en

Botany/Animals-Domestic

Public domain in the USA.

TRUE

3054

Volcanic Islands

Darwin, Charles

485

en

Public domain in the USA.

TRUE

3620

Geological Observations on South America

Darwin, Charles

485

en

South America

Public domain in the USA.

TRUE

3704

Journal of Researches into the Natural History and Geology of the Countries Visited During the Voyage Round the World of H.M.S. Beagle Under the Command of Captain Fitz Roy, R.N.

Darwin, Charles

485

en

Travel/Harvard Classics

Public domain in the USA.

TRUE

3807

The Different Forms of Flowers on Plants of the Same Species

Darwin, Charles

485

en

Botany

Public domain in the USA.

TRUE

4346

The Effects of Cross & Self-Fertilisation in the Vegetable Kingdom

Darwin, Charles

485

en

Botany

Public domain in the USA.

TRUE

5765

Insectivorous Plants

Darwin, Charles

485

en

Botany

Public domain in the USA.

TRUE

22728

The Foundations of the Origin of Species
Two Essays written in 1842 and 1844

Darwin, Charles

485

en

Biology

Public domain in the USA.

TRUE

22764

On the Origin of Species by Means of Natural Selection
or the Preservation of Favoured Races in the Struggle for Life. (2nd edition)

Darwin, Charles

485

en

Banned Books from Anne Haight's list/Biology/Best Books Ever Listings

Public domain in the USA.

TRUE

24923

The Variation of Animals and Plants Under Domestication, Vol. I.

Darwin, Charles

485

en

Animals-Domestic

Public domain in the USA.

TRUE

28897

The Variation of Animals and Plants Under Domestication, Volume II (of 2)

Darwin, Charles

485

en

Animals-Domestic

Public domain in the USA.

TRUE

31558

A Monograph on the Sub-class Cirripedia (Volume 1 of 2)
The Lepadidae; Or, Pedunculated Cirripedes

Darwin, Charles

485

en

Animal/Animals-Wild

Public domain in the USA.

TRUE

34967

The Descent of Man and Selection in Relation to Sex, Vol. I

Darwin, Charles

485

en

Public domain in the USA.

TRUE

36520

The Descent of Man and Selection in Relation to Sex, Vol. II (1st Edition)

Darwin, Charles

485

en

Public domain in the USA.

TRUE

38629

Charles Darwin: His Life Told in an Autobiographical Chapter, and in a Selected Series of His Published Letters

Darwin, Charles

485

en

Public domain in the USA.

TRUE

46408

A Monograph on the Sub-class Cirripedia (Volume 2 of 2)
The Balanidæ, (or Sessile Cirripedes); the Verrucidæ, etc., etc.

Darwin, Charles

485

en

Public domain in the USA.

TRUE

To find all texts in, for example, German, you need to specify the language in the gutenberg_works function as shown below.

# load data
german <- gutenberg_works(languages = "de", all_languages = TRUE) %>%
  count(language, sort = TRUE)
# inspect data
ft <- flextable(head(german, 15))
ft <- set_caption(ft, "Number of texts in German available through Project Gutenberg.")
ft <- fit_to_width(ft, max_width = 6)
ft
Number of texts in German available through Project Gutenberg.

language

n

de

1342

2 Loading individual texts

To download any of these text, you need to specify the text you want, e.g. by specifying the title. In a next step, you can then use the gutenberg_download function to download the text. To exemplify how this works we download William Shakespeare’s Romeo and Juliet.

# load data
romeo <- gutenberg_works(title == "Romeo and Juliet") %>%
  gutenberg_download(meta_fields = "title")
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
# inspect data
ft <- flextable(head(romeo, 15))
ft <- set_caption(ft, "First 15 lines of William Shakespeare's *Romeo and Juliet*.")
ft <- autofit(ft)
ft
First 15 lines of William Shakespeare's *Romeo and Juliet*.

gutenberg_id

text

title

1513

ROMEO AND JULIET

Romeo and Juliet

1513

Romeo and Juliet

1513

by William Shakespeare

Romeo and Juliet

1513

Romeo and Juliet

1513

Romeo and Juliet

1513

Romeo and Juliet

1513

Romeo and Juliet

1513

PERSONS REPRESENTED

Romeo and Juliet

1513

Romeo and Juliet

1513

Escalus, Prince of Verona.

Romeo and Juliet

1513

Paris, a young Nobleman, kinsman to the Prince.

Romeo and Juliet

1513

Montague,}Heads of two Houses at variance with each other.

Romeo and Juliet

1513

Capulet, }

Romeo and Juliet

1513

An Old Man, Uncle to Capulet.

Romeo and Juliet

1513

Romeo, Son to Montague.

Romeo and Juliet

We could also use the gutenberg_id to download this text.

# load data
romeo <- gutenberg_works(gutenberg_id == "1513") %>%
  gutenberg_download(meta_fields = "gutenberg_id")
# inspect data
ft <- flextable(head(romeo, 15))
ft <- set_caption(ft, "First 15 lines of William Shakespeare's *Romeo and Juliet*.")
ft <- autofit(ft)
ft
First 15 lines of William Shakespeare's *Romeo and Juliet*.

gutenberg_id

text

1513

ROMEO AND JULIET

1513

1513

by William Shakespeare

1513

1513

1513

1513

1513

PERSONS REPRESENTED

1513

1513

Escalus, Prince of Verona.

1513

Paris, a young Nobleman, kinsman to the Prince.

1513

Montague,}Heads of two Houses at variance with each other.

1513

Capulet, }

1513

An Old Man, Uncle to Capulet.

1513

Romeo, Son to Montague.

3 Loading texts simultaneously

To load more than one text, you can use the | (or) operator to inform R that you want to download the text with the gutenberg_id 1513 and/or the text with the gutenberg_id 1.

texts <- gutenberg_works(gutenberg_id == "1513"|gutenberg_id == "1") %>%
  gutenberg_download(meta_fields = "title")
# inspect data
ft <- flextable(as.data.frame(table(texts$gutenberg_id)))
ft <- set_caption(ft, "Texts loaded from Project Gutenberg.")
ft <- fit_to_width(ft, max_width = 6)
ft
Texts loaded from Project Gutenberg.

Var1

Freq

1

2053

1513

5268

Feel free to have a look at different texts provided by the Project Gutenberg!

Citation & Session Info

Schweinberger, Martin. 2020. Downloading Texts from Project Gutenberg using R. Brisbane: The University of Queensland. url: https://slcladal.github.io/topicmodels.html (Version 2020.09.29).

@manual{schweinberger2020gb,
  author = {Schweinberger, Martin},
  title = {Downloading Texts from Project Gutenberg using R},
  note = {https://slcladal.github.io/gutenberg.html},
  year = {2020},
  organization = "The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2020/10/14}
}
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] officer_0.3.14   flextable_0.5.11 stringr_1.4.0    gutenbergr_0.2.0
## [5] dplyr_1.0.2     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5        xml2_1.3.2        knitr_1.30        magrittr_1.5     
##  [5] hms_0.5.3         uuid_0.1-4        tidyselect_1.1.0  R6_2.4.1         
##  [9] rlang_0.4.7       tools_4.0.2       data.table_1.13.0 xfun_0.16        
## [13] systemfonts_0.3.2 htmltools_0.5.0   ellipsis_0.3.1    lazyeval_0.2.2   
## [17] yaml_2.2.1        digest_0.6.25     tibble_3.0.3      lifecycle_0.2.0  
## [21] crayon_1.3.4      zip_2.1.1         readr_1.4.0       purrr_0.3.4      
## [25] base64enc_0.1-3   vctrs_0.3.4       triebeard_0.3.0   curl_4.3         
## [29] glue_1.4.2        evaluate_0.14     rmarkdown_2.4     stringi_1.5.3    
## [33] compiler_4.0.2    pillar_1.4.6      urltools_1.7.3    gdtools_0.2.2    
## [37] generics_0.0.2    pkgconfig_2.0.3

Main page