Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/EmilHvitfeldt/R-text-data

List of textual data sources to be used for text mining in R
https://github.com/EmilHvitfeldt/R-text-data

data-science nlp rstats text-analysis text-analytics-in-r text-mining tidytext

Last synced: 3 months ago
JSON representation

List of textual data sources to be used for text mining in R

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
cache = TRUE
)

library(tidyverse)
```

# R Text Data Compilation

The goal of this repository is to act as a collection of textual data set to be used for training and practice in text mining/NLP in R. This repository will not be a guide on how to do text analysis/mining but rather how to get a data set to get started with minimal hassle.

# Table of Contents

- [Main page](#R-text-data)
- [CRAN packages](#cran-packages)
- [janeaustenr](#janeaustenr)
- [quRan](#quran)
- [scriptuRs](#scripturs)
- [friends](#friends)
- [hcandersenr](#hcandersenr)
- [proustr](#proustr)
- [schrute](#schrute)
- [textdata](#textdata)
- [gutenbergr](#gutenbergr)
- [text2vec](#text2vec)
- [epubr](#epubr)
- [Github packages](#github-packages)
- [appa](#appa)
- [sacred](#sacred)
- [harrypotter](#harrypotter)
- [hgwellsr](#hgwellsr)
- [jeeves](#jeeves)
- [koanr](#koanr)
- [sherlock](#sherlock)
- [rperseus](#rperseus)
- [tidygutenbergr](#tidygutenbergr)
- [subtools](#subtools)
- [tidytuesday](#tidytuesday)
- [Wild data](#wild-data)
- Cornell data
- [polarity dataset v2.0](#polarity-dataset-v20)
- [sentence polarity dataset v1.0](#sentence-polarity-dataset-v10)
- [scale dataset v1.0](#scale-dataset-v10)
- [subjectivity dataset v1.0](#subjectivity-dataset-v10)
- [SouthParkData](#southparkdata)
- [Saudi Newspapers Corpus](#saudi-newspapers-corpus)

## CRAN packages

### janeaustenr

First we have the [janeaustenr](https://github.com/juliasilge/janeaustenr) package popularized by Julia Silge in [tidytextmining](https://www.tidytextmining.com/).

```{r}
#install.packages("janeaustenr")
library(janeaustenr)
```

`janeaustenr` includes 6 books; `emma`, `mansfieldpark`, `northangerabbey`, `persuasion`, `prideprejudice` and `sensesensibility` all formatted as a character vector with elements of about 70 characters.

```{r}
head(emma, n = 15)
```

All the books can also be found combined into one data.frame in the function `austen_books()`

```{r}
dplyr::glimpse(austen_books())
```

Examples:

-

### quRan

The [quRan](https://github.com/andrewheiss/quRan) package contains the complete text of the Qur'an in Arabic (with and without vowels) and in English (the Yusuf Ali and Saheeh International translations).

```{r}
#install.packages("quRan")
library(quRan)
```

```{r}
dplyr::glimpse(quran_ar)
```

Examples:

[Twitter thread](https://twitter.com/andrewheiss/status/1078428352577327104)

### scriptuRs

The [scriptuRs](https://github.com/andrewheiss/scriptuRs) package full text of the Standard Works for The Church of Jesus Christ of Latter-day Saints: the Old and New Testaments, the Book of Mormon, the Doctrine and Covenants, and the Pearl of Great Price. Each volume is in a data frame with a row for each verse, along with 19 columns of detailed metadata.

```{r}
#install.packages("scriptuRs")
library(scriptuRs)
```

```{r}
dplyr::glimpse(scriptuRs::book_of_mormon)
```

Examples:

- [Tidy text, parts of speech, and unique words in the Bible](https://www.andrewheiss.com/blog/2018/12/26/tidytext-pos-john/)

### friends

The goal of [friends](https://github.com/emilhvitfeldt/friends) to provide the complete script transcription of the [Friends](https://en.wikipedia.org/wiki/Friends) sitcom. The data originates from the [Character Mining](https://github.com/emorynlp/character-mining) repository which includes references to scientific explorations using this data. This package simply provides the data in tibble format instead of json files.

```{r}
#install.packages("friends")
library(friends)
```

The data set includes the full transcription, line by line with metadata about episode number, season number, character, and more.

```{r}
dplyr::glimpse(friends)
```

Additionally data sets are included for more meta data.

```{r}
dplyr::glimpse(friends_emotions)

dplyr::glimpse(friends_entities)

dplyr::glimpse(friends_info)
```

### hcandersenr

The [hcandersenr](https://github.com/emilhvitfeldt/hcandersenr) package includes many of H.C. Andersen's fairy tales in 5 difference languages.

```{r}
#install.packages("hcandersenr")
library(hcandersenr)
```

The fairy tales are found in the following data frames `hcandersen_en`, `hcandersen_da`, `hcandersen_de`, `hcandersen_es` and `hcandersen_fr` for the English, Danish, German, Spanish and French versions respectively. Please be advised that all fairy tales aren't available in all languages in this package.

```{r}
dplyr::glimpse(hcandersen_en)
```

All the fairy tales are collected in the following data.frame:

```{r}
dplyr::glimpse(hca_fairytales())
```

Examples:

Still pending.

### proustr

This [proustr](https://github.com/ColinFay/proustr) packages gives you access to tools designed to do Natural Language Processing in French.

```{r}
#install.packages("proustr")
library(proustr)
```

Furthermore it includes the following 7 books

- Du côté de chez Swann (1913): `ducotedechezswann`.
- À l'ombre des jeunes filles en fleurs (1919): `alombredesjeunesfillesenfleurs`.
- Le Côté de Guermantes (1921): `lecotedeguermantes`.
- Sodome et Gomorrhe (1922) : `sodomeetgomorrhe`.
- La Prisonnière (1923) :`laprisonniere`.
- Albertine disparue (1925, also know as : La Fugitive) : `albertinedisparue`.
- Le Temps retrouvé (1927) : `letempretrouve`.

Which are all found in the `proust_books()` function.

```{r}
dplyr::glimpse(proust_books())
```

### schrute

This [schrute](https://github.com/bradlindblad/schrute) contains complete script transcription for The Office (US) television show.

```{r}
#install.packages("schrute")
library(schrute)
```

The data set includes the full transcription, line by line with metadata about episode number, season number, character, and more.

```{r}
glimpse(theoffice)
```

Examples:

- [Tidy Tuesday screencast: analyzing ratings and scripts from The Office](https://www.youtube.com/watch?v=_IvAubTDQME&t=1092s)
- [Lasso regression with tidymodels and The Office](https://www.youtube.com/watch?v=R32AsuKICAY)
- [tidytuesday: Part-of-Speech and textrecipes with The Office](https://www.emilhvitfeldt.com/post/tidytuesday-pos-textrecipes-the-office/)

### textdata

The goal of [textdata](https://github.com/emilhvitfeldt/textdata) is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.

```{r}
#install.packages("textdata")
library(textdata)
```

All the functions used in this package will prompt you to download the files. Once they are downloaded and cached they are easily loaded.

```{r}
glimpse(textdata::dataset_imdb())
```

Available data sets:

```{r}
with(catalogue, split(name, type))
```

### gutenbergr

The [gutenbergr](https://github.com/ropensci/gutenbergr) package allows for search and download of public domain texts from [Project Gutenberg](https://www.gutenberg.org/). Currently includes more then 57,000 free eBooks.

```{r}
#install.packages("gutenbergr")
library(gutenbergr)
```

To use **gutenbergr** you must know the Gutenberg id of the work you wish to analyze. A text search of the works can be done using the `gutenberg_works` function.

```{r}
gutenberg_works(title == "Wuthering Heights")
```

With that id you can use the `gutenberg_download()` function to

```{r}
gutenberg_download(768)
```

Examples:

Still pending.

### text2vec

While the [text2vec](https://github.com/dselivanov/text2vec) package isn't a data package by itself, it does include a textual data set inside.

```{r}
#install.packages("text2vec")
library(text2vec)
```

The data frame `movie_review` contains 5000 IMDB movie reviews selected for sentiment analysis. It has been preprocessed to include sentiment that means that an IMDB rating \< 5 results in a sentiment score of 0, and a rating \>=7 has a sentiment score of 1.

```{r}
dplyr::glimpse(movie_review)
```

### epubr

The [epubr](https://github.com/ropensci/epubr) package allows for extraction of metadata and textual content of epub files.

```{r, eval=FALSE}
install.packages("epubr")
library(epubr)
```

Further information and examples can be found [here](https://github.com/ropensci/epubr).

## Github packages

### appa

This [appa](https://github.com/averyrobbins1/appa) package contains complete script transcription for Avatar: The Last Airbender.

```{r}
#devtools::install_github("averyrobbins1/appa")
library(appa)
```

The data set includes the full transcription, line by line with metadata about book number, chapter number, character, and more.

```{r}
dplyr::glimpse(appa)
```

### sacred

The [sacred](https://github.com/JohnCoene/sacred) package includes 9 tidy data sets: `apocrypha`, `book_of_mormon`, `doctrine_and_covenants`, `greek_new_testament`, `king_james_version`, `pearl_of_great_price`, `tanach`, `vulgate` and `septuagint` with column describing the position within each work.

```{r}
#devtools::install_github("JohnCoene/sacred")
library(sacred)
```

```{r}
dplyr::glimpse(apocrypha)
```

Examples:

Still pending.

### harrypotter

The [harrypotter](https://github.com/bradleyboehmke/harrypotter) package includes the text from all 7 main series books.

```{r}
#devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
```

the 7 books; `philosophers_stone`, `chamber_of_secrets`, `prisoner_of_azkaban`, `goblet_of_fire`, `order_of_the_phoenix`, `half_blood_prince` and `deathly_hallows` are formatted as character vectors with a chapter for each string.

```{r}
dplyr::glimpse(harrypotter::chamber_of_secrets)
```

Examples:

- [Harry Plotter: Celebrating the 20 year anniversary with tidytext and the tidyverse in R](https://paulvanderlaken.com/2017/08/03/harry-plotter-celebrating-the-20-year-anniversary-with-tidytext-the-tidyverse-and-r/)
- [Harry Plotter: Part 2 – Hogwarts Houses and their Stereotypes](https://paulvanderlaken.com/2017/08/22/harry-plotter-part-2-hogwarts-houses-and-their-stereotypes/)

### hgwellsr

The [hgwellsr](https://github.com/erikhoward/hgwellsr) package provides access to the full texts of six novels by H. G. Wells.

```{r}
#devtools::install_github("erikhoward/hgwellsr")
library(hgwellsr)
```

- Ann Veronica (1909): `annveronica`.
- The History of Mr Polly (1910): `mrpolly`.
- The Invisible Man (1897): `invisibleman`.
- The Island of Doctor Moreau (1896): `doctormoreau`.
- The Time Machine (1895):`timemachine`.
- The War of the Worlds (1898): `waroftheworlds`.

```{r}
head(annveronica, 10)
```

### jeeves

The [jeeves](https://github.com/aniruhil/jeeves) package provides access to the full texts of 38 works by P.G. Wodehouse.

```{r message=FALSE}
#devtools::install_github("aniruhil/jeeves")
library(jeeves)
```

```{r}
glimpse(adamselindistress)
```

### koanr

The [koanr](https://github.com/malcolmbarrett/koanr) package includes text from several of the more important Zen koan texts.

```{r message=FALSE}
#devtools::install_github("malcolmbarrett/koanr")
library(koanr)
```

The texts in this package include The Gateless Gate (`gateless_gate`), The Blue Cliff Record (`blue_cliff_record`), The Record of the Transmission of the Light(`record_of_light`), and The Book of Equanimity(`book_of_equanimity`).

```{r}
dplyr::glimpse(gateless_gate)
```

### sherlock

The [sherlock](https://github.com/EmilHvitfeldt/sherlock) package includes text from the Sherlock Holmes Books.

```{r message=FALSE}
#devtools::install_github("EmilHvitfeldt/sherlock")
library(sherlock)
```

The goal of sherlock is to provide access to the full texts of Sherlock Holmes stories that are in the public domain. Text and further information regarding copyright laws can be found [here](https://sherlock-holm.es/ascii/).

```{r}
dplyr::glimpse(holmes)
```

### rperseus

The goal of [rperseus](https://github.com/ropensci/rperseus) is to furnish classicists, textual critics, and R enthusiasts with texts from the Classical World. While the English translations of most texts are available through `gutenbergr`, rperseus returns these works in their original language--Greek, Latin, and Hebrew.

```{r warning=FALSE}
#devtools::install_github("ropensci/rperseus")
library(rperseus)
aeneid_latin <- perseus_catalog %>%
filter(group_name == "Virgil",
label == "Aeneid",
language == "lat") %>%
pull(urn) %>%
get_perseus_text()
head(aeneid_latin)
```

See [the vignette for more examples.](https://ropensci.github.io/rperseus/articles/rperseus-vignette.html)

### tidygutenbergr

The [tidygutenbergr](https://github.com/emilHvitfeldt/tidygutenbergr) contains many functions that will fetch data from [Project Gutenberg](https://www.gutenberg.org/) using the **gutenbergr** package and do some light cleaning.

```{r}
#devtools::install_github("emilHvitfeldt/tidygutenbergr")
library(tidygutenbergr)
```

tidygutenbergr contains a couple dozen datasets that can all be found [here](https://emilhvitfeldt.github.io/tidygutenbergr/reference/index.html).

Many books will have metadata on the text such as book nunmber and chapter name/number.

```{r, message=FALSE}
glimpse(a_tale_of_two_cities())
```

### subtools

The [subtools](https://github.com/fkeck/subtools) package doesn't include any textual data, but allows you to read subtitle files.

```{r}
#devtools::install_github("fkeck/subtools")
library(subtools)
```

the use of this function can be found in the examples.

Examples:

- [Movies and series subtitles in R with subtools](http://www.pieceofk.fr/?p=437)
- [A tidy text analysis of Rick and Morty](http://tamaszilagyi.com/blog/a-tidy-text-analysis-of-rick-and-morty/)
- [You beautiful, naïve, sophisticated newborn series](https://masalmon.eu/2017/11/05/newborn-serie/)

## Tidytuesday

The [tidytuesday](https://github.com/rfordatascience/tidytuesday) project is an amazing collection of data sets that are well suited for beginners to hone their skills. Below is a list of the data sets that contain enough text data to analyse. This list does contain data set that are already present on this page but are kept here for completeness.

Examples will not be shown here since that is taken care of in the respective pages.

| Date | Topic |
|------|-------|
| 2019-01-01 | [#rstats and #TidyTuesday Tweets from rtweet](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-01-01) |
| 2019-03-12 | [Board Games Database](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-12) |
| 2019-04-23 | [Anime Dataset](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-04-23) |
| 2019-05-28 | [Wine ratings](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-05-28) |
| 2019-06-25 | [UFO Sightings around the world](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-06-25) |
| 2019-09-10 | [Amusement Park injuries](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-09-10) |
| 2019-10-22 | [Horror movie metadata](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-10-22) |
| 2019-12-17 | [Adoptable dogs](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-12-17) |
| 2019-12-24 | [Christmas Music Billboards](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-12-24) |
| 2020-03-17 | [The Office - Words and Numbers](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-03-17) |
| 2020-04-21 | [GDPR Fines](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-04-21) |
| 2020-04-28 | [Broadway Weekly Grosses](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-04-28) |
| 2020-05-05 | [Animal Crossing - New Horizons](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-05-05) |
| 2020-05-26 | [Cocktails](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-05-26) |
| 2020-06-09 | [African American Achievements](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-06-09) |
| 2020-06-16 | [American Slavery and Juneteenth](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-06-16) |
| 2020-08-11 | [Avatar: The last airbender](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-08-11) |
| 2020-09-08 | [Friends](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-09-08) |
| 2020-09-29 | [Beyoncé and Taylor Swift Lyrics](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-09-29) |
| 2020-12-08 | [Women of 2020](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-12-08) |
| 2021-01-12 | [Art Collections](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-01-12) |
| 2021-03-02 | [Superbowl commercials](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-03-02) |
| 2021-03-23 | [UN Votes](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-03-23) |
| 2021-04-20 | [Netflix Shows](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-04-20) |
| 2021-04-27 | [CEO Departures](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-04-27) |
| 2021-06-15 | [Du Bois and Juneteenth Revisited](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-06-15) |

## Wild data

This sections includes public data sets and how to import them into R ready for analysis. It is generally advised to save the resulting data such that you don't re-download the data excessively.

[Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/)

This website include a handful of different movie review data sets. Below is the code chuck necessary to load in the data sets.

### polarity dataset v2.0

```{r}
library(tidyverse)
library(fs)

filepath <- file_temp() %>%
path_ext_set("tar.gz")

download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz", filepath)

file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]

untar(filepath, files = file_names)

data <- map_df(file_names,
~ tibble(text = read_lines(.x),
polarity = str_detect(.x, "pos"),
cv_tag = str_extract(.x, "(?<=cv)\\d{3}"),
html_tag = str_extract(.x, "(?<=cv\\d{3}_)\\d*")))

glimpse(data)
```

### sentence polarity dataset v1.0

```{r}
library(tidyverse)
library(fs)

filepath <- file_temp() %>%
path_ext_set("tar.gz")

download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz", filepath)

file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]

untar(filepath, files = file_names)

data <- map_df(file_names,
~ tibble(text = read_lines(.x),
polarity = str_detect(.x, "pos")))

glimpse(data)
```

### scale dataset v1.0

```{r}
library(tidyverse)
library(fs)

filepath <- file_temp() %>%
path_ext_set("tar.gz")

download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz", filepath)

file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]

untar(filepath, files = file_names)

subjs <- str_subset(file_names, "subj")
ids <- str_subset(file_names, "id")
ratings <- str_subset(file_names, "rating")
names <- str_extract(ratings, "(?<=rating.).*") %>%
str_replace("\\+", " ")

data <- map_df(seq_len(length(names)),
~ tibble(text = read_lines(subjs[.x]),
id = read_lines(ids[.x]),
rating = read_lines(ratings[.x]),
name = names[.x]))

glimpse(data)
```

### subjectivity dataset v1.0

```{r}
library(tidyverse)
library(fs)

filepath <- file_temp() %>%
path_ext_set("tar.gz")

download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz", filepath)

file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]

untar(filepath, files = file_names)

data <- map_df(file_names,
~ tibble(text = read_lines(.x),
label = if_else(str_detect(.x, "quote"),
"subjective",
"objective")))

glimpse(data)
```

### SouthParkData

the following github repository [BobAdamsEE/SouthParkData](https://github.com/BobAdamsEE/SouthParkData) includes the script of the first 19 seasons of South Park. The following code snippet lets you download them all at once.

```{r message=FALSE}
url_base <- "https://raw.githubusercontent.com/BobAdamsEE/SouthParkData/master/by-season"
urls <- paste0(url_base, "/Season-", 1:19, ".csv")

data <- map_df(urls, ~ read_csv(.x))

glimpse(data)
```

Examples:

-

### Saudi Newspapers Corpus

The following Github repository [inparallel/SaudiNewsNet](https://github.com/inparallel/SaudiNewsNet) includes data and text from 31030 Arabic newspaper articles along with metadata, extracted from various online Saudi newspapers.

```{r}
library(rio)
library(glue)
library(fs)
library(purrr)

dates <- c("2015-07-21", "2015-07-22", "2015-07-23", "2015-07-24", "2015-07-25",
"2015-07-26", "2015-07-27", "2015-07-31", "2015-08-01", "2015-08-02",
"2015-08-03", "2015-08-04", "2015-08-06", "2015-08-07", "2015-08-08",
"2015-08-09", "2015-08-10", "2015-08-11")

tmp_path <- path_temp()

urls <- glue("https://raw.githubusercontent.com/inparallel/SaudiNewsNet/master/dataset/{dates}.zip")
paths <- path(tmp_path, dates, ext = "zip")

data <- map2_dfr(urls, paths, ~ {
download.file(.x, .y)
import_list(.y)[[1]]
})

glimpse(data)
```