{"id":18043587,"url":"https://github.com/emilhvitfeldt/r-text-data","last_synced_at":"2026-01-18T19:01:25.140Z","repository":{"id":38842847,"uuid":"142812320","full_name":"EmilHvitfeldt/R-text-data","owner":"EmilHvitfeldt","description":"List of textual data sources to be used for text mining in R","archived":false,"fork":false,"pushed_at":"2021-08-17T03:12:00.000Z","size":10186,"stargazers_count":148,"open_issues_count":1,"forks_count":15,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-07-13T17:49:08.606Z","etag":null,"topics":["data-science","nlp","rstats","text-analysis","text-analytics-in-r","text-mining","tidytext"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EmilHvitfeldt.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-07-30T02:00:15.000Z","updated_at":"2025-04-12T02:13:36.000Z","dependencies_parsed_at":"2022-09-18T09:13:08.602Z","dependency_job_id":null,"html_url":"https://github.com/EmilHvitfeldt/R-text-data","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/EmilHvitfeldt/R-text-data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EmilHvitfeldt%2FR-text-data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EmilHvitfeldt%2FR-text-data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EmilHvitfeldt%2FR-text-data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EmilHvitfeldt%2FR-text-data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EmilHvitfeldt","download_url":"https://codeload.github.com/EmilHvitfeldt/R-text-data/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EmilHvitfeldt%2FR-text-data/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28548943,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-18T14:59:57.589Z","status":"ssl_error","status_checked_at":"2026-01-18T14:59:46.540Z","response_time":98,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","nlp","rstats","text-analysis","text-analytics-in-r","text-mining","tidytext"],"created_at":"2024-10-30T17:09:10.955Z","updated_at":"2026-01-18T19:01:25.023Z","avatar_url":"https://github.com/EmilHvitfeldt.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r setup, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  cache = TRUE\n)\n\nlibrary(tidyverse)\n```\n\n# R Text Data Compilation\n\nThe goal of this repository is to act as a collection of textual data set to be used for training and practice in text mining/NLP in R. This repository will not be a guide on how to do text analysis/mining but rather how to get a data set to get started with minimal hassle.\n\n# Table of Contents\n\n-   [Main page](#R-text-data)\n-   [CRAN packages](#cran-packages)\n    -   [janeaustenr](#janeaustenr)\n    -   [quRan](#quran)\n    -   [scriptuRs](#scripturs)\n    -   [friends](#friends)\n    -   [hcandersenr](#hcandersenr)\n    -   [proustr](#proustr)\n    -   [schrute](#schrute)\n    -   [textdata](#textdata)\n    -   [gutenbergr](#gutenbergr)\n    -   [text2vec](#text2vec)\n    -   [epubr](#epubr)\n-   [Github packages](#github-packages)\n    -   [appa](#appa)\n    -   [sacred](#sacred)\n    -   [harrypotter](#harrypotter)\n    -   [hgwellsr](#hgwellsr)\n    -   [jeeves](#jeeves)\n    -   [koanr](#koanr)\n    -   [sherlock](#sherlock)\n    -   [rperseus](#rperseus)\n    -   [tidygutenbergr](#tidygutenbergr)\n    -   [subtools](#subtools)\n-   [tidytuesday](#tidytuesday)\n-   [Wild data](#wild-data)\n    -   Cornell data\n        -   [polarity dataset v2.0](#polarity-dataset-v20)\n        -   [sentence polarity dataset v1.0](#sentence-polarity-dataset-v10)\n        -   [scale dataset v1.0](#scale-dataset-v10)\n        -   [subjectivity dataset v1.0](#subjectivity-dataset-v10)\n    -   [SouthParkData](#southparkdata)\n    -   [Saudi Newspapers Corpus](#saudi-newspapers-corpus)\n\n## CRAN packages\n\n### janeaustenr\n\nFirst we have the [janeaustenr](https://github.com/juliasilge/janeaustenr) package popularized by Julia Silge in [tidytextmining](https://www.tidytextmining.com/).\n\n```{r}\n#install.packages(\"janeaustenr\")\nlibrary(janeaustenr)\n```\n\n`janeaustenr` includes 6 books; `emma`, `mansfieldpark`, `northangerabbey`, `persuasion`, `prideprejudice` and `sensesensibility` all formatted as a character vector with elements of about 70 characters.\n\n```{r}\nhead(emma, n = 15)\n```\n\nAll the books can also be found combined into one data.frame in the function `austen_books()`\n\n```{r}\ndplyr::glimpse(austen_books())\n```\n\nExamples:\n\n-   \u003chttps://juliasilge.com/blog/if-i-loved-nlp-less/\u003e\n\n### quRan\n\nThe [quRan](https://github.com/andrewheiss/quRan) package contains the complete text of the Qur'an in Arabic (with and without vowels) and in English (the Yusuf Ali and Saheeh International translations).\n\n```{r}\n#install.packages(\"quRan\")\nlibrary(quRan)\n```\n\n```{r}\ndplyr::glimpse(quran_ar)\n```\n\nExamples:\n\n[Twitter thread](https://twitter.com/andrewheiss/status/1078428352577327104)\n\n### scriptuRs\n\nThe [scriptuRs](https://github.com/andrewheiss/scriptuRs) package full text of the Standard Works for The Church of Jesus Christ of Latter-day Saints: the Old and New Testaments, the Book of Mormon, the Doctrine and Covenants, and the Pearl of Great Price. Each volume is in a data frame with a row for each verse, along with 19 columns of detailed metadata.\n\n```{r}\n#install.packages(\"scriptuRs\")\nlibrary(scriptuRs)\n```\n\n```{r}\ndplyr::glimpse(scriptuRs::book_of_mormon)\n```\n\nExamples:\n\n-   [Tidy text, parts of speech, and unique words in the Bible](https://www.andrewheiss.com/blog/2018/12/26/tidytext-pos-john/)\n\n### friends\n\nThe goal of [friends](https://github.com/emilhvitfeldt/friends) to provide the complete script transcription of the [Friends](https://en.wikipedia.org/wiki/Friends) sitcom. The data originates from the [Character Mining](https://github.com/emorynlp/character-mining) repository which includes references to scientific explorations using this data. This package simply provides the data in tibble format instead of json files.\n\n```{r}\n#install.packages(\"friends\")\nlibrary(friends)\n```\n\nThe data set includes the full transcription, line by line with metadata about episode number, season number, character, and more.\n\n```{r}\ndplyr::glimpse(friends)\n```\n\nAdditionally data sets are included for more meta data.\n\n```{r}\ndplyr::glimpse(friends_emotions)\n\ndplyr::glimpse(friends_entities)\n\ndplyr::glimpse(friends_info)\n```\n\n### hcandersenr\n\nThe [hcandersenr](https://github.com/emilhvitfeldt/hcandersenr) package includes many of H.C. Andersen's fairy tales in 5 difference languages.\n\n```{r}\n#install.packages(\"hcandersenr\")\nlibrary(hcandersenr)\n```\n\nThe fairy tales are found in the following data frames `hcandersen_en`, `hcandersen_da`, `hcandersen_de`, `hcandersen_es` and `hcandersen_fr` for the English, Danish, German, Spanish and French versions respectively. Please be advised that all fairy tales aren't available in all languages in this package.\n\n```{r}\ndplyr::glimpse(hcandersen_en)\n```\n\nAll the fairy tales are collected in the following data.frame:\n\n```{r}\ndplyr::glimpse(hca_fairytales())\n```\n\nExamples:\n\nStill pending.\n\n### proustr\n\nThis [proustr](https://github.com/ColinFay/proustr) packages gives you access to tools designed to do Natural Language Processing in French.\n\n```{r}\n#install.packages(\"proustr\")\nlibrary(proustr)\n```\n\nFurthermore it includes the following 7 books\n\n-   Du côté de chez Swann (1913): `ducotedechezswann`.\n-   À l'ombre des jeunes filles en fleurs (1919): `alombredesjeunesfillesenfleurs`.\n-   Le Côté de Guermantes (1921): `lecotedeguermantes`.\n-   Sodome et Gomorrhe (1922) : `sodomeetgomorrhe`.\n-   La Prisonnière (1923) :`laprisonniere`.\n-   Albertine disparue (1925, also know as : La Fugitive) : `albertinedisparue`.\n-   Le Temps retrouvé (1927) : `letempretrouve`.\n\nWhich are all found in the `proust_books()` function.\n\n```{r}\ndplyr::glimpse(proust_books())\n```\n\n### schrute\n\nThis [schrute](https://github.com/bradlindblad/schrute) contains complete script transcription for The Office (US) television show.\n\n```{r}\n#install.packages(\"schrute\")\nlibrary(schrute)\n```\n\nThe data set includes the full transcription, line by line with metadata about episode number, season number, character, and more.\n\n```{r}\nglimpse(theoffice)\n```\n\nExamples:\n\n-   [Tidy Tuesday screencast: analyzing ratings and scripts from The Office](https://www.youtube.com/watch?v=_IvAubTDQME\u0026t=1092s)\n-   [Lasso regression with tidymodels and The Office](https://www.youtube.com/watch?v=R32AsuKICAY)\n-   [tidytuesday: Part-of-Speech and textrecipes with The Office](https://www.emilhvitfeldt.com/post/tidytuesday-pos-textrecipes-the-office/)\n\n### textdata\n\nThe goal of [textdata](https://github.com/emilhvitfeldt/textdata) is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.\n\n```{r}\n#install.packages(\"textdata\")\nlibrary(textdata)\n```\n\nAll the functions used in this package will prompt you to download the files. Once they are downloaded and cached they are easily loaded.\n\n```{r}\nglimpse(textdata::dataset_imdb())\n```\n\nAvailable data sets:\n\n```{r}\nwith(catalogue, split(name, type))\n```\n\n### gutenbergr\n\nThe [gutenbergr](https://github.com/ropensci/gutenbergr) package allows for search and download of public domain texts from [Project Gutenberg](https://www.gutenberg.org/). Currently includes more then 57,000 free eBooks.\n\n```{r}\n#install.packages(\"gutenbergr\")\nlibrary(gutenbergr)\n```\n\nTo use **gutenbergr** you must know the Gutenberg id of the work you wish to analyze. A text search of the works can be done using the `gutenberg_works` function.\n\n```{r}\ngutenberg_works(title == \"Wuthering Heights\")\n```\n\nWith that id you can use the `gutenberg_download()` function to\n\n```{r}\ngutenberg_download(768)\n```\n\nExamples:\n\nStill pending.\n\n### text2vec\n\nWhile the [text2vec](https://github.com/dselivanov/text2vec) package isn't a data package by itself, it does include a textual data set inside.\n\n```{r}\n#install.packages(\"text2vec\")\nlibrary(text2vec)\n```\n\nThe data frame `movie_review` contains 5000 IMDB movie reviews selected for sentiment analysis. It has been preprocessed to include sentiment that means that an IMDB rating \\\u003c 5 results in a sentiment score of 0, and a rating \\\u003e=7 has a sentiment score of 1.\n\n```{r}\ndplyr::glimpse(movie_review)\n```\n\n### epubr\n\nThe [epubr](https://github.com/ropensci/epubr) package allows for extraction of metadata and textual content of epub files.\n\n```{r, eval=FALSE}\ninstall.packages(\"epubr\")\nlibrary(epubr)\n```\n\nFurther information and examples can be found [here](https://github.com/ropensci/epubr).\n\n## Github packages\n\n### appa\n\nThis [appa](https://github.com/averyrobbins1/appa) package contains complete script transcription for Avatar: The Last Airbender.\n\n```{r}\n#devtools::install_github(\"averyrobbins1/appa\")\nlibrary(appa)\n```\n\nThe data set includes the full transcription, line by line with metadata about book number, chapter number, character, and more.\n\n```{r}\ndplyr::glimpse(appa)\n```\n\n### sacred\n\nThe [sacred](https://github.com/JohnCoene/sacred) package includes 9 tidy data sets: `apocrypha`, `book_of_mormon`, `doctrine_and_covenants`, `greek_new_testament`, `king_james_version`, `pearl_of_great_price`, `tanach`, `vulgate` and `septuagint` with column describing the position within each work.\n\n```{r}\n#devtools::install_github(\"JohnCoene/sacred\")\nlibrary(sacred)\n```\n\n```{r}\ndplyr::glimpse(apocrypha)\n```\n\nExamples:\n\nStill pending.\n\n### harrypotter\n\nThe [harrypotter](https://github.com/bradleyboehmke/harrypotter) package includes the text from all 7 main series books.\n\n```{r}\n#devtools::install_github(\"bradleyboehmke/harrypotter\")\nlibrary(harrypotter)\n```\n\nthe 7 books; `philosophers_stone`, `chamber_of_secrets`, `prisoner_of_azkaban`, `goblet_of_fire`, `order_of_the_phoenix`, `half_blood_prince` and `deathly_hallows` are formatted as character vectors with a chapter for each string.\n\n```{r}\ndplyr::glimpse(harrypotter::chamber_of_secrets)\n```\n\nExamples:\n\n-   [Harry Plotter: Celebrating the 20 year anniversary with tidytext and the tidyverse in R](https://paulvanderlaken.com/2017/08/03/harry-plotter-celebrating-the-20-year-anniversary-with-tidytext-the-tidyverse-and-r/)\n-   [Harry Plotter: Part 2 – Hogwarts Houses and their Stereotypes](https://paulvanderlaken.com/2017/08/22/harry-plotter-part-2-hogwarts-houses-and-their-stereotypes/)\n\n### hgwellsr\n\nThe [hgwellsr](https://github.com/erikhoward/hgwellsr) package provides access to the full texts of six novels by H. G. Wells.\n\n```{r}\n#devtools::install_github(\"erikhoward/hgwellsr\")\nlibrary(hgwellsr)\n```\n\n-   Ann Veronica (1909): `annveronica`.\n-   The History of Mr Polly (1910): `mrpolly`.\n-   The Invisible Man (1897): `invisibleman`.\n-   The Island of Doctor Moreau (1896): `doctormoreau`.\n-   The Time Machine (1895):`timemachine`.\n-   The War of the Worlds (1898): `waroftheworlds`.\n\n```{r}\nhead(annveronica, 10)\n```\n\n### jeeves\n\nThe [jeeves](https://github.com/aniruhil/jeeves) package provides access to the full texts of 38 works by P.G. Wodehouse.\n\n```{r message=FALSE}\n#devtools::install_github(\"aniruhil/jeeves\")\nlibrary(jeeves)\n```\n\n```{r}\nglimpse(adamselindistress)\n```\n\n### koanr\n\nThe [koanr](https://github.com/malcolmbarrett/koanr) package includes text from several of the more important Zen koan texts.\n\n```{r message=FALSE}\n#devtools::install_github(\"malcolmbarrett/koanr\")\nlibrary(koanr)\n```\n\nThe texts in this package include The Gateless Gate (`gateless_gate`), The Blue Cliff Record (`blue_cliff_record`), The Record of the Transmission of the Light(`record_of_light`), and The Book of Equanimity(`book_of_equanimity`).\n\n```{r}\ndplyr::glimpse(gateless_gate)\n```\n\n### sherlock\n\nThe [sherlock](https://github.com/EmilHvitfeldt/sherlock) package includes text from the Sherlock Holmes Books.\n\n```{r message=FALSE}\n#devtools::install_github(\"EmilHvitfeldt/sherlock\")\nlibrary(sherlock)\n```\n\nThe goal of sherlock is to provide access to the full texts of Sherlock Holmes stories that are in the public domain. Text and further information regarding copyright laws can be found [here](https://sherlock-holm.es/ascii/).\n\n```{r}\ndplyr::glimpse(holmes)\n```\n\n### rperseus\n\nThe goal of [rperseus](https://github.com/ropensci/rperseus) is to furnish classicists, textual critics, and R enthusiasts with texts from the Classical World. While the English translations of most texts are available through `gutenbergr`, rperseus returns these works in their original language--Greek, Latin, and Hebrew.\n\n```{r warning=FALSE}\n#devtools::install_github(\"ropensci/rperseus\")\nlibrary(rperseus)\naeneid_latin \u003c- perseus_catalog %\u003e% \n  filter(group_name == \"Virgil\",\n         label == \"Aeneid\",\n         language == \"lat\") %\u003e% \n  pull(urn) %\u003e% \n  get_perseus_text()\nhead(aeneid_latin)\n```\n\nSee [the vignette for more examples.](https://ropensci.github.io/rperseus/articles/rperseus-vignette.html)\n\n### tidygutenbergr\n\nThe [tidygutenbergr](https://github.com/emilHvitfeldt/tidygutenbergr) contains many functions that will fetch data from [Project Gutenberg](https://www.gutenberg.org/) using the **gutenbergr** package and do some light cleaning.\n\n```{r}\n#devtools::install_github(\"emilHvitfeldt/tidygutenbergr\")\nlibrary(tidygutenbergr)\n```\n\ntidygutenbergr contains a couple dozen datasets that can all be found [here](https://emilhvitfeldt.github.io/tidygutenbergr/reference/index.html).\n\nMany books will have metadata on the text such as book nunmber and chapter name/number.\n\n```{r, message=FALSE}\nglimpse(a_tale_of_two_cities())\n```\n\n### subtools\n\nThe [subtools](https://github.com/fkeck/subtools) package doesn't include any textual data, but allows you to read subtitle files.\n\n```{r}\n#devtools::install_github(\"fkeck/subtools\")\nlibrary(subtools)\n```\n\nthe use of this function can be found in the examples.\n\nExamples:\n\n-   [Movies and series subtitles in R with subtools](http://www.pieceofk.fr/?p=437)\n-   [A tidy text analysis of Rick and Morty](http://tamaszilagyi.com/blog/a-tidy-text-analysis-of-rick-and-morty/)\n-   [You beautiful, naïve, sophisticated newborn series](https://masalmon.eu/2017/11/05/newborn-serie/)\n\n## Tidytuesday\n\nThe [tidytuesday](https://github.com/rfordatascience/tidytuesday) project is an amazing collection of data sets that are well suited for beginners to hone their skills. Below is a list of the data sets that contain enough text data to analyse. This list does contain data set that are already present on this page but are kept here for completeness. \n\nExamples will not be shown here since that is taken care of in the respective pages.\n\n| Date | Topic |\n|------|-------|\n| 2019-01-01 | [#rstats and #TidyTuesday Tweets from rtweet](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-01-01) |\n| 2019-03-12 | [Board Games Database](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-12) |\n| 2019-04-23 | [Anime Dataset](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-04-23) |\n| 2019-05-28 | [Wine ratings](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-05-28) |\n| 2019-06-25 | [UFO Sightings around the world](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-06-25) |\n| 2019-09-10 | [Amusement Park injuries](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-09-10) |\n| 2019-10-22 | [Horror movie metadata](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-10-22) |\n| 2019-12-17 | [Adoptable dogs](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-12-17) |\n| 2019-12-24 | [Christmas Music Billboards](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-12-24) |\n| 2020-03-17 | [The Office - Words and Numbers](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-03-17) |\n| 2020-04-21 | [GDPR Fines](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-04-21) |\n| 2020-04-28 | [Broadway Weekly Grosses](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-04-28) |\n| 2020-05-05 | [Animal Crossing - New Horizons](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-05-05) |\n| 2020-05-26 | [Cocktails](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-05-26) |\n| 2020-06-09 | [African American Achievements](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-06-09) |\n| 2020-06-16 | [American Slavery and Juneteenth](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-06-16) |\n| 2020-08-11 | [Avatar: The last airbender](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-08-11) |\n| 2020-09-08 | [Friends](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-09-08) |\n| 2020-09-29 | [Beyoncé and Taylor Swift Lyrics](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-09-29) |\n| 2020-12-08 | [Women of 2020](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-12-08) |\n| 2021-01-12 | [Art Collections](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-01-12) |\n| 2021-03-02 | [Superbowl commercials](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-03-02) |\n| 2021-03-23 | [UN Votes](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-03-23) |\n| 2021-04-20 | [Netflix Shows](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-04-20) |\n| 2021-04-27 | [CEO Departures](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-04-27) |\n| 2021-06-15 | [Du Bois and Juneteenth Revisited](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-06-15) |\n\n## Wild data\n\nThis sections includes public data sets and how to import them into R ready for analysis. It is generally advised to save the resulting data such that you don't re-download the data excessively.\n\n[Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/)\n\nThis website include a handful of different movie review data sets. Below is the code chuck necessary to load in the data sets.\n\n### polarity dataset v2.0\n\n```{r}\nlibrary(tidyverse)\nlibrary(fs)\n\nfilepath \u003c- file_temp() %\u003e%\n  path_ext_set(\"tar.gz\")\n\ndownload.file(\"http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz\", filepath)\n\nfile_names \u003c- untar(filepath, list = TRUE)\nfile_names \u003c- file_names[!str_detect(file_names, \"README\")]\n\nuntar(filepath, files = file_names)\n\ndata \u003c- map_df(file_names, \n               ~ tibble(text = read_lines(.x),\n                        polarity = str_detect(.x, \"pos\"),\n                        cv_tag = str_extract(.x, \"(?\u003c=cv)\\\\d{3}\"),\n                        html_tag = str_extract(.x, \"(?\u003c=cv\\\\d{3}_)\\\\d*\")))\n\nglimpse(data)\n```\n\n### sentence polarity dataset v1.0\n\n```{r}\nlibrary(tidyverse)\nlibrary(fs)\n\nfilepath \u003c- file_temp() %\u003e%\n  path_ext_set(\"tar.gz\")\n\ndownload.file(\"http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz\", filepath)\n\nfile_names \u003c- untar(filepath, list = TRUE)\nfile_names \u003c- file_names[!str_detect(file_names, \"README\")]\n\nuntar(filepath, files = file_names)\n\ndata \u003c- map_df(file_names, \n               ~ tibble(text = read_lines(.x),\n                        polarity = str_detect(.x, \"pos\")))\n\nglimpse(data)\n```\n\n### scale dataset v1.0\n\n```{r}\nlibrary(tidyverse)\nlibrary(fs)\n\nfilepath \u003c- file_temp() %\u003e%\n  path_ext_set(\"tar.gz\")\n\ndownload.file(\"http://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz\", filepath)\n\nfile_names \u003c- untar(filepath, list = TRUE)\nfile_names \u003c- file_names[!str_detect(file_names, \"README\")]\n\nuntar(filepath, files = file_names)\n\nsubjs \u003c- str_subset(file_names, \"subj\")\nids \u003c- str_subset(file_names, \"id\")\nratings \u003c- str_subset(file_names, \"rating\")\nnames \u003c- str_extract(ratings, \"(?\u003c=rating.).*\") %\u003e%\n  str_replace(\"\\\\+\", \" \")\n\ndata \u003c- map_df(seq_len(length(names)), \n               ~ tibble(text = read_lines(subjs[.x]),\n                        id = read_lines(ids[.x]),\n                        rating = read_lines(ratings[.x]),\n                        name = names[.x]))\n\nglimpse(data)\n```\n\n### subjectivity dataset v1.0\n\n```{r}\nlibrary(tidyverse)\nlibrary(fs)\n\nfilepath \u003c- file_temp() %\u003e%\n  path_ext_set(\"tar.gz\")\n\ndownload.file(\"http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz\", filepath)\n\nfile_names \u003c- untar(filepath, list = TRUE)\nfile_names \u003c- file_names[!str_detect(file_names, \"README\")]\n\nuntar(filepath, files = file_names)\n\ndata \u003c- map_df(file_names, \n               ~ tibble(text = read_lines(.x),\n                        label = if_else(str_detect(.x, \"quote\"), \n                                        \"subjective\", \n                                        \"objective\")))\n\nglimpse(data)\n```\n\n### SouthParkData\n\nthe following github repository [BobAdamsEE/SouthParkData](https://github.com/BobAdamsEE/SouthParkData) includes the script of the first 19 seasons of South Park. The following code snippet lets you download them all at once.\n\n```{r message=FALSE}\nurl_base \u003c- \"https://raw.githubusercontent.com/BobAdamsEE/SouthParkData/master/by-season\"\nurls \u003c- paste0(url_base, \"/Season-\", 1:19, \".csv\")\n\ndata \u003c- map_df(urls, ~ read_csv(.x))\n\nglimpse(data)\n```\n\nExamples:\n\n-   \u003chttps://www.kaylinpavlik.com/text-mining-south-park/\u003e\n\n### Saudi Newspapers Corpus\n\nThe following Github repository [inparallel/SaudiNewsNet](https://github.com/inparallel/SaudiNewsNet) includes data and text from 31030 Arabic newspaper articles along with metadata, extracted from various online Saudi newspapers.\n\n```{r}\nlibrary(rio)\nlibrary(glue)\nlibrary(fs)\nlibrary(purrr)\n\ndates \u003c- c(\"2015-07-21\", \"2015-07-22\", \"2015-07-23\", \"2015-07-24\", \"2015-07-25\",\n           \"2015-07-26\", \"2015-07-27\", \"2015-07-31\", \"2015-08-01\", \"2015-08-02\",\n           \"2015-08-03\", \"2015-08-04\", \"2015-08-06\", \"2015-08-07\", \"2015-08-08\",\n           \"2015-08-09\", \"2015-08-10\", \"2015-08-11\")\n\ntmp_path \u003c- path_temp()\n\nurls \u003c- glue(\"https://raw.githubusercontent.com/inparallel/SaudiNewsNet/master/dataset/{dates}.zip\")\npaths \u003c- path(tmp_path, dates, ext = \"zip\")\n\ndata \u003c- map2_dfr(urls, paths, ~ {\n  download.file(.x, .y)\n  import_list(.y)[[1]]\n})\n\nglimpse(data)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femilhvitfeldt%2Fr-text-data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Femilhvitfeldt%2Fr-text-data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femilhvitfeldt%2Fr-text-data/lists"}