Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/EmilHvitfeldt/textdata
Download, parse, store, and load text datasets instead of storing it in packages
https://github.com/EmilHvitfeldt/textdata
r rstats text-datasets
Last synced: 3 months ago
JSON representation
Download, parse, store, and load text datasets instead of storing it in packages
- Host: GitHub
- URL: https://github.com/EmilHvitfeldt/textdata
- Owner: EmilHvitfeldt
- License: other
- Created: 2019-01-01T01:41:17.000Z (about 6 years ago)
- Default Branch: main
- Last Pushed: 2024-01-22T17:26:21.000Z (about 1 year ago)
- Last Synced: 2024-04-23T17:52:38.202Z (10 months ago)
- Topics: r, rstats, text-datasets
- Language: R
- Homepage: https://emilhvitfeldt.github.io/textdata/
- Size: 14.3 MB
- Stars: 73
- Watchers: 9
- Forks: 12
- Open Issues: 12
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- jimsghstars - EmilHvitfeldt/textdata - Download, parse, store, and load text datasets instead of storing it in packages (R)
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```# textdata
[](https://github.com/EmilHvitfeldt/textdata/actions/workflows/R-CMD-check.yaml)
[](https://CRAN.R-project.org/package=textdata)
[](https://cran.r-project.org/package=textdata)
[](https://doi.org/10.5281/zenodo.3244433)
[](https://app.codecov.io/gh/EmilHvitfeldt/textdata?branch=main)
[](https://lifecycle.r-lib.org/articles/stages.html)The goal of textdata is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.
## Installation
You can install the not yet released version of textdata from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("textdata")
```And the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("remotes")
remotes::install_github("EmilHvitfeldt/textdata")
```
## ExampleThe first time you use one of the functions for accessing an included text dataset, such as `lexicon_afinn()` or `dataset_sentence_polarity()`, the function will prompt you to agree that you understand the dataset's license or terms of use and then download the dataset to your computer.

After the first use, each time you use a function like `lexicon_afinn()`, the function will load the dataset from disk.
## Included text datasets
As of today, the datasets included in textdata are:
| Dataset | Function |
| --------------------------------------------------------------- | ----------------------------- |
| v1.0 sentence polarity dataset | `dataset_sentence_polarity()` |
| AFINN-111 sentiment lexicon | `lexicon_afinn()` |
| Hu and Liu's opinion lexicon | `lexicon_bing()` |
| NRC word-emotion association lexicon | `lexicon_nrc()` |
| NRC Emotion Intensity Lexicon | `lexicon_nrc_eil()` |
| The NRC Valence, Arousal, and Dominance Lexicon | `lexicon_nrc_vad()` |
| Loughran and McDonald's opinion lexicon for financial documents | `lexicon_loughran()` |
| AG's News | `dataset_ag_news()` |
| DBpedia ontology | `dataset_dbpedia()` |
| Trec-6 and Trec-50 | `dataset_trec()` |
| IMDb Large Movie Review Dataset | `dataset_imdb()` |
| Stanford NLP GloVe pre-trained word vectors | `embedding_glove6b()` |
| | `embedding_glove27b()` |
| | `embedding_glove42b()` |
| | `embedding_glove840b()` |Check out each function's documentation for detailed information (including citations) for the relevant dataset.
## Community Guidelines
Note that this project is released with a
[Contributor Code of Conduct](https://github.com/EmilHvitfeldt/textdata/blob/main/CODE_OF_CONDUCT.md).
By contributing to this project, you agree to abide by its terms.
Feedback, bug reports (and fixes!), and feature requests are welcome; file
issues or seek support [here](https://github.com/EmilHvitfeldt/textdata/issues).
For details on how to add a new dataset to this package, check out the vignette!