Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/EmilHvitfeldt/textdata
Download, parse, store, and load text datasets instead of storing it in packages
https://github.com/EmilHvitfeldt/textdata
r rstats text-datasets
Last synced: 3 months ago
JSON representation
Download, parse, store, and load text datasets instead of storing it in packages
- Host: GitHub
- URL: https://github.com/EmilHvitfeldt/textdata
- Owner: EmilHvitfeldt
- License: other
- Created: 2019-01-01T01:41:17.000Z (about 6 years ago)
- Default Branch: main
- Last Pushed: 2024-01-22T17:26:21.000Z (about 1 year ago)
- Last Synced: 2024-04-23T17:52:38.202Z (10 months ago)
- Topics: r, rstats, text-datasets
- Language: R
- Homepage: https://emilhvitfeldt.github.io/textdata/
- Size: 14.3 MB
- Stars: 73
- Watchers: 9
- Forks: 12
- Open Issues: 12
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- jimsghstars - EmilHvitfeldt/textdata - Download, parse, store, and load text datasets instead of storing it in packages (R)
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```# textdata
[data:image/s3,"s3://crabby-images/123c6/123c61fc28822fcf757e70efc5da1861b798f314" alt="R-CMD-check"](https://github.com/EmilHvitfeldt/textdata/actions/workflows/R-CMD-check.yaml)
[data:image/s3,"s3://crabby-images/48acf/48acf22a74fc99d0114aee210035d0a0a9609ead" alt="CRAN status"](https://CRAN.R-project.org/package=textdata)
[data:image/s3,"s3://crabby-images/67ef5/67ef565d9e6a9dd05c5076bab7f5d0ee5d6fd0ac" alt="Downloads"](https://cran.r-project.org/package=textdata)
[data:image/s3,"s3://crabby-images/7537d/7537d5e99cd4242f2e88b7c36d34fba2ac54564b" alt="DOI"](https://doi.org/10.5281/zenodo.3244433)
[data:image/s3,"s3://crabby-images/13ee4/13ee42f0ad04306937b964d65b82836be7f548d1" alt="Codecov test coverage"](https://app.codecov.io/gh/EmilHvitfeldt/textdata?branch=main)
[data:image/s3,"s3://crabby-images/d47ec/d47ec1157ea15a7fb737e0f94ec675a5c193c843" alt="Lifecycle: stable"](https://lifecycle.r-lib.org/articles/stages.html)The goal of textdata is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.
## Installation
You can install the not yet released version of textdata from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("textdata")
```And the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("remotes")
remotes::install_github("EmilHvitfeldt/textdata")
```
## ExampleThe first time you use one of the functions for accessing an included text dataset, such as `lexicon_afinn()` or `dataset_sentence_polarity()`, the function will prompt you to agree that you understand the dataset's license or terms of use and then download the dataset to your computer.
data:image/s3,"s3://crabby-images/3e502/3e502687f49a4a322a3b7001628d3afe29b17fb8" alt=""
After the first use, each time you use a function like `lexicon_afinn()`, the function will load the dataset from disk.
## Included text datasets
As of today, the datasets included in textdata are:
| Dataset | Function |
| --------------------------------------------------------------- | ----------------------------- |
| v1.0 sentence polarity dataset | `dataset_sentence_polarity()` |
| AFINN-111 sentiment lexicon | `lexicon_afinn()` |
| Hu and Liu's opinion lexicon | `lexicon_bing()` |
| NRC word-emotion association lexicon | `lexicon_nrc()` |
| NRC Emotion Intensity Lexicon | `lexicon_nrc_eil()` |
| The NRC Valence, Arousal, and Dominance Lexicon | `lexicon_nrc_vad()` |
| Loughran and McDonald's opinion lexicon for financial documents | `lexicon_loughran()` |
| AG's News | `dataset_ag_news()` |
| DBpedia ontology | `dataset_dbpedia()` |
| Trec-6 and Trec-50 | `dataset_trec()` |
| IMDb Large Movie Review Dataset | `dataset_imdb()` |
| Stanford NLP GloVe pre-trained word vectors | `embedding_glove6b()` |
| | `embedding_glove27b()` |
| | `embedding_glove42b()` |
| | `embedding_glove840b()` |Check out each function's documentation for detailed information (including citations) for the relevant dataset.
## Community Guidelines
Note that this project is released with a
[Contributor Code of Conduct](https://github.com/EmilHvitfeldt/textdata/blob/main/CODE_OF_CONDUCT.md).
By contributing to this project, you agree to abide by its terms.
Feedback, bug reports (and fixes!), and feature requests are welcome; file
issues or seek support [here](https://github.com/EmilHvitfeldt/textdata/issues).
For details on how to add a new dataset to this package, check out the vignette!