Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/EmilHvitfeldt/textdata
Download, parse, store, and load text datasets instead of storing it in packages
https://github.com/EmilHvitfeldt/textdata
r rstats text-datasets
Last synced: 3 months ago
JSON representation
Download, parse, store, and load text datasets instead of storing it in packages
- Host: GitHub
- URL: https://github.com/EmilHvitfeldt/textdata
- Owner: EmilHvitfeldt
- License: other
- Created: 2019-01-01T01:41:17.000Z (almost 6 years ago)
- Default Branch: main
- Last Pushed: 2024-01-22T17:26:21.000Z (10 months ago)
- Last Synced: 2024-04-23T17:52:38.202Z (7 months ago)
- Topics: r, rstats, text-datasets
- Language: R
- Homepage: https://emilhvitfeldt.github.io/textdata/
- Size: 14.3 MB
- Stars: 73
- Watchers: 9
- Forks: 12
- Open Issues: 12
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- jimsghstars - EmilHvitfeldt/textdata - Download, parse, store, and load text datasets instead of storing it in packages (R)
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```# textdata
[![R-CMD-check](https://github.com/EmilHvitfeldt/textdata/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/EmilHvitfeldt/textdata/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/textdata)](https://CRAN.R-project.org/package=textdata)
[![Downloads](http://cranlogs.r-pkg.org/badges/textdata)](https://cran.r-project.org/package=textdata)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3244433.svg)](https://doi.org/10.5281/zenodo.3244433)
[![Codecov test coverage](https://codecov.io/gh/EmilHvitfeldt/textdata/branch/main/graph/badge.svg)](https://app.codecov.io/gh/EmilHvitfeldt/textdata?branch=main)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html)The goal of textdata is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.
## Installation
You can install the not yet released version of textdata from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("textdata")
```And the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("remotes")
remotes::install_github("EmilHvitfeldt/textdata")
```
## ExampleThe first time you use one of the functions for accessing an included text dataset, such as `lexicon_afinn()` or `dataset_sentence_polarity()`, the function will prompt you to agree that you understand the dataset's license or terms of use and then download the dataset to your computer.
![](man/figures/textdata_demo.gif)
After the first use, each time you use a function like `lexicon_afinn()`, the function will load the dataset from disk.
## Included text datasets
As of today, the datasets included in textdata are:
| Dataset | Function |
| --------------------------------------------------------------- | ----------------------------- |
| v1.0 sentence polarity dataset | `dataset_sentence_polarity()` |
| AFINN-111 sentiment lexicon | `lexicon_afinn()` |
| Hu and Liu's opinion lexicon | `lexicon_bing()` |
| NRC word-emotion association lexicon | `lexicon_nrc()` |
| NRC Emotion Intensity Lexicon | `lexicon_nrc_eil()` |
| The NRC Valence, Arousal, and Dominance Lexicon | `lexicon_nrc_vad()` |
| Loughran and McDonald's opinion lexicon for financial documents | `lexicon_loughran()` |
| AG's News | `dataset_ag_news()` |
| DBpedia ontology | `dataset_dbpedia()` |
| Trec-6 and Trec-50 | `dataset_trec()` |
| IMDb Large Movie Review Dataset | `dataset_imdb()` |
| Stanford NLP GloVe pre-trained word vectors | `embedding_glove6b()` |
| | `embedding_glove27b()` |
| | `embedding_glove42b()` |
| | `embedding_glove840b()` |Check out each function's documentation for detailed information (including citations) for the relevant dataset.
## Community Guidelines
Note that this project is released with a
[Contributor Code of Conduct](https://github.com/EmilHvitfeldt/textdata/blob/main/CODE_OF_CONDUCT.md).
By contributing to this project, you agree to abide by its terms.
Feedback, bug reports (and fixes!), and feature requests are welcome; file
issues or seek support [here](https://github.com/EmilHvitfeldt/textdata/issues).
For details on how to add a new dataset to this package, check out the vignette!