Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/emilhvitfeldt/textdata

Download, parse, store, and load text datasets instead of storing it in packages
https://github.com/emilhvitfeldt/textdata

r rstats text-datasets

Last synced: 7 days ago
JSON representation

Download, parse, store, and load text datasets instead of storing it in packages

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```

# textdata

[![R-CMD-check](https://github.com/EmilHvitfeldt/textdata/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/EmilHvitfeldt/textdata/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/textdata)](https://CRAN.R-project.org/package=textdata)
[![Downloads](http://cranlogs.r-pkg.org/badges/textdata)](https://cran.r-project.org/package=textdata)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3244433.svg)](https://doi.org/10.5281/zenodo.3244433)
[![Codecov test coverage](https://codecov.io/gh/EmilHvitfeldt/textdata/branch/main/graph/badge.svg)](https://app.codecov.io/gh/EmilHvitfeldt/textdata?branch=main)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html)

The goal of textdata is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.

## Installation

You can install the not yet released version of textdata from [CRAN](https://CRAN.R-project.org) with:

``` r
install.packages("textdata")
```

And the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("remotes")
remotes::install_github("EmilHvitfeldt/textdata")
```
## Example

The first time you use one of the functions for accessing an included text dataset, such as `lexicon_afinn()` or `dataset_sentence_polarity()`, the function will prompt you to agree that you understand the dataset's license or terms of use and then download the dataset to your computer.

![](man/figures/textdata_demo.gif)

After the first use, each time you use a function like `lexicon_afinn()`, the function will load the dataset from disk.

## Included text datasets

As of today, the datasets included in textdata are:

| Dataset | Function |
| --------------------------------------------------------------- | ----------------------------- |
| v1.0 sentence polarity dataset | `dataset_sentence_polarity()` |
| AFINN-111 sentiment lexicon | `lexicon_afinn()` |
| Hu and Liu's opinion lexicon | `lexicon_bing()` |
| NRC word-emotion association lexicon | `lexicon_nrc()` |
| NRC Emotion Intensity Lexicon | `lexicon_nrc_eil()` |
| The NRC Valence, Arousal, and Dominance Lexicon | `lexicon_nrc_vad()` |
| Loughran and McDonald's opinion lexicon for financial documents | `lexicon_loughran()` |
| AG's News | `dataset_ag_news()` |
| DBpedia ontology | `dataset_dbpedia()` |
| Trec-6 and Trec-50 | `dataset_trec()` |
| IMDb Large Movie Review Dataset | `dataset_imdb()` |
| Stanford NLP GloVe pre-trained word vectors | `embedding_glove6b()` |
| | `embedding_glove27b()` |
| | `embedding_glove42b()` |
| | `embedding_glove840b()` |

Check out each function's documentation for detailed information (including citations) for the relevant dataset.

## Community Guidelines

Note that this project is released with a
[Contributor Code of Conduct](https://github.com/EmilHvitfeldt/textdata/blob/main/CODE_OF_CONDUCT.md).
By contributing to this project, you agree to abide by its terms.
Feedback, bug reports (and fixes!), and feature requests are welcome; file
issues or seek support [here](https://github.com/EmilHvitfeldt/textdata/issues).
For details on how to add a new dataset to this package, check out the vignette!