https://github.com/emilhvitfeldt/textdata

Download, parse, store, and load text datasets instead of storing it in packages
https://github.com/emilhvitfeldt/textdata

r rstats text-datasets

Last synced: about 1 year ago
JSON representation

Download, parse, store, and load text datasets instead of storing it in packages

Host: GitHub
URL: https://github.com/emilhvitfeldt/textdata
Owner: EmilHvitfeldt
License: other
Created: 2019-01-01T01:41:17.000Z (over 7 years ago)
Default Branch: main
Last Pushed: 2024-05-28T22:02:37.000Z (about 2 years ago)
Last Synced: 2025-03-31T15:25:45.346Z (about 1 year ago)
Topics: r, rstats, text-datasets
Language: R
Homepage: https://emilhvitfeldt.github.io/textdata/
Size: 14.4 MB
Stars: 75
Watchers: 8
Forks: 12
Open Issues: 12
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

          ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-"

)

```

# textdata 

[![R-CMD-check](https://github.com/EmilHvitfeldt/textdata/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/EmilHvitfeldt/textdata/actions/workflows/R-CMD-check.yaml)

[![CRAN status](https://www.r-pkg.org/badges/version/textdata)](https://CRAN.R-project.org/package=textdata)

[![Downloads](http://cranlogs.r-pkg.org/badges/textdata)](https://cran.r-project.org/package=textdata)

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3244433.svg)](https://doi.org/10.5281/zenodo.3244433)

[![Codecov test coverage](https://codecov.io/gh/EmilHvitfeldt/textdata/branch/main/graph/badge.svg)](https://app.codecov.io/gh/EmilHvitfeldt/textdata?branch=main)

[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html)

The goal of textdata is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.

## Installation

You can install the not yet released version of textdata from [CRAN](https://CRAN.R-project.org) with:

``` r

install.packages("textdata")

```

And the development version from [GitHub](https://github.com/) with:

``` r

# install.packages("remotes")

remotes::install_github("EmilHvitfeldt/textdata")

```

## Example

The first time you use one of the functions for accessing an included text dataset, such as `lexicon_afinn()` or `dataset_sentence_polarity()`, the function will prompt you to agree that you understand the dataset's license or terms of use and then download the dataset to your computer.

![](man/figures/textdata_demo.gif)

After the first use, each time you use a function like `lexicon_afinn()`, the function will load the dataset from disk.

## Included text datasets

As of today, the datasets included in textdata are:

| Dataset                                                         | Function                      |

| --------------------------------------------------------------- | ----------------------------- |

| v1.0 sentence polarity dataset                                  | `dataset_sentence_polarity()` |

| AFINN-111 sentiment lexicon                                     | `lexicon_afinn()`             |

| Hu and Liu's opinion lexicon                                    | `lexicon_bing()`              |

| NRC word-emotion association lexicon                            | `lexicon_nrc()`               |

| NRC Emotion Intensity Lexicon                                   | `lexicon_nrc_eil()`           |

| The NRC Valence, Arousal, and Dominance Lexicon                 | `lexicon_nrc_vad()`           |

| Loughran and McDonald's opinion lexicon for financial documents | `lexicon_loughran()`          |

| AG's News                                                       | `dataset_ag_news()`           |

| DBpedia ontology                                                | `dataset_dbpedia()`           |

| Trec-6 and Trec-50                                              | `dataset_trec()`              |

| IMDb Large Movie Review Dataset	                                | `dataset_imdb()`              |

| Stanford NLP GloVe pre-trained word vectors                     | `embedding_glove6b()`         |

|                                                                 | `embedding_glove27b()`        |

|                                                                 | `embedding_glove42b()`        |

|                                                                 | `embedding_glove840b()`       |

Check out each function's documentation for detailed information (including citations) for the relevant dataset.

## Community Guidelines

Note that this project is released with a

[Contributor Code of Conduct](https://github.com/EmilHvitfeldt/textdata/blob/main/CODE_OF_CONDUCT.md).

By contributing to this project, you agree to abide by its terms. 

Feedback, bug reports (and fixes!), and feature requests are welcome; file 

issues or seek support [here](https://github.com/EmilHvitfeldt/textdata/issues).

For details on how to add a new dataset to this package, check out the vignette!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/emilhvitfeldt/textdata

Awesome Lists containing this project

README