{"id":18043607,"url":"https://github.com/emilhvitfeldt/textdata","last_synced_at":"2025-04-10T03:50:33.016Z","repository":{"id":56936374,"uuid":"163707177","full_name":"EmilHvitfeldt/textdata","owner":"EmilHvitfeldt","description":"Download, parse, store, and load text datasets instead of storing it in packages ","archived":false,"fork":false,"pushed_at":"2024-05-28T22:02:37.000Z","size":15091,"stargazers_count":75,"open_issues_count":12,"forks_count":12,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-03-31T15:25:45.346Z","etag":null,"topics":["r","rstats","text-datasets"],"latest_commit_sha":null,"homepage":"https://emilhvitfeldt.github.io/textdata/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EmilHvitfeldt.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-01-01T01:41:17.000Z","updated_at":"2024-05-28T22:00:29.000Z","dependencies_parsed_at":"2024-05-28T23:46:02.208Z","dependency_job_id":"692e3f5a-4907-4085-9813-3d2afe7e6870","html_url":"https://github.com/EmilHvitfeldt/textdata","commit_stats":{"total_commits":122,"total_committers":4,"mean_commits":30.5,"dds":0.05737704918032782,"last_synced_commit":"256ccd6fe771b09efe51ffad70513f83578f3e6e"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EmilHvitfeldt%2Ftextdata","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EmilHvitfeldt%2Ftextdata/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EmilHvitfeldt%2Ftextdata/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EmilHvitfeldt%2Ftextdata/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EmilHvitfeldt","download_url":"https://codeload.github.com/EmilHvitfeldt/textdata/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248154999,"owners_count":21056542,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["r","rstats","text-datasets"],"created_at":"2024-10-30T17:09:15.154Z","updated_at":"2025-04-10T03:50:32.976Z","avatar_url":"https://github.com/EmilHvitfeldt.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\"\n)\n```\n\n# textdata \u003cimg src='man/figures/logo.png' style=\"float:right\" height=\"139\" /\u003e\n\n\u003c!-- badges: start --\u003e\n[![R-CMD-check](https://github.com/EmilHvitfeldt/textdata/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/EmilHvitfeldt/textdata/actions/workflows/R-CMD-check.yaml)\n[![CRAN status](https://www.r-pkg.org/badges/version/textdata)](https://CRAN.R-project.org/package=textdata)\n[![Downloads](http://cranlogs.r-pkg.org/badges/textdata)](https://cran.r-project.org/package=textdata)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3244433.svg)](https://doi.org/10.5281/zenodo.3244433)\n[![Codecov test coverage](https://codecov.io/gh/EmilHvitfeldt/textdata/branch/main/graph/badge.svg)](https://app.codecov.io/gh/EmilHvitfeldt/textdata?branch=main)\n[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html)\n\u003c!-- badges: end --\u003e\n\nThe goal of textdata is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.\n\n## Installation\n\nYou can install the not yet released version of textdata from [CRAN](https://CRAN.R-project.org) with:\n\n``` r\ninstall.packages(\"textdata\")\n```\n\nAnd the development version from [GitHub](https://github.com/) with:\n\n``` r\n# install.packages(\"remotes\")\nremotes::install_github(\"EmilHvitfeldt/textdata\")\n```\n## Example\n\nThe first time you use one of the functions for accessing an included text dataset, such as `lexicon_afinn()` or `dataset_sentence_polarity()`, the function will prompt you to agree that you understand the dataset's license or terms of use and then download the dataset to your computer.\n\n![](man/figures/textdata_demo.gif)\n\nAfter the first use, each time you use a function like `lexicon_afinn()`, the function will load the dataset from disk.\n\n## Included text datasets\n\nAs of today, the datasets included in textdata are:\n\n| Dataset                                                         | Function                      |\n| --------------------------------------------------------------- | ----------------------------- |\n| v1.0 sentence polarity dataset                                  | `dataset_sentence_polarity()` |\n| AFINN-111 sentiment lexicon                                     | `lexicon_afinn()`             |\n| Hu and Liu's opinion lexicon                                    | `lexicon_bing()`              |\n| NRC word-emotion association lexicon                            | `lexicon_nrc()`               |\n| NRC Emotion Intensity Lexicon                                   | `lexicon_nrc_eil()`           |\n| The NRC Valence, Arousal, and Dominance Lexicon                 | `lexicon_nrc_vad()`           |\n| Loughran and McDonald's opinion lexicon for financial documents | `lexicon_loughran()`          |\n| AG's News                                                       | `dataset_ag_news()`           |\n| DBpedia ontology                                                | `dataset_dbpedia()`           |\n| Trec-6 and Trec-50                                              | `dataset_trec()`              |\n| IMDb Large Movie Review Dataset\t                                | `dataset_imdb()`              |\n| Stanford NLP GloVe pre-trained word vectors                     | `embedding_glove6b()`         |\n|                                                                 | `embedding_glove27b()`        |\n|                                                                 | `embedding_glove42b()`        |\n|                                                                 | `embedding_glove840b()`       |\n\nCheck out each function's documentation for detailed information (including citations) for the relevant dataset.\n\n## Community Guidelines\n\nNote that this project is released with a\n[Contributor Code of Conduct](https://github.com/EmilHvitfeldt/textdata/blob/main/CODE_OF_CONDUCT.md).\nBy contributing to this project, you agree to abide by its terms. \nFeedback, bug reports (and fixes!), and feature requests are welcome; file \nissues or seek support [here](https://github.com/EmilHvitfeldt/textdata/issues).\nFor details on how to add a new dataset to this package, check out the vignette!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femilhvitfeldt%2Ftextdata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Femilhvitfeldt%2Ftextdata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femilhvitfeldt%2Ftextdata/lists"}