{"id":13858135,"url":"https://github.com/quanteda/stopwords","last_synced_at":"2025-12-12T01:04:41.474Z","repository":{"id":45115833,"uuid":"110147368","full_name":"quanteda/stopwords","owner":"quanteda","description":"Multilingual Stopword Lists in R","archived":false,"fork":false,"pushed_at":"2022-01-07T08:58:41.000Z","size":1019,"stargazers_count":113,"open_issues_count":2,"forks_count":9,"subscribers_count":11,"default_branch":"master","last_synced_at":"2024-05-21T02:53:32.796Z","etag":null,"topics":["r","text-analysis"],"latest_commit_sha":null,"homepage":"http://stopwords.quanteda.io","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/quanteda.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-11-09T17:44:27.000Z","updated_at":"2024-04-23T19:33:56.000Z","dependencies_parsed_at":"2022-09-14T07:31:02.312Z","dependency_job_id":null,"html_url":"https://github.com/quanteda/stopwords","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quanteda%2Fstopwords","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quanteda%2Fstopwords/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quanteda%2Fstopwords/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quanteda%2Fstopwords/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/quanteda","download_url":"https://codeload.github.com/quanteda/stopwords/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247557770,"owners_count":20958047,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["r","text-analysis"],"created_at":"2024-08-05T03:01:57.842Z","updated_at":"2025-12-12T01:04:41.375Z","avatar_url":"https://github.com/quanteda.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n```{r, echo = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"##\",\n  fig.path = \"man/images/\"\n)\n```\n\n```{r echo=FALSE, results=\"hide\", message=FALSE}\nlibrary(\"badger\")\n```\n\n# stopwords: the R package\n\n[![CRAN Version](https://www.r-pkg.org/badges/version/stopwords)](https://CRAN.R-project.org/package=stopwords)\n`r badge_devel(\"quanteda/stopwords\", \"royalblue\")`\n[![R build status](https://github.com/quanteda/stopwords/workflows/R-CMD-check/badge.svg)](https://github.com/quanteda/stopwords/actions)\n[![codecov](https://codecov.io/gh/quanteda/stopwords/branch/master/graph/badge.svg)](https://app.codecov.io/gh/quanteda/stopwords)\n[![Downloads](https://cranlogs.r-pkg.org/badges/stopwords)](https://CRAN.R-project.org/package=stopwords)\n[![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/stopwords?color=orange)](https://CRAN.R-project.org/package=stopwords)\n\nR package providing \"one-stop shopping\" (or should that be \"one-shop stopping\"?) for stopword lists in R, for multiple languages and sources. No longer should text analysis or NLP packages bake in their own stopword lists or functions, since this package can accommodate them all, and is easily extended.\n\nCreated by [David Muhr](https://github.com/davnn), and extended in cooperation with [Kenneth Benoit](https://github.com/kbenoit) and [Kohei Watanabe](https://github.com/koheiw).\n\n## Installation\n\n```{r, eval = FALSE}\n# from CRAN\ninstall.packages(\"stopwords\")\n\n# Or get the development version from GitHub:\n# install.packages(\"devtools\")\ndevtools::install_github(\"quanteda/stopwords\")\n```\n\n## Usage\n\n```{r}\nhead(stopwords::stopwords(\"de\", source = \"snowball\"), 20)\n\nhead(stopwords::stopwords(\"ja\", source = \"marimo\"), 20)\n```\n\nFor compatibility with the former `quanteda::stopwords()`:\n\n```{r}\nhead(stopwords::stopwords(\"german\"), 20)\n```\n\nExplore sources and languages:\n\n```{r}\n# list all sources\nstopwords::stopwords_getsources()\n\n# list languages for a specific source\nstopwords::stopwords_getlanguages(\"snowball\")\n```\n\n## Languages available\n\nThe following coverage of languages is currently available, by source. Note that the inclusiveness of the stopword lists will vary by source, and the number of languages covered by a stopword list does not necessarily mean that the source is better than one with more limited coverage. (There may be many reasons to prefer the default \"snowball\" source over the \"stopwords-iso\" source, for instance.)\n\nThe following languages are currently available:\n\n```{r, echo=FALSE}\ndat \u003c- read.csv(\"data-raw/language.csv\", as.is = TRUE, check.names = FALSE)\ndat[is.na(dat)] \u003c- \"\"\ndat[dat == \"TRUE\"] \u003c- \"\\u2713\"\nknitr::kable(dat, align = c(\"l\", \"l\", \"c\", \"c\", \"c\", \"c\", \"l\"))\n```\n\n## Basic usage\n\n```{r}\nhead(stopwords::stopwords(\"de\", source = \"snowball\"), 20)\n\nhead(stopwords::stopwords(\"de\", source = \"stopwords-iso\"), 20)\n```\n\nFor compatibility with the former `quanteda::stopwords()`:\n\n```{r}\nhead(stopwords::stopwords(\"german\"), 20)\n```\n\nExplore sources and languages:\n\n```{r}\n# list all sources\nstopwords::stopwords_getsources()\n\n# list languages for a specific source\nstopwords::stopwords_getlanguages(\"snowball\")\n```\n\n## Modifying stopword lists\n\nIt is now possible to edit your own stopword lists, using the interactive editor, with functions from the **quanteda** package (\u003e= v2.02).  For instance to edit the English stopword list for the Snowball source:\n\n```{r eval = FALSE}\n# edit the English stopwords\nmy_stopwords \u003c- quanteda::char_edit(stopwords(\"en\", source = \"snowball\"))\n```\n\nTo edit stopwords whose underlying structure is a list, such as the \"marimo\" source, we can use the `list_edit()` function:\n```{r eval = FALSE}\n# edit the English stopwords\nmy_stopwordlist \u003c- quanteda::list_edit(stopwords(\"en\", source = \"marimo\", simplify = FALSE))\n```\n\nFinally, it's possible to remove stopwords using pattern matching.  The default is the easy-to-use [\"glob\" style matching](https://en.wikipedia.org/wiki/Glob_(programming)), which is equivalent to fixed matching when no wildcard characters are used.  So to remove personal pronouns from the English Snowball word list, for instance, this would work:\n```{r}\nlibrary(\"quanteda\", warn.conflicts = FALSE)\nposspronouns \u003c- stopwords::data_stopwords_marimo$en$pronoun$possessive\nposspronouns\n\nstopwords(\"en\", source = \"snowball\") %\u003e%\n  head(n = 10)\n```\nSee the difference when we remove them -- \"my\", \"ours\", and \"your\" are gone:\n```{r}\nstopwords(\"en\", source = \"snowball\") %\u003e%\n  head(n = 10) %\u003e%\n  char_remove(pattern = posspronouns)\n```\n\nThere is no `char_add()`, since it's just as easy to use `c()` for this, but there is a `char_keep()` for positive selection rather than removal.\n\n\n## Adding stopwords to your own package\n\nIn v2.2, we've removed the function `use_stopwords()` because the dependency on\n**usethis** added too many downstream package dependencies, and **stopwords** is\nmeant to be a lightweight package.\n\nHowever it is very easy to add a re-export for `stopwords()` to your package by adding this file as `stopwords.R`:\n\n```{r, eval = FALSE}\n#' Stopwords\n#'\n#' @description\n#' Return a character vector of stopwords.\n#' See \\code{stopwords::\\link[stopwords:stopwords]{stopwords()}} for details.\n#' @usage stopwords(language = \"en\", source = \"snowball\")\n#' @name stopwords\n#' @importFrom stopwords stopwords\n#' @export\nNULL\n```\n\nand add `stopwords` to the list of `Imports:` in your `DESCRIPTION` file.\n\n\n## Contributing\n\nAdditional sources can be defined and contributed by adding new data objects, as follows:\n\n1. **Data object**.  Create a named list of characters, in UTF-8 format, consisting of the stopwords for each language. The [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) language code will form the name of the list element, and the values of each element will be the character vector of stopwords for literal matches. The data object should follow the package naming convention, and be called `data_stopwords_newsource`, where `newsource` is replaced by the name of the new source.\n\n2. **Documentation**.  The new source should be clearly documented, especially the source from which was taken.\n\n## License\n\nThis package as well as the source repositories are licensed under MIT.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquanteda%2Fstopwords","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquanteda%2Fstopwords","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquanteda%2Fstopwords/lists"}