{"id":15370651,"url":"https://github.com/dfalbel/ptstem","last_synced_at":"2025-06-25T14:33:20.227Z","repository":{"id":85587894,"uuid":"57229381","full_name":"dfalbel/ptstem","owner":"dfalbel","description":"Stemming Algorithms for the Portuguese Language","archived":false,"fork":false,"pushed_at":"2020-05-12T19:56:15.000Z","size":1522,"stargazers_count":21,"open_issues_count":2,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-06-12T00:06:44.181Z","etag":null,"topics":["hunspell","portuguese-language","r","stem","stemmer","stemming-algorithm"],"latest_commit_sha":null,"homepage":"http://dfalbel.github.io/ptstem/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dfalbel.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2016-04-27T16:37:54.000Z","updated_at":"2024-08-21T14:19:52.000Z","dependencies_parsed_at":"2023-03-10T23:45:21.671Z","dependency_job_id":null,"html_url":"https://github.com/dfalbel/ptstem","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/dfalbel/ptstem","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dfalbel%2Fptstem","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dfalbel%2Fptstem/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dfalbel%2Fptstem/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dfalbel%2Fptstem/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dfalbel","download_url":"https://codeload.github.com/dfalbel/ptstem/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dfalbel%2Fptstem/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259763075,"owners_count":22907408,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hunspell","portuguese-language","r","stem","stemmer","stemming-algorithm"],"created_at":"2024-10-01T13:42:54.549Z","updated_at":"2025-06-14T05:05:28.557Z","avatar_url":"https://github.com/dfalbel.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, echo = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"README-\"\n)\n```\n\nptstem\n==========================================\n\u003e Stemming Algorithms for the Portuguese Language\n\n[![Travis-CI Build Status](https://travis-ci.org/dfalbel/ptstem.svg?branch=master)](https://travis-ci.org/dfalbel/ptstem)\n[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/dfalbel/ptstem?branch=master\u0026svg=true)](https://ci.appveyor.com/project/dfalbel/ptstem)\n[![Coverage Status](https://img.shields.io/codecov/c/github/dfalbel/ptstem/master.svg)](https://codecov.io/github/dfalbel/ptstem?branch=master)\n[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/ptstem)](https://cran.r-project.org/package=ptstem)\n[![](http://cranlogs.r-pkg.org/badges/ptstem)](https://cran.r-project.org/package=ptstem)\n\nThis packages wraps 3 stemming algorithms for the portuguese language available in R. It\nunifies the API for the stemmers and provides easy stemming completion.\n\n## Installing\n\nYou can install directly from Github using:\n\n```{r, eval = F}\ndevtools::install_github(\"dfalbel/ptstem\")\n```\n\nor from CRAN using:\n\n```{r, eval=FALSE}\ninstall.packages(\"ptstem\")\n```\n\n\n## Using\n\nConsider the following text, extracted from [Stemming in Wikipedia](https://pt.wikipedia.org/wiki/Stemiza%C3%A7%C3%A3o)\n\n```{r}\ntext \u003c- \"Em morfologia linguística e recuperação de informação a stemização (do inglês, stemming) é\no processo de reduzir palavras flexionadas (ou às vezes derivadas) ao seu tronco (stem), base ou\nraiz, geralmente uma forma da palavra escrita. O tronco não precisa ser idêntico à raiz morfológica\nda palavra; ele geralmente é suficiente que palavras relacionadas sejam mapeadas para o mesmo\ntronco, mesmo se este tronco não for ele próprio uma raiz válida. O estudo de algoritmos para\nstemização tem sido realizado em ciência da computação desde a década de 60. Vários motores de\nbuscas tratam palavras com o mesmo tronco como sinônimos como um tipo de expansão de consulta, em\num processo de combinação.\"\n```\n\nThis will use the [`rslp`](https://github.com/dfalbel/rslp) algorithm to stem the text.\n\n```{r}\nlibrary(ptstem)\nptstem(text, algorithm = \"rslp\", complete = FALSE)\n```\n\nYou can complete stemmed words using the argument `complete = T`.\n\n```{r, eval = F}\nptstem(text, algorithm = \"rslp\", complete = TRUE)\n```\n\nThe other implemented algorithms are:\n\n* hunspell: the same algorithm used in OpenOffice corrector. (available via [hunspell](https://github.com/ropensci/hunspell) package)\n* porter: available via SnowballC package.\n\nYou can stem using those algorithms by changing the `algorithm` argument in `ptstem` function.\n\n```{r}\nlibrary(ptstem)\nptstem(text, algorithm = \"hunspell\")\nptstem(text, algorithm = \"porter\")\n```\n\n## Performance\n\nThe goal of stemming algorithms is to group related words and to separate unrelated words. With this in mind, you can talk about two kinds of possible errors when stemming:\n\n* Understemming: Related words were not grouped because you didn't stem enought.\n* Overstemming: Unrelated words were grouped because you removed a large part of the word when stemming.\n\nTo measure these errors the function `performance` was implemented. It returns a `data.frame` with 3 columns. The name of the stemmer and 2 metrics:\n\n* UI: the undersampling index. It's the proportion of related words that were not grouped.\n* OI: the overstemming index. It's the proportion of unrelated words that were grouped. \n\nRemember that OI is 0 if you don't stem. So I think the true objective of a stemming algorithm is to reduce UI without augmenting OI too much.\n\n`ptstem` package provides a dataset of grouped words for the portuguese language (found in this [link](http://www.inf.ufrgs.br/~fnflores/paice_tool/)). It's in this dataset that `performance` function calculates the metrics described above.\n\nSee results:\n\n```{r}\nperformance()\n```\n\nThis is not the only approach for measuring performance of the those algorithms. The article [*Assessing the impact of Stemming Accuracy on Information Retrieval – A multilingual perspective*](http://dx.doi.org/10.1016/j.ipm.2016.03.004) describes various ways to analyse stemming performance.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdfalbel%2Fptstem","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdfalbel%2Fptstem","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdfalbel%2Fptstem/lists"}