{"id":29269114,"url":"https://github.com/master/spark-stemming","last_synced_at":"2025-07-04T20:07:30.274Z","repository":{"id":57721344,"uuid":"52890014","full_name":"master/spark-stemming","owner":"master","description":"Spark MLlib wrapper for the Snowball framework","archived":false,"fork":false,"pushed_at":"2018-11-27T22:38:03.000Z","size":168,"stargazers_count":33,"open_issues_count":1,"forks_count":20,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-11-15T12:27:05.370Z","etag":null,"topics":["nlp","snowball","spark","stemming"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/master.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-03-01T16:01:33.000Z","updated_at":"2023-04-11T14:23:16.000Z","dependencies_parsed_at":"2022-09-26T21:41:30.023Z","dependency_job_id":null,"html_url":"https://github.com/master/spark-stemming","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/master/spark-stemming","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/master%2Fspark-stemming","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/master%2Fspark-stemming/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/master%2Fspark-stemming/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/master%2Fspark-stemming/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/master","download_url":"https://codeload.github.com/master/spark-stemming/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/master%2Fspark-stemming/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263611900,"owners_count":23488429,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nlp","snowball","spark","stemming"],"created_at":"2025-07-04T20:07:29.611Z","updated_at":"2025-07-04T20:07:30.265Z","avatar_url":"https://github.com/master.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark Stemming\n\n[![Build Status](https://travis-ci.org/master/spark-stemming.svg?branch=master)](https://travis-ci.org/master/spark-stemming)\n\n[Snowball](http://snowballstem.org/) is a small string processing language\ndesigned for creating stemming algorithms for use in Information Retrieval.\nThis package allows to use it as a part of [Spark ML\nPipeline](https://spark.apache.org/docs/latest/ml-guide.html) API.\n\n## Linking\n\nLink against this library using SBT:\n\n```\nlibraryDependencies += \"com.github.master\" %% \"spark-stemming\" % \"0.2.1\"\n```\n\nUsing Maven:\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003ecom.github.master\u003c/groupId\u003e\n    \u003cartifactId\u003espark-stemming_2.10\u003c/artifactId\u003e\n    \u003cversion\u003e0.2.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nOr include it when starting the Spark shell:\n\n```\n$ bin/spark-shell --packages com.github.master:spark-stemming_2.10:0.2.1\n```\n\n## Features\n\nCurrently implemented algorithms:\n\n* Arabic\n* English\n* English (Porter)\n* Romance stemmers:\n  * French\n  * Spanish\n  * Portuguese\n  * Italian\n  * Romanian\n* Germanic stemmers:\n  * German\n  * Dutch\n* Scandinavian stemmers:\n  * Swedish\n  * Norwegian (Bokmål)\n  * Danish\n* Russian\n* Finnish\n* Greek\n\nMore details are on the [Snowball stemming algorithms](http://snowballstem.org/algorithms/) page.\n\n## Usage\n\n`Stemmer`\n[Transformer](https://spark.apache.org/docs/latest/ml-guide.html#transformers)\ncan be used directly or as a part of ML\n[Pipeline](https://spark.apache.org/docs/latest/ml-guide.html#pipeline). In\nparticular, it is nicely combined with\n[Tokenizer](https://spark.apache.org/docs/latest/ml-features.html#tokenizer).\n\n```scala\nimport org.apache.spark.mllib.feature.Stemmer\n\nval data = sqlContext\n  .createDataFrame(Seq((\"мама\", 1), (\"мыла\", 2), (\"раму\", 3)))\n  .toDF(\"word\", \"id\")\n\nval stemmed = new Stemmer()\n  .setInputCol(\"word\")\n  .setOutputCol(\"stemmed\")\n  .setLanguage(\"Russian\")\n  .transform(data)\n\nstemmed.show\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaster%2Fspark-stemming","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaster%2Fspark-stemming","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaster%2Fspark-stemming/lists"}