https://github.com/master/spark-stemming
Spark MLlib wrapper for the Snowball framework
https://github.com/master/spark-stemming
nlp snowball spark stemming
Last synced: 3 months ago
JSON representation
Spark MLlib wrapper for the Snowball framework
- Host: GitHub
- URL: https://github.com/master/spark-stemming
- Owner: master
- License: bsd-2-clause
- Created: 2016-03-01T16:01:33.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2018-11-27T22:38:03.000Z (almost 7 years ago)
- Last Synced: 2024-11-15T12:27:05.370Z (11 months ago)
- Topics: nlp, snowball, spark, stemming
- Language: Java
- Homepage:
- Size: 164 KB
- Stars: 33
- Watchers: 5
- Forks: 20
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Spark Stemming
[](https://travis-ci.org/master/spark-stemming)
[Snowball](http://snowballstem.org/) is a small string processing language
designed for creating stemming algorithms for use in Information Retrieval.
This package allows to use it as a part of [Spark ML
Pipeline](https://spark.apache.org/docs/latest/ml-guide.html) API.## Linking
Link against this library using SBT:
```
libraryDependencies += "com.github.master" %% "spark-stemming" % "0.2.1"
```Using Maven:
```xml
com.github.master
spark-stemming_2.10
0.2.0```
Or include it when starting the Spark shell:
```
$ bin/spark-shell --packages com.github.master:spark-stemming_2.10:0.2.1
```## Features
Currently implemented algorithms:
* Arabic
* English
* English (Porter)
* Romance stemmers:
* French
* Spanish
* Portuguese
* Italian
* Romanian
* Germanic stemmers:
* German
* Dutch
* Scandinavian stemmers:
* Swedish
* Norwegian (Bokmål)
* Danish
* Russian
* Finnish
* GreekMore details are on the [Snowball stemming algorithms](http://snowballstem.org/algorithms/) page.
## Usage
`Stemmer`
[Transformer](https://spark.apache.org/docs/latest/ml-guide.html#transformers)
can be used directly or as a part of ML
[Pipeline](https://spark.apache.org/docs/latest/ml-guide.html#pipeline). In
particular, it is nicely combined with
[Tokenizer](https://spark.apache.org/docs/latest/ml-features.html#tokenizer).```scala
import org.apache.spark.mllib.feature.Stemmerval data = sqlContext
.createDataFrame(Seq(("мама", 1), ("мыла", 2), ("раму", 3)))
.toDF("word", "id")val stemmed = new Stemmer()
.setInputCol("word")
.setOutputCol("stemmed")
.setLanguage("Russian")
.transform(data)stemmed.show
```