https://github.com/sadit/snowballstemmer.jl
Julia's wrapper for libstemmer
https://github.com/sadit/snowballstemmer.jl
julia nlp snowball stemmer
Last synced: 5 months ago
JSON representation
Julia's wrapper for libstemmer
- Host: GitHub
- URL: https://github.com/sadit/snowballstemmer.jl
- Owner: sadit
- License: other
- Created: 2017-08-31T16:35:48.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2020-04-06T15:01:41.000Z (about 5 years ago)
- Last Synced: 2025-01-10T17:00:14.303Z (5 months ago)
- Topics: julia, nlp, snowball, stemmer
- Language: Julia
- Size: 21.5 KB
- Stars: 2
- Watchers: 2
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
SnowballStemmer.jl
===============The SnowballStemmer.jl package extracts the stemmer functionality of the `TextAnalysis.jl` package, which is also a wrapper for [libstemmer](http://snowball.tartarus.org/).
The idea is to expose the stemming functions without forcing your programs to follow the interfaces of `TextAnalysis.jl`.# Installation
The TextAnalysis package can be installed using Julia's package manager:
```julia
julia> Pkg.clone("https://github.com/sadit/SnowballStemmer.jl")
```
you may also need to build the internal libraries
```julia
julia> Pkg.build("SnowballStemmer")
```# Getting Started
Just import the stemmer package and you are ready to work.
```julia
julia> using SnowballStemmer
```Listing the available stemmers (supported languages)
```julia
julia> stemmer_types()
16-element Array{AbstractString,1}:
"danish"
"dutch"
"english"
"finnish"
"french"
"german"
"hungarian"
"italian"
"norwegian"
"porter"
"portuguese"
"romanian"
"russian"
"spanish"
"swedish"
"turkish"```
A stemmer is initialized as follows:
```julia
julia> s = Stemmer("spanish")
```
Then, use the `stem` function over each word```julia
julia> [stem(s, text) for text in split("las casas de colores estan sobre las colinas")]
8-element Array{String,1}:
"las"
"cas"
"de"
"color"
"estan"
"sobr"
"las"
"colin"
```As you may noticed, there is no integrated tokenizer; for most complex cases, you may create your own tokenizers, for simple cases you can use just regular expressions.
The following is an example of use for an English sentence:
```julia
julia> e = Stemmer("english")
SnowballStemmer.Stemmer(Ptr{Void} @0x00007fcbb253c6c0, "english", "UTF_8")julia> [stem(e, x.match) for x in eachmatch(r"\w+", "browsing and searching are not equivalent; however, no body cares... surprised?")]
11-element Array{String,1}:
"brows"
"and"
"search"
"are"
"not"
"equival"
"howev"
"no"
"bodi"
"care"
"surpris"
```# Advanced manipulation of text
This package only provides stemming facilities. More advanced functionality can be found in `TextAnalysis.jl` or `TextModel.jl`.