https://github.com/tggo/steblo
Zero-dependency rule-based Ukrainian stemmer in pure Go
https://github.com/tggo/steblo
bleve golang nlp stemmer stemming text-processing ukrainian ukrainian-language
Last synced: 11 days ago
JSON representation
Zero-dependency rule-based Ukrainian stemmer in pure Go
- Host: GitHub
- URL: https://github.com/tggo/steblo
- Owner: tggo
- License: mit
- Created: 2026-05-27T11:44:04.000Z (18 days ago)
- Default Branch: main
- Last Pushed: 2026-05-27T12:43:01.000Z (18 days ago)
- Last Synced: 2026-05-27T14:19:00.072Z (18 days ago)
- Topics: bleve, golang, nlp, stemmer, stemming, text-processing, ukrainian, ukrainian-language
- Language: Go
- Homepage: https://tggo.github.io/steblo/
- Size: 209 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# steblo
[](https://github.com/tggo/steblo/actions/workflows/ci.yml)
[](https://pkg.go.dev/github.com/tggo/steblo)
Zero-dependency, rule-based stemmer for the Ukrainian language in pure Go.
- **Zero runtime dependencies** — no cgo, no models, no regex in the hot path.
- **Concurrency-safe** — stateless, no package-level mutable state.
- **Allocation-free** hot path via `StemRunes` (~127 ns/word; ~4.9M words/sec).
- Optional Bleve analyzer in the decoupled [`bleveuk`](./bleveuk) sub-package.
The canonical algorithm and every design decision live in
[`docs/algorithm.md`](./docs/algorithm.md) — that spec, not the code, is the
source of truth.
## Install
```bash
go get github.com/tggo/steblo
go install github.com/tggo/steblo/cmd/stemctl@latest # CLI
```
## Usage
```go
package main
import (
"fmt"
"github.com/tggo/steblo"
)
func main() {
fmt.Println(steblo.Stem("випробування")) // випробуван
fmt.Println(steblo.StemWith("Чепинога", steblo.Options{Strict: true})) // чепино
}
```
CLI:
```bash
echo "слова українські красиві" | stemctl # слов українськ красив
stemctl --strict --json < words.txt # {"слова":"слов", ...}
stemctl --bench < bench/words.txt # 10000 words, 2.2ms total, 219 ns/word, 1.0 allocs/word
```
## Options
`StemWith(word, Options{…})`. Defaults shown are those used by `Stem`.
| Option | Default | Effect |
|---|---|---|
| `Strict` | `false` | Apply consonant-alternation cleanup after stripping (e.g. `Чепинозі → чепино`). Over-strips by design. |
| `Lowercase` | `true` | Pre-lowercase via `unicode.ToLower`. |
| `NormalizeApostr` | `true` | Unify and delete apostrophe variants before stemming (`об'єднання → обєднан`). |
| `NormalizeYo` | `false` | Map `ё→е`, `ъ→ї` for mixed-Cyrillic corpora. |
Decomposed (NFD) Cyrillic — combining breve (`й`) and diaeresis (`ї`, `ё`) — is
always recomposed, so input from NFD sources (e.g. macOS filenames) stems
identically to NFC. This is unconditional and never alters NFC text.
API surface: `Stem`, `StemWith`, `StemRunes`, `StemRunesWith`, `Options`,
`DefaultOptions`. `StemRunes*` is the allocation-conscious form; its result may
alias the input — clone before mutating.
## Performance
| | ns/word | allocs/word |
|---|---:|---:|
| `Stem` (string API) | ~204 | 1 |
| `StemRunes` (no normalisation copy) | ~127 | **0** |
Measured on Apple M4 Max, Go 1.25, over `bench/words.txt`. Full methodology and
numbers in [`docs/bench.md`](./docs/bench.md). Reproduce with `make bench`.
## Bleve integration
The [`bleveuk`](./bleveuk) sub-package registers a Bleve token filter
(`uk_stem`), a Ukrainian stopword filter (`stop_uk`), and an analyzer (`uk`)
composed of `unicode → lowercase → stop_uk → uk_stem`. Import it for its
side effects, then reference the analyzer by name in your field mapping:
```go
package main
import (
"fmt"
"github.com/blevesearch/bleve/v2"
_ "github.com/tggo/steblo/bleveuk" // registers the "uk" analyzer
)
func main() {
// Map a text field to the Ukrainian analyzer.
fm := bleve.NewTextFieldMapping()
fm.Analyzer = "uk"
dm := bleve.NewDocumentMapping()
dm.AddFieldMappingsAt("text", fm)
im := bleve.NewIndexMapping()
im.DefaultMapping = dm
idx, _ := bleve.NewMemOnly(im)
idx.Index("d1", map[string]string{"text": "державні випробування обладнання"})
idx.Index("d2", map[string]string{"text": "погашення кредиту достроково"})
// "випробувань" and the indexed "випробування" both stem to "випробуван",
// so the query matches d1 even though the surface forms differ.
q := bleve.NewMatchQuery("випробувань")
q.SetField("text")
res, _ := idx.Search(bleve.NewSearchRequest(q))
fmt.Println(res.Hits[0].ID) // d1
}
```
`bleveuk` is a **separate Go module** (`github.com/tggo/steblo/bleveuk`) so that
Bleve's dependency tree never touches the core: `go get github.com/tggo/steblo`
pulls in nothing. Install the integration only if you want it:
```bash
go get github.com/tggo/steblo/bleveuk
```
## Caveats
steblo is a **rule-based truncation stemmer, not a lemmatiser**. It deliberately
over- and under-stems. It does no dictionary lookup, no morphological analysis,
and no mixed-script (Ukrainian/Russian) disambiguation — the caller must detect
script. If you need lemmas, POS, or full paradigms, use a morphological analyser.
`Stem` is **not idempotent**: `Stem(Stem(x))` may differ from `Stem(x)`, because
each call strips one suffix per phase. See
[`docs/algorithm.md`](./docs/algorithm.md) §9a.
## Development
```bash
make test # unit + corpus differential tests
make cover # coverage (core package > 90%)
make bench # benchmarks
make fuzz # fuzz targets
make lint # go vet + staticcheck + golangci-lint if installed
```
## Not in the public repo
Kept local only (gitignored):
- `CLAUDE.md` — internal instructions, full of external links.
- `scripts/build_corpus/` — the corpus generator references external repos by
name; the generated corpus ships, the generator doesn't.
## Sources
Algorithm lineage and reference implementations:
- [drupal ukstemmer](https://www.drupal.org/project/ukstemmer)
- [Amice13/ukr_stemmer](https://github.com/Amice13/ukr_stemmer) · [ukrstemmer-node](https://github.com/Amice13/ukrstemmer-node)
- [Desklop/Uk_Stemmer](https://github.com/Desklop/Uk_Stemmer)
- [titarenko/ukrstemmer](https://github.com/titarenko/ukrstemmer)
- corpus seeded from [brown-uk/corpus](https://github.com/brown-uk/corpus)
## License
MIT — see [`LICENSE`](./LICENSE).