{"id":50520308,"url":"https://github.com/tggo/steblo","last_synced_at":"2026-06-03T03:31:14.813Z","repository":{"id":360690768,"uuid":"1251247198","full_name":"tggo/steblo","owner":"tggo","description":"Zero-dependency rule-based Ukrainian stemmer in pure Go","archived":false,"fork":false,"pushed_at":"2026-05-27T12:43:01.000Z","size":214,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-27T14:19:00.072Z","etag":null,"topics":["bleve","golang","nlp","stemmer","stemming","text-processing","ukrainian","ukrainian-language"],"latest_commit_sha":null,"homepage":"https://tggo.github.io/steblo/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tggo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-27T11:44:04.000Z","updated_at":"2026-05-27T12:26:21.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/tggo/steblo","commit_stats":null,"previous_names":["tggo/steblo"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/tggo/steblo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tggo%2Fsteblo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tggo%2Fsteblo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tggo%2Fsteblo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tggo%2Fsteblo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tggo","download_url":"https://codeload.github.com/tggo/steblo/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tggo%2Fsteblo/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33847264,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-03T02:00:06.370Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bleve","golang","nlp","stemmer","stemming","text-processing","ukrainian","ukrainian-language"],"created_at":"2026-06-03T03:31:12.263Z","updated_at":"2026-06-03T03:31:14.807Z","avatar_url":"https://github.com/tggo.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# steblo\n\n[![CI](https://github.com/tggo/steblo/actions/workflows/ci.yml/badge.svg)](https://github.com/tggo/steblo/actions/workflows/ci.yml)\n[![Go Reference](https://pkg.go.dev/badge/github.com/tggo/steblo.svg)](https://pkg.go.dev/github.com/tggo/steblo)\n\nZero-dependency, rule-based stemmer for the Ukrainian language in pure Go.\n\n- **Zero runtime dependencies** — no cgo, no models, no regex in the hot path.\n- **Concurrency-safe** — stateless, no package-level mutable state.\n- **Allocation-free** hot path via `StemRunes` (~127 ns/word; ~4.9M words/sec).\n- Optional Bleve analyzer in the decoupled [`bleveuk`](./bleveuk) sub-package.\n\nThe canonical algorithm and every design decision live in\n[`docs/algorithm.md`](./docs/algorithm.md) — that spec, not the code, is the\nsource of truth.\n\n## Install\n\n```bash\ngo get github.com/tggo/steblo\ngo install github.com/tggo/steblo/cmd/stemctl@latest   # CLI\n```\n\n## Usage\n\n```go\npackage main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/tggo/steblo\"\n)\n\nfunc main() {\n\tfmt.Println(steblo.Stem(\"випробування\")) // випробуван\n\n\tfmt.Println(steblo.StemWith(\"Чепинога\", steblo.Options{Strict: true})) // чепино\n}\n```\n\nCLI:\n\n```bash\necho \"слова українські красиві\" | stemctl     # слов українськ красив\nstemctl --strict --json \u003c words.txt           # {\"слова\":\"слов\", ...}\nstemctl --bench \u003c bench/words.txt             # 10000 words, 2.2ms total, 219 ns/word, 1.0 allocs/word\n```\n\n## Options\n\n`StemWith(word, Options{…})`. Defaults shown are those used by `Stem`.\n\n| Option | Default | Effect |\n|---|---|---|\n| `Strict` | `false` | Apply consonant-alternation cleanup after stripping (e.g. `Чепинозі → чепино`). Over-strips by design. |\n| `Lowercase` | `true` | Pre-lowercase via `unicode.ToLower`. |\n| `NormalizeApostr` | `true` | Unify and delete apostrophe variants before stemming (`об'єднання → обєднан`). |\n| `NormalizeYo` | `false` | Map `ё→е`, `ъ→ї` for mixed-Cyrillic corpora. |\n\nDecomposed (NFD) Cyrillic — combining breve (`й`) and diaeresis (`ї`, `ё`) — is\nalways recomposed, so input from NFD sources (e.g. macOS filenames) stems\nidentically to NFC. This is unconditional and never alters NFC text.\n\nAPI surface: `Stem`, `StemWith`, `StemRunes`, `StemRunesWith`, `Options`,\n`DefaultOptions`. `StemRunes*` is the allocation-conscious form; its result may\nalias the input — clone before mutating.\n\n## Performance\n\n| | ns/word | allocs/word |\n|---|---:|---:|\n| `Stem` (string API) | ~204 | 1 |\n| `StemRunes` (no normalisation copy) | ~127 | **0** |\n\nMeasured on Apple M4 Max, Go 1.25, over `bench/words.txt`. Full methodology and\nnumbers in [`docs/bench.md`](./docs/bench.md). Reproduce with `make bench`.\n\n## Bleve integration\n\nThe [`bleveuk`](./bleveuk) sub-package registers a Bleve token filter\n(`uk_stem`), a Ukrainian stopword filter (`stop_uk`), and an analyzer (`uk`)\ncomposed of `unicode → lowercase → stop_uk → uk_stem`. Import it for its\nside effects, then reference the analyzer by name in your field mapping:\n\n```go\npackage main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/blevesearch/bleve/v2\"\n\t_ \"github.com/tggo/steblo/bleveuk\" // registers the \"uk\" analyzer\n)\n\nfunc main() {\n\t// Map a text field to the Ukrainian analyzer.\n\tfm := bleve.NewTextFieldMapping()\n\tfm.Analyzer = \"uk\"\n\tdm := bleve.NewDocumentMapping()\n\tdm.AddFieldMappingsAt(\"text\", fm)\n\tim := bleve.NewIndexMapping()\n\tim.DefaultMapping = dm\n\n\tidx, _ := bleve.NewMemOnly(im)\n\tidx.Index(\"d1\", map[string]string{\"text\": \"державні випробування обладнання\"})\n\tidx.Index(\"d2\", map[string]string{\"text\": \"погашення кредиту достроково\"})\n\n\t// \"випробувань\" and the indexed \"випробування\" both stem to \"випробуван\",\n\t// so the query matches d1 even though the surface forms differ.\n\tq := bleve.NewMatchQuery(\"випробувань\")\n\tq.SetField(\"text\")\n\tres, _ := idx.Search(bleve.NewSearchRequest(q))\n\tfmt.Println(res.Hits[0].ID) // d1\n}\n```\n\n`bleveuk` is a **separate Go module** (`github.com/tggo/steblo/bleveuk`) so that\nBleve's dependency tree never touches the core: `go get github.com/tggo/steblo`\npulls in nothing. Install the integration only if you want it:\n\n```bash\ngo get github.com/tggo/steblo/bleveuk\n```\n\n## Caveats\n\nsteblo is a **rule-based truncation stemmer, not a lemmatiser**. It deliberately\nover- and under-stems. It does no dictionary lookup, no morphological analysis,\nand no mixed-script (Ukrainian/Russian) disambiguation — the caller must detect\nscript. If you need lemmas, POS, or full paradigms, use a morphological analyser.\n\n`Stem` is **not idempotent**: `Stem(Stem(x))` may differ from `Stem(x)`, because\neach call strips one suffix per phase. See\n[`docs/algorithm.md`](./docs/algorithm.md) §9a.\n\n## Development\n\n```bash\nmake test      # unit + corpus differential tests\nmake cover     # coverage (core package \u003e 90%)\nmake bench     # benchmarks\nmake fuzz      # fuzz targets\nmake lint      # go vet + staticcheck + golangci-lint if installed\n```\n\n## Not in the public repo\n\nKept local only (gitignored):\n\n- `CLAUDE.md` — internal instructions, full of external links.\n- `scripts/build_corpus/` — the corpus generator references external repos by\n  name; the generated corpus ships, the generator doesn't.\n\n## Sources\n\nAlgorithm lineage and reference implementations:\n\n- [drupal ukstemmer](https://www.drupal.org/project/ukstemmer)\n- [Amice13/ukr_stemmer](https://github.com/Amice13/ukr_stemmer) · [ukrstemmer-node](https://github.com/Amice13/ukrstemmer-node)\n- [Desklop/Uk_Stemmer](https://github.com/Desklop/Uk_Stemmer)\n- [titarenko/ukrstemmer](https://github.com/titarenko/ukrstemmer)\n- corpus seeded from [brown-uk/corpus](https://github.com/brown-uk/corpus)\n\n## License\n\nMIT — see [`LICENSE`](./LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftggo%2Fsteblo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftggo%2Fsteblo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftggo%2Fsteblo/lists"}