{"id":50720085,"url":"https://github.com/phrozen/yake","last_synced_at":"2026-06-09T23:01:41.662Z","repository":{"id":361968045,"uuid":"1255677609","full_name":"phrozen/yake","owner":"phrozen","description":null,"archived":false,"fork":false,"pushed_at":"2026-06-02T01:57:32.000Z","size":88,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-02T03:23:48.087Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/phrozen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-01T04:29:59.000Z","updated_at":"2026-06-02T01:57:35.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/phrozen/yake","commit_stats":null,"previous_names":["phrozen/yake"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/phrozen/yake","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phrozen%2Fyake","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phrozen%2Fyake/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phrozen%2Fyake/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phrozen%2Fyake/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/phrozen","download_url":"https://codeload.github.com/phrozen/yake/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phrozen%2Fyake/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34129072,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-09T23:01:40.819Z","updated_at":"2026-06-09T23:01:41.650Z","avatar_url":"https://github.com/phrozen.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# YAKE — Yet Another Keyword Extractor for Go\n\nA zero-dependency Go implementation of [YAKE](https://github.com/LIAAD/yake), an unsupervised, lightweight keyword extraction algorithm. YAKE selects the most relevant keywords from a single document using only statistical features — no external corpora, no training data, and no dictionary lookups beyond a stopword list.\n\n## Installation\n\n```sh\ngo get github.com/phrozen/yake\n```\n\n## Quick Start\n\n```go\npackage main\n\nimport (\n\t\"fmt\"\n\t\"log\"\n\n\t\"github.com/phrozen/yake\"\n)\n\nfunc main() {\n\ttext := \"Google is acquiring data science community Kaggle. \" +\n\t\t\"Sources tell us that Google is acquiring Kaggle, \" +\n\t\t\"a platform that hosts data science and machine learning competitions.\"\n\n\ty, err := yake.New(yake.DefaultConfig())\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\tkeywords := y.Extract(text, 10)\n\n\tfor _, kw := range keywords {\n\t\tfmt.Printf(\"%-30s  %.4f\\n\", kw.Keyword, kw.Score)\n\t}\n\t// Output (sorted by score, lower = better):\n\t// google                         0.0251\n\t// kaggle                         0.0273\n\t// ceo anthony goldbloom          0.0483\n\t// data science                   0.0550\n\t// acquiring data science         0.0603\n\t// ...\n}\n```\n\nLower scores indicate more important keywords.\n\n## How It Works\n\nYAKE extracts keywords by computing five per-term features and combining them into an **H score**:\n\n| Feature | What it captures | Rationale |\n|---|---|---|\n| **Casing** | Proportion of acronym/uppercase occurrences | Uppercase terms (NASA, CEO) tend to be more relevant |\n| **Position** | Median sentence position of the term | Important keywords appear earlier in a document |\n| **Frequency** | Term frequency normalized by corpus statistics | Rare terms are penalized; very common terms are balanced |\n| **Relatedness** | Dispersion of neighboring words | Terms with varied neighbors are more meaningful than fixed collocations |\n| **Spread** | Proportion of sentences containing the term | Well-distributed terms are more important than clustered ones |\n\nThese are combined in a co-occurrence-graph–based formula to produce the final score. The algorithm supports n-grams up to a configurable length and deduplicates near-duplicate phrases using Levenshtein similarity.\n\n## Configuration\n\n```go\ncfg := yake.DefaultConfig()\ncfg.Language = \"pt\"               // ISO 639-2 code for built-in stopwords\ncfg.Ngrams = 2                    // max words per keyphrase (default: 3)\ncfg.WindowSize = 2                // co-occurrence window (default: 1)\ncfg.RemoveDuplicates = true       // filter near-duplicates (default: true)\ncfg.DeduplicationThreshold = 0.8  // similarity threshold (default: 0.9)\ncfg.MinimumChars = 4              // min characters per candidate (default: 3)\n\ny, err := yake.New(cfg)           // validated on construction\n```\n\n### Custom Stopwords\n\n```go\nsw := yake.StopWordsFromList([]string{\"fig\", \"table\", \"figure\"})\ncfg := yake.DefaultConfig()\ncfg.StopWords = sw\n```\n\n### Supported Languages\n\nBuilt-in stopword lists are embedded for 34 languages:\n\n`ar bg br cz da de el en es et fa fi fr hi hr hu hy id it ja lt lv nl no pl pt ro ru sk sl sv tr uk zh`\n\nUse `yake.PredefinedStopWords(\"xx\")` to load a list by its ISO 639-2 code. Returns `nil` for unsupported codes.\n\n## API\n\n```go\nfunc DefaultConfig() Config\nfunc New(config Config) (*Yake, error)\nfunc (y *Yake) Extract(text string, n int) []ResultItem\n```\n\n`New` validates the configuration and returns an error for invalid values (zero n-grams, nil punctuation, out-of-range thresholds, etc.).\n\n`ResultItem` carries the raw surface form, the normalized keyword, and the score:\n\n```go\ntype ResultItem struct {\n    Raw     string  // original casing, e.g. \"Machine Learning\"\n    Keyword string  // lowercased, normalized, e.g. \"machine learning\"\n    Score   float64 // lower is better\n}\n```\n\n## Validation\n\nThis implementation is cross-validated against both the [original Python](https://github.com/LIAAD/yake) and the [Rust port](https://github.com/quesurifn/yake-rust). The test suite includes 35 cross-validation tests covering:\n\n- Inline English tests (singular, plural, hyphenated, multi-ngram, stopword weighting, deduplication)\n- File-based English tests matching the LIAAD reference samples (Google/Kaggle, Gitter, Genius, Fukushima, Global Crossing)\n- Multilingual tests in 14 languages (Arabic, German, Dutch, Finnish, French, Italian, Polish, Portuguese, Spanish, Turkish)\n\nAll tests use byte-identical input files and stopword lists. Scores match the Python and Rust reference implementations. Tokenizer differences account for minor score variance in non-English edge cases; keyword identity and ranking are consistent.\n\n## Benchmarks\n\nRun `go test -bench=. -benchmem` to measure throughput on your hardware (Apple M2 Max reference):\n\n| Benchmark | Time | Memory | Allocs |\n|---|---|---|---|\n| ExtractShort (~20 words) | ~61 µs | ~44 KB | 457 |\n| ExtractMedium (~120 words) | ~199 µs | ~148 KB | 1,638 |\n| Tokenizer | ~8.2 µs | ~4.2 KB | 81 |\n| Sentence Splitter | ~2.4 µs | ~418 B | 11 |\n| Levenshtein | ~430 ns | ~256 B | 2 |\n\n## License\n\nMIT — see the [original paper](https://arxiv.org/abs/2111.07068) for algorithm attribution.\n\n## References\n\n- Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A. (2020). [YAKE! Keyword extraction from single documents using multiple local features](https://doi.org/10.1016/j.ins.2019.09.013). *Information Sciences*, 509, 257–289.\n- [LIAAD/yake](https://github.com/LIAAD/yake) — original Python implementation\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphrozen%2Fyake","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fphrozen%2Fyake","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphrozen%2Fyake/lists"}