https://github.com/phrozen/yake

Last synced: about 9 hours ago
JSON representation
Host: GitHub
URL: https://github.com/phrozen/yake
Owner: phrozen
Created: 2026-06-01T04:29:59.000Z (9 days ago)
Default Branch: main
Last Pushed: 2026-06-02T01:57:32.000Z (8 days ago)
Last Synced: 2026-06-02T03:23:48.087Z (8 days ago)
Language: Go
Size: 85.9 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # YAKE — Yet Another Keyword Extractor for Go

A zero-dependency Go implementation of [YAKE](https://github.com/LIAAD/yake), an unsupervised, lightweight keyword extraction algorithm. YAKE selects the most relevant keywords from a single document using only statistical features — no external corpora, no training data, and no dictionary lookups beyond a stopword list.

## Installation

```sh

go get github.com/phrozen/yake

```

## Quick Start

```go

package main

import (

	"fmt"

	"log"

	"github.com/phrozen/yake"

)

func main() {

	text := "Google is acquiring data science community Kaggle. " +

		"Sources tell us that Google is acquiring Kaggle, " +

		"a platform that hosts data science and machine learning competitions."

	y, err := yake.New(yake.DefaultConfig())

	if err != nil {

		log.Fatal(err)

	}

	keywords := y.Extract(text, 10)

	for _, kw := range keywords {

		fmt.Printf("%-30s  %.4f\n", kw.Keyword, kw.Score)

	}

	// Output (sorted by score, lower = better):

	// google                         0.0251

	// kaggle                         0.0273

	// ceo anthony goldbloom          0.0483

	// data science                   0.0550

	// acquiring data science         0.0603

	// ...

}

```

Lower scores indicate more important keywords.

## How It Works

YAKE extracts keywords by computing five per-term features and combining them into an **H score**:

| Feature | What it captures | Rationale |

|---|---|---|

| **Casing** | Proportion of acronym/uppercase occurrences | Uppercase terms (NASA, CEO) tend to be more relevant |

| **Position** | Median sentence position of the term | Important keywords appear earlier in a document |

| **Frequency** | Term frequency normalized by corpus statistics | Rare terms are penalized; very common terms are balanced |

| **Relatedness** | Dispersion of neighboring words | Terms with varied neighbors are more meaningful than fixed collocations |

| **Spread** | Proportion of sentences containing the term | Well-distributed terms are more important than clustered ones |

These are combined in a co-occurrence-graph–based formula to produce the final score. The algorithm supports n-grams up to a configurable length and deduplicates near-duplicate phrases using Levenshtein similarity.

## Configuration

```go

cfg := yake.DefaultConfig()

cfg.Language = "pt"               // ISO 639-2 code for built-in stopwords

cfg.Ngrams = 2                    // max words per keyphrase (default: 3)

cfg.WindowSize = 2                // co-occurrence window (default: 1)

cfg.RemoveDuplicates = true       // filter near-duplicates (default: true)

cfg.DeduplicationThreshold = 0.8  // similarity threshold (default: 0.9)

cfg.MinimumChars = 4              // min characters per candidate (default: 3)

y, err := yake.New(cfg)           // validated on construction

```

### Custom Stopwords

```go

sw := yake.StopWordsFromList([]string{"fig", "table", "figure"})

cfg := yake.DefaultConfig()

cfg.StopWords = sw

```

### Supported Languages

Built-in stopword lists are embedded for 34 languages:

`ar bg br cz da de el en es et fa fi fr hi hr hu hy id it ja lt lv nl no pl pt ro ru sk sl sv tr uk zh`

Use `yake.PredefinedStopWords("xx")` to load a list by its ISO 639-2 code. Returns `nil` for unsupported codes.

## API

```go

func DefaultConfig() Config

func New(config Config) (*Yake, error)

func (y *Yake) Extract(text string, n int) []ResultItem

```

`New` validates the configuration and returns an error for invalid values (zero n-grams, nil punctuation, out-of-range thresholds, etc.).

`ResultItem` carries the raw surface form, the normalized keyword, and the score:

```go

type ResultItem struct {

    Raw     string  // original casing, e.g. "Machine Learning"

    Keyword string  // lowercased, normalized, e.g. "machine learning"

    Score   float64 // lower is better

}

```

## Validation

This implementation is cross-validated against both the [original Python](https://github.com/LIAAD/yake) and the [Rust port](https://github.com/quesurifn/yake-rust). The test suite includes 35 cross-validation tests covering:

- Inline English tests (singular, plural, hyphenated, multi-ngram, stopword weighting, deduplication)

- File-based English tests matching the LIAAD reference samples (Google/Kaggle, Gitter, Genius, Fukushima, Global Crossing)

- Multilingual tests in 14 languages (Arabic, German, Dutch, Finnish, French, Italian, Polish, Portuguese, Spanish, Turkish)

All tests use byte-identical input files and stopword lists. Scores match the Python and Rust reference implementations. Tokenizer differences account for minor score variance in non-English edge cases; keyword identity and ranking are consistent.

## Benchmarks

Run `go test -bench=. -benchmem` to measure throughput on your hardware (Apple M2 Max reference):

| Benchmark | Time | Memory | Allocs |

|---|---|---|---|

| ExtractShort (~20 words) | ~61 µs | ~44 KB | 457 |

| ExtractMedium (~120 words) | ~199 µs | ~148 KB | 1,638 |

| Tokenizer | ~8.2 µs | ~4.2 KB | 81 |

| Sentence Splitter | ~2.4 µs | ~418 B | 11 |

| Levenshtein | ~430 ns | ~256 B | 2 |

## License

MIT — see the [original paper](https://arxiv.org/abs/2111.07068) for algorithm attribution.

## References

- Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A. (2020). [YAKE! Keyword extraction from single documents using multiple local features](https://doi.org/10.1016/j.ins.2019.09.013). *Information Sciences*, 509, 257–289.

- [LIAAD/yake](https://github.com/LIAAD/yake) — original Python implementation
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/phrozen/yake

Awesome Lists containing this project

README