https://github.com/shner-elmo/flashtext2

The fastest FlashText library for Python
https://github.com/shner-elmo/flashtext2

extracting-keywords flashtext flashtext2 nlp pyo3 python rust string text-processing

Last synced: 4 months ago
JSON representation

The fastest FlashText library for Python

Host: GitHub
URL: https://github.com/shner-elmo/flashtext2
Owner: shner-elmo
License: mit
Created: 2023-01-01T00:41:27.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2024-07-04T14:35:32.000Z (almost 2 years ago)
Last Synced: 2026-02-26T19:46:49.524Z (4 months ago)
Topics: extracting-keywords, flashtext, flashtext2, nlp, pyo3, python, rust, string, text-processing
Language: Python
Homepage:
Size: 1.89 MB
Stars: 26
Watchers: 2
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


    

  ![PyPi Version](https://badge.fury.io/py/flashtext2.svg)

  ![Supported Python versions](https://img.shields.io/pypi/pyversions/flashtext2.svg?color=%2334D058)

  ![Downloads](https://static.pepy.tech/badge/flashtext2)

  ![Downloads](https://static.pepy.tech/badge/flashtext2/month)

    



```sh

pip install flashtext2

```

# flashtext2

`flashtext2` is an optimized version of the `flashtext` library for fast keyword extraction and replacement. 

Its orders of magnitude faster compared to regular expressions.

## Key Enhancements in flashtext2

- **Rewritten for Better Performance**: Completely rewritten in Rust, making it approximately 3-10x faster than the original version.

- **Unicode Standard Annex #29**: Instead of relying on arbitrary regex patterns like **flashtext** 

[does](https://github.com/vi3k6i5/flashtext/blob/b316c7e9e54b6b4d078462b302a83db85f884a94/flashtext/keyword.py#L13): `[A-Za-z0-9_]+`, 

**flashtext2** uses the [Unicode Standard Annex #29](https://www.unicode.org/reports/tr29/) to split strings into tokens. 

This ensures compatibility with all languages, not just Latin-based ones.

- **Unicode Case Folding**: Instead of converting strings to lowercase for case-insensitive matches, it uses 

[Unicode case folding](https://www.w3.org/TR/charmod-norm/#definitionCaseFolding), ensuring accurate normalization 

of characters according to the Unicode standard.

- **Fully Type-Hinted API**: The entire API is fully type-hinted, providing better code clarity and improved development experience.

## Usage

  Click to unfold usage

### Keyword Extraction

```python

from flashtext2 import KeywordProcessor

kp = KeywordProcessor(case_sensitive=False)

kp.add_keyword('Python')

kp.add_keyword('flashtext')

kp.add_keyword('program')

text = "I love programming in Python and using the flashtext library."

keywords_found = kp.extract_keywords(text)

print(keywords_found)

# Output: ['Python', 'flashtext']

keywords_found = kp.extract_keywords_with_span(text)

print(keywords_found)

# Output: [('Python', 22, 28), ('flashtext', 43, 52)]

```

### Keyword Replacement

```python

from flashtext2 import KeywordProcessor

kp = KeywordProcessor(case_sensitive=False)

kp.add_keyword('Java', 'Python')

kp.add_keyword('regex', 'flashtext')

text = "I love programming in Java and using the regex library."

new_text = kp.replace_keywords(text)

print(new_text)

# Output: "I love programming in Python and using the flashtext library."

```

### Case Sensitivity

```python

from flashtext2 import KeywordProcessor

text = 'abc aBc ABC'

kp = KeywordProcessor(case_sensitive=True)

kp.add_keyword('aBc')

print(kp.extract_keywords(text))

# Output: ['aBc']

kp = KeywordProcessor(case_sensitive=False)

kp.add_keyword('aBc')

print(kp.extract_keywords(text))

# Output: ['aBc', 'aBc', 'aBc']

```

### Other Examples

Overlapping keywords (returns the longest sequence)

```python

from flashtext2 import KeywordProcessor

kp = KeywordProcessor(case_sensitive=True)

kp.add_keyword('machine')

kp.add_keyword('machine learning')

text = "machine learning is a subset of artificial intelligence"

print(kp.extract_keywords(text))

# Output: ['machine learning']

```

Case folding

```python

from flashtext2 import KeywordProcessor

kp = KeywordProcessor(case_sensitive=False)

kp.add_keywords_from_iter(["flour", "Maße", "ᾲ στο διάολο"])

text = "ﬂour, MASSE, ὰι στο διάολο"

print(kp.extract_keywords(text))

# Output: ['flour', 'Maße', 'ᾲ στο διάολο']

```

### Performance

  

  Click to unfold performance

  

Extracting keywords is usually 2.5-3x faster, and replacing them is about 10x.  

There is still room to optimize the code and improve performance.   

You can find the benchmarks [here](https://github.com/shner-elmo/FlashText2.0/tree/master/benchmarks).

![Image](benchmarks/extract-keywords.png)

![Image](benchmarks/replace-keywords.png)

The words have on average 6 characters, and a sentence has 10k words, so the length is 60k.

### TODO

  

  Click to unfold TODO

  

* Add multiple ways of normalizing strings: simple case folding, full case folding, and locale-aware folding

* Remove all clones in src code

Credit to [Vikash Singh](https://github.com/vi3k6i5/), the author of the original `flashtext` package.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shner-elmo/flashtext2

Awesome Lists containing this project

README