https://github.com/shner-elmo/flashtext2
The fastest FlashText library for Python
https://github.com/shner-elmo/flashtext2
extracting-keywords flashtext flashtext2 nlp pyo3 python rust string text-processing
Last synced: 4 months ago
JSON representation
The fastest FlashText library for Python
- Host: GitHub
- URL: https://github.com/shner-elmo/flashtext2
- Owner: shner-elmo
- License: mit
- Created: 2023-01-01T00:41:27.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2024-07-04T14:35:32.000Z (almost 2 years ago)
- Last Synced: 2026-02-26T19:46:49.524Z (4 months ago)
- Topics: extracting-keywords, flashtext, flashtext2, nlp, pyo3, python, rust, string, text-processing
- Language: Python
- Homepage:
- Size: 1.89 MB
- Stars: 26
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README




```sh
pip install flashtext2
```
# flashtext2
`flashtext2` is an optimized version of the `flashtext` library for fast keyword extraction and replacement.
Its orders of magnitude faster compared to regular expressions.
## Key Enhancements in flashtext2
- **Rewritten for Better Performance**: Completely rewritten in Rust, making it approximately 3-10x faster than the original version.
- **Unicode Standard Annex #29**: Instead of relying on arbitrary regex patterns like **flashtext**
[does](https://github.com/vi3k6i5/flashtext/blob/b316c7e9e54b6b4d078462b302a83db85f884a94/flashtext/keyword.py#L13): `[A-Za-z0-9_]+`,
**flashtext2** uses the [Unicode Standard Annex #29](https://www.unicode.org/reports/tr29/) to split strings into tokens.
This ensures compatibility with all languages, not just Latin-based ones.
- **Unicode Case Folding**: Instead of converting strings to lowercase for case-insensitive matches, it uses
[Unicode case folding](https://www.w3.org/TR/charmod-norm/#definitionCaseFolding), ensuring accurate normalization
of characters according to the Unicode standard.
- **Fully Type-Hinted API**: The entire API is fully type-hinted, providing better code clarity and improved development experience.
## Usage
Click to unfold usage
### Keyword Extraction
```python
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('Python')
kp.add_keyword('flashtext')
kp.add_keyword('program')
text = "I love programming in Python and using the flashtext library."
keywords_found = kp.extract_keywords(text)
print(keywords_found)
# Output: ['Python', 'flashtext']
keywords_found = kp.extract_keywords_with_span(text)
print(keywords_found)
# Output: [('Python', 22, 28), ('flashtext', 43, 52)]
```
### Keyword Replacement
```python
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('Java', 'Python')
kp.add_keyword('regex', 'flashtext')
text = "I love programming in Java and using the regex library."
new_text = kp.replace_keywords(text)
print(new_text)
# Output: "I love programming in Python and using the flashtext library."
```
### Case Sensitivity
```python
from flashtext2 import KeywordProcessor
text = 'abc aBc ABC'
kp = KeywordProcessor(case_sensitive=True)
kp.add_keyword('aBc')
print(kp.extract_keywords(text))
# Output: ['aBc']
kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('aBc')
print(kp.extract_keywords(text))
# Output: ['aBc', 'aBc', 'aBc']
```
### Other Examples
Overlapping keywords (returns the longest sequence)
```python
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=True)
kp.add_keyword('machine')
kp.add_keyword('machine learning')
text = "machine learning is a subset of artificial intelligence"
print(kp.extract_keywords(text))
# Output: ['machine learning']
```
Case folding
```python
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keywords_from_iter(["flour", "Maße", "ᾲ στο διάολο"])
text = "flour, MASSE, ὰι στο διάολο"
print(kp.extract_keywords(text))
# Output: ['flour', 'Maße', 'ᾲ στο διάολο']
```
### Performance
Click to unfold performance
Extracting keywords is usually 2.5-3x faster, and replacing them is about 10x.
There is still room to optimize the code and improve performance.
You can find the benchmarks [here](https://github.com/shner-elmo/FlashText2.0/tree/master/benchmarks).


The words have on average 6 characters, and a sentence has 10k words, so the length is 60k.
### TODO
Click to unfold TODO
* Add multiple ways of normalizing strings: simple case folding, full case folding, and locale-aware folding
* Remove all clones in src code
Credit to [Vikash Singh](https://github.com/vi3k6i5/), the author of the original `flashtext` package.