Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mesejo/trex
Efficient string matching with regular expressions
https://github.com/mesejo/trex
keyword-extraction nlp pandas python python-library regex regular-expression search-in-text string-matching text-mining trie
Last synced: 2 months ago
JSON representation
Efficient string matching with regular expressions
- Host: GitHub
- URL: https://github.com/mesejo/trex
- Owner: mesejo
- License: mit
- Created: 2020-04-04T15:30:43.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2024-11-15T04:50:12.000Z (2 months ago)
- Last Synced: 2024-11-15T05:27:02.443Z (2 months ago)
- Topics: keyword-extraction, nlp, pandas, python, python-library, regex, regular-expression, search-in-text, string-matching, text-mining, trie
- Language: Python
- Homepage: https://trrex.readthedocs.io/en/latest/
- Size: 241 KB
- Stars: 138
- Watchers: 3
- Forks: 6
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# Efficient string matching with regular expressions
This package includes a pure Python function that enables you to represent a set of strings as a regular expression.
With this regular expression, you can perform various operations, such as replacing, extracting and matching keywords.
The name of the package comes from the internal trie used to build the regular expression (**TR**ie to **RE**ge**X**)## Install trrex
Use pip,
```bash
pip install trrex
```## Usage
```python
import trrex as tx
import repattern = tx.make(['baby', 'bat', 'bad'])
hits = re.findall(pattern, 'The baby was scared by the bad bat.')
# hits = ['baby', 'bat', 'bad']
```### pandas
```python
import trrex as tx
import pandas as pdframe = pd.DataFrame({
"txt": ["The baby", "The bat"]
})
pattern = tx.make(['baby', 'bat', 'bad'], prefix=r"\b(", suffix=r")\b") # need to specify capturing groups
frame["match"] = frame["txt"].str.extract(pattern)
hits = frame["match"].tolist()
print(hits)
# hits = ['baby', 'bad']
```## Why use trrex?
- trrex builds a *better* regex pattern, than the simple regex union, therefore searching (and replacing) strings is
about 300 times faster than a regex union pattern, and about 2.5 times faster than FlashText algorithm. See below for a performance
comparison:![Performance comparison](https://github.com/mesejo/trex/blob/images/find_comparison.png?raw=true)
- Plays well with others, can be integrated easily with pandas, spacy and any other regex engine. See the [documentation](https://trrex.readthedocs.io/en/latest/integration.html)
for examples.
- Pure Python, no other dependencies## Issues
If you have any issues with this repository, please don't hesitate to [raise them](https://github.com/mesejo/trex/issues/new).
It is actively maintained, and we will do our best to help you.## Acknowledgments
This project is based on the following resources:
- [Speed up regex](https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3)
- [Triegex](https://github.com/ZhukovAlexander/triegex)## Liked the work?
If you've found this repository helpful, why not give it a star? It's an easy way to show your appreciation and support for the project.
Plus, it helps others discover it too!