Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/johann-petrak/python-matchtext

Python 3 package for fast text matching and replacing
https://github.com/johann-petrak/python-matchtext

Last synced: about 2 months ago
JSON representation

Python 3 package for fast text matching and replacing

Host: GitHub
URL: https://github.com/johann-petrak/python-matchtext
Owner: johann-petrak
License: apache-2.0
Created: 2020-05-31T12:46:28.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2020-08-17T16:06:17.000Z (over 4 years ago)
Last Synced: 2024-08-08T18:35:56.114Z (5 months ago)
Language: Python
Homepage:
Size: 73.2 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.rst
- License: LICENSE

Awesome Lists containing this project

README

        # Python matchtext

[![PyPi version](https://img.shields.io/pypi/v/matchtext.svg)](https://pypi.python.org/pypi/matchtext/)

[![Python compatibility](https://img.shields.io/pypi/pyversions/matchtext.svg)](https://pypi.python.org/pypi/matchtext/)

Python 3 package for fast text matching and replacing.

This library implements two fast approaches for matching keywords/gazetteer entries:

* TokenMatcher: keywords/gazetteer entries are sequences of tokens, optionally associated with some data and 

  the matcher tries to match any of those in a given sequence of tokens. 

* StringMatcher: keywords/gazetter entries are strings, optionally associated with some data and 

  the matcher tries to match any of those in a given string, optionally only at non-word boundaries.

The matchers are implemented to be fast: TokenMatcher is a hash tree, StringMatcher uses a

character trie implementation underneath. Both matchers implement additional features often required in NLP:

* return the offsets in the original iterable where a match occurs

* mapfunc: tokens/characters can be mapped to some canonical form that is used for matching

* ignorefunc: some tokens/characters can be entirely ignored for matching

* match all/longest: only match the longest entry versus all entries

* skip/noskip: if any match is found, continue matching after the longest match versus at the next position