Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/radeklat/words-to-regular-expression
A command line tool and Python library for converting lists of strings into matching regular expressions (finite automata).
https://github.com/radeklat/words-to-regular-expression
command-line-tool matching-algorithm python python-library regular-expression
Last synced: 4 months ago
JSON representation
A command line tool and Python library for converting lists of strings into matching regular expressions (finite automata).
- Host: GitHub
- URL: https://github.com/radeklat/words-to-regular-expression
- Owner: radeklat
- License: mit
- Created: 2018-07-07T09:57:12.000Z (over 6 years ago)
- Default Branch: develop
- Last Pushed: 2018-12-08T13:06:24.000Z (about 6 years ago)
- Last Synced: 2024-10-09T13:26:16.434Z (4 months ago)
- Topics: command-line-tool, matching-algorithm, python, python-library, regular-expression
- Language: Python
- Size: 200 KB
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
[![Master Build Status](https://travis-ci.org/radeklat/words-to-regular-expression.svg?branch=master)](https://travis-ci.org/radeklat/words-to-regular-expression)
[![Develop Build Status](https://travis-ci.org/radeklat/words-to-regular-expression.svg?branch=develop)](https://travis-ci.org/radeklat/words-to-regular-expression)Compatible with Python 3.4+
# Purpose
This library and command line tool compresses multiple strings into one regular expression that can be used to find/match these strings later in larger piece of text.
# Installation
As simple as `pip install w2re`
## Example use
Input string are: `is`, `in`, `it`, `if`, `the`, `than`
As a library:
```python
from w2re import iterable_to_regexp
iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than'])
```'(?:i[fnst]|th(?:e|an))'
As command line tool:```bash
echo -e "is\nin\nit\nif\nthe\nthan" | w2re
```
(?:i[fnst]|th(?:e|an))
Input text is [The Zen of Python](https://www.python.org/dev/peps/pep-0020/#id3)Counting words:
```python
from collections import Counter
from re import findallfrom requests import get
from w2re import iterable_to_regexpCounter(
findall(
iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than']),
get('https://raw.githubusercontent.com/python/peps/master/pep-0020.txt').text
)
).most_common()
```[('is', 15), ('it', 12), ('in', 11), ('than', 8), ('the', 7), ('if', 2)]
# Features
## Collapsing multiple strings from command line input
This is very useful if you need to search for multiple strings and are not sure how to write the correct regexp (or like me, are lazy and write libraries for it instead).
Terminate your input with EOF (Ctrl+D on empty line in Linux).
```bash
w2re
i am searching for this
and this
and this as well
```(?:i\ am\ searching\ for\ this|and\ this(?:\ as\ wel{2})?)
## Collapsing of repeated sequences
```bash
echo 'hahaha' | w2re
```(?:ha){3}
This unfortunately does not produce a range yet. E.g. `subsubsection`, `subsection` and `section` will become `s(?:ection|ubs(?:ection|ubsection))` rather than expected `(?:sub){0,2}section`.
## Automatic escaping of regular expressions
```bash
echo '* test: ...' | w2re
```\*\ test\:\ \.{3}
## Reading words from a file on command line
w2re -i /usr/share/dict/words
## Command line filter
head -n 10 /usr/share/dict/words | w2re
A(?:\'s|MD(?:\'s)?|OL(?:\'s)?|WS(?:\'s)?|achen(?:\'s)?)## Reading words from iterable
```python
import w2re
w2re.iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than'])
```'(?:i[fnst]|th(?:e|an))'
## Reading words from stream```python
import w2re
import io
w2re.stream_to_regexp(io.StringIO('is\nin\nit\nif\nthe\nthan'))
```'(?:i[fnst]|th(?:e|an))'
## Multiple output formats
### `w2re.PythonFormatter`
Standard Python formatted regular expression, based on the [re](https://docs.python.org/3/library/re.html) module. This is the default formatter for command line and library.
```python
import w2re
w2re.iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than'], w2re.PythonFormatter)
```'(?:i[fnst]|th(?:e|an))'
### `w2re.PythonWordMatchFormatter`
Standard Python formatted regular expression, based on the [re](https://docs.python.org/3/library/re.html) module. Suitable for matching whole words, rather than strings. Unlike `PythonFormatter`, it won't match `Python` in `Pythonista`.
```python
import w2re
w2re.iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than'], w2re.PythonWordMatchFormatter)
```'(?:\\W+|\\A)((?:i[fnst]|th(?:e|an)))(?=\\W+|\\Z)'
### `w2re.BaseFormatter`Base class for implementation of custom formatters. See the [w2re.formatters](https://github.com/radeklat/words-to-regular-expression/blob/develop/w2re/formatters.py) module.