Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/radeklat/words-to-regular-expression

A command line tool and Python library for converting lists of strings into matching regular expressions (finite automata).
https://github.com/radeklat/words-to-regular-expression

command-line-tool matching-algorithm python python-library regular-expression

Last synced: 4 months ago
JSON representation

A command line tool and Python library for converting lists of strings into matching regular expressions (finite automata).

Host: GitHub
URL: https://github.com/radeklat/words-to-regular-expression
Owner: radeklat
License: mit
Created: 2018-07-07T09:57:12.000Z (over 6 years ago)
Default Branch: develop
Last Pushed: 2018-12-08T13:06:24.000Z (about 6 years ago)
Last Synced: 2024-10-09T13:26:16.434Z (4 months ago)
Topics: command-line-tool, matching-algorithm, python, python-library, regular-expression
Language: Python
Size: 200 KB
Stars: 5
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

        [![Master Build Status](https://travis-ci.org/radeklat/words-to-regular-expression.svg?branch=master)](https://travis-ci.org/radeklat/words-to-regular-expression)

[![Develop Build Status](https://travis-ci.org/radeklat/words-to-regular-expression.svg?branch=develop)](https://travis-ci.org/radeklat/words-to-regular-expression)

Compatible with Python 3.4+

# Purpose

This library and command line tool compresses multiple strings into one regular expression that can be used to find/match these strings later in larger piece of text.

# Installation

As simple as `pip install w2re`

## Example use

Input string are: `is`, `in`, `it`, `if`, `the`, `than`

As a library:

```python

from w2re import iterable_to_regexp                                         

    

iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than'])

```                 

    '(?:i[fnst]|th(?:e|an))'

    

As command line tool:

```bash

echo -e "is\nin\nit\nif\nthe\nthan" | w2re

```

    

    (?:i[fnst]|th(?:e|an))

    

Input text is [The Zen of Python](https://www.python.org/dev/peps/pep-0020/#id3)

Counting words:

```python

from collections import Counter

from re import findall

from requests import get

from w2re import iterable_to_regexp

Counter(

    findall(

        iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than']),

        get('https://raw.githubusercontent.com/python/peps/master/pep-0020.txt').text

    )

).most_common()                    

```

    [('is', 15), ('it', 12), ('in', 11), ('than', 8), ('the', 7), ('if', 2)]

# Features

## Collapsing multiple strings from command line input

This is very useful if you need to search for multiple strings and are not sure how to write the correct regexp (or like me, are lazy and write libraries for it instead).

Terminate your input with EOF (Ctrl+D on empty line in Linux).

```bash

w2re

i am searching for this

and this

and this as well

```

    (?:i\ am\ searching\ for\ this|and\ this(?:\ as\ wel{2})?)

## Collapsing of repeated sequences

```bash

echo 'hahaha' | w2re

```

    (?:ha){3}

This unfortunately does not produce a range yet. E.g. `subsubsection`, `subsection` and `section` will become `s(?:ection|ubs(?:ection|ubsection))` rather than expected `(?:sub){0,2}section`.

## Automatic escaping of regular expressions

  

```bash

echo '* test: ...' | w2re

```

    \*\ test\:\ \.{3}

## Reading words from a file on command line

    w2re -i /usr/share/dict/words

## Command line filter

    head -n 10 /usr/share/dict/words | w2re

    

    A(?:\'s|MD(?:\'s)?|OL(?:\'s)?|WS(?:\'s)?|achen(?:\'s)?)

## Reading words from iterable

```python

import w2re                                         

    

w2re.iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than'])

``` 

    '(?:i[fnst]|th(?:e|an))'

        

## Reading words from stream

```python

import w2re                 

import io                        

    

w2re.stream_to_regexp(io.StringIO('is\nin\nit\nif\nthe\nthan'))

``` 

    '(?:i[fnst]|th(?:e|an))'

## Multiple output formats

### `w2re.PythonFormatter`

Standard Python formatted regular expression, based on the [re](https://docs.python.org/3/library/re.html) module. This is the default formatter for command line and library.

```python

import w2re                                         

    

w2re.iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than'], w2re.PythonFormatter)

```

    '(?:i[fnst]|th(?:e|an))'

### `w2re.PythonWordMatchFormatter`

Standard Python formatted regular expression, based on the [re](https://docs.python.org/3/library/re.html) module. Suitable for matching whole words, rather than strings. Unlike `PythonFormatter`, it won't match `Python` in `Pythonista`.

```python

import w2re                                         

    

w2re.iterable_to_regexp(['is', 'in', 'it', 'if', 'the', 'than'], w2re.PythonWordMatchFormatter)

```

    '(?:\\W+|\\A)((?:i[fnst]|th(?:e|an)))(?=\\W+|\\Z)'

    

### `w2re.BaseFormatter`

Base class for implementation of custom formatters. See the [w2re.formatters](https://github.com/radeklat/words-to-regular-expression/blob/develop/w2re/formatters.py) module.