https://github.com/toilal/rebulk

Define simple search patterns in bulk to perform advanced matching on any string
https://github.com/toilal/rebulk
Last synced: over 1 year ago
JSON representation
Define simple search patterns in bulk to perform advanced matching on any string
Host: GitHub
URL: https://github.com/toilal/rebulk
Owner: Toilal
License: mit
Created: 2015-09-05T06:11:51.000Z (almost 11 years ago)
Default Branch: develop
Last Pushed: 2023-12-14T15:00:10.000Z (over 2 years ago)
Last Synced: 2024-04-14T13:53:46.883Z (over 2 years ago)
Language: Python
Homepage:
Size: 494 KB
Stars: 55
Watchers: 5
Forks: 9
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

          ReBulk

======

[![Latest Version](http://img.shields.io/pypi/v/rebulk.svg)](https://pypi.python.org/pypi/rebulk)

[![MIT License](http://img.shields.io/badge/license-MIT-blue.svg)](https://pypi.python.org/pypi/rebulk)

[![Build Status](https://img.shields.io/github/workflow/status/Toilal/rebulk/ci)](https://github.com/Toilal/rebulk/actions?query=workflow%3Aci)

[![Coveralls](http://img.shields.io/coveralls/Toilal/rebulk.svg)](https://coveralls.io/r/Toilal/rebulk?branch=master)

[![semantic-release](https://img.shields.io/badge/%20%20%F0%9F%93%A6%F0%9F%9A%80-semantic--release-e10079.svg)](https://github.com/relekang/python-semantic-release)

ReBulk is a python library that performs advanced searches in strings

that would be hard to implement using [re

module](https://docs.python.org/3/library/re.html) or [String

methods](https://docs.python.org/3/library/stdtypes.html#str) only.

It includes some features like `Patterns`, `Match`, `Rule` that allows

developers to build a custom and complex string matcher using a readable

and extendable API.

This project is hosted on GitHub: 

Install

=======

```sh

$ pip install rebulk

```

Usage

=====

Regular expression, string and function based patterns are declared in a

`Rebulk` object. It use a fluent API to chain `string`, `regex`, and

`functional` methods to define various patterns types.

```python

>>> from rebulk import Rebulk

>>> bulk = Rebulk().string('brown').regex(r'qu\w+').functional(lambda s: (20, 25))

```

When `Rebulk` object is fully configured, you can call `matches` method

with an input string to retrieve all `Match` objects found by registered

pattern.

```python

>>> bulk.matches("The quick brown fox jumps over the lazy dog")

[, , ]

```

If multiple `Match` objects are found at the same position, only the

longer one is kept.

```python

>>> bulk = Rebulk().string('lakers').string('la')

>>> bulk.matches("the lakers are from la")

[, ]

```

String Patterns

===============

String patterns are based on

[str.find](https://docs.python.org/3/library/stdtypes.html#str.find)

method to find matches, but returns all matches in the string.

`ignore_case` can be enabled to ignore case.

```python

>>> Rebulk().string('la').matches("lalalilala")

[, , , ]

>>> Rebulk().string('la').matches("LalAlilAla")

[]

>>> Rebulk().string('la', ignore_case=True).matches("LalAlilAla")

[, , , ]

```

You can define several patterns with a single `string` method call.

```python

>>> Rebulk().string('Winter', 'coming').matches("Winter is coming...")

[, ]

```

Regular Expression Patterns

===========================

Regular Expression patterns are based on a compiled regular expression.

[re.finditer](https://docs.python.org/3/library/re.html#re.finditer)

method is used to find matches.

If [regex module](https://pypi.python.org/pypi/regex) is available, it

can be used by rebulk instead of default [re

module](https://docs.python.org/3/library/re.html). Enable it with `REBULK_REGEX_ENABLED=1` environment variable.

```python

>>> Rebulk().regex(r'l\w').matches("lolita")

[, ]

```

You can define several patterns with a single `regex` method call.

```python

>>> Rebulk().regex(r'Wint\wr', r'com\w{3}').matches("Winter is coming...")

[, ]

```

All keyword arguments from

[re.compile](https://docs.python.org/3/library/re.html#re.compile) are

supported.

```python

>>> import re  # import required for flags constant

>>> Rebulk().regex('L[A-Z]KERS', flags=re.IGNORECASE) \

...         .matches("The LaKeRs are from La")

[]

>>> Rebulk().regex('L[A-Z]', 'L[A-Z]KERS', flags=re.IGNORECASE) \

...         .matches("The LaKeRs are from La")

[, ]

>>> Rebulk().regex(('L[A-Z]', re.IGNORECASE), ('L[a-z]KeRs')) \

...         .matches("The LaKeRs are from La")

[, ]

```

If [regex module](https://pypi.python.org/pypi/regex) is available, it

automatically supports repeated captures.

```python

>>> # If regex module is available, repeated_captures is True by default.

>>> matches = Rebulk().regex(r'(\d+)(?:-(\d+))+').matches("01-02-03-04")

>>> matches[0].children # doctest:+SKIP

[<01:(0, 2)>, <02:(3, 5)>, <03:(6, 8)>, <04:(9, 11)>]

>>> # If regex module is not available, or if repeated_captures is forced to False.

>>> matches = Rebulk().regex(r'(\d+)(?:-(\d+))+', repeated_captures=False) \

...                   .matches("01-02-03-04")

>>> matches[0].children

[<01:(0, 2)+initiator=01-02-03-04>, <04:(9, 11)+initiator=01-02-03-04>]

```

-   `abbreviations`

    Defined as a list of 2-tuple, each tuple is an abbreviation. It

    simply replace `tuple[0]` with `tuple[1]` in the expression.

    \>\>\> Rebulk().regex(r\'Custom-separators\',

    abbreviations=\[(\"-\", r\"\[W\_\]+\")\])\...

    .matches(\"Custom\_separators using-abbreviations\")

    \[\\]

Functional Patterns

===================

Functional Patterns are based on the evaluation of a function.

The function should have the same parameters as `Rebulk.matches` method,

that is the input string, and must return at least start index and end

index of the `Match` object.

```python

>>> def func(string):

...     index = string.find('?')

...     if index > -1:

...         return 0, index - 11

>>> Rebulk().functional(func).matches("Why do simple ? Forget about it ...")

[]

```

You can also return a dict of keywords arguments for `Match` object.

You can define several patterns with a single `functional` method call,

and function used can return multiple matches.

Chain Patterns

==============

Chain Patterns are ordered composition of string, functional and regex

patterns. Repeater can be set to define repetition on chain part.

```python

>>> r = Rebulk().regex_defaults(flags=re.IGNORECASE)\

...             .defaults(children=True, formatter={'episode': int, 'version': int})\

...             .chain()\

...             .regex(r'e(?P\d{1,4})').repeater(1)\

...             .regex(r'v(?P\d+)').repeater('?')\

...             .regex(r'[ex-](?P\d{1,4})').repeater('*')\

...             .close() # .repeater(1) could be omitted as it's the default behavior

>>> r.matches("This is E14v2-15-16-17").to_dict()  # converts matches to dict

MatchesDict([('episode', [14, 15, 16, 17]), ('version', 2)])

```

Patterns parameters

===================

All patterns have options that can be given as keyword arguments.

-   `validator`

    Function to validate `Match` value given by the pattern. Can also be

    a `dict`, to use `validator` with pattern named with key.

    ```python

    >>> def check_leap_year(match):

    ...     return int(match.value) in [1980, 1984, 1988]

    >>> matches = Rebulk().regex(r'\d{4}', validator=check_leap_year) \

    ...                   .matches("In year 1982 ...")

    >>> len(matches)

    0

    >>> matches = Rebulk().regex(r'\d{4}', validator=check_leap_year) \

    ...                   .matches("In year 1984 ...")

    >>> len(matches)

    1

    ```

Some base validator functions are available in `rebulk.validators`

module. Most of those functions have to be configured using

`functools.partial` to map them to function accepting a single `match`

argument.

-   `formatter`

    Function to convert `Match` value given by the pattern. Can also be

    a `dict`, to use `formatter` with matches named with key.

    ```python

    >>> def year_formatter(value):

    ...     return int(value)

    >>> matches = Rebulk().regex(r'\d{4}', formatter=year_formatter) \

    ...                   .matches("In year 1982 ...")

    >>> isinstance(matches[0].value, int)

    True

    ```

-   `pre_match_processor` / `post_match_processor`

    Function to mutagen or invalidate a match generated by a pattern.

    Function has a single parameter which is the Match object. If

    function returns False, it will be considered as an invalid match.

    If function returns a match instance, it will replace the original

    match with this instance in the process.

-   `post_processor`

    Function to change the default output of the pattern. Function

    parameters are Matches list and Pattern object.

-   `name`

    The name of the pattern. It is automatically passed to `Match`

    objects generated by this pattern.

-   `tags`

    A list of string that qualifies this pattern.

-   `value`

    Override value property for generated `Match` objects. Can also be a

    `dict`, to use `value` with pattern named with key.

-   `validate_all`

    By default, validator is called for returned `Match` objects only.

    Enable this option to validate them all, parent and children

    included.

-   `format_all`

    By default, formatter is called for returned `Match` values only.

    Enable this option to format them all, parent and children included.

-   `disabled`

    A `function(context)` to disable the pattern if returning `True`.

-   `children`

    If `True`, all children `Match` objects will be retrieved instead of

    a single parent `Match` object.

-   `private`

    If `True`, `Match` objects generated from this pattern are available

    internally only. They will be removed at the end of `Rebulk.matches`

    method call.

-   `private_parent`

    Force parent matches to be returned and flag them as private.

-   `private_children`

    Force children matches to be returned and flag them as private.

-   `private_names`

    Matches names that will be declared as private

-   `ignore_names`

    Matches names that will be ignored from the pattern output, after

    validation.

-   `marker`

    If `true`, `Match` objects generated from this pattern will be

    markers matches instead of standard matches. They won\'t be included

    in `Matches` sequence, but will be available in `Matches.markers`

    sequence (see `Markers` section).

Match

=====

A `Match` object is the result created by a registered pattern.

It has a `value` property defined, and position indices are available

through `start`, `end` and `span` properties.

In some case, it contains children `Match` objects in `children`

property, and each child `Match` object reference its parent in `parent`

property. Also, a `name` property can be defined for the match.

If groups are defined in a Regular Expression pattern, each group match

will be converted to a single `Match` object. If a group has a name

defined (`(?Pgroup)`), it is set as `name` property in a child

`Match` object. The whole regexp match (`re.group(0)`) will be converted

to the main `Match` object, and all subgroups (1, 2, \... n) will be

converted to `children` matches of the main `Match` object.

```python

>>> matches = Rebulk() \

...         .regex(r"One, (?P\w+), Two, (?P\w+), Three, (?P\w+)") \

...         .matches("Zero, 0, One, 1, Two, 2, Three, 3, Four, 4")

>>> matches

[]

>>> for child in matches[0].children:

...     '%s = %s' % (child.name, child.value)

'one = 1'

'two = 2'

'three = 3'

```

It\'s possible to retrieve only children by using `children` parameters.

You can also customize the way structure is generated with `every`,

`private_parent` and `private_children` parameters.

```python

>>> matches = Rebulk() \

...         .regex(r"One, (?P\w+), Two, (?P\w+), Three, (?P\w+)", children=True) \

...         .matches("Zero, 0, One, 1, Two, 2, Three, 3, Four, 4")

>>> matches

[<1:(14, 15)+name=one+initiator=One, 1, Two, 2, Three, 3>, <2:(22, 23)+name=two+initiator=One, 1, Two, 2, Three, 3>, <3:(32, 33)+name=three+initiator=One, 1, Two, 2, Three, 3>]

```

Match object has the following properties that can be given to Pattern

objects

-   `formatter`

    Function to convert `Match` value given by the pattern. Can also be

    a `dict`, to use `formatter` with matches named with key.

    ```python

    >>> def year_formatter(value):

    ...     return int(value)

    >>> matches = Rebulk().regex(r'\d{4}', formatter=year_formatter) \

    ...                   .matches("In year 1982 ...")

    >>> isinstance(matches[0].value, int)

    True

    ```

-   `format_all`

    By default, formatter is called for returned `Match` values only.

    Enable this option to format them all, parent and children included.

-   `conflict_solver`

    A `function(match, conflicting_match)` used to solve conflict.

    Returned object will be removed from matches by `ConflictSolver`

    default rule. If `__default__` string is returned, it will fallback

    to default behavior keeping longer match.

Matches

=======

A `Matches` object holds the result of `Rebulk.matches` method call.

It\'s a sequence of `Match` objects and it behaves like a list.

All methods accepts a `predicate` function to filter `Match` objects

using a callable, and an `index` int to retrieve a single element from

default returned matches.

It has the following additional methods and properties on it.

-   `starting(index, predicate=None, index=None)`

    Retrieves a list of `Match` objects that starts at given index.

-   `ending(index, predicate=None, index=None)`

    Retrieves a list of `Match` objects that ends at given index.

-   `previous(match, predicate=None, index=None)`

    Retrieves a list of `Match` objects that are previous and nearest to

    match.

-   `next(match, predicate=None, index=None)`

    Retrieves a list of `Match` objects that are next and nearest to

    match.

-   `tagged(tag, predicate=None, index=None)`

    Retrieves a list of `Match` objects that have the given tag defined.

-   `named(name, predicate=None, index=None)`

    Retrieves a list of `Match` objects that have the given name.

-   `range(start=0, end=None, predicate=None, index=None)`

    Retrieves a list of `Match` objects for given range, sorted from

    start to end.

-   `holes(start=0, end=None, formatter=None, ignore=None, predicate=None, index=None)`

    Retrieves a list of *hole* `Match` objects for given range. A hole

    match is created for each range where no match is available.

-   `conflicting(match, predicate=None, index=None)`

    Retrieves a list of `Match` objects that conflicts with given match.

-   `chain_before(self, position, seps, start=0, predicate=None, index=None)`:

    Retrieves a list of chained matches, before position, matching

    predicate and separated by characters from seps only.

-   `chain_after(self, position, seps, end=None, predicate=None, index=None)`:

    Retrieves a list of chained matches, after position, matching

    predicate and separated by characters from seps only.

-   `at_match(match, predicate=None, index=None)`

    Retrieves a list of `Match` objects at the same position as match.

-   `at_span(span, predicate=None, index=None)`

    Retrieves a list of `Match` objects from given (start, end) tuple.

-   `at_index(pos, predicate=None, index=None)`

    Retrieves a list of `Match` objects from given position.

-   `names`

    Retrieves a sequence of all `Match.name` properties.

-   `tags`

    Retrieves a sequence of all `Match.tags` properties.

-   `to_dict(details=False, first_value=False, enforce_list=False)`

    Convert to an ordered dict, with `Match.name` as key and

    `Match.value` as value.

    It\'s a subclass of

    [OrderedDict](https://docs.python.org/2/library/collections.html#collections.OrderedDict),

    that contains a `matches` property which is a dict with `Match.name`

    as key and list of `Match` objects as value.

    If `first_value` is `True` and distinct values are found for the

    same name, value will be wrapped to a list. If `False`, first value

    only will be kept and values lists can be retrieved with

    `values_list` which is a dict with `Match.name` as key and list of

    `Match.value` as value.

    if `enforce_list` is `True`, all values will be wrapped to a list,

    even if a single value is found.

    If `details` is True, `Match.value` objects are replaced with

    complete `Match` object.

-   `markers`

    A custom `Matches` sequences specialized for `markers` matches (see

    below)

Markers

=======

If you have defined some patterns with `markers` property, then

`Matches.markers` points to a special `Matches` sequence that contains

only `markers` matches. This sequence supports all methods from

`Matches`.

Markers matches are not intended to be used in final result, but can be

used to implement a `Rule`.

Rules

=====

Rules are a convenient and readable way to implement advanced

conditional logic involving several `Match` objects. When a rule is

triggered, it can perform an action on `Matches` object, like filtering

out, adding additional tags or renaming.

Rules are implemented by extending the abstract `Rule` class. They are

registered using `Rebulk.rule` method by giving either a `Rule`

instance, a `Rule` class or a module containing `Rule classes` only.

For a rule to be triggered, `Rule.when` method must return `True`, or a

non empty list of `Match` objects, or any other truthy object. When

triggered, `Rule.then` method is called to perform the action with

`when_response` parameter defined as the response of `Rule.when` call.

Instead of implementing `Rule.then` method, you can define `consequence`

class property with a Consequence classe or instance, like

`RemoveMatch`, `RenameMatch` or `AppendMatch`. You can also use a list

of consequence when required : `when_response` must then be iterable,

and elements of this iterable will be given to each consequence in the

same order.

When many rules are registered, it can be useful to set `priority` class

variable to define a priority integer between all rule executions

(higher priorities will be executed first). You can also define

`dependency` to declare another Rule class as dependency for the current

rule, meaning that it will be executed before.

For all rules with the same `priority` value, `when` is called before,

and `then` is called after all.

```python

>>> from rebulk import Rule, RemoveMatch

>>> class FirstOnlyRule(Rule):

...     consequence = RemoveMatch

...

...     def when(self, matches, context):

...         grabbed = matches.named("grabbed", 0)

...         if grabbed and matches.previous(grabbed):

...             return grabbed

>>> rebulk = Rebulk()

>>> rebulk.regex("This match(.*?)grabbed", name="grabbed")

<...Rebulk object ...>

>>> rebulk.regex("if it's(.*?)first match", private=True)

<...Rebulk object at ...>

>>> rebulk.rules(FirstOnlyRule)

<...Rebulk object at ...>

>>> rebulk.matches("This match is grabbed only if it's the first match")

[]

>>> rebulk.matches("if it's NOT the first match, This match is NOT grabbed")

[]

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/toilal/rebulk

Awesome Lists containing this project

README