https://github.com/rlayers/pawpaw

Text Processing & Segmentation Framework
https://github.com/rlayers/pawpaw
extract-text hierarchical-text-segmentation information-extraction knowledge-graph lexer natural-language-processing nlp parser python query-engine query-language text-processing text-segmentation tree xml-parser xmlparser
Last synced: 3 months ago
JSON representation
Text Processing & Segmentation Framework
Host: GitHub
URL: https://github.com/rlayers/pawpaw
Owner: rlayers
License: mit
Created: 2022-10-04T21:50:29.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2025-09-18T18:10:02.000Z (10 months ago)
Last Synced: 2025-12-15T10:41:08.732Z (7 months ago)
Topics: extract-text, hierarchical-text-segmentation, information-extraction, knowledge-graph, lexer, natural-language-processing, nlp, parser, python, query-engine, query-language, text-processing, text-segmentation, tree, xml-parser, xmlparser
Language: Python
Homepage:
Size: 5.35 MB
Stars: 25
Watchers: 1
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project

README

          


[![Python][Python-shield]][Python-url]

[![Contributors][contributors-shield]][contributors-url]

[![Watchers][watchers-shield]][watchers-url]

[![Forks][forks-shield]][forks-url]

[![MIT License][license-shield]][license-url]

[![Stargazers][stars-social]][stars-url]




# Pawpaw  [![High Performance Text Segmentation, Parsing, & Query][byline-img]][repo]

Pawpaw is a high-performance framework for parsing and text segmentation. The segments it generates are automatically organized into tree graphs, which can be easily serialized, traversed, and queried using a powerful structured query language called *plumule*. Creating a tree graph is simple — just provide spans, gaps, substrings, or even a regex/match object. Additionally, Pawpaw includes a robust pipelining engine, enabling you to create complex, multi-step, rules-based text parsers.

 

- Indexed string and substring representation

  - Efficient memory utilization

  - Fast processing

  - Pythonic relative indexing and slicing

  - Runtime & polymorphic value extraction

  - Tree graphs for all indexed text

- Search and Query

  - Search trees using *plumule*: a powerful structured query language similar to XPATH

  - Combine multiple axes, filters, and subqueries sequentially and recursively to any depth

  - Optionally pre-compile queries for increased performance

- Rules Pipelining Engine

  - Develop complex lexical parsers with just a few lines of code

  - Quickly and easily convert unstructured text into structured, indexed, & searchable tree graphs

  - Pre-process text for downstream NLP/AI/ML consumers

- XML Processing

  - Features a drop-in replacement for ``ElementTree.XmlParser``

  - Full text indices for all elements, attributes, tags, text, etc.

  - Search the resulting XML using either XPATH and/or plumule

  - Extract *both* ``ElementTree`` and Pawpaw datastructures in one go; with cross-linked nodes between trees

- NLP Support:

  - Ideal for *preprocessing* unstructured text for downstream NLP consumption

  - Integrates well with other libraries, such as [NLTK](https://www.nltk.org/)

- Efficient pickling and JSON persistence

  - A security option allows persistence of index-only data, reinjecting referenced strings during deserialization

- Stable & Defect Free

  - Over 5,000 unit tests and counting!

  - Pure Python, with only one external dependency: [regex](https://github.com/mrabarnett/mrab-regex)



  Explore the docs

    •  

  Request a feature or report a bug

    •  

  Explore the code



## Example

With Pawpaw, you can start with flattened text like this: 

```

ARTICLE I

Section 1: Congress

All legislative Powers herein granted shall be vested in a Congress of the United States,

which shall consist of a Senate and House of Representatives.

Section 2: The House of Representatives

The House of Representatives shall be composed of Members chosen every second Year by the

People of the several States, and the Electors in each State shall have the Qualifications

requisite for Electors of the most numerous Branch of the State Legislature.

No Person shall be a Representative who shall not have attained to the Age of twenty five

Years, and been seven Years a Citizen of the United States, and who shall not, when elected,

be an Inhabitant of that State in which he shall be chosen.

```

and quickly and easily produce a tree that look like this:

```mermaid

graph TD;

  A1["[article]
#quot;ARTICLE I…#quot;"]:::dark_brown --> A1_k["[key]
#quot;I#quot;"]:::dark_brown;

  A1--->Sc1["[section]
#quot;Section 1…#quot;"]:::light_brown;

  Sc1-->Sc1_k["[key]
#quot;1#quot;"]:::light_brown

  Sc1--->Sc1_p1["[paragraph]
#quot;All legislative Powers…#quot;"]:::peach

  Sc1_p1-->Sc1_p1_s1["[sentence]
#quot;All legislative Powers…#quot;"]:::dark_green

  Sc1_p1_s1-->Sc1_p1_s1_w1["[word]
#quot;All#quot;"]:::light_green

  Sc1_p1_s1-->Sc1_p1_s1_w2["[word]
#quot;legislative#quot;"]:::light_green

  Sc1_p1_s1-->Sc1_p1_s1_w3["[word]
#quot;Powers#quot;"]:::light_green

  Sc1_p1_s1-->Sc1_p1_s1_w4["..."]:::ellipsis

  A1--->Sc2["[section]
#quot;Section 2#quot;"]:::light_brown;

  Sc2-->Sc2_k["[key]
#quot;2#quot;"]:::light_brown

  Sc2--->Sc2_p1["[paragraph]
#quot;The House of…#quot;"]:::peach

  Sc2_p1---->Sc2_p1_s1["[sentence]
#quot;The House of…#quot;"]:::dark_green

  Sc2_p1_s1-->Sc2_p1_s1_w1["[word]
#quot;The#quot;"]:::light_green

  Sc2_p1_s1-->Sc2_p1_s1_w2["[word]
#quot;House#quot;"]:::light_green

  Sc2_p1_s1-->Sc2_p1_s1_w3["[word]
#quot;of#quot;"]:::light_green

  Sc2_p1_s1-->Sc2_p1_s1_w4["..."]:::ellipsis

  Sc2--->Sc2_p2["[paragraph]
#quot;No Person shall…#quot;"]:::peach

  Sc2_p2---->Sc2_p2_s1["[sentence]
#quot;No Person shall…#quot;"]:::dark_green

  Sc2_p2_s1-->Sc2_p2_s1_w1["[word]
#quot;No#quot;"]:::light_green

  Sc2_p2_s1-->Sc2_p2_s1_w2["[word]
#quot;Person#quot;"]:::light_green

  Sc2_p2_s1-->Sc2_p2_s1_w3["[word]
#quot;shall#quot;"]:::light_green

  Sc2_p2_s1-->Sc2_p2_s1_w4["..."]:::ellipsis

  classDef dark_brown fill:#533E30,stroke:#000000,color:#FFFFFF;

  classDef light_brown fill:#D2AC70,stroke:#000000,color:#000000;

  classDef peach fill:#E4D1AE,stroke:#000000,color:#000000;

  classDef dark_green fill:#517D3D,stroke:#000000,color:#FFFFFF;

  classDef light_green fill:#90C246,stroke:#000000,color:#FFFFFF;

  classDef ellipsis fill:#FFFFFF,stroke:#FFFFFF,color:#000000;

```

You can then search your tree using plumule: a powerful structured query language:

 ```python

'**[d:section]{**[d:word] & [lcs:power,right]}'  # Plumule query to find sections that containing words 'power' or 'right'

 ```

Try out [this demo](docs/demos/us_constitution) yourself, which shows how easy it is to parse, visualize, and query the US Constitution using Pawpaw.

## Usage

Pawpaw has extensive features and capabilities you can read about in the [Docs](/Docs).  As a quick example, say you have some text that would like to perform nlp-like segmentation on. 

```python

>>> s = 'nine 9 ten 10 eleven 11 TWELVE 12 thirteen 13'

```

You can use a regular expression for segmentation as follows:

```python

>>> import regex 

>>> re = regex.compile(r'(?:(?P(?P(?P\w)+) (?P(?P\d)+))\s*)+')

```

 

You can then use this regex to feed **Pawpaw**:

 ```python

>>> import pawpaw 

>>> doc = pawpaw.Ito.from_match(re.fullmatch(s))[0]

 ```

With this single line of code, Pawpaw generates a fully hierarchical, tree of phrases, words, chars, numbers, and digits.  You can visualize the tree:

```python

>>> tree_vis = pawpaw.visualization.pepo.Tree()

>>> print(tree_vis.dumps(doc))

(0, 45) '0' : 'nine 9 ten 10 eleven…ELVE 12 thirteen 13'

├──(0, 6) 'phrase' : 'nine 9'

│  ├──(0, 4) 'word' : 'nine'

│  │  ├──(0, 1) 'char' : 'n'

│  │  ├──(1, 2) 'char' : 'i'

│  │  ├──(2, 3) 'char' : 'n'

│  │  └──(3, 4) 'char' : 'e'

│  └──(5, 6) 'number' : '9'

│     └──(5, 6) 'digit' : '9'

├──(7, 13) 'phrase' : 'ten 10'

│  ├──(7, 10) 'word' : 'ten'

│  │  ├──(7, 8) 'char' : 't'

│  │  ├──(8, 9) 'char' : 'e'

│  │  └──(9, 10) 'char' : 'n'

│  └──(11, 13) 'number' : '10'

│     ├──(11, 12) 'digit' : '1'

│     └──(12, 13) 'digit' : '0'

├──(14, 23) 'phrase' : 'eleven 11'

│  ├──(14, 20) 'word' : 'eleven'

│  │  ├──(14, 15) 'char' : 'e'

│  │  ├──(15, 16) 'char' : 'l'

│  │  ├──(16, 17) 'char' : 'e'

│  │  ├──(17, 18) 'char' : 'v'

│  │  ├──(18, 19) 'char' : 'e'

│  │  └──(19, 20) 'char' : 'n'

│  └──(21, 23) 'number' : '11'

│     ├──(21, 22) 'digit' : '1'

│     └──(22, 23) 'digit' : '1'

├──(24, 33) 'phrase' : 'TWELVE 12'

│  ├──(24, 30) 'word' : 'TWELVE'

│  │  ├──(24, 25) 'char' : 'T'

│  │  ├──(25, 26) 'char' : 'W'

│  │  ├──(26, 27) 'char' : 'E'

│  │  ├──(27, 28) 'char' : 'L'

│  │  ├──(28, 29) 'char' : 'V'

│  │  └──(29, 30) 'char' : 'E'

│  └──(31, 33) 'number' : '12'

│     ├──(31, 32) 'digit' : '1'

│     └──(32, 33) 'digit' : '2'

└──(34, 45) 'phrase' : 'thirteen 13'

   ├──(34, 42) 'word' : 'thirteen'

   │  ├──(34, 35) 'char' : 't'

   │  ├──(35, 36) 'char' : 'h'

   │  ├──(36, 37) 'char' : 'i'

   │  ├──(37, 38) 'char' : 'r'

   │  ├──(38, 39) 'char' : 't'

   │  ├──(39, 40) 'char' : 'e'

   │  ├──(40, 41) 'char' : 'e'

   │  └──(41, 42) 'char' : 'n'

   └──(43, 45) 'number' : '13'

      ├──(43, 44) 'digit' : '1'

      └──(44, 45) 'digit' : '3'

```

 

And you can search the tree using Pawpaw's *plumule*, a powerful XPATH-like structured query language:

 ```python

 >>> print(*doc.find_all('**[d:digit]'), sep=', ')  # all digits

9, 1, 0, 1, 1, 1, 2, 1, 3

 >>> print(*doc.find_all('**[d:number]{*[s:i]}'), sep=', ')  # all numbers with 'i' in their name

9, 13

 ```

This example uses regular expressions as a source, however, Pawpaw is able to work with many other input types.  For example, you can use libraries such as [NLTK](https://www.nltk.org/) to grow Pawpaw trees, or, you can use Pawpaw's included parser framework to build your own sophisticated parsers quickly and easily.

(back to top)


## Getting Started

### Prerequisites

Pawpaw has been written and tested using Python 3.10.  The only dependency is

``regex``, which will be fetched and installed automatically if you install Pawpaw

with pip or conda.

### Installation Options

There are lots of ways to install Pawpaw.  Versioned instances that have passed all automated tests are available from [PyPI](https://pypi.org/project/pawpaw/):

1. Install with pip from PyPI:  

   ```

   pip install pawpaw

   ```

   

2. Install with conda from PyPI:

   ```

   conda activate myenv

   conda install git pip

   pip install pawpaw

   ```

Alternatively, you can pull from the main branch at GitHub.  This will ensure that you have the latest code, however, the main branch can potentially have internal inconsistencies and/or failed tests:

1. Install with pip from GitHub:

   ```

   pip install git+https://github.com/rlayers/pawpaw.git

   ```

2. Install with conda from GitHub:

   ```

   conda activate myenv

   conda install git pip

   pip install git+https://github.com/rlayers/pawpaw.git

   ```

3. Clone the repo with git from GitHub:

   ```

   git clone https://github.com/rlayers/pawpaw

   ```

   

### Verify Installation

Whichever way you fetch Pawpaw, you can easily verify that it is installed correctly.  Just open Python prompt and type:

```python

>>> from pawpaw import Ito

>>> Ito('Hello, World!')

Ito(span=(0, 13), desc='', substr='Hello, World!')

```

  

If your last line looks like this, you are up and running with Pawpaw!

(back to top)


## History & Roadmap

Pawpaw is a rewrite of *desponia*, a now-deprecated Python 2.x segmentation framework that was itself based on a prior framework called *Ito*.  Currently in release-candidate status, many components and features are production ready.  However, documentation is still being written and some newer features are still undergoing work.  A rough outline of which components are finalized is as follows:

- [x] arborform

  - [x] itorator

    - [x] Desc

    - [x] Extract

    - [x] Reflect

    - [x] Split

    - [x] ValueFunc

  - [x] postorator

    - [x] StackedReduce

    - [x] WindowedJoin

- [x] core

  - [x] Errors

  - [x] Infix

  - [x] Ito

  - [x] ItoChildren

  - [x] nuco

  - [x] Span

  - [x] Types

- [ ] documentation & examples

- [x] query

  - [x] radicle query engine

  - [x] plumule

- [ ] nlp

- [x] visualization

  - [x] ascibox

  - [x] highlighter

  - [x] pepo

  - [x] sgr

- [x] xml

  - [x] XmlHelper

  - [x] XmlParser

(back to top)


## Donations

  

    

    Pawpaw is distributed under the MIT License and is free to use, however, should you wish to help support the continued development of Pawpaw, you may do so via:

  



  

|  Link   | QR Code |

| :-----: | :-----: | 

| [![](https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif)](https://www.paypal.com/donate/?hosted_button_id=ALXFKLFU8W2NE) |   |



(back to top)


## License

Distributed under the MIT License. See [LICENSE](LICENSE) for more information.

(back to top)


## Contacts

Robert L. Ayers:  a.nov.guy@gmail.com

(back to top)


## References

* [Matthew Barnett's regex](https://github.com/mrabarnett/mrab-regex)

* [NLTK](https://www.nltk.org/)

(back to top)


[repo]: https://github.com/rlayers/pawpaw

[byline-img]: https://img.shields.io/badge/-High%20Performance%20Text%20Segmentation%2C%20Parsing%2C%20%26%20Query-FFFFFF

[byline2-img]: https://readme-typing-svg.demolab.com?font=Fira+Code&weight=800&duration=500&pause=1500&color=533E30&vCenter=true&width=375&height=25&lines=High+Performance+Text+Segmentation

[Python-shield]: https://img.shields.io/badge/python-≥3.10-517D3D.svg?style=flat

[Python-url]: https://www.python.org

[contributors-shield]: https://img.shields.io/github/contributors/rlayers/pawpaw.svg?color=90C246&style=flat

[contributors-url]: https://github.com/rlayers/pawpaw/graphs/contributors

[watchers-shield]: https://img.shields.io/github/watchers/rlayers/pawpaw.svg?color=E4D1AE&style=flat

[watchers-url]: https://github.com/rlayers/pawpaw/watchers

[issues-shield]: https://img.shields.io/github/issues/rlayers/pawpaw.svg?style=flat

[issues-url]: https://github.com/rlayers/pawpaw/issues

[forks-social]: https://img.shields.io/github/forks/rlayers/pawpaw.svg?style=social

[forks-shield]: https://img.shields.io/github/forks/rlayers/pawpaw.svg?color=D2AC70&style=flat

[forks-url]: https://github.com/rlayers/pawpaw/network/members

[license-shield]: https://img.shields.io/github/license/rlayers/pawpaw.svg?color=533E30&style=flat

[license-url]: https://github.com/rlayers/pawpaw/blob/master/LICENSE

[stars-social]: https://img.shields.io/github/stars/rlayers/pawpaw.svg?style=social

[stars-shield]: https://img.shields.io/github/stars/rlayers/pawpaw.svg?style=flat

[stars-url]: https://github.com/rlayers/pawpaw/stargazers

[PyCharm-shield]: https://img.shields.io/badge/PyCharm-000000.svg?&style=flat&logo=PyCharm&logoColor=white

[PyCharm-url]: https://www.jetbrains.com/pycharm/
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rlayers/pawpaw

Awesome Lists containing this project

README