https://github.com/boppreh/extract_by_pattern

Extract data from texts with no clear field separator
https://github.com/boppreh/extract_by_pattern

Last synced: 3 months ago
JSON representation

Extract data from texts with no clear field separator

Host: GitHub
URL: https://github.com/boppreh/extract_by_pattern
Owner: boppreh
Created: 2017-04-20T02:53:58.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2017-04-20T03:08:20.000Z (about 8 years ago)
Last Synced: 2025-01-25T23:56:40.069Z (4 months ago)
Language: Python
Size: 3.91 KB
Stars: 2
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # extract_by_pattern

Extracts data from monospaced text by giving a similarly formatted example. Useful for extracting structured data from sources with no clear separators.

```python

from extract_by_pattern import extract

str_headers = """

name                 age  sex

address

"""

str_data = ["""

John Smith           55    M

5322 Otter Lane

""", """

Jane Smith           57    F

5322 Otter Lane

"""]

items = list(extract(str_headers, str_data))

print(items[1]['name'])

# 'Jane Smith'

```

The `extract_loose` implementation is the default one, and tries to keep chunks of text together, looking for which header is most likely for each chunk, then grouping the chunks under that name. This is what a human would do if there was any empty space between the "fields".

The `extract_strict` function is not afraid of splitting a word if crosses the boundary between two headers. Internally it converts the boundaries to a single regular expression, so matching is done very quickly.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/boppreh/extract_by_pattern

Awesome Lists containing this project

README