https://github.com/boppreh/extract_by_pattern
Extract data from texts with no clear field separator
https://github.com/boppreh/extract_by_pattern
Last synced: 3 months ago
JSON representation
Extract data from texts with no clear field separator
- Host: GitHub
- URL: https://github.com/boppreh/extract_by_pattern
- Owner: boppreh
- Created: 2017-04-20T02:53:58.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2017-04-20T03:08:20.000Z (about 8 years ago)
- Last Synced: 2025-01-25T23:56:40.069Z (4 months ago)
- Language: Python
- Size: 3.91 KB
- Stars: 2
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# extract_by_pattern
Extracts data from monospaced text by giving a similarly formatted example. Useful for extracting structured data from sources with no clear separators.
```python
from extract_by_pattern import extractstr_headers = """
name age sex
address
"""str_data = ["""
John Smith 55 M
5322 Otter Lane
""", """
Jane Smith 57 F
5322 Otter Lane
"""]items = list(extract(str_headers, str_data))
print(items[1]['name'])
# 'Jane Smith'
```The `extract_loose` implementation is the default one, and tries to keep chunks of text together, looking for which header is most likely for each chunk, then grouping the chunks under that name. This is what a human would do if there was any empty space between the "fields".
The `extract_strict` function is not afraid of splitting a word if crosses the boundary between two headers. Internally it converts the boundaries to a single regular expression, so matching is done very quickly.