https://github.com/spaceshaman/deckard
Extract structured data from unstructured text β no AI, just regular expressions. π
https://github.com/spaceshaman/deckard
data-extraction extract extract-data regex regular-expression
Last synced: 5 months ago
JSON representation
Extract structured data from unstructured text β no AI, just regular expressions. π
- Host: GitHub
- URL: https://github.com/spaceshaman/deckard
- Owner: SpaceShaman
- License: mit
- Created: 2025-08-10T18:36:48.000Z (5 months ago)
- Default Branch: master
- Last Pushed: 2025-08-17T09:23:54.000Z (5 months ago)
- Last Synced: 2025-08-17T09:28:25.578Z (5 months ago)
- Topics: data-extraction, extract, extract-data, regex, regular-expression
- Language: Python
- Homepage: https://pypi.org/project/deckard
- Size: 33.2 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Deckard π΅οΈββοΈ
Extract structured data from unstructured text β no AI, just regular expressions. π
[](https://github.com/SpaceShaman/deckard?tab=MIT-1-ov-file)
[](https://app.codecov.io/github/SpaceShaman/deckard)
[](https://app.codecov.io/github/SpaceShaman/deckard)
[](https://pypi.org/project/deckard)
[](https://pypi.org/project/deckard)
[](https://github.com/psf/black)
[](https://github.com/astral-sh/ruff)
[](https://docs.pytest.org/)
Deckard is a library of regular-expression patterns for extracting structured data (addresses, phone numbers, email addresses, etc.) and a small set of helper utilities that make using those patterns easier.
> [!IMPORTANT]
> Status: very early-stage project. Right now the repository contains mostly patterns for Poland. I am looking for contributors from around the world π β address formats, phone-number formats and other data representations differ by country, so the goal is to gather country-specific patterns for many regions.
## Key features β¨
- ποΈ A collection of ready-to-use regex patterns organized by country (for example [`deckard/patterns/pl.py`](./deckard/patterns/pl.py)).
- π¦ Universal patterns (e.g. email) live in [`deckard/patterns/standard.py`](./deckard/patterns/standard.py).
- π οΈ A small helper function `deckard.search` that combines multiple patterns and returns named-group matches ([deckard/main.py](./deckard/main.py)).
## Installation βοΈ
From PyPI:
```bash
pip install deckard
```
Editable / local development install:
```bash
pip install -e .
```
### For contributors β install dependencies with Poetry π§βπ»
This project uses Poetry to manage dependencies and development dependencies.
1. Install Poetry (see https://python-poetry.org for instructions).
2. From the project root run:
```bash
poetry install
```
This will create a virtual environment and install runtime and development dependencies (including `pytest`).
To run tests using Poetry:
```bash
poetry run pytest
```
Or start a shell in the created virtualenv and run tests directly:
```bash
poetry shell
pytest
```
## Quick usage π§
Example using the current public API:
```python
from deckard import search
from deckard.patterns import standard, pl
text = (
"Hello, my email is spaceshaman@tuta.io and my phone number is "
"+48 792 321 321 and my address is ul. Tesotowa 12/6A, 66-700 Bielsko-BiaΕa."
)
result = search([standard.EMAIL, pl.MOBILE_PHONE, pl.ADDRESS], text)
# result.groupdict() will return a dict of named groups, for example:
# {
# 'email': 'spaceshaman@tuta.io',
# 'mobile_phone': '792 321 321',
# 'street': 'ul. Tesotowa',
# 'building': '12',
# 'apartment': '6A',
# 'zip_code': '66-700',
# 'city': 'Bielsko-BiaΕa'
# }
```
The `search` helper composes the provided patterns into a single regex (using lookaheads) and returns the first match as a `regex.Match` object (or `None` if nothing matched).
## Repository layout
- [`deckard/`](./deckard/) β library code
- [`deckard/main.py`](./deckard/main.py) β helper `search` function
- [`deckard/patterns/standard.py`](./deckard/patterns/standard.py) β universal patterns (e.g. `EMAIL`)
- [`deckard/patterns/pl.py`](./deckard/patterns/pl.py) β Poland-specific patterns (address, postal code, phone, etc.)
- [`tests/`](./tests/) β unit tests
Examples of existing tests:
- [`tests/test_standard_patterns.py`](./tests/test_standard_patterns.py) β test for `standard.EMAIL`
- [`tests/test_search_with_multiple_patterns.py`](./tests/test_search_with_multiple_patterns.py) β integration tests combining `standard.EMAIL` with patterns from `pl.py`
- [`tests/pl/test_search_address_pl.py`](./tests/pl/test_search_address_pl.py) β tests for Polish address patterns
Every new pattern must come with tests. Pull requests without tests will not be accepted.
## Contributing β how to add new patterns
1. Create a new file under [`deckard/patterns/`](./deckard/patterns/) named by the country code, e.g. `us.py`, `de.py`, `fr.py`.
2. Define constants (UPPERCASE) for each pattern, for example `MOBILE_PHONE`, `ADDRESS`, `ZIP_CODE`.
3. Add tests under `tests/`. Use the existing Polish tests (e.g. `tests/test_search_with_multiple_patterns.py`) as a template. Provide normal and edge-case examples.
4. In the PR description explain local rules (phone number format, postal code format, common street abbreviations, etc.).
5. PRs without tests will not be accepted.
Tips π‘:
- π§Ύ Use clear, consistent named groups in regexes (`?P`) so `groupdict()` returns a predictable structure.
- π Document complex patterns with comments and example inputs if necessary.
## Discussion and roadmap π§
The project is not yet final β everything is open for discussion. Areas for contributors and discussion include:
- π Defining a minimal set of patterns every country should provide (email, phone, address, postal code, national ID where applicable).
- π Standardizing group names (`street`, `building`, `apartment`, `zip_code`, `city`, `country`, `mobile_phone`, etc.).
- βοΈ Tools for validation and normalization of extracted values.
- π€ Automating tests with sample documents in various languages.
If you want to help, open an issue or a PR β a short description of the local data format and one or two patterns with tests is a great place to start.
## License π
This project is licensed under the MIT License. See the [LICENSE](./LICENSE) file for the full text.
---
Thanks for your interest β please join the effort. Together we can build an international library of patterns to extract structured data from arbitrary text using robust regular expressions. π