https://github.com/scdh/pygexml
Small pythonic wrapper around PAGE XML
https://github.com/scdh/pygexml
page-xml parser python xml
Last synced: 4 months ago
JSON representation
Small pythonic wrapper around PAGE XML
- Host: GitHub
- URL: https://github.com/scdh/pygexml
- Owner: SCDH
- License: mit
- Created: 2026-02-18T18:22:07.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-02-19T15:14:00.000Z (4 months ago)
- Last Synced: 2026-02-19T15:46:44.978Z (4 months ago)
- Topics: page-xml, parser, python, xml
- Language: Python
- Homepage: https://scdh.github.io/pygexml/
- Size: 18.6 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# pygexml
A minimal Python wrapper around the [PAGE-XML][page-xml] format for OCR output.
[![pygexml checks, tests and docs][workflows-badge]][workflows] [![API docs online][api-docs-badge]][api-docs]
## Installation
```
pip install pygexml
```
Requires Python 3.12+.
## Usage
```python
from pygexml import Page
page = Page.from_xml_string(xml_string)
for line in page.all_text():
print(line)
```
### Data model
| Class | Import from |
|---|---|
| `Page` | `pygexml` |
| `Page`, `TextRegion`, `TextLine`, `Coords` | `pygexml.page` |
| `Point`, `Box`, `Polygon` | `pygexml.geometry` |
`Page`, `TextRegion` and `TextLine` each expose `all_text()` and `all_words()` iterators.
Lookups by ID are available via `lookup_region()` and `lookup_textline()`.
Refer to the [online API docs][api-docs] for details.
## Development
```bash
pip install ".[dev,test,docs]"
black pygexml test test_util # format
mypy pygexml test test_util # type check
pyright pygexml test test_util # type check
pytest -v # tests
pdoc -o .api_docs pygexml/* # API docs
```
CI runs on Python 3.12, 3.13 and 3.14. [API documentation][api-docs] is published to GitHub Pages on every push to `main`.
## Contributing
[Bug reports, feature requests][gh-issues] and [pull requests][gh-prs] are welcome. Feel free to open draft pull requests early to invite discussion and collaboration.
Please note that this project has a [Code of Conduct](CODE_OF_CONDUCT.md).
## Copyright and License
Copyright (c) 2026 [Mirko Westermeier][gh-memowe] (SCDH, University of Münster)
Released under the [MIT License](LICENSE).
[page-xml]: https://github.com/PRImA-Research-Lab/PAGE-XML
[workflows]: https://github.com/SCDH/pygexml/actions/workflows/ci.yml
[workflows-badge]: https://github.com/SCDH/pygexml/actions/workflows/ci.yml/badge.svg
[api-docs]: https://scdh.github.io/pygexml
[api-docs-badge]: https://img.shields.io/badge/API%20docs-online-blue?logo=gitbook&logoColor=lightgrey
[gh-issues]: https://github.com/SCDH/pygexml/issues
[gh-prs]: https://github.com/SCDH/pygexml/pulls
[gh-memowe]: https://github.com/memowe