https://github.com/cneud/alto-tools
Python tools for performing various operations on ALTO XML files
https://github.com/cneud/alto-tools
alto-xml digital-library optical-character-recognition
Last synced: 27 days ago
JSON representation
Python tools for performing various operations on ALTO XML files
- Host: GitHub
- URL: https://github.com/cneud/alto-tools
- Owner: cneud
- License: apache-2.0
- Created: 2015-09-04T14:40:20.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2025-02-27T19:09:46.000Z (3 months ago)
- Last Synced: 2025-05-08T22:43:19.949Z (27 days ago)
- Topics: alto-xml, digital-library, optical-character-recognition
- Language: Python
- Homepage:
- Size: 144 KB
- Stars: 46
- Watchers: 3
- Forks: 16
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-ocr - alto-tools - Various tools to work with ALTO files, Python (Software / OCR file formats)
README
ALTO Tools
Python tools for performing various operations on ALTO XML files
---
## Installation
You can install from [PyPI](https://pypi.org/project/alto-tools/) by running
```bash
pip install alto-tools
```or clone the repository, enter it and run
```bash
pip install .
```## Usage
```bash
alto-tools [OPTION]
````INPUT` should be the path to an ALTO xml file or directory containing ALTO xml files.
To pipe the output of another command into `alto-tools`, pass the path `-` as the `INPUT` argument, e.g.
```bash
cat tests/data/PPN720183197-PHYS_0004.xml | alto-tools -t -
```The following `OPTIONS` are currently supported:
| OPTION | Description |
|------------------------|:------------------------------------------------------------------|
| `-t` `--text` | Extract UTF-8 encoded text content |
| `-c` `--confidence` | Extract mean OCR word confidence score |
| `-i` `--illustrations` | Extract bounding box coordinates of `` elements |
| `-g` `--graphics` | Extract bounding box coordinates of `` elements |
| `-s` `--statistics` | Extract statistical info (no. of textlines, words, glyphs etc.) |All output is sent to `stdout`.