Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cneud/alto-tools
Python tools for performing various operations on ALTO XML files
https://github.com/cneud/alto-tools
alto-xml digital-library optical-character-recognition
Last synced: 16 days ago
JSON representation
Python tools for performing various operations on ALTO XML files
- Host: GitHub
- URL: https://github.com/cneud/alto-tools
- Owner: cneud
- License: apache-2.0
- Created: 2015-09-04T14:40:20.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2023-10-13T15:00:51.000Z (about 1 year ago)
- Last Synced: 2024-10-06T11:35:50.027Z (about 1 month ago)
- Topics: alto-xml, digital-library, optical-character-recognition
- Language: Python
- Homepage:
- Size: 107 KB
- Stars: 39
- Watchers: 3
- Forks: 15
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-ocr - alto-tools - Various tools to work with ALTO files, Python (Software / OCR file formats)
README
ALTO Tools
Python tools for performing various operations on ALTO XML files
---
## Installation
You can install from [PyPI](https://pypi.org/project/alto-tools/) by running
```bash
pip install alto-tools
```or clone the repository, enter it and run
```bash
pip install .
```## Usage
```bash
alto-tools [OPTION]
````INPUT` should be the path to an ALTO xml file or directory containing ALTO xml files.
The following `OPTIONS` are currently supported:
| OPTION | Description |
|------------------------|:------------------------------------------------------------------|
| `-t` `--text` | Extract UTF-8 encoded text content |
| `-c` `--confidence` | Extract mean OCR word confidence score |
| `-i` `--illustrations` | Extract bounding box coordinates of `` elements |
| `-g` `--graphics` | Extract bounding box coordinates of `` elements |
| `-s` `--statistics` | Extract statistical info (no. of textlines, words, glyphs etc.) |All output is sent to `stdout`.