Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gatenlp/gateplugin-format_alto
GATE document format for reading ALTO XML documents
https://github.com/gatenlp/gateplugin-format_alto
Last synced: about 2 months ago
JSON representation
GATE document format for reading ALTO XML documents
- Host: GitHub
- URL: https://github.com/gatenlp/gateplugin-format_alto
- Owner: GateNLP
- License: lgpl-3.0
- Created: 2019-01-22T16:03:10.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2023-02-10T12:00:01.000Z (almost 2 years ago)
- Last Synced: 2024-04-16T07:59:24.405Z (9 months ago)
- Language: Java
- Homepage:
- Size: 18.6 KB
- Stars: 0
- Watchers: 13
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# GATE Support for ALTO XML documents
This plugin provides support for reading documents stored as [ALTO XML](http://loc.gov/standards/alto). The format is usually used to store OCR based transcriptions of documents and hence contains information on the position within the page of the text as well as the text itself. It's popular among libraries and museums as a way of providing digital copies of scanned document and manuscripts. For example, the [British Libray](https://www.bl.uk/) offers a number of [collections of digitised books](https://data.bl.uk/digbks/) in this format.
The code provided by this plugin focuses purely on the text content of ALTO XML files and completely ignores the positional information. Specifically it reads the `String` elements that appear within `TextBlock` elements that are within the `PrintSpace` of each page. This means that text in the header, footer, and margins are ignored. This is based on previous experiance with processing multi-page formats (such as PDFs) where the header and footer make the processing of text which flows across pages exceptionally problematic. This may change in future versions.
To activate the plugin (once loaded) set the mime type to `application/xml+alto` when loading documents.