https://github.com/mittagessen/kraken
OCR engine for all the languages
https://github.com/mittagessen/kraken
alto-xml handwritten-text-recognition hocr htr layout-analysis neural-networks ocr optical-character-recognition page-xml
Last synced: about 2 months ago
JSON representation
OCR engine for all the languages
- Host: GitHub
- URL: https://github.com/mittagessen/kraken
- Owner: mittagessen
- License: apache-2.0
- Created: 2015-05-19T09:24:38.000Z (almost 10 years ago)
- Default Branch: main
- Last Pushed: 2024-10-21T10:26:04.000Z (7 months ago)
- Last Synced: 2024-10-21T15:10:44.440Z (7 months ago)
- Topics: alto-xml, handwritten-text-recognition, hocr, htr, layout-analysis, neural-networks, ocr, optical-character-recognition, page-xml
- Language: Python
- Homepage: http://kraken.re
- Size: 28.9 MB
- Stars: 734
- Watchers: 27
- Forks: 130
- Open Issues: 28
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
- awesome-ocr - kraken - Ocropus fork with sane defaults (Software / OCR engines)
README
Description
===========.. image:: https://github.com/mittagessen/kraken/actions/workflows/test.yml/badge.svg
:target: https://github.com/mittagessen/kraken/actions/workflows/test.ymlkraken is a turn-key OCR system optimized for historical and non-Latin script
material.kraken's main features are:
- Fully trainable layout analysis, reading order, and character recognition
- `Right-to-Left `_, `BiDi
`_, and Top-to-Bottom
script support
- `ALTO `_, PageXML, abbyyXML, and hOCR
output
- Word bounding boxes and character cuts
- Multi-script recognition support
- `Public repository `_ of model files
- Variable recognition network architectureInstallation
============kraken only runs on **Linux or Mac OS X**. Windows is not supported.
The latest stable releases can be installed from `PyPi `_:
::
$ pip install kraken
If you want direct PDF and multi-image TIFF/JPEG2000 support it is necessary to
install the `pdf` extras package for PyPi:::
$ pip install kraken[pdf]
or install `pyvips` manually with pip:
::
$ pip install pyvips
Conda environment files are provided for the seamless installation of the main
branch as well:::
$ git clone https://github.com/mittagessen/kraken.git
$ cd kraken
$ conda env create -f environment.ymlor:
::
$ git clone https://github.com/mittagessen/kraken.git
$ cd kraken
$ conda env create -f environment_cuda.ymlfor CUDA acceleration with the appropriate hardware.
Finally you'll have to scrounge up a model to do the actual recognition of
characters. To download the default model for printed French text and place it
in the kraken directory for the current user:::
$ kraken get 10.5281/zenodo.10592716
A list of libre models available in the central repository can be retrieved by
running:::
$ kraken list
Quickstart
==========Recognizing text on an image using the default parameters including the
prerequisite steps of binarization and page segmentation:::
$ kraken -i image.tif image.txt binarize segment ocr
To binarize a single image using the nlbin algorithm:
::
$ kraken -i image.tif bw.png binarize
To segment an image (binarized or not) with the new baseline segmenter:
::
$ kraken -i image.tif lines.json segment -bl
To segment and OCR an image using the default model(s):
::
$ kraken -i image.tif image.txt segment -bl ocr -m catmus-print-fondue-large.mlmodel
All subcommands and options are documented. Use the ``help`` option to get more
information.Documentation
=============Have a look at the `docs `_.
Related Software
================These days kraken is quite closely linked to the `eScriptorium
`_ project developed in the same eScripta research
group. eScriptorium provides a user-friendly interface for annotating data,
training models, and inference (but also much more). There is a `gitter channel
`_ that is mostly intended for
coordinating technical development but is also a spot to find people with
experience on applying kraken on a wide variety of material.Funding
=======kraken is developed at the `École Pratique des Hautes Études `_, `Université PSL `_.
.. container:: twocol
.. container::
.. image:: https://raw.githubusercontent.com/mittagessen/kraken/main/docs/_static/normal-reproduction-low-resolution.jpg
:width: 100
:alt: Co-financed by the European Union.. container::
This project was funded in part by the European Union. (ERC, MiDRASH,
project number 101071829)... container:: twocol
.. container::
.. image:: https://raw.githubusercontent.com/mittagessen/kraken/main/docs/_static/normal-reproduction-low-resolution.jpg
:width: 100
:alt: Co-financed by the European Union.. container::
This project was partially funded through the RESILIENCE project, funded from
the European Union’s Horizon 2020 Framework Programme for Research and
Innovation... container:: twocol
.. container::
.. image:: https://projet.biblissima.fr/sites/default/files/2021-11/biblissima-baseline-sombre-ia.png
:width: 400
:alt: Received funding from the Programme d’investissements d’Avenir.. container::
Ce travail a bénéficié d’une aide de l’État gérée par l’Agence Nationale de la
Recherche au titre du Programme d’Investissements d’Avenir portant la référence
ANR-21-ESRE-0005 (Biblissima+).