{"id":13415531,"url":"https://github.com/mittagessen/kraken","last_synced_at":"2025-03-14T23:30:53.700Z","repository":{"id":32297407,"uuid":"35872353","full_name":"mittagessen/kraken","owner":"mittagessen","description":"OCR engine for all the languages","archived":false,"fork":false,"pushed_at":"2024-10-21T10:26:04.000Z","size":30286,"stargazers_count":734,"open_issues_count":28,"forks_count":130,"subscribers_count":27,"default_branch":"main","last_synced_at":"2024-10-21T15:10:44.440Z","etag":null,"topics":["alto-xml","handwritten-text-recognition","hocr","htr","layout-analysis","neural-networks","ocr","optical-character-recognition","page-xml"],"latest_commit_sha":null,"homepage":"http://kraken.re","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mittagessen.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-05-19T09:24:38.000Z","updated_at":"2024-10-20T18:11:14.000Z","dependencies_parsed_at":"2023-11-14T01:47:15.496Z","dependency_job_id":"30e1cb18-c528-4c91-9eb1-95dd31708a31","html_url":"https://github.com/mittagessen/kraken","commit_stats":null,"previous_names":[],"tags_count":141,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mittagessen%2Fkraken","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mittagessen%2Fkraken/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mittagessen%2Fkraken/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mittagessen%2Fkraken/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mittagessen","download_url":"https://codeload.github.com/mittagessen/kraken/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243663291,"owners_count":20327299,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alto-xml","handwritten-text-recognition","hocr","htr","layout-analysis","neural-networks","ocr","optical-character-recognition","page-xml"],"created_at":"2024-07-30T21:00:50.060Z","updated_at":"2025-03-14T23:30:53.695Z","avatar_url":"https://github.com/mittagessen.png","language":"Python","funding_links":[],"categories":["1. \u003ca name='Software'\u003e\u003c/a\u003eSoftware","Optical Character Recognition Engines and Frameworks","Software","HCR"],"sub_categories":["1.1. \u003ca name='OCRengines'\u003e\u003c/a\u003eOCR engines","CTPN [paper:2016](https://arxiv.org/pdf/1609.03605.pdf)","OCR engines"],"readme":"Description\n===========\n\n.. image:: https://github.com/mittagessen/kraken/actions/workflows/test.yml/badge.svg\n    :target: https://github.com/mittagessen/kraken/actions/workflows/test.yml\n\nkraken is a turn-key OCR system optimized for historical and non-Latin script\nmaterial.\n\nkraken's main features are:\n\n  - Fully trainable layout analysis, reading order, and character recognition\n  - `Right-to-Left \u003chttps://en.wikipedia.org/wiki/Right-to-left\u003e`_, `BiDi\n    \u003chttps://en.wikipedia.org/wiki/Bi-directional_text\u003e`_, and Top-to-Bottom\n    script support\n  - `ALTO \u003chttps://www.loc.gov/standards/alto/\u003e`_, PageXML, abbyyXML, and hOCR\n    output\n  - Word bounding boxes and character cuts\n  - Multi-script recognition support\n  - `Public repository \u003chttps://zenodo.org/communities/ocr_models\u003e`_ of model files\n  - Variable recognition network architecture\n\nInstallation\n============\n\nkraken only runs on **Linux or Mac OS X**. Windows is not supported.\n\nThe latest stable releases can be installed from `PyPi \u003chttps://pypi.org\u003e`_:\n\n::\n\n  $ pip install kraken\n\nIf you want direct PDF and multi-image TIFF/JPEG2000 support it is necessary to\ninstall the `pdf` extras package for PyPi:\n\n::\n\n  $ pip install kraken[pdf]\n\nor install `pyvips` manually with pip:\n\n::\n\n  $ pip install pyvips\n\nConda environment files are provided for the seamless installation of the main\nbranch as well:\n\n::\n\n  $ git clone https://github.com/mittagessen/kraken.git\n  $ cd kraken\n  $ conda env create -f environment.yml\n\nor:\n\n::\n\n  $ git clone https://github.com/mittagessen/kraken.git\n  $ cd kraken\n  $ conda env create -f environment_cuda.yml\n\nfor CUDA acceleration with the appropriate hardware.\n\nFinally you'll have to scrounge up a model to do the actual recognition of\ncharacters. To download the default model for printed French text and place it\nin the kraken directory for the current user:\n\n::\n\n  $ kraken get 10.5281/zenodo.10592716\n\nA list of libre models available in the central repository can be retrieved by\nrunning:\n\n::\n\n  $ kraken list\n\nQuickstart\n==========\n\nRecognizing text on an image using the default parameters including the\nprerequisite steps of binarization and page segmentation:\n\n::\n\n  $ kraken -i image.tif image.txt binarize segment ocr\n\nTo binarize a single image using the nlbin algorithm:\n\n::\n\n  $ kraken -i image.tif bw.png binarize\n\nTo segment an image (binarized or not) with the new baseline segmenter:\n\n::\n\n  $ kraken -i image.tif lines.json segment -bl\n\n\nTo segment and OCR an image using the default model(s):\n\n::\n\n  $ kraken -i image.tif image.txt segment -bl ocr -m catmus-print-fondue-large.mlmodel\n\nAll subcommands and options are documented. Use the ``help`` option to get more\ninformation.\n\nDocumentation\n=============\n\nHave a look at the `docs \u003chttps://kraken.re\u003e`_.\n\nRelated Software\n================\n\nThese days kraken is quite closely linked to the `eScriptorium\n\u003chttps://gitlab.com/scripta/escriptorium/\u003e`_ project developed in the same eScripta research\ngroup. eScriptorium provides a user-friendly interface for annotating data,\ntraining models, and inference (but also much more). There is a `gitter channel\n\u003chttps://gitter.im/escripta/escriptorium\u003e`_ that is mostly intended for\ncoordinating technical development but is also a spot to find people with\nexperience on applying kraken on a wide variety of material.\n\nFunding\n=======\n\nkraken is developed at the `École Pratique des Hautes Études \u003chttps://www.ephe.psl.eu\u003e`_, `Université PSL \u003chttps://www.psl.eu\u003e`_.\n\n.. container:: twocol\n\n   .. container::\n\n        .. image:: https://raw.githubusercontent.com/mittagessen/kraken/main/docs/_static/normal-reproduction-low-resolution.jpg\n          :width: 100\n          :alt: Co-financed by the European Union\n\n   .. container::\n\n        This project was funded in part by the European Union. (ERC, MiDRASH,\n        project number 101071829).\n\n.. container:: twocol\n\n   .. container::\n\n        .. image:: https://raw.githubusercontent.com/mittagessen/kraken/main/docs/_static/normal-reproduction-low-resolution.jpg\n          :width: 100\n          :alt: Co-financed by the European Union\n\n   .. container::\n\n        This project was partially funded through the RESILIENCE project, funded from\n        the European Union’s Horizon 2020 Framework Programme for Research and\n        Innovation.\n\n\n.. container:: twocol\n\n   .. container::\n\n      .. image:: https://projet.biblissima.fr/sites/default/files/2021-11/biblissima-baseline-sombre-ia.png\n         :width: 400\n         :alt: Received funding from the Programme d’investissements d’Avenir\n\n   .. container::\n\n        Ce travail a bénéficié d’une aide de l’État gérée par l’Agence Nationale de la\n        Recherche au titre du Programme d’Investissements d’Avenir portant la référence\n        ANR-21-ESRE-0005 (Biblissima+).\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmittagessen%2Fkraken","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmittagessen%2Fkraken","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmittagessen%2Fkraken/lists"}