{"id":13415665,"url":"https://github.com/mauvilsa/tesseract-recognize","last_synced_at":"2025-05-05T20:10:25.837Z","repository":{"id":45647710,"uuid":"78739814","full_name":"mauvilsa/tesseract-recognize","owner":"mauvilsa","description":"Tool that does layout analysis and/or text recognition using tesseract and outputs the result in Page XML format","archived":false,"fork":false,"pushed_at":"2024-04-16T15:54:16.000Z","size":183,"stargazers_count":46,"open_issues_count":0,"forks_count":8,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-30T23:22:38.492Z","etag":null,"topics":["cli","docker-image","document-recognition","ocr","optical-character-recognition","pagexml","tesseract","text-detection"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mauvilsa.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-01-12T11:41:46.000Z","updated_at":"2025-01-10T20:29:54.000Z","dependencies_parsed_at":"2024-10-24T08:31:24.307Z","dependency_job_id":"a0799fe8-a335-4cec-bee0-a38107ad56e2","html_url":"https://github.com/mauvilsa/tesseract-recognize","commit_stats":{"total_commits":67,"total_committers":3,"mean_commits":"22.333333333333332","dds":0.04477611940298509,"last_synced_commit":"a7119edbd291b762091f38c3a7c90dbf2e0b3dce"},"previous_names":[],"tags_count":23,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mauvilsa%2Ftesseract-recognize","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mauvilsa%2Ftesseract-recognize/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mauvilsa%2Ftesseract-recognize/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mauvilsa%2Ftesseract-recognize/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mauvilsa","download_url":"https://codeload.github.com/mauvilsa/tesseract-recognize/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252569645,"owners_count":21769517,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","docker-image","document-recognition","ocr","optical-character-recognition","pagexml","tesseract","text-detection"],"created_at":"2024-07-30T21:00:51.195Z","updated_at":"2025-05-05T20:10:25.821Z","avatar_url":"https://github.com/mauvilsa.png","language":"C++","funding_links":[],"categories":["1. \u003ca name='Software'\u003e\u003c/a\u003eSoftware","Software"],"sub_categories":["1.4. \u003ca name='OCRCLI'\u003e\u003c/a\u003eOCR CLI","OCR CLI"],"readme":"# NAME\n\ntesseract-recognize - A tool that does layout analysis and/or text recognition using tesseract and outputs the result in Page XML format.\n\n\n# Requirements (Ubuntu 18.04 \u0026 20.04 \u0026 22.04 \u0026 24.04)\n\n## Build\n\n- make\n- cmake\n- g++\n- libtesseract-dev\n- libgs-dev\n- libxslt1-dev\n\n## Runtime\n\n- tesseract-ocr\n- ghostscript\n- libxslt1.1\n\n\n# Installation and usage\n\nTo compile from source follow the instructions here. If you only want the tool\nit might be simpler to use docker as explained in the next section.\n\n    git clone --recursive https://github.com/mauvilsa/tesseract-recognize\n    mkdir tesseract-recognize/build\n    cd tesseract-recognize/build\n    cmake -DCMAKE_INSTALL_PREFIX:PATH=$HOME ..\n    make install\n    \n    tesseract-recognize --help\n    tesseract-recognize IMAGE1 IMAGE2 -o OUTPUT.xml\n    tesseract-recognize INPUT.xml -o OUTPUT.xml\n\n\n# Installation and usage (docker)\n\nThe latest docker images are based on Ubuntu 24.04 and use the version of\ntesseract from the default package repositories (see the respective [docker hub\npage](https://hub.docker.com/r/mauvilsa/tesseract-recognize/)).\n\nTo install first pull the docker image of your choosing, using a command such\nas:\n\n    TAG=\"SELECTED_TAG_HERE\"\n    docker pull mauvilsa/tesseract-recognize:$TAG\n\nThe basic docker image only includes language files for recognition of English,\nso for additional languages you need to provide to the docker container the\ncorresponding tessdata files. There is also an additional docker image that can\nbe used to create a volume that includes all languages from the tesseract-ocr-*\nubuntu packages. To create this volume run the following:\n\n    docker pull mauvilsa/tesseract-recognize:$TAG-langs\n    docker run \\\n      --rm \\\n      --mount source=tesseract-ocr-tessdata,destination=/opt/tesseract-ocr/tessdata \\\n      -it mauvilsa/tesseract-recognize:$TAG-langs\n\nThen there are two possible ways of using the tesseract-recognize docker image,\nthrough a command line interface or through a REST API, as explained in the next\ntwo sections.\n\n\n## Command line interface\n\nFirst download the\n[https://github.com/omni-us/docker-command-line-interface](docker-cli), put it\nin some directory in your path and make it executable, for example:\n\n    wget -O $HOME/.local/bin https://raw.githubusercontent.com/omni-us/docker-command-line-interface/master/docker-cli\n    chmod +x $HOME/.local/bin/docker-cli\n\nAs an additional step, you could look at `docker-cli --help` and read about how\nto configure bash completion.\n\nAfter installing docker-cli, the tesseract-recognize tool can be used like any\nother command, i.e.\n\n    docker-cli \\\n      --ipc=host \\\n      -- mauvilsa/tesseract-recognize:$TAG \\\n      tesseract-recognize IMAGE -o OUTPUT.xml\n\nTo recognize other languages using the tessdata volume mentioned previously can\nbe done as follows\n\n    docker-cli \\\n      --ipc=host \\\n      --mount source=tesseract-ocr-tessdata,destination=/opt/tesseract-ocr/tessdata \\\n      --env TESSDATA_PREFIX=/opt/tesseract-ocr/tessdata \\\n      -- mauvilsa/tesseract-recognize:$TAG \\\n      tesseract-recognize --lang deu IMAGE -o OUTPUT.xml\n\nFor convenience you could setup an alias, i.e.\n\n    alias tesseract-recognize-docker=\"docker-cli --ipc=host --mount source=tesseract-ocr-tessdata,destination=/opt/tesseract-ocr/tessdata --env TESSDATA_PREFIX=/opt/tesseract-ocr/tessdata -- mauvilsa/tesseract-recognize:$TAG tesseract-recognize\"\n    tesseract-recognize-docker --help\n\n\n## API interface\n\nThe API interface uses a python flask sever that can be accessed through port\n5000 inside the docker container. For example the server could be started as:\n\n    docker run --rm -t -p 5000:5000 mauvilsa/tesseract-recognize:$TAG \n\nThe API exposes the following endpoints:\n\nMethod | Endpoint                          | Description                      | Parameters (form fields)\n------ | --------------------------------- | -------------------------------- | ------------------------\nGET    | /tesseract-recognize/version      | Returns tool version information | -\nGET    | /tesseract-recognize/help         | Returns tool help                | -\nGET    | /tesseract-recognize/swagger.json | The swagger json                 | -\nPOST   | /tesseract-recognize/process      | Recognize given images or xml    | **images (array, required):** Image files with names as in page xml. **pagexml (optional):** Page xml file to recognize. **options (optional):** Array of strings with options for the tesseract-recognize tool.\n\nFor illustration purposes the curl command can be used. Processing an input\nimage with a non-default layout level would be using a POST such as\n\n    curl -o output.xml -F images=@img.png -F options='[\"--layout\", \"word\"]' http://localhost:5000/tesseract-recognize/process\n\nTo process a page xml file, both the xml and the respective images should be\nincluded in the request, that is for example\n\n    curl -o output.xml -F images=@img1.png -F images=@img2.png -F pagexml=input.xml http://localhost:5000/tesseract-recognize/process\n\nThe API is implemented using Flask-RESTPlus which allows that once the server is\nstarted, you can use a browser to get a more detailed view of the exposed\nendpoints by going to http://localhost:5000/tesseract-recognize/swagger.\n\n\n# Viewing results\n\nThe results can be viewed/edited using the Page XML editor available at\nhttps://github.com/mauvilsa/nw-page-editor or using other tools that support\nthis format such as http://www.primaresearch.org/tools and\nhttps://transkribus.eu/Transkribus/ .\n\n\n# Contributing\n\nIf you intend to contribute, before any commits be sure to first execute\ngithook-pre-commit to setup (symlink) the pre-commit hook. This hook takes care\nof automatically updating the tool version.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmauvilsa%2Ftesseract-recognize","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmauvilsa%2Ftesseract-recognize","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmauvilsa%2Ftesseract-recognize/lists"}