{"id":15113837,"url":"https://github.com/maxim2266/ocr","last_synced_at":"2026-02-21T04:33:48.753Z","repository":{"id":41368815,"uuid":"94795735","full_name":"maxim2266/OCR","owner":"maxim2266","description":"A collection of tools for OCR (optical character recognition).","archived":false,"fork":false,"pushed_at":"2024-10-17T14:39:24.000Z","size":74,"stargazers_count":30,"open_issues_count":0,"forks_count":0,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-01-30T17:29:08.281Z","etag":null,"topics":["bash-script","c","extract-text","linux","ocr","ocr-recognition","tesseract"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maxim2266.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-19T16:06:38.000Z","updated_at":"2024-10-17T14:39:34.000Z","dependencies_parsed_at":"2022-09-16T08:22:47.954Z","dependency_job_id":null,"html_url":"https://github.com/maxim2266/OCR","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxim2266%2FOCR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxim2266%2FOCR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxim2266%2FOCR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxim2266%2FOCR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maxim2266","download_url":"https://codeload.github.com/maxim2266/OCR/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237784795,"owners_count":19365931,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bash-script","c","extract-text","linux","ocr","ocr-recognition","tesseract"],"created_at":"2024-09-26T01:23:30.460Z","updated_at":"2025-10-23T05:31:08.891Z","avatar_url":"https://github.com/maxim2266.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OCR tools\nThe ever growing collection of tools to perform [OCR](https://en.wikipedia.org/wiki/Optical_character_recognition).\n\n### Motivation\n\nAchieving a good quality OCR in one go is not easy. Depending on the quality of the input,\nthe process may include a number of iterations to improve the original image(s) in order\nto achieve reasonable recognition quality, followed by some (often manual) correction of\nthe recognised text to remove various OCR errors. This is not a massive problem when\ndigitising a page or two, but processing a book of 500 pages makes things a lot harder.\nThis project aims to help with complex OCR projects, but instead of providing one monolithic\ntool that would include all the processing a user can possibly want, here we develop a number of\nsmaller instruments that can do only the obviously needed steps like OCR itself, but also\nallowing for user-defined processing to be integrated into the pipeline.\n\n### Tools\n\nThe toolset wraps around a number of well-known programs that perform tasks like PDF\nor image processing, character recognition, etc., aiming to create an environment for\niterative processing of large documents with the ability to utilise custom scripts.\n\nFor example, given a document `text.pdf`, the simplest OCR session may look like\nthe following:\n\n```sh\n▶ mkdir book \u0026\u0026 cd book\n▶ ocr-open ../text.pdf\nocr-open: processing file \"../text.pdf\"\nocr-open: extracting all pages\n▶ ocr\nocr: processing page 1 [ \"./page-01.pgm\" ]\nocr: processing page 2 [ \"./page-02.pgm\" ]\n{ ... }\nocr: processing page 15 [ \"./page-15.pgm\" ]\n▶ ocr-ls --text | xargs cat\n{ ... recognised text }\n▶\n```\nIn this simple example we first create a directory and `cd` into it, after that we convert\neach page of the document `text.pdf` to an image using `ocr-open` tool, and then we do\nthe actual character recognition via `ocr` tool. The last command gives an example of\nhow other custom tools can be integrated into the process with the help of the `ocr-ls`\nutility. Here we use the standard Linux `cat` utility to display the recognised text.\n\nInternally, the toolset operates on images in [PGM](http://netpbm.sourceforge.net/doc/pgm.html)\nformat, that has been chosen as the lowest common denominator between all the tools\nwrapped by this toolset, and also because it is understood by the good old `netpbm`\npackage, which is often a bit faster than `imagemagic` when it comes to simple\noperations like image cropping.\n\nAll images are named using pattern `page-N.pgm`, where `N` is the page number ranging from\n1 to the maximum of 9999, as in the source document, and with a sufficient number\nof leading zeroes to make sure that a list of files sorted alphabetically gives the correct\npage order. The text recognised from each page is stored in a file named using the same pattern,\nbut with the `.txt` extension. Most of the tools in this toolset can operate on a sub-range of\npages via `-p` or `--pages` command line option, see help (`-h` or `--help`) on\na particular tool. Generally, the toolset is designed to operate on \"pages\" rather\nthan files, for convenience.\n\nAnother thing these tools are designed to do is to check all the parameters and input\nfiles before passing them over to the underlying programs, because the error messages\nfrom those programs are sometimes a bit cryptic.\n\nFor details on the command line options supported by a particular tool, simply\ninvoke the tool with `-h` or `--help` option.\n\nThe included tools are:\n\n##### `ocr-open`\n\nThis is usually the first command to invoke when starting a new project. The tool\nconverts each page of the specified document to a separate image. There are options\nto specify the range of pages to extract, and the destination directory.\nInput document can be either in `.pdf` or `.djvu` format. Internally the tool invokes\neither `ddjvu` or `pdftoppm` program, depending on the type of the input file.\n\n##### `ocr-ls`\n\nThe main purpose of the tool is to produce a list of files for bulk-processing.\nThe tool outputs a list of files, text or images, from the selected range(s) of pages, in order.\nA simple example is given above, where it is used to concatenate all the recognised text.\nFor a more involved example, consider the situation where every page except the first one\nhas a page number at the bottom that we don't want to see in the recognised text,\nand so we want to crop (for example) 6.5% from the bottom of each image starting from\nthe page 2, and till the end of the document. This can be achieved with the following\ncommand:\n```sh\nocr-ls -p 2- | xargs -I{} -n 1 crop-image -b 6.5% {} {}\n```\n_(see below for the description of the `crop-image` command)_\n\n##### `ocr`\n\nThe tool invokes `tesseract` program to recognise text from the given images. There\nare options to specify the range of pages to process, as well as the directory\nwhere the image files are stored. Per each page, the recognised text is written to\nthe same directory, and to the file with the same name but with a `.txt` extension.\nFor example, this is how to extract text in Russian and English, from pages\n5 to 10 only, all located in the directory `book`:\n\n```sh\nocr -p 5-10 -d book/ -- -l rus+eng\n```\nNote: everything to the right from `\"--\"` is passed over to the `tesseract` program.\n\n##### `crop-image`\n\nCrops the specified image. The amount of space to crop is given as the percentage of\nthe image's width or height, which is often more convenient than using pixels. Wraps\naround the `pamcut` utility from `netpbm` toolset.\n\n##### `norm-image`\n\nA tiny utility that crops the image to content and then adds 5% white border. Wraps around\nImageMagic `convert` tool. Rarely useful, except the situations where there are\npoor quality scanned images with some dust bits on the space surrounding the text,\nthat sometimes get recognised as punctuation.\n\n##### `norm-text`\n\nA script to normalise text by removing hyphenation and line breaks inside paragraphs.\nNormally, `tesseract` separates paragraphs by empty lines, and this is required\nfor the tool to work correctly. The tool takes its input from `stdin`, and\nwrites to `stdout`.\n\n##### `norm-page`\n\nEnsures the correct paragraph boundary at the end of the page. Takes one or more text\nfiles as input, and writes its output to `stdout`. Can be used in conjunction with\nother tools, for example:\n```sh\nocr-ls -t | xargs norm-page | norm-text\n```\n\n### Installation\n\nThe toolset makes use of external tools that need to be installed first:\n```sh\nsudo apt install netpbm imagemagick tesseract-ocr djvulibre-bin poppler-utils\n```\n\nOptionally, install language packs for `tesseract`, for example:\n```sh\nsudo apt install tesseract-ocr-rus\n```\n\nThe preferred way to install the toolset is to grab the `ocr-*.tar.xz` archive\nattached to the latest [release](https://github.com/maxim2266/ocr/releases)\non github (starting from version 0.8), and extract it to a directory listed on the\n`$PATH`. Alternatively, if the very recent but yet unreleased updates are required,\njust clone the project from github\n```sh\ngit clone --recursive https://github.com/maxim2266/ocr\n```\nthen install dependencies for the build\n```sh\nsudo apt install build-essential libmagic-dev\n```\nand finally run `make release` from the root directory of the project. This will compile the\ntoolset and create an archive with all the utilities, which can then be extracted to a directory\non the `$PATH`.\n\nThe toolset has been tested on Linux Mint 19.3, and will probably work on other Debian-based\ndistributions as well. Supported `tesseract` version is 4.0.0 or later.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxim2266%2Focr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaxim2266%2Focr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxim2266%2Focr/lists"}