{"id":13543395,"url":"https://github.com/maxim2266/go-ocr","last_synced_at":"2025-04-02T12:32:18.045Z","repository":{"id":57492488,"uuid":"62742290","full_name":"maxim2266/go-ocr","owner":"maxim2266","description":"A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.","archived":true,"fork":false,"pushed_at":"2020-02-20T11:13:16.000Z","size":42,"stargazers_count":34,"open_issues_count":0,"forks_count":8,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-11-03T10:32:43.293Z","etag":null,"topics":["extract-images","go","ocr","scanned-documents"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maxim2266.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-07-06T17:57:30.000Z","updated_at":"2024-06-08T03:12:07.000Z","dependencies_parsed_at":"2022-08-28T11:51:24.805Z","dependency_job_id":null,"html_url":"https://github.com/maxim2266/go-ocr","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxim2266%2Fgo-ocr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxim2266%2Fgo-ocr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxim2266%2Fgo-ocr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxim2266%2Fgo-ocr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maxim2266","download_url":"https://codeload.github.com/maxim2266/go-ocr/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246815852,"owners_count":20838530,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["extract-images","go","ocr","scanned-documents"],"created_at":"2024-08-01T11:00:31.034Z","updated_at":"2025-04-02T12:32:13.026Z","avatar_url":"https://github.com/maxim2266.png","language":"Go","funding_links":[],"categories":["Optical Character Recognition Engines and Frameworks"],"sub_categories":["CTPN [paper:2016](https://arxiv.org/pdf/1609.03605.pdf)"],"readme":"_The project is based on older versions of `tesseract` and other tools, and is now superseded by \n[another project](https://github.com/maxim2266/OCR)\nwhich allows for more granular control over the text recognition process._\n\n# go-ocr\nA tool for extracting plain text from scanned documents (`pdf` or `djvu`), with user-defined postprocessing.\n\n### Motivation\nOnce I had a task of OCR'ing a number of scanned documents in `pdf` format. I quickly built a pipeline\nof the tools to extract images from the input files and to convert them to plain text, but then I realised that\nmodern OCR\nsoftware is still less than ideal in terms of recognising text, so a good deal of postprocessing was needed\nin order to remove at least some of those OCR artefacts and irregularities. I ended up with a long pipeline\nof `sed`/`grep` filters which\nalso had to be adjusted per each document and per each document language. What I wanted was a tool that could\ncombine the OCR tools invocation with filters application, also giving an easy way of modifying and combining\nthe filter definitions.\n\n### The tool\nGiven an input file in either `pdf` or `djvu` format, the tool performs the following steps:\n\n1. Images get extracted from the input file using `pdfimages` or `ddjvu` tool;\n2. The extracted images get converted to plain text using `tesseract` tool, in parallel;\n3. The specified filters get applied to the text.\n\n### Invocation\n```go-ocr [OPTION]... FILE```\n\nCommand line options:\n```\n-f,--first N        first page number (optional, default: 1)\n-l,--last  N        last page number (optional, default: last page of the document)\n-F,--filter FILE    filter specification file name (optional, may be given multiple times)\n-L,--language LANG  document language (optional, default: 'eng')\n-o,--output FILE    output file name (optional, default: stdout)\n-h,--help           display this help and exit\n-v,--version        output version information and exit\n```\n\n##### Example\nThe following command processes a document `some.pdf` in Russian, from page 12 to page 26 (inclusive),\nwithout any postprocessing, storing the result in the file `document.txt`:\n```\n./go-ocr --first 12 --last 26 --language rus --output document.txt some.pdf\n```\n\n### Filter definitions\nFilter definition file is a plain text file containing rewriting rules and C-style comments.\nEach rewriting rule has the following format:\n```\nscope type \"match\" \"substitution\"\n```\nwhere\n- `scope` is either `line` or `text`;\n- `type` is either `word` or `regex`;\n- `match` and `substitution` are Go strings.\n\nEach rule must be on one line.\n\nEach rule of the scope `line` is applied to each line of the text. There is no\nprocessing done to the line by the tool itself other than trimming the trailing whitespace, which means\nthat a line does not have a trailing newline symbol when the rule is applied. After that all the lines get\ncombined into text with newline symbols inserted between them.\n\nEach rule of the scope `text` is applied to the whole text after all the `line` rules. All newline\nsymbols are visible to the rule which allows for combining multiple lines into one.\n\nThe reason for having two different scopes for the rules is that applying a rule to a line is computationally\ncheaper that applying to the whole text. Also, this makes the line regular expressions a bit simpler as,\nfor example, `\\s` regex cannot match a newline.\n\nRules of type `word` do a simple substitution replacing any `match` string with its corresponding\n`substitution` string.\n\nRules of type `regex` search the input for any match of the `match` regular expression and replace\nit with the `substitution` string. The [syntax](https://golang.org/pkg/regexp/syntax/) of the regular\nexpression is that of the Go `regexp` engine. The `substuitution` string may contain\n[references](https://golang.org/pkg/regexp/#Regexp.Expand) to the content of capturing groups\nfrom the corresponding `match` regular expression. From the Go documentation, each reference\n\n\u003e is denoted by a substring of the form $name or ${name}, where name is a non-empty sequence of letters, digits, and underscores. A purely numeric name like $1 refers to the submatch with the corresponding index; other names refer to capturing parentheses named with the (?P\\\u003cname\\\u003e...) syntax. A reference to an out of range or unmatched index or a name that is not present in the regular expression is replaced with an empty slice.\n\n\u003e In the $name form, name is taken to be as long as possible: $1x is equivalent to ${1x}, not ${1}x, and, $10 is equivalent to ${10}, not ${1}0.\n\n\u003e To insert a literal $ in the output, use $$ in the template.\n\nAll filter definition files are always processed in the order in which they are specified on the command line.\nWithin each file, the rules are grouped by the `scope`, and applied in the order of specification. This\nallows for each rule to rely on the outcome of all the rules before it.\n\n##### Rewriting rules examples\nRule to replace ellipsis with a single utf-8 symbol:\n```\nline word\t\"...\"  \"…\"\n```\nRule to replace all whitespace sequences with a single space character:\n```\nline regex\t`\\s+`\t\" \"\n```\nRule to remove all newline characters from the middle of a sentence:\n```\ntext regex\t`([a-z\\(\\),])\\n+([a-z\\(\\)])` \"${1} ${2}\"\n```\n\nMore examples can be found in the files `filter-eng` and `filter-rus`.\n\nIn practice, it is often useful to maintain\none filter definition file with rules to remove common OCR artefacts, and another file with rules\nspecific to a particular document. In general, it is probably impossible to avoid all manual editing\naltogether by using this tool, but from my experience, a few hours spent on setting up the appropriate filters\nfor a 700 pages document can dramatically reduce the amount of manual work needed afterwards.\n\n### Other tools\nInternally the program relies on `pdfimages` and `ddjvu` tools for extracting images from the input file,\nand on `tesseract` program for the actual OCR'ing. The tool `pdfimages` is usually a part of `poppler-utils`\npackage, the tool `ddjvu` comes from `djvulibre-bin` package, and `tesseract` is included in `tesseract-ocr`\npackage. By default, `tesseract` comes with the English language support only, other languages\nshould be installed separately, for example, run `sudo apt install tesseract-ocr-rus`\nto install the Russian language support. To find out what languages are currently installed type\n`tesseract --list-langs`.\n\n### Compilation\nInvoke `make` (or `make debug`) from the directory of the project to compile the code with debug\ninformation included, or `make release` to compile without debug symbols. This creates executable file `go-ocr`.\n\n### Technical details\nThe tool first runs `pdfimages` or `ddjvu` program to extract images to a temporary directory, and then invokes\n`tesseract` on each image in parallel to produce lines of plain text. Those lines are then passed through\nthe `line` filters, if any, then assembled into one text string and passed through `text` filters, if any.\n`regexp` filters are implemented using [Regexp.ReplaceAll()](https://golang.org/pkg/regexp/#Regexp.ReplaceAll)\nfunction, and `word` filters are invocations of [bytes.Replace()](https://golang.org/pkg/bytes/#Replace) function.\n\n### Known issues\nOlder versions of `pdfimages` tool do not have `-tiff` option, resulting in an error.\n\n### Platform\nLinux (tested on Linux Mint 18 64bit, based on Ubuntu 16.04), will probably work on MacOS as well.\n\nTools:\n```bash\n$ go version\ngo version go1.6.2 linux/amd64\n$ tesseract --version\ntesseract 3.04.01\n...\n$ pdfimages --version\npdfimages version 0.41.0\n...\n$ ddjvu --help\nDDJVU --- DjVuLibre-3.5.27\n...\n\n```\n\n##### Lisence: BSD\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxim2266%2Fgo-ocr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaxim2266%2Fgo-ocr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxim2266%2Fgo-ocr/lists"}