{"id":13588625,"url":"https://github.com/hertzg/tesseract-server","last_synced_at":"2026-04-26T08:13:29.249Z","repository":{"id":39670571,"uuid":"319424402","full_name":"hertzg/tesseract-server","owner":"hertzg","description":"A small lightweight HTTP server that converts photos, images and scanned documents to text using optical character recognition by utilizing the power of Google Tesseract.","archived":false,"fork":false,"pushed_at":"2024-10-29T18:08:10.000Z","size":2251,"stargazers_count":87,"open_issues_count":4,"forks_count":21,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-10-29T20:12:23.365Z","etag":null,"topics":["api","container","containers","docker","docker-compose","docker-image","hacktoberfest","http-server","image-processing","ocr","rest-api","tesseract","tesseract-server","typescript"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hertzg.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG-next.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-07T19:35:40.000Z","updated_at":"2024-10-20T20:37:30.000Z","dependencies_parsed_at":"2024-03-25T20:26:35.524Z","dependency_job_id":"e9127300-47d5-4676-aaed-a0f1b6d8545f","html_url":"https://github.com/hertzg/tesseract-server","commit_stats":null,"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hertzg%2Ftesseract-server","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hertzg%2Ftesseract-server/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hertzg%2Ftesseract-server/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hertzg%2Ftesseract-server/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hertzg","download_url":"https://codeload.github.com/hertzg/tesseract-server/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247430874,"owners_count":20937874,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api","container","containers","docker","docker-compose","docker-image","hacktoberfest","http-server","image-processing","ocr","rest-api","tesseract","tesseract-server","typescript"],"created_at":"2024-08-01T15:06:49.645Z","updated_at":"2026-04-26T08:13:24.201Z","avatar_url":"https://github.com/hertzg.png","language":"TypeScript","funding_links":[],"categories":["TypeScript","api"],"sub_categories":[],"readme":"# Tesseract server (OCR over HTTP)\n\nA small lightweight HTTP server that converts photos, images and scanned\ndocuments to text using optical character recognition by utilizing the power of\n[Google Tesseract](https://github.com/tesseract-ocr/tesseract).\n\n![An arrow from papers to json signifying conversion, arrow is labeled as http](feature-image.png)\n\n## Quick Start\n\nThe easiest way to get started is using\n[pre-built docker images](https://hub.docker.com/repository/docker/hertzg/tesseract-server)\n(multi-arch)\n\n```shell script\n$ docker run -p 8884:8884 hertzg/tesseract-server:latest\n```\n\nYou can use the service by sending `multipart` http requests containing\n`options` and `file` fields.\n\n\u003c!-- prettier-ignore-start --\u003e\n```shell script\n# Run OCR using english language on file sample.jpg in current directory\n$ curl -F \"options={\\\"languages\\\":[\\\"eng\\\"]}\" -F file=@sample.jpg http://127.0.0.1:8884/tesseract\n\n{\n  \"data\": {\n    \"exit\": {\n      \"code\": 0,\n      \"signal\": null\n    },\n    \"stderr\": \"Warning: Invalid resolution 0 dpi. Using 70 instead.\\nEstimating resolution as 153\\n\",\n    \"stdout\": \" \\n\\n \\n\\nThe Life and Work of\\nFredson Bowers\\n\\nby\\nG. THOMAS TANSELLE\\n\\n \\n\\nN EVERY FIELD OF ENDEAVOR THERE ARE A FEW FIGURES WHOSE AGCOM-\\nplishment and influence cause them to be the symbols of their age;\\ntheir careers and oeuvres become the touchstones by which the\\nfield is measured and its history told. In the related pursuits of\\n\\nanalytical and descriptive bibliography, textual criticism, and scholarly\\nediting, Fredson Bowers was such a figure, dominating the four decades\\nafter 1949, when his Principles of Bibliographical Description was pub-\\nlished. By 1973 the period was already being called “the age of Bowers”:\\nin that year Norman Sanders, writing the chapter on textual scholarship\\nfor Stanley Wells's Shakespeare: Select Bibliographies, gave this title to\\na section of his essay. For most people, it would be achievement enough\\nto ise to such a position in a field as complex as Shakespearean textual\\nstudies; but Bowers played an equally important role in other areas.\\nEditors of nineteenth-century American authors, for example, would\\nalso have to call the recent past “the age of Bowers, as would the writers\\nof descriptive bibliographics of authors and presses. His ubiquity in\\nthe broad field of bibliographical and textual study, his seemingly com-\\nplete possession of it, distinguished him from his illustrious predeces-\\nsors and made him the personification of bibliographical scholarship in\\nhis time.\\n\\nWhen in 1969 Bowers was awarded the Gold Medal of the Biblio-\\ngraphical Society in London, John Carter’s citation referred to the\\nPrinciples as “majestic,” called Bowers’s current projects “formidable,”\\nsaid that he had “imposed critical discipline” on the texts of several\\nauthors, described Studies in Bibliography as a “great and continuing\\nachievement,” and included among his characteristics “uncompromising\\nseriousness of purpose” and “professional intensity.” Bowers was not\\nunaccustomed to such encomia, but he had also experienced his share of\\nattacks: his scholarly positions were not universally popular, and he\\nexpressed them with an aggressiveness that almost seemed calculated to\\n\\n \\n\\f\"\n  }\n}\n```\n\u003c!-- prettier-ignore-end --\u003e\n\n## Usage\n\nThe service provides configurations as cli options. All the options with their\ndescriptions, types and defaults including some usage examples can be seen using\n`--help` flag.\n\n\u003c!-- prettier-ignore-start --\u003e\n```shell script\n# Using Docker\n$ docker hertzg/tesseract-server:latest --help\n```\n```text test-id=\"--help\" test-param-columns=\"160\"\ntesseract-server [options]\n\nA small lightweight http server exposing tesseract as a service.\n\nOptions:\n  --help                                    Show help                                                                                                  [boolean]\n  --version                                 Show version number                                                                                        [boolean]\n  --pool.default.min                        Minimum number of processes to keep waiting in each pool                                       [number] [default: 0]\n  --pool.default.max                        Maximum number of processes to spawn for each pool after which requests are queued             [number] [default: 2]\n  --pool.default.idleTimeoutMillis          Time (in milliseconds) a processes can stay idle in queue before eviction                   [number] [default: 5000]\n  --pool.default.evictionRunIntervalMillis  Time interval (in milliseconds) between eviction checks                                     [number] [default: 5000]\n  --http.listen.address                     Set http listen address                                                                [string] [default: \"0.0.0.0\"]\n  --http.listen.port                        Set http listen port                                                                        [number] [default: 8884]\n  --http.upload.tmpDir                      Path to where temp uploads are saved to                                                   [string] [default: \"/tmp\"]\n  --http.endpoint.status.enable             Enable /status endpoint                                                                    [boolean] [default: true]\n  --http.endpoint.health.enable             Enable /.well-known/health/* endpoints and health checkers                                 [boolean] [default: true]\n  --http.endpoint.webui.enable              Enable Web UI at /                                                                         [boolean] [default: true]\n  --http.input.optionsField                 Multipart field name containing OCR Options                                            [string] [default: \"options\"]\n  --http.input.fileField                    Multipart field name containing OCR file                                                  [string] [default: \"file\"]\n  --http.output.jsonSpaces                  Enable json pretty printing and set number of spaces to use for indentation                    [number] [default: 0]\n  --processor.lineEndings                   Set line ending policy                                    [string] [choices: \"auto\", \"lf\", \"crlf\"] [default: \"auto\"]\n\nExamples:\n  tesseract-server --http.output.jsonSpaces 2                                       Enable JSON pretty printing\n  tesseract-server --http.endpoint.status.enable false                              Disable Status and Health endpoints\n  --http.endpoint.health.enable false\n\nReferences:\n  GitHub: https://github.com/hertzg/tesseract-server\n  Discussions: https://github.com/hertzg/tesseract-server/discussions\n  Issues: https://github.com/hertzg/tesseract-server/issues\n```\n\u003c!-- prettier-ignore-end --\u003e\n\n## Docker\n\nDocker images are multi-arch images based on `alpine` variant of official `node`\ndocker images supporting `linux/amd64`, `linux/arm/v6`, `linux/arm/v7`,\n`linux/arm64/v8`, `linux/ppc64le` and `linux/s390x` platforms.\n\n## Raspberry Pi support\n\nThe docker images support ARM architectures which means that they can be used on\nat least the following versions of Raspberry Pi:\n\n- RPi 1 Model A\n- RPi 1 Model A+\n- RPi 3 Model A+\n- RPi 1 Model B\n- RPi 1 Model B+\n- RPi 2 Model B\n- RPi 2 Model B v1.2 (:heavy_check_mark: tested)\n- RPi 3 Model B\n- RPi 3 Model B+ (:heavy_check_mark: tested)\n- RPi 4 Model B (:heavy_check_mark: tested)\n- Compute Module 1\n- Compute Module 3\n- Compute Module 3 Lite\n- Compute Module 3+\n- Compute Module 3+ Lite\n- RPi Zero PCB v1.2\n- RPi Zero PCB v1.3\n- RPi Zero W\n\nIf you have any of those devices and have successfully used the images feel free\nto report them and help update this list. :open_hands:\n\n## Supported Languages\n\nThe container by default installs tesseract and 3 datapacks:\n\n- `tesseract-ocr` - English (included)\n- `tesseract-ocr-data-deu` - German\n- `tesseract-ocr-data-fra` - French\n- `tesseract-ocr-data-kat` - Georgia\n- `tesseract-ocr-data-pol` - Polish\n- `tesseract-ocr-data-rus` - Russian\n\nTo add more languages you can extend this image and install one or more\n[available language datapacks](https://pkgs.alpinelinux.org/packages?name=tesseract-ocr-data-*\u0026branch=edge\u0026arch=x86_64)\nwith the package manager:\n\n\u003c!-- prettier-ignore-start --\u003e\n```Dockerfile\nFROM hertzg/tesseract-server:latest\nRUN apk add --no-cache tesseract-ocr-data-spa tesseract-ocr-data-ara # and so on\n```\n\u003c!-- prettier-ignore-end --\u003e\n\nAfter starting the container the new language will be automatically available.\n\n## HTTP API\n\nThere are a few endpoints exposed this section describes each one.\n\n### OCR Endpoint - `/tesseract`\n\nThis endpoint performs OCR on provided `file`, You can control the OCR process\nby providing `options` field with `JSON` object containing the configuration.\nThis is the main endpoint that expects http `multipart` request containing\n`options` and `file` fields and returns a `json` containing `stdout` and\n`stderr` of the tesseract process.\n\nThe `options` json object fields directly relate to the CLI options of\n`tesseract` command.\n\n\u003c!-- prettier-ignore-start --\u003e\n```json5\n{\n  \"languages\": ['eng'],               // -l LANG[+LANG]        Specify language(s) used for OCR.\n  \"dpi\": 300,                         // --dpi VALUE           Specify DPI for input image.\n  \"pageSegmentationMethod\": 3,        // --psm NUM             Specify page segmentation mode.\n  \"ocrEngineMode\": 3,                 // --oem NUM             Specify OCR Engine mode.\n  \"tessDataDir\": './dir',             // --tessdata-dir PATH   Specify the location of tessdata path.,\n  \"userPatternsFile\": './file',       // --user-words PATH     Specify the location of user words file.\n  \"userWordsFile\": './file',          // --user-patterns PATH  Specify the location of user patterns file.\n  \"configParams\": {                   // -c VAR=VALUE          Set value for config variables.\n    \"VAR\": \"VALUE\",                   // Note: You can use tesseract --print-parameters to see all available parameters\n  },\n}\n```\n\u003c!-- prettier-ignore-end --\u003e\n\nThe returned response has the following shape\n\n\u003c!-- prettier-ignore-start --\u003e\n```json5\n{\n  \"exit\": {\n    \"code\": 0,                        // Process exit code\n    \"signal\": null                    // Process signal that caused the exit\n  },\n  \"stderr\":  \"...\",                    // Tesseract Errors and warnings\n  \"stdout\":  \"...\"                     // Tesseract output that contains the result\n}\n```\n\u003c!-- prettier-ignore-end --\u003e\n\n### Status Endpoint - `/status`\n\n```shell\n# Get worker status\n$ curl http://127.0.0.1:8884/status\n```\n\nReturns the pool and their statuses as JSON. When you make OCR request the first\npool will be created and then re-used. This endpoint also shows detailed\ninformation about each pool including process pids and eviction flags.\n\n\u003c!-- prettier-ignore-start --\u003e\n```json5\n{\n  data: {\n    processor: {\n      pools: [\n        {\n          args: '-l eng',\n          resources: [],\n          status: {\n            spareResourceCapacity: 2,\n            size: 0,\n            available: 0,\n            borrowed: 0,\n            pending: 0,\n            max: 2,\n            min: 0,\n          },\n        },\n      ],\n    },\n  },\n}\n```\n\u003c!-- prettier-ignore-end --\u003e\n\n### Health Endpoints\n\nEndpoints:\n\n- `/.well-known/health/healthy`\n- `/.well-known/health/live`\n- `/.well-known/health/ready`\n\nThe difference between liveness and readiness endpoints is the purpose:\nreadiness should be used to denote whether an application is \"ready\" to receive\nrequests, and liveness should be used to denote whether an application is \"live\"\n(vs. in a state where it should be restarted.\n\nThe combined health endpoint is designed for cloud technologies, such as Cloud\nFoundry which only support a single endpoint for both liveness and readiness\nchecking.\n\n## Deployment Guides\n\n- [Heroku](./docs/heroku-deploy.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhertzg%2Ftesseract-server","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhertzg%2Ftesseract-server","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhertzg%2Ftesseract-server/lists"}