{"id":26418602,"url":"https://github.com/agentgill/tesseract-for-aws","last_synced_at":"2026-04-19T14:02:57.364Z","repository":{"id":282873471,"uuid":"949931322","full_name":"agentgill/tesseract-for-aws","owner":"agentgill","description":"Using Tesseract with AWS is extremely difficult due to the size of the binaries and the dependencies.","archived":false,"fork":false,"pushed_at":"2025-03-17T11:17:38.000Z","size":46804,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-17T12:29:19.708Z","etag":null,"topics":["amazonlinux2023","aws","docker-image","dockerfile","lambda","lambda-layers","python3","tesseract","tesseract-ocr"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/agentgill.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-17T11:16:19.000Z","updated_at":"2025-03-17T11:19:38.000Z","dependencies_parsed_at":"2025-03-17T12:39:25.179Z","dependency_job_id":null,"html_url":"https://github.com/agentgill/tesseract-for-aws","commit_stats":null,"previous_names":["agentgill/tesseract-for-aws"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/agentgill/tesseract-for-aws","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agentgill%2Ftesseract-for-aws","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agentgill%2Ftesseract-for-aws/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agentgill%2Ftesseract-for-aws/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agentgill%2Ftesseract-for-aws/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/agentgill","download_url":"https://codeload.github.com/agentgill/tesseract-for-aws/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agentgill%2Ftesseract-for-aws/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32009239,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T20:23:30.271Z","status":"online","status_checked_at":"2026-04-19T02:00:07.110Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amazonlinux2023","aws","docker-image","dockerfile","lambda","lambda-layers","python3","tesseract","tesseract-ocr"],"created_at":"2025-03-18T01:48:46.410Z","updated_at":"2026-04-19T14:02:57.347Z","avatar_url":"https://github.com/agentgill.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Using Tesseract with AWS\n\nUsing Tesseract with AWS is extremely difficult due to the size of the binaries and the dependencies.\n\nCompile the latest stable version of Tesseract and use with Lambda Layers or Lambda Docker. :fire:\n\n## Resources\n\n- [Docs](https://tesseract-ocr.github.io/)\n- [Github](https://github.com/tesseract-ocr/tesseract)\n\n## Build Tesseract 5.5.0 (Latest Stable Version)\n\nBuild necessary dependencies for Tesseract.\n\n### Build in Docker\n\nUse Amazon Linux 2023 as the base image.\n\n```bash\ndocker run --platform linux/amd64 -it -v \"${PWD}\":/layer-build amazonlinux:2023 bash\n```\n\n### Install Dependencies\n\n```bash\n# Update system and install development tools\ndnf update -y\ndnf groupinstall -y \"Development Tools\"\ndnf install -y cmake gcc gcc-c++ make autoconf automake libtool pkgconfig\n```\n\n### Install library dependencies\n\n```bash\ndnf install -y zlib zlib-devel libjpeg libjpeg-devel libwebp libwebp-devel \\\n    libtiff libtiff-devel libpng libpng-devel wget libicu libicu-devel\n```\n\n### Install Leptonica using CMake with position-independent code\n\n```bash\ncd /tmp\nwget https://github.com/DanBloomberg/leptonica/releases/download/1.83.1/leptonica-1.83.1.tar.gz\ntar -xzvf leptonica-1.83.1.tar.gz\ncd leptonica-1.83.1\nmkdir build \u0026\u0026 cd build\n# Add -DCMAKE_POSITION_INDEPENDENT_CODE=ON to ensure PIC is enabled\ncmake -DCMAKE_INSTALL_PREFIX=/usr/local -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DBUILD_SHARED_LIBS=ON ..\nmake\nmake install\nldconfig\n```\n\n### Install Tesseract 5.5.0 using CMake\n\n```bash\ncd /tmp\nwget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/5.5.0.tar.gz\ntar -xzvf 5.5.0.tar.gz\ncd tesseract-5.5.0\nmkdir build \u0026\u0026 cd build\ncmake -DCMAKE_INSTALL_PREFIX=/usr/local -DBUILD_SHARED_LIBS=ON -DBUILD_TRAINING_TOOLS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON ..\nmake\nmake install\nldconfig\n```\n\n### Download English language data\n\n```bash\ncd /usr/local/share/tessdata\nwget https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata\n```\n\n### Create Lambda Layer Structure\n\n```bash\nmkdir -p /layer-build/layer/lib\nmkdir -p /layer-build/layer/bin\nmkdir -p /layer-build/layer/tessdata\n```\n\n### Copy Tesseract binary\n\n```bash\ncp /usr/local/bin/tesseract /layer-build/layer/bin/\n\n# Copy required libraries - The CMake build system likely defaulted to using /usr/local/lib64/ \n# cp /usr/local/lib/libtesseract.so* /layer-build/layer/lib/\ncp /usr/local/lib64/libtesseract.so* /layer-build/layer/lib/\n# cp /usr/local/lib/libleptonica.so* /layer-build/layer/lib/\ncp /usr/local/lib64/libleptonica.so* /layer-build/layer/lib/\n```\n\n### Create symbolic links with expected names\n\n```bash\ncd /layer-build/layer/lib\nln -s libleptonica.so.6 liblept.so\nln -s libleptonica.so.6 liblept.so.6\n```\n\n### Copy language data\n\n```bash\ncp /usr/local/share/tessdata/eng.traineddata /layer-build/layer/tessdata/\n```\n\n### Find and copy all dependencies\n\n```bash\nldd /usr/local/bin/tesseract | grep \"=\u003e /\" | awk '{print $3}' | xargs -I '{}' cp -v '{}' /layer-build/layer/lib/\n```\n\n#### Create zip file for Lambda layer\n\n```bash\ncd /layer-build/layer\nzip -r ../tesseract-layer.zip *\n```\n\n## Working with Lambda\n\n### Using with Lambda Docker\n\nIf you need to use Tesseract with Lambda Docker, you can use the following Dockerfile as a starting point.\n\n```dockerfile\nFROM public.ecr.aws/lambda/python:3.13\n\n# Install unzip utility\nRUN dnf install -y unzip\n\n# Copy tesseract layer and extract it to /opt\nCOPY tesseract-layer.zip /tmp/\nRUN unzip /tmp/tesseract-layer.zip -d /opt \u0026\u0026 \\\n    rm /tmp/tesseract-layer.zip\n\n# Copy application files\nCOPY . ${LAMBDA_TASK_ROOT}\nCOPY requirements.txt .\nRUN pip install -r requirements.txt\n\nCMD [\"main.handler\" ]\n\n```\n\n### Using the Lambda as a Layer\n\nCreate a Lambda Layer using the following command and associate it with your Lambda function.\n\n```bash\naws lambda publish-layer-version --layer-name tesseract --zip-file fileb://tesseract-layer.zip --compatible-runtimes python3.13\n```\n\n## Example Lambda Function\n\n```python\nimport subprocess\nimport os\nfrom typing import Dict, Any\n\nfrom aws_lambda_powertools import Logger\nfrom aws_lambda_powertools.logging.formatter import LambdaPowertoolsFormatter\nfrom aws_lambda_powertools.utilities.typing import LambdaContext\n\nformatter = LambdaPowertoolsFormatter(utc=True)\nlogger = Logger(service=\"universal-ocr-engine\", logger_formatter=formatter)\n\n\ndef handler(event: Dict[str, Any], context: LambdaContext) -\u003e None:\n    \"\"\"Handle OCR processing requests\"\"\"\n\n    # Setup \u0026 verify tesseract\n    env = os.environ.copy()\n    get_tesseract_version(env)\n\n\ndef get_tesseract_version(env: Dict[str, str]) -\u003e None:\n    \"\"\"Get and log the installed Tesseract OCR version.\"\"\"\n    try:\n        tesseract_v = subprocess.run(\n            [\"/opt/bin/tesseract\", \"--version\"],\n            env=env,\n            stdout=subprocess.PIPE,\n            stderr=subprocess.PIPE,\n        )\n        logger.info(f\"tesseract_v: {tesseract_v.stdout.decode()}\")\n    except Exception as e:\n        logger.error(f\"Error running tesseract: {str(e)}\")\n\n```\n\n### Successfully tested with Lambda Layers and Lambda Docker :fire:\n\n```json\n{\"level\":\"INFO\",\"location\":\"handler:69\",\"message\":\"tesseract_v: tesseract 5.5.0\\n leptonica-1.83.1\\n  libjpeg 6b (libjpeg-turbo 2.1.4) : libpng 1.6.37 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.4\\n Found AVX2\\n Found AVX\\n Found FMA\\n Found SSE4.1\\n\",\"timestamp\":\"2025-03-17 10:06:57,658+0000\",\"service\":\"universal-ocr-engine\",\"xray_trace_id\":\"1-67d7f441-6d3d372abd2e1c6106494145\"}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagentgill%2Ftesseract-for-aws","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fagentgill%2Ftesseract-for-aws","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagentgill%2Ftesseract-for-aws/lists"}