{"id":13522815,"url":"https://github.com/bweigel/aws-lambda-tesseract-layer","last_synced_at":"2026-03-04T11:09:09.079Z","repository":{"id":37926489,"uuid":"160103319","full_name":"bweigel/aws-lambda-tesseract-layer","owner":"bweigel","description":"A layer for AWS Lambda containing the tesseract C libraries and tesseract executable.","archived":false,"fork":false,"pushed_at":"2026-03-03T16:27:42.000Z","size":44991,"stargazers_count":123,"open_issues_count":7,"forks_count":36,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-03T20:44:25.000Z","etag":null,"topics":["amazon-linux","aws-lambda","lambda","lambda-layer","serverless","serverless-framework","tesseract"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bweigel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2018-12-02T22:38:20.000Z","updated_at":"2026-02-19T08:44:24.000Z","dependencies_parsed_at":"2023-12-18T09:30:00.991Z","dependency_job_id":"819c9b09-fcac-4fbf-9e39-dc7407fc27b3","html_url":"https://github.com/bweigel/aws-lambda-tesseract-layer","commit_stats":null,"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"purl":"pkg:github/bweigel/aws-lambda-tesseract-layer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bweigel%2Faws-lambda-tesseract-layer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bweigel%2Faws-lambda-tesseract-layer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bweigel%2Faws-lambda-tesseract-layer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bweigel%2Faws-lambda-tesseract-layer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bweigel","download_url":"https://codeload.github.com/bweigel/aws-lambda-tesseract-layer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bweigel%2Faws-lambda-tesseract-layer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30078562,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T08:01:56.766Z","status":"ssl_error","status_checked_at":"2026-03-04T08:00:42.919Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amazon-linux","aws-lambda","lambda","lambda-layer","serverless","serverless-framework","tesseract"],"created_at":"2024-08-01T06:00:52.553Z","updated_at":"2026-03-04T11:09:09.066Z","avatar_url":"https://github.com/bweigel.png","language":"TypeScript","readme":"Tesseract OCR Lambda Layer\n===\n\n![Tesseract](https://img.shields.io/badge/Tesseract-5.5.2-green?style=flat-square)\n![Leptonica](https://img.shields.io/badge/Leptonica-1.87.0-green?style=flat-square)\n\n![Examples available for Runtimes](https://img.shields.io/badge/Examples_(Lambda_runtimes)-Python_3.12(AL2023),Node.js_20(AL2023)-informational?style=flat-square)\n![Examples available for IaC Tools](https://img.shields.io/badge/Examples_(IaC)-Serverless_Framework,_AWS_CDK-informational?style=flat-square)\n\n\n![Continuos Integration](https://github.com/bweigel/aws-lambda-tesseract-layer/workflows/Continuos%20Integration/badge.svg)\n\n\u003e AWS Lambda layer containing the [tesseract OCR](https://github.com/tesseract-ocr/tesseract) libraries and command-line binary for Lambda Runtimes running on Amazon Linux 2023 and 2.\n\n\u003e :warning: **DEPRECATION NOTICE**:\n\u003e - **Amazon Linux 1 (AL1)**: Removed. No longer supported.\n\u003e - **Amazon Linux 2 (AL2)**: **Deprecated**. Will be removed after 6 months. New projects should use Amazon Linux 2023 (AL2023).\n\u003e   - **Note**: AL2 with Tesseract 5.5+ is not supported in CI due to GCC 7.3.1 lacking C++17 filesystem support. Users can build locally with Tesseract 5.4.x or earlier if AL2 is required.\n\u003e - **Recommended**: Use Amazon Linux 2023 (AL2023) for all new projects.\n\n\u003c!-- TOC --\u003e\n\n- [Quickstart](#quickstart)\n- [Ready-to-use binaries](#ready-to-use-binaries)\n    - [Use with Serverless Framework](#use-with-serverless-framework)\n    - [Use with AWS CDK](#use-with-aws-cdk)\n- [Build tesseract layer from source using Docker](#build-tesseract-layer-from-source-using-docker)\n    - [available `Dockerfile`s](#available-dockerfiles)\n    - [Building a different tesseract version and/or language](#building-a-different-tesseract-version-andor-language)\n    - [Deployment size optimization](#deployment-size-optimization)\n    - [Building the layer binaries directly using CDK](#building-the-layer-binaries-directly-using-cdk)\n    - [Layer contents](#layer-contents)\n- [Migration from AL2 to AL2023](#migration-from-al2-to-al2023)\n    - [Why Migrate?](#why-migrate)\n    - [Migration Steps](#migration-steps)\n    - [Common Issues](#common-issues)\n- [Known Issues](#known-issues)\n    - [Avoiding Pillow library issues](#avoiding-pillow-library-issues)\n    - [Unable to import module 'handler': cannot import name '_imaging'](#unable-to-import-module-handler-cannot-import-name-_imaging)\n- [Contributors :heart:](#contributors-heart)\n\n\u003c!-- /TOC --\u003e\n\n# Quickstart\n\nThis repo comes with ready-to-use binaries compiled against the AWS Lambda Runtimes (based on Amazon Linux 2023 and 2).\nExample Projects in Python 3.12 and Node.js 20 using Serverless Framework and CDK are provided:\n\n```bash\n## Demo using Serverless Framework and prebuilt layer\ncd example/serverless\nnpm ci\nnpx sls deploy\n\n## or ..\n\n## Demo using CDK and prebuilt layer\ncd example/cdk\nnpm ci\nnpx cdk deploy\n```\n# Ready-to-use binaries\n\nFor compiled, ready to use binaries that you can put in your layer see [`ready-to-use`](./ready-to-use), or check out the [latest release](https://github.com/bweigel/aws-lambda-tesseract-layer/releases/latest).\n\nSee [examples](./example) for some ready-to-use examples.\n\n## Use with Serverless Framework\n\n\u003e [Serverless Framework](https://www.serverless.com/framework/docs/getting-started/)\n\nReference the path to the ready-to-use layer contents in your `serverless.yml`:\n\n```yaml\nservice: tesseract-ocr-layer\n\nprovider:\n  name: aws\n\n# define layer\nlayers:\n  tesseractAl2:\n    # and path to contents\n    path: ready-to-use/amazonlinux-2\n    compatibleRuntimes:\n      - python3.8\n\nfunctions:\n  tesseract-ocr:\n    handler: ...\n    runtime: python3.8\n    # reference layer in function\n    layers:\n      - { Ref: TesseractAl2LambdaLayer }\n    events:\n      - http:\n          path: ocr\n          method: post\n```\n\nDeploy\n\n```\nnpx sls deploy\n```\n\n## Use with AWS CDK\n\n\u003e [AWS CDK](https://github.com/aws/aws-cdk#getting-started)\n\nReference the path to the layer contents in your constructs:\n\n```typescript\nconst app = new App();\nconst stack = new Stack(app, 'tesseract-lambda-ci');\n\nconst al2Layer = new lambda.LayerVersion(stack, 'al2-layer', {\n    // reference the directory containing the ready-to-use layer\n    code: Code.fromAsset(path.resolve(__dirname, './ready-to-use/amazonlinux-2')),\n    description: 'AL2 Tesseract Layer',\n});\nnew lambda.Function(stack, 'python38', {\n    // reference the source code to your function\n    code: lambda.Code.fromAsset(path.resolve(__dirname, 'lambda-handlers')),\n    runtime: Runtime.PYTHON_3_8,\n    // add tesseract layer to function\n    layers: [al2Layer],\n    memorySize: 512,\n    timeout: Duration.seconds(30),\n    handler: 'handler.main',\n});\n```\n\n# Build tesseract layer from source using Docker\n\nYou can build layer contents manually with the [provided `Dockerfile`s](#available-dockerfiles).\n\nBuild layer using your preferred `Dockerfile`:\n\n```bash\n## build (using AL2023 - recommended)\ndocker build -t tesseract-lambda-layer -f Dockerfile.al2023 .\n## run container\nexport CONTAINER=$(docker run -d tesseract-lambda-layer false)\n## copy tesseract files from container to local folder layer\ndocker cp $CONTAINER:/opt/build-dist layer\n## remove Docker container\ndocker rm $CONTAINER\nunset CONTAINER\n```\n\n## available `Dockerfile`s\n\n| Dockerfile                              | Base-Image        | compatible Runtimes                                           | Status             |\n| :-------------------------------------- | :---------------- | :------------------------------------------------------------ | :----------------- |\n| `Dockerfile.al2023` (**recommended**)   | Amazon Linux 2023 | Python 3.12+, Node.js 20+, Ruby 3.2+, Java 17+                | ✅ **Active**      |\n| `Dockerfile.al2`                        | Amazon Linux 2    | Python 3.8-3.11, Node.js 18, Ruby 2.7, Java 8/11              | ⚠️ **Deprecated**  |\n| ~~`Dockerfile.al1`~~                    | ~~Amazon Linux 1~~| ~~Python 2.7/3.6/3.7, Ruby 2.5, Java 8, Go 1.x~~              | ❌ **Removed**     |\n\n\n## Building a different tesseract version and/or language\n\nBy default, the build generates Tesseract 5.5.2 OCR libraries with the _fast_ german, english and osd (orientation and script detection) [data files](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files) included.\n\nThe build process can be modified using different build time arguments (defined as `ARG` in `Dockerfile.al2` and `Dockerfile.al2023`), using the `--build-arg` option of `docker build`.\n\n| Build-Argument           | description                                                                                                       | default value | available versions                                                                                                                        |\n| :----------------------- | :---------------------------------------------------------------------------------------------------------------- | :------------ | :---------------------------------------------------------------------------------------------------------------------------------------- |\n| `TESSERACT_VERSION`      | the tesseract OCR engine                                                                                          | `5.5.2`       | https://github.com/tesseract-ocr/tesseract/releases                                                                                       |\n| `LEPTONICA_VERSION`      | fundamental image processing and analysis library                                                                 | `1.87.0`      | https://github.com/danbloomberg/leptonica/releases                                                                                        |\n| `OCR_LANG`               | Language to install (in addition to `eng` and `osd`)                                                              | `deu`         | https://github.com/tesseract-ocr/tessdata (`\u003clang\u003e.traineddata`)                                                                          |\n| `TESSERACT_DATA_SUFFIX`  | Trained LSTM models for tesseract. Can be empty (default), `_best` (best inference) and `_fast` (fast inference). | `_fast`       | https://github.com/tesseract-ocr/tessdata, https://github.com/tesseract-ocr/tessdata_best, https://github.com/tesseract-ocr/tessdata_fast |\n| `TESSERACT_DATA_VERSION` | Version of the trained LSTM models for tesseract                                                                  | `4.1.0`       | https://github.com/tesseract-ocr/tessdata/releases/tag/4.1.0                                                                              |\n| `COMPILER_FLAGS`         | C++ compiler flags for building Tesseract                                                                         | `\"-mavx2 -std=c++17\"` | Any valid CXXFLAGS (e.g., optimization level, CPU architecture, C++ standard)                                                             |\n\n\n**Example of custom build**\n\n```bash\n## Build with French language support (recommended)\ndocker build --build-arg OCR_LANG=fra -t tesseract-lambda-layer-french -f Dockerfile.al2023 .\n\n## Build with specific Tesseract version and language\ndocker build --build-arg TESSERACT_VERSION=5.0.0 --build-arg OCR_LANG=fra -t tesseract-lambda-layer -f Dockerfile.al2023 .\n\n## Build with custom compiler optimizations (e.g., for different CPU architectures)\ndocker build --build-arg COMPILER_FLAGS=\"-march=native -O3 -std=c++17\" -t tesseract-lambda-layer-optimized -f Dockerfile.al2023 .\n```\n\n## Deployment size optimization\n\nThe library files that are content of the layer are stripped, before deployment to make them more suitable for the lambda environment. See `Dockerfile`s:\n\n```Dockerfile\nRUN ... \\\n  find ${DIST}/lib -name '*.so*' | xargs strip -s\n```\n\nThe stripping can cause issues, when the build runtime and the lambda runtime are different (e.g. if building on Amazon Linux 1 and running on Amazon Linux 2).\n\n## Building the layer binaries directly using CDK\n\nYou can build the layer directly and get the artifacts (like in [ready-to-use](./ready-to-use/)). This is done using AWS CDK with the [`bundling` option](https://aws.amazon.com/blogs/devops/building-apps-with-aws-cdk/).\n\nRefer to [continous-integration](./continous-integration/README.md) and the [corresponding Github Workflow](https://github.com/bweigel/aws-lambda-tesseract-layer/actions?query=workflow%3A%22Continuos+Integration%22) for an example.\n\n## Layer contents\n\nThe layer contents get deployed to `/opt`, when used by a function. See [here](https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html) for details.\nSee [ready-to-use](./ready-to-use/) for layer contents for Amazon Linux 2023 and Amazon Linux 2.\n\n# Migration from AL2 to AL2023\n\n## Why Migrate?\n\n- **Extended Support**: AL2023 receives updates until 2028\n- **Modern Runtimes**: Python 3.12+, Node.js 20+\n- **Performance**: Improved compiler optimizations and newer system libraries\n- **Security**: Latest security patches and cryptographic libraries\n\n## Migration Steps\n\n### 1. Update Runtime\n\n| Current Runtime | → | AL2023 Runtime |\n|-----------------|---|----------------|\n| Python 3.8-3.11 | → | Python 3.12    |\n| Node.js 18      | → | Node.js 20     |\n| Ruby 2.7        | → | Ruby 3.2       |\n\n### 2. Update Layer Reference\n\n**Serverless Framework**:\n```yaml\n# Before\nlayers:\n  tesseractAl2:\n    path: ready-to-use/amazonlinux-2\n    compatibleRuntimes:\n      - python3.8\n\n# After\nlayers:\n  tesseractAl2023:\n    path: ready-to-use/amazonlinux-2023\n    compatibleRuntimes:\n      - python3.12\n```\n\n**AWS CDK**:\n```typescript\n// Before\nconst layer = new lambda.LayerVersion(stack, 'layer', {\n  code: Code.fromAsset('ready-to-use/amazonlinux-2'),\n});\nnew lambda.Function(stack, 'fn', {\n  runtime: Runtime.PYTHON_3_8,\n  layers: [layer],\n});\n\n// After\nconst layer = new lambda.LayerVersion(stack, 'layer', {\n  code: Code.fromAsset('ready-to-use/amazonlinux-2023'),\n});\nnew lambda.Function(stack, 'fn', {\n  runtime: Runtime.PYTHON_3_12,\n  layers: [layer],\n});\n```\n\n### 3. Test Locally\n\n```bash\n# Update dependencies for new runtime\npip install --upgrade -r requirements.txt  # Python\nnpm update                                  # Node.js\n\n# Test with SAM CLI\nsam local invoke --runtime python3.12 ...\n```\n\n### 4. Deploy \u0026 Monitor\n\n- Deploy to dev/staging environment first\n- Check CloudWatch logs for compatibility issues\n- Verify OCR functionality works correctly\n- Roll out to production gradually\n\n## Common Issues\n\n**Python 3.12 Compatibility**\n- Some packages need updates for Python 3.12\n- Use `pip install --upgrade` for dependencies\n- Check for deprecated Python APIs\n\n**Node.js Native Modules**\n- Native modules must be recompiled for AL2023\n- Ensure node-gyp is up to date\n- Test with `sam local invoke`\n\n**Library Versions**\n- AL2023 may have different .so library versions\n- Error: \"cannot open shared object file\"\n- Solution: Use the AL2023 layer (not AL2 layer)\n\n# Known Issues\n## Avoiding Pillow library issues\nUse [cloud9 IDE](https://aws.amazon.com/cloud9/) with AMI linux to deploy [example](./example). Or alternately follow instructions for getting correct binaries for lambda using [EC2](https://forums.aws.amazon.com/thread.jspa?messageID=915630). AWS lambda uses AMI linux distro which needs correct python binaries. This step is not needed for deploying layer function. Layer function and example function are separately deployed.\n\n## Unable to import module 'handler': cannot import name '_imaging'\n\nYou might run into an issue like this:\n\n```\n/var/task/PIL/_imaging.cpython-36m-x86_64-linux-gnu.so: ELF load command address/offset not properly aligned\nUnable to import module 'handler': cannot import name '_imaging'\n```\n\nThe root cause is a faulty stripping of libraries using [`strip`](https://man7.org/linux/man-pages/man1/strip.1.html) [here](https://github.com/bweigel/aws-lambda-tesseract-layer/blob/42b725f653520b2b4d7081998ef8dca6b9b9d7df/Dockerfile#L46).\n\n**Quickfix**\n\u003e You can just disable stripping (comment out the line in the `Dockerfile`) and the libraries (`*.so`) won't be stripped. This also means the library files will be larger and your artifact might exceed lambda limits.\n\n**A lenghtier fix**\n\nAWS Lambda Runtimes work on top of Amazon Linux. Depending on the Runtime AWS Lambda uses Amazon Linux Version 1 or Version 2 under the hood.\nFor example the Python 3.8 Runtime uses Amazon Linux 2, whereas Python \u003c= 3.7 uses version 1.\n\nThe current Dockerfile runs on top of Amazon Linux Version 1. So artifacts for runtimes running version 2 will throw the above error.\nYou can try and use a base Dockerimage for Amazon Linux 2 in these cases:\n\n```Dockerfile\nFROM: lambci/lambda-base-2:build\n...\n```\n\nor, as @secretshardul suggested\n\n\u003esimple solution: Use AWS cloud9 to deploy example folder. Layer can be deployed from anywhere.\n\u003ecomplex solution: Deploy EC2 instance with AMI linux and get correct binaries.\n\n# Contributors :heart:\n\n- @secretshardul\n- @TheLucasMoore for providing a Dockerfile that builds working binaries for Python 3.8 / Amazon Linux 2\n","funding_links":[],"categories":["Layers"],"sub_categories":["Utilities"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbweigel%2Faws-lambda-tesseract-layer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbweigel%2Faws-lambda-tesseract-layer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbweigel%2Faws-lambda-tesseract-layer/lists"}