{"id":27288343,"url":"https://github.com/iwstkhr/aws-lambda-tesseract-ocr-example","last_synced_at":"2025-04-11T20:33:04.685Z","repository":{"id":134123909,"uuid":"505992987","full_name":"iwstkhr/aws-lambda-tesseract-ocr-example","owner":"iwstkhr","description":"Tesseract OCR Sample with AWS Lambda Container Images using AWS SAM","archived":false,"fork":false,"pushed_at":"2025-02-22T03:06:14.000Z","size":804,"stargazers_count":3,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-25T16:23:24.294Z","etag":null,"topics":["aws","lambda","tesseract-ocr"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iwstkhr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-21T20:24:10.000Z","updated_at":"2025-02-22T06:03:55.000Z","dependencies_parsed_at":null,"dependency_job_id":"eaf2cfa8-70da-431d-9e96-b4195819dab4","html_url":"https://github.com/iwstkhr/aws-lambda-tesseract-ocr-example","commit_stats":null,"previous_names":["iwstkhr/aws-lambda-tesseract-ocr-example"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iwstkhr%2Faws-lambda-tesseract-ocr-example","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iwstkhr%2Faws-lambda-tesseract-ocr-example/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iwstkhr%2Faws-lambda-tesseract-ocr-example/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iwstkhr%2Faws-lambda-tesseract-ocr-example/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iwstkhr","download_url":"https://codeload.github.com/iwstkhr/aws-lambda-tesseract-ocr-example/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248476385,"owners_count":21110267,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","lambda","tesseract-ocr"],"created_at":"2025-04-11T20:31:59.596Z","updated_at":"2025-04-11T20:33:04.670Z","avatar_url":"https://github.com/iwstkhr.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Introduction\n\nDevelopers can run [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) with [pytesseract](https://pypi.org/project/pytesseract/) using [Lambda container images](https://docs.aws.amazon.com/lambda/latest/dg/images-create.html) for efficient and scalable OCR operations.\n\n## Prerequisites\n\nEnsure the following tools are installed on your system:\n\n- [AWS SAM](https://aws.amazon.com/serverless/sam/)\n- Python 3.x\n\n## Setting Up the Project\n\n### Writing the AWS SAM Template\n\nThe following SAM template sets up the Lambda function triggered by EventBridge, since API Gateway has a [maximum timeout limit of 29 seconds](https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html). The sample Python script execution exceeds this limit.\n\n```yaml\nAWSTemplateFormatVersion: '2010-09-09'\nTransform: AWS::Serverless-2016-10-31\nDescription: Tesseract OCR Sample with AWS Lambda Container Images using AWS SAM\nResources:\n  TesseractOcrSample:\n    Type: AWS::Serverless::Function\n    Properties:\n      Events:\n        Schedule:\n          Type: Schedule\n          Properties:\n            Enabled: true\n            Schedule: cron(0 * * * ? *)\n      MemorySize: 512\n      PackageType: Image\n      Timeout: 900\n    Metadata:\n      DockerTag: latest\n      DockerContext: ./src/\n      Dockerfile: Dockerfile\n```\n\n## Creating the Dockerfile\n\nCreate a `Dockerfile` to define the runtime environment. If your application processes text in a specific language like Japanese, set the **`LANG` environment variable** (line 3) accordingly to avoid encoding issues.\n\n```dockerfile line=\"3\"\nFROM public.ecr.aws/lambda/python:3.9\n\nENV LANG=ja_JP.UTF-8\nWORKDIR ${LAMBDA_TASK_ROOT}\nCOPY app.py ./\nCOPY requirements.txt ./\nCOPY run-melos.pdf ./\nRUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm \\\n    \u0026\u0026 yum update -y \u0026\u0026 yum install -y poppler-utils tesseract tesseract-langpack-jpn \\\n    \u0026\u0026 pip install -U pip \u0026\u0026 pip install -r requirements.txt --target \"${LAMBDA_TASK_ROOT}\"\n\nCMD [\"app.lambda_handler\"]\n```\n\n## Writing the Python Script\n\n### Define `requirements.txt`\n\nAdd the required libraries to `requirements.txt`.\n\n```text\npdf2image==1.16.0\npytesseract==0.3.9\n```\n\n### Implement `app.py`\n\nThe script converts a PDF to images, performs OCR, and logs the results.\n\n```python\nimport re\nfrom datetime import datetime\n\nimport pdf2image\nimport pytesseract\n\n\ndef lambda_handler(event: dict, context: dict) -\u003e None:\n    start = datetime.now()\n    result = ''\n\n    images = to_images('run-melos.pdf', 1, 2)\n    for image in images:\n        result += to_string(image)\n    result = normalize(result)\n\n    end = datetime.now()\n    duration = end.timestamp() - start.timestamp()\n\n    print('----------------------------------------')\n    print(f'Start: {start}')\n    print(f'End: {end}')\n    print(f'Duration: {int(duration)} seconds')\n    print(f'Result: {result}')\n    print('----------------------------------------')\n\n\ndef to_images(pdf_path: str, first_page: int = None, last_page: int = None) -\u003e list:\n    \"\"\" Convert a PDF to a PNG image.\n\n    Args:\n        pdf_path (str): PDF path\n        first_page (int): First page starting 1 to be converted\n        last_page (int): Last page to be converted\n\n    Returns:\n        list: List of image data\n    \"\"\"\n\n    print(f'Convert a PDF ({pdf_path}) to a png...')\n    images = pdf2image.convert_from_path(\n        pdf_path=pdf_path,\n        fmt='png',\n        first_page=first_page,\n        last_page=last_page,\n    )\n    print(f'A total of converted png images is {len(images)}.')\n    return images\n\n\ndef to_string(image) -\u003e str:\n    \"\"\" OCR an image data.\n\n    Args:\n        image: Image data\n\n    Returns:\n        str: OCR processed characters\n    \"\"\"\n\n    print(f'Extract characters from an image...')\n    return pytesseract.image_to_string(image, lang='jpn')\n\n\ndef normalize(target: str) -\u003e str:\n    \"\"\" Normalize result text.\n\n    Applying the following:\n    - Remove newlines.\n    - Remove spaces between Japanese characters.\n\n    Args:\n        target (str): Target text to be normalized\n\n    Returns:\n        str: Normalized text\n    \"\"\"\n\n    result = re.sub('\\n', '', target)\n    result = re.sub('([あ-んア-ン一-鿐])\\s+((?=[あ-んア-ン一-鿐]))', r'\\1\\2', result)\n    return result\n```\n\n## Building and Deploying\n\n### Build the Application\n\nRun the following command to build the application:\n\n```shell\nsam build\n```\n\nExecute the following command to run the application locally:\n\n```shell\nsam local invoke\n```\n\n### Deploy the Application\n\nIf an ECR repository does not exist, create one:\n\n```shell\naws ecr create-repository --repository-name tesseract-ocr-lambda\n```\n\nDeploy the application:\n\n```shell\nsam deploy \\\n  --stack-name aws-lambda-tesseract-ocr-sample \\\n  --image-repository 123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/tesseract-ocr-lambda \\\n  --capabilities CAPABILITY_IAM\n```\n\nAfter deployment, the Lambda function will run hourly and the OCR results will be written to CloudWatch Logs.\n\n## Cleaning Up\n\nTo clean up the provisioned AWS resources, use the following command:\n\n```shell\nsam delete --stack-name aws-lambda-tesseract-ocr-sample\n```\n\n## Conclusion\n\nRunning Tesseract OCR in AWS Lambda using container images provides an efficient, scalable way to handle complex OCR workflows.\n\nHappy Coding! 🚀\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiwstkhr%2Faws-lambda-tesseract-ocr-example","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiwstkhr%2Faws-lambda-tesseract-ocr-example","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiwstkhr%2Faws-lambda-tesseract-ocr-example/lists"}