https://github.com/roberto-a-cardenas/intellidoc-engine
Serverless OCR pipeline on AWS using Lambda, API Gateway, S3, and Textract. Accepts base64 PDFs and returns extracted text via API. Built with Terraform.
https://github.com/roberto-a-cardenas/intellidoc-engine
api-gateway aws aws-lambda cloud-engineering document-processing ocr s3 serverless terraform textract
Last synced: about 2 months ago
JSON representation
Serverless OCR pipeline on AWS using Lambda, API Gateway, S3, and Textract. Accepts base64 PDFs and returns extracted text via API. Built with Terraform.
- Host: GitHub
- URL: https://github.com/roberto-a-cardenas/intellidoc-engine
- Owner: Roberto-A-Cardenas
- License: mit
- Created: 2025-06-04T03:45:37.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-04T04:11:50.000Z (about 1 year ago)
- Last Synced: 2025-06-04T10:30:51.074Z (about 1 year ago)
- Topics: api-gateway, aws, aws-lambda, cloud-engineering, document-processing, ocr, s3, serverless, terraform, textract
- Language: HCL
- Homepage:
- Size: 1.79 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ๐ง IntelliDoc Engine
**IntelliDoc Engine** is a lightweight, serverless OCR processing pipeline on AWS. It accepts a base64 encoded PDF via an HTTP API, stores it in S3, runs Textract to extract text, and returns structured JSON.
Built for simplicity, cost efficiency, and clean architecture with a future-ready path for secure VPC deployment.
## ๐ฏ Why I Built This
To explore how serverless document automation can be built from the ground up securely, affordably, and without vendor lock in.
This project is part of my AWS portfolio to demonstrate real infrastructure as code, API integration, and hands on debugging in action.





## ๐งฑ Architecture

> ๐ก This diagram reflects the current optimized deployment (no VPC) with optional VPC secure variant described below.
## ๐ Features
- ๐ Accepts base64 encoded PDFs via public HTTP endpoint
- โ๏ธ Stores documents in Amazon S3
- ๐ Uses Amazon Textract to extract structured text
- ๐ Returns clean, readable JSON
- ๐งฑ Deployed entirely via Terraform for reproducibility
---
## ๐ฆ Sample Payload
```json
{
"filename": "real-test.pdf",
"filedata": "JVBERi0xLjQKJe..."
}
```
---
## ๐งช How It Works
1. Users send a POST request to the API Gateway with a base64 encoded PDF.
2. Lambda decodes and stores the file in S3.
3. Textract analyzes the file via S3 reference.
4. Extracted text is returned to the user as a clean JSON array.
---
## ๐ Project Structure
```
Intellidoc-Engine/
โโโ action/ # GitHub Actions or automation scripts (if used)
โโโ lambda/ # Lambda source code and test files
โ โโโ base64.txt
โ โโโ lambda.zip
โ โโโ nlp_parser.py
โ โโโ ocr_processor.py
โ โโโ ocr_processor.zip
โ โโโ payload.json
โ โโโ requirements.txt
โโโ terraform/ # Root Terraform configuration
โ โโโ .terraform/
โ โโโ lambda/
โ โโโ modules/ # Custom Terraform modules
โ โ โโโ api_gateway/
โ โ โโโ lambda/
โ โ โโโ s3/
โ โ โโโ vpc/
โ โโโ base64.txt
โ โโโ clean.b64
โ โโโ main.tf
โ โโโ outputs.tf
โ โโโ payload.json
โ โโโ real-test.b64
โ โโโ real-test.pdf
โ โโโ terraform.tfstate
โ โโโ terraform.tfstate.backup
โ โโโ variables.tf
โโโ .gitignore
โโโ LICENSE
โโโ README.md
โโโ base64.txt
โโโ bucket-policy.json
โโโ Intellidoc-diagram.pdf
โโโ nlp_parser.py
โโโ ocr_processor.py
โโโ ocr_processor.zip
โโโ requirements.txt
```
๐ ๏ธ Troubleshooting
Internal Server Error (500) from API Gateway
โ Check CloudWatch logs for Lambda exceptions. Ensure ocr_processor.py runs correctly and payload.json contains valid base64 PDF.
Textract Access Denied
โ Confirm Lambda role has textract:* and proper S3 permissions (s3:GetObject, s3:PutObject, s3:ListBucket).
S3 Upload Issues
โ Verify correct bucket name, object key, and that files are actually uploaded. Add debug logs in Lambda.
Terraform Module Errors
โ Ensure module paths are correct and run terraform init before apply.
Missing or Broken Lambda Zip
โ Rebuild with PowerShell:
Compress-Archive -Path ocr_processor.py -DestinationPath ocr_processor.zip -Force
---
## ๐ Future: VPC-Secure Variant
This project was intentionally deployed with a **public Lambda** to prioritize:
- โฑ๏ธ Fast cold start time
- ๐ธ Zero NAT Gateway cost
- ๐งช Easy local + remote testing
For enterprise or compliance heavy environments, it can be upgraded with:
- VPC attached Lambda in private subnets
- NAT Gateway for outbound Textract access
- Fully isolated, audit compliant architecture
This variant is planned as a **future branch** of IntelliDoc Engine.
---
## ๐ License
This project is licensed under the MIT License see [`LICENSE`](./LICENSE) for details.
---
## ๐ง Author Notes
> This project represents 20+ hours of focused work building, testing, and refining infrastructure from scratch.
> Every piece was handcrafted for real world scenarios, not for show. No shortcuts.
> Just clean cloud architecture built with intent and care.