An open API service indexing awesome lists of open source software.

https://github.com/roberto-a-cardenas/intellidoc-engine

Serverless OCR pipeline on AWS using Lambda, API Gateway, S3, and Textract. Accepts base64 PDFs and returns extracted text via API. Built with Terraform.
https://github.com/roberto-a-cardenas/intellidoc-engine

api-gateway aws aws-lambda cloud-engineering document-processing ocr s3 serverless terraform textract

Last synced: about 2 months ago
JSON representation

Serverless OCR pipeline on AWS using Lambda, API Gateway, S3, and Textract. Accepts base64 PDFs and returns extracted text via API. Built with Terraform.

Awesome Lists containing this project

README

          

# ๐Ÿง  IntelliDoc Engine

**IntelliDoc Engine** is a lightweight, serverless OCR processing pipeline on AWS. It accepts a base64 encoded PDF via an HTTP API, stores it in S3, runs Textract to extract text, and returns structured JSON.

Built for simplicity, cost efficiency, and clean architecture with a future-ready path for secure VPC deployment.

## ๐ŸŽฏ Why I Built This

To explore how serverless document automation can be built from the ground up securely, affordably, and without vendor lock in.
This project is part of my AWS portfolio to demonstrate real infrastructure as code, API integration, and hands on debugging in action.

![AWS Lambda](https://img.shields.io/badge/AWS-Lambda-orange?logo=amazon-aws&logoColor=white)
![API Gateway](https://img.shields.io/badge/AWS-API_Gateway-purple?logo=amazon-aws&logoColor=white)
![S3](https://img.shields.io/badge/AWS-S3-red?logo=amazon-aws&logoColor=white)
![Textract](https://img.shields.io/badge/AWS-Textract-green?logo=amazon-aws&logoColor=white)
![Terraform](https://img.shields.io/badge/IaC-Terraform-blueviolet?logo=terraform)

## ๐Ÿงฑ Architecture

![IntelliDoc Diagram](intellidoc-diagram.png)

> ๐Ÿ’ก This diagram reflects the current optimized deployment (no VPC) with optional VPC secure variant described below.

## ๐Ÿš€ Features

- ๐Ÿ“„ Accepts base64 encoded PDFs via public HTTP endpoint
- โ˜๏ธ Stores documents in Amazon S3
- ๐Ÿ” Uses Amazon Textract to extract structured text
- ๐Ÿ” Returns clean, readable JSON
- ๐Ÿงฑ Deployed entirely via Terraform for reproducibility

---

## ๐Ÿ“ฆ Sample Payload

```json
{
"filename": "real-test.pdf",
"filedata": "JVBERi0xLjQKJe..."
}
```

---

## ๐Ÿงช How It Works

1. Users send a POST request to the API Gateway with a base64 encoded PDF.
2. Lambda decodes and stores the file in S3.
3. Textract analyzes the file via S3 reference.
4. Extracted text is returned to the user as a clean JSON array.

---

## ๐Ÿ“ Project Structure

```
Intellidoc-Engine/
โ”œโ”€โ”€ action/ # GitHub Actions or automation scripts (if used)
โ”œโ”€โ”€ lambda/ # Lambda source code and test files
โ”‚ โ”œโ”€โ”€ base64.txt
โ”‚ โ”œโ”€โ”€ lambda.zip
โ”‚ โ”œโ”€โ”€ nlp_parser.py
โ”‚ โ”œโ”€โ”€ ocr_processor.py
โ”‚ โ”œโ”€โ”€ ocr_processor.zip
โ”‚ โ”œโ”€โ”€ payload.json
โ”‚ โ””โ”€โ”€ requirements.txt
โ”œโ”€โ”€ terraform/ # Root Terraform configuration
โ”‚ โ”œโ”€โ”€ .terraform/
โ”‚ โ”œโ”€โ”€ lambda/
โ”‚ โ”œโ”€โ”€ modules/ # Custom Terraform modules
โ”‚ โ”‚ โ”œโ”€โ”€ api_gateway/
โ”‚ โ”‚ โ”œโ”€โ”€ lambda/
โ”‚ โ”‚ โ”œโ”€โ”€ s3/
โ”‚ โ”‚ โ””โ”€โ”€ vpc/
โ”‚ โ”œโ”€โ”€ base64.txt
โ”‚ โ”œโ”€โ”€ clean.b64
โ”‚ โ”œโ”€โ”€ main.tf
โ”‚ โ”œโ”€โ”€ outputs.tf
โ”‚ โ”œโ”€โ”€ payload.json
โ”‚ โ”œโ”€โ”€ real-test.b64
โ”‚ โ”œโ”€โ”€ real-test.pdf
โ”‚ โ”œโ”€โ”€ terraform.tfstate
โ”‚ โ”œโ”€โ”€ terraform.tfstate.backup
โ”‚ โ””โ”€โ”€ variables.tf
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ base64.txt
โ”œโ”€โ”€ bucket-policy.json
โ”œโ”€โ”€ Intellidoc-diagram.pdf
โ”œโ”€โ”€ nlp_parser.py
โ”œโ”€โ”€ ocr_processor.py
โ”œโ”€โ”€ ocr_processor.zip
โ””โ”€โ”€ requirements.txt

```

๐Ÿ› ๏ธ Troubleshooting
Internal Server Error (500) from API Gateway
โ†’ Check CloudWatch logs for Lambda exceptions. Ensure ocr_processor.py runs correctly and payload.json contains valid base64 PDF.

Textract Access Denied
โ†’ Confirm Lambda role has textract:* and proper S3 permissions (s3:GetObject, s3:PutObject, s3:ListBucket).

S3 Upload Issues
โ†’ Verify correct bucket name, object key, and that files are actually uploaded. Add debug logs in Lambda.

Terraform Module Errors
โ†’ Ensure module paths are correct and run terraform init before apply.

Missing or Broken Lambda Zip
โ†’ Rebuild with PowerShell:
Compress-Archive -Path ocr_processor.py -DestinationPath ocr_processor.zip -Force

---

## ๐Ÿ”’ Future: VPC-Secure Variant

This project was intentionally deployed with a **public Lambda** to prioritize:
- โฑ๏ธ Fast cold start time
- ๐Ÿ’ธ Zero NAT Gateway cost
- ๐Ÿงช Easy local + remote testing

For enterprise or compliance heavy environments, it can be upgraded with:
- VPC attached Lambda in private subnets
- NAT Gateway for outbound Textract access
- Fully isolated, audit compliant architecture

This variant is planned as a **future branch** of IntelliDoc Engine.

---

## ๐Ÿ“œ License

This project is licensed under the MIT License see [`LICENSE`](./LICENSE) for details.

---

## ๐Ÿง  Author Notes
> This project represents 20+ hours of focused work building, testing, and refining infrastructure from scratch.
> Every piece was handcrafted for real world scenarios, not for show. No shortcuts.
> Just clean cloud architecture built with intent and care.