{"id":28560846,"url":"https://github.com/roberto-a-cardenas/intellidoc-engine","last_synced_at":"2026-04-22T23:34:11.269Z","repository":{"id":297153486,"uuid":"995815834","full_name":"Roberto-A-Cardenas/Intellidoc-Engine","owner":"Roberto-A-Cardenas","description":"Serverless OCR pipeline on AWS using Lambda, API Gateway, S3, and Textract. Accepts base64 PDFs and returns extracted text via API. Built with Terraform.","archived":false,"fork":false,"pushed_at":"2025-06-04T04:11:50.000Z","size":1875,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-04T10:30:51.074Z","etag":null,"topics":["api-gateway","aws","aws-lambda","cloud-engineering","document-processing","ocr","s3","serverless","terraform","textract"],"latest_commit_sha":null,"homepage":"","language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Roberto-A-Cardenas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-04T03:45:37.000Z","updated_at":"2025-06-04T04:16:51.000Z","dependencies_parsed_at":"2025-06-04T10:32:47.004Z","dependency_job_id":null,"html_url":"https://github.com/Roberto-A-Cardenas/Intellidoc-Engine","commit_stats":null,"previous_names":["roberto-a-cardenas/intellidoc-engine"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Roberto-A-Cardenas/Intellidoc-Engine","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Roberto-A-Cardenas%2FIntellidoc-Engine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Roberto-A-Cardenas%2FIntellidoc-Engine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Roberto-A-Cardenas%2FIntellidoc-Engine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Roberto-A-Cardenas%2FIntellidoc-Engine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Roberto-A-Cardenas","download_url":"https://codeload.github.com/Roberto-A-Cardenas/Intellidoc-Engine/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Roberto-A-Cardenas%2FIntellidoc-Engine/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262928033,"owners_count":23386070,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api-gateway","aws","aws-lambda","cloud-engineering","document-processing","ocr","s3","serverless","terraform","textract"],"created_at":"2025-06-10T10:01:15.015Z","updated_at":"2026-04-22T23:34:11.237Z","avatar_url":"https://github.com/Roberto-A-Cardenas.png","language":"HCL","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🧠 IntelliDoc Engine\n\n**IntelliDoc Engine** is a lightweight, serverless OCR processing pipeline on AWS. It accepts a base64 encoded PDF via an HTTP API, stores it in S3, runs Textract to extract text, and returns structured JSON.\n\nBuilt for simplicity, cost efficiency, and clean architecture with a future-ready path for secure VPC deployment.\n\n## 🎯 Why I Built This\n\nTo explore how serverless document automation can be built from the ground up securely, affordably, and without vendor lock in.  \nThis project is part of my AWS portfolio to demonstrate real infrastructure as code, API integration, and hands on debugging in action.\n\n![AWS Lambda](https://img.shields.io/badge/AWS-Lambda-orange?logo=amazon-aws\u0026logoColor=white)\n![API Gateway](https://img.shields.io/badge/AWS-API_Gateway-purple?logo=amazon-aws\u0026logoColor=white)\n![S3](https://img.shields.io/badge/AWS-S3-red?logo=amazon-aws\u0026logoColor=white)\n![Textract](https://img.shields.io/badge/AWS-Textract-green?logo=amazon-aws\u0026logoColor=white)\n![Terraform](https://img.shields.io/badge/IaC-Terraform-blueviolet?logo=terraform)\n\n\n## 🧱 Architecture\n\n![IntelliDoc Diagram](intellidoc-diagram.png)\n\n\u003e 💡 This diagram reflects the current optimized deployment (no VPC) with optional VPC secure variant described below.\n\n\n## 🚀 Features\n\n- 📄 Accepts base64 encoded PDFs via public HTTP endpoint\n- ☁️ Stores documents in Amazon S3\n- 🔍 Uses Amazon Textract to extract structured text\n- 🔁 Returns clean, readable JSON\n- 🧱 Deployed entirely via Terraform for reproducibility\n\n---\n\n## 📦 Sample Payload\n\n```json\n{\n  \"filename\": \"real-test.pdf\",\n  \"filedata\": \"JVBERi0xLjQKJe...\"\n}\n```\n\n---\n\n## 🧪 How It Works\n\n1. Users send a POST request to the API Gateway with a base64 encoded PDF.\n2. Lambda decodes and stores the file in S3.\n3. Textract analyzes the file via S3 reference.\n4. Extracted text is returned to the user as a clean JSON array.\n\n---\n\n## 📁 Project Structure\n\n```\nIntellidoc-Engine/\n├── action/                        # GitHub Actions or automation scripts (if used)\n├── lambda/                        # Lambda source code and test files\n│   ├── base64.txt\n│   ├── lambda.zip\n│   ├── nlp_parser.py\n│   ├── ocr_processor.py\n│   ├── ocr_processor.zip\n│   ├── payload.json\n│   └── requirements.txt\n├── terraform/                     # Root Terraform configuration\n│   ├── .terraform/\n│   ├── lambda/\n│   ├── modules/                   # Custom Terraform modules\n│   │   ├── api_gateway/\n│   │   ├── lambda/\n│   │   ├── s3/\n│   │   └── vpc/\n│   ├── base64.txt\n│   ├── clean.b64\n│   ├── main.tf\n│   ├── outputs.tf\n│   ├── payload.json\n│   ├── real-test.b64\n│   ├── real-test.pdf\n│   ├── terraform.tfstate\n│   ├── terraform.tfstate.backup\n│   └── variables.tf\n├── .gitignore\n├── LICENSE\n├── README.md\n├── base64.txt\n├── bucket-policy.json\n├── Intellidoc-diagram.pdf\n├── nlp_parser.py\n├── ocr_processor.py\n├── ocr_processor.zip\n└── requirements.txt\n\n```\n\n🛠️ Troubleshooting\nInternal Server Error (500) from API Gateway\n→ Check CloudWatch logs for Lambda exceptions. Ensure ocr_processor.py runs correctly and payload.json contains valid base64 PDF.\n\nTextract Access Denied\n→ Confirm Lambda role has textract:* and proper S3 permissions (s3:GetObject, s3:PutObject, s3:ListBucket).\n\nS3 Upload Issues\n→ Verify correct bucket name, object key, and that files are actually uploaded. Add debug logs in Lambda.\n\nTerraform Module Errors\n→ Ensure module paths are correct and run terraform init before apply.\n\nMissing or Broken Lambda Zip\n→ Rebuild with PowerShell:\nCompress-Archive -Path ocr_processor.py -DestinationPath ocr_processor.zip -Force\n\n---\n\n## 🔒 Future: VPC-Secure Variant\n\nThis project was intentionally deployed with a **public Lambda** to prioritize:\n- ⏱️ Fast cold start time\n- 💸 Zero NAT Gateway cost\n- 🧪 Easy local + remote testing\n\nFor enterprise or compliance heavy environments, it can be upgraded with:\n- VPC attached Lambda in private subnets\n- NAT Gateway for outbound Textract access\n- Fully isolated, audit compliant architecture\n\nThis variant is planned as a **future branch** of IntelliDoc Engine.\n\n---\n\n## 📜 License\n\nThis project is licensed under the MIT License see [`LICENSE`](./LICENSE) for details.\n\n---\n\n## 🧠 Author Notes\n\u003e This project represents 20+ hours of focused work building, testing, and refining infrastructure from scratch.  \n\u003e Every piece was handcrafted for real world scenarios, not for show. No shortcuts.  \n\u003e Just clean cloud architecture built with intent and care.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froberto-a-cardenas%2Fintellidoc-engine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Froberto-a-cardenas%2Fintellidoc-engine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froberto-a-cardenas%2Fintellidoc-engine/lists"}