{"id":27914950,"url":"https://github.com/ozcanmiraay/opsbot","last_synced_at":"2026-04-13T21:01:40.961Z","repository":{"id":284804110,"uuid":"950863266","full_name":"ozcanmiraay/opsbot","owner":"ozcanmiraay","description":"AI-powered PDF extraction suite for structured insights from contracts, forms, and documents. Built with Streamlit, LangChain, GPT-4o, and PDFPlumber.","archived":false,"fork":false,"pushed_at":"2025-04-04T23:43:21.000Z","size":10078,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-03T19:19:38.039Z","etag":null,"topics":["automation","contracts","document-ai","gpt-4o","langchain","openai","pdf-extraction","streamlit","structured-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ozcanmiraay.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-18T19:51:15.000Z","updated_at":"2025-04-04T23:43:24.000Z","dependencies_parsed_at":"2025-03-27T19:33:29.256Z","dependency_job_id":"32cf994f-1188-4022-bc10-e1a7e6a01143","html_url":"https://github.com/ozcanmiraay/opsbot","commit_stats":null,"previous_names":["ozcanmiraay/opsbot"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ozcanmiraay/opsbot","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ozcanmiraay%2Fopsbot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ozcanmiraay%2Fopsbot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ozcanmiraay%2Fopsbot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ozcanmiraay%2Fopsbot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ozcanmiraay","download_url":"https://codeload.github.com/ozcanmiraay/opsbot/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ozcanmiraay%2Fopsbot/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31770726,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-13T20:17:16.280Z","status":"ssl_error","status_checked_at":"2026-04-13T20:17:08.216Z","response_time":93,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","contracts","document-ai","gpt-4o","langchain","openai","pdf-extraction","streamlit","structured-data"],"created_at":"2025-05-06T15:32:51.222Z","updated_at":"2026-04-13T21:01:40.956Z","avatar_url":"https://github.com/ozcanmiraay.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 📄 Intelligent PDF Extractor Suite  \nBuilt with 💙 by **Miray Ozcan** | Powered by **Streamlit + GPT-4o + Agentic Document Extractor + LangChain + PDFPlumber**\n\n\u003e **Solve real internal bottlenecks with AI.**  \n\u003e This multi-version toolset tackles one of the biggest pain points faced by product operations and cross-functional teams: **valuable business data trapped inside messy PDFs** like contracts, onboarding forms, invoices, or configuration summaries.  \n\n---\n\n## 🔧 Motivation\n\nDuring my internship interview at **Proscia**, I interviewed the Product Operations Lead and uncovered key internal pain points:\n\n\u003e ⚠️ _“The data often lives in PDFs we’ve signed with customers, but it’s never made it into a spreadsheet that’s queryable... It’s scattered, manual, inconsistent, or siloed. If we could automatically pull structured info from these documents and present it cleanly, we’d save hours per deal.\"_  \n\nThis repo aims to **automate that transformation pipeline** — turning unstructured PDFs into structured, queryable, and exportable datasets with just a few clicks.\n\n---\n\n## 🚦 App Versions Overview\n\n| Version | Branch | Stack | Best For | Summary |\n|--------|--------|-------|----------|---------|\n| **v0** | `main` | `PDFPlumber + OpenAI` | ✅ Quick prototyping\u003cbr\u003e✅ Lightweight extractions\u003cbr\u003e✅ Page-by-page summaries | Extracts raw text and tables from PDFs using `pdfplumber`, then summarizes via GPT-4o. Ideal for simple forms or multi-page review. |\n| **v1** | `v1`   | `LangChain + GPT-4o` | ✅ Structured data schema\u003cbr\u003e✅ Automating contract ingestion\u003cbr\u003e✅ Extracting JSON records | Uses LangChain's document loader and schema detection to extract structured records with schema customization. Great for generating tabular insights from customer contracts. |\n| **v2** | `v2`   | `Agentic Document Extraction (ADE)` | ✅ Formatted documents\u003cbr\u003e✅ Contracts w/ visual structure\u003cbr\u003e✅ Section-level summaries | Sends full PDFs to an external ADE API and groups semantic chunks into labeled sections. Excellent for internal PDF templates or procurement docs. |\n\n---\n\n## 🧠 Feature Comparison\n\n| Feature | v0 | v1 | v2 |\n|--------|----|----|----|\n| Extracts Tables | ✅ | ✅ | ✅ |\n| Extracts Raw Text | ✅ | ✅ | ✅ |\n| Section-Based Summarization | 🚫 | ✅ | ✅ |\n| Structured JSON Record Extraction | 🚫 | ✅ | 🚫 |\n| Custom Schema Selection | 🚫 | ✅ | 🚫 |\n| Full PDF Summarization | ✅ | 🚫 | ✅ |\n| Export CSV | ✅ | ✅ | ✅ |\n| Best For | Simpler docs | Tabular contract data | Long-form/styled PDFs |\n\n---\n\n## ▶️ How to Run\n\nEach version lives in its own branch. Follow these steps to get started:\n\n### 1. ⬇️ Clone the repo:\n```bash\ngit clone https://github.com/ozcanmiraay/opsbot.git\ncd opsbot\n```\n\n### 2. 🐍 Create and activate a virtual environment:\n```bash\npython3 -m venv venv\nsource venv/bin/activate     # macOS/Linux\n# OR\nvenv\\Scripts\\activate        # Windows\n```\n\n### 3. 📦 Install required dependencies:\n```bash\npip install -r requirements.txt\n```\n\n### 4. 🔐 Set up your API keys:\n\nCreate a `.env` file in the root directory with the following content:\n\n```\nOPENAI_API_KEY=your-openai-api-key\nADE_API_KEY=your-agentic-doc-extraction-api-key\n```\n\n- Get your **OpenAI API key** [here](https://platform.openai.com/account/api-keys)  \n- Request access to the **Agentic Document Extractor API (ADE)** [here](https://support.landing.ai/landinglens/docs/visionagent-api-key)\n  \n---\n\n## 🚀 Launch the App\n\n### ⚙️ v0: PDFPlumber Text \u0026 Table Extractor\n```bash\ngit checkout main\nstreamlit run app/streamlit_app.py\n```\n\n### 🧠 v1: LangChain Schema-Based Extractor\n```bash\ngit checkout v1\nstreamlit run app/streamlit_app.py\n```\n\n### 🤖 v2: Agentic Document Intelligence Viewer\n```bash\ngit checkout v2\nstreamlit run ui/streamlit_app.py\n```\n\n---\n\n## 📸 Screenshots\n\n| v0: Lightweight PDF Reader | v1: LangChain Schema Extractor | v2: ADE-Powered Chunk Viewer |\n|----------------------------|-------------------------------|-------------------------------|\n| ![v0 Screenshot](./screenshots/v0.png) | ![v1 Screenshot](./screenshots/v1.png) | ![v2 Screenshot](./screenshots/v2.png) |\n\n---\n\n## 🎥 Demo Videos\n\n| PDFPlumber \u0026 LangChain-Based Extractor (v0, v1) | LangChain-Based Extractor \u0026 ADE-Powered Viewer (v1, v2) |\n|-------------------------------|--------------------------|\n| [Watch Demo Part 1!](https://www.loom.com/share/93ca1ac870d0480580b5a5d2d93db4f2?sid=126cd365-f141-4e12-817e-20df70afccaa) | [Watch Demo Part 2!](https://www.loom.com/share/91cff4e13f054cb2b4ed158fc494d866?sid=6da7125e-48c6-42ce-832b-9e486032f49f) |\n\n\u003e 🔗 Click on a link to watch a 5-minute walkthrough on Loom.\n\n---\n\n## 🧩 Real-World Use Cases\n\n- **Contract Intelligence**: Pulling features, pricing, infrastructure specs, and deployment configurations from customer contracts.\n- **Sales Enablement**: Exporting client configuration from PDFs into CRM fields automatically.\n- **Internal Alignment**: Creating dashboards where executives and department leaders view only the data relevant to them.\n- **Audit Readiness**: Summarizing past signed forms and validating consistency across regions.\n\n---\n\n## 🛠️ Tech Stack\n\n- **LLM**: GPT-4o via `langchain-openai`\n- **Document Parsing**: `pdfplumber`, `PyPDFLoader`, Agentic Document Extractor by LandingAI\n- **Interface**: Streamlit\n- **Helpers**: LangChain prompt pipelines, recursive chunking, CSV export, HTML table rendering\n\n---\n\n## 🙌 Credits\n\nSpecial thanks to the **Product Operations Lead at Proscia** for their insights and support in identifying real automation opportunities that can drive cross-team efficiency.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fozcanmiraay%2Fopsbot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fozcanmiraay%2Fopsbot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fozcanmiraay%2Fopsbot/lists"}