https://github.com/ozcanmiraay/opsbot
AI-powered PDF extraction suite for structured insights from contracts, forms, and documents. Built with Streamlit, LangChain, GPT-4o, and PDFPlumber.
https://github.com/ozcanmiraay/opsbot
automation contracts document-ai gpt-4o langchain openai pdf-extraction streamlit structured-data
Last synced: about 2 months ago
JSON representation
AI-powered PDF extraction suite for structured insights from contracts, forms, and documents. Built with Streamlit, LangChain, GPT-4o, and PDFPlumber.
- Host: GitHub
- URL: https://github.com/ozcanmiraay/opsbot
- Owner: ozcanmiraay
- Created: 2025-03-18T19:51:15.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-04T23:43:21.000Z (about 1 year ago)
- Last Synced: 2025-06-03T19:19:38.039Z (about 1 year ago)
- Topics: automation, contracts, document-ai, gpt-4o, langchain, openai, pdf-extraction, streamlit, structured-data
- Language: Python
- Homepage:
- Size: 9.61 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# π Intelligent PDF Extractor Suite
Built with π by **Miray Ozcan** | Powered by **Streamlit + GPT-4o + Agentic Document Extractor + LangChain + PDFPlumber**
> **Solve real internal bottlenecks with AI.**
> This multi-version toolset tackles one of the biggest pain points faced by product operations and cross-functional teams: **valuable business data trapped inside messy PDFs** like contracts, onboarding forms, invoices, or configuration summaries.
---
## π§ Motivation
During my internship interview at **Proscia**, I interviewed the Product Operations Lead and uncovered key internal pain points:
> β οΈ _βThe data often lives in PDFs weβve signed with customers, but itβs never made it into a spreadsheet thatβs queryable... Itβs scattered, manual, inconsistent, or siloed. If we could automatically pull structured info from these documents and present it cleanly, weβd save hours per deal."_
This repo aims to **automate that transformation pipeline** β turning unstructured PDFs into structured, queryable, and exportable datasets with just a few clicks.
---
## π¦ App Versions Overview
| Version | Branch | Stack | Best For | Summary |
|--------|--------|-------|----------|---------|
| **v0** | `main` | `PDFPlumber + OpenAI` | β
Quick prototyping
β
Lightweight extractions
β
Page-by-page summaries | Extracts raw text and tables from PDFs using `pdfplumber`, then summarizes via GPT-4o. Ideal for simple forms or multi-page review. |
| **v1** | `v1` | `LangChain + GPT-4o` | β
Structured data schema
β
Automating contract ingestion
β
Extracting JSON records | Uses LangChain's document loader and schema detection to extract structured records with schema customization. Great for generating tabular insights from customer contracts. |
| **v2** | `v2` | `Agentic Document Extraction (ADE)` | β
Formatted documents
β
Contracts w/ visual structure
β
Section-level summaries | Sends full PDFs to an external ADE API and groups semantic chunks into labeled sections. Excellent for internal PDF templates or procurement docs. |
---
## π§ Feature Comparison
| Feature | v0 | v1 | v2 |
|--------|----|----|----|
| Extracts Tables | β
| β
| β
|
| Extracts Raw Text | β
| β
| β
|
| Section-Based Summarization | π« | β
| β
|
| Structured JSON Record Extraction | π« | β
| π« |
| Custom Schema Selection | π« | β
| π« |
| Full PDF Summarization | β
| π« | β
|
| Export CSV | β
| β
| β
|
| Best For | Simpler docs | Tabular contract data | Long-form/styled PDFs |
---
## βΆοΈ How to Run
Each version lives in its own branch. Follow these steps to get started:
### 1. β¬οΈ Clone the repo:
```bash
git clone https://github.com/ozcanmiraay/opsbot.git
cd opsbot
```
### 2. π Create and activate a virtual environment:
```bash
python3 -m venv venv
source venv/bin/activate # macOS/Linux
# OR
venv\Scripts\activate # Windows
```
### 3. π¦ Install required dependencies:
```bash
pip install -r requirements.txt
```
### 4. π Set up your API keys:
Create a `.env` file in the root directory with the following content:
```
OPENAI_API_KEY=your-openai-api-key
ADE_API_KEY=your-agentic-doc-extraction-api-key
```
- Get your **OpenAI API key** [here](https://platform.openai.com/account/api-keys)
- Request access to the **Agentic Document Extractor API (ADE)** [here](https://support.landing.ai/landinglens/docs/visionagent-api-key)
---
## π Launch the App
### βοΈ v0: PDFPlumber Text & Table Extractor
```bash
git checkout main
streamlit run app/streamlit_app.py
```
### π§ v1: LangChain Schema-Based Extractor
```bash
git checkout v1
streamlit run app/streamlit_app.py
```
### π€ v2: Agentic Document Intelligence Viewer
```bash
git checkout v2
streamlit run ui/streamlit_app.py
```
---
## πΈ Screenshots
| v0: Lightweight PDF Reader | v1: LangChain Schema Extractor | v2: ADE-Powered Chunk Viewer |
|----------------------------|-------------------------------|-------------------------------|
|  |  |  |
---
## π₯ Demo Videos
| PDFPlumber & LangChain-Based Extractor (v0, v1) | LangChain-Based Extractor & ADE-Powered Viewer (v1, v2) |
|-------------------------------|--------------------------|
| [Watch Demo Part 1!](https://www.loom.com/share/93ca1ac870d0480580b5a5d2d93db4f2?sid=126cd365-f141-4e12-817e-20df70afccaa) | [Watch Demo Part 2!](https://www.loom.com/share/91cff4e13f054cb2b4ed158fc494d866?sid=6da7125e-48c6-42ce-832b-9e486032f49f) |
> π Click on a link to watch a 5-minute walkthrough on Loom.
---
## π§© Real-World Use Cases
- **Contract Intelligence**: Pulling features, pricing, infrastructure specs, and deployment configurations from customer contracts.
- **Sales Enablement**: Exporting client configuration from PDFs into CRM fields automatically.
- **Internal Alignment**: Creating dashboards where executives and department leaders view only the data relevant to them.
- **Audit Readiness**: Summarizing past signed forms and validating consistency across regions.
---
## π οΈ Tech Stack
- **LLM**: GPT-4o via `langchain-openai`
- **Document Parsing**: `pdfplumber`, `PyPDFLoader`, Agentic Document Extractor by LandingAI
- **Interface**: Streamlit
- **Helpers**: LangChain prompt pipelines, recursive chunking, CSV export, HTML table rendering
---
## π Credits
Special thanks to the **Product Operations Lead at Proscia** for their insights and support in identifying real automation opportunities that can drive cross-team efficiency.