{"id":33329319,"url":"https://github.com/harshman7/insight-agent-idp","last_synced_at":"2026-04-08T20:05:28.406Z","repository":{"id":324466585,"uuid":"1097306683","full_name":"harshman7/insight-agent-idp","owner":"harshman7","description":"AI-powered Intelligent Document Processing (IDP) system with RAG, anomaly detection, and natural language insights. Local, zero-cost alternative to AWS Textract + Bedrock.","archived":false,"fork":false,"pushed_at":"2025-11-16T00:56:02.000Z","size":10252,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-16T02:28:15.534Z","etag":null,"topics":["ai-agent","anomaly-detection","document-analytics","expense-tracking","faiss","fastapi","idp","intelligent-document-processing","llm","ocr","ollama","postgresql","python","rag","streamlit"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/harshman7.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-15T23:12:48.000Z","updated_at":"2025-11-16T01:02:07.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/harshman7/insight-agent-idp","commit_stats":null,"previous_names":["harshman7/insight-agent-idp"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/harshman7/insight-agent-idp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshman7%2Finsight-agent-idp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshman7%2Finsight-agent-idp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshman7%2Finsight-agent-idp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshman7%2Finsight-agent-idp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/harshman7","download_url":"https://codeload.github.com/harshman7/insight-agent-idp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshman7%2Finsight-agent-idp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31571626,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T14:31:17.711Z","status":"ssl_error","status_checked_at":"2026-04-08T14:31:17.202Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agent","anomaly-detection","document-analytics","expense-tracking","faiss","fastapi","idp","intelligent-document-processing","llm","ocr","ollama","postgresql","python","rag","streamlit"],"created_at":"2025-11-20T16:01:24.364Z","updated_at":"2026-04-08T20:05:28.395Z","avatar_url":"https://github.com/harshman7.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DocSage \n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"image.png\" alt=\"DocSage Logo\" width=\"150\"/\u003e\n  \n  **Intelligent Document Processing with AI-Powered Analytics**\n  \n  *Local, zero-cost alternative to AWS Textract + Bedrock*\n\u003c/div\u003e\n\n---\n\n**DocSage** is a **local, zero-cost platform** for AI-powered document intelligence that sits on top of an Intelligent Document Processing (IDP) pipeline. DocSage features an intelligent AI agent that processes documents and answers questions using natural language.\n\nIt ingests PDF documents (e.g., invoices, bank statements, forms), extracts structured data, and lets you **ask natural language questions** like:\n\n- \"What did I spend on rent in the last 3 months?\"\n- \"Which vendors are above $5,000 this quarter?\"\n- \"Show me anomalies in monthly spend and the supporting documents.\"\n\nThe implementation is designed to **mirror how this would run on AWS** (Textract, Bedrock, S3, RDS, OpenSearch), but uses **100% free, local tools** instead.\n\n---\n\n## Why this project exists\n\nI built this to practice end-to-end architecture for an intelligent document processing system similar to what you'd run on AWS (Textract + Bedrock + RDS + OpenSearch), but using 100% local, free tools. My learning focus was:\n\n- **Designing a tool-using LLM agent** wired into SQL, metrics, and RAG\n- **Building an IDP pipeline** (OCR, classification, field extraction) for financial docs\n- **Structuring a FastAPI + Streamlit system** that's easy to \"lift-and-shift\" to AWS\n\nThat makes the learning goal explicit instead of implicit.\n\n---\n\n## Architecture Overview\n\n### Conceptual Flow\n\n1. **Ingestion \u0026 IDP**\n   - PDFs are stored in `data/raw_docs/`.\n   - A Python-based IDP pipeline:\n     - Uses OCR (Tesseract) and PDF parsing (`pdfplumber`) to extract text.\n     - Classifies document types (invoice, statement, etc.).\n     - Extracts key fields (dates, amounts, vendors, categories).\n   - Structured outputs are saved as JSON/CSV and loaded into a relational database.\n\n2. **Storage \u0026 Analytics**\n   - All structured data is stored in **PostgreSQL** (via Docker Compose) by default.\n   - SQLite is available as an optional alternative for development.\n   - Derived metrics (e.g., monthly totals, category breakdowns, vendor stats) are computed and exposed as reusable \"metrics functions\".\n\n3. **RAG + Vector Search**\n   - Document chunks and summaries are embedded with a free `sentence-transformers` model.\n   - Embeddings are stored in a local **FAISS** index (no external vector DB).\n   - This enables the agent to retrieve **supporting documents** for its answers.\n\n4. **AI Agent (LLM + Tools)**\n   - DocSage features an intelligent AI agent that powers the system.\n   - A local LLM (via [Ollama](https://ollama.com/)) provides reasoning and natural language generation.\n   - The agent is wired with tools (using LangChain/LlamaIndex-style patterns):\n     - `sql_tool`: run parameterized SQL queries on the transactional DB.\n     - `metrics_tool`: call pre-defined Python functions for KPIs.\n     - `rag_tool`: search FAISS for relevant document snippets.\n   - The agent decides which tools to call based on the user's query, aggregates the results, and explains the insight in plain language, referencing underlying data and documents.\n\n5. **API \u0026 UI**\n   - **Backend:** FastAPI application exposing:\n     - `POST /chat/insights` – main endpoint for DocSage's AI agent.\n     - `GET /health` – health check endpoint.\n     - `GET /docs` – interactive API documentation.\n   - **Frontend:** Streamlit app with 8 comprehensive pages:\n     - 📊 **Analytics Dashboard** – Time-series analytics, spending trends, vendor analysis, and forecasting.\n     - 💬 **Chat** – Natural language interface to interact with DocSage.\n     - 📄 **Documents** – Document management with visual overlays, interactive corrections, and real-time upload.\n     - ⚠️ **Anomalies** – Automated anomaly detection (duplicates, unusual amounts, missing fields).\n     - 🔍 **Document Comparison** – Side-by-side document comparison and price change tracking.\n     - 📈 **Insights Report** – AI-generated natural language insights and recommendations.\n     - 🔗 **Receipt Matching** – Automatic receipt-to-invoice matching with fuzzy matching.\n     - 📤 **Export** – Export data to Excel and Markdown formats.\n\n---\n\n## Stack (Local Analogues of AWS Services)\n\nThis project intentionally mirrors an AWS-native design:\n\n| AWS Service (Target)      | Local / Free Equivalent             |\n|---------------------------|-------------------------------------|\n| S3 (document storage)     | `data/raw_docs/` on local disk      |\n| Textract (OCR)            | Tesseract + `pytesseract`           |\n| Comprehend / Bedrock NLU  | Local LLM + `sentence-transformers` |\n| RDS / Aurora              | PostgreSQL (Docker) - SQLite optional |\n| OpenSearch / Kendra       | FAISS vector index                  |\n| Bedrock LLM (agents)      | Ollama + LangChain/LlamaIndex       |\n| Lambda / Step Functions   | Python services + scripts           |\n| QuickSight                | Streamlit charts + notebooks        |\n\nThis makes it easy to **lift and shift the architecture to AWS** later by replacing the local components with managed services.\n\n---\n\n## Features\n\n### Core IDP \u0026 Document Processing\n- ✅ **End-to-end IDP pipeline:**\n  - OCR + text extraction from PDFs and images (Tesseract + pdfplumber).\n  - Document classification (invoices, receipts, statements).\n  - Field extraction into structured tables with confidence scores.\n  - Real-time document upload with drag-and-drop support.\n\n### AI-Powered Analytics\n- ✅ **RAG-enabled AI agent:**\n  - DocSage's agent combines SQL analytics with document retrieval.\n  - Answers questions in natural language and cites source docs.\n  - Intelligent tool-using agent that chooses between SQL, metrics, and RAG.\n- ✅ **Time-series analytics:**\n  - Monthly spending trends with interactive charts.\n  - Daily spending visualization (last 30 days).\n  - Vendor trends over time.\n  - Spending forecast using linear regression (3-month prediction).\n- ✅ **Smart expense categorization:**\n  - LLM-based automatic categorization into 12+ categories.\n  - Categories: Office Supplies, Software, Travel, Meals, Services, etc.\n\n### Document Intelligence\n- ✅ **Visual document overlay:**\n  - Highlight extracted fields on document images.\n  - Color-coded fields with confidence scores.\n  - Annotated document viewer.\n- ✅ **Interactive document correction:**\n  - Edit extracted data directly in the UI.\n  - Track corrections with confidence scores.\n  - Real-time updates after corrections.\n- ✅ **Document comparison:**\n  - Side-by-side comparison of documents.\n  - Similar document finder.\n  - Price change detection for recurring vendors.\n  - Price trend charts.\n\n### Anomaly Detection \u0026 Quality\n- ✅ **Automated anomaly detection:**\n  - Duplicate transaction detection.\n  - Unusual amount flags (\u003e2 standard deviations).\n  - Missing field detection.\n  - Date anomaly identification.\n  - Severity levels (High, Medium, Low).\n\n### Business Intelligence\n- ✅ **Natural language insights generator:**\n  - AI-generated reports using Ollama LLM.\n  - Spending pattern analysis.\n  - Cost optimization recommendations.\n  - Markdown format with downloadable reports.\n- ✅ **Receipt-to-invoice matching:**\n  - Automatic matching with fuzzy matching.\n  - Vendor name similarity scoring.\n  - Amount and date tolerance matching.\n  - Confidence scores for matches.\n\n### Export \u0026 Reporting\n- ✅ **Export functionality:**\n  - Excel export with multiple sheets (Transactions, Vendors, Categories, Anomalies, Documents).\n  - Summary reports in Markdown format.\n  - Downloadable files with timestamps.\n\n\u003e 📖 **See [FEATURES.md](FEATURES.md) for detailed documentation of all features.**\n\n---\n\n## Prerequisites\n\n1. **Python 3.9+**\n2. **PostgreSQL** (via Docker Compose) - **Primary database**\n   - SQLite is available as an optional alternative (set `USE_SQLITE=True` in `.env`)\n3. **Docker** (required for PostgreSQL via Docker Compose)\n   - Install from: https://www.docker.com/get-started\n4. **Ollama** installed and running locally\n   - Install from: https://ollama.com/\n   - Pull a model: `ollama pull llama3` (or `mistral`, `codellama`, etc.)\n5. **Tesseract OCR** (for image OCR)\n   - macOS: `brew install tesseract`\n   - Linux: `sudo apt-get install tesseract-ocr`\n   - Windows: Download from [GitHub](https://github.com/UB-Mannheim/tesseract/wiki)\n\n---\n\n## Installation\n\n1. **Clone the repository:**\n   ```bash\n   git clone \u003crepository-url\u003e\n   cd docsage\n   ```\n\n2. **Create a virtual environment:**\n   ```bash\n   python -m venv venv\n   source venv/bin/activate  # On Windows: venv\\Scripts\\activate\n   ```\n\n3. **Install dependencies:**\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n4. **Set up environment variables:**\n   Create a `.env` file (optional, defaults are provided):\n   ```env\n   # PostgreSQL (default database)\n   POSTGRES_USER=postgres\n   POSTGRES_PASSWORD=postgres\n   POSTGRES_DB=docsage\n   POSTGRES_HOST=localhost\n   POSTGRES_PORT=5432\n   USE_SQLITE=False  # Set to True to use SQLite instead (not recommended for production)\n   \n   # Ollama LLM\n   OLLAMA_BASE_URL=http://localhost:11434\n   OLLAMA_MODEL=llama3\n   ```\n\n5. **Start PostgreSQL (via Docker Compose):**\n   ```bash\n   docker-compose up -d postgres\n   ```\n   \n   **Note:** PostgreSQL is the default and recommended database. To use SQLite instead (not recommended for production), set `USE_SQLITE=True` in your `.env` file and skip this step.\n\n6. **Create the database** (if it doesn't exist):\n   ```bash\n   docker-compose exec postgres psql -U postgres -c \"CREATE DATABASE docsage;\"\n   ```\n   \n   Or if you prefer to create it manually, connect to PostgreSQL and run:\n   ```sql\n   CREATE DATABASE docsage;\n   ```\n\n7. **Verify Ollama is running:**\n   ```bash\n   curl http://localhost:11434/api/tags\n   ```\n\n---\n\n## Usage\n\n### 1. Initialize Database\n\n**Important:** Make sure Docker is running and PostgreSQL is started before running these commands.\n\n**First, run the database migration** (if upgrading from an older version):\n\n```bash\n# Activate virtual environment first\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Run migration\npython scripts/migrate_database.py\n```\n\nThen create tables and optionally seed with sample data:\n\n```bash\n# Create tables only\npython scripts/seed_db.py 0\n\n# Create tables + seed with 50 sample transactions\npython scripts/seed_db.py\n```\n\n### 2. Download Sample Documents (Optional)\n\nYou can download a free invoice/receipt dataset from Hugging Face:\n\n```bash\n# Install dataset library\npip install datasets pillow\n\n# Download sample (first 20 images for testing)\npython3 scripts/download_huggingface_dataset.py --split train --max-images 20\n\n# Or download full training set (2,040 images)\npython3 scripts/download_huggingface_dataset.py --split train\n```\n\nSee `DATASET_GUIDE.md` for detailed instructions.\n\n### 3. Ingest Documents\n\nPlace PDF/image files in `data/raw_docs/` and run:\n\n```bash\npython3 scripts/ingest_docs.py\n```\n\nThis will:\n- Extract text from PDFs/images (using OCR for images)\n- Classify document types\n- Extract structured fields\n- Create transaction records\n\n### 4. Build Vector Embeddings\n\nAfter ingesting documents, build the FAISS index:\n\n```bash\npython3 scripts/build_embeddings.py\n```\n\n### 5. Start the API Server\n\n```bash\npython3 -m app.main\n# Or: uvicorn app.main:app --reload\n```\n\nThe API will be available at `http://localhost:8000`\n\n### 6. Start the Streamlit Frontend\n\nIn a new terminal:\n\n```bash\nstreamlit run frontend/streamlit_app.py\n```\n\nNavigate to `http://localhost:8501` in your browser.\n\n---\n\n## API Usage\n\n### Chat Endpoint\n\n```bash\ncurl -X POST \"http://localhost:8000/chat/insights\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"query\": \"What did I spend on rent in the last 3 months?\",\n    \"use_rag\": true,\n    \"use_sql\": true\n  }'\n```\n\nResponse:\n```json\n{\n  \"answer\": \"Based on the transaction data...\",\n  \"sources\": [...],\n  \"sql_query\": \"SELECT SUM(amount) FROM transactions WHERE...\"\n}\n```\n\n---\n\n## Project Structure\n\n```\napp/\n  main.py           # FastAPI entrypoint\n  config.py         # App settings\n  db.py             # Database connection (SQLAlchemy)\n  models.py         # ORM models (Documents, Transactions, etc.)\n  schemas.py        # Pydantic schemas for API requests/responses\n\n  services/\n    idp_pipeline.py          # OCR + extraction pipeline\n    rag.py                   # Embedding + FAISS vector search helpers\n    sql_tools.py             # Safe SQL wrappers used by the agent\n    insights.py              # Metrics / KPI computation functions\n    anomaly_detection.py     # Anomaly detection and alerting\n    categorization.py          # LLM-based expense categorization\n    document_comparison.py   # Document comparison and price tracking\n    document_visualization.py # Visual document overlay with annotations\n    export_service.py        # Excel and Markdown export functionality\n    insights_generator.py    # AI-generated natural language insights\n    receipt_matching.py      # Receipt-to-invoice matching\n\n  agents/\n    insight_agent.py# DocSageAgent class - Core AI agent orchestration logic\n    tools.py        # Tool definitions exposed to the LLM\n\n  vectorstore/\n    faiss_store.py  # FAISS index management\n\ndata/\n  raw_docs/         # Input PDFs\n  processed/        # Extracted JSON/CSV\n  embeddings/       # FAISS indexes, metadata\n\nfrontend/\n  streamlit_app.py  # Comprehensive UI with 8 pages: Analytics, Chat, Documents, Anomalies, Comparison, Insights, Receipt Matching, Export\n\nscripts/\n  ingest_docs.py                  # CLI: load PDFs into the system\n  build_embeddings.py             # Build FAISS vector index\n  seed_db.py                      # Initialize database and seed sample data\n  migrate_database.py             # Database migration script (adds new tables/columns)\n  download_huggingface_dataset.py # Download invoice/receipt datasets\n  add_documents_from_folder.py    # Batch document ingestion\n  diagnose_and_fix_transactions.py # Diagnostic and repair utilities\n\nnotebooks/\n  exploratory_idp.ipynb\n  analytics_demo.ipynb\n```\n\n---\n\n## Configuration\n\nKey configuration options in `app/config.py`:\n\n- **Database (PostgreSQL is default):**\n  - `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`, `POSTGRES_HOST`, `POSTGRES_PORT`: PostgreSQL connection settings\n  - `USE_SQLITE`: Set to `True` to use SQLite instead (default: `False` - PostgreSQL is recommended)\n- **LLM:**\n- `OLLAMA_MODEL`: LLM model to use (default: `llama3`)\n  - `OLLAMA_BASE_URL`: Ollama API endpoint (default: `http://localhost:11434`)\n- **Vector Store:**\n- `EMBEDDING_MODEL`: Embedding model (default: `all-MiniLM-L6-v2`)\n- `FAISS_INDEX_PATH`: Path to FAISS index file\n- **API:**\n  - `API_HOST`: API host (default: `0.0.0.0`)\n  - `API_PORT`: API port (default: `8000`)\n\n---\n\n## Troubleshooting\n\n### Ollama Connection Error\n\n- Ensure Ollama is running: `ollama serve`\n- Check the model is available: `ollama list`\n- Pull the model if needed: `ollama pull llama3`\n\n### Tesseract OCR Not Found\n\n- Install Tesseract (see Prerequisites)\n- On macOS, ensure it's in PATH: `which tesseract`\n\n### Database Connection Error\n\n- Ensure PostgreSQL is running: `docker-compose ps`\n- Check connection settings in `.env` or `app/config.py`\n\n### No Documents Found\n\n- Place PDF/image files in `data/raw_docs/`\n- Run `python scripts/ingest_docs.py`\n- Supported formats: PDF, PNG, JPG, JPEG\n\n### Database Migration Issues\n\n- If you see errors about missing columns or tables, run: `python3 scripts/migrate_database.py`\n- This adds new tables (`document_corrections`) and columns (`confidence_score`, `is_corrected`)\n\n---\n\n## Development\n\n### Adding New Document Types\n\n1. Update `classify_document()` in `app/services/idp_pipeline.py`\n2. Add extraction function (e.g., `extract_form_fields()`)\n3. Update `parse_document()` to handle the new type\n\n### Adding New Metrics\n\n1. Add function to `app/services/insights.py`\n2. Update `create_metrics_tool()` in `app/agents/tools.py`\n\n---\n\n## Deployment\n\nDocSage can be deployed for **free** using Railway or Render with free LLM APIs (Groq or Hugging Face).\n\n### Quick Deploy to Railway (Free)\n\n1. **Get a free Groq API key**: [console.groq.com](https://console.groq.com)\n2. **Deploy to Railway**: Connect your GitHub repo at [railway.app](https://railway.app)\n3. **Set environment variables**:\n   - `LLM_PROVIDER=groq`\n   - `GROQ_API_KEY=your_key`\n   - Database credentials (Railway provides these automatically)\n\nSee **[DEPLOYMENT.md](DEPLOYMENT.md)** for detailed deployment instructions including:\n- Railway deployment (recommended)\n- Render deployment\n- Docker Compose production setup\n- Free LLM API setup (Groq, Hugging Face)\n- Environment variable configuration\n\n## Future Enhancements\n\n- [ ] Advanced text chunking strategies\n- [ ] Multi-turn conversation support\n- [ ] PDF report generation (using reportlab)\n- [ ] Real-time document processing webhooks\n- [ ] AWS deployment guide\n- [ ] Email integration for automatic document processing\n- [ ] Multi-language support\n- [ ] Advanced ML models for better extraction accuracy\n- [ ] Budget tracking and alerts\n- [ ] Approval workflows\n\n---\n\n## What I learned\n\n- **How to design tools and guardrails** so an LLM can safely query a SQL DB\n- **How to combine RAG + analytics** (FAISS + metrics functions + SQL) for grounded insights\n- **How to mirror a managed-cloud architecture** with local components first\n\n## Next steps\n\n- Add GitHub Actions to run linting on each push\n- Swap local components for AWS services (Textract, Bedrock, RDS) in a branch\n\n---\n\n## License\n\nMIT License\n\n---\n\n## Contributing\n\nContributions welcome! Please open an issue or submit a pull request.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharshman7%2Finsight-agent-idp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fharshman7%2Finsight-agent-idp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharshman7%2Finsight-agent-idp/lists"}