https://github.com/suranjitpartho/clinical-data-intelligence-system
An AI Agent for clinical data intelligence. Built with LangGraph, FastAPI, and pgvector to unify structured SQL records and unstructured clinical notes via a self-healing, schema-aware reasoning engine.
https://github.com/suranjitpartho/clinical-data-intelligence-system
ai-agents clinical-data fastapi healthcare-ai langgraph llm pgvector postgresql rag self-healing-ai
Last synced: 2 days ago
JSON representation
An AI Agent for clinical data intelligence. Built with LangGraph, FastAPI, and pgvector to unify structured SQL records and unstructured clinical notes via a self-healing, schema-aware reasoning engine.
- Host: GitHub
- URL: https://github.com/suranjitpartho/clinical-data-intelligence-system
- Owner: suranjitpartho
- License: mit
- Created: 2026-04-10T04:44:54.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-18T19:41:47.000Z (5 days ago)
- Last Synced: 2026-05-18T21:51:12.703Z (5 days ago)
- Topics: ai-agents, clinical-data, fastapi, healthcare-ai, langgraph, llm, pgvector, postgresql, rag, self-healing-ai
- Language: Python
- Homepage:
- Size: 3.17 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-medical-ai - clinical-data-intelligence-system - data-intelligence-system?style=flat-square) | ⭐ C+ | LangGraph + FastAPI clinical intelligence layer that unifies SQL records and notes via hybrid retrieval, self-healing SQL, and a reasoning-trace UI for grounded clinician queries. | (Clinical Software & EHR)
README
# CLINICAL DATA INTELLIGENCE SYSTEM
*The Clinical Data Intelligence System is an AI platform designed to make clinical information easy to access through simple, natural conversation. It enables doctors and healthcare staff to instantly search patient records, medical notes, and lab results without needing technical database skills. By seamlessly integrating structured database records with unstructured clinical notes, the system automates manual reporting and provides clear insights that help medical teams save time and provide better care for their patients.*







## Case Study: Solving Clinical Data Fragmentation
> ⭐ **SITUATION:** Clinical environments suffer from fragmented data. Quantitative metrics (billing/labs) live in rigid SQL databases, while qualitative insights (clinical notes) are locked in unstructured text. Clinicians lose hours waiting for manual data pulls, delaying patient care and operational decisions.
>
> ⭐ **TARGET:** The mission was to build a "Clinical Intelligence Layer" that translates natural language into precise database queries.
>
> ⭐ **ACTION:** Engineered a deterministic state machine using *LangGraph* to orchestrate a hybrid retrieval system. Implemented a *Semantically Augmented Data Dictionary* to bridge clinical logic with SQL schemas, integrated *pgvector* for narrative medical searches, and built a *Proactive Discovery & Self-Healing loop* that autonomously corrects database hallucinations in real-time. Developed a custom *Observability Layer* that mirrors Langfuse Cloud data for node-level latency and costing transparency.
>
> ⭐ **RESULT:** Reduced clinical data retrieval workflows from hours to near real-time responses. Built a dual-layer *Reasoning Trace UI* that exposes both the internal logic and the financial cost of every decision, improving trust, accountability, and operational predictability.
## Core Capabilities
| Feature | Clinical Benefit |
| :--- | :--- |
| **Self-Healing SQL** | Eliminates manual query fixes by autonomously correcting syntax errors. |
| **Proactive Discovery** | Prevents hallucinations by fetching real categorical values before writing SQL. |
| **Hybrid Retrieval** | Combines exact lab results with semantic insights from clinical notes. |
| **Observability Trace** | Provides node-level transparency for latency, token density, and financial cost. |
| **Contextual Rewrite** | Maintains diagnostic accuracy in multi-turn conversations by resolving pronouns. |
| **Dimensional Enrichment** | Automatically fetches medical reference ranges (e.g., lab thresholds) to ground AI synthesis in clinical truth. |
## Technical Architecture
The system is built on a modular, state-managed architecture designed for high availability and clinical precision.
     
#### End-to-End Request Flow
1. **Natural Language Input**: User enters a query (e.g., *"Show abnormal lab results for Patient A"*).
2. **Contextual Rewrite**: The system resolves conversation history and converts ambiguous prompts into standalone, context-rich queries.
3. **Intent Routing**: The Orchestrator determines if the request requires *SQL retrieval* (structured labs), *Semantic RAG* (clinical notes), or a *Hybrid response*.
4. **Multi-Modal Retrieval**:
- **SQL Node**: Queries structured tables using schema-aware logic.
- **RAG Node**: Searches clinical notes and protocols using *pgvector*.
5. **Validation & Self-Correction**: Any SQL syntax errors or schema mismatches capture the *PostgreSQL traceback*, triggering an autonomous retry loop for immediate self-correction.
6. **Synthesis Layer**: Combines structured data and unstructured evidence into a single, grounded clinical response.
7. **Reasoning Trace**: The execution path is exposed to the UI, providing full transparency of the AI’s decision-making process.
#### System Stack Overview
| Layer | Component / Tech | Key Responsibility |
| :--- | :--- | :--- |
| **Orchestration** | **LangGraph** | Managing state-based clinical reasoning and tool loops. |
| **Knowledge Layer** | **Data Dictionary** | Mapping natural language to complex clinical business logic. |
| **Observability** | **Langfuse** | Capturing LLM latency, token usage, and graph execution traces. |
| **API Backend** | **FastAPI** | Providing high-concurrency, asynchronous API endpoints. |
| **Knowledge Base** | **pgvector** | Storing medical narratives and protocol embeddings. |
| **Modern UI** | **React 19** | Delivering a transparent "Reasoning Trace" for clinician trust. |
## Engineering Deep Dive: Challenges & Solutions
✴️ **Challenge: Managing Non-Linear Clinical Logic → Solution: State-Machine Orchestration**
At its core, the system utilizes a *LangGraph-driven State Graph* to manage complex reasoning. Unlike basic linear chains, this architecture allows for *directed cycles*, enabling the agent to revisit previous steps if conditions aren't met. This state-managed approach allows the system to generate a *Reasoning Trace*, exposing its internal "Chain of Thought" to clinicians for verification before final synthesis.
✴️ **Challenge: Conversational Context Drift → Solution: Recursive Query Transformation**
To support natural, multi-turn dialogue, the system implements an intelligent *Query Rewrite Node*. This node uses LLM-based transformation to turn ambiguous follow-up questions (e.g., *"What about his labs?"*) into standalone, context-rich queries (*"Show laboratory results for Patient X"*). This prevents "memory contamination" and ensures the intent router always receives a clear, precise instruction.
✴️ **Challenge: Fragmented Patient Histories → Solution: Multi-Modal Data Fusion (SQL + RAG)**
To provide a 360-degree patient view, the system implements a *multi-modal retrieval strategy*. It simultaneously pulls quantitative data (billing, labs) via exact-match SQL and qualitative narratives (symptoms, history) via semantic search. By utilizing the *BGE-M3* embedding model and *pgvector*, the system captures subtle medical nuances that traditional keyword search would miss.
✴️ **Challenge: SQL Hallucination & Syntactic Errors → Solution: Proactive Discovery & Self-Correction**
To guarantee precision, the system employs *Proactive Schema Discovery* guided by a *schema-aware data dictionary*. Before generating SQL, the agent consults a custom knowledge map that defines complex clinical relationships and business rules (e.g., precise age-calculation logic). It then fetches real-time categorical values from the database to ensure the query is perfectly grounded in live data. If a query fails, an autonomous *Self-Correction Loop* captures the database error and feeds it back to the agent for an immediate, self-healing rewrite.
✴️ **Challenge: Context Loss in Aggregated SQL → Solution: Automated Dimensional Enrichment**
Aggregating data (e.g., averages) often "squashes" clinical context like reference ranges. I engineered a *Metadata-Driven Enrichment Layer* that dynamically injects a separate "Dimensional Context" payload into the synthesis layer. This allows the AI to interpret results against clinical ground truth even for highly aggregated queries.
## Technical Rationale: Why This Stack?
* **LangGraph over LangChain**: Unlike standard chains, LangGraph provides the fine-grained control over *cycles and state* required for a non-linear clinical diagnostic flow.
* **PostgreSQL + pgvector over Pinecone**: By using pgvector, the system can execute complex SQL joins and semantic vector searches natively within the same database environment. This unified storage ensures data consistency and allows dynamic context passing (e.g., using SQL results to immediately filter vector searches) without relying on external vector databases.
* **FastAPI over Django**: Chosen for its high-performance asynchronous capabilities, efficiently orchestrating multiple concurrent LLM calls to deliver near real-time responses critical for medical consultation environments.
* **Langfuse over LangSmith**: Chosen for its open-source, self-hostable architecture which is essential for clinical data privacy and HIPAA compliance. Unlike SaaS-only alternatives, Langfuse allows the clinic to maintain full ownership of its telemetry data while providing granular, 5-decimal micro-billing and node-level latency tracking.
## Trust & Transparency
* **Reasoning Trace**: The system exposes its internal "Chain of Thought" to the user, allowing clinicians to verify the logic and node-level execution sequence behind every data retrieval.
* **Deep Observability & System Analytics**: Integrated with *Langfuse Cloud* for real-time telemetry. Features a high-fidelity analytics dashboard that provides sub-second latency tracking, token density analysis, and 5-decimal micro-billing precision for every graph execution.
* **Performance & Financial Audit**: Every reasoning step (Rewrite, Classify, SQL, RAG) is logged with its specific *sub-second latency*, *token density*, and *USD cost*, ensuring transparent audit trails and predictable OPEX for medical departments.
* **Deterministic Guardrails**: Using LangGraph, the system enforces a strict state-managed flow, preventing the AI from wandering into "creative" or ungrounded responses.
* **Clinical Simulation & Privacy**: To ensure absolute privacy and HIPAA compliance, this system operates on a *proprietary synthetic dataset*. I engineered a custom *Clinical Simulation Engine* that generates high-entropy patient records and longitudinal narratives for rigorous testing.
## Project Structure
```text
├── app/ # FastAPI Backend (Graph logic, Nodes, Models)
├── frontend/ # React 19 + Tailwind 4 Frontend
├── migrations/ # SQLAlchemy/Alembic Database Migrations
├── scripts/ # Data Seeding & BGE-M3 Embedding Generation
├── app/services/ # Core Data Dictionary & AI Prompts
└── requirements.txt # Backend dependencies
```
### System Requirements
* **RAM:** Minimum 4GB (8GB recommended for local AI execution).
* **Disk Space:** ~5GB for Docker images and local model storage.
* **Docker:** Ensure at least 4GB of memory is allocated to Docker.
## Installation & Setup
### Quick Start with Docker (Recommended)
This system is fully containerized. Deployment via Docker ensures environment consistency across both the Frontend and Backend services.
> [!NOTE]
> **Prerequisite:** Ensure your PostgreSQL instance supports the `pgvector` extension. (Standard on Supabase, Neon, and AWS RDS).
**1. Clone the repository**
```bash
git clone https://github.com/suranjitpartho/clinical-data-intelligence-system.git
cd clinical-data-intelligence-system
```
**2. Configure environment**
Populate .env with your API keys and DATABASE credentials
```bash
cp .env.example .env
```
**3. Deploy services**
```bash
docker-compose up --build
```
Once initialized, the unified application will be accessible at **http://localhost:8000**.