https://github.com/md-emon-hasan/autodocthinker
Agentic AI system that allows users to upload documents (PDFs, DOCX, etc.) and natural language questions. It uses LLM-based RAG to extract relevant information. The architecture includes multi-agent components such as document retrievers, summarizers, web searchers, and tool routers β enabling dynamic reasoning and accurate responses.
https://github.com/md-emon-hasan/autodocthinker
agentic-ai ai-agents ai-assistant ai-document-search auto-document-analysis conversation-memory document-intelligence document-qa document-retrieval duckduckgo-tool langgraph llm-apps llm-reasoning planner-executor-agent qna-system rag semantic-search smart-document-search tool-usage-llm vector-search
Last synced: 3 days ago
JSON representation
Agentic AI system that allows users to upload documents (PDFs, DOCX, etc.) and natural language questions. It uses LLM-based RAG to extract relevant information. The architecture includes multi-agent components such as document retrievers, summarizers, web searchers, and tool routers β enabling dynamic reasoning and accurate responses.
- Host: GitHub
- URL: https://github.com/md-emon-hasan/autodocthinker
- Owner: Md-Emon-Hasan
- License: mit
- Created: 2025-05-04T14:09:01.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2026-05-07T12:38:57.000Z (about 2 months ago)
- Last Synced: 2026-05-07T14:40:00.567Z (about 2 months ago)
- Topics: agentic-ai, ai-agents, ai-assistant, ai-document-search, auto-document-analysis, conversation-memory, document-intelligence, document-qa, document-retrieval, duckduckgo-tool, langgraph, llm-apps, llm-reasoning, planner-executor-agent, qna-system, rag, semantic-search, smart-document-search, tool-usage-llm, vector-search
- Language: Python
- Homepage: https://autodocthinker.onrender.com
- Size: 32.9 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AutoDocThinker: Agentic RAG System with Intelligent Search Engine
[](https://python.org) [](https://fastapi.tiangolo.com/) [](https://python.langchain.com/) [](https://langchain-ai.github.io/langgraph/) [](https://pytorch.org/) [](https://huggingface.co/) [](https://www.trychroma.com/) [](https://www.docker.com/) [](https://reactjs.org/) [](https://tailwindcss.com/) [](https://groq.com/) [](https://github.com/Md-Emon-Hasan/AutoDocThinker)
**AutoDocThinker** (v3.0) is an advanced **Agentic RAG (Retrieval-Augmented Generation)** system designed to bridge the gap between static documents and dynamic intelligence, solving the critical problem of information overload in data-rich environments. Built on a **Modular Monolithic Architecture** with **FastAPI, LangGraph, and ChromaDB**, the system transforms unstructured data (PDFs, Word docs, Web URLs, plain text) into an interactive knowledge base, enabling users to query complex information using natural language. Unlike traditional keyword search that fails to understand context, AutoDocThinker employs a **four-mode RAG workflow engine** β **Naive, Advanced, CRAG (Corrective RAG), and Self-RAG** β to adaptively route, retrieve, evaluate, and regenerate answers. The **Hybrid Search engine** fuses **dense vector retrieval (ChromaDB)** with **sparse BM25 indexing** via **Reciprocal Rank Fusion (RRF)**, followed by **CrossEncoder reranking**, to deliver precision-first results. Seven **domain-specific presets** (Medical, Legal, Finance, Technical, Education, Customer Support, General) tune prompts and retrieval behavior per use case, while a full **chat session system** maintains multi-turn conversation history. This end-to-end solution not only automates research and Level-1 support tasks but also delivers **10x productivity gains** by synthesizing accurate, citation-backed answers in seconds β effectively turning a repository of "dead" files into an active, decision-driving organizational brain.
[](https://github.com/user-attachments/assets/a18dc570-35fc-4c42-8bad-fd6be37b6c0a)
---
## **Live Demo**
**Try it now**: [AutoDocThinker: Agentic RAG System with Intelligent Search Engine](https://autodocthinker.onrender.com/)
---
## **Features & Functionalities**
| # | Module | Technology Stack | Implementation Details |
|----|--------------------------|-----------------------------------------|--------------------------------------------------------------|
| 1 | **Backend Framework** | FastAPI + Uvicorn | Async support, auto OpenAPI docs, lifecycle hooks |
| 2 | **LLM Processing** | Groq + LLaMA-3-70B | Configurable temperature, output parsing, retry logic |
| 3 | **Document Parsing** | PyMuPDF + python-docx + BeautifulSoup | PDF, DOCX, TXT, URL, raw text with metadata preservation |
| 4 | **Text Chunking** | RecursiveCharacterTextSplitter | Adaptive chunk optimizer with configurable size and overlap |
| 5 | **Vector Embeddings** | all-MiniLM-L6-v2 (HuggingFace) | Efficient 384-dimensional dense embeddings |
| 6 | **Vector Database** | ChromaDB | Persistent storage, similarity search, source-level deletion |
| 7 | **Sparse Index** | BM25 (rank-bm25) | Keyword-based sparse retrieval with custom tokenizer |
| 8 | **Hybrid Search** | Dense + Sparse fusion via RRF | Reciprocal Rank Fusion merges both retrieval signals |
| 9 | **Reranking** | CrossEncoder (sentence-transformers) | Re-scores top-K candidates for precision-first results |
| 10 | **Compression** | LLM-based context compression | Reduces retrieved chunks to only query-relevant sentences |
| 11 | **RAG Workflows** | LangGraph (4 modes) | Naive, Advanced, CRAG, Self-RAG with conditional edges |
| 12 | **Domain Presets** | 7 domain profiles | General, Medical, Legal, Finance, Education, Technical, CS |
| 13 | **Prompt Engineering** | Domain-aware prompt templates | Separate system prompts per domain and per RAG workflow |
| 14 | **Chat System** | Session-based multi-turn chat | Session management, history store, auto title generation |
| 15 | **Web Fallback** | Wikipedia API + LangChain | Auto-triggered on low-confidence or empty index |
| 16 | **CLI Interface** | Interactive terminal CLI | Commands for ingestion, querying, and session management |
| 17 | **Source Management** | Per-source ingestion tracking | Deduplication, source registry, per-source deletion |
| 18 | **Index Management** | Full index lifecycle control | Status, per-source removal, full clear |
| 19 | **User Interface** | React 18 + Vite + Tailwind CSS | SPA with chat, ingestion, domains, index, and admin pages |
| 20 | **Containerization** | Docker + Docker Compose | Production-ready multi-service deployment |
---
## **Project Structure**
```
AutoDocThinker/
β
βββ .github/
β βββ workflows/
β βββ ci-cd.yml # Full CI/CD pipeline (lint β test β build β deploy)
β βββ docker.yml # Docker build & push to GHCR on release
β
βββ backend/ # FastAPI backend application
β βββ .dockerignore
β βββ .env.example # Environment variables template
β βββ .flake8 # Flake8 linting configuration
β βββ Dockerfile # Backend Docker image
β βββ pyproject.toml # Project metadata and tool config
β βββ requirements.txt # Python dependencies
β βββ run.py # Backend entry point (Uvicorn launcher)
β βββ split.py # Dev utility for splitting test output
β β
β βββ app/ # Main application package
β β βββ __init__.py
β β βββ application.py # FastAPI app factory
β β βββ dependencies.py # DI container (IoC box)
β β βββ exceptions.py # Global exception handlers
β β βββ lifecycle.py # Startup / shutdown hooks
β β βββ logging_config.py # Structured logging setup
β β βββ main.py # ASGI entry point
β β β
β β βββ api/ # HTTP route handlers
β β β βββ __init__.py
β β β βββ admin_routes.py # GET /admin/summary
β β β βββ chat_routes.py # Chat session CRUD & query
β β β βββ domain_routes.py # Domain preset listing
β β β βββ health_routes.py # GET /health
β β β βββ index_routes.py # Index status, clear, per-source delete
β β β βββ ingestion_routes.py # File upload, URL, raw text ingestion
β β β βββ rag_routes.py # RAG query, mode listing, profiles
β β β βββ router.py # Central router aggregator
β β β
β β βββ chat/ # Chat session management
β β β βββ __init__.py
β β β βββ history_store.py # In-memory chat history store
β β β βββ memory.py # LangChain memory adapter
β β β βββ message.py # Message dataclass
β β β βββ service.py # Chat service (create/get/query session)
β β β βββ session.py # Session model
β β β βββ title_generator.py # Auto-generate session titles via LLM
β β β
β β βββ cli/ # Interactive command-line interface
β β β βββ __init__.py
β β β βββ commands.py # CLI command definitions
β β β βββ interactive.py # REPL loop
β β β βββ printing.py # Rich terminal output helpers
β β β
β β βββ core/ # Core config & constants
β β β βββ __init__.py
β β β βββ config.py # RAGConfig frozen dataclass (v3.0.0)
β β β βββ constants.py # App-wide constant values
β β β βββ environment.py # Env var loader
β β β βββ errors.py # Base custom exception classes
β β β βββ paths.py # Path resolution helpers
β β β
β β βββ domain/ # Domain preset system
β β β βββ __init__.py
β β β βββ defaults.py # Default domain selection logic
β β β βββ models.py # Domain Pydantic models
β β β βββ registry.py # Domain registry (name β preset)
β β β βββ selector.py # Domain auto-selector
β β β βββ validator.py # Domain input validator
β β β βββ presets/ # Per-domain configuration
β β β βββ __init__.py
β β β βββ customer_support.py
β β β βββ education.py
β β β βββ finance.py
β β β βββ general.py
β β β βββ legal.py
β β β βββ medical.py
β β β βββ technical.py
β β β
β β βββ indexing/ # Hybrid index (vector + BM25)
β β β βββ __init__.py
β β β βββ bm25_index.py # BM25 sparse index implementation
β β β βββ chroma_store.py # ChromaDB collection wrapper
β β β βββ deduplication.py # Chunk deduplication logic
β β β βββ hybrid_index.py # Unified hybrid index interface
β β β βββ locking.py # Thread-safe write locking
β β β βββ persistence.py # Index persistence helpers
β β β βββ source_registry.py # Per-source tracking registry
β β β βββ stats.py # Index statistics
β β β βββ tokenizer.py # Custom BM25 tokenizer
β β β βββ vector_index.py # Vector index operations
β β β
β β βββ ingestion/ # Document ingestion pipeline
β β β βββ __init__.py
β β β βββ chunk_optimizer.py # Adaptive chunking strategy
β β β βββ document.py # Document dataclass
β β β βββ document_processor.py # Load β clean β metadata injection
β β β βββ file_validation.py # File type and size validation
β β β βββ metadata.py # Metadata extraction helpers
β β β βββ service.py # Ingestion orchestrator
β β β βββ source_id.py # Deterministic source ID generation
β β β βββ supported_types.py # Allowed file type registry
β β β βββ loaders/ # Format-specific document loaders
β β β βββ __init__.py
β β β βββ base.py # Abstract loader interface
β β β βββ docx_loader.py # DOCX (python-docx) loader
β β β βββ factory.py # Routes file_type β loader instance
β β β βββ pdf_loader.py # PDF (PyMuPDF) loader
β β β βββ text_loader.py # Raw pasted-text loader
β β β βββ txt_loader.py # Plain .txt file loader
β β β βββ url_loader.py # Web URL scraper (BeautifulSoup)
β β β
β β βββ llm/ # LLM & embedding clients
β β β βββ __init__.py
β β β βββ chain_factory.py # LangChain chain builder
β β β βββ embedding_client.py # HuggingFace embedding wrapper
β β β βββ fallback.py # LLM fallback / error recovery
β β β βββ groq_client.py # Groq API client (LLaMA-3)
β β β βββ output_parser.py # Structured LLM output parser
β β β βββ wikipedia_client.py # Wikipedia API client
β β β
β β βββ prompts/ # Prompt templates
β β β βββ __init__.py
β β β βββ answer.py # Final answer generation prompt
β β β βββ base.py # Base prompt template
β β β βββ compression.py # Context compression prompt
β β β βββ crag.py # CRAG-specific prompts
β β β βββ evaluation.py # Relevance evaluation prompt
β β β βββ query_rewrite.py # Query rewriting prompt
β β β βββ self_rag.py # Self-RAG reflection prompts
β β β βββ domain/ # Domain-specific system prompts
β β β βββ __init__.py
β β β βββ customer_support.py
β β β βββ education.py
β β β βββ finance.py
β β β βββ general.py
β β β βββ legal.py
β β β βββ medical.py
β β β βββ technical.py
β β β
β β βββ rag/ # RAG orchestration layer
β β β βββ __init__.py
β β β βββ citations.py # Citation extraction and formatting
β β β βββ formatting.py # Response formatter
β β β βββ history.py # Conversation history helpers
β β β βββ modes.py # RAG mode enum (naive/advanced/crag/self_rag)
β β β βββ service.py # RAG service (query entry point)
β β β βββ state.py # LangGraph shared state schema
β β β
β β βββ retrieval/ # Retrieval & ranking pipeline
β β β βββ __init__.py
β β β βββ bm25_search.py # BM25 sparse search
β β β βββ compressor.py # LLM-based chunk compressor
β β β βββ filters.py # Metadata pre-filters
β β β βββ fusion.py # Reciprocal Rank Fusion (RRF)
β β β βββ hybrid_search.py # Combined dense + sparse search
β β β βββ ranking.py # Score normalization & ranking
β β β βββ reranker.py # CrossEncoder reranker
β β β βββ scoring.py # Relevance scoring utilities
β β β βββ service.py # Retrieval service (main interface)
β β β βββ vector_search.py # ChromaDB vector search
β β β
β β βββ schemas/ # Pydantic request/response schemas
β β β βββ __init__.py
β β β βββ chat.py # Chat session schemas
β β β βββ common.py # Shared base schemas
β β β βββ domain.py # Domain schemas
β β β βββ error.py # Error response schema
β β β βββ health.py # Health check schema
β β β βββ history.py # History schemas
β β β βββ index.py # Index schemas
β β β βββ ingestion.py # Ingestion request/response schemas
β β β βββ rag.py # RAG query request/response schemas
β β β βββ rag_profile.py # RAG profile schema
β β β βββ source.py # Source metadata schema
β β β
β β βββ storage/ # File and vector storage management
β β β βββ __init__.py
β β β βββ cleanup.py # Storage cleanup utilities
β β β βββ file_storage.py # File system operations
β β β βββ paths.py # Storage path resolution
β β β βββ upload_storage.py # Upload directory management
β β β βββ vector_storage.py # Vector store path management
β β β
β β βββ utils/ # Shared utility modules
β β β βββ __init__.py
β β β βββ hashing.py # Content hashing (SHA-256)
β β β βββ retry.py # Exponential backoff retry decorator
β β β βββ serialization.py # JSON serialization helpers
β β β βββ testing.py # Test utility helpers
β β β βββ text.py # Text normalization utilities
β β β βββ time.py # Timestamp helpers
β β β βββ validation.py # Input validation utilities
β β β
β β βββ workflows/ # LangGraph workflow definitions
β β βββ __init__.py
β β βββ finalize.py # Shared finalization node
β β βββ advanced/ # Advanced RAG workflow
β β β βββ __init__.py
β β β βββ compat.py # Backward-compat adapter
β β β βββ edges.py # Conditional edge logic
β β β βββ graph.py # LangGraph graph definition
β β β βββ nodes.py # Workflow node functions
β β βββ crag/ # Corrective RAG workflow
β β β βββ __init__.py
β β β βββ compat.py
β β β βββ edges.py
β β β βββ graph.py
β β β βββ nodes.py
β β βββ naive/ # Naive RAG workflow
β β β βββ __init__.py
β β β βββ compat.py
β β β βββ edges.py
β β β βββ graph.py
β β β βββ nodes.py
β β βββ self_rag/ # Self-RAG workflow
β β βββ __init__.py
β β βββ compat.py
β β βββ edges.py
β β βββ graph.py
β β βββ nodes.py
β β
β βββ data/
β β βββ vector_store/ # ChromaDB persistent storage
β β
β βββ notebooks/
β β βββ experiment.ipynb # Exploratory experiments
β β βββ fix-final.ipynb # Debug notebook
β β
β βββ uploads/ # User-uploaded documents (runtime)
β β
β βββ tests/ # Full test suite
β βββ conftest.py # Shared fixtures and DI overrides
β βββ api/
β β βββ test_admin_routes.py
β β βββ test_chat_routes.py
β β βββ test_domain_routes.py
β β βββ test_health_route.py
β β βββ test_index_routes.py
β β βββ test_ingest_text_routes.py
β β βββ test_ingestion_routes.py
β β βββ test_rag_routes.py
β β βββ test_upload_routes.py
β βββ chat/
β β βββ test_chat_service.py
β β βββ test_chat_session.py
β β βββ test_history_store.py
β β βββ test_make_message.py
β β βββ test_memory.py
β β βββ test_title_generator.py
β βββ core/
β β βββ test_application.py
β β βββ test_c_l_i.py
β β βββ test_config.py
β β βββ test_constants.py
β β βββ test_environment.py
β β βββ test_errors.py
β β βββ test_lifecycle_and_exceptions.py
β β βββ test_logging.py
β β βββ test_paths.py
β βββ domain/
β β βββ test_defaults.py
β β βββ test_domain_profile.py
β β βββ test_domain_prompt_constants.py
β β βββ test_registry.py
β β βββ test_selector.py
β β βββ test_validator.py
β βββ indexing/
β β βββ test_b_m25_index.py
β β βββ test_b_m25_search.py
β β βββ test_chroma_store.py
β β βββ test_compressor.py
β β βββ test_deduplication.py
β β βββ test_filters.py
β β βββ test_fusion.py
β β βββ test_hybrid_index.py
β β βββ test_hybrid_search.py
β β βββ test_locking.py
β β βββ test_persistence.py
β β βββ test_ranking.py
β β βββ test_reranker.py
β β βββ test_retrieval_service.py
β β βββ test_scoring.py
β β βββ test_source_registry.py
β β βββ test_stats.py
β β βββ test_tokenizer.py
β β βββ test_vector_index.py
β β βββ test_vector_search.py
β βββ ingestion/
β β βββ test_base_loader.py
β β βββ test_chunk_optimizer.py
β β βββ test_document.py
β β βββ test_document_processor.py
β β βββ test_docx_loader.py
β β βββ test_file_validation.py
β β βββ test_ingestion_service.py
β β βββ test_loader_factory.py
β β βββ test_metadata.py
β β βββ test_pdf_loader.py
β β βββ test_source_id.py
β β βββ test_standalone_functions.py
β β βββ test_supported_types.py
β β βββ test_text_loader.py
β β βββ test_txt_loader.py
β β βββ test_url_loader.py
β βββ llm/
β β βββ test_chain_factory.py
β β βββ test_embedding_client.py
β β βββ test_fallback.py
β β βββ test_groq_client.py
β β βββ test_output_parser.py
β β βββ test_prompts.py
β β βββ test_wikipedia.py
β βββ rag/
β β βββ test_advanced_workflow.py
β β βββ test_c_r_a_g_workflow.py
β β βββ test_citations.py
β β βββ test_finalize.py
β β βββ test_history.py
β β βββ test_modes.py
β β βββ test_naive_workflow.py
β β βββ test_process_query.py
β β βββ test_r_a_g_service.py
β β βββ test_self_r_a_g_workflow.py
β β βββ test_state.py
β βββ schemas/
β β βββ test_schemas.py
β βββ storage/
β β βββ test_storage.py
β βββ utils/
β βββ test_utils.py
β
βββ frontend/ # React frontend application
β βββ .dockerignore
β βββ .gitignore
β βββ Dockerfile # Frontend Docker image (Nginx)
β βββ index.html # HTML entry point
β βββ package.json # Node.js dependencies
β βββ package-lock.json
β βββ README.md
β βββ vite.config.js # Vite bundler config
β βββ public/
β β βββ favicon.svg
β βββ src/
β βββ api.js # Centralized API client (fetch wrappers)
β βββ App.jsx # Root component with React Router
β βββ index.css # Global Tailwind CSS styles
β βββ main.jsx # React entry point
β βββ components/
β βββ AdminPage.jsx # System summary dashboard
β βββ ChatPage.jsx # Multi-turn AI chat interface
β βββ DomainsPage.jsx # Domain preset browser
β βββ IndexPage.jsx # Index status and management
β βββ IngestPage.jsx # Document upload / URL / text ingestion
β βββ Sidebar.jsx # Navigation sidebar
β
βββ .gitignore
βββ demo.mp4 # Project demo video
βββ demo.png # Project screenshot
βββ docker-compose.yml # Multi-service orchestration
βββ Dockerfile # Root multi-stage Docker image
βββ LICENSE
βββ README.md
βββ render.yml # Render.com deployment config
βββ run.py # Root entry point (starts backend)
```
---
## **Architecture Pattern: Modular Monolithic Architecture**
This project follows a **Modular Monolithic Architecture** with the following design patterns:
| Pattern | Where Used | Purpose |
|---------|------------|---------|
| **App Factory** | `app/application.py` | Configurable FastAPI app creation |
| **IoC Container** | `app/dependencies.py` | Dependency injection box wires all services |
| **Frozen Config** | `app/core/config.py` | Immutable `RAGConfig` dataclass for all settings |
| **Strategy** | `app/workflows/*/` | Four interchangeable RAG workflow strategies |
| **State Machine** | `app/workflows/*/graph.py` | LangGraph conditional state transitions |
| **Template Method** | `app/ingestion/loaders/base.py` | Common loader interface per file type |
| **Repository** | `app/indexing/hybrid_index.py` | Unified data access over vector + BM25 stores |
| **Registry** | `app/domain/registry.py` | Name-keyed domain preset lookup |
| **Singleton** | `app/dependencies.py` | Single shared instances of index, LLM, embedder |
---
## **System Architecture**
```mermaid
graph TD
UI[React Frontend]:::ui -->|HTTP REST| API[FastAPI Server]:::server
API --> IGR[Ingestion Routes]:::route
API --> RAGR[RAG Routes]:::route
API --> CHR[Chat Routes]:::route
API --> IDR[Index Routes]:::route
API --> DMR[Domain & Admin Routes]:::route
IGR --> IS[Ingestion Service]:::processor
IS --> DP[Document Processor + Chunk Optimizer]:::splitter
DP --> HI[Hybrid Index]:::database
HI --> VI[Vector Index / ChromaDB]:::database
HI --> BI[BM25 Sparse Index]:::database
RAGR --> RS[RAG Service]:::rag
RS --> WS{Workflow Selector}:::router
WS -->|naive| NW[Naive RAG]:::workflow
WS -->|advanced| AW[Advanced RAG]:::workflow
WS -->|crag| CW[CRAG Workflow]:::workflow
WS -->|self_rag| SW[Self-RAG Workflow]:::workflow
NW & AW & CW & SW --> RET[Retrieval Service]:::retriever
RET --> HS[Hybrid Search Dense + Sparse]:::retriever
HS --> VI
HS --> BI
HS --> RRF[RRF Fusion + CrossEncoder Reranker]:::retriever
RRF --> LLM[Groq LLM / LLaMA-3-70B]:::llm
CW & SW -->|low confidence| WK[Wikipedia Fallback]:::fallback
WK --> LLM
LLM --> FR[Formatted Response + Citations]:::executor
FR --> API
CHR --> CS[Chat Session Service]:::chat
CS --> HS2[History Store + Title Generator]:::chat
DMR --> DR[Domain Registry β 7 Presets]:::domain
classDef ui fill:#4e79a7,color:white;
classDef server fill:#f28e2b,color:white;
classDef route fill:#e15759,color:white;
classDef processor fill:#76b7b2,color:white;
classDef splitter fill:#edc948,color:#333;
classDef database fill:#8cd17d,color:#333;
classDef rag fill:#499894,color:white;
classDef router fill:#b07aa1,color:white;
classDef workflow fill:#86bcb6,color:#333;
classDef retriever fill:#59a14f,color:white;
classDef fallback fill:#f1ce63,color:#333;
classDef llm fill:#d37295,color:white;
classDef executor fill:#b3b3b3,color:#333;
classDef chat fill:#a0d6e5,color:#333;
classDef domain fill:#ff9da7,color:#333;
```
---
## **Installation**
### Prerequisites
- Python 3.11+
- Node.js 18+ (for frontend)
- Groq API Key
### Using pip
```bash
# Clone the repository
git clone https://github.com/Md-Emon-Hasan/AutoDocThinker.git
cd AutoDocThinker
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install backend dependencies
cd backend
pip install -r requirements.txt
# Copy and configure environment
cp .env.example .env
# Edit .env with your API keys
# Run the backend
python run.py
```
### Using Docker
The project includes a **Root Multi-stage Dockerfile** that builds both the React frontend and the FastAPI backend into a single deployable container.
```bash
# RECOMMENDED: Build and run with Docker Compose
docker-compose up -d --build
# OR: Build the root Docker image manually
docker build -t auto-doc-thinker .
# Run the container
docker run -p 5000:5000 --env-file backend/.env auto-doc-thinker
```
> [!NOTE]
> The container serves the **Frontend UI at the same port as the Backend (5000)** when built via the root Dockerfile.
---
## **Configuration**
Key environment variables in `backend/.env`:
| Variable | Description | Default |
|----------|-------------|---------|
| `GROQ_API_KEY` | Groq API key for LLaMA-3 | required |
| `HUGGINGFACEHUB_API_TOKEN` | HuggingFace token for embeddings | required |
| `GOOGLE_API_KEY` | Google API key (optional integrations) | optional |
| `TAVILY_API_KEY` | Tavily search API key | optional |
| `SERPER_API_KEY` | Serper web search API key | optional |
| `FLASK_ENV` | Environment mode | development |
| `SECRET_KEY` | Application secret key | change in production |
Core RAG parameters are set via the frozen `RAGConfig` dataclass in `backend/app/core/config.py`:
| Parameter | Description | Default |
|-----------|-------------|---------|
| `default_domain` | Domain preset used when none specified | `general` |
| `default_mode` | RAG workflow used when none specified | `advanced` |
| `initial_k` | Candidates retrieved before reranking | `20` |
| `rerank_top_k` | Final chunks passed to LLM after reranking | `5` |
| `crag_high_confidence` | CRAG score threshold for direct answer | `0.6` |
| `crag_low_confidence` | CRAG score threshold for Wikipedia fallback | `0.3` |
| `supported_extensions` | Accepted file types | `.pdf`, `.docx`, `.txt` |
---
## **API Endpoints**
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check |
| `/docs` | GET | Swagger interactive API documentation |
| `/redoc` | GET | ReDoc API documentation |
| `/rag-modes` | GET | List available RAG modes |
| `/rag-profiles` | GET | List RAG profiles per domain |
| `/rag/query` | POST | Run a RAG query (domain + mode + history) |
| `/ingest/source` | POST | Ingest document from file path or URL source |
| `/ingest/upload` | POST | Upload a file (PDF / DOCX / TXT) |
| `/ingest/text` | POST | Ingest raw pasted text |
| `/index/status` | GET | Get index stats (chunk count, sources) |
| `/index/source/{source_id}` | DELETE | Remove a specific ingested source |
| `/index` | DELETE | Clear the entire index |
| `/chat/sessions` | POST | Create a new chat session |
| `/chat/sessions/{id}` | GET | Retrieve an existing session |
| `/chat/sessions/{id}/select-profile` | POST | Set domain and RAG mode for a session |
| `/chat/sessions/{id}/query` | POST | Send a message in a session |
| `/domains` | GET | List all available domain presets |
| `/admin/summary` | GET | System summary (domains, chunk count) |
---
## **Usage**
1. **Select a Domain**: Choose the domain that best matches your documents (e.g., Medical, Legal, Finance)
2. **Select a RAG Mode**: Pick `naive` for speed, `advanced` for quality, `crag` or `self_rag` for highest accuracy
3. **Upload a Document**: Choose PDF, DOCX, TXT, paste a URL, or type raw text directly
4. **Click "Ingest"**: System loads, chunks, embeds, and indexes into the Hybrid Index (Vector + BM25)
5. **Ask Questions**: Chat with your documents using natural language in the Chat page
6. **Get AI Answers**: Responses include source citations; if no relevant documents exist, Wikipedia fallback activates automatically
7. **Manage Index**: Use the Index page to view ingested sources or remove specific documents
---
## **Running Tests**
### Backend Tests
Navigate to the `backend` directory first:
```bash
cd backend
```
Then run the tests:
```bash
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=app --cov-report=html
# Run async tests
pytest tests/ -v --asyncio-mode=auto
# Run a specific test module
pytest tests/rag/ -v
pytest tests/indexing/ -v
```
---
## **Testing Strategy & Quality Assurance**
We employ a comprehensive testing strategy using **Pytest** and **unittest.mock** to ensure reliability and maintainability across all modules.
### **1. Unit Testing (White-Box Testing)**
- **Isolation**: Each module (Ingestion, Indexing, Retrieval, RAG, Chat, LLM, Schemas, Storage, Utils) is tested in isolation.
- **Mocking**: External dependencies (Groq API, ChromaDB, Wikipedia, HuggingFace) are mocked to ensure tests are deterministic and do not require network access.
- **Technique**: `patch` and `MagicMock` simulate external behaviors, error conditions, and edge cases.
### **2. Edge Case & Error Handling**
- **Boundary Value Analysis**: Testing empty inputs, invalid file types, oversized payloads, missing sessions, and unknown domains.
- **Exception Handling**: Verifying the system gracefully handles API rate limits (429), LLM downtime (500), and invalid ingestion requests (400) with correct HTTP status codes.
### **3. Integration Testing (Simulated)**
- **Workflow Graph**: Each of the four LangGraph workflows (Naive, Advanced, CRAG, Self-RAG) is tested by simulating state transitions through nodes and conditional edges.
- **API Endpoints**: All FastAPI routes are tested with `TestClient` to verify HTTP status codes, response schemas, and error payloads.
- **Hybrid Index**: End-to-end ingestion β hybrid search β RRF fusion β reranker pipeline tested with in-memory mocks.
### **4. Module Coverage**
Tests span every backend module: `api`, `chat`, `core`, `domain`, `indexing`, `ingestion`, `llm`, `rag`, `schemas`, `storage`, `utils`, and all four workflow variants.
### **5. Code Quality Metrics**
- **100% Test Coverage Goal**: Every code path executed during test runs.
- **Linting**: Strict adherence to **PEP 8** standards enforced via `flake8` (config in `.flake8`), `isort`, and `black`.
- **Type Safety**: Pydantic v2 models enforce runtime data validation across all API boundaries.
---
## **Log Management**
The application uses a structured logging system for monitoring and debugging, configured in `app/logging_config.py`.
- **Storage**: Logs are stored in `logs/app.log`.
- **Rotation**: Automatic log rotation (10MB per file, keeping last 5 backups) prevents disk overflow.
- **Format**: `YYYY-MM-DD HH:MM:SS - logger_name - LEVEL - [file:line] - message`
- **Levels**:
- `INFO`: General operational events (requests, ingestion, state transitions).
- `DEBUG`: Detailed debugging information (only in development).
- `ERROR`: Exceptions and critical failures (stack traces included).
---
## **Tech Stack**
| Category | Technologies |
|----------|--------------|
| **Backend** | FastAPI, Uvicorn, Python 3.11 |
| **AI / LLM** | Groq API (LLaMA-3-70B), LangChain, LangGraph |
| **Embeddings** | HuggingFace `all-MiniLM-L6-v2`, CrossEncoder reranker |
| **Vector Database** | ChromaDB (persistent dense vector store) |
| **Sparse Index** | BM25 via `rank-bm25` |
| **Hybrid Search** | Dense + Sparse fusion with Reciprocal Rank Fusion (RRF) |
| **Web Fallback** | Wikipedia API via LangChain |
| **Frontend** | React 18, Vite, Tailwind CSS |
| **DevOps** | Docker, Docker Compose, GitHub Actions, Render |
---
## **CI/CD Pipeline**
This project uses **GitHub Actions** for continuous integration and deployment.
### Pipeline Stages
```
βββββββββββ βββββββββββ ββββββββββββ βββββββββββ ββββββββββββ
β Lint βββββΆβ Test βββββΆβ Security βββββΆβ Build βββββΆβ Deploy β
β (Black, β β(pytest) β β (Safety, β β(Docker) β β (Render) β
β Flake8) β β β β Bandit) β β β β β
βββββββββββ βββββββββββ ββββββββββββ βββββββββββ ββββββββββββ
```
### Workflow Files
| File | Trigger | Purpose |
|------|---------|---------|
| `ci-cd.yml` | Push/PR to main | Full CI/CD pipeline |
| `docker.yml` | Release published | Build & push to GHCR |
### Required Secrets
| Secret | Description |
|--------|-------------|
| `GROQ_API_KEY` | Groq API key for test runs |
| `RENDER_DEPLOY_HOOK` | Render deploy webhook URL |
---
## **Author**
**Md Emon Hasan**
- Email: [emon.mlengineer@gmail.com](mailto:emon.mlengineer@gmail.com)
- LinkedIn: [md-emon-hasan](https://www.linkedin.com/in/md-emon-hasan-695483237/)
- GitHub: [Md-Emon-Hasan](https://github.com/Md-Emon-Hasan)
- Facebook: [Md-Emon-Hasan](https://www.facebook.com/mdemon.hasan2001/)
- WhatsApp: [+8801834363533](https://wa.me/8801834363533)
---
## **License**
MIT License - see [LICENSE](LICENSE) file for details.
---
## **Contributing**
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request