https://github.com/anuragb7/mcp-rag
MCP-RAG
https://github.com/anuragb7/mcp-rag
Last synced: 3 months ago
JSON representation
MCP-RAG
- Host: GitHub
- URL: https://github.com/anuragb7/mcp-rag
- Owner: AnuragB7
- License: mit
- Created: 2025-05-25T10:54:31.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-05-25T11:02:54.000Z (4 months ago)
- Last Synced: 2025-05-25T11:38:02.438Z (4 months ago)
- Language: Python
- Homepage: https://medium.com/@anurag.bombarde.dev/building-a-production-ready-mcp-rag-from-protocol-to-practice-55b9c466bf24
- Size: 0 Bytes
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 📚 MCP-RAG
MCP-RAG system built with the Model Context Protocol (MCP) that handles large files (up to 200MB) using intelligent chunking strategies, multi-format document support, and enterprise-grade reliability.
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/modelcontextprotocol)## 🌟 Features
### 📄 **Multi-Format Document Support**
- **PDF**: Intelligent page-by-page processing with table detection
- **DOCX**: Paragraph and table extraction with formatting preservation
- **Excel**: Sheet-aware processing with column context (.xlsx/.xls)
- **CSV**: Smart row batching with header preservation
- **PPTX**: Support for PPTX
- **IMAGE**: Suppport for jpeg , png , webp , gif etc and OCR### 🚀 **Large File Processing**
- **Adaptive chunking**: Different strategies based on file size
- **Memory management**: Streaming processing for 50MB+ files
- **Progress tracking**: Real-time progress indicators
- **Timeout handling**: Graceful handling of long-running operations### 🧠 **Advanced RAG Capabilities**
- **Semantic search**: Vector similarity with confidence scores
- **Cross-document queries**: Search across multiple documents simultaneously
- **Source attribution**: Citations with similarity scores
- **Hybrid retrieval**: Combine semantic and keyword search### 🔌 **Model Context Protocol (MCP) Integration**
- **Universal tool interface**: Standardized AI-to-tool communication
- **Auto-discovery**: LangChain agents automatically find and use tools
- **Secure communication**: Built-in permission controls
- **Extensible architecture**: Easy to add new document processors### 🏢 **Enterprise Ready**
- **Custom LLM endpoints**: Support for any OpenAI-compatible API
- **Vector database options**: ChromaDB (local) + Milvus (production)
- **Batch processing**: Handles API rate limits and batch size constraints
- **Error recovery**: Retry logic and graceful degradation## 🏗️ Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Streamlit │ │ LangChain │ │ MCP Server │
│ Frontend │◄──►│ Agent │◄──►│ (Tools) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌────────────────────────┼────────────────────────┐
│ ▼ │
┌───────▼────────┐ ┌─────────────────┐ ┌──────▼──────┐
│ Document │ │ Vector Database │ │ LLM API │
│ Processors │ │ (ChromaDB) │ │ Endpoint │
└────────────────┘ └─────────────────┘ └─────────────┘## 🚀 Quick Start
### Prerequisites
- Python 3.11+
- OpenAI API key or compatible LLM endpoint
- 8GB+ RAM (for large file processing)### Installation
**Clone the repository**
```bash
git clone https://github.com/yourusername/rag-large-file-processor.git
cd rag-large-file-processorpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txt
# Create .env file
cat > .env << EOF
OPENAI_API_KEY=your_openai_api_key_here
BASE_URL=https://api.openai.com/v1
MODEL_NAME=gpt-4o
VECTOR_DB_TYPE=chromadbstreamlit run streamlit_app.py