{"id":49363583,"url":"https://github.com/jaybhuvaa/smartllm-router","last_synced_at":"2026-04-27T18:03:19.513Z","repository":{"id":328594655,"uuid":"1115551610","full_name":"jaybhuvaa/SmartLLM-Router","owner":"jaybhuvaa","description":"Intelligent LLM Cost Optimizer - Routes queries to optimal models based on complexity, implements semantic caching, and provides cost analytics. Built with FastAPI \u0026 Ollama.","archived":false,"fork":false,"pushed_at":"2025-12-15T04:01:37.000Z","size":54,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-16T09:40:17.203Z","etag":null,"topics":["ai","cache","cost-optimization","fastapi","llm","machine-learning","ollama","openai","pyth","semantic"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jaybhuvaa.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-13T04:31:35.000Z","updated_at":"2025-12-15T04:01:41.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jaybhuvaa/SmartLLM-Router","commit_stats":null,"previous_names":["jaybhuvaa/smartllm-router"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/jaybhuvaa/SmartLLM-Router","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaybhuvaa%2FSmartLLM-Router","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaybhuvaa%2FSmartLLM-Router/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaybhuvaa%2FSmartLLM-Router/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaybhuvaa%2FSmartLLM-Router/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jaybhuvaa","download_url":"https://codeload.github.com/jaybhuvaa/SmartLLM-Router/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaybhuvaa%2FSmartLLM-Router/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32348058,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-27T17:12:42.749Z","status":"ssl_error","status_checked_at":"2026-04-27T17:12:41.658Z","response_time":128,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","cache","cost-optimization","fastapi","llm","machine-learning","ollama","openai","pyth","semantic"],"created_at":"2026-04-27T18:03:07.558Z","updated_at":"2026-04-27T18:03:19.497Z","avatar_url":"https://github.com/jaybhuvaa.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SmartLLM Router 🚀\n\n**Intelligent LLM Cost Optimizer** - A production-grade middleware that routes LLM requests to optimal models based on query complexity, implements semantic caching, and provides cost analytics.\n\n![Python](https://img.shields.io/badge/Python-3.11+-blue.svg)\n![FastAPI](https://img.shields.io/badge/FastAPI-0.109+-green.svg)\n![Tests](https://img.shields.io/badge/Tests-40%20Passing-success.svg)\n![License](https://img.shields.io/badge/License-MIT-yellow.svg)\n\n## 🎯 Key Features\n\n- **Smart Routing**: Automatically classifies query complexity and routes to the most cost-effective model\n- **Semantic Caching**: Uses sentence-transformer embeddings for similarity-based response caching\n- **Cost Analytics**: Real-time tracking of costs, savings, and performance metrics\n- **Multi-Model Support**: Local Ollama models (TinyLlama, Llama 3.2) with cloud API fallback support\n- **Resource Optimized**: Designed to run on systems with 8GB RAM\n\n## 📊 Actual Performance Metrics\n\n| Metric | Achieved | Description |\n|--------|----------|-------------|\n| Cost Savings | **100%** | Using free local Ollama models vs paid APIs |\n| Cache Hit Rate | **40-50%** | Semantic similarity matching (threshold: 0.85) |\n| Simple Query Latency | **2-10s** | TinyLlama responses |\n| Complex Query Latency | **30-240s** | Llama 3.2 responses (8GB RAM constraint) |\n| Routing Accuracy | **\u003e90%** | Correct complexity classification |\n\n## 🏗️ Architecture\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│                      CLIENT REQUEST                              │\n└─────────────────────────────────────────────────────────────────┘\n                                │\n                                ▼\n┌─────────────────────────────────────────────────────────────────┐\n│                    SMARTLLM ROUTER API                          │\n├─────────────────────────────────────────────────────────────────┤\n│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────┐    │\n│  │   Request    │──│   Prompt     │──│ Query Complexity   │    │\n│  │   Handler    │  │ Preprocessor │  │    Classifier      │    │\n│  └──────────────┘  └──────────────┘  └────────────────────┘    │\n│                                              │                  │\n│                    ┌─────────────────────────┼─────────────┐    │\n│                    ▼                         ▼             ▼    │\n│  ┌─────────────────────┐    ┌─────────────────────────────────┐│\n│  │   SEMANTIC CACHE    │    │         MODEL ROUTER            ││\n│  │ (Sentence Embeddings)│    │  ┌─────────┬─────────────────┐ ││\n│  └─────────────────────┘    │  │TinyLlama│   Llama 3.2     │ ││\n│            │                │  │(Simple/ │   (Complex)     │ ││\n│            │                │  │ Medium) │                 │ ││\n│            └────────────────┴──┴─────────┴─────────────────┴──┘│\n│                             │                                   │\n│                             ▼                                   │\n│  ┌─────────────────────────────────────────────────────────────┐│\n│  │                   COST ANALYTICS                            ││\n│  └─────────────────────────────────────────────────────────────┘│\n└─────────────────────────────────────────────────────────────────┘\n```\n\n## 🧠 Model Selection \u0026 Reasoning\n\n### Why Local Ollama Models?\n\n| Decision | Reasoning |\n|----------|-----------|\n| **No Paid APIs** | Eliminated recurring costs; perfect for portfolio projects |\n| **TinyLlama (637MB)** | Ultra-fast responses for simple queries; fits easily in 8GB RAM |\n| **Llama 3.2 (2GB)** | Best quality-to-size ratio for complex reasoning on limited hardware |\n| **Dropped Phi-3** | Initially tested but timeout issues on 8GB RAM; too resource-intensive |\n\n### Model Configuration\n\n```\nSimple Queries  → TinyLlama  (2-10s response, lightweight)\nMedium Queries  → TinyLlama  (prioritize speed over marginal quality gain)\nComplex Queries → Llama 3.2  (30-240s response, better reasoning)\n```\n\n### Why This Configuration?\n\n1. **Hardware Constraint**: 8GB RAM limits concurrent model loading\n2. **Speed Priority**: Users prefer fast responses for simple questions\n3. **Quality When Needed**: Complex system design questions get the more capable model\n4. **Zero Cost**: All local inference = $0 API bills\n\n## 🛠️ Development Journey\n\n### Phase 1: Initial Setup\n- Created FastAPI project structure with proper separation of concerns\n- Implemented Pydantic models for type-safe request/response handling\n- Set up configuration management with environment variables\n\n### Phase 2: Core Features\n- **Complexity Classifier**: Rule-based system analyzing:\n  - Technical terms (40+ terms including ML, system design, security)\n  - Reasoning patterns (why, how, explain, compare, analyze)\n  - Code detection (Python, JavaScript, SQL patterns)\n  - Query length and structure\n  - System design keywords (scale, distributed, million, availability)\n\n- **Semantic Cache**: \n  - Sentence-transformers (`all-MiniLM-L6-v2`) for embeddings\n  - Cosine similarity matching with 0.85 threshold\n  - In-memory storage (Redis-ready architecture)\n\n### Phase 3: Multi-Model Routing\n- Integrated Ollama for local LLM inference\n- Tested multiple model combinations\n- Optimized timeouts (300s) for resource-constrained environments\n\n### Phase 4: Testing \u0026 CI/CD\n- 40 comprehensive tests (classifier, cache, integration)\n- GitHub Actions pipeline with Python 3.11/3.12 matrix\n- Docker build verification\n- Code coverage reporting\n\n## 🐛 Problems Faced \u0026 Solutions\n\n| Problem | Solution |\n|---------|----------|\n| **Phi-3 timeouts** | Switched to Llama 3.2 which handles 8GB RAM better |\n| **Classifier too strict** | Lowered complex threshold from ≥5 to ≥4; added system design detection |\n| **GitHub CI test failures** | Aligned test queries with actual classifier scoring logic |\n| **Docker build cache timeout** | Simplified CI to use standard `docker build` command |\n| **Kubernetes query misclassified** | Added explicit system design keyword detection (+3 score boost) |\n| **Sentence-transformers import** | Added lazy loading with hash-based fallback |\n\n## 🚀 Quick Start\n\n### Prerequisites\n\n- Python 3.11+\n- [Ollama](https://ollama.ai/) installed locally\n- 8GB+ RAM recommended\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/jaybhuvaa/SmartLLM-Router.git\ncd SmartLLM-Router\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install dependencies\npip install -r requirements.txt\n\n# Pull required Ollama models\nollama pull tinyllama\nollama pull llama3.2\n```\n\n### Configuration\n\nCreate a `.env` file:\n\n```env\n# Model Configuration\nDEFAULT_SIMPLE_MODEL=ollama/tinyllama\nDEFAULT_MEDIUM_MODEL=ollama/tinyllama\nDEFAULT_COMPLEX_MODEL=ollama/llama3.2\n\n# Ollama\nOLLAMA_BASE_URL=http://localhost:11434\n\n# Caching\nCACHE_SIMILARITY_THRESHOLD=0.85\nCACHE_TTL_HOURS=24\n\n# Server\nHOST=0.0.0.0\nPORT=8000\nDEBUG=true\n```\n\n### Running Locally\n\n```bash\n# Make sure Ollama is running\nollama serve\n\n# Start the API server\nuvicorn src.main:app --reload\n\n# Server runs at http://localhost:8000\n# API docs at http://localhost:8000/docs\n```\n\n## 📖 API Usage\n\n### Chat Endpoint\n\n```bash\ncurl -X POST \"http://localhost:8000/api/v1/chat\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"message\": \"What is Python?\"}'\n```\n\nResponse:\n```json\n{\n  \"response\": \"Python is a high-level programming language...\",\n  \"model_used\": \"ollama/tinyllama\",\n  \"complexity\": \"simple\",\n  \"was_cached\": false,\n  \"input_tokens\": 4,\n  \"output_tokens\": 45,\n  \"actual_cost\": 0.0,\n  \"baseline_cost\": 0.00147,\n  \"latency_ms\": 3500,\n  \"request_id\": \"abc123\"\n}\n```\n\n### Complex Query Example\n\n```bash\ncurl -X POST \"http://localhost:8000/api/v1/chat\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"message\": \"Design a distributed cache system for a social media platform that handles 10 million requests per second with high availability\"}'\n```\n\nResponse:\n```json\n{\n  \"response\": \"I will design a distributed cache system...\",\n  \"model_used\": \"ollama/llama3.2\",\n  \"complexity\": \"complex\",\n  \"was_cached\": false,\n  \"actual_cost\": 0.0,\n  \"baseline_cost\": 0.049,\n  \"latency_ms\": 230846\n}\n```\n\n### Classification Endpoint\n\n```bash\ncurl -X POST \"http://localhost:8000/api/v1/classify\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"message\": \"Design a distributed system for...\"}'\n```\n\n### Analytics Endpoints\n\n```bash\n# Get summary\ncurl \"http://localhost:8000/api/v1/analytics/summary\"\n\n# Get cache statistics\ncurl \"http://localhost:8000/api/v1/analytics/cache-stats\"\n\n# Get savings report\ncurl \"http://localhost:8000/api/v1/analytics/savings-report\"\n```\n\n## 🧪 Testing\n\n```bash\n# Run all tests (40 tests)\npytest\n\n# Run with coverage\npytest --cov=src --cov-report=html\n\n# Run specific test file\npytest tests/test_classifier.py -v\n\n# Run integration tests\npytest tests/test_integration.py -v\n```\n\n### Test Coverage\n\n| Module | Coverage |\n|--------|----------|\n| Complexity Classifier | 85% |\n| Semantic Cache | 85% |\n| Integration Tests | 100% |\n| **Overall** | **63%** |\n\n## 📁 Project Structure\n\n```\nsmartllm-router/\n├── src/\n│   ├── main.py                 # FastAPI app entry\n│   ├── config.py               # Settings management\n│   ├── models/\n│   │   └── schemas.py          # Pydantic models\n│   ├── routers/\n│   │   ├── chat.py             # Main chat endpoint\n│   │   └── analytics.py        # Analytics endpoints\n│   ├── services/\n│   │   ├── complexity_classifier.py  # Query routing logic\n│   │   ├── semantic_cache.py         # Embedding-based caching\n│   │   ├── llm_providers.py          # Ollama integration\n│   │   └── cost_tracker.py           # Cost analytics\n│   └── utils/\n│       └── token_counter.py\n├── tests/\n│   ├── test_classifier.py      # 24 tests\n│   ├── test_cache.py           # 13 tests\n│   └── test_integration.py     # 13 tests\n├── benchmarks/\n│   └── benchmark.py            # Performance testing\n├── .github/\n│   └── workflows/\n│       └── ci.yml              # GitHub Actions CI\n├── docker-compose.yml\n├── Dockerfile\n├── requirements.txt\n└── README.md\n```\n\n## 🎯 How Routing Works\n\nThe complexity classifier analyzes queries using a scoring system:\n\n### Scoring Features\n\n| Feature | Points | Condition |\n|---------|--------|-----------|\n| Query Length | +1 | \u003e30 words |\n| Query Length | +2 | \u003e100 words |\n| Code Presence | +2 | Contains code blocks/patterns |\n| Reasoning Words | +1 | 1+ matches (why, how, explain) |\n| Reasoning Words | +2 | 3+ matches |\n| Technical Terms | +1 | 2+ terms |\n| Technical Terms | +2 | 5+ terms |\n| **System Design** | +2 | 1+ keywords (design, scale, distributed) |\n| **System Design** | +3 | 3+ keywords |\n| Multi-step Task | +1 | Contains step patterns |\n| Multiple Sentences | +1 | 4+ sentences |\n\n### Classification Thresholds\n\n```\nScore 0-1  → SIMPLE  → TinyLlama (fast, basic)\nScore 2-3  → MEDIUM  → TinyLlama (speed priority)\nScore 4+   → COMPLEX → Llama 3.2 (quality priority)\n```\n\n## 💰 Cost Model\n\n| Model | Input (per 1K) | Output (per 1K) | Our Usage |\n|-------|----------------|-----------------|-----------|\n| GPT-4 | $0.03 | $0.06 | Baseline comparison |\n| GPT-3.5-turbo | $0.0005 | $0.0015 | Baseline comparison |\n| **Ollama/TinyLlama** | **$0.00** | **$0.00** | ✅ Simple/Medium |\n| **Ollama/Llama3.2** | **$0.00** | **$0.00** | ✅ Complex |\n\n**Result: 100% cost savings using local models!**\n\n\n\n## 🔮 Future Enhancements\n\n- [ ] Redis-backed persistent caching\n- [ ] PostgreSQL request logging\n- [ ] ML-based classifier (replace rule-based)\n- [ ] A/B testing for model comparison\n- [ ] Prometheus metrics endpoint\n- [ ] Cloud deployment (Railway/Fly.io)\n- [ ] Rate limiting per user\n- [ ] Response streaming\n\n## 🛠️ Configuration Options\n\n| Variable | Default | Description |\n|----------|---------|-------------|\n| `DEFAULT_SIMPLE_MODEL` | `ollama/tinyllama` | Model for simple queries |\n| `DEFAULT_MEDIUM_MODEL` | `ollama/tinyllama` | Model for medium queries |\n| `DEFAULT_COMPLEX_MODEL` | `ollama/llama3.2` | Model for complex queries |\n| `CACHE_SIMILARITY_THRESHOLD` | `0.85` | Minimum similarity for cache hit |\n| `CACHE_TTL_HOURS` | `24` | Cache entry expiration |\n| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama API endpoint |\n\n## 📝 License\n\nMIT License - feel free to use this project for learning and portfolio purposes!\n\n## 🙏 Acknowledgments\n\n- [FastAPI](https://fastapi.tiangolo.com/) - Modern Python web framework\n- [Ollama](https://ollama.ai/) - Local LLM inference\n- [sentence-transformers](https://www.sbert.net/) - Embedding generation\n- The open-source LLM community\n\n---\n\nBuilt with ❤️ by [Jaykumar Bhuva](https://github.com/jaybhuvaa)\n\n**GitHub**: https://github.com/jaybhuvaa/SmartLLM-Router\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaybhuvaa%2Fsmartllm-router","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjaybhuvaa%2Fsmartllm-router","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaybhuvaa%2Fsmartllm-router/lists"}