https://github.com/jaybhuvaa/smartllm-router
Intelligent LLM Cost Optimizer - Routes queries to optimal models based on complexity, implements semantic caching, and provides cost analytics. Built with FastAPI & Ollama.
https://github.com/jaybhuvaa/smartllm-router
ai cache cost-optimization fastapi llm machine-learning ollama openai pyth semantic
Last synced: 2 months ago
JSON representation
Intelligent LLM Cost Optimizer - Routes queries to optimal models based on complexity, implements semantic caching, and provides cost analytics. Built with FastAPI & Ollama.
- Host: GitHub
- URL: https://github.com/jaybhuvaa/smartllm-router
- Owner: jaybhuvaa
- Created: 2025-12-13T04:31:35.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-12-15T04:01:37.000Z (6 months ago)
- Last Synced: 2025-12-16T09:40:17.203Z (6 months ago)
- Topics: ai, cache, cost-optimization, fastapi, llm, machine-learning, ollama, openai, pyth, semantic
- Language: Python
- Homepage:
- Size: 52.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SmartLLM Router ๐
**Intelligent LLM Cost Optimizer** - A production-grade middleware that routes LLM requests to optimal models based on query complexity, implements semantic caching, and provides cost analytics.




## ๐ฏ Key Features
- **Smart Routing**: Automatically classifies query complexity and routes to the most cost-effective model
- **Semantic Caching**: Uses sentence-transformer embeddings for similarity-based response caching
- **Cost Analytics**: Real-time tracking of costs, savings, and performance metrics
- **Multi-Model Support**: Local Ollama models (TinyLlama, Llama 3.2) with cloud API fallback support
- **Resource Optimized**: Designed to run on systems with 8GB RAM
## ๐ Actual Performance Metrics
| Metric | Achieved | Description |
|--------|----------|-------------|
| Cost Savings | **100%** | Using free local Ollama models vs paid APIs |
| Cache Hit Rate | **40-50%** | Semantic similarity matching (threshold: 0.85) |
| Simple Query Latency | **2-10s** | TinyLlama responses |
| Complex Query Latency | **30-240s** | Llama 3.2 responses (8GB RAM constraint) |
| Routing Accuracy | **>90%** | Correct complexity classification |
## ๐๏ธ Architecture
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLIENT REQUEST โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SMARTLLM ROUTER API โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Request โโโโ Prompt โโโโ Query Complexity โ โ
โ โ Handler โ โ Preprocessor โ โ Classifier โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโ โ
โ โผ โผ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ SEMANTIC CACHE โ โ MODEL ROUTER โโ
โ โ (Sentence Embeddings)โ โ โโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ โโ
โ โโโโโโโโโโโโโโโโโโโโโโโ โ โTinyLlamaโ Llama 3.2 โ โโ
โ โ โ โ(Simple/ โ (Complex) โ โโ
โ โ โ โ Medium) โ โ โโ
โ โโโโโโโโโโโโโโโโโโดโโโดโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ COST ANALYTICS โโ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
## ๐ง Model Selection & Reasoning
### Why Local Ollama Models?
| Decision | Reasoning |
|----------|-----------|
| **No Paid APIs** | Eliminated recurring costs; perfect for portfolio projects |
| **TinyLlama (637MB)** | Ultra-fast responses for simple queries; fits easily in 8GB RAM |
| **Llama 3.2 (2GB)** | Best quality-to-size ratio for complex reasoning on limited hardware |
| **Dropped Phi-3** | Initially tested but timeout issues on 8GB RAM; too resource-intensive |
### Model Configuration
```
Simple Queries โ TinyLlama (2-10s response, lightweight)
Medium Queries โ TinyLlama (prioritize speed over marginal quality gain)
Complex Queries โ Llama 3.2 (30-240s response, better reasoning)
```
### Why This Configuration?
1. **Hardware Constraint**: 8GB RAM limits concurrent model loading
2. **Speed Priority**: Users prefer fast responses for simple questions
3. **Quality When Needed**: Complex system design questions get the more capable model
4. **Zero Cost**: All local inference = $0 API bills
## ๐ ๏ธ Development Journey
### Phase 1: Initial Setup
- Created FastAPI project structure with proper separation of concerns
- Implemented Pydantic models for type-safe request/response handling
- Set up configuration management with environment variables
### Phase 2: Core Features
- **Complexity Classifier**: Rule-based system analyzing:
- Technical terms (40+ terms including ML, system design, security)
- Reasoning patterns (why, how, explain, compare, analyze)
- Code detection (Python, JavaScript, SQL patterns)
- Query length and structure
- System design keywords (scale, distributed, million, availability)
- **Semantic Cache**:
- Sentence-transformers (`all-MiniLM-L6-v2`) for embeddings
- Cosine similarity matching with 0.85 threshold
- In-memory storage (Redis-ready architecture)
### Phase 3: Multi-Model Routing
- Integrated Ollama for local LLM inference
- Tested multiple model combinations
- Optimized timeouts (300s) for resource-constrained environments
### Phase 4: Testing & CI/CD
- 40 comprehensive tests (classifier, cache, integration)
- GitHub Actions pipeline with Python 3.11/3.12 matrix
- Docker build verification
- Code coverage reporting
## ๐ Problems Faced & Solutions
| Problem | Solution |
|---------|----------|
| **Phi-3 timeouts** | Switched to Llama 3.2 which handles 8GB RAM better |
| **Classifier too strict** | Lowered complex threshold from โฅ5 to โฅ4; added system design detection |
| **GitHub CI test failures** | Aligned test queries with actual classifier scoring logic |
| **Docker build cache timeout** | Simplified CI to use standard `docker build` command |
| **Kubernetes query misclassified** | Added explicit system design keyword detection (+3 score boost) |
| **Sentence-transformers import** | Added lazy loading with hash-based fallback |
## ๐ Quick Start
### Prerequisites
- Python 3.11+
- [Ollama](https://ollama.ai/) installed locally
- 8GB+ RAM recommended
### Installation
```bash
# Clone the repository
git clone https://github.com/jaybhuvaa/SmartLLM-Router.git
cd SmartLLM-Router
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Pull required Ollama models
ollama pull tinyllama
ollama pull llama3.2
```
### Configuration
Create a `.env` file:
```env
# Model Configuration
DEFAULT_SIMPLE_MODEL=ollama/tinyllama
DEFAULT_MEDIUM_MODEL=ollama/tinyllama
DEFAULT_COMPLEX_MODEL=ollama/llama3.2
# Ollama
OLLAMA_BASE_URL=http://localhost:11434
# Caching
CACHE_SIMILARITY_THRESHOLD=0.85
CACHE_TTL_HOURS=24
# Server
HOST=0.0.0.0
PORT=8000
DEBUG=true
```
### Running Locally
```bash
# Make sure Ollama is running
ollama serve
# Start the API server
uvicorn src.main:app --reload
# Server runs at http://localhost:8000
# API docs at http://localhost:8000/docs
```
## ๐ API Usage
### Chat Endpoint
```bash
curl -X POST "http://localhost:8000/api/v1/chat" \
-H "Content-Type: application/json" \
-d '{"message": "What is Python?"}'
```
Response:
```json
{
"response": "Python is a high-level programming language...",
"model_used": "ollama/tinyllama",
"complexity": "simple",
"was_cached": false,
"input_tokens": 4,
"output_tokens": 45,
"actual_cost": 0.0,
"baseline_cost": 0.00147,
"latency_ms": 3500,
"request_id": "abc123"
}
```
### Complex Query Example
```bash
curl -X POST "http://localhost:8000/api/v1/chat" \
-H "Content-Type: application/json" \
-d '{"message": "Design a distributed cache system for a social media platform that handles 10 million requests per second with high availability"}'
```
Response:
```json
{
"response": "I will design a distributed cache system...",
"model_used": "ollama/llama3.2",
"complexity": "complex",
"was_cached": false,
"actual_cost": 0.0,
"baseline_cost": 0.049,
"latency_ms": 230846
}
```
### Classification Endpoint
```bash
curl -X POST "http://localhost:8000/api/v1/classify" \
-H "Content-Type: application/json" \
-d '{"message": "Design a distributed system for..."}'
```
### Analytics Endpoints
```bash
# Get summary
curl "http://localhost:8000/api/v1/analytics/summary"
# Get cache statistics
curl "http://localhost:8000/api/v1/analytics/cache-stats"
# Get savings report
curl "http://localhost:8000/api/v1/analytics/savings-report"
```
## ๐งช Testing
```bash
# Run all tests (40 tests)
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test file
pytest tests/test_classifier.py -v
# Run integration tests
pytest tests/test_integration.py -v
```
### Test Coverage
| Module | Coverage |
|--------|----------|
| Complexity Classifier | 85% |
| Semantic Cache | 85% |
| Integration Tests | 100% |
| **Overall** | **63%** |
## ๐ Project Structure
```
smartllm-router/
โโโ src/
โ โโโ main.py # FastAPI app entry
โ โโโ config.py # Settings management
โ โโโ models/
โ โ โโโ schemas.py # Pydantic models
โ โโโ routers/
โ โ โโโ chat.py # Main chat endpoint
โ โ โโโ analytics.py # Analytics endpoints
โ โโโ services/
โ โ โโโ complexity_classifier.py # Query routing logic
โ โ โโโ semantic_cache.py # Embedding-based caching
โ โ โโโ llm_providers.py # Ollama integration
โ โ โโโ cost_tracker.py # Cost analytics
โ โโโ utils/
โ โโโ token_counter.py
โโโ tests/
โ โโโ test_classifier.py # 24 tests
โ โโโ test_cache.py # 13 tests
โ โโโ test_integration.py # 13 tests
โโโ benchmarks/
โ โโโ benchmark.py # Performance testing
โโโ .github/
โ โโโ workflows/
โ โโโ ci.yml # GitHub Actions CI
โโโ docker-compose.yml
โโโ Dockerfile
โโโ requirements.txt
โโโ README.md
```
## ๐ฏ How Routing Works
The complexity classifier analyzes queries using a scoring system:
### Scoring Features
| Feature | Points | Condition |
|---------|--------|-----------|
| Query Length | +1 | >30 words |
| Query Length | +2 | >100 words |
| Code Presence | +2 | Contains code blocks/patterns |
| Reasoning Words | +1 | 1+ matches (why, how, explain) |
| Reasoning Words | +2 | 3+ matches |
| Technical Terms | +1 | 2+ terms |
| Technical Terms | +2 | 5+ terms |
| **System Design** | +2 | 1+ keywords (design, scale, distributed) |
| **System Design** | +3 | 3+ keywords |
| Multi-step Task | +1 | Contains step patterns |
| Multiple Sentences | +1 | 4+ sentences |
### Classification Thresholds
```
Score 0-1 โ SIMPLE โ TinyLlama (fast, basic)
Score 2-3 โ MEDIUM โ TinyLlama (speed priority)
Score 4+ โ COMPLEX โ Llama 3.2 (quality priority)
```
## ๐ฐ Cost Model
| Model | Input (per 1K) | Output (per 1K) | Our Usage |
|-------|----------------|-----------------|-----------|
| GPT-4 | $0.03 | $0.06 | Baseline comparison |
| GPT-3.5-turbo | $0.0005 | $0.0015 | Baseline comparison |
| **Ollama/TinyLlama** | **$0.00** | **$0.00** | โ
Simple/Medium |
| **Ollama/Llama3.2** | **$0.00** | **$0.00** | โ
Complex |
**Result: 100% cost savings using local models!**
## ๐ฎ Future Enhancements
- [ ] Redis-backed persistent caching
- [ ] PostgreSQL request logging
- [ ] ML-based classifier (replace rule-based)
- [ ] A/B testing for model comparison
- [ ] Prometheus metrics endpoint
- [ ] Cloud deployment (Railway/Fly.io)
- [ ] Rate limiting per user
- [ ] Response streaming
## ๐ ๏ธ Configuration Options
| Variable | Default | Description |
|----------|---------|-------------|
| `DEFAULT_SIMPLE_MODEL` | `ollama/tinyllama` | Model for simple queries |
| `DEFAULT_MEDIUM_MODEL` | `ollama/tinyllama` | Model for medium queries |
| `DEFAULT_COMPLEX_MODEL` | `ollama/llama3.2` | Model for complex queries |
| `CACHE_SIMILARITY_THRESHOLD` | `0.85` | Minimum similarity for cache hit |
| `CACHE_TTL_HOURS` | `24` | Cache entry expiration |
| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama API endpoint |
## ๐ License
MIT License - feel free to use this project for learning and portfolio purposes!
## ๐ Acknowledgments
- [FastAPI](https://fastapi.tiangolo.com/) - Modern Python web framework
- [Ollama](https://ollama.ai/) - Local LLM inference
- [sentence-transformers](https://www.sbert.net/) - Embedding generation
- The open-source LLM community
---
Built with โค๏ธ by [Jaykumar Bhuva](https://github.com/jaybhuvaa)
**GitHub**: https://github.com/jaybhuvaa/SmartLLM-Router