An open API service indexing awesome lists of open source software.

https://github.com/jaybhuvaa/smartllm-router

Intelligent LLM Cost Optimizer - Routes queries to optimal models based on complexity, implements semantic caching, and provides cost analytics. Built with FastAPI & Ollama.
https://github.com/jaybhuvaa/smartllm-router

ai cache cost-optimization fastapi llm machine-learning ollama openai pyth semantic

Last synced: 2 months ago
JSON representation

Intelligent LLM Cost Optimizer - Routes queries to optimal models based on complexity, implements semantic caching, and provides cost analytics. Built with FastAPI & Ollama.

Awesome Lists containing this project

README

          

# SmartLLM Router ๐Ÿš€

**Intelligent LLM Cost Optimizer** - A production-grade middleware that routes LLM requests to optimal models based on query complexity, implements semantic caching, and provides cost analytics.

![Python](https://img.shields.io/badge/Python-3.11+-blue.svg)
![FastAPI](https://img.shields.io/badge/FastAPI-0.109+-green.svg)
![Tests](https://img.shields.io/badge/Tests-40%20Passing-success.svg)
![License](https://img.shields.io/badge/License-MIT-yellow.svg)

## ๐ŸŽฏ Key Features

- **Smart Routing**: Automatically classifies query complexity and routes to the most cost-effective model
- **Semantic Caching**: Uses sentence-transformer embeddings for similarity-based response caching
- **Cost Analytics**: Real-time tracking of costs, savings, and performance metrics
- **Multi-Model Support**: Local Ollama models (TinyLlama, Llama 3.2) with cloud API fallback support
- **Resource Optimized**: Designed to run on systems with 8GB RAM

## ๐Ÿ“Š Actual Performance Metrics

| Metric | Achieved | Description |
|--------|----------|-------------|
| Cost Savings | **100%** | Using free local Ollama models vs paid APIs |
| Cache Hit Rate | **40-50%** | Semantic similarity matching (threshold: 0.85) |
| Simple Query Latency | **2-10s** | TinyLlama responses |
| Complex Query Latency | **30-240s** | Llama 3.2 responses (8GB RAM constraint) |
| Routing Accuracy | **>90%** | Correct complexity classification |

## ๐Ÿ—๏ธ Architecture

```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ CLIENT REQUEST โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ SMARTLLM ROUTER API โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Request โ”‚โ”€โ”€โ”‚ Prompt โ”‚โ”€โ”€โ”‚ Query Complexity โ”‚ โ”‚
โ”‚ โ”‚ Handler โ”‚ โ”‚ Preprocessor โ”‚ โ”‚ Classifier โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ–ผ โ–ผ โ–ผ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚ โ”‚ SEMANTIC CACHE โ”‚ โ”‚ MODEL ROUTER โ”‚โ”‚
โ”‚ โ”‚ (Sentence Embeddings)โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚TinyLlamaโ”‚ Llama 3.2 โ”‚ โ”‚โ”‚
โ”‚ โ”‚ โ”‚ โ”‚(Simple/ โ”‚ (Complex) โ”‚ โ”‚โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ Medium) โ”‚ โ”‚ โ”‚โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”˜โ”‚
โ”‚ โ”‚ โ”‚
โ”‚ โ–ผ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚ โ”‚ COST ANALYTICS โ”‚โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```

## ๐Ÿง  Model Selection & Reasoning

### Why Local Ollama Models?

| Decision | Reasoning |
|----------|-----------|
| **No Paid APIs** | Eliminated recurring costs; perfect for portfolio projects |
| **TinyLlama (637MB)** | Ultra-fast responses for simple queries; fits easily in 8GB RAM |
| **Llama 3.2 (2GB)** | Best quality-to-size ratio for complex reasoning on limited hardware |
| **Dropped Phi-3** | Initially tested but timeout issues on 8GB RAM; too resource-intensive |

### Model Configuration

```
Simple Queries โ†’ TinyLlama (2-10s response, lightweight)
Medium Queries โ†’ TinyLlama (prioritize speed over marginal quality gain)
Complex Queries โ†’ Llama 3.2 (30-240s response, better reasoning)
```

### Why This Configuration?

1. **Hardware Constraint**: 8GB RAM limits concurrent model loading
2. **Speed Priority**: Users prefer fast responses for simple questions
3. **Quality When Needed**: Complex system design questions get the more capable model
4. **Zero Cost**: All local inference = $0 API bills

## ๐Ÿ› ๏ธ Development Journey

### Phase 1: Initial Setup
- Created FastAPI project structure with proper separation of concerns
- Implemented Pydantic models for type-safe request/response handling
- Set up configuration management with environment variables

### Phase 2: Core Features
- **Complexity Classifier**: Rule-based system analyzing:
- Technical terms (40+ terms including ML, system design, security)
- Reasoning patterns (why, how, explain, compare, analyze)
- Code detection (Python, JavaScript, SQL patterns)
- Query length and structure
- System design keywords (scale, distributed, million, availability)

- **Semantic Cache**:
- Sentence-transformers (`all-MiniLM-L6-v2`) for embeddings
- Cosine similarity matching with 0.85 threshold
- In-memory storage (Redis-ready architecture)

### Phase 3: Multi-Model Routing
- Integrated Ollama for local LLM inference
- Tested multiple model combinations
- Optimized timeouts (300s) for resource-constrained environments

### Phase 4: Testing & CI/CD
- 40 comprehensive tests (classifier, cache, integration)
- GitHub Actions pipeline with Python 3.11/3.12 matrix
- Docker build verification
- Code coverage reporting

## ๐Ÿ› Problems Faced & Solutions

| Problem | Solution |
|---------|----------|
| **Phi-3 timeouts** | Switched to Llama 3.2 which handles 8GB RAM better |
| **Classifier too strict** | Lowered complex threshold from โ‰ฅ5 to โ‰ฅ4; added system design detection |
| **GitHub CI test failures** | Aligned test queries with actual classifier scoring logic |
| **Docker build cache timeout** | Simplified CI to use standard `docker build` command |
| **Kubernetes query misclassified** | Added explicit system design keyword detection (+3 score boost) |
| **Sentence-transformers import** | Added lazy loading with hash-based fallback |

## ๐Ÿš€ Quick Start

### Prerequisites

- Python 3.11+
- [Ollama](https://ollama.ai/) installed locally
- 8GB+ RAM recommended

### Installation

```bash
# Clone the repository
git clone https://github.com/jaybhuvaa/SmartLLM-Router.git
cd SmartLLM-Router

# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Pull required Ollama models
ollama pull tinyllama
ollama pull llama3.2
```

### Configuration

Create a `.env` file:

```env
# Model Configuration
DEFAULT_SIMPLE_MODEL=ollama/tinyllama
DEFAULT_MEDIUM_MODEL=ollama/tinyllama
DEFAULT_COMPLEX_MODEL=ollama/llama3.2

# Ollama
OLLAMA_BASE_URL=http://localhost:11434

# Caching
CACHE_SIMILARITY_THRESHOLD=0.85
CACHE_TTL_HOURS=24

# Server
HOST=0.0.0.0
PORT=8000
DEBUG=true
```

### Running Locally

```bash
# Make sure Ollama is running
ollama serve

# Start the API server
uvicorn src.main:app --reload

# Server runs at http://localhost:8000
# API docs at http://localhost:8000/docs
```

## ๐Ÿ“– API Usage

### Chat Endpoint

```bash
curl -X POST "http://localhost:8000/api/v1/chat" \
-H "Content-Type: application/json" \
-d '{"message": "What is Python?"}'
```

Response:
```json
{
"response": "Python is a high-level programming language...",
"model_used": "ollama/tinyllama",
"complexity": "simple",
"was_cached": false,
"input_tokens": 4,
"output_tokens": 45,
"actual_cost": 0.0,
"baseline_cost": 0.00147,
"latency_ms": 3500,
"request_id": "abc123"
}
```

### Complex Query Example

```bash
curl -X POST "http://localhost:8000/api/v1/chat" \
-H "Content-Type: application/json" \
-d '{"message": "Design a distributed cache system for a social media platform that handles 10 million requests per second with high availability"}'
```

Response:
```json
{
"response": "I will design a distributed cache system...",
"model_used": "ollama/llama3.2",
"complexity": "complex",
"was_cached": false,
"actual_cost": 0.0,
"baseline_cost": 0.049,
"latency_ms": 230846
}
```

### Classification Endpoint

```bash
curl -X POST "http://localhost:8000/api/v1/classify" \
-H "Content-Type: application/json" \
-d '{"message": "Design a distributed system for..."}'
```

### Analytics Endpoints

```bash
# Get summary
curl "http://localhost:8000/api/v1/analytics/summary"

# Get cache statistics
curl "http://localhost:8000/api/v1/analytics/cache-stats"

# Get savings report
curl "http://localhost:8000/api/v1/analytics/savings-report"
```

## ๐Ÿงช Testing

```bash
# Run all tests (40 tests)
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run specific test file
pytest tests/test_classifier.py -v

# Run integration tests
pytest tests/test_integration.py -v
```

### Test Coverage

| Module | Coverage |
|--------|----------|
| Complexity Classifier | 85% |
| Semantic Cache | 85% |
| Integration Tests | 100% |
| **Overall** | **63%** |

## ๐Ÿ“ Project Structure

```
smartllm-router/
โ”œโ”€โ”€ src/
โ”‚ โ”œโ”€โ”€ main.py # FastAPI app entry
โ”‚ โ”œโ”€โ”€ config.py # Settings management
โ”‚ โ”œโ”€โ”€ models/
โ”‚ โ”‚ โ””โ”€โ”€ schemas.py # Pydantic models
โ”‚ โ”œโ”€โ”€ routers/
โ”‚ โ”‚ โ”œโ”€โ”€ chat.py # Main chat endpoint
โ”‚ โ”‚ โ””โ”€โ”€ analytics.py # Analytics endpoints
โ”‚ โ”œโ”€โ”€ services/
โ”‚ โ”‚ โ”œโ”€โ”€ complexity_classifier.py # Query routing logic
โ”‚ โ”‚ โ”œโ”€โ”€ semantic_cache.py # Embedding-based caching
โ”‚ โ”‚ โ”œโ”€โ”€ llm_providers.py # Ollama integration
โ”‚ โ”‚ โ””โ”€โ”€ cost_tracker.py # Cost analytics
โ”‚ โ””โ”€โ”€ utils/
โ”‚ โ””โ”€โ”€ token_counter.py
โ”œโ”€โ”€ tests/
โ”‚ โ”œโ”€โ”€ test_classifier.py # 24 tests
โ”‚ โ”œโ”€โ”€ test_cache.py # 13 tests
โ”‚ โ””โ”€โ”€ test_integration.py # 13 tests
โ”œโ”€โ”€ benchmarks/
โ”‚ โ””โ”€โ”€ benchmark.py # Performance testing
โ”œโ”€โ”€ .github/
โ”‚ โ””โ”€โ”€ workflows/
โ”‚ โ””โ”€โ”€ ci.yml # GitHub Actions CI
โ”œโ”€โ”€ docker-compose.yml
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ README.md
```

## ๐ŸŽฏ How Routing Works

The complexity classifier analyzes queries using a scoring system:

### Scoring Features

| Feature | Points | Condition |
|---------|--------|-----------|
| Query Length | +1 | >30 words |
| Query Length | +2 | >100 words |
| Code Presence | +2 | Contains code blocks/patterns |
| Reasoning Words | +1 | 1+ matches (why, how, explain) |
| Reasoning Words | +2 | 3+ matches |
| Technical Terms | +1 | 2+ terms |
| Technical Terms | +2 | 5+ terms |
| **System Design** | +2 | 1+ keywords (design, scale, distributed) |
| **System Design** | +3 | 3+ keywords |
| Multi-step Task | +1 | Contains step patterns |
| Multiple Sentences | +1 | 4+ sentences |

### Classification Thresholds

```
Score 0-1 โ†’ SIMPLE โ†’ TinyLlama (fast, basic)
Score 2-3 โ†’ MEDIUM โ†’ TinyLlama (speed priority)
Score 4+ โ†’ COMPLEX โ†’ Llama 3.2 (quality priority)
```

## ๐Ÿ’ฐ Cost Model

| Model | Input (per 1K) | Output (per 1K) | Our Usage |
|-------|----------------|-----------------|-----------|
| GPT-4 | $0.03 | $0.06 | Baseline comparison |
| GPT-3.5-turbo | $0.0005 | $0.0015 | Baseline comparison |
| **Ollama/TinyLlama** | **$0.00** | **$0.00** | โœ… Simple/Medium |
| **Ollama/Llama3.2** | **$0.00** | **$0.00** | โœ… Complex |

**Result: 100% cost savings using local models!**

## ๐Ÿ”ฎ Future Enhancements

- [ ] Redis-backed persistent caching
- [ ] PostgreSQL request logging
- [ ] ML-based classifier (replace rule-based)
- [ ] A/B testing for model comparison
- [ ] Prometheus metrics endpoint
- [ ] Cloud deployment (Railway/Fly.io)
- [ ] Rate limiting per user
- [ ] Response streaming

## ๐Ÿ› ๏ธ Configuration Options

| Variable | Default | Description |
|----------|---------|-------------|
| `DEFAULT_SIMPLE_MODEL` | `ollama/tinyllama` | Model for simple queries |
| `DEFAULT_MEDIUM_MODEL` | `ollama/tinyllama` | Model for medium queries |
| `DEFAULT_COMPLEX_MODEL` | `ollama/llama3.2` | Model for complex queries |
| `CACHE_SIMILARITY_THRESHOLD` | `0.85` | Minimum similarity for cache hit |
| `CACHE_TTL_HOURS` | `24` | Cache entry expiration |
| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama API endpoint |

## ๐Ÿ“ License

MIT License - feel free to use this project for learning and portfolio purposes!

## ๐Ÿ™ Acknowledgments

- [FastAPI](https://fastapi.tiangolo.com/) - Modern Python web framework
- [Ollama](https://ollama.ai/) - Local LLM inference
- [sentence-transformers](https://www.sbert.net/) - Embedding generation
- The open-source LLM community

---

Built with โค๏ธ by [Jaykumar Bhuva](https://github.com/jaybhuvaa)

**GitHub**: https://github.com/jaybhuvaa/SmartLLM-Router