Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/muhtasham/simulator
๐ A high-performance simulator for LLM inference optimization, modeling compute-bound prefill and memory-bound decode phases. Explore batching strategies, analyze throughput-latency trade-offs, and optimize inference deployments without real model overhead.
https://github.com/muhtasham/simulator
llm-inference
Last synced: 22 days ago
JSON representation
๐ A high-performance simulator for LLM inference optimization, modeling compute-bound prefill and memory-bound decode phases. Explore batching strategies, analyze throughput-latency trade-offs, and optimize inference deployments without real model overhead.
- Host: GitHub
- URL: https://github.com/muhtasham/simulator
- Owner: Muhtasham
- Created: 2024-12-10T03:02:24.000Z (26 days ago)
- Default Branch: main
- Last Pushed: 2024-12-10T03:07:03.000Z (26 days ago)
- Last Synced: 2024-12-12T04:05:42.895Z (24 days ago)
- Topics: llm-inference
- Language: HTML
- Homepage:
- Size: 1.29 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# LLM Inference Simulator
A simulator for exploring different batching strategies and load patterns in LLM inference.
## Installation & Setup
```bash
# Install uv package manager
pip install uv# Clone the repository
git clone https://github.com/muhtasham/simulator.git
cd simulator# Install dependencies
uv pip install -r requirements.txt
```## Understanding Ticks
In this simulator:
- A **tick** is the basic unit of time
- `prefill_time=2` means the prefill phase takes 2 ticks
- `itl=1` (Inter-Token Latency) means generating each token takes 1 tick
- Metrics are often reported per 1000 ticks for easier comparison
- Example: `460./1000.` request rate means 460 requests per 1000 ticks## Running Examples
Each example demonstrates different aspects of the simulator:
```bash
# Basic examples with simple configurations
uv run examples/batch_duration_demo.py# Detailed metrics visualization
uv run examples/metrics_visualization.py# Advanced batching strategies comparison
uv run examples/batching_strategies.py# Queue growth analysis for long runs
uv run examples/queue_growth.py
```## Features
- Multiple batching strategies (Static, In-Flight, Chunked Context)
- Various load generation patterns (Batch, Concurrent, Request Rate)
- Rich metrics visualization
- Configurable batch sizes and request parameters
- Queue growth analysis for long-running simulations## Batching Strategies and Performance
### Static Batching
Basic batching strategy that only batches requests when all slots are empty.
```python
# Configuration
engine = sim.Engine(
max_batch_size=4, # Maximum 4 requests in a batch
load_generator=BatchLoadGenerator(
initial_batch=100, # Send 100 requests at start
prefill_time=2, # Each prefill takes 2 ticks
itl=1, # Each token generation takes 1 tick
target_output_len_tokens=10 # Generate 10 tokens per request
),
batcher=StaticBatcher()
)
```Performance:
```bash
Average E2E Latency: 58.16
Average TTFT: 52.80
Average ITL: 1.00
Requests/(1K ticks)/instance = 190.00
Tokens/(1K ticks)/instance = 1680.00
```### In-Flight Batching (IFB)
Allows mixing prefill and decode phases in the same batch.
```python
# Configuration
engine = sim.Engine(
max_batch_size=4,
load_generator=BatchLoadGenerator(
initial_batch=100,
prefill_time=2,
itl=1,
target_output_len_tokens=10
),
batcher=IFBatcher()
)
```Performance:
```bash
Average E2E Latency: 58.44
Average TTFT: 52.90
Average ITL: 1.39
Requests/(1K ticks)/instance = 267.33 # 41% improvement over Static
Tokens/(1K ticks)/instance = 2376.24
```### Chunked Context
Optimizes performance by separating prefill into chunks.
```python
# Configuration
load_generator = BatchLoadGenerator(
initial_batch=100,
prefill_time=2,
itl=1,
target_output_len_tokens=10,
total_prefill_chunks=2 # Split prefill into 2 chunks
)
engine = sim.Engine(
max_batch_size=4,
load_generator=load_generator,
batcher=IFBatcher()
)
```Performance:
```bash
Average E2E Latency: 57.42
Average TTFT: 54.51
Average ITL: 1.14
Requests/(1K ticks)/instance = 310.00 # 15% improvement over basic IFB
Tokens/(1K ticks)/instance = 2730.00
```### One Prefill Per Batch
Limits to one prefill request at a time for balanced compute/memory usage.
```python
# Configuration
engine = sim.Engine(
max_batch_size=4,
load_generator=load_generator,
batcher=IFBatcherWithOnePrefillOnly()
)
```Performance:
```bash
Average E2E Latency: 55.94
Average TTFT: 52.13
Average ITL: 1.00
Requests/(1K ticks)/instance = 360.00 # Best throughput
Tokens/(1K ticks)/instance = 3170.00
```## Load Generation Patterns
### Concurrent Load
Maintains a target level of concurrent requests.
```python
# Configuration
load_generator = ConcurrentLoadGenerator(
target_concurrency=6, # Maintain 6 concurrent requests
target_output_len_tokens=10,
total_prefill_chunks=2,
prefill_time=2,
itl=1
)
```Performance:
```bash
Average E2E Latency: 15.14
Average TTFT: 7.87
Average ITL: 1.00
Requests/(1K ticks)/instance = 360.00
Tokens/(1K ticks)/instance = 3170.00
```### Request Rate
Generates requests at a constant rate.
```python
# Configuration
load_generator = RequestRateLoadGenerator(
request_rate=460./1000., # 460 requests per 1000 ticks
target_output_len_tokens=10,
total_prefill_chunks=2,
prefill_time=2,
itl=1
)
```Performance:
```bash
Average E2E Latency: 17.66
Average TTFT: 11.03
Average ITL: 1.00
Requests/(1K ticks)/instance = 350.00
Tokens/(1K ticks)/instance = 3060.00
```## Queue Growth Analysis
Compare performance between short (100 ticks) and long (10000 ticks) runs:
```bash
Request Rate Load Generator (460 requests/1000 ticks)
โโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโณโโโโโโโโโโโโโโณโโโโโโโโโโโโโ
โ Metric โ 100 ticks โ 10000 ticks โ Difference โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Final Queue Size โ 6 โ 1138 โ 1132 โ
โ Average TTFT โ 11.03 โ 1245.77 โ 1234.75 โ
โ Average E2E โ 17.66 โ 1253.78 โ 1236.12 โ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโConcurrent Load Generator (6 concurrent requests)
โโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโณโโโโโโโโโโโโโโณโโโโโโโโโโโโโ
โ Metric โ 100 ticks โ 10000 ticks โ Difference โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Final Queue Size โ 2 โ 2 โ 0 โ
โ Average TTFT โ 7.87 โ 8.61 โ 0.74 โ
โ Average E2E โ 15.14 โ 17.32 โ 2.19 โ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโ
```Key observations:
- Request Rate generator shows significant queue growth over time
- Concurrent Load generator maintains stable queue size and latencies
- TTFT and E2E latency increase dramatically with queue growth
- One Prefill Per Batch strategy achieves best throughput (3170 tokens/1K ticks)
- IFB improves throughput by 41% over Static Batching
- Chunked Context further improves throughput by 15% over basic IFB## Key Metrics
- **E2E Latency**: End-to-end latency for request completion (in ticks)
- **TTFT**: Time to first token (in ticks)
- **ITL**: Inter-token latency (ticks between tokens)
- **Throughput**: Requests and tokens processed per 1K ticks per instance
- **Queue Size**: Number of requests waiting to be processed