Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/muhtasham/simulator

๐Ÿš€ A high-performance simulator for LLM inference optimization, modeling compute-bound prefill and memory-bound decode phases. Explore batching strategies, analyze throughput-latency trade-offs, and optimize inference deployments without real model overhead.
https://github.com/muhtasham/simulator

llm-inference

Last synced: 22 days ago
JSON representation

๐Ÿš€ A high-performance simulator for LLM inference optimization, modeling compute-bound prefill and memory-bound decode phases. Explore batching strategies, analyze throughput-latency trade-offs, and optimize inference deployments without real model overhead.

Awesome Lists containing this project

README

        

# LLM Inference Simulator

A simulator for exploring different batching strategies and load patterns in LLM inference.

## Installation & Setup

```bash
# Install uv package manager
pip install uv

# Clone the repository
git clone https://github.com/muhtasham/simulator.git
cd simulator

# Install dependencies
uv pip install -r requirements.txt
```

## Understanding Ticks

In this simulator:

- A **tick** is the basic unit of time
- `prefill_time=2` means the prefill phase takes 2 ticks
- `itl=1` (Inter-Token Latency) means generating each token takes 1 tick
- Metrics are often reported per 1000 ticks for easier comparison
- Example: `460./1000.` request rate means 460 requests per 1000 ticks

## Running Examples

Each example demonstrates different aspects of the simulator:

```bash
# Basic examples with simple configurations
uv run examples/batch_duration_demo.py

# Detailed metrics visualization
uv run examples/metrics_visualization.py

# Advanced batching strategies comparison
uv run examples/batching_strategies.py

# Queue growth analysis for long runs
uv run examples/queue_growth.py
```

## Features

- Multiple batching strategies (Static, In-Flight, Chunked Context)
- Various load generation patterns (Batch, Concurrent, Request Rate)
- Rich metrics visualization
- Configurable batch sizes and request parameters
- Queue growth analysis for long-running simulations

## Batching Strategies and Performance

### Static Batching

Basic batching strategy that only batches requests when all slots are empty.

```python
# Configuration
engine = sim.Engine(
max_batch_size=4, # Maximum 4 requests in a batch
load_generator=BatchLoadGenerator(
initial_batch=100, # Send 100 requests at start
prefill_time=2, # Each prefill takes 2 ticks
itl=1, # Each token generation takes 1 tick
target_output_len_tokens=10 # Generate 10 tokens per request
),
batcher=StaticBatcher()
)
```

Performance:

```bash
Average E2E Latency: 58.16
Average TTFT: 52.80
Average ITL: 1.00
Requests/(1K ticks)/instance = 190.00
Tokens/(1K ticks)/instance = 1680.00
```

### In-Flight Batching (IFB)

Allows mixing prefill and decode phases in the same batch.

```python
# Configuration
engine = sim.Engine(
max_batch_size=4,
load_generator=BatchLoadGenerator(
initial_batch=100,
prefill_time=2,
itl=1,
target_output_len_tokens=10
),
batcher=IFBatcher()
)
```

Performance:

```bash
Average E2E Latency: 58.44
Average TTFT: 52.90
Average ITL: 1.39
Requests/(1K ticks)/instance = 267.33 # 41% improvement over Static
Tokens/(1K ticks)/instance = 2376.24
```

### Chunked Context

Optimizes performance by separating prefill into chunks.

```python
# Configuration
load_generator = BatchLoadGenerator(
initial_batch=100,
prefill_time=2,
itl=1,
target_output_len_tokens=10,
total_prefill_chunks=2 # Split prefill into 2 chunks
)
engine = sim.Engine(
max_batch_size=4,
load_generator=load_generator,
batcher=IFBatcher()
)
```

Performance:

```bash
Average E2E Latency: 57.42
Average TTFT: 54.51
Average ITL: 1.14
Requests/(1K ticks)/instance = 310.00 # 15% improvement over basic IFB
Tokens/(1K ticks)/instance = 2730.00
```

### One Prefill Per Batch

Limits to one prefill request at a time for balanced compute/memory usage.

```python
# Configuration
engine = sim.Engine(
max_batch_size=4,
load_generator=load_generator,
batcher=IFBatcherWithOnePrefillOnly()
)
```

Performance:

```bash
Average E2E Latency: 55.94
Average TTFT: 52.13
Average ITL: 1.00
Requests/(1K ticks)/instance = 360.00 # Best throughput
Tokens/(1K ticks)/instance = 3170.00
```

## Load Generation Patterns

### Concurrent Load

Maintains a target level of concurrent requests.

```python
# Configuration
load_generator = ConcurrentLoadGenerator(
target_concurrency=6, # Maintain 6 concurrent requests
target_output_len_tokens=10,
total_prefill_chunks=2,
prefill_time=2,
itl=1
)
```

Performance:

```bash
Average E2E Latency: 15.14
Average TTFT: 7.87
Average ITL: 1.00
Requests/(1K ticks)/instance = 360.00
Tokens/(1K ticks)/instance = 3170.00
```

### Request Rate

Generates requests at a constant rate.

```python
# Configuration
load_generator = RequestRateLoadGenerator(
request_rate=460./1000., # 460 requests per 1000 ticks
target_output_len_tokens=10,
total_prefill_chunks=2,
prefill_time=2,
itl=1
)
```

Performance:

```bash
Average E2E Latency: 17.66
Average TTFT: 11.03
Average ITL: 1.00
Requests/(1K ticks)/instance = 350.00
Tokens/(1K ticks)/instance = 3060.00
```

## Queue Growth Analysis

Compare performance between short (100 ticks) and long (10000 ticks) runs:

```bash
Request Rate Load Generator (460 requests/1000 ticks)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Metric โ”ƒ 100 ticks โ”ƒ 10000 ticks โ”ƒ Difference โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Final Queue Size โ”‚ 6 โ”‚ 1138 โ”‚ 1132 โ”‚
โ”‚ Average TTFT โ”‚ 11.03 โ”‚ 1245.77 โ”‚ 1234.75 โ”‚
โ”‚ Average E2E โ”‚ 17.66 โ”‚ 1253.78 โ”‚ 1236.12 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Concurrent Load Generator (6 concurrent requests)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Metric โ”ƒ 100 ticks โ”ƒ 10000 ticks โ”ƒ Difference โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Final Queue Size โ”‚ 2 โ”‚ 2 โ”‚ 0 โ”‚
โ”‚ Average TTFT โ”‚ 7.87 โ”‚ 8.61 โ”‚ 0.74 โ”‚
โ”‚ Average E2E โ”‚ 15.14 โ”‚ 17.32 โ”‚ 2.19 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```

Key observations:

- Request Rate generator shows significant queue growth over time
- Concurrent Load generator maintains stable queue size and latencies
- TTFT and E2E latency increase dramatically with queue growth
- One Prefill Per Batch strategy achieves best throughput (3170 tokens/1K ticks)
- IFB improves throughput by 41% over Static Batching
- Chunked Context further improves throughput by 15% over basic IFB

## Key Metrics

- **E2E Latency**: End-to-end latency for request completion (in ticks)
- **TTFT**: Time to first token (in ticks)
- **ITL**: Inter-token latency (ticks between tokens)
- **Throughput**: Requests and tokens processed per 1K ticks per instance
- **Queue Size**: Number of requests waiting to be processed