Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/muhtasham/simulator

🚀 A high-performance simulator for LLM inference optimization, modeling compute-bound prefill and memory-bound decode phases. Explore batching strategies, analyze throughput-latency trade-offs, and optimize inference deployments without real model overhead.
https://github.com/muhtasham/simulator

llm-inference

Last synced: 22 days ago
JSON representation

Host: GitHub
URL: https://github.com/muhtasham/simulator
Owner: Muhtasham
Created: 2024-12-10T03:02:24.000Z (26 days ago)
Default Branch: main
Last Pushed: 2024-12-10T03:07:03.000Z (26 days ago)
Last Synced: 2024-12-12T04:05:42.895Z (24 days ago)
Topics: llm-inference
Language: HTML
Homepage:
Size: 1.29 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# LLM Inference Simulator

A simulator for exploring different batching strategies and load patterns in LLM inference.

## Installation & Setup

```bash
# Install uv package manager
pip install uv

# Clone the repository
git clone https://github.com/muhtasham/simulator.git
cd simulator

# Install dependencies
uv pip install -r requirements.txt
```

## Understanding Ticks

In this simulator:

- A **tick** is the basic unit of time
- `prefill_time=2` means the prefill phase takes 2 ticks
- `itl=1` (Inter-Token Latency) means generating each token takes 1 tick
- Metrics are often reported per 1000 ticks for easier comparison
- Example: `460./1000.` request rate means 460 requests per 1000 ticks

## Running Examples

Each example demonstrates different aspects of the simulator:

```bash
# Basic examples with simple configurations
uv run examples/batch_duration_demo.py

# Detailed metrics visualization
uv run examples/metrics_visualization.py

# Advanced batching strategies comparison
uv run examples/batching_strategies.py

# Queue growth analysis for long runs
uv run examples/queue_growth.py
```

## Features

- Multiple batching strategies (Static, In-Flight, Chunked Context)
- Various load generation patterns (Batch, Concurrent, Request Rate)
- Rich metrics visualization
- Configurable batch sizes and request parameters
- Queue growth analysis for long-running simulations

## Batching Strategies and Performance

### Static Batching

Basic batching strategy that only batches requests when all slots are empty.

```python
# Configuration
engine = sim.Engine(
max_batch_size=4, # Maximum 4 requests in a batch
load_generator=BatchLoadGenerator(
initial_batch=100, # Send 100 requests at start
prefill_time=2, # Each prefill takes 2 ticks
itl=1, # Each token generation takes 1 tick
target_output_len_tokens=10 # Generate 10 tokens per request
),
batcher=StaticBatcher()
)
```

Performance:

```bash
Average E2E Latency: 58.16
Average TTFT: 52.80
Average ITL: 1.00
Requests/(1K ticks)/instance = 190.00
Tokens/(1K ticks)/instance = 1680.00
```

### In-Flight Batching (IFB)

Allows mixing prefill and decode phases in the same batch.

```python
# Configuration
engine = sim.Engine(
max_batch_size=4,
load_generator=BatchLoadGenerator(
initial_batch=100,
prefill_time=2,
itl=1,
target_output_len_tokens=10
),
batcher=IFBatcher()
)
```

Performance:

```bash
Average E2E Latency: 58.44
Average TTFT: 52.90
Average ITL: 1.39
Requests/(1K ticks)/instance = 267.33 # 41% improvement over Static
Tokens/(1K ticks)/instance = 2376.24
```

### Chunked Context

Optimizes performance by separating prefill into chunks.

```python
# Configuration
load_generator = BatchLoadGenerator(
initial_batch=100,
prefill_time=2,
itl=1,
target_output_len_tokens=10,
total_prefill_chunks=2 # Split prefill into 2 chunks
)
engine = sim.Engine(
max_batch_size=4,
load_generator=load_generator,
batcher=IFBatcher()
)
```

Performance:

```bash
Average E2E Latency: 57.42
Average TTFT: 54.51
Average ITL: 1.14
Requests/(1K ticks)/instance = 310.00 # 15% improvement over basic IFB
Tokens/(1K ticks)/instance = 2730.00
```

### One Prefill Per Batch

Limits to one prefill request at a time for balanced compute/memory usage.

```python
# Configuration
engine = sim.Engine(
max_batch_size=4,
load_generator=load_generator,
batcher=IFBatcherWithOnePrefillOnly()
)
```

Performance:

```bash
Average E2E Latency: 55.94
Average TTFT: 52.13
Average ITL: 1.00
Requests/(1K ticks)/instance = 360.00 # Best throughput
Tokens/(1K ticks)/instance = 3170.00
```

## Load Generation Patterns

### Concurrent Load

Maintains a target level of concurrent requests.

```python
# Configuration
load_generator = ConcurrentLoadGenerator(
target_concurrency=6, # Maintain 6 concurrent requests
target_output_len_tokens=10,
total_prefill_chunks=2,
prefill_time=2,
itl=1
)
```

Performance:

```bash
Average E2E Latency: 15.14
Average TTFT: 7.87
Average ITL: 1.00
Requests/(1K ticks)/instance = 360.00
Tokens/(1K ticks)/instance = 3170.00
```

### Request Rate

Generates requests at a constant rate.

```python
# Configuration
load_generator = RequestRateLoadGenerator(
request_rate=460./1000., # 460 requests per 1000 ticks
target_output_len_tokens=10,
total_prefill_chunks=2,
prefill_time=2,
itl=1
)
```

Performance:

```bash
Average E2E Latency: 17.66
Average TTFT: 11.03
Average ITL: 1.00
Requests/(1K ticks)/instance = 350.00
Tokens/(1K ticks)/instance = 3060.00
```

## Queue Growth Analysis

Compare performance between short (100 ticks) and long (10000 ticks) runs:

```bash
Request Rate Load Generator (460 requests/1000 ticks)
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric ┃ 100 ticks ┃ 10000 ticks ┃ Difference ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Final Queue Size │ 6 │ 1138 │ 1132 │
│ Average TTFT │ 11.03 │ 1245.77 │ 1234.75 │
│ Average E2E │ 17.66 │ 1253.78 │ 1236.12 │
└──────────────────┴────────────┴─────────────┴────────────┘

Concurrent Load Generator (6 concurrent requests)
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric ┃ 100 ticks ┃ 10000 ticks ┃ Difference ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Final Queue Size │ 2 │ 2 │ 0 │
│ Average TTFT │ 7.87 │ 8.61 │ 0.74 │
│ Average E2E │ 15.14 │ 17.32 │ 2.19 │
└──────────────────┴────────────┴─────────────┴────────────┘
```

Key observations:

- Request Rate generator shows significant queue growth over time
- Concurrent Load generator maintains stable queue size and latencies
- TTFT and E2E latency increase dramatically with queue growth
- One Prefill Per Batch strategy achieves best throughput (3170 tokens/1K ticks)
- IFB improves throughput by 41% over Static Batching
- Chunked Context further improves throughput by 15% over basic IFB

## Key Metrics

- **E2E Latency**: End-to-end latency for request completion (in ticks)
- **TTFT**: Time to first token (in ticks)
- **ITL**: Inter-token latency (ticks between tokens)
- **Throughput**: Requests and tokens processed per 1K ticks per instance
- **Queue Size**: Number of requests waiting to be processed