{"id":22912346,"url":"https://github.com/muhtasham/simulator","last_synced_at":"2025-04-01T11:17:00.172Z","repository":{"id":267723990,"uuid":"901092003","full_name":"Muhtasham/simulator","owner":"Muhtasham","description":"🚀 A high-performance simulator for LLM inference optimization, modeling compute-bound prefill and memory-bound decode phases. Explore batching strategies, analyze throughput-latency trade-offs, and optimize inference deployments without real model overhead.","archived":false,"fork":false,"pushed_at":"2024-12-10T03:07:03.000Z","size":1353,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-26T12:55:29.941Z","etag":null,"topics":["llm-inference"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Muhtasham.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-10T03:02:24.000Z","updated_at":"2025-02-14T16:14:46.000Z","dependencies_parsed_at":"2024-12-12T04:15:45.083Z","dependency_job_id":null,"html_url":"https://github.com/Muhtasham/simulator","commit_stats":null,"previous_names":["muhtasham/simulator"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Muhtasham%2Fsimulator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Muhtasham%2Fsimulator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Muhtasham%2Fsimulator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Muhtasham%2Fsimulator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Muhtasham","download_url":"https://codeload.github.com/Muhtasham/simulator/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246628225,"owners_count":20808106,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm-inference"],"created_at":"2024-12-14T04:27:50.410Z","updated_at":"2025-04-01T11:17:00.148Z","avatar_url":"https://github.com/Muhtasham.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LLM Inference Simulator\n\nA simulator for exploring different batching strategies and load patterns in LLM inference.\n\n## Installation \u0026 Setup\n\n```bash\n# Install uv package manager\npip install uv\n\n# Clone the repository\ngit clone https://github.com/muhtasham/simulator.git\ncd simulator\n\n# Install dependencies\nuv pip install -r requirements.txt\n```\n\n## Understanding Ticks\n\nIn this simulator:\n\n- A **tick** is the basic unit of time\n- `prefill_time=2` means the prefill phase takes 2 ticks\n- `itl=1` (Inter-Token Latency) means generating each token takes 1 tick\n- Metrics are often reported per 1000 ticks for easier comparison\n- Example: `460./1000.` request rate means 460 requests per 1000 ticks\n\n## Running Examples\n\nEach example demonstrates different aspects of the simulator:\n\n```bash\n# Basic examples with simple configurations\nuv run examples/batch_duration_demo.py\n\n# Detailed metrics visualization\nuv run examples/metrics_visualization.py\n\n# Advanced batching strategies comparison\nuv run examples/batching_strategies.py\n\n# Queue growth analysis for long runs\nuv run examples/queue_growth.py\n```\n\n## Features\n\n- Multiple batching strategies (Static, In-Flight, Chunked Context)\n- Various load generation patterns (Batch, Concurrent, Request Rate)\n- Rich metrics visualization\n- Configurable batch sizes and request parameters\n- Queue growth analysis for long-running simulations\n\n## Batching Strategies and Performance\n\n### Static Batching\n\nBasic batching strategy that only batches requests when all slots are empty.\n\n```python\n# Configuration\nengine = sim.Engine(\n    max_batch_size=4,  # Maximum 4 requests in a batch\n    load_generator=BatchLoadGenerator(\n        initial_batch=100,  # Send 100 requests at start\n        prefill_time=2,    # Each prefill takes 2 ticks\n        itl=1,             # Each token generation takes 1 tick\n        target_output_len_tokens=10  # Generate 10 tokens per request\n    ),\n    batcher=StaticBatcher()\n)\n```\n\nPerformance:\n\n```bash\nAverage E2E Latency: 58.16\nAverage TTFT: 52.80\nAverage ITL: 1.00\nRequests/(1K ticks)/instance = 190.00\nTokens/(1K ticks)/instance = 1680.00\n```\n\n### In-Flight Batching (IFB)\n\nAllows mixing prefill and decode phases in the same batch.\n\n```python\n# Configuration\nengine = sim.Engine(\n    max_batch_size=4,\n    load_generator=BatchLoadGenerator(\n        initial_batch=100,\n        prefill_time=2,\n        itl=1,\n        target_output_len_tokens=10\n    ),\n    batcher=IFBatcher()\n)\n```\n\nPerformance:\n\n```bash\nAverage E2E Latency: 58.44\nAverage TTFT: 52.90\nAverage ITL: 1.39\nRequests/(1K ticks)/instance = 267.33  # 41% improvement over Static\nTokens/(1K ticks)/instance = 2376.24\n```\n\n### Chunked Context\n\nOptimizes performance by separating prefill into chunks.\n\n```python\n# Configuration\nload_generator = BatchLoadGenerator(\n    initial_batch=100,\n    prefill_time=2,\n    itl=1,\n    target_output_len_tokens=10,\n    total_prefill_chunks=2  # Split prefill into 2 chunks\n)\nengine = sim.Engine(\n    max_batch_size=4,\n    load_generator=load_generator,\n    batcher=IFBatcher()\n)\n```\n\nPerformance:\n\n```bash\nAverage E2E Latency: 57.42\nAverage TTFT: 54.51\nAverage ITL: 1.14\nRequests/(1K ticks)/instance = 310.00  # 15% improvement over basic IFB\nTokens/(1K ticks)/instance = 2730.00\n```\n\n### One Prefill Per Batch\n\nLimits to one prefill request at a time for balanced compute/memory usage.\n\n```python\n# Configuration\nengine = sim.Engine(\n    max_batch_size=4,\n    load_generator=load_generator,\n    batcher=IFBatcherWithOnePrefillOnly()\n)\n```\n\nPerformance:\n\n```bash\nAverage E2E Latency: 55.94\nAverage TTFT: 52.13\nAverage ITL: 1.00\nRequests/(1K ticks)/instance = 360.00  # Best throughput\nTokens/(1K ticks)/instance = 3170.00\n```\n\n## Load Generation Patterns\n\n### Concurrent Load\n\nMaintains a target level of concurrent requests.\n\n```python\n# Configuration\nload_generator = ConcurrentLoadGenerator(\n    target_concurrency=6,    # Maintain 6 concurrent requests\n    target_output_len_tokens=10,\n    total_prefill_chunks=2,\n    prefill_time=2,\n    itl=1\n)\n```\n\nPerformance:\n\n```bash\nAverage E2E Latency: 15.14\nAverage TTFT: 7.87\nAverage ITL: 1.00\nRequests/(1K ticks)/instance = 360.00\nTokens/(1K ticks)/instance = 3170.00\n```\n\n### Request Rate\n\nGenerates requests at a constant rate.\n\n```python\n# Configuration\nload_generator = RequestRateLoadGenerator(\n    request_rate=460./1000.,  # 460 requests per 1000 ticks\n    target_output_len_tokens=10,\n    total_prefill_chunks=2,\n    prefill_time=2,\n    itl=1\n)\n```\n\nPerformance:\n\n```bash\nAverage E2E Latency: 17.66\nAverage TTFT: 11.03\nAverage ITL: 1.00\nRequests/(1K ticks)/instance = 350.00\nTokens/(1K ticks)/instance = 3060.00\n```\n\n## Queue Growth Analysis\n\nCompare performance between short (100 ticks) and long (10000 ticks) runs:\n\n```bash\nRequest Rate Load Generator (460 requests/1000 ticks)\n┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ Metric           ┃ 100 ticks  ┃ 10000 ticks ┃ Difference ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ Final Queue Size │ 6          │ 1138        │ 1132       │\n│ Average TTFT     │ 11.03      │ 1245.77     │ 1234.75    │\n│ Average E2E      │ 17.66      │ 1253.78     │ 1236.12    │\n└──────────────────┴────────────┴─────────────┴────────────┘\n\nConcurrent Load Generator (6 concurrent requests)\n┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ Metric           ┃ 100 ticks  ┃ 10000 ticks ┃ Difference ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ Final Queue Size │ 2          │ 2           │ 0          │\n│ Average TTFT     │ 7.87       │ 8.61        │ 0.74       │\n│ Average E2E      │ 15.14      │ 17.32       │ 2.19       │\n└──────────────────┴────────────┴─────────────┴────────────┘\n```\n\nKey observations:\n\n- Request Rate generator shows significant queue growth over time\n- Concurrent Load generator maintains stable queue size and latencies\n- TTFT and E2E latency increase dramatically with queue growth\n- One Prefill Per Batch strategy achieves best throughput (3170 tokens/1K ticks)\n- IFB improves throughput by 41% over Static Batching\n- Chunked Context further improves throughput by 15% over basic IFB\n\n## Key Metrics\n\n- **E2E Latency**: End-to-end latency for request completion (in ticks)\n- **TTFT**: Time to first token (in ticks)\n- **ITL**: Inter-token latency (ticks between tokens)\n- **Throughput**: Requests and tokens processed per 1K ticks per instance\n- **Queue Size**: Number of requests waiting to be processed\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmuhtasham%2Fsimulator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmuhtasham%2Fsimulator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmuhtasham%2Fsimulator/lists"}