{"id":31939650,"url":"https://github.com/alibaba/infersim","last_synced_at":"2025-10-14T08:44:35.431Z","repository":{"id":316992318,"uuid":"1065587626","full_name":"alibaba/InferSim","owner":"alibaba","description":"A Lightweight LLM Inference Performance Simulator","archived":false,"fork":false,"pushed_at":"2025-09-28T04:04:08.000Z","size":53,"stargazers_count":6,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-28T05:39:48.048Z","etag":null,"topics":["llm-inference","simulator"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alibaba.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-28T02:55:45.000Z","updated_at":"2025-09-28T05:12:30.000Z","dependencies_parsed_at":"2025-09-28T05:39:49.581Z","dependency_job_id":"18c45bd9-f8a4-4742-8fc9-67bdea9e73ae","html_url":"https://github.com/alibaba/InferSim","commit_stats":null,"previous_names":["alibaba/infersim"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/alibaba/InferSim","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2FInferSim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2FInferSim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2FInferSim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2FInferSim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alibaba","download_url":"https://codeload.github.com/alibaba/InferSim/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2FInferSim/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279018302,"owners_count":26086345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm-inference","simulator"],"created_at":"2025-10-14T08:44:33.850Z","updated_at":"2025-10-14T08:44:35.423Z","avatar_url":"https://github.com/alibaba.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# InferSim: A Lightweight LLM Inference Performance Simulator\n\nInferSim is a lightweight simulator for LLM inference, written in pure Python without any 3rd-party depenencies. It calculates the TTFT, TPOT and throughput TGS (tokens/GPU/second) based on computation complexity FLOPs (Floating-Point Operations), GPU computing power FLOPS (Floating-Point Operations per Second), GPU memory bandwidth and MFU (Model FLOPs Utilization) obtained by benchmarking the state-of-the-art LLM kernels. For multi-GPU, multi-node deployment, InferSim also estimates the communication latency according to data volume and bandwidth.\n\nThe main use cases of InferSim include:\n- **Model-Sys co-design**: predicting inference performance given the hyperparameters of a model.\n- **Inference performance analysis**: quantifying performance bottlenecks, such as compute-bound or IO-bound, and supporting optimization efforts.\n\nFor more details, please check [InferSim Technical Report](https://github.com/user-attachments/files/22580184/infersim_tech_report.pdf).\n\n## Simulation Result\n\n| Model | GPU | Prefill TGS(Actual) | Prefill TGS(Sim) | Decode TGS(Actual) | Decode TGS(Sim) | Notes |\n| :--- | :---: | :---: | :---: | :---: | :---: | :--- |\n| DeepSeek-V3 | H800 | 7839 | 9034 | 2324 | 2675 | Actual data from [deepseek/profile-data](https://github.com/deepseek-ai/profile-data/). Simulated with same setup: [example/deepseek-v3/](./example/deepseek-v3/). |\n| Qwen3-30B-A3B-BF16 | H20 | 16594 | 17350 | 2749 | 2632 | Actual data tested with SGLang, simulation example: [example/qwen3-30B-A3B/](./example/qwen3-30B-A3B/). |\n| Qwen3-8B-FP8 | H20 | 15061 | 16328 | 2682 | 2581 | Actual data tested with SGLang, simulation example: [example/qwen3-8B/](./example/qwen3-8B/). |\n\n## Supported Features\n\n- **Attention**: MHA/GQA, MLA. MFU benchmarks from FlashInfer, FlashAttention-3, FlashMLA.\n- **MoE**: GroupedGEMM. MFU benchmarks from DeepGEMM.\n- **Linear**: GEMM. MFU benchmarks from DeepGEMM.\n- **Parallelization**: DP Attn, EP MoE.\n- **Large EP**: DeepEP dispatch and combine, with normal and low_latency mode.\n\n## Help\n\n```\n$ python3 main.py --help\nusage: main.py [-h] --config-path CONFIG_PATH [--device-type {H20,H800}] [--world-size WORLD_SIZE] [--num-nodes NUM_NODES]\n               [--max-prefill-tokens MAX_PREFILL_TOKENS] [--decode-bs DECODE_BS] [--target-tgs TARGET_TGS]\n               [--target-tpot TARGET_TPOT] [--target-isl TARGET_ISL] [--target-osl TARGET_OSL] [--use-fp8-gemm]\n               [--use-fp8-kv] [--enable-deepep] [--enable-tbo] [--sm-ratio SM_RATIO] [--prefill-only] [--decode-only]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --config-path CONFIG_PATH\n                        The path of the hf model config.json\n  --device-type {H20,H800}\n                        Device type\n  --world-size WORLD_SIZE\n                        Num of GPUs\n  --num-nodes NUM_NODES\n                        Num of nodes\n  --max-prefill-tokens MAX_PREFILL_TOKENS\n                        Max prefill tokens\n  --decode-bs DECODE_BS\n                        Decoding batchsize. If not specified, bs = tgs * tpot.\n  --target-tgs TARGET_TGS\n                        Target tokens/s per GPU\n  --target-tpot TARGET_TPOT\n                        TPOT in ms\n  --target-isl TARGET_ISL\n                        Input sequence length, in tokens\n  --target-osl TARGET_OSL\n                        Output sequence length, in tokens\n  --use-fp8-gemm        Use fp8 gemm\n  --use-fp8-kv          Use fp8 kvcache\n  --enable-deepep       Enable DeepEP\n  --enable-tbo          Enable two batch overlap\n  --sm-ratio SM_RATIO   In TBO DeepEP normal mode, the SM ratio used for computation\n  --prefill-only        Only simulate prefill\n  --decode-only         Only simulate decoding\n```\n\n## Example\n\n```\n$ bash example/qwen3-30B-A3B/decode.sh\n\n================ Simulator Result ================\nDevice type:                             H20\nWorld size:                              4\nAttn type:                               MHA/GQA\nUse FP8 GEMM:                            0\nUse FP8 KV:                              0\n------------------Model Weights-------------------\nOne attn params size (MB):               36.00\nOne expert params size (MB):             9.00\nPer GPU params size (GB):                15.19\n---------------------KV Cache---------------------\nKV cache space (GB):                     60.81\nInput seq len:                           4096\nOutput seq len:                          2048\nTarget decode batchsize:                 100\nTarget per-token KV cache size (KB):     103.79\nCurrent per-token KV cache size (KB):    96.00\n----------------------FLOPs-----------------------\nNum hidden layers:                       48\nPer-token per-layer attn core (GFLOPs):  0.08\nPer-token per-layer MoE/FFN (GFLOPs):    0.08\nPer-token per-layer others (GFLOPs):     0.04\nPer-token attn core (GFLOPs):            4.03\nPer-token MoE (GFLOPs):                  3.62\nPer-token others (GFLOPs):               1.81\nPer-token total (GFLOPs):                9.46\n---------------------Decoding---------------------\nAttn core MFU:                           0.15\nAttn core latency (us):                  361.77\nKV loading latency (us):                 298.02\nQKV_proj latency (us):                   31.03\nO_proj latency (us):                     16.95\nRouted experts/FFN MFU:                  0.18\nRouted experts/FFN latency (us):         269.28\nExperts loading latency (us):            85.83\nComm before MoE/FFN (us):                4.24\nComm after MoE/FFN (us):                 4.24\nTPOT (ms):                               38.00\nThroughput (TGS):                        2632\n```\n\n## Acknowledgement\n\nThis work is developed and maintained by Alimama AI Infra Team \u0026 Future Living Lab, Alibaba Group.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falibaba%2Finfersim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falibaba%2Finfersim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falibaba%2Finfersim/lists"}