{"id":23345738,"url":"https://github.com/harleyszhang/llm_counts","last_synced_at":"2025-08-23T10:31:26.967Z","repository":{"id":183944002,"uuid":"671044141","full_name":"harleyszhang/llm_counts","owner":"harleyszhang","description":"llm theoretical performance analysis tools and support params, flops, memory and latency analysis.","archived":false,"fork":false,"pushed_at":"2025-07-11T17:40:43.000Z","size":7786,"stargazers_count":102,"open_issues_count":0,"forks_count":9,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-08-17T08:46:36.245Z","etag":null,"topics":["gpu-performance","llama","llm","llm-inference","profiler","python3","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/harleyszhang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-07-26T12:15:25.000Z","updated_at":"2025-08-10T15:19:07.000Z","dependencies_parsed_at":"2024-08-21T15:34:42.775Z","dependency_job_id":"6f0f1ef0-cc88-47d1-a702-d89c808ef6e6","html_url":"https://github.com/harleyszhang/llm_counts","commit_stats":{"total_commits":8,"total_committers":3,"mean_commits":"2.6666666666666665","dds":0.375,"last_synced_commit":"3a037d22ae7ae29a4b3bb2f2b2963926a543f39d"},"previous_names":["harleyszhang/llm_profiler","harleyszhang/llm_counts"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/harleyszhang/llm_counts","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harleyszhang%2Fllm_counts","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harleyszhang%2Fllm_counts/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harleyszhang%2Fllm_counts/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harleyszhang%2Fllm_counts/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/harleyszhang","download_url":"https://codeload.github.com/harleyszhang/llm_counts/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harleyszhang%2Fllm_counts/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271746655,"owners_count":24813575,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-23T02:00:09.327Z","response_time":69,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpu-performance","llama","llm","llm-inference","profiler","python3","transformer"],"created_at":"2024-12-21T07:01:23.678Z","updated_at":"2025-08-23T10:31:26.961Z","avatar_url":"https://github.com/harleyszhang.png","language":"Python","funding_links":[],"categories":["Building","Summary"],"sub_categories":["Tools"],"readme":"# llm_profiler\n\nllm theoretical performance analysis tools and support params, flops, memory and latency analysis.\n\n## 主要功能\n\n- 支持 qwen2.5、qwen3 dense 系列模型。\n- 支持张量并行推理模式。\n- 支持 `A100`、`V100`、`T4` 等硬件以及主流 decoder-only 的自回归模型，可自行在配置文件中增加。\n- 支持分析性能瓶颈，不同 `layer` 是 `memory bound` 还是 `compute bound`，以及 `kv_cache` 的性能瓶颈。\n- 支持输出每层和整个模型的参数量、计算量，内存和 `latency`。\n- 推理时支持预填充和解码阶段分别计算内存和 latency、以及理论支持的最大 `bs` 等等。\n- 支持设置计算效率、内存读取效率（不同推理框架可能不一样，这个设置好后，可推测输出实际值）。\n- 推理性能理论分析结果的格式化输出。\n\n## 如何使用\n\n使用方法，直接调用 `llm_profiler/llm_profiler.py` 文件中函数 `llm_profile()` 函数并输入相关参数即可。\n\n```python\ndef llm_profile(model_name=\"llama-13b\",\n                gpu_name: str = \"v100-sxm-32gb\",\n                bytes_per_param: int = BYTES_FP16,\n                bs: int = 1,\n                seq_len: int = 522,\n                generate_len=1526,\n                ds_zero: int = 0,\n                dp_size: int = 1,\n                tp_size: int = 1,\n                pp_size: int = 1,\n                sp_size: int = 1,\n                layernorm_dtype_bytes: int = BYTES_FP16,\n                kv_cache_bytes: int = BYTES_FP16,\n                flops_efficiency: float = FLOPS_EFFICIENCY,\n                hbm_memory_efficiency: float = HBM_MEMORY_EFFICIENCY,\n                intra_node_memory_efficiency=INTRA_NODE_MEMORY_EFFICIENCY,\n                inter_node_memory_efficiency=INTER_NODE_MEMORY_EFFICIENCY,\n                mode: str = \"inference\",\n            ) -\u003e dict:\n\n    \"\"\"format print dicts of the total floating-point operations, MACs, parameters and latency of a llm.\n\n    Args:\n        model_name (str, optional): model name to query the pre-defined `model_configs.json`. Defaults to \"llama-13b\".\n        gpu_name (str, optional): gpu name to query the pre-defined `model_configs.json`. Defaults to \"v100-sxm2-32gb\".\n        bs (int, optional): _description_. Defaults to 1.\n        seq_len (int, optional): batch size per GPU.. Defaults to 522.\n        generate_len (int, optional): The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. Defaults to 1526.\n        dp_size (int, optional): data parallelism size. Defaults to 1.\n        tp_size (int, optional): tensor parallelism size. Defaults to 1.\n        pp_size (int, optional): pipeline parallelism size. Defaults to 1.\n        sp_size (int, optional): sequence parallelism size. Defaults to 1.\n            speed up decoding. Defaults to True.\n        layernorm_dtype_bytes (int, optional): number of bytes in the data type for the layernorm activations.. Defaults to BYTES_FP16.\n        kv_cache_bytes (int, optional): number of bytes in the data type for the kv_cache. Defaults to None.\n        flops_efficiency (float, optional): flops efficiency, ranging from 0 to 1. Defaults to None.\n        hbm_memory_efficiency (float, optional): GPU HBM memory efficiency, ranging from 0 to 1. Defaults to HBM_MEMORY_EFFICIENCY.\n        intra_node_memory_efficiency (_type_, optional): intra-node memory efficiency, ranging from 0 to 1.. Defaults to INTRA_NODE_MEMORY_EFFICIENCY.\n        inter_node_memory_efficiency (_type_, optional): inter-node memory efficiency, ranging from 0 to 1.. Defaults to INTER_NODE_MEMORY_EFFICIENCY.\n\n    Returns:\n        None: format print some summary dictionary of the inference analysis\n    \"\"\"\n```\n\n`llama2-70` 模型，tp_size = 8 和 bs = 20，输出示例信息如下所示：\n\n```bash\n-------------------------- LLM main infer config --------------------------\n{   'inference_config': {   'model_name': 'llama2-70b',\n                            'num_attention_heads': 64,\n                            'num_kv_heads': 8,\n                            'head_dim': 128,\n                            'hidden_size': 8192,\n                            'intermediate_size': 28672,\n                            'vocab_size': 32000,\n                            'max_seq_len': 4096,\n                            'bs': 32,\n                            'seq_len': 1024,\n                            'tp_size': 8,\n                            'pp_size': 1,\n                            'generate_len': 128},\n    'gpu_config': {   'name': 'a100-sxm-40gb',\n                      'memory_in_GB': '40 GB',\n                      'gpu_hbm_bw': '1555 GB/s',\n                      'gpu_intra_node_bw': '600 GB/s',\n                      'gpu_fp16_TFLOPS': '312 TFLOPS'}}\n\n-------------------------- LLM infer performance analysis --------------------------\n{   'weight_memory_per_gpu': '17.18 GB',\n    'consume_memory_per_gpu': '20.57 GB',\n    'prefill_flops': '4574.25 T',\n    'decode_flops_per_step': '4.38 T',\n    'TTFT': 2.7060724961666294,\n    'TTOT': 0.040541745771914876,\n    'kv_cache_latency': '959.04 us',\n    'total_infer_latency': '7.9 s',\n    'support_max_batch_total_tokens': 240249}\n\n---------------------------- LLM Params per_layer analysis ----------------------------\n{   'qkvo_proj': '150.99 M',\n    'mlp': '704.64 M',\n    'rmsnorm': '16.38 K',\n    'input_embedding': '262.14 M',\n    'output_embedding': '262.14 M'}\n{'params_model': '68.71 G'}\n\n---------------------------- LLM Prefill Flops per_layer analysis ----------------------------\n{   'attention_kernel': '1.1 T',\n    'qkvo_proj': '9.9 T',\n    'mlp': '46.18 T',\n    'rmsnorm': '4.29 G',\n    'positional_embedding': '536.87 M',\n    'input_embedding': '0'}\n{'prefill flops_model': '4574.25 T'}\n\n---------------------------- LLM Memory analysis (Prefill) ----------------------------\n{   'weight_memory_per_gpu': '17.18 GB',\n    'prefill_max_bs': '388B',\n    'prefill_act_per_gpu': '1.88 GB'}\n\n---------------------------- LLM Memory analysis (Decode) ----------------------------\n{   'decode_act_per_gpu': '1.88 GB',\n    'kv_cache_memory_per_gpu': '1.51 GB',\n    'consume_memory_per_gpu': '20.57 GB',\n    'decode_max_bs': '215.0B',\n    'max_batch_total_tokens': '240.25 KB'}\n\n---------------------------- LLM Latency analysis (Prefill) ----------------------------\n{   'prefill_qkvo_proj': '352.41 ms',\n    'prefill_attn_kernel': '131.39 ms',\n    'prefill_mlp': '1.64 s',\n    'prefill_rmsnorm': '61.38 ms',\n    'prefill_tp_comm': '501.08 ms',\n    'prefill_kv_cache_rw': '959.04 us',\n    'prefill_latency': '2.71 s'}\n\n---------------------------- LLM Latency analysis (Decode) ----------------------------\n{   'decode_qkvo_proj': '6.5 ms',\n    'decode_attn_kernel': '2.56 ms',\n    'decode_mlp': '30.26 ms',\n    'decode_rmsnorm': '64.62 us',\n    'decode_tp_comm': '640.0 us',\n    'decode_kv_cache_rw': '121.75 us',\n    'kv_cache_latency': '959.04 us',\n    'decode_latency': '40.54 ms'}\n```\n\n## 模型结构可视化\n\nllama2-70b 模型，A100-SXM40GB，tp_size = 8 和 bs = 20，prefill 阶段:\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"images/grpah_prefill_llama2-70b_tp8_bs32_seqlen1024_genlen128.png\" width=\"50%\" alt=\"prefill 阶段\"\u003e\n\u003c/div\u003e\n\nllama2-70b 模型，A100-SXM40GB，tp_size = 8 和 bs = 20， decode 阶段:\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"images/grpah_decode_llama2-70b_tp8_bs32_seqlen1024_genlen128.png\" width=\"50%\" alt=\"decode 阶段\"\u003e\n\u003c/div\u003e\n\nqwen3 moe 模型结构可视化, expert 结构其实就是 moe_mlp，下图没有展开。graph_prefil_Qwen3-30B-A3B_tp1_bs16_seqlen600_genlen128:\n\n![graph_prefill_Qwen3-30B-A3B](./images/Qwen3-30B-A3B/graph_prefill_Qwen3-30B-A3B_tp1_bs16_seqlen600_genlen128.png)\n\ngraph_decode_Qwen3-30B-A3B_tp1_bs16_seqlen600_genlen128:\n\n![decode_Qwen3-30B-A3B](./images/Qwen3-30B-A3B/graph_decode_Qwen3-30B-A3B_tp1_bs16_seqlen600_genlen128.png)\n\n## 模型参数量、计算量、latency 分布\n\nllama2-70b 模型，A100-SXM40GB，tp_size = 8 和 bs = 20，参数量统计分布:\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"images/params_llama2-70b_tp8_bs32_seqlen1024_genlen128.png\" width=\"50%\" alt=\"prefill 阶段\"\u003e\n\u003c/div\u003e\n\nllama2-70b 模型，A100-SXM40GB，tp_size = 8 和 bs = 20，prefill 阶段计算量统计分布:\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"images/flops_prefill_llama2-70b_tp8_bs32_seqlen1024_genlen128.png\" width=\"50%\" alt=\"prefill 阶段计算量统计分布\"\u003e\n\u003c/div\u003e\n\nllama2-70b 模型，A100-SXM40GB，tp_size = 8 和 bs = 20，generate_len = 128, decode 阶段计算量统计分布:\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"images/flops_decode_llama2-70b_tp8_bs32_seqlen1024_genlen128.png\" width=\"50%\" alt=\"decode 阶段计算量统计分布\"\u003e\n\u003c/div\u003e\n\nllama2-70b 模型，A100-SXM40GB，tp_size = 8 和 bs = 20，prefill 阶段 latency 统计分布:\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"images/latency_prefill_llama2-70b_tp8_bs32_seqlen1024_genlen128.png\" width=\"50%\" alt=\"prefill 阶段 latency 统计分布\"\u003e\n\u003c/div\u003e\n\nllama2-70b 模型，A100-SXM40GB，tp_size = 8 和 bs = 20，decode 阶段 latency 统计分布:\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"images/latency_decode_llama2-70b_tp8_bs32_seqlen1024_genlen128.png\" width=\"50%\" alt=\"decode 阶段 latency 统计分布\"\u003e\n\u003c/div\u003e\n\n## 参考链接\n- [Transformer 性能分析理论基础](https://github.com/HarleysZhang/dl_note/blob/main/6-llm_note/transformer_basic/Transformer%E6%80%A7%E8%83%BD%E5%88%86%E6%9E%90%E7%90%86%E8%AE%BA%E5%9F%BA%E7%A1%80.md)\n- [llm_analysis](https://github.com/cli99/llm-analysis)\n- [Transformer Inference Arithmetic](https://kipp.ly/blog/transformer-inference-arithmetic/)\n- [LLM-Viewer](https://github.com/hahnyuan/LLM-Viewer.git)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharleyszhang%2Fllm_counts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fharleyszhang%2Fllm_counts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharleyszhang%2Fllm_counts/lists"}