{"id":27362209,"url":"https://github.com/ByteDance-Seed/Triton-distributed","last_synced_at":"2025-04-13T03:01:11.681Z","repository":{"id":286246546,"uuid":"959035472","full_name":"ByteDance-Seed/Triton-distributed","owner":"ByteDance-Seed","description":"Distributed Triton for Parallel Systems","archived":false,"fork":false,"pushed_at":"2025-04-05T07:19:04.000Z","size":67475,"stargazers_count":79,"open_issues_count":0,"forks_count":4,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-04-05T08:23:39.502Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"MLIR","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ByteDance-Seed.png","metadata":{"files":{"readme":"README-cn.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-04-02T06:57:03.000Z","updated_at":"2025-04-05T08:09:31.000Z","dependencies_parsed_at":"2025-04-05T08:34:06.603Z","dependency_job_id":null,"html_url":"https://github.com/ByteDance-Seed/Triton-distributed","commit_stats":null,"previous_names":["bytedance-seed/triton-distributed"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FTriton-distributed","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FTriton-distributed/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FTriton-distributed/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FTriton-distributed/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ByteDance-Seed","download_url":"https://codeload.github.com/ByteDance-Seed/Triton-distributed/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248657867,"owners_count":21140844,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-13T03:00:50.135Z","updated_at":"2025-04-13T03:01:11.648Z","avatar_url":"https://github.com/ByteDance-Seed.png","language":"MLIR","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n 👋 大家好!\n    \u003cbr\u003e\n    我们是 \u003cb\u003eByteDance Seed team.\u003c/b\u003e\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\n  欢迎通过以下方式以更好的了解我们👇\n  \u003cbr\u003e\n  \u003ca href=\"https://team.doubao.com/\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Website-%231e37ff?style=for-the-badge\u0026logo=bytedance\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/user-attachments/assets/93481cda-a7f3-47f3-b333-fe6b3da86b78\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/WeChat-07C160?style=for-the-badge\u0026logo=wechat\u0026logoColor=white\"\u003e\u003c/a\u003e\n \u003ca href=\"https://www.xiaohongshu.com/user/profile/668e7e15000000000303157d?xsec_token=ABl2-aqekpytY6A8TuxjrwnZskU-6BsMRE_ufQQaSAvjc%3D\u0026xsec_source=pc_search\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Xiaohongshu-%23FF2442?style=for-the-badge\u0026logo=xiaohongshu\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.zhihu.com/org/dou-bao-da-mo-xing-tuan-dui/\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/zhihu-%230084FF?style=for-the-badge\u0026logo=zhihu\u0026logoColor=white\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n![seed logo](https://github.com/user-attachments/assets/c42e675e-497c-4508-8bb9-093ad4d1f216)\n\n# Triton-distributed\n\u003c!-- \n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/bytedance/flux\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Triton-distributed-Project Page-yellow\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://arxiv.org/pdf/xxxx.xxxx\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Triton-distributed-Tech Report-red\"\u003e\u003c/a\u003e\n  \u003cbr\u003e\n  \u003ca href=\"https://github.com/user-attachments/assets/d3fcb3bf-466b-4efe-8c3f-5f85258202ae\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Triton-distributed-Wechat Communication Group-07C160\"\u003e\u003c/a\u003e\n  \u003ca href=\"XXX\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/License-MIT-blue\"\u003e\u003c/a\u003e\n\u003c/p\u003e --\u003e\n\n[原始Triton README](upstream-README.md) | [英文README](README.md)\n\nTriton-distributed是基于OpenAI Triton构建的分布式编译器，专为计算-通信重叠优化设计。\n\n使用Triton-distributed，开发者可以创建性能媲美优化库（如NVIDIA的[Distributed-GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/65_distributed_gemm)和字节跳动的[FLUX](https://github.com/bytedance/flux/blob/main/README.md)）的高效Kernel。当前主要支持NVIDIA GPU和AMD GPU，也可移植到其他硬件平台。如需在自定义硬件上使用，请联系我们。\n\n## 快速入门\n### 源码安装\n\n[安装指导](docs/distributed/build.md)\n\n### 如何使用 Triton-distributed\nTriton-distributed 提供了一套易于使用的原语，用于支持开发计算-通信融合的分布式kernel。这些原语分为低层次原语和高层次原语。目前，我们已经发布了低层次原语，并计划在未来发布高层次原语。\n\n[Triton-distributed 原语](docs/distributed/primitives.md)\n\n使用这些原语，用户可以轻松编写通信kernel。例如，以下展示了一个低延迟的AllToAll通信操作（在推理场景下，其延迟表现优于[DeepEP](https://github.com/deepseek-ai/DeepEP)）。这个例子在32卡H800集群中性能是137微秒（每个卡128 token, topk=8, hidden_size=7168, 数据类型是fp8），DeepEP是182微秒（DeepEP推理不用NVLink）\n```py\n@triton.jit\ndef all_to_all_kernel(\n    data_src,\n    data_dst,\n    splits_src,\n    splits_dst,\n    signal,\n    splits_cumsum,\n    scale_src,\n    scale_dst,\n    rank: int,\n    call_count: int,\n    WITH_SCALE: tl.constexpr,\n    WORLD_SIZE: tl.constexpr,\n    HIDDEN: tl.constexpr,\n    MAX_M: tl.constexpr,\n    EXPERTS_PER_RANK: tl.constexpr,\n    NUM_TOT_EXPERTS: tl.constexpr,\n    ELEMENT_SIZE: tl.constexpr = 2,\n    SCALE_ELEMENT_SIZE: tl.constexpr = 4,\n):\n    pid = tl.program_id(0)\n    threadidx = tid(axis=0)\n\n    exp_st = pid * EXPERTS_PER_RANK\n    exp_ed = exp_st + EXPERTS_PER_RANK\n\n    m_st = tl.load(splits_cumsum + exp_st)\n    m_ed = tl.load(splits_cumsum + exp_ed)\n    num_rows_cur_block = m_ed - m_st\n\n    src_off = m_st\n    dst_off = rank * MAX_M\n\n    split_src_ptr = splits_src + exp_st\n    off0 = exp_st + tl.arange(0, EXPERTS_PER_RANK)\n    off1 = exp_st + tl.arange(0, EXPERTS_PER_RANK) + 1\n    cumsum_sts = tl.load(splits_cumsum + off0)\n    cumsum_eds = tl.load(splits_cumsum + off1)\n    tl.store(split_src_ptr + tl.arange(0, EXPERTS_PER_RANK), cumsum_eds - cumsum_sts)\n\n    act_pos = call_count % 2\n    data_dst_ptr = data_dst + act_pos * WORLD_SIZE * MAX_M * HIDDEN + dst_off * HIDDEN\n    split_dst_ptr = splits_dst + act_pos * NUM_TOT_EXPERTS + rank * EXPERTS_PER_RANK\n    signal_ptr = signal + act_pos * WORLD_SIZE + rank\n\n    libshmem_device.putmem_nbi_block(\n        data_dst_ptr,\n        data_src + src_off * HIDDEN,\n        num_rows_cur_block * HIDDEN * ELEMENT_SIZE,\n        pid,\n    )\n    libshmem_device.putmem_nbi_block(\n        split_dst_ptr,\n        split_src_ptr,\n        EXPERTS_PER_RANK * 4,  # now we use `int32` for splits\n        pid,\n    )\n    if WITH_SCALE:\n        scale_dst_ptr = scale_dst + act_pos * WORLD_SIZE * MAX_M + dst_off\n        libshmem_device.putmem_signal_nbi_block(\n            scale_dst_ptr,\n            scale_src + src_off,\n            num_rows_cur_block * SCALE_ELEMENT_SIZE,\n            signal_ptr,\n            call_count,\n            libshmem_device.NVSHMEM_SIGNAL_SET,\n            pid,\n        )\n\n    libshmem_device.fence()\n    if threadidx == 0:\n        if not WITH_SCALE:\n            libshmem_device.signal_op(\n                signal_ptr,\n                call_count,\n                libshmem_device.NVSHMEM_SIGNAL_SET,\n                pid,\n            )\n        libshmem_device.signal_wait_until(\n            signal + act_pos * WORLD_SIZE + pid,\n            libshmem_device.NVSHMEM_CMP_EQ,\n            call_count,\n        )\n```\n\n此外，用户可以将通信部分与计算部分结合，设计计算-通信融合的kernel。我们在`third_party/distributed/distributed/kernels`目录下提供了示例实现。\n\n## Performance\nTriton-distributed 可以达到和手写分布式算子库接近的性能，有时候还能更好。\n\n\n### AllGather GEMM 单机H800\n![Ag-GEMM-inter-node](asset/ag-gemm-intra-node.png)\n\n### GEMM ReduceScatter 单机H800\n![Ag-GEMM-inter-node](asset/gemm-rs-intranode-perf.png)\n\n### AllGather GEMM 双机H800\n![Ag-GEMM-inter-node](asset/ag-inter-node-gemm.png)\n\n### GEMM ReduceScatter 双机H800\n![GEMM-Rs-inter-node](asset/gemm-rs-inter-node.png)\n\n### 分布式Flash-Decode从单机到四机扩展情况\n![flash-decode-inter-node](asset/flash-decode-scaling.png)\n\n### 其他平台性能\n[AMD GPUs](docs/distributed/amd-perf.md)\n\n## Roadmaps\n### 功能\n- [x] Release low-level primitives\n- [ ] Release high-level primitives\n- [ ] Tutorials\n- [ ] Pre-built binary\n### Kernels\n- [x] Release single-node GEMM TP overlapping kernels\n- [x] Release single-node MoE TP overlapping kernels\n- [x] Release single-node distributed Flash-Decoding kernels\n- [ ] Release single-node MoE EP overlapping kernels\n- [x] Release cross-node GEMM TP overlapping kernels\n- [x] Release cross-node MoE TP overlapping kernels\n- [x] Release cross-node distributed Flash-Decoding kernels\n- [x] Release cross-node EP all-to-all kernels (similar to [DeepEP](https://github.com/deepseek-ai/DeepEP))\n- [ ] Provide tutorials for kernel implementation\n### 后端\n计算能力\n- [x] Nvidia SM90a support\n- [x] Nvidia SM80 support\n- [x] Nvidia SM89 support\n- [x] AMD CDNA3 support\n\n通信能力\n- [x] NVLink\n- [x] IB\n- [ ] PCIe \n\n### 性能\n- [ ] Performance report\n\n## 许可协议\nTriton-distributed 主体是 MIT license.\n我们的代码中有一些是 Apache-2.0 License的:\n- `third_party/distributed/distributed/kernels/flash_decode.py`\n\nTriton原本又些代码也是 Apache-2.0 License的:\n- `include/triton/Dialect/TritonGPU/Transforms/PipelineExpander.h`\n- `lib/Dialect/TritonGPU/Transforms/Pipeliner/PipelineExpander.cpp`\n- `python/triton/_C/include/triton/Dialect/TritonGPU/Transforms/PipelineExpander.h`\n- `utils/generate-test-checks.py`\n\n## 引用\n如在学术研究中使用Triton-distributed，请引用：\n```bibtex\n@misc{zheng2025tilelink,\n      title={TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives},\n      author={Size Zheng and Jin Fang and Xuegui Zheng and Qi Hou and Wenlei Bao and Ningxin Zheng and Ziheng Jiang and Dongyang Wang and Jianxi Ye and Haibin Lin and Li-Wen Chang and Xin Liu},\n      year={2025},\n      eprint={TBD},\n      archivePrefix={MLSys}\n}\n```\n\n# 关于 [ByteDance Seed Team](https://team.doubao.com/)\n\n字节跳动Seed团队成立于 2023 年，致力于打造行业内最先进的人工智能基础模型。该团队立志成为世界一流的研究团队，并为科学进步和社会发展做出重大贡献。\n\n---\n\n# 交流与讨论\n\u003cimg src=\"asset/wechat-group-temporal.png\" width=\"200\" height=\"200\" alt=\"微信讨论群\"\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FByteDance-Seed%2FTriton-distributed","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FByteDance-Seed%2FTriton-distributed","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FByteDance-Seed%2FTriton-distributed/lists"}