{"id":21389450,"url":"https://github.com/feifeibear/odysseus-transformer","last_synced_at":"2025-07-13T15:33:13.947Z","repository":{"id":242867302,"uuid":"810207285","full_name":"feifeibear/Odysseus-Transformer","owner":"feifeibear","description":"Odysseus: Playground of LLM Sequence Parallelism","archived":false,"fork":false,"pushed_at":"2024-06-17T08:29:19.000Z","size":479,"stargazers_count":49,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-08-21T17:00:39.287Z","etag":null,"topics":["llm","megatron-lm","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/feifeibear.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-04T08:54:12.000Z","updated_at":"2024-08-13T06:36:20.000Z","dependencies_parsed_at":"2024-06-05T12:57:44.300Z","dependency_job_id":"30c908c0-9a0a-4319-9410-ac1ecc1115cb","html_url":"https://github.com/feifeibear/Odysseus-Transformer","commit_stats":null,"previous_names":["feifeibear/odysseus-transformer"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2FOdysseus-Transformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2FOdysseus-Transformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2FOdysseus-Transformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2FOdysseus-Transformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/feifeibear","download_url":"https://codeload.github.com/feifeibear/Odysseus-Transformer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225896453,"owners_count":17541499,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","megatron-lm","pytorch"],"created_at":"2024-11-22T12:26:36.550Z","updated_at":"2024-11-22T12:26:37.117Z","avatar_url":"https://github.com/feifeibear.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Odysseus: Playground of LLM Sequence Parallelism \nThe repository serves as a playground for variant sequence parallelism implementations.\nThis repository delves into a set of parallelization strategies for long-sequence LLMs, implementing four methods: \n1. [Tensor Parallelism with Sequence Parallelism (TP-SP)](https://arxiv.org/abs/2205.05198), refer to MLSys 23' paper: Reducing Activation Recomputation in Large Transformer Models.\n2. [DeepSpeed-Ulysses](https://arxiv.org/abs/2309.14509), refer to the paper: DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. Apply the implementation from [feifeibear/long-context-attention](https://github.com/feifeibear/long-context-attention).\n3. [Ring-Attentions](https://arxiv.org/abs/2310.01889), refer to the paper: Ring Attention with Blockwise Transformers for Near-Infinite Context. Apply the implementation from [zhuzilin/ring-flash-attention](https://github.com/zhuzilin/ring-flash-attention)\n4. Odysseus. A novel method is proposed in this repo.\n\nAs illustrated in the below figure, **Odysseus**, our innovative sequence parallelization strategy, decouples the parallelization of Attention and MLP within Transformers. \nFor Attention, it utilizes TP-SP to split Q, K, V, O Linear weights and uses allgather for input tensors and reducescatter for output tensors, segmenting Activation by sequence dimension. \nMLP implements naive sequence parallelism, splitting input by sequence dimension without requiring communication on activation but needs synchronization on gradients during backpropagation.\n**The communication cost of the Odysseus is higher than Ulysses on large GPU scale.**\n**The communication cost of the Odysseus is better than TP-SP on long sequence scenarios.**\nOdysseus can be used can be used orthogonally with Ring-Attention.\n\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./media/Odysseus.jpg\" alt=\"Image description\"\u003e\n\u003c/div\u003e\n\nThe communication and memory costs of these four methods are summarized in the table below. Among them, RS stands for ReduceScatter, and AG stands for AllGather. L represents the sequence length, d is the hidden dimension, i is the intermediate hidden size, with GPT-4 having i = 4d, and N denotes the number of GPUs.\n\nWhen the sequence length $L$ exceeds the intermediate hidden size $i$ ($L$ \u003e i), Odysseus+ZeRO3 demonstrates a lower communication cost compared to TP-SP and Ulysses+ZeRO3. Notably, all three methods maintain similar memory consumption.\n\n| Method          | Comm Activation | Comm Volume       | Comm Gradient | Comm Volume                   | Mem Activation | Mem Param/Grad |\n|-----------------|------------|--------------|----------|--------------------------|------------|------------|\n| TP              | 2AllReduce | 8O(Ld)       | 0        | 0                        | full       | 1/N        |\n| TP-SP           | 6RS+4AG    | 10O(Ld)       | 0        | 0                        | 1/N        | 1/N        |\n| Ulysses+ZeRO3   | 8All2All   | 8O(Ld)/N      | RS+2AG (Full) | 4O($d^2$)+3O(di)           | 1/N        | 1/N      |\n| Ring+ZeRO3      | P2Ps       | 4O(Ld)       | RS+2AG (Full) | 4O($d^2$)+3O(di)           | 1/N        | 1/N      |\n| Odysseus+ZeRO3  | 3RS+2AG    | 5O(Ld)       | RS+2AG (MLP) | 3O(di) | 1/N        | 1/N        |\n\n\nWe conducted a benchmark of the four methods on 8xA100 GPUs, with a global batch size of 1 and without applying gradient checkpointing or offload. The elapsed time and memory usage are presented below. **The results differ from the analysis presented above.**\n\n1. Odysseus and TP-SP demonstrates the better memory efficiency than Ulysses and Ring. Despite the theoretical equivalence in memory consumption between the four methods, we suspect that FSDP's memory efficiency is inferior to manually partitioning the weights of Linear layers.\n2. Odysseus and TP-SP exhibit similar speed. However, the Odysseus MLP ZeRO modules, where both AG (AllGather) and RS (ReduceScatter) operation are synchronized, still has room to improve by applying async versions.\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./media/odysseus_perf.png\" alt=\"Image description\"\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./media/ody_perf_2.png\" alt=\"Image description\"\u003e\n\u003c/div\u003e\n\n### Usage\n1. Install requirements.txt\n2. Install [feifeibear/long-context-attention](https://github.com/feifeibear/long-context-attention), and [zhuzilin/ring-flash-attention](https://github.com/zhuzilin/ring-flash-attention).\n3. bash run.sh\n\n\n### Acknowledgements\n\n[jzhang38/EasyContext](https://github.com/jzhang38/EasyContext)\n\n### Citation\n\nIf you apply Odysseus in you project, I I kindly request that you acknowledge my contribution with the following citation.\n\n```\n@misc{fang2024odysseus,\n  title={Odysseus: Upgrade DeepSpeed-Ulysses by Decoupling the Parallel Strategies of Attention and MLP},\n  author={Fang, Jiarui},\n  howpublished={\\url{https://github.com/feifeibear/Odysseus-Transformer}},\n  year={2024},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffeifeibear%2Fodysseus-transformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffeifeibear%2Fodysseus-transformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffeifeibear%2Fodysseus-transformer/lists"}