{"id":13993943,"url":"https://github.com/imoneoi/multipack_sampler","last_synced_at":"2025-03-17T15:12:59.959Z","repository":{"id":179657912,"uuid":"663012406","full_name":"imoneoi/multipack_sampler","owner":"imoneoi","description":"Multipack distributed sampler for fast padding-free training of LLMs","archived":false,"fork":false,"pushed_at":"2024-08-10T04:16:04.000Z","size":18,"stargazers_count":184,"open_issues_count":3,"forks_count":14,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-02T13:08:47.368Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/imoneoi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-06T11:12:29.000Z","updated_at":"2025-01-15T21:19:23.000Z","dependencies_parsed_at":"2024-01-18T04:51:49.119Z","dependency_job_id":"11ebc180-e3be-4be2-989f-2a9529cc4a29","html_url":"https://github.com/imoneoi/multipack_sampler","commit_stats":null,"previous_names":["imoneoi/multipack_sampler"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imoneoi%2Fmultipack_sampler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imoneoi%2Fmultipack_sampler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imoneoi%2Fmultipack_sampler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imoneoi%2Fmultipack_sampler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/imoneoi","download_url":"https://codeload.github.com/imoneoi/multipack_sampler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244056425,"owners_count":20390719,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-09T14:02:38.304Z","updated_at":"2025-03-17T15:12:59.920Z","avatar_url":"https://github.com/imoneoi.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Multipack Sampler\n\nThe Multipack sampler is designed for padding-free distributed training of large language models. It utilizes an approximate solution to the identical machine scheduling problem to maximize the efficiency of batch processing. On the OpenChat V1 training set, it achieves \u003e99% theoretical efficiency, while the interleaved sampler only achieves ~75%.\n\n## V2 Update\n\nMultipack V2 optimized the packing algorithm complexity from `O(n k log n)` down to `O(n log k log n)` without degrading the packing efficiency, achieving better throughput for a large number of nodes.\n\nThe V2 release also has two variants with different packing optimization objective:\n\n - `MultipackDistributedBatchSampler`: Designed for models with quadratic attention. It will try to optimize packing efficiency as well as balance long/short sequences between each nodes, to minimize the difference of quadratic load.\n - `MultipackDistributedBatchSampler_LinearAttention`: For models with linear attention. Only consider packing efficiency and performs better on it than Quadratic variant, however this algorithm tends to put all long sequences into one node.\n\n## Benchmark\n\nPlease refer to `test_multipack.ipynb`\n\n- Efficiency: Percentage of actual batch size to max batch size\n\n   = `number of tokens per batch / max capacity of tokens per batch`\n\n - Utilization: all nodes waiting for the slowest node\n \n   = `number of tokens per batch / max number of tokens on a single node * node count`\n\nL^2 lag: `sqrt(max over node(sum length^2) - min over node(sum length^2))`\n\n```\nOpenChat V1 (testdata.json)\n\nSampler Multipack QuadraticAttention:\nBatch count for ranks: [37, 37, 37, 37, 37, 37, 37, 37]\nPacking Time: 20ms\n\nL^2 lag avg: 438 max: 717\nEfficiency: 98.16%\nUtilization: 99.70%\n==========\n\nSampler Multipack LinearAttention:\nBatch count for ranks: [36, 36, 36, 36, 36, 36, 36, 36]\nPacking Time: 18ms\n\nL^2 lag avg: 6500 max: 6761\nEfficiency: 99.64%\nUtilization: 99.64%\n==========\n\nSampler Interleaved:\nBatch count for ranks: [48, 48, 48, 48, 48, 48, 48, 48]\nPacking Time: 0ms\n\nL^2 lag avg: 1914 max: 2000\nEfficiency: 75.67%\nUtilization: 96.79%\n==========\n```\n\n## Usage\n\nCompatible with PyTorch `DataLoader`\n\n```python\nbatch_max_len = 16 * 2048  # batch size * max context length\n\nlengths = np.array([len(tokens) for tokens in data])\n\nsampler = MultipackDistributedBatchSampler(\n    batch_max_length=batch_max_len,\n    lengths=lengths,\n    seed=0\n)\n\ndataloader = DataLoader(data, batch_sampler=sampler)\n```\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimoneoi%2Fmultipack_sampler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fimoneoi%2Fmultipack_sampler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimoneoi%2Fmultipack_sampler/lists"}