{"id":26148828,"url":"https://github.com/hpcaitech/elixir","last_synced_at":"2025-04-14T03:41:07.822Z","repository":{"id":149017623,"uuid":"600793745","full_name":"hpcaitech/Elixir","owner":"hpcaitech","description":"Elixir: Train a Large Language Model on a Small GPU Cluster","archived":false,"fork":false,"pushed_at":"2023-06-08T06:14:37.000Z","size":263,"stargazers_count":14,"open_issues_count":1,"forks_count":5,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-27T17:46:17.891Z","etag":null,"topics":["efficient","large-language-models","memory-management"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hpcaitech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-12T16:11:51.000Z","updated_at":"2025-03-24T16:01:48.000Z","dependencies_parsed_at":null,"dependency_job_id":"e7e48344-c2ef-45fa-a230-6af3532c803f","html_url":"https://github.com/hpcaitech/Elixir","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcaitech%2FElixir","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcaitech%2FElixir/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcaitech%2FElixir/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcaitech%2FElixir/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hpcaitech","download_url":"https://codeload.github.com/hpcaitech/Elixir/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248818525,"owners_count":21166438,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["efficient","large-language-models","memory-management"],"created_at":"2025-03-11T05:21:52.740Z","updated_at":"2025-04-14T03:41:07.818Z","avatar_url":"https://github.com/hpcaitech.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Elixir (Gemini2.0)\nElixir, also known as Gemini, is a technology designed to facilitate the training of large models on a small GPU cluster.\nIts goal is to eliminate data redundancy and leverage CPU memory to accommodate really large models.\nIn addition, Elixir automatically profiles each training step prior to execution and selects the optimal configuration for the ratio of redundancy and the device for each parameter.\nThis repository is used to benchmark the performance of Elixir.\nElixir will be integrated into ColossalAI for usability.\n\n## Environment\n\nThis version is a beta release, so the running environment is somewhat restrictive.\nWe are only demonstrating our running environment here, as we have not yet tested its compatibility.\nWe have set the CUDA version to `11.6` and the PyTorch version to `1.13.1+cu11.6`.\n\nThree dependent package should be installed from source.\n- [ColossalAI](https://github.com/hpcaitech/ColossalAI) (necessary): just clone it and use `pip install .` from the newest master branch.\n- [Apex](https://github.com/NVIDIA/apex) (optional): clone it, checkout to tag `22.03`, and install it.\n- [Xformers](https://github.com/facebookresearch/xformers) (optional): clone it, checkout to tag `v0.0.17`, and install it.\n\nFinally, install all packages in the `requirements.txt`.\n\n## Tools\n\n### CUDA Memory Profiling\n\nFunction `cuda_memory_profiling` in `elixir.tracer.memory_tracer` can help you profile each kind of memory occupation during training.\nIt tells you the CUDA memory occupation of parameters, gradient and maximum size of activations generated during training.\nMoreover, it is an efficient and fast tool which enables quickly profiling OPT-175B model on a single GPU.\nYou can try it by yourself with the folder `activation` in the directory `example`.\n\n(I think you should have at least 16GB CUDA memory to run the OPT-175B example but that doesn't matter. Just try it first.)\n\n### Hardware Performance Profiling\n\nSee the folder `profile`.\nYou can profile the aggregate bandwidth of GPU-CPU communications and the aggreagte velocity of Adam optimizers.\n\n## Examples\n\nHere is a simple example to wrap your model and optimizer for [fine-tuning](https://github.com/hpcaitech/Elixir/tree/main/example/fine-tune).\n\n```python\nfrom elixir.search import minimum_waste_search\nfrom elixir.wrapper import ElixirModule, ElixirOptimizer\n\nmodel = BertForSequenceClassification.from_pretrained('bert-base-uncased')\noptimizer = torch.optim.Adam(model.parameters(), lr=1e-4, eps=1e-8)\n\nsr = minimum_waste_search(model, world_size)\nmodel = ElixirModule(model, sr, world_group)\noptimizer = ElixirOptimizer(model, optimizer)\n```\n\nHere is an advanced example for performance, which is used in our [benchmarkhere](https://github.com/hpcaitech/Elixir/blob/main/example/common/elx.py).\n\n```python\nimport torch\nimport torch.distributed as dist\nfrom colossalai.nn.optimizer import HybridAdam\nfrom elixir.wrapper import ElixirModule, ElixirOptimizer\n\n# get the world communication group\nglobal_group = dist.GroupMember.WORLD\n# get the communication world size\nglobal_size = dist.get_world_size()\n\n# initialize the model in CPU\nmodel = get_model(model_name)\n# HybridAdam allows a part of parameters updated on CPU and a part updated on GPU\noptimizer = HybridAdam(model.parameters(), lr=1e-3)\n\nsr = optimal_search(\n    model,\n    global_size,\n    unified_dtype=torch.float16,  # enable for FP16 training\n    overlap=True,  # enable for overlapping communications\n    verbose=True,  # print detailed processing information\n    inp=data,  # proivde an example input data in dictionary format\n    step_fn=train_step  # provide an example step function\n)\nmodel = ElixirModule(\n    model,\n    sr,\n    global_group,\n    prefetch=True,  # prefetch chunks to overlap communications\n    dtype=torch.float16,  # use AMP\n    use_fused_kernels=True  # enable fused kernels in Apex\n)\noptimizer = ElixirOptimizer(\n    model,\n    optimizer,\n    initial_scale=64,  # loss scale used in AMP\n    init_step=True  # enable for the stability of training\n)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhpcaitech%2Felixir","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhpcaitech%2Felixir","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhpcaitech%2Felixir/lists"}