https://github.com/hpcaitech/elixir

Elixir: Train a Large Language Model on a Small GPU Cluster
https://github.com/hpcaitech/elixir

efficient large-language-models memory-management

Last synced: over 1 year ago
JSON representation

Elixir: Train a Large Language Model on a Small GPU Cluster

Host: GitHub
URL: https://github.com/hpcaitech/elixir
Owner: hpcaitech
Created: 2023-02-12T16:11:51.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-06-08T06:14:37.000Z (about 3 years ago)
Last Synced: 2025-03-27T17:46:17.891Z (over 1 year ago)
Topics: efficient, large-language-models, memory-management
Language: Python
Homepage:
Size: 257 KB
Stars: 14
Watchers: 2
Forks: 5
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Elixir (Gemini2.0)

Elixir, also known as Gemini, is a technology designed to facilitate the training of large models on a small GPU cluster.

Its goal is to eliminate data redundancy and leverage CPU memory to accommodate really large models.

In addition, Elixir automatically profiles each training step prior to execution and selects the optimal configuration for the ratio of redundancy and the device for each parameter.

This repository is used to benchmark the performance of Elixir.

Elixir will be integrated into ColossalAI for usability.

## Environment

This version is a beta release, so the running environment is somewhat restrictive.

We are only demonstrating our running environment here, as we have not yet tested its compatibility.

We have set the CUDA version to `11.6` and the PyTorch version to `1.13.1+cu11.6`.

Three dependent package should be installed from source.

- [ColossalAI](https://github.com/hpcaitech/ColossalAI) (necessary): just clone it and use `pip install .` from the newest master branch.

- [Apex](https://github.com/NVIDIA/apex) (optional): clone it, checkout to tag `22.03`, and install it.

- [Xformers](https://github.com/facebookresearch/xformers) (optional): clone it, checkout to tag `v0.0.17`, and install it.

Finally, install all packages in the `requirements.txt`.

## Tools

### CUDA Memory Profiling

Function `cuda_memory_profiling` in `elixir.tracer.memory_tracer` can help you profile each kind of memory occupation during training.

It tells you the CUDA memory occupation of parameters, gradient and maximum size of activations generated during training.

Moreover, it is an efficient and fast tool which enables quickly profiling OPT-175B model on a single GPU.

You can try it by yourself with the folder `activation` in the directory `example`.

(I think you should have at least 16GB CUDA memory to run the OPT-175B example but that doesn't matter. Just try it first.)

### Hardware Performance Profiling

See the folder `profile`.

You can profile the aggregate bandwidth of GPU-CPU communications and the aggreagte velocity of Adam optimizers.

## Examples

Here is a simple example to wrap your model and optimizer for [fine-tuning](https://github.com/hpcaitech/Elixir/tree/main/example/fine-tune).

```python

from elixir.search import minimum_waste_search

from elixir.wrapper import ElixirModule, ElixirOptimizer

model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, eps=1e-8)

sr = minimum_waste_search(model, world_size)

model = ElixirModule(model, sr, world_group)

optimizer = ElixirOptimizer(model, optimizer)

```

Here is an advanced example for performance, which is used in our [benchmarkhere](https://github.com/hpcaitech/Elixir/blob/main/example/common/elx.py).

```python

import torch

import torch.distributed as dist

from colossalai.nn.optimizer import HybridAdam

from elixir.wrapper import ElixirModule, ElixirOptimizer

# get the world communication group

global_group = dist.GroupMember.WORLD

# get the communication world size

global_size = dist.get_world_size()

# initialize the model in CPU

model = get_model(model_name)

# HybridAdam allows a part of parameters updated on CPU and a part updated on GPU

optimizer = HybridAdam(model.parameters(), lr=1e-3)

sr = optimal_search(

    model,

    global_size,

    unified_dtype=torch.float16,  # enable for FP16 training

    overlap=True,  # enable for overlapping communications

    verbose=True,  # print detailed processing information

    inp=data,  # proivde an example input data in dictionary format

    step_fn=train_step  # provide an example step function

)

model = ElixirModule(

    model,

    sr,

    global_group,

    prefetch=True,  # prefetch chunks to overlap communications

    dtype=torch.float16,  # use AMP

    use_fused_kernels=True  # enable fused kernels in Apex

)

optimizer = ElixirOptimizer(

    model,

    optimizer,

    initial_scale=64,  # loss scale used in AMP

    init_step=True  # enable for the stability of training

)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hpcaitech/elixir

Awesome Lists containing this project

README