https://github.com/alpa-projects/alpa

Training and serving large-scale neural networks with auto parallelization.
https://github.com/alpa-projects/alpa

alpa auto-parallelization compiler deep-learning distributed-computing distributed-training high-performance-computing jax llm machine-learning

Last synced: 4 months ago
JSON representation

Training and serving large-scale neural networks with auto parallelization.

Host: GitHub
URL: https://github.com/alpa-projects/alpa
Owner: alpa-projects
License: apache-2.0
Created: 2021-02-22T03:21:23.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2023-12-09T16:26:31.000Z (over 1 year ago)
Last Synced: 2024-05-23T01:11:56.792Z (about 1 year ago)
Topics: alpa, auto-parallelization, compiler, deep-learning, distributed-computing, distributed-training, high-performance-computing, jax, llm, machine-learning
Language: Python
Homepage: https://alpa.ai
Size: 7.11 MB
Stars: 2,990
Watchers: 45
Forks: 343
Open Issues: 75
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-local-ai - Alpa - Alpa is a system for training and serving large-scale neural networks. (Training)
awesome - alpa-projects/alpa - Training and serving large-scale neural networks with auto parallelization. (Python)
awesome-distributed-ml - Alpa: Auto Parallelization for Large-Scale Neural Networks
awesome-lm-system - Alpa
awesome-ray - Alpa - Auto parallelization for large-scale neural networks using Jax, XLA, and Ray (Models and Projects / Ray + JAX / TPU)
awesome-Auto-Parallelism - Alpa - and Intra-Operator Parallelism for Distributed Deep Learning | UC Berkley, Google, etc. | [arxiv](https://arxiv.org/pdf/2201.12023.pdf) | Jax, XLA | 2022 | Integer Linear for Intra, Dynamic programming for inter (Data Parallelism + Model Parallelism (or Tensor Parallelism) + Pipeline Parallelism:)
awesome-technostructure - alpa-projects/alpa - projects/alpa: Training and serving large-scale neural networks with auto parallelization. ([:robot: machine-learning]([robot-machine-learning)](<https://github.com/stars/ketsapiwiq/lists/robot-machine-learning>)))
awesome-technostructure - alpa-projects/alpa - projects/alpa: Training and serving large-scale neural networks with auto parallelization. ([:robot: machine-learning]([robot-machine-learning)](<https://github.com/stars/ketsapiwiq/lists/robot-machine-learning>)))

README

        **Note: Alpa is not actively maintained currently. It is available as a research artifact. The core algorithm in Alpa has been merged into XLA, which is still being maintained. https://github.com/openxla/xla/tree/main/xla/hlo/experimental/auto_sharding**










[![CI](https://github.com/alpa-projects/alpa/actions/workflows/ci.yml/badge.svg)](https://github.com/alpa-projects/alpa/actions/workflows/ci.yml)

[![Build Jaxlib](https://github.com/alpa-projects/alpa/actions/workflows/build_jaxlib.yml/badge.svg)](https://github.com/alpa-projects/alpa/actions/workflows/build_jaxlib.yml)

[**Documentation**](https://alpa-projects.github.io) | [**Slack**](https://forms.gle/YEZTCrtZD6EAVNBQ7)

Alpa is a system for training and serving large-scale neural networks.

Scaling neural networks to hundreds of billions of parameters has enabled dramatic breakthroughs such as GPT-3, but training and serving these large-scale neural networks require complicated distributed system techniques.

Alpa aims to automate large-scale distributed training and serving with just a few lines of code.

The key features of Alpa include:  

💻 **Automatic Parallelization**. Alpa automatically parallelizes users' single-device code on distributed clusters with data, operator, and pipeline parallelism. 

🚀 **Excellent Performance**. Alpa achieves linear scaling on training models with billions of parameters on distributed clusters.

✨ **Tight Integration with Machine Learning Ecosystem**. Alpa is backed by open-source, high-performance, and production-ready libraries such as [Jax](https://github.com/google/jax), [XLA](https://www.tensorflow.org/xla), and [Ray](https://github.com/ray-project/ray).

## Serving

The code below shows how to use huggingface/transformers interface and Alpa distributed backend for large model inference.

Detailed documentation is in [Serving OPT-175B using Alpa](https://alpa-projects.github.io/tutorials/opt_serving.html).

```python

from transformers import AutoTokenizer

from llm_serving.model.wrapper import get_model

# Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-2.7b")

tokenizer.add_bos_token = False

# Load the model. Alpa automatically downloads the weights to the specificed path

model = get_model(model_name="alpa/opt-2.7b", path="~/opt_weights/")

# Generate

prompt = "Paris is the capital city of"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

output = model.generate(input_ids=input_ids, max_length=256, do_sample=True)

generated_string = tokenizer.batch_decode(output, skip_special_tokens=True)

print(generated_string)

```

## Training

Use Alpa's decorator ``@parallelize`` to scale your single-device training code to distributed clusters.

Check out the [documentation](https://alpa-projects.github.io) site and

[examples](https://github.com/alpa-projects/alpa/tree/main/examples) folder

for installation instructions, tutorials, examples, and more.

```python

import alpa

# Parallelize the training step in Jax by simply using a decorator

@alpa.parallelize

def train_step(model_state, batch):

    def loss_func(params):

        out = model_state.forward(params, batch["x"])

        return jnp.mean((out - batch["y"]) ** 2)

    grads = grad(loss_func)(model_state.params)

    new_model_state = model_state.apply_gradient(grads)

    return new_model_state

# The training loop now automatically runs on your designated cluster

model_state = create_train_state()

for batch in data_loader:

    model_state = train_step(model_state, batch)

```

## Learning more

- [Papers](docs/publications/publications.rst)

- [Google AI blog](https://ai.googleblog.com/2022/05/alpa-automated-model-parallel-deep.html)

- [OSDI 2022 talk slides](https://docs.google.com/presentation/d/1CQ4S1ff8yURk9XmL5lpQOoMMlsjw4m0zPS6zYDcyp7Y/edit?usp=sharing)

- [ICML 2022 big model tutorial](https://sites.google.com/view/icml-2022-big-model/home)

- [GTC 2023 talk video](https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s51337/)

## Getting Involved

- Connect to Alpa developers via the [Alpa slack](https://forms.gle/YEZTCrtZD6EAVNBQ7).

- Please read the [contributor guide](https://alpa-projects.github.io/developer/developer_guide.html) if you are interested in contributing code.

## License

Alpa is licensed under the [Apache-2.0 license](https://github.com/alpa-projects/alpa/blob/main/LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alpa-projects/alpa

Awesome Lists containing this project

README