https://github.com/volcengine/veScale

A PyTorch Native LLM Training Framework
https://github.com/volcengine/veScale

llm-training pytorch

Last synced: 5 months ago
JSON representation

A PyTorch Native LLM Training Framework

Host: GitHub
URL: https://github.com/volcengine/veScale
Owner: volcengine
License: apache-2.0
Created: 2024-02-26T19:01:27.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-08-25T22:43:22.000Z (over 1 year ago)
Last Synced: 2024-12-01T14:08:59.881Z (about 1 year ago)
Topics: llm-training, pytorch
Language: Python
Homepage: http://vescale.xyz
Size: 2.52 MB
Stars: 674
Watchers: 34
Forks: 34
Open Issues: 4
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

awesome-production-machine-learning - veScale - veScale is a PyTorch native LLM training framework. (Computation and Communication Optimisation)
StarryDivineSky - volcengine/veScale
awesome-distributed-ml - veScale: A PyTorch Native LLM Training Framework

README

          # Breaking Changes Coming Soon ...



    



# A PyTorch Native LLM Training Framework

_**An Industrial-Level Framework for Easy-of-Use**_

- 🔥 **PyTorch Native**: veScale is rooted in PyTorch-native data structures, operators, and APIs, enjoying the ecosystem of PyTorch that dominates the ML world.

- 🛡 **Zero Model Code Change**: veScale decouples distributed system design from model architecture, requiring near-zero or zero modification on the model code of users.

- 🚀 **Single Device Abstraction**:  veScale provides single-device semantics to users, automatically distributing and orchestrating model execution in a cluster of devices. 

- 🎯 **Automatic Parallelism Planning**:  veScale parallelizes model execution with a synergy of strategies (tensor, sequence, data, ZeRO, pipeline parallelism) under semi- or full-automation [coming soon].

- ⚡ **Eager & Compile Mode**: veScale supports not only Eager-mode automation for parallel training and inference but also Compile-mode for ultimate performance [coming soon].

- 📀 **Automatic Checkpoint Resharding**: veScale manages distributed checkpoints automatically with online resharding across different cluster sizes and different parallelism strategies. 

## Latest News

- [2024-7-25] veScale's [pipeline parallelism](https://github.com/volcengine/veScale/blob/main/vescale/pipe/README.md) open sourced with API, graph parser, stage abstraction, schedules and execution runtime along with [nD distributed timeline](https://github.com/volcengine/veScale/blob/main/vescale/ndtimeline/README.md).

- [2024-5-31] veScale's [fast checkpointing system](https://github.com/volcengine/veScale/blob/main/vescale/checkpoint/README.md) open sourced with automatic checkpoint resharding, caching, load-balancing, fast copying, deduplicating, and asynchronous io.

- [2024-5-21] veScale's examples ([Mixtral](https://github.com/volcengine/veScale/tree/main/examples/mixtral_4D_training), [LLama2](https://github.com/volcengine/veScale/tree/main/examples/llama2_4D_finetune), and [nanoGPT](https://github.com/volcengine/veScale/tree/main/examples/nanogpt_4D_finetune)) open sourced with bit-wise correctness of training loss curves.

- [2024-5-13] The debut of veScale in MLSys 2024 as a [poster](https://volcengine.github.io/veScaleWeb/blog/mlsys2024.html).

- [2024-4-16] Our [internal LLM training system](https://volcengine.github.io/veScaleWeb/blog/megascale.html) presented in NSDI 2024.

## Coming Soon

_**veScale**_ is still in its early phase. We are refactoring our internal LLM training system components to meet open source standard. The tentative timeline is as follows:

- High-level [nD parallel api](https://github.com/volcengine/veScale/issues/39) for extreme ease of use

- Power-user plan api for easy customization of nD parallel training

- End-to-end vescale/examples with 5D parallel training (TP, SP, DP, ZeRO, PP)

## Table of Content ([web view](https://volcengine.github.io/veScaleWeb/))

**[Introduction](./docs/texts/introduction.md)**

**[Quick Start](./docs/texts/quick-start.md)**

**[DTensor](./vescale/dtensor/README.md)**

**Parallel**

  * [Overview](./docs/texts/parallel_overview.md)

  * [Tensor Parallel & Sequence Parallel](./vescale/dmodule/README.md)

  * [Data Parallel](./vescale/ddp/README.md)

  * [Optimizer Parallel](./vescale/optim/README.md)

  * [Pipeline Parallel](./vescale/pipe/README.md)

  * [nD Device Mesh](./vescale/devicemesh_api/README.md)

**Plan**

  * [Auto TP & SP Plan](./vescale/dmp/README.md)

**[Checkpoint](./vescale/checkpoint/README.md)**

## [We Are Hiring!](https://volcengine.github.io/veScaleWeb/misc/join-us.html) ##

## [License](./LICENSE)

The veScale Project is under the Apache License v2.0.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/volcengine/veScale

Awesome Lists containing this project

README