Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/volcengine/veScale
A PyTorch Native LLM Training Framework
https://github.com/volcengine/veScale
llm-training pytorch
Last synced: 8 days ago
JSON representation
A PyTorch Native LLM Training Framework
- Host: GitHub
- URL: https://github.com/volcengine/veScale
- Owner: volcengine
- License: apache-2.0
- Created: 2024-02-26T19:01:27.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-08-25T22:43:22.000Z (4 months ago)
- Last Synced: 2024-12-01T14:08:59.881Z (11 days ago)
- Topics: llm-training, pytorch
- Language: Python
- Homepage: http://vescale.xyz
- Size: 2.52 MB
- Stars: 674
- Watchers: 34
- Forks: 34
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
- awesome-production-machine-learning - veScale - veScale is a PyTorch native LLM training framework. (Training Orchestration)
- StarryDivineSky - volcengine/veScale
- awesome-distributed-ml - veScale: A PyTorch Native LLM Training Framework
README
# A PyTorch Native LLM Training Framework
_**An Industrial-Level Framework for Easy-of-Use**_
- 🔥 **PyTorch Native**: veScale is rooted in PyTorch-native data structures, operators, and APIs, enjoying the ecosystem of PyTorch that dominates the ML world.
- 🛡 **Zero Model Code Change**: veScale decouples distributed system design from model architecture, requiring near-zero or zero modification on the model code of users.
- 🚀 **Single Device Abstraction**: veScale provides single-device semantics to users, automatically distributing and orchestrating model execution in a cluster of devices.
- 🎯 **Automatic Parallelism Planning**: veScale parallelizes model execution with a synergy of strategies (tensor, sequence, data, ZeRO, pipeline parallelism) under semi- or full-automation [coming soon].
- âš¡ **Eager & Compile Mode**: veScale supports not only Eager-mode automation for parallel training and inference but also Compile-mode for ultimate performance [coming soon].
- 📀 **Automatic Checkpoint Resharding**: veScale manages distributed checkpoints automatically with online resharding across different cluster sizes and different parallelism strategies.
## Latest News
- [2024-7-25] veScale's [pipeline parallelism](https://github.com/volcengine/veScale/blob/main/vescale/pipe/README.md) open sourced with API, graph parser, stage abstraction, schedules and execution runtime along with [nD distributed timeline](https://github.com/volcengine/veScale/blob/main/vescale/ndtimeline/README.md).
- [2024-5-31] veScale's [fast checkpointing system](https://github.com/volcengine/veScale/blob/main/vescale/checkpoint/README.md) open sourced with automatic checkpoint resharding, caching, load-balancing, fast copying, deduplicating, and asynchronous io.
- [2024-5-21] veScale's examples ([Mixtral](https://github.com/volcengine/veScale/tree/main/examples/mixtral_4D_training), [LLama2](https://github.com/volcengine/veScale/tree/main/examples/llama2_4D_finetune), and [nanoGPT](https://github.com/volcengine/veScale/tree/main/examples/nanogpt_4D_finetune)) open sourced with bit-wise correctness of training loss curves.
- [2024-5-13] The debut of veScale in MLSys 2024 as a [poster](https://volcengine.github.io/veScaleWeb/blog/mlsys2024.html).
- [2024-4-16] Our [internal LLM training system](https://volcengine.github.io/veScaleWeb/blog/megascale.html) presented in NSDI 2024.
## Coming Soon
_**veScale**_ is still in its early phase. We are refactoring our internal LLM training system components to meet open source standard. The tentative timeline is as follows:
- High-level [nD parallel api](https://github.com/volcengine/veScale/issues/39) for extreme ease of use
- Power-user plan api for easy customization of nD parallel training
- End-to-end vescale/examples with 5D parallel training (TP, SP, DP, ZeRO, PP)
## Table of Content ([web view](https://volcengine.github.io/veScaleWeb/))
**[Introduction](./docs/texts/introduction.md)**
**[Quick Start](./docs/texts/quick-start.md)**
**[DTensor](./vescale/dtensor/README.md)**
**Parallel**
* [Overview](./docs/texts/parallel_overview.md)
* [Tensor Parallel & Sequence Parallel](./vescale/dmodule/README.md)
* [Data Parallel](./vescale/ddp/README.md)
* [Optimizer Parallel](./vescale/optim/README.md)
* [Pipeline Parallel](./vescale/pipe/README.md)
* [nD Device Mesh](./vescale/devicemesh_api/README.md)**Plan**
* [Auto TP & SP Plan](./vescale/dmp/README.md)**[Checkpoint](./vescale/checkpoint/README.md)**
## [We Are Hiring!](https://volcengine.github.io/veScaleWeb/misc/join-us.html) ##
## [License](./LICENSE)
The veScale Project is under the Apache License v2.0.