An open API service indexing awesome lists of open source software.

https://github.com/servicenow/fast-llm

Accelerating your LLM training to full speed! Made with ❤️ by ServiceNow Research
https://github.com/servicenow/fast-llm

Last synced: 5 months ago
JSON representation

Accelerating your LLM training to full speed! Made with ❤️ by ServiceNow Research

Awesome Lists containing this project

README

          

Fast-LLM

[![Docker][ci-badge]][ci-workflow]
[![Documentation][docs-badge]][docs-workflow]
[![License][license-badge]][license]

*Accelerating your LLM training to full speed*

Made with ❤️ by [ServiceNow Research][servicenow-research]

## Overview

Fast-LLM is a cutting-edge open-source library for training large language models with exceptional speed, scalability, and flexibility. Built on [PyTorch][pytorch] and [Triton][triton], Fast-LLM empowers AI teams to push the limits of generative AI, from research to production.

Optimized for training models of all sizes—from small 1B-parameter models to massive clusters with 70B+ parameters—Fast-LLM delivers faster training, lower costs, and seamless scalability. Its fine-tuned kernels, advanced parallelism techniques, and efficient memory management make it the go-to choice for diverse training needs.

As a truly open-source project, Fast-LLM allows full customization and extension without proprietary restrictions. Developed transparently by a community of professionals on GitHub, the library benefits from collaborative innovation, with every change discussed and reviewed in the open to ensure trust and quality. Fast-LLM combines professional-grade tools with unified support for GPT-like architectures, offering the cost efficiency and flexibility that serious AI practitioners demand.

> [!NOTE]
> Fast-LLM is not affiliated with Fast.AI, FastHTML, FastAPI, FastText, or other similarly named projects. Our library's name refers to its speed and efficiency in language model training.

## Why Fast-LLM?

1. 🚀 **Fast-LLM is Blazingly Fast**:
- ⚡️ Optimized kernel efficiency and reduced overheads.
- 🔋 Optimized memory usage for best performance.
- ⏳ Minimizes training time and cost.

2. 📈 **Fast-LLM is Highly Scalable**:
- 📡 Distributed training across multiple GPUs and nodes using 3D parallelism (Data, Tensor, and Pipeline).
- 🔗 Supports sequence length parallelism to handle longer sequences effectively.
- 🧠 ZeRO-1, ZeRO-2, and ZeRO-3 implementations for improved memory efficiency.
- 🎛️ Mixed precision training support for better performance.
- 🏋️‍♂️ Large batch training and gradient accumulation support.
- 🔄 Reproducible training with deterministic behavior.

3. 🎨 **Fast-LLM is Incredibly Flexible**:
- 🤖 Compatible with all common language model architectures in a unified class.
- ⚡ Efficient dropless Mixture-of-Experts (MoE) implementation with SoTA performance.
- 🧩 Customizable language model architectures, data loaders, loss functions, and optimizers (in progress).
- 🤗 Seamless integration with [Hugging Face Transformers][transformers].

4. 🎯 **Fast-LLM is Super Easy to Use**:
- 📦 [Pre-built Docker images](https://github.com/ServiceNow/Fast-LLM/pkgs/container/fast-llm) for quick deployment.
- 📝 Simple YAML configuration for hassle-free setup.
- 💻 Command-line interface for easy launches.
- 📊 Detailed logging and real-time monitoring features.
- 📚 Extensive [documentation][docs] and practical tutorials (in progress).

5. 🌐 **Fast-LLM is Truly Open Source**:
- ⚖️ Licensed under [Apache 2.0][license] for maximum freedom to use Fast-LLM at work, in your projects, or for research.
- 💻 Transparently developed on GitHub with public [roadmap][roadmap] and [issue tracking][issues].
- 🤝 Contributions and collaboration are always welcome!

## Usage

We'll walk you through how to use Fast-LLM to train a large language model on a cluster with multiple nodes and GPUs. We'll show an example setup using a Slurm cluster and a Kubernetes cluster.

For this demo, we will train a Mistral-7B model from scratch for 100 steps on random data. The config file `examples/mistral-4-node-benchmark.yaml` is pre-configured for a multi-node setup with 4 DGX nodes, each with 8 A100-80GB or H100-80GB GPUs.

> [!NOTE]
> Fast-LLM scales from a single GPU to large clusters. You can start small and expand based on your resources.

Expect to see a significant speedup in training time compared to other libraries! For training Mistral-7B, Fast-LLM is expected to achieve a throughput of **9,800 tokens/s/H100** (batch size 32, sequence length 8k) on a 4-node cluster with 32 H100s.

### Running Fast-LLM on a Slurm Cluster

#### Prerequisites

- A [Slurm](https://slurm.schedmd.com/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
- CUDA 12.1 or higher.
- Dependencies: [PyTorch][pytorch], [Triton][triton], and [Apex](https://github.com/NVIDIA/apex) installed on all nodes.

#### Steps

1. Deploy the [nvcr.io/nvidia/pytorch:24.07-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) Docker image to all nodes (recommended), because it contains all the necessary dependencies.
2. Install Fast-LLM on all nodes:

```bash
sbatch <