{"id":13451295,"url":"https://github.com/huggingface/nanotron","last_synced_at":"2025-05-13T23:06:15.696Z","repository":{"id":218022166,"uuid":"690106318","full_name":"huggingface/nanotron","owner":"huggingface","description":"Minimalistic large language model 3D-parallelism training","archived":false,"fork":false,"pushed_at":"2025-05-12T11:55:52.000Z","size":15614,"stargazers_count":1854,"open_issues_count":112,"forks_count":187,"subscribers_count":47,"default_branch":"main","last_synced_at":"2025-05-12T12:39:22.794Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/huggingface.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-09-11T14:40:28.000Z","updated_at":"2025-05-12T08:17:27.000Z","dependencies_parsed_at":"2024-02-10T08:24:26.051Z","dependency_job_id":"0424a4c9-dfc9-4eea-9233-464c4e7b7410","html_url":"https://github.com/huggingface/nanotron","commit_stats":{"total_commits":840,"total_committers":23,"mean_commits":36.52173913043478,"dds":0.6273809523809524,"last_synced_commit":"2cde8f63519f15aa449803441ec70225644bd25d"},"previous_names":["huggingface/nanotron"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fnanotron","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fnanotron/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fnanotron/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fnanotron/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/huggingface","download_url":"https://codeload.github.com/huggingface/nanotron/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253744528,"owners_count":21957309,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T07:00:51.302Z","updated_at":"2025-05-13T23:06:10.676Z","avatar_url":"https://github.com/huggingface.png","language":"Python","funding_links":[],"categories":["Python","LLM Training / Finetuning","A01_文本生成_文本对话","微调 Fine-Tuning","Model Training and Orchestration","LLM Training Frameworks","Training","7. Training \u0026 Fine-tuning Ecosystem","HuggingFace SmolLM (v2 Oct. 2024)","Librerías para usar NLP en español"],"sub_categories":["大语言对话模型及数据","Modelos de Embeddings para Sentence Similarity y Semantic Search"],"readme":"\u003ch1 align=\"center\"\u003e⚡️ Nanotron\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/huggingface/nanotron/releases\"\u003e\n        \u003cimg alt=\"GitHub release\" src=\"https://img.shields.io/github/release/huggingface/nanotron.svg\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/huggingface/nanotron/blob/master/LICENSE\"\u003e\n        \u003cimg alt=\"License\" src=\"https://img.shields.io/github/license/huggingface/nanotron.svg?color=green\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n\u003ch4 align=\"center\"\u003e\n    \u003cp\u003e\n        \u003ca href=\"#installation\"\u003eInstallation\u003c/a\u003e •\n        \u003ca href=\"#quick-start\"\u003eQuick Start\u003c/a\u003e •\n        \u003ca href=\"#features\"\u003eFeatures\u003c/a\u003e •\n        \u003ca href=\"#benchmarks\"\u003eBenchmarks\u003c/a\u003e •\n        \u003ca href=\"CONTRIBUTING.md\"\u003eContributing\u003c/a\u003e\n    \u003cp\u003e\n\u003c/h4\u003e\n\n\u003ch3 align=\"center\"\u003e\n    \u003ca href=\"https://huggingface.co/nanotron\"\u003e\u003cimg style=\"float: middle; padding: 10px 10px 10px 10px;\" width=\"60\" height=\"55\" src=\"https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png\" /\u003e\u003c/a\u003e\n\u003c/h3\u003e\n\u003ch3 align=\"center\"\u003e\n\u003cp\u003ePretraining models made easy\n\u003c/h3\u003e\n\nNanotron is a library for pretraining transformer models. It provides a simple and flexible API to pretrain models on custom datasets. Nanotron is designed to be easy to use, fast, and scalable. It is built with the following principles in mind:\n\n- **Simplicity**: Nanotron is designed to be easy to use. It provides a simple and flexible API to pretrain models on custom datasets.\n- **Performance**: Optimized for speed and scalability, Nanotron uses the latest techniques to train models faster and more efficiently.\n\n📚 **Check out our [Ultrascale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook)** - A comprehensive guide to efficiently scale LLM training with Nanotron!\n\n## Installation\n\nTo run the code in this project, first create a Python virtual environment using e.g. `uv`:\n\n\n```shell\nuv venv nanotron --python 3.11 \u0026\u0026 source nanotron/bin/activate \u0026\u0026 uv pip install --upgrade pip\n```\n\n\u003e [!TIP]\n\u003e For Hugging Face cluster users, add `export UV_LINK_MODE=copy` to your `.bashrc` to suppress cache warnings from `uv`\n\nNext, install Pytorch:\n\n```shell\nuv pip install torch --index-url https://download.pytorch.org/whl/cu124\n```\n\nThen install the core dependencies with:\n\n```shell\nuv pip install -e .\n```\n\nTo run the example scripts, install the remaining dependencies as follows:\n\n```shell\nuv pip install datasets transformers datatrove[io] numba wandb\n# Fused kernels\nuv pip install ninja triton \"flash-attn\u003e=2.5.0\" --no-build-isolation\n```\n\nNext, log into your Hugging Face and Weights and Biases accounts as follows:\n\n```shell\nhuggingface-cli login\nwandb login\n```\n\nFinally, check whether your system has Git LFS installed so that you can load and push models/datasets to the Hugging Face Hub:\n\n```shell\ngit-lfs --version\n```\n\nIf it isn't installed, run:\n\n```shell\nsudo apt-get install git-lfs\n```\n\n\n## Quick Start\n\n### Training a tiny Llama model\n\nThe following command will train a tiny Llama model on a single node of 8 x H100s in about 10 minutes:\n\n```shell\nCUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file examples/config_tiny_llama.yaml\n```\n\nThe model will be saved in the `checkpoints` directory as specified in the config file.\n\n\u003e [!NOTE]\n\u003e You can use `examples/config_tiny_llama.py` to generate your own training config \n\nFor detailed instructions on training your first model, check out our [Your First Training guide](docs/your-first-training.md). For multi-node training with Slurm, see our [Multi-Node Training guide](docs/multi-node-training.md).\n\n### Run generation from your checkpoint\n\n```shell\ntorchrun --nproc_per_node=1 run_generate.py --ckpt-path checkpoints/{checkpoint_number}/ --tp 1 --pp 1\n```\n\nIncrease the value of `--tp` (tensor paralle) to accelerate generation with multiple GPUs and use a larger value of `--pp` (pipeline parallel) for very large models.\n\n### Debugging with VSCode\nTo debug with VSCode, add the following configuration to your `launch.json` file:\n\n```json\n{\n    \"name\": \"run_train.py\",\n    \"type\": \"python\",\n    \"request\": \"launch\",\n    \"program\": \"torchrun\", // or full path to torchrun by running `which torchrun`\n    \"console\": \"integratedTerminal\",\n    \"justMyCode\": false,\n    \"args\": [\n        \"--nproc_per_node=2\",\n        \"run_train.py\",\n        \"--config-file=examples/config_tiny_llama.yaml\", // or use examples/config_tiny_llama.py to generate your own config\n    ],\n    \"env\": {\n        // \"NANOTRON_BENCHMARK\": \"1\", // enable to benchmark your training for a couple of steps\n        \"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",\n        \"WANDB_MODE\": \"disabled\",\n    }\n},\n```\n\u003e [!NOTE]\n\u003e For more info check [Debugging Nanotron example (on multiple GPUs)](/examples/contributor-guide/README.md#debugging-nanotron-example-on-multiple-gpus)\n\n### Custom examples\nYou can find more examples in the [`/examples`](/examples) directory:\n\u003c!-- Make a table of the examples we support --\u003e\n| Example | Description |\n| --- | --- |\n| `custom-dataloader` | Plug a custom dataloader to nanotron |\n| `datatrove` | Use the datatrove library to load data |\n| `doremi` | Use DoReMi to speed up training |\n| `mamba` | Train an example Mamba model |\n| `moe` | Train an example Mixture-of-Experts (MoE) model |\n| `mup` | Use spectral µTransfer to scale up your model |\n| `examples/config_tiny_llama_with_s3_upload.yaml` | For automatically uploading checkpoints to S3 |\n\nWe're working on adding more examples soon! Feel free to add a PR to add your own example. 🚀\n\n## Benchmarks\n\nWe've conducted extensive benchmarking of Nanotron across various model sizes and configurations. The complete benchmark data, configurations, and logs are available in our [ultrascale-playbook-data](https://huggingface.co/datasets/nanotron/ultrascale-playbook-data/tree/main) repository.\n\n![Model Efficiency Benchmarks](docs/benchmark_summary.svg)\n\nThe diagram above showcases the best configurations we discovered for each model size and node count in nanotron v0.5, highlighting optimal MFU (Model FLOPS Utilization) and memory usage. These represent the most efficient training setups identified through our comprehensive benchmarking process. Stay tuned for even more optimizations coming soon! 🚀\n\nFor detailed analysis and best practices derived from these benchmarks, see our [Ultrascale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook).\n\n## Features\nWe currently support the following features:\n- [x] 3D parallelism (DP+TP+PP)\n- [x] Expert parallelism for MoEs\n- [x] AFAB and 1F1B schedules for PP\n- [x] Explicit APIs for TP and PP which enables easy debugging\n- [x] ZeRO-1 optimizer\n- [x] FP32 gradient accumulation\n- [x] Parameter tying/sharding\n- [x] Custom module checkpointing for large models\n- [x] Spectral µTransfer parametrization for scaling up neural networks\n- [x] Mamba example\n\nAnd we have on our roadmap:\n- [ ] FP8 training\n- [ ] ZeRO-3 optimizer (a.k.a FSDP)\n- [ ] `torch.compile` support\n- [ ] Ring attention\n- [ ] Interleaved 1f1b schedule\n\n## Credits\nWe would like to thank everyone working on LLMs, especially those sharing their work openly from which we took great inspiration: Nvidia for `Megatron-LM/apex`, Microsoft for `DeepSpeed`, HazyResearch for `flash-attn`..\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fnanotron","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuggingface%2Fnanotron","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fnanotron/lists"}