{"id":17760740,"url":"https://github.com/microsoft/nnscaler","last_synced_at":"2025-04-05T20:04:17.168Z","repository":{"id":238977181,"uuid":"783126023","full_name":"microsoft/nnscaler","owner":"microsoft","description":"nnScaler: Compiling DNN models for Parallel Training","archived":false,"fork":false,"pushed_at":"2025-02-14T02:41:56.000Z","size":1582,"stargazers_count":103,"open_issues_count":5,"forks_count":13,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-03-29T19:02:10.733Z","etag":null,"topics":["compiler","deep-learning","distributed-training","llm","machine-learning","parallel-computing"],"latest_commit_sha":null,"homepage":"https://nnscaler.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-07T02:26:05.000Z","updated_at":"2025-03-23T07:22:29.000Z","dependencies_parsed_at":"2024-05-09T09:10:22.496Z","dependency_job_id":"421f4916-32b4-4b7d-9196-daaaf882186e","html_url":"https://github.com/microsoft/nnscaler","commit_stats":null,"previous_names":["microsoft/nnscaler"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fnnscaler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fnnscaler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fnnscaler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fnnscaler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/nnscaler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247393566,"owners_count":20931812,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compiler","deep-learning","distributed-training","llm","machine-learning","parallel-computing"],"created_at":"2024-10-26T19:11:32.133Z","updated_at":"2025-04-05T20:04:17.131Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"docs/source/images/nnScaler-c-1.png\" alt=\"drawing\" width=\"100\" align=\"left\"/\u003e  \n\nnnScaler: Compiling DNN models for Parallel Training over Multiple Devices\n==============\n\n\n# What is nnScaler?\n\n---------\nnnScaler is a parallelization engine that compiles a Deep neural network (DNN) model that designed for single-GPU execution into a program that capable of running in parallel across multiple GPUs.\n\n\u003cimg src=\"docs/source/images/nnScaler_flow.png\" alt=\"drawing\" width=\"600\"/\u003e\n\n# Latest News\nnnScaler (also known as CUBE as code name) has been adopted by multiple product and research projects, this section includes some of the latest news from the team and partner projects.\n* **2025-02-12** nnScaler 0.7 released: https://github.com/microsoft/nnscaler/releases/tag/0.7\n* **2024-10-07** Diff-Transformer utilizes nnScaler for differential attention mechanism: [DIFFERENTIAL TRANSFORMER](https://arxiv.org/abs/2410.05258)\n* **2024-05-09** YOCO utilizes nnScaler for long-sequence training: [(YOCO)You only cache once: Decoder-decoder architectures for language models](https://arxiv.org/abs/2405.05254)\n* **2024-04-22** Post training for the long context version of [Phi-3 series](https://arxiv.org/abs/2404.14219)\n* **2024-02-21** LongRoPE utilizes nnScaler to reduce both the training and inference costs: [LongRoPE: Extending LLM context window beyond 2 million tokens](https://arxiv.org/abs/2402.13753)\n\n### System Highlights:\n\n* Ease of Use: Only a few lines of code need to be changed to enable automated parallelization.\n* Pythonic: The parallelization output is in PyTorch code, making it easy for users to understand and convenient for further development or customization.\n* Extensibility: nnScaler exposes an API to support new operators for emerging models.\n* Reliability: Verified through various end-to-end training sessions, nnScaler is a dependable system.\n* Performance: By exploring a large parallelization space, nnScaler can significantly enhance parallel training performance.\n\nFor **_DNN scientists_**, they can concentrate on model design with PyTorch on single GPU, while leaving parallelization complexities to nnScaler. It introduces innovative parallelism techniques that surpass existing methods in performance. Additionally, nnScaler supports the extension of DNN modules with new structures or execution patterns, enabling users to parallelize their custom DNN models.\n\nFor **_DNN system experts_**, they can leverage nnScaler to explore new DNN parallelization mechanisms and policies for emerging models. By providing user-defined functions for new operators not recognized by nnScaler, it ensures seamless parallelization of novel DNN models. For example, to facilitate long sequence support in LLMs.\n\n\n# Quick start\n\n---------\n\n## Installation\n\n### Prerequisite\n\nInstall the following packages before the installation of nnScaler:\n\n    Python \u003e= 3.9, \u003c 3.11 (3.10 is recommanded)\n\n    PyTorch \u003e= 2.0, \u003c 2.4 (2.2.0 is recommanded)\n\n### Install nnScaler from source\nExecute below commands in nnScaler directory: \n\n    pip install -r requirements.txt\n    pip install -e .\n\nBesides, to avoid *cppimport* error, it also needs to include nnScaler directory in environment variable **PYTHONPATH**:\n\n    export NNSCALER_HOME=$(pwd)\n    export PYTHONPATH=${NNSCALER_HOME}:$PYTHONPATH\n\n[//]: # (Reference output: Successfully installed MarkupSafe-2.1.5 contourpy-1.3.0 cppimport-22.8.2 cycler-0.12.1 dill-0.3.8 filelock-3.15.4 fonttools-4.53.1 fsspec-2024.6.1 importlib-resources-6.4.4 jinja2-3.1.4 kiwisolver-1.4.5 mako-1.3.5 matplotlib-3.9.2 more-itertools-10.4.0 mpmath-1.3.0 networkx-3.3 numpy-2.1.0 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.6.68 nvidia-nvtx-cu12-12.1.105 packaging-24.1 pillow-10.4.0 psutil-6.0.0 pulp-2.9.0 pybind11-2.13.5 pyparsing-3.1.4 python-dateutil-2.9.0.post0 pyyaml-6.0.2 six-1.16.0 sympy-1.13.2 torch-2.4.0 tqdm-4.66.5 triton-3.0.0 typing-extensions-4.12.2)\n\n\n## Example Llama-3\n\n### Prerequisite for Llama-3\n\nInstall packages required to run Llama-3. Besides, a certain version of CUDA library is needed during flash-attn installation. For example, [CUDA V11.8](https://developer.nvidia.com/cuda-11-8-0-download-archive) is needed if using PyTorch 2.20. \n\n    python -m pip install transformers==4.40.0 flash-attn==2.5.5 tensorboard\n\n### Model Access\n\nObtain access of Llama-3 model from [HuggingFace](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), where you will receive an access token which should be set as an environment variable: \n\n    export HF_TOKEN=\u003cHUGGINGFACE_ACCESS_TOKEN\u003e\n\n### Code Changes for Parallelization\n\nYou can find all the example code at `examples/llama`. As shown below, a user needs to:\n* Wrap the Model: Include loss computation and other necessary components.\n* Configure Components: Set up the model, optimizer, and dataloader.\n* Initialize and Start: In the main function, create an nnScaler trainer with the above configurations and start the training process.\n\n```python\n# import the nnScaler build-in parallelization-capable trainer\nfrom nnscaler.cli.trainer import Trainer\n\n# wrap model to include loss computing, etc.\nclass WrapperModel(torch.nn.Module):\n    def __init__(self, model_id):\n        super().__init__()\n        self.model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation='flash_attention_2')\n\n    def forward(self, samples):\n        outputs = self.model.model(\n            input_ids=samples['net_input']['src_tokens'],\n            use_cache=False,\n            return_dict=False,\n        )\n        loss = torch.sum(chunk_linear_cross_entropy(outputs[0], self.model.lm_head.weight, samples['target'], ...))\n        return loss, samples['ntokens'], samples['nsentences']\n\ndef main(args):\n    # data config\n    dataloader_config = ...\n    \n    # model config\n    model_config = ModelConfig(\n        type=WrapperModel,\n        args={\n            'model_id': args.model_id,\n        },\n    )\n    # optimizer hyperparameters \n    optimizer_config = OptimizerConfig(\n        type=MixedPrecisionAdamW,\n        args={'lr': 2e-5, 'betas': (0.9, 0.95), 'weight_decay': 0.0, 'fused': True},\n        #...\n    )\n    #...\n    \n    # setup trainer with configs of dataloader/model/optimizer, etc. \n    trainer = Trainer(train_args=TrainerArgs(\n            #...\n            model=model_config,\n            optimizer=optimizer_config,\n            dataloader=dataloader_config,\n            #...\n        ))\n    trainer.run()\n\n```\n\n### Run the example Llama-3 training\n\nThen we can start the example, and all the parallelization tasks will be finished by nnScaler automatically. \n\n```shell\ncd examples/llama\n\n# prepare training data:\npython bookcorpus.py --data_path_or_name bookcorpus/bookcorpus --tokenizer_path_or_name meta-llama/Meta-Llama-3-8B-Instruct --save_path ./bookcorpus_llama3_4K --sequence_length 4096\n\n# build the mini model\npython create_mini_model.py --model_id meta-llama/Meta-Llama-3-8B-Instruct --output_id ./llama3_mini\n\n#compile and run using data parallelism + zero1\ntorchrun --nproc_per_node=2 train.py --plan_ngpus 1 --runtime_ngpus 2 --name llama3_debug --model_id ./llama3_mini --dataset_path ./bookcorpus_llama3_4K\n\n```\n\n## Example nanoGPT\n\nWe also provide an example to demonstrate how to parallelize a model through a [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/)-compatible interface in nnScaler.\n\n* Find the [nanoGPT](https://github.com/karpathy/nanoGPT) example in nnScaler repo:\n```shell\n    cd examples/nanogpt\n```\n* Install nanoGPT's dependencies:\n```shell\n    pip install -r requirements.txt\n```\n* Prepare dataset:\n```shell\n    python nanoGPT/data/shakespeare_char/prepare.py\n```\n* Test with Single GPU\n\nNow you can run ``train_nnscaler.py`` with `torchrun \u003chttps://pytorch.org/docs/stable/elastic/run.html\u003e`:\n\n    torchrun --nproc_per_node=1 train_nnscaler.py nanoGPT/config/train_shakespeare_char.py\n\nThis will train a baby GPT model on a single GPU.\nIt will take several minutes and the best validation loss will be around 1.47.\n\n* Test with Multi-GPU\n\nBy default, nnScaler parallelizes a model over GPUs with _data parallelism_.\nIf you have 4 GPUs on one node:\n\n    torchrun --nproc_per_node=4 train_nnscaler.py nanoGPT/config/train_shakespeare_char.py\n\nOr if you have multiple nodes, for example 2 nodes with 4 GPUs each:\n\n    # on each node\n    torchrun --nnodes=2 --nproc_per_node=4 --rdzv-id=NNSCALER_NANOGPT --rdzv-backend=c10d --rdzv-endpoint=\u003cIP\u003e \\\n        train_nnscaler.py nanoGPT/config/train_shakespeare_char.py\n\nNOTE: The local batch size is fixed by default, so using more workers will result in a larger global batch size.\n\n💡 For advanced usages, please stay tuned for our future release.\n\n# Reference\n\n---------\nYou may find the Artifact Evaluation for OSDI'24 with the guidance [here](https://github.com/microsoft/nnscaler/tree/osdi24ae). \nPlease cite nnScaler in your publications if it helps your research:\n\n    @inproceedings{lin2024nnscaler,\n    title = {nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training},\n    author={Lin, Zhiqi and Miao, Youshan and Zhang, Quanlu and Yang, Fan and Zhu, Yi and Li, Cheng and Maleki, Saeed and Cao, Xu and Shang, Ning and Yang, Yilei and Xu, Weijiang and Yang, Mao and Zhang, Lintao and Zhou, Lidong},\n    booktitle={18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)},\n    pages={347--363},\n    year={2024}\n    }\n\n## Contributing\n\nThis project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information, see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos is subject to those third-party's policies.\n\n## Contact\n\nYou may find our public repo from \u003chttps://github.com/microsoft/nnscaler\u003e or microsoft internal repo \u003chttps://aka.ms/ms-nnscaler\u003e.\nFor any questions or inquiries, please contact us at [nnscaler@service.microsoft.com](mailto:nnscaler@service.microsoft.com).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fnnscaler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fnnscaler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fnnscaler/lists"}