{"id":13604198,"url":"https://github.com/MachineLearningSystem/varuna","last_synced_at":"2025-04-11T23:31:59.572Z","repository":{"id":185462007,"uuid":"513461039","full_name":"MachineLearningSystem/varuna","owner":"MachineLearningSystem","description":null,"archived":false,"fork":true,"pushed_at":"2022-03-05T20:00:35.000Z","size":117889,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2024-11-07T08:42:20.367Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"microsoft/varuna","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-07-13T09:34:00.000Z","updated_at":"2022-07-11T16:37:20.000Z","dependencies_parsed_at":"2023-08-02T05:32:18.444Z","dependency_job_id":null,"html_url":"https://github.com/MachineLearningSystem/varuna","commit_stats":null,"previous_names":["machinelearningsystem/varuna"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fvaruna","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fvaruna/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fvaruna/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fvaruna/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/varuna/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248495053,"owners_count":21113557,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:41.480Z","updated_at":"2025-04-11T23:31:58.535Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"readme":"# _Varuna_\n\n_Varuna_ is a tool for efficient training of large DNN models on commodity GPUs and networking. It implements a combination of pipeline parallelism and data parallelism in PyTorch, and enables training on a changing set of resources smoothly.\n\nThis repository is an implementation of the paper:\n\n[\"Varuna: Scalable, Low-cost Training of Massive Deep Learning Models\"](https://arxiv.org/abs/2111.04007), to appear in EuroSys'22.\n\n## Setup \u0026 Installation\n\nVaruna requires python 3, [PyTorch](https://pytorch.org/get-started/locally/) (1.5+) and [apex](https://github.com/NVIDIA/apex). \n\nThe patch `apex.patch` in this directory needs to be applied to apex before building it. Varuna's code and this patch has been tested for [this commit](https://github.com/NVIDIA/apex/commit/0c2c6eea6556b208d1a8711197efc94899e754e1) of apex.\n~~~~\ngit clone https://github.com/NVIDIA/apex\ncp apex.patch /path/to/apex/\ncd /path/to/apex\ngit apply apex.patch\npip install -v --disable-pip-version-check --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./\n~~~~\nTo install, clone this repository, cd into it and run\n~~~~\npython setup.py install\n~~~~\n## Running\n\nVaruna trains large DNN models by parallelising them into sequential pipeline stages and data parallel replicas across a set of GPUs. These methods are called pipeline parallelism and data parallelism respectively.\nTo enable parallel training with Varuna, there are several steps the user must follow. Detailed docs are in the `docs/` folder as webpages (`html/index.html`) or pdf (`varuna.pdf`). \n\nExamples of models working with varuna can also be found in `examples/`. Please see this folder to run examples with BERT and Megatron-LM.\n\nSome of the steps for are briefly described below.\n\n### CutPoint demarcation\n\nVaruna slices a DNN model into sequential pipeline stages. For this, the model should be annotated with varuna `CutPoint` instances between different operations/ parts of model computation. These are nn.Module instances that are potential slice points in the mode. For each CutPoint, Varuna can either ignore it or activate it as a partition boundary. (see [CutPoints](docs/html/cutpoint.html))\n\n~~~~\nfrom varuna import CutPoint\n\nclass SampleModel(nn.Module):\n  def __init__(...):\n    ....\n    self.cutpoints = [CutPoint() for i in range(num_cutpoints)]\n    ....\n\n  def forward(input...):\n    input = self.some_operation(input)\n    input = self.cutpoints[0](input)     # marked as a potential stage boundary\n    input = self.some_other_operation(input)\n    ....\n    for i in range(sub_modules):\n      x = sub_module_i(input, ...)\n      x = self.cutpoints[i+1](x)        # each cutpoint instance should be used only once in a model\n    ....\n\n~~~~\n\nOperations separated by CutPoints should preferably have no shared modules/parameters. For weight sharing between different parts of the module, you should register separate nn.Parameter instances (even for the same tensor) and pass the pair of parameter names as shared_weights to Varuna.\n\n### Wrapping the model in Varuna\n\nThe nn.Module for your DNN instance should be wrapped in a `Varuna` instance before training and before optimizer creation. (see [Varuna](docs/html/varuna.html)) Wrapping in `Varuna` returns a model partitioned according to the given stage_to_rank_map (which is passed by the varuna launcher) and moved to the GPU. After this initialization, each rank in the job has only the parts of the model required by it. Varuna internally handles fp16 mixed precision training and shared parameters (such as the initial and last embedding weights in BERT/GPT-2). \nOptimizer creation should be after this since it requires model parameters as input. The optimizer needs to be registered with Varuna using a setter.\n~~~~\n    model = MyModel()             # full model on CPU\n    # provide dummy input function to varuna for initialization. Inputs mst be in dictionary form.\n    def get_batch_fn(size, device=None):\n        batch = dataset[:size]\n        if device is not None:\n          batch = [t.to(device) for t in batch]\n        inputs, mask = batch\n        return {'inputs': inputs, 'mask': mask, 'extra_norm': True }\n\n    shared_weights = [(\"language_model.embedding.word_embeddings.weight\",\"lm_head_weight\")]  # parameter sharing between stages\n    model = Varuna( model, args.stage_to_rank_map, dry_run_input, global_batch_size, \n                        args.chunk_size, args.fp16, local_rank=args.local_rank, \n                        device=args.local_rank, shared_weights=shared_weights)\n\n    # now model is a subset of the original model, moved to the GPU on each process\n\n    optimizer = get_optimizer(model)\n    model.set_optimizer(optimizer)\n\n~~~~\n\n### Training loop.\n\nThe Varuna training loop does not require a separate forward \u0026 backward step, the script may just call the `step` function. The input to this function should be of the per-process batch size (batch_size / data_parallel_workers), and should be a dictionary with arg names and values. The step function makes micro-batches out of this input batch, completes the fwd/bwd pipeline schedule and reduces the gradients/overflow over the whole job, returning the loss and overflow boolean. \n\n~~~~\n\ninputs = dict({\n    \"input_ids\": tokens,\n    \"position_ids\": position_ids,\n    \"attention_mask\": attention_mask,\n    \"loss_mask\": loss_mask,\n    \"labels\": labels\n})\n\nloss, overflow = model.step(inputs)\nloss = torch.Tensor([loss])\n\nif not overflow:\n  optimizer.step()\n\n~~~~\n\n### Launcher and Arguments\n\n\nTo launch a distributed training process using Varuna, use the run_varuna.py script as follows:\n\npython -m varuna.run_varuna --machine_list \u003cfile_with_ips\u003e --gpus_per_node \u003cnum_gpus_per_node\u003e --batch-size \u003ctotal_effective_batch_size\u003e --nstages \u003cnumber_of_pipeline_stages\u003e --chunk_size \u003cmicro_batch_size_for_pipeline\u003e --code_dir \u003cworking_dir_for_training\u003e user_training_script.py \u003c...user args...\u003e\n\nSee [Launching varuna](docs/html/launching.html).\n\nThis expects all machines in the machine_list to be set up with necessary code/libraries in code_dir and have gpus_per_node GPUs working. The job is launched with all workers running the user_training_script and args.\n\nThis launcher passes a few arguments to the user training script for Varuna. These should be passed during `Varuna` initialisation in the python script:\n* rank: process rank in overall distributed job\n* local_rank: process rank in the local node \n* stage_to_rank_map: varuna config info about stage placement\n* chunk_size: micro batch size for Varuna pipeline\n* batch-size: per process batch size\n\n### Changing resources: job morphing\n\nVaruna enables training on a changing set of nodes/gpus. This is through monitoring the machine_list text file of IPs with the set of available nodes at any time. \nTraining jobs are launched from a long-living manager.\nOn detecting a change, Varuna checkpoints, stops and relaunches the job from the manager. To allow for this on-demand checkpoint/stop, varuna relies on user signals (SIGUSR1 in unix). The user therefore needs to add a simple handler for this signal to their training script.\nSee [Morphing](docs/html/morphing.html).\n\n~~~~\nif __name__ == \"__main__\":\n\n    def handler(signum,_):\n        save_checkpoint(iteration, model, optimizer, lr_scheduler)\n        exit()\n\n    signal.signal(signal.SIGUSR1, handler)\n~~~~\n\n\u003c!-- \n- change polling, reduce freq, get_available\n- manager ip argument\n- checkpoint folder --\u003e\n\n### Profiling, config selection\n\nVaruna supports auto-configuration of data-parallel and pipeline-parallel dimensions that saves the user from running and comparing different configs for better performance. To enable this, the user needs to run one-time profiling of the model and network conditions using the `Profiler` class in Varuna.\nSee [Varuna Profiling](docs/html/profiler.html).\nThis is instantiated similar to Varuna and runs as a distributed process:\n\n~~~~\nmodel = BuildModel()\nprofiler = Profiler(model, args.device, fp16=args.fp16)\n\ndef get_batch(size):\n  # function to get sample batches of given size for profiling\n  return batch\n\nprofiler.initialize(get_batch)\nmicrobatches_to_profile = list(range(1,max_micro_BS))\nprofile = profiler.profile_all(get_batch, microbatches_to_profile, out_folder=args.save)\n\n~~~~\n\nEach process profiles the compute of different cutpoints and at the same time measures communication with other processes. This builds and saves a profile of the model in the specified location, from where it can be accessed by the `AutoConfig` class. `AutoConfig` calculates the different configs for a given number of GPUs and simulates them using information from the pre-built profile to compare and return the best performing setting in a few seconds. This calculation is triggered by run_varuna when no nstages and chunk_size arguments are given and a profile location is passed.\n","funding_links":[],"categories":["Paper-Code"],"sub_categories":["Parallellism Training"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fvaruna","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2Fvaruna","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fvaruna/lists"}