{"id":18306496,"url":"https://github.com/blacksamorez/tensor_parallel","last_synced_at":"2025-12-14T17:42:47.032Z","repository":{"id":63617195,"uuid":"547971597","full_name":"BlackSamorez/tensor_parallel","owner":"BlackSamorez","description":"Automatically split your PyTorch models on multiple GPUs for training \u0026 inference","archived":false,"fork":false,"pushed_at":"2024-01-02T10:23:37.000Z","size":323,"stargazers_count":652,"open_issues_count":35,"forks_count":41,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-05-16T00:04:39.949Z","etag":null,"topics":["deep-learning","machine-learning","natural-language-processing","nlp","python","pytorch","pytorch-transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BlackSamorez.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-10-08T17:15:13.000Z","updated_at":"2025-05-04T02:01:42.000Z","dependencies_parsed_at":"2024-01-18T04:51:50.779Z","dependency_job_id":"f6557369-363a-4226-9d90-4c63c4d933c5","html_url":"https://github.com/BlackSamorez/tensor_parallel","commit_stats":{"total_commits":127,"total_committers":4,"mean_commits":31.75,"dds":"0.26771653543307083","last_synced_commit":"9d22bd9b3c2a9c1271288dc6775d74a168af6c51"},"previous_names":["blacksamorez/petals_local_parallel"],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackSamorez%2Ftensor_parallel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackSamorez%2Ftensor_parallel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackSamorez%2Ftensor_parallel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackSamorez%2Ftensor_parallel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BlackSamorez","download_url":"https://codeload.github.com/BlackSamorez/tensor_parallel/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254442854,"owners_count":22071878,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","machine-learning","natural-language-processing","nlp","python","pytorch","pytorch-transformers"],"created_at":"2024-11-05T16:00:29.020Z","updated_at":"2025-12-14T17:42:46.972Z","avatar_url":"https://github.com/BlackSamorez.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tensor_parallel\n[![PyPI version](https://img.shields.io/pypi/v/tensor-parallel.svg?color=blue)](https://pypi.org/project/tensor-parallel/)\n[![Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![CI status](https://github.com/BlackSamorez/tensor_parallel/actions/workflows/run-tests.yaml/badge.svg?branch=main)](https://github.com/BlackSamorez/tensor_parallel/actions)\n\n\u003cp align=\"center\"\u003e\n    🚀 \u0026nbsp;\u003cb\u003e\u003ca href=\"https://www.kaggle.com/code/blacksamorez/tensor-parallel-int4-llm/\"\u003eTry new 40B LLMs demo in Kaggle\u003c/a\u003e\u003c/b\u003e\n\u003c/p\u003e\n\nRun large PyTorch models on multiple GPUs in one line of code with potentially linear speedup.\n\n```python\nimport transformers\nimport tensor_parallel as tp\ntokenizer = transformers.AutoTokenizer.from_pretrained(\"facebook/opt-13b\")\nmodel = transformers.AutoModelForCausalLM.from_pretrained(\"facebook/opt-13b\")  # use opt-125m for testing\n\nmodel = tp.tensor_parallel(model, [\"cuda:0\", \"cuda:1\"])  # \u003c- each GPU has half the weights\n\ninputs = tokenizer(\"A cat sat\", return_tensors=\"pt\")[\"input_ids\"].to(\"cuda:0\")\noutputs = model.generate(inputs, num_beams=5)\nprint(tokenizer.decode(outputs[0])) # A cat sat on my lap for a few minutes ...\n\nmodel(input_ids=inputs, labels=inputs).loss.backward()  # training works as usual\n```\n\n## Installation\nLatest stable version (recommended):\n```\npip install tensor_parallel\n```\nBleeding edge version:\n```\npip install https://github.com/BlackSamorez/tensor_parallel/archive/main.zip\n```\n\n\n## Usage\n\n\nSimply wrap your PyTorch model with `tp.tensor_parallel` and use it normally.\nFor best memory efficiency, call `tp.tensor_parallel` while the model is still on CPU.  \n\nHere are a few use cases:\n- [`examples/training_flan-t5-xl.ipynb`](./examples/training_flan-t5-xl.ipynb) - fine-tune full FLAN-T5 model on text summarization\n- [`tensor_parallel int8 LLM`](https://www.kaggle.com/code/blacksamorez/tensor-parallel-int8-llm/) - adapter-tuning a large language model with LLM.8bit + tensor_parallel\n- __TBA__ - defining custom parallelism strategy\n\n\nAdvanced parameters to `tensor_parallel`:\n- `device_ids: List[device]` - which devices to use; defaults to all available GPUs\n- `output_device: device` - model outputs will have this device\n- `tensor_parallel_config: tp.Config` - use custom parallelism strategy, see [`slicing_configs.py`](./src/tensor_parallel/slicing_configs.py)\n- `distributed: bool` - if True, use torch.distributed backend instead of threading (requires `torchrun`)\n- `sharded: bool` - if True, find all trainable parameters that weren't split by Tensor Parallelism and split them using [ZeRO-3 algorithm](https://deepspeed.readthedocs.io/en/latest/zero3.html).\n   - weights will be split between GPUs and re-assembled before each forward pass\n   - TL;DR use this when training to avoid duplicate parameters (enabled by default!) \n   - `sharded_param_names: List[str]` - parameter names that should be sharded this way, default = found automatically\n\n  \n### Saving the model\n\nTo save a model such that it could be used in a non `tensor_parallel` context, you should use a `save_tensor_parallel` context wrapper.\n\n```python\nimport torch\nimport transformers\nimport tensor_parallel as tp\n\nmodel = tp.tensor_parallel(\n    transformers.AutoModelForCausalLM.from_pretrained(\"facebook/opt-13b\"), \n)\n\n# A whole lot of trainig...\n\nwith tp.save_tensor_parallel(model):\n    torch.save(model.state_dict(), \"/tmp/\")\n    # or \n    model.save_pretrained(\"/tmp/\")\n```\n\nSuch code saves a model as if it was never split. It works by gathering model parts during `state_dict` creation.\n  \n### Memory efficient dispatch\n\nNormally, to normally create and dispatch a `tensor_parallel` model, one needs the whole model in memory. This can be troublesome, but there is another way.\n\nIt's possible to convert a `state_dict` of a basic model into the corresponding `tensor_parallel` `state_dict` using a helper function `convert_state_dict`. The state dict can then be dispatched and loaded into the model:\n\n```python\nimport accelerate\nimport transformers\n\nimport tensor_parallel as tp\n\n# Initialize a weightless tensor_parallel model from MyModel\nwith accelerate.init_empty_weights():\n    model = tp.TensorParallel(\n        MyModel(),\n        device_ids=[0, 1] # and prepare it to be put on GPUs 0 and 1\n    )\n\n# Load partial state_dict for MyModel\nstate_dict = torch.load(\"my_model_part_1_of_5.bin\")\n\n# Convert it into a tensor_parallel state_dict\ntensor_parallel_state_dict = tp.convert_state_dict(\n    state_dict,\n    tensor_parallel_config=model.tensor_parallel_config,\n    world_size=len(model.devices),\n)\n\n# Dispatch the partial state_dict (load_state_dict doesn't work with meta so here I use accelerate)\ndevice_map = tp.infer_sharded_device_map(model)\nfor param_name, param in state_dict.items():\n    module_name = param_name\n    while len(module_name) \u003e 0 and module_name not in device_map:\n        module_name = \".\".join(module_name.split(\".\")[:-1])\n    param_device = device_map[module_name]\n    accelerate.utils.set_module_tensor_to_device(model, param_name, param_device, value=param)\n```\n\nWith this no more than one part of the model needs to be loaded into memory at once. \n  \n## FAQ\n\n- __Q:__ I don't have a multi-GPU server. Can I use tensor_parallel in Google Colab?\n- __A:__ Colab has a single GPU, so there's no point in tensor parallelism. However, [Kaggle offers two T4 for free](https://www.kaggle.com/code/muellerzr/multi-gpu-and-accelerate) to all phone-verified accounts.\n\n\n- __Q:__ What is tensor parallelism?\n- __A:__ You split each layer's weights into parts, multiply each part on a separate GPU, then gather results. Read more [here](https://colossalai.org/docs/concepts/paradigms_of_parallelism/)\n \n\n- __Q:__ Should I use `TensorParallel` or `DataParallel`?\n- __A:__ TensorParallel for large models, DataParallel for smaller ones\n\n\n- __Q:__ How does it compare against FullyShardedDataParallel and ZeRO?\n- __A:__ ZeRO is better if you can fit a large batch, TensorParallel is better for small batches\n\n\nWhy use `tensor_parallel` ...\n- v.s. [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [FairScale](https://github.com/facebookresearch/fairscale/)\n  - DeepSpeed has many parallelization strategies, but requires careful configuration\n  - tensor_parallel has one strategy that works with 1 line of code\n  - tensor_parallel works in a jupyter notebook\n- v.s. [MegatronLM](https://github.com/NVIDIA/Megatron-LM)\n  - MegatronLM has _great_ tensor parallelism for one model architecture\n  - tensor_parallel has _good_ parallelism for any architecture\n  - tensor_parallel is way easier to install\n- v.s. [parallelformers](https://github.com/tunib-ai/parallelformers) \n  - parallelformers is inference-only, tensor_parallel supports training\n- v.s. [`alpa`](https://github.com/alpa-projects/alpa)\n  - alpa is a powerful tool for automatic distributed training / inference in JAX\n  - tensor_parallel works with PyTorch\n- v.s. [`Model.parallelize()`](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Model.parallelize)\n  - both are easy to use, both fit large models\n  - in parallelize, one GPU works at a time\n  - in tensor_parallel, GPUs work in parallel\n\nIn short, use `tensor_parallel` for quick prototyping on a single machine.\nUse DeepSpeed+Megatron or alpa for million-dollar training runs.\n\n\n## Troubleshooting\n\nIf you experience NCCL errors, or random hanging, you may have some code errors that are not displayed properly. \nTo debug these errors, we recommend restarting with `export TENSOR_PARALLEL_USE_NATIVE=1` or on a single device. \n\nIf you found a bug or encountered a problem, please report it to [our issue tracker](https://github.com/BlackSamorez/tensor_parallel/issues).\nWe will do our best to help, but it may take some time before we get to it.\nPlease create issues only if your problem is specifically with `tensor_parallel`.\nFor example, if you need help installing `transformers` or optimizing your code, please seek it elsewhere.\n\n### Code style\n\nWe use [black](https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html) and [isort](https://pycqa.github.io/isort/) for all pull requests.\nBefore committing your code, simply run `black . \u0026\u0026 isort .` and you will be fine.\n\n--------------------------------------------------------------------------------\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblacksamorez%2Ftensor_parallel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblacksamorez%2Ftensor_parallel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblacksamorez%2Ftensor_parallel/lists"}