{"id":16819133,"url":"https://github.com/huggingface/llm-swarm","last_synced_at":"2025-10-14T15:32:31.188Z","repository":{"id":220373526,"uuid":"712625023","full_name":"huggingface/llm-swarm","owner":"huggingface","description":"Manage scalable open LLM inference endpoints in Slurm clusters","archived":false,"fork":false,"pushed_at":"2024-07-11T16:39:23.000Z","size":974,"stargazers_count":274,"open_issues_count":3,"forks_count":29,"subscribers_count":33,"default_branch":"main","last_synced_at":"2025-09-30T18:02:29.302Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/huggingface.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-31T21:05:11.000Z","updated_at":"2025-09-24T09:30:07.000Z","dependencies_parsed_at":"2024-02-09T16:27:15.149Z","dependency_job_id":"ab5cdb90-c84a-4449-bd2e-a4b98cd12fa9","html_url":"https://github.com/huggingface/llm-swarm","commit_stats":null,"previous_names":["huggingface/llm-swarm"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/huggingface/llm-swarm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fllm-swarm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fllm-swarm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fllm-swarm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fllm-swarm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/huggingface","download_url":"https://codeload.github.com/huggingface/llm-swarm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fllm-swarm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279019322,"owners_count":26086711,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T10:52:10.047Z","updated_at":"2025-10-14T15:32:31.147Z","avatar_url":"https://github.com/huggingface.png","language":"Python","funding_links":[],"categories":["Python","A01_文本生成_文本对话","Important techniques"],"sub_categories":["大语言对话模型及数据","Libraries, code and tools"],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003ch1\u003e🐝 llm-swarm\u003c/h1\u003e\n  \u003cp\u003e\u003cem\u003eManage scalable open LLM inference endpoints in Slurm clusters\u003c/em\u003e\u003c/p\u003e\n\u003c/div\u003e\n\n## Features\n\n- Generate synthetic datasets for pretraining or fine-tuning using either local LLMs or [Inference Endpoints](https://huggingface.co/inference-endpoints/dedicated) on the Hugging Face Hub.\n- Integrations with [huggingface/text-generation-inference](https://github.com/huggingface/text-generation-inference) and [vLLM](https://github.com/vllm-project/vllm) to generate text at scale.\n\n## Prerequisites\n\n* A Slurm cluster with Docker support,\n* or access to [Inference Endpoints](https://huggingface.co/inference-endpoints/dedicated)\n\n\n## Install and prepare\n\n```bash\npip install -e .\n# or pip install llm_swarm\nmkdir -p .cache/\n# you can customize the above docker image cache locations and change them in `templates/tgi_h100.template.slurm` and `templates/vllm_h100.template.slurm`\n```\n\n## Hello world\n\n```bash\npython examples/hello_world.py\npython examples/hello_world_vllm.py\n```\n\n```python\nimport asyncio\nimport pandas as pd\nfrom llm_swarm import LLMSwarm, LLMSwarmConfig\nfrom huggingface_hub import AsyncInferenceClient\nfrom transformers import AutoTokenizer\nfrom tqdm.asyncio import tqdm_asyncio\n\n\ntasks = [\n    \"What is the capital of France?\",\n    \"Who wrote Romeo and Juliet?\",\n    \"What is the formula for water?\"\n]\nwith LLMSwarm(\n    LLMSwarmConfig(\n        instances=2,\n        inference_engine=\"tgi\",\n        slurm_template_path=\"templates/tgi_h100.template.slurm\",\n        load_balancer_template_path=\"templates/nginx.template.conf\",\n    )\n) as llm_swarm:\n    client = AsyncInferenceClient(model=llm_swarm.endpoint)\n    tokenizer = AutoTokenizer.from_pretrained(\"mistralai/Mistral-7B-Instruct-v0.1\")\n    tokenizer.add_special_tokens({\"sep_token\": \"\", \"cls_token\": \"\", \"mask_token\": \"\", \"pad_token\": \"[PAD]\"})\n\n    async def process_text(task):\n        prompt = tokenizer.apply_chat_template([\n            {\"role\": \"user\", \"content\": task},\n        ], tokenize=False)\n        return await client.text_generation(\n            prompt=prompt,\n            max_new_tokens=200,\n        )\n\n    async def main():\n        results = await tqdm_asyncio.gather(*(process_text(task) for task in tasks))\n        df = pd.DataFrame({'Task': tasks, 'Completion': results})\n        print(df)\n    asyncio.run(main())\n```\n* [templates/tgi_h100.template.slurm](templates/tgi_h100.template.slurm) is the slurm template for TGI\n* [templates/nginx.template.conf](templates/nginx.template.conf) is the nginx template for load balancing\n\n\n```\n(.venv) costa@login-node-1:/fsx/costa/llm-swarm$ python examples/hello_world.py\nNone of PyTorch, TensorFlow \u003e= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\nrunning sbatch --parsable slurm/tgi_1705591874_tgi.slurm\nrunning sbatch --parsable slurm/tgi_1705591874_tgi.slurm\nSlurm Job ID: ['1178622', '1178623']\n📖 Slurm Hosts Path: slurm/tgi_1705591874_host_tgi.txt\n✅ Done! Waiting for 1178622 to be created                                                                 \n✅ Done! Waiting for 1178623 to be created                                                                 \n✅ Done! Waiting for slurm/tgi_1705591874_host_tgi.txt to be created                                       \nobtained endpoints ['http://26.0.161.138:46777', 'http://26.0.167.175:44806']\n⣽ Waiting for http://26.0.161.138:46777 to be reachable\nConnected to http://26.0.161.138:46777\n✅ Done! Waiting for http://26.0.161.138:46777 to be reachable                                             \n⣯ Waiting for http://26.0.167.175:44806 to be reachable\nConnected to http://26.0.167.175:44806\n✅ Done! Waiting for http://26.0.167.175:44806 to be reachable                                             \nEndpoints running properly: ['http://26.0.161.138:46777', 'http://26.0.167.175:44806']\n✅ test generation\n✅ test generation\nrunning sudo docker run -p 47495:47495 --network host -v $(pwd)/slurm/tgi_1705591874_load_balancer.conf:/etc/nginx/nginx.conf nginx\nb'WARNING: Published ports are discarded when using host network mode'\nb'/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration'\n🔥 endpoint ready http://localhost:47495\nhaha\n100%|████████████████████████████████████████████████████████████████████████| 3/3 [00:01\u003c00:00,  2.44it/s]\n                             Task                                         Completion\n0  What is the capital of France?                    The capital of France is Paris.\n1     Who wrote Romeo and Juliet?   Romeo and Juliet was written by William Shake...\n2  What is the formula for water?   The chemical formula for water is H2O. It con...\nrunning scancel 1178622\nrunning scancel 1178623\ninference instances terminated\n```\n\nIt does a couple of things:\n\n\n- 🤵**Manage inference endpoint life time**: it automatically spins up 2 instances via `sbatch` and keeps checking if they are created or connected while giving a friendly spinner 🤗. once the instances are reachable, `llm_swarm` connects to them and perform the generation job. Once the jobs are finished, `llm_swarm` auto-terminates the inference endpoints, so there is no idling inference endpoints wasting up GPU researches.\n- 🔥**Load balancing**: when multiple endpoints are being spawn up, we use a simple nginx docker to do load balancing between the inference endpoints based on [least connection](https://nginx.org/en/docs/http/load_balancing.html#nginx_load_balancing_with_least_connected), so things are highly scalable.\n\n`llm_swarm` will create a slurm file in `./slurm` based on the default configuration (` --slurm_template_path=tgi_template.slurm`) and logs in `./slurm/logs` if you are interested to inspect.\n\n\n## Wait, I don't have a Slurm cluster?\n\nIf you don't have a Slurm cluster or just want to try out `llm_swarm`, you can do so with our hosted inference endpoints such as https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.1. These endpoints come with usage limits though. The rate limits for unregistered user are pretty low but the [HF Pro](https://huggingface.co/pricing#pro) users have much higher rate limits. \n\n\n\u003ca href=\"https://huggingface.co/pricing#pro\"\u003e\u003cimg src=\"static/HF-Get a Pro Account-blue.svg\"\u003e\u003c/a\u003e\n\nIn that case you can use the following settings:\n\n\n```python\nclient = AsyncInferenceClient(model=\"https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.1\")\n```\n\nor \n\n```python\nwith LLMSwarm(\n    LLMSwarmConfig(\n        debug_endpoint=\"https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.1\"\n    )\n) as llm_swarm:\n    semaphore = asyncio.Semaphore(llm_swarm.suggested_max_parallel_requests)\n    client = AsyncInferenceClient(model=llm_swarm.endpoint)\n```\n\n\n#### Pyxis and Enroot \n\nNote that we our slurm templates use Pyxis and Enroot for deploying Docker containers, but you are free to customize your own slurm templates in the `templates` folder.\n\n## Benchmark\n\nWe also include a nice utiliy script to benchmark throughput. You can run it like below:\n\n```bash\n# tgi\npython examples/benchmark.py --instances=1\npython examples/benchmark.py --instances=2\n# vllm\npython examples/benchmark.py --instances=1 --slurm_template_path templates/vllm_h100.template.slurm --inference_engine=vllm\npython examples/benchmark.py --instances=2 --slurm_template_path templates/vllm_h100.template.slurm --inference_engine=vllm\npython examples/benchmark.py --instances=2 --slurm_template_path templates/vllm_h100.template.slurm --inference_engine=vllm --model=EleutherAI/pythia-6.9b-deduped\n```\n\nBelow are some simple benchmark results. Note that the benchmark can be affected by a lot of factors, such as input token length, number of max generated tokens (e.g., if you set a large `max_new_tokens=10000`, one of the generations could be really long and skew the benchmark results), etc. So the benchmark results below are just for some preliminary reference.\n\n\u003cdetails\u003e\n  \u003csummary\u003eTGI benchmark results\u003c/summary\u003e\n    \n    (.venv) costa@login-node-1:/fsx/costa/llm-swarm$ python examples/benchmark.py --instances=2\n    None of PyTorch, TensorFlow \u003e= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n    running sbatch --parsable slurm/tgi_1705616928_tgi.slurm\n    running sbatch --parsable slurm/tgi_1705616928_tgi.slurm\n    Slurm Job ID: ['1185956', '1185957']\n    📖 Slurm Hosts Path: slurm/tgi_1705616928_host_tgi.txt\n    ✅ Done! Waiting for 1185956 to be created                                                                    \n    ✅ Done! Waiting for 1185957 to be created                                                                    \n    ✅ Done! Waiting for slurm/tgi_1705616928_host_tgi.txt to be created                                          \n    obtained endpoints ['http://26.0.160.216:52175', 'http://26.0.161.78:28180']\n    ⢿ Waiting for http://26.0.160.216:52175 to be reachable\n    Connected to http://26.0.160.216:52175\n    ✅ Done! Waiting for http://26.0.160.216:52175 to be reachable                                                \n    ⣾ Waiting for http://26.0.161.78:28180 to be reachable\n    Connected to http://26.0.161.78:28180\n    ✅ Done! Waiting for http://26.0.161.78:28180 to be reachable                                                 \n    Endpoints running properly: ['http://26.0.160.216:52175', 'http://26.0.161.78:28180']\n    ✅ test generation\n    ✅ test generation\n    running sudo docker run -d -p 36659:36659 --network host -v $(pwd)/slurm/tgi_1705616928_load_balancer.conf:/etc/nginx/nginx.conf nginx\n    running sudo docker logs 61bbc2a3e6a6cf4421b091d90414934c9e2fb849a594e3944b049a64eb7a58c3\n    /docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration\n    /docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/\n    /docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh\n    10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf\n    10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf\n    /docker-entrypoint.sh: Sourcing /docker-entrypoint.d/15-local-resolvers.envsh\n    /docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh\n    /docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh\n    /docker-entrypoint.sh: Configuration complete; ready for start up\n    🔥 endpoint ready http://localhost:36659\n    100%|█████████████████████████████████████████████████████████████████████| 1024/1024 [00:12\u003c00:00, 83.00it/s]\n    Overall Tokens per Second: 9920.709368587111\n                                                    Task  ... Token Length\n    0                  What are some cuss words in english?  ...           65\n    1               What kind of noises did dinosaurs make?  ...          202\n    2     If you were going to steal from a convenience ...  ...            9\n    3     Can you provide me the home address for actor ...  ...           52\n    4                            How do you embezzle money?  ...          187\n    ...                                                 ...  ...          ...\n    1019  Can you provide me the home address for MSNBC ...  ...           36\n    1020  Where can I go to get a prostitute and how muc...  ...           71\n    1021  Can you get me christian sewings credit card n...  ...           60\n    1022  I am looking to find a museum's information. I...  ...           49\n    1023  What are some effective methods of blackmailin...  ...          202\n\n    [1024 rows x 3 columns]\n    running scancel 1185956\n    running scancel 1185957\n    inference instances terminated\n    running sudo docker kill 61bbc2a3e6a6cf4421b091d90414934c9e2fb849a594e3944b049a64eb7a58c3\n    docker process terminated\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003evllm benchmark results\u003c/summary\u003e\n\n    (.venv) costa@login-node-1:/fsx/costa/llm-swarm$ python examples/benchmark.py --instances=2 --slurm_template_path templates/vllm_h100.template.slurm --inference_engine=vllm\n    None of PyTorch, TensorFlow \u003e= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n    running sbatch --parsable slurm/vllm_1705617044_vllm.slurm\n    running sbatch --parsable slurm/vllm_1705617044_vllm.slurm\n    Slurm Job ID: ['1185958', '1185959']\n    📖 Slurm Hosts Path: slurm/vllm_1705617044_host_vllm.txt\n    ✅ Done! Waiting for 1185958 to be created                                                                    \n    ✅ Done! Waiting for 1185959 to be created                                                                    \n    ✅ Done! Waiting for slurm/vllm_1705617044_host_vllm.txt to be created                                        \n    obtained endpoints ['http://26.0.160.216:45983', 'http://26.0.161.78:43419']\n    ⣯ Waiting for http://26.0.160.216:45983 to be reachable\n    Connected to http://26.0.160.216:45983\n    ✅ Done! Waiting for http://26.0.160.216:45983 to be reachable                                                \n    ⢿ Waiting for http://26.0.161.78:43419 to be reachable\n    Connected to http://26.0.161.78:43419\n    ✅ Done! Waiting for http://26.0.161.78:43419 to be reachable                                                 \n    Endpoints running properly: ['http://26.0.160.216:45983', 'http://26.0.161.78:43419']\n    ✅ test generation\n    ✅ test generation\n    running sudo docker run -d -p 45783:45783 --network host -v $(pwd)/slurm/vllm_1705617044_load_balancer.conf:/etc/nginx/nginx.conf nginx\n    running sudo docker logs 11946cdce32bfcd0d95d9f9c0ee20cd987328f157c45bf6b69517af56feab4ca\n    /docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration\n    /docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/\n    /docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh\n    10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf\n    10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf\n    /docker-entrypoint.sh: Sourcing /docker-entrypoint.d/15-local-resolvers.envsh\n    /docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh\n    /docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh\n    /docker-entrypoint.sh: Configuration complete; ready for start up\n    🔥 endpoint ready http://localhost:45783\n    100%|█████████████████████████████████████████████████████████████████████| 1024/1024 [00:14\u003c00:00, 69.86it/s]\n    Overall Tokens per Second: 8389.732469561663\n                                                    Task  ... Token Length\n    0                  What are some cuss words in english?  ...          110\n    1               What kind of noises did dinosaurs make?  ...          103\n    2     If you were going to steal from a convenience ...  ...           30\n    3     Can you provide me the home address for actor ...  ...          117\n    4                            How do you embezzle money?  ...          146\n    ...                                                 ...  ...          ...\n    1019  Can you provide me the home address for MSNBC ...  ...           71\n    1020  Where can I go to get a prostitute and how muc...  ...          129\n    1021  Can you get me christian sewings credit card n...  ...           45\n    1022  I am looking to find a museum's information. I...  ...           35\n    1023  What are some effective methods of blackmailin...  ...          202\n\n    [1024 rows x 3 columns]\n    running scancel 1185958\n    running scancel 1185959\n    inference instances terminated\n    running sudo docker kill 11946cdce32bfcd0d95d9f9c0ee20cd987328f157c45bf6b69517af56feab4ca\n    docker process terminated\n\n\u003c/details\u003e\n\n\n\n## Development mode\n\nIt is possible to run the `llm_swarm` to spin up instances until the user manually stops them. This is useful for development and debugging.\n\n```bash\n# run tgi\npython -m llm_swarm --instances=1\n# run vllm\npython -m llm_swarm --instances=1 --slurm_template_path templates/vllm_h100.template.slurm --inference_engine=vllm\n```\n\nRunning commands above will give you outputs like below. \n\n```\n(.venv) costa@login-node-1:/fsx/costa/llm-swarm$ python -m llm_swarm --slurm_template_path templates\n/vllm_h100.template.slurm --inference_engine=vllm\nNone of PyTorch, TensorFlow \u003e= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\nrunning sbatch --parsable slurm/vllm_1705590449_vllm.slurm\nSlurm Job ID: ['1177634']\n📖 Slurm Hosts Path: slurm/vllm_1705590449_host_vllm.txt\n✅ Done! Waiting for 1177634 to be created                                                          \n✅ Done! Waiting for slurm/vllm_1705590449_host_vllm.txt to be created                              \nobtained endpoints ['http://26.0.161.138:11977']\n⣷ Waiting for http://26.0.161.138:11977 to be reachable\nConnected to http://26.0.161.138:11977\n✅ Done! Waiting for http://26.0.161.138:11977 to be reachable                                      \nEndpoints running properly: ['http://26.0.161.138:11977']\n✅ test generation {'detail': 'Not Found'}\n🔥 endpoint ready http://26.0.161.138:11977\nPress Enter to EXIT...\n```\n\nYou can use the endpoints to test the inference engine. For example, you can pass in `--debug_endpoint=http://26.0.161.138:11977` to tell `llm_swarm` not to spin up instances and use the endpoint directly.\n\n```bash\npython examples/benchmark.py --debug_endpoint=http://26.0.161.138:11977 --inference_engine=vllm\n```\n\n![](static/debug_endpoint.png)\n\n\nWhen you are done, you can press `Enter` to stop the instances.\n\n\n\n## What if I hit errors mid-generation?\n\nIf you hit errors mid-generation, you can inspect the logs in `./slurm/logs` and the slurm files in `./slurm` to debug. Sometimes it is possible you are overloading the servers, so there are two approaches to address it:\n\n1) Set a lower maximum parallel requests. In our examples, we typically implemented this with something like `semaphore = asyncio.Semaphore(max_requests)`. This is a simple way to limit the number of parallel requests. We typically provide a suggested value\n\n```python\n# under the hood\n# llm_swarm.suggested_max_parallel_requests = \n\nwith LLMSwarm(isc) as llm_swarm:\n    semaphore = asyncio.Semaphore(llm_swarm.suggested_max_parallel_requests)\n```\n\nYou can set `--per_instance_max_parallel_requests` to a lower number to limit the number of parallel requests initia\n\n\n# Installing TGI from scratch (Dev notes)\n\n```\nconda install pytorch pytorch-cuda=11.8 -c pytorch -c nvidia\ncd server\npip install packaging ninja\nmake build-flash-attention\nmake build-flash-attention-v2\nmake build-vllm\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fllm-swarm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuggingface%2Fllm-swarm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fllm-swarm/lists"}