{"id":28637806,"url":"https://github.com/qwenlm/parscale","last_synced_at":"2025-06-12T18:35:11.349Z","repository":{"id":293675225,"uuid":"984036705","full_name":"QwenLM/ParScale","owner":"QwenLM","description":"Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling","archived":false,"fork":false,"pushed_at":"2025-05-16T14:18:59.000Z","size":2254,"stargazers_count":27,"open_issues_count":0,"forks_count":2,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-05-16T15:31:56.086Z","etag":null,"topics":["large-language-models","llm","machine-learning","scaling-law"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2505.10475","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/QwenLM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-15T09:49:05.000Z","updated_at":"2025-05-16T15:24:47.000Z","dependencies_parsed_at":"2025-05-16T15:33:45.193Z","dependency_job_id":"a85e3f69-a9ef-44d1-af0c-09a7666f4ec3","html_url":"https://github.com/QwenLM/ParScale","commit_stats":null,"previous_names":["qwenlm/parscale"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/QwenLM/ParScale","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QwenLM%2FParScale","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QwenLM%2FParScale/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QwenLM%2FParScale/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QwenLM%2FParScale/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/QwenLM","download_url":"https://codeload.github.com/QwenLM/ParScale/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QwenLM%2FParScale/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259519282,"owners_count":22870331,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["large-language-models","llm","machine-learning","scaling-law"],"created_at":"2025-06-12T18:35:11.276Z","updated_at":"2025-06-12T18:35:11.337Z","avatar_url":"https://github.com/QwenLM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\n# Parallel Scaling Law for Language Model\n\n\n_Yet Another Scaling Law beyond Parameters and Inference Time Scaling_\n\n[![Paper](https://img.shields.io/badge/arXiv-2505.10475-red)](https://arxiv.org/abs/2505.10475)\n[![huggingface](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-FFD21E)](https://huggingface.co/ParScale)\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figures/logo.jpg\" style=\"width: 10%;\" /\u003e\n\u003c/div\u003e\n\n\n\u003cp align=\"center\"\u003e\n    💡\u0026nbsp;\u003ca href=\"#-key-findings\"\u003eKey Findings\u003c/a\u003e\n    | 📈\u0026nbsp;\u003ca href=\"#-scaling-law\"\u003eScaling Law\u003c/a\u003e\n    | ⚡\u0026nbsp;\u003ca href=\"#-cost-analysis\"\u003eCost Analysis\u003c/a\u003e\n    | 🔥\u0026nbsp;\u003ca href=\"#-models\"\u003eModels\u003c/a\u003e\n    | 📚\u0026nbsp;\u003ca href=\"#-citation\"\u003eCitation\u003c/a\u003e\n\u003c/p\u003e\n\u003c/div\u003e\n\n## 🌟 About\n\n- Most believe that scaling language models requires a heavy cost in either **space** (parameter scaling) or **time** (inference-time scaling). \n- We introduce the *third* scaling paradigm for scaling LLMs: leverages **parallel computation** during both training and inference time (Parallel Scaling, or *ParScale*).\n- We apply $P$ diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the $P$ outputs. \n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figures/teaser.png\" style=\"width: 80%;\" /\u003e\n\u003c/div\u003e\n\n---\n\n## 💡 Key Findings\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figures/scaling_comparison.png\" style=\"width: 80%;\" /\u003e\n\u003c/div\u003e\n\nHere are the core insights and benefits distilled from our theoretical analysis and empirical evaluations:\n\n📈 **Logarithmic Scaling Law**: We theoretically and empirically establish that **scaling with $P$ parallel streams is comparable to scaling the number of parameters by** $O(\\log P)$. This suggests that parallel computation can serve as an efficient substitute for parameter growth, especially for larger models.\n\n✅ **Universal Applicability**: Unlike inference-time scaling which requires specialized data and limited application, it works with any model architecture, optimization method, data, or downstream task.\n\n\n🧠 **Stronger Performance on Reasoning Tasks**: Reasoning-intensive tasks (e.g., coding or math) benefit more from ParScale, which suggests that scaling computation can effectively push the boundary of reasoning. \n\n⚡ **Superior Inference Efficiency**: ParScale can use up to **22x less memory increase** and **6x less latency increase** compared to parameter scaling that achieves the same performance improvement (batch size=1).\n\n🧱 **Cost-Efficient Training via Two-Stage Strategy**: Training a parallel-scaled model doesn't require starting from scratch. With a two-stage training strategy, we can post-train ithe parallel components using only a small amount of data.\n\n🔁 **Dynamic Adaptation at Inference Time**: We find that ParScale remains effective with frozen main parameters for different $P$. This illustrates the potential of dynamic parallel scaling: switching $P$ to dynamically adapt model capabilities during inference.\n\nWe release the inference code in `modeling_qwen2_parscale.py` and `configuration_qwen2_parscale.py`. Our 67 checkpoints is available at [🤗 HuggingFace](https://huggingface.co/ParScale).\n\n---\n\n## 📈 Scaling Law\n\n- We carry out large-scale pre-training experiments on the Stack-V2 and Pile corpus, by ranging $P$ from 1 to 8 and model parameters from 500M to 4.4B. \n- We use the results to fit a new *parallel scaling law* that generalizes the Chinchilla scaling law.\n- We release our parametric fitting code in `parametric_fit.py`.\n- Feel free to try [🤗 HuggingFace Space](https://huggingface.co/spaces/ParScale/Parallel_Scaling_Law) for a nice visualization for the parallel scaling law!\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figures/scaling_law.png\" style=\"width: 70%;\" /\u003e\n\u003cimg src=\"figures/scaling_law2.png\" style=\"width: 70%;\" /\u003e\n\u003c/div\u003e\n\n---\n\n## ⚡ Cost Analysis\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figures/cost.png\" style=\"width: 70%;\" /\u003e\n\u003c/div\u003e\n\n- We further compare the inference efficiency between parallel scaling and parameter scaling at equivalent performance levels. \n- We release our analysis code in `cost_analysis.py`. Before using it, you should first install [llm-analysis](https://github.com/cli99/llm-analysis):\n\n```bash\ngit clone https://github.com/cli99/llm-analysis.git\ncd llm-analysis\npip install .\n```\n\n- You can use the following command to analyze the inference memory and latency cost for our 4.4B model, with $P=2$ and batch size=2:\n```bash\npython cost_analysis.py --hidden_size 2560 --intermediate_size 13824 --P 2 --batch_size 2\n```\n\n---\n\n## 🔥 Models\n\n✨ are our recommendation for strong models!\n\n### Base models for scaling training data to 1T tokens\n\nThese models demonstrate strong competitiveness among existing small models, including SmolLM, gemma, and Llama-3.2.\n\n|Model|Description|Download|\n|:-:|:-:|:-:|\n|ParScale-1.8B-P1|✨ Baseline $P=1$|[🤗 ParScale/ParScale-1.8B-P1](https://huggingface.co/ParScale/ParScale-1.8B-P1)|\n|ParScale-1.8B-P2|✨ ParScale $P=2$|[🤗 ParScale/ParScale-1.8B-P2](https://huggingface.co/ParScale/ParScale-1.8B-P2)|\n|ParScale-1.8B-P4|✨ ParScale $P=4$|[🤗 ParScale/ParScale-1.8B-P4](https://huggingface.co/ParScale/ParScale-1.8B-P4)|\n|ParScale-1.8B-P8|✨ ParScale $P=8$|[🤗 ParScale/ParScale-1.8B-P8](https://huggingface.co/ParScale/ParScale-1.8B-P8)|\n\n### Instruct models for scaling training data to 1T tokens\n\nWe post-trained the aforementioned base model on SmolTalk-1M to enable conversational capabilities.\n\n|Model|Description|Download|\n|:-:|:-:|:-:|\n|ParScale-1.8B-P1-Inst|✨ Baseline $P=1$|[🤗 ParScale/ParScale-1.8B-P1-Inst](https://huggingface.co/ParScale/ParScale-1.8B-P1-Inst)|\n|ParScale-1.8B-P2-Inst|✨ ParScale $P=2$|[🤗 ParScale/ParScale-1.8B-P2-Inst](https://huggingface.co/ParScale/ParScale-1.8B-P2-Inst)|\n|ParScale-1.8B-P4-Inst|✨ ParScale $P=4$|[🤗 ParScale/ParScale-1.8B-P4-Inst](https://huggingface.co/ParScale/ParScale-1.8B-P4-Inst)|\n|ParScale-1.8B-P8-Inst|✨ ParScale $P=8$|[🤗 ParScale/ParScale-1.8B-P8-Inst](https://huggingface.co/ParScale/ParScale-1.8B-P8-Inst)|\n\n\n### Continual Pretraining Qwen-2.5-3B\n\nWe froze the parameters of Qwen-2.5-3B and only fine-tuned the newly introduced parameters on Stack-V2-Python. Since the following models share the same backbone parameters as Qwen-2.5-3B, they have the potential for dynamic ParScale: switching P to adapt model capabilities during inference.\n\n|Model|Description|Download|\n|:-:|:-:|:-:|\n|ParScale-Qwen-3B-P2-Python|✨ ParScale $P=2$|[🤗 ParScale/ParScale-Qwen-3B-P2-Python](https://huggingface.co/ParScale/ParScale-Qwen-3B-P2-Python)|\n|ParScale-Qwen-3B-P4-Python|✨ ParScale $P=4$|[🤗 ParScale/ParScale-Qwen-3B-P4-Python](https://huggingface.co/ParScale/ParScale-Qwen-3B-P4-Python)|\n|ParScale-Qwen-3B-P8-Python|✨ ParScale $P=8$|[🤗 ParScale/ParScale-Qwen-3B-P8-Python](https://huggingface.co/ParScale/ParScale-Qwen-3B-P8-Python)|\n\n- For full continual pretraining on Stack-V2-Python\n\n|Model|Description|Download|\n|:-:|:-:|:-:|\n|ParScale-QwenInit-3B-P1-Python|Baseline $P=1$|[🤗 ParScale/ParScale-QwenInit-3B-P1-Python](https://huggingface.co/ParScale/ParScale-QwenInit-3B-P1-Python)|\n|ParScale-QwenInit-3B-P2-Python|ParScale $P=2$|[🤗 ParScale/ParScale-QwenInit-3B-P2-Python](https://huggingface.co/ParScale/ParScale-QwenInit-3B-P2-Python)|\n|ParScale-QwenInit-3B-P4-Python|ParScale $P=4$|[🤗 ParScale/ParScale-QwenInit-3B-P4-Python](https://huggingface.co/ParScale/ParScale-QwenInit-3B-P4-Python)|\n|ParScale-QwenInit-3B-P8-Python|ParScale $P=8$|[🤗 ParScale/ParScale-QwenInit-3B-P8-Python](https://huggingface.co/ParScale/ParScale-QwenInit-3B-P8-Python)|\n\n- For full continual pretraining on Pile\n\n|Model|Description|Download|\n|:-:|:-:|:-:|\n|ParScale-QwenInit-3B-P1-Pile|Baseline $P=1$|[🤗 ParScale/ParScale-QwenInit-3B-P1-Pile](https://huggingface.co/ParScale/ParScale-QwenInit-3B-P1-Pile)|\n|ParScale-QwenInit-3B-P2-Pile|ParScale $P=2$|[🤗 ParScale/ParScale-QwenInit-3B-P2-Pile](https://huggingface.co/ParScale/ParScale-QwenInit-3B-P2-Pile)|\n|ParScale-QwenInit-3B-P4-Pile|ParScale $P=4$|[🤗 ParScale/ParScale-QwenInit-3B-P4-Pile](https://huggingface.co/ParScale/ParScale-QwenInit-3B-P4-Pile)|\n|ParScale-QwenInit-3B-P8-Pile|ParScale $P=8$|[🤗 ParScale/ParScale-QwenInit-3B-P8-Pile](https://huggingface.co/ParScale/ParScale-QwenInit-3B-P8-Pile)|\n\n\n### Checkpoints Used to Fit the Scaling Law\n\nDownload link: https://huggingface.co/ParScale/ParScale-{size}-{P}-{dataset}\n\n- {size}: model size, from {0.7B, 0.9B, 1.3B, 1.8B, 3B, 4.7B}\n- {P}: number of parallels, from {P1, P2, P4, P8}\n- {dataset}: training dataset, from {Python, Pile}\n- $6\\times 4 \\times 2=48$ checkpoints in total.\n\n### Usage Example with 🤗 Hugging Face\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nname = \"ParScale/ParScale-1.8B-P8\" # or anything else you like\nmodel = AutoModelForCausalLM.from_pretrained(name, trust_remote_code=True).to(\"cuda\")\ntokenizer = AutoTokenizer.from_pretrained(name)\ninputs = tokenizer.encode(\"Hello, how are you today?\", return_tensors=\"pt\").to(\"cuda\")\noutputs = model.generate(inputs, max_new_tokens=128)[0]\nprint(tokenizer.decode(outputs))\n```\n\n\n## 📚 Citation\n\n```bibtex\n@article{ParScale,\n      title={Parallel Scaling Law for Language Models}, \n      author={Mouxiang Chen and Binyuan Hui and Zeyu Cui and Jiaxi Yang and Dayiheng Liu and Jianling Sun and Junyang Lin and Zhongxin Liu},\n      year={2025},\n      eprint={2505.10475},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      journal={arXiv preprint arXiv:2505.10475},\n      url={https://arxiv.org/abs/2505.10475}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqwenlm%2Fparscale","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqwenlm%2Fparscale","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqwenlm%2Fparscale/lists"}