{"id":23335861,"url":"https://github.com/zhanshijinwat/Steel-LLM","last_synced_at":"2025-08-23T04:32:58.334Z","repository":{"id":227986940,"uuid":"772869201","full_name":"zhanshijinwat/Steel-LLM","owner":"zhanshijinwat","description":"Train a 1B LLM with 1T tokens from scratch by personal","archived":false,"fork":false,"pushed_at":"2024-12-18T16:11:01.000Z","size":51934,"stargazers_count":430,"open_issues_count":5,"forks_count":48,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-12-18T17:22:35.750Z","etag":null,"topics":["large-language-model","llama","llm","qwen"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zhanshijinwat.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-16T05:16:40.000Z","updated_at":"2024-12-18T16:11:05.000Z","dependencies_parsed_at":"2024-03-30T07:29:53.394Z","dependency_job_id":"3652419c-7752-4142-af6e-a2a6c897ba54","html_url":"https://github.com/zhanshijinwat/Steel-LLM","commit_stats":null,"previous_names":["zhanshijinwat/steel-llm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhanshijinwat%2FSteel-LLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhanshijinwat%2FSteel-LLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhanshijinwat%2FSteel-LLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhanshijinwat%2FSteel-LLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zhanshijinwat","download_url":"https://codeload.github.com/zhanshijinwat/Steel-LLM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230665554,"owners_count":18261516,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["large-language-model","llama","llm","qwen"],"created_at":"2024-12-21T02:01:33.443Z","updated_at":"2025-08-23T04:32:58.321Z","avatar_url":"https://github.com/zhanshijinwat.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"\u003cdiv align=\"center\"\u003e\n\n# 开源中文预训练语言模型Steel-LLM\n\u003c/div\u003e\n\n\\[ 中文 | [English](README_en.md) \\]\n\n## 👋 介绍\nSteel-LLM是个人发起的从零预训练中文大模型项目。我们使用了1T token的数据预训练一个1B左右参数量的中文LLM。项目从开始到微调出第一版模型耗时了8个月。我们详细的分享了数据收集、数据处理、预训练框架选择、模型设计等全过程，并开源全部代码。让每个人在有8~几十张卡的情况下都能复现我们的工作。得益于开源中文数据，Steel-LLM在中文benchmark上表现优于机构早期发布的一些更大的模型，在ceval达到了42分，cmmlu达到了36分。\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\".github/steel.png\" width=\"200\"/\u003e\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\n        \u003ca href=\"https://huggingface.co/gqszhanshijin/Steel-LLM\" target=\"_blank\"\u003e\n        \u003cimg alt=\"Static Badge\" src=\"https://img.shields.io/badge/%F0%9F%A4%97Hugging%20Face-Steel%20LLM-yellow\"\u003e\n        \u003c/a\u003e\n        \u003cspan style=\"display: inline-block; width: 10px;\"\u003e\u003c/span\u003e\n        \u003ca href=\"https://www.modelscope.cn/models/zhanshijin/Steel-LLM\"\u003e\u003cimg alt=\"Static Badge\" src=\"https://img.shields.io/badge/%F0%9F%A4%96modelscope-Steel%20LLM-purple\"\u003e\n        \u003c/a\u003e\n        \u003cspan style=\"display: inline-block; width: 10px;\"\u003e\u003c/span\u003e\n        \u003ca href=\".github/wechat_liangangAI.jpg\"\u003e\n        \u003cimg alt=\"Static Badge\" src=\"https://img.shields.io/badge/WeChat-%E7%82%BC%E9%92%A2AI-brightgreen?logo=wechat\u0026logoColor=white\"\u003e\n        \u003c/a\u003e\u003cbr\u003e\n        \u003ca href=\"https://www.zhihu.com/people/zhan-shi-jin-27\"\u003e\n        \u003cimg alt=\"Static Badge\" src=\"https://img.shields.io/badge/%E2%98%91%EF%B8%8Fzhihu%20blog-%E6%88%98%E5%A3%AB%E9%87%91-blue\"\u003e\n        \u003c/a\u003e\n        \u003cspan style=\"display: inline-block; width: 10px;\"\u003e\u003c/span\u003e\n        \u003ca href=\"https://arxiv.org/abs/2502.06635\" target=\"_blank\"\u003e\n        \u003cimg alt=\"Arxiv\" src=\"https://img.shields.io/badge/Arxiv-Steel%20LLM-7289da?logo=arxiv\u0026logoColor=red\u0026color=red\" /\u003e\n        \u003c/a\u003e\n\n\n        \n\n\"Steel(钢)\"取名灵感来源于华北平原一只优秀的乐队“万能青年旅店（万青）”。乐队在做一专的时候条件有限，自称是在“土法炼钢”，但却是一张神专。我们训练LLM的条件同样有限，但也希望能炼出好“钢”来。\n\n## 🔔 公告 \n\n### 更新\n后续会在数学能力、强化学习、复杂推理等方面进一步探索......\n\n[2025/3/10] 发布了一篇强化学习相关博客：《拒绝采样微调加速RL收敛及模型遗忘问题探究》：https://mp.weixin.qq.com/s/Qk4bN6yFkI39Ye9fsS4NXA\n\n[2025/3/6] 🎉🎉🎉《Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM》被ICLR 2025 workshop 接收\n\n[2025/2/13] 上传了技术报告：https://arxiv.org/abs/2502.06635\n\n[2025/1/17]  更新steel-LLM-chat-v2,微调时加入了英文数据，中英文数据比例和预训练保持一致，最终在ceval上由38分提高到了41.9分，cmmlu从33分提高到了36分。\n\n[2024/11/13] 🔥发布一篇项目汇总文章《个人从零预训练1B LLM心路历程》：https://mp.weixin.qq.com/s/POUugkCNZTzmlKWZVVD1CQ🔥 \n\n[2024/10/28]更新了第一版chat模型，在ceval达到了38分，cmmlu达到了33分。\n\n[2024/10/24]发布了Steel-LLM微调和评估的细节。微调时探索了cot、模型刷榜等实验。博客地址：https://mp.weixin.qq.com/s/KK0G0spNw0D9rPUESkHMew\n\n[2024/9/2] HuggingFace更新了480k、660k、720k、980k、1060k（最后一个checkpoint）step的checkpoint。\n\n[2024/8/18] 预训练已经完成，后续进行微调以及评测\n\n[2024/7/18] 使用8*H800继续训练，wandb：https://api.wandb.ai/links/steel-llm-lab/vqf297nr\n\n[2024/6/30] 放出预训练200k个step的checkpoint，[huggingface链接](https://huggingface.co/gqszhanshijin/Steel-LLM/tree/main)\n\n[2024/5/21] 模型开启正式训练，后续不定期放出checkpoint。\n\n[2024/5/19] 基于Qwen1.5完成模型修改，模型大小1.12B：\n- FFN层使用softmax moe，相同参数量下有更高的训练速度\n- 使用双层的SwiGLU\n\n相关博客:https://zhuanlan.zhihu.com/p/700395878\n\n[2024/5/5] 预训练程序修改相关的博客：https://zhuanlan.zhihu.com/p/694223107\n\n[2024/4/24] 完成训练程序改进：兼容Hugginface格式模型、支持数据断点续训、支持追加新的数据 \n\n[2024/4/14] 完成数据收集与处理，生成预训练程序所需要的bin文件。更新数据收集与处理相关的博客：https://zhuanlan.zhihu.com/p/687338497\n\n\n### 🧑‍🤝‍🧑 交流\n欢迎加入交流群,人数已超过200，添加微信入群：a1843450905\n\n\u003cbr\u003e\n\n## 🤖 预训练\n### 数据收集\n使用的数据集和链接如下所示，更详细的介绍见[**此篇文章**](https://zhuanlan.zhihu.com/p/687338497)\n\n- [Skywork/Skypile-150B数据集](https://huggingface.co/datasets/Skywork/SkyPile-150B/tree/main/data)\n- [wanjuan1.0(nlp部分)](https://opendatalab.org.cn/OpenDataLab/WanJuan1_dot_0?source=Q1NETg)\n- [中文维基过滤数据](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered)\n- [百度百科数据](https://huggingface.co/datasets/xuqinyang/BaiduBaike-5.63M)\n- [百度百科问答数据](https://aistudio.baidu.com/datasetdetail/107726)\n- [知乎问答数据](https://huggingface.co/datasets/wangrui6/Zhihu-KOL)\n- [BELLE对话数据](https://github.com/LianjiaTech/BELLE/tree/main/data/10M)\n- [moss项目对话数据](https://hf-mirror.com/datasets/YeungNLP/moss-003-sft-data)\n- [firefly1.1M](https://hf-mirror.com/datasets/YeungNLP/firefly-train-1.1M)\n- [starcoder](https://hf-mirror.com/datasets/bigcode/starcoderdata)\n\n### 数据处理\n(详细内容见[**此篇文章**](https://mp.weixin.qq.com/s/yqmtHLuuNV9075qHgzhcPw))\n#### step1：格式转化\n- 源数据：针对4类数据进行格式统一的转化处理：\n  - 简单文本：百度百科（title和各段落需要手动合并）、中文维基\n  - 对话（含单轮与多轮）：百度百科问答数据、BELLE对话数据（BELLE_3_5M）、moss项目对话数据、知乎问答数据、BELLE任务数据（BELLE_2_5M)、firefly1.1M\n  - 代码数据：starcode\n  - 其他数据：wanjuan和skypile数据集不用做单独处理\n- 目标格式：`{\"text\": \"asdfasdf...\"}`，文件保存为.jsonl类型。\n- 运行方式：`python data/pretrain_data_prepare/step1_data_process.py`\n#### step2：data-juicer数据处理\n- 运行方式：`sh data/pretrain_data_prepare/step2/run_step2.sh`\n\n- 具体使用的data juicer算子见\u003ca href=\".github/datajuicer_op.md\"\u003e此文档\u003c/a\u003e。\n\n#### step3：生成最终用于训练的bin格式\n需要先在代码中修改filename_sets，指定数据路径，然后运行如下程序：\n\n`python pretrain_modify_from_TinyLlama/scripts/prepare_steel_llm_data.py`\n\n输入数据格式为：包含'text'字段的jsonl文件\n\n### tokenizer\n不单独训练tokenizer，使用[Qwen/Qwen1.5-MoE-A2.7B-Chat](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat)的tokenizer\n\n### 模型结构\n(详细内容见[**此篇文章**](https://mp.weixin.qq.com/s/JaZyf1jOEOtNDCcFqSj8TQ))\n\n基于Qwen1.5模型，进行了如下改动：\n- FFN层使用softmax moe，相同参数量下有更高的训练速度\n- 使用双层的SwiGLU\n\n### 预训练框架\n(详细内容见[**此篇文章**](https://mp.weixin.qq.com/s/KPRir6bK3MZZ-vMFTfhUQQ))\n\n基于TinyLlama预训练程序进行如下改进：\n\n  - 兼容HuggingFace格式的模型\n  - 加载checkpoint时，完全恢复数据训练的进度\n  - 数据一致性检测\n  - 在不影响已训练数据的情况下，在数据集中追加新的数据\n\n启动预训练：\n\n`python Steel-LLM/pretrain_modify_from_TinyLlama/pretrain/pretrain_steel_llm.py`\n\n### 评估\n(详细内容见[**此篇文章**](https://mp.weixin.qq.com/s/KK0G0spNw0D9rPUESkHMew))\n\nSteel-LLM在CEVAL、CMMLU上进行了测试。Steel-LLM旨在训练一个中文LLM，80%的训练数据都是中文，因此在英文benchmark并未做过多的测试。\n其他模型的指标来自于CEVAL论文、MiniCPM技术报告、MAP-Neo技术报告等途径。更多模型的指标可查看之前的\u003ca href=https://mp.weixin.qq.com/s/KK0G0spNw0D9rPUESkHMew\u003e博客\u003c/a\u003e\n\n|                              | CEVAL  | CMMLU |\n|------------------------------|--------|-------|\n| Steel-LLM-chat-v2            | 41.90  | 36.08 |\n| Steel-LLM-chat-v1            | 38.57  | 33.48 |\n| Tiny-Llama-1.1B              | 25.02  | 24.03 |\n| Gemma-2b-it                  | 32.3   | 33.07 |\n| Phi2(2B)\t                   | 23.37\t| 24.18 |\n| Deepseek-coder-1.3B-instruct |  28.33 | 27.75 |\n| CT-LLM-SFT-2B                | 41.54  | 41.48 |\n| MiniCPM-2B-sft-fp32          | 49.14  | 51.0  |\n| Qwen1.5-1.8B-Chat            | 56.84  | 54.11 |\n| ChatGLM-6B                   | 38.9   | -     |\n| Moss                         | 33.1   | -     |\n| LLAMA-65B                    | 34.7   | -     |\n| Qwen-7B                      | 58.96  | 60.35 |\n| Gemma-7B                     | 42.57  | 44.20 |\n| OLMo-7B                      | 35.18  | 35.55 |\n| MAP-NEO-7B                   | 56.97  | 55.01 |\n\n\u003cbr\u003e\n\n## ⛏️ 快速使用\n```python\nfrom modelscope import AutoModelForCausalLM, AutoTokenizer\n\nmodel_name = \"zhanshijin/Steel-LLM\"\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=\"auto\",\n    device_map=\"auto\"\n)\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\nprompt = \"你是谁开发的\"\nmessages = [\n    {\"role\": \"user\", \"content\": prompt}\n]\ntext = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True\n)\nmodel_inputs = tokenizer([text], return_tensors=\"pt\").to(model.device)\n\ngenerated_ids = model.generate(\n    **model_inputs,\n    max_new_tokens=512\n)\ngenerated_ids = [\n    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n]\n\nresponse = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\nprint(response)\n\n```\n\n### 硬件资源\nGPU：8* H800 80G(训练30天左右)\n\nGPU：8* A100 80G（训练60天左右）\n\n硬盘：4TB\n\n## Citation\n\n**BibTeX:**\n```bibtex\n@article{gu2025steel,\n  title={Steel-LLM: From Scratch to Open Source--A Personal Journey in Building a Chinese-Centric LLM},\n  author={Gu, Qingshui and Li, Shu and Zheng, Tianyu and Zhang, Zhaoxiang},\n  journal={arXiv preprint arXiv:2502.06635},\n  year={2025}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhanshijinwat%2FSteel-LLM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzhanshijinwat%2FSteel-LLM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhanshijinwat%2FSteel-LLM/lists"}