{"id":17227053,"url":"https://github.com/peterh0323/ancient-chat-llm","last_synced_at":"2025-10-18T13:32:14.260Z","repository":{"id":219703040,"uuid":"748929855","full_name":"PeterH0323/ancient-chat-llm","owner":"PeterH0323","description":"ancient-chat-llm: A LLM which is proficient in Chinese culture 古语说： 一个精通中国文化的大模型","archived":false,"fork":false,"pushed_at":"2024-02-04T03:25:23.000Z","size":1743,"stargazers_count":30,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-12-03T14:49:19.891Z","etag":null,"topics":["chatbot","chinese","gpt","internlm2","large-language-model","llm","xtuner"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PeterH0323.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-27T04:18:42.000Z","updated_at":"2024-11-26T15:21:33.000Z","dependencies_parsed_at":"2024-02-01T09:24:13.667Z","dependency_job_id":"20ecdcb5-50ed-4e80-b2b3-ddc80206e2f7","html_url":"https://github.com/PeterH0323/ancient-chat-llm","commit_stats":null,"previous_names":["peterh0323/ancient-chat-llm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PeterH0323%2Fancient-chat-llm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PeterH0323%2Fancient-chat-llm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PeterH0323%2Fancient-chat-llm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PeterH0323%2Fancient-chat-llm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PeterH0323","download_url":"https://codeload.github.com/PeterH0323/ancient-chat-llm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":236376198,"owners_count":19139155,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","chinese","gpt","internlm2","large-language-model","llm","xtuner"],"created_at":"2024-10-15T04:17:59.481Z","updated_at":"2025-10-18T13:32:14.155Z","avatar_url":"https://github.com/PeterH0323.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- for modelscope yaml info\n---\nlanguage:\n- zh\ntags:\n- ancient-chat-llm\n- internlm2\nframeworks:\n- pytorch\ntasks:\n- text-generation\nlicense: Apache License 2.0\n---\n--\u003e\n\n# ancient-chat-llm 古语说 —— 一个精通中国文化的大模型\n\n\u003c!-- PROJECT SHIELDS --\u003e\n\u003c!-- \n[![Contributors][contributors-shield]][contributors-url]\n[![Forks][forks-shield]][forks-url]\n[![Issues][issues-shield]][issues-url]\n[![MIT License][license-shield]][license-url]\n[![Stargazers][stars-shield]][stars-url]\n--\u003e\n\n\u003cbr /\u003e\n\u003c!-- PROJECT LOGO --\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/PeterH0323/ancient-chat-llm/\"\u003e\n    \u003cimg src=\"assets/logo.png\" alt=\"Logo\" width=\"30%\"\u003e\n  \u003c/a\u003e\n\n\u003ch3 align=\"center\"\u003eancient-chat-llm\u003c/h3\u003e\n  \u003cp align=\"center\"\u003e\n    \u003cbr /\u003e\n    \u003ca href=\"https://openxlab.org.cn/apps/detail/HinGwenWong/ancient-chat-llm\"\u003e查看Demo\u003c/a\u003e\n    ·\n    \u003ca href=\"https://github.com/PeterH0323/ancient-chat-llm/issues\"\u003e报告Bug \u0026 提出新特性\u003c/a\u003e\n  \u003c/p\u003e\n\u003c/p\u003e\n\n## 简介\n\n**ancient-chat-llm 古语说** 是一个能够在用户输入现代汉语后输出文言文，同时能够解答用户 **关于中国文化的问题** 的大模型，包括但不限于**唐诗、宋词、论语**等古籍，还可以让其**将文言文翻译成白话文**等，模型用 [xtuner](https://github.com/InternLM/xtuner) 在 [InternLM2](https://github.com/InternLM/InternLM) 的基础上指令微调而来。\n\n**开源不易，如果本项目帮到大家，可以右上角帮我点个 star~ ⭐⭐ , 您的 star ⭐是我们最大的鼓励，谢谢各位！**  \n\n## NEWS\n\n- [2024.2.3]  数据清洗，发布迭代模型\n- [2024.1.28] 新增诗词、古籍等知识微调模型\n- [2024.1.16] 成语数据集微调模型\n\n## 介绍\n\n中国文化，博大精深，源远流长。从古老的诗词歌赋到现代的文艺创作，都展现了中华民族的智慧和创造力。\n\n- **中国古籍**，中华文明的重要组成部分，承载着丰富的历史和文化信息，反映了古代社会的风貌和人民的智慧。这些古籍不仅具有极高的历史价值，也是我们了解古代文化、传承中华文明的重要窗口。其中，《诗经》是中国最早的诗歌总集，收录了西周初年至春秋中叶的诗歌，展现了古代人民的生活和情感。其优美的语言和深邃的思想，至今仍为人们所传颂和学习。另一部重要的古籍是 **《论语》**，其是儒家学派的经典之作，记录了孔子及其弟子的言行和思想。它强调仁爱、礼义等儒家核心价值观，对中国乃至东亚地区的文化和社会产生了深远的影响。此外，**《道德经》、《易经》** 等道家经典，以及 **《孙子兵法》、《战国策》** 等兵家著作，也都是中国古代文化古籍中的重要代表。\n\n- **中国古诗**，蕴含着深厚的文化底蕴，闪耀着诗人的智慧与才情。以李白的 **《将进酒》** 为例，诗中“人生得意须尽欢，莫使金樽空对月”传达出豁达乐观的人生态度，激励着代代读者。这样的诗句，既是中国古诗的瑰宝，也是中华文化的骄傲。让我们共同欣赏、传承这些珍贵的文化遗产，感受中国古诗的无穷魅力。\n\n- **中国成语**，其有固定的结构形式和固定的说法，表示一定的意义，在语句中是作为一个整体来应用的。成语有很大一部分是从古代相承沿用下来的，它代表了一个故事或者典故，有些成语本就是一个微型的句子。有些成语来自于历史事件，如“完璧归赵”、“负荆请罪”等，它们通过简短的形式，概括了整个故事的内容，使得人们可以更加方便地理解和记忆。有些成语则来自于文学作品，如“柳暗花明”、“刻舟求剑”等，这些成语通过形象的比喻，表达了深刻的道理。\n\n**这就是我们做这个模型的初衷，我们想将中华文化教给大模型，让其能够尽可能掌握中华文化，做到文化输出。**\n\n**ancient-chat-llm 古语说** 是一个能够在用户输入现代汉语后输出文言文，同时能够解答用户 **关于中国文化的问题** 的大模型，包括但不限于**唐诗、宋词、论语**等古籍，还可以让其**将文言文翻译成白话文**等，模型用 [xtuner](https://github.com/InternLM/xtuner) 在 [InternLM2](https://github.com/InternLM/InternLM) 的基础上指令微调而来。\n\n**开源不易，如果本项目帮到大家，可以右上角帮我点个 star~ ⭐⭐ , 您的 star ⭐是我们最大的鼓励，谢谢各位！**  \n\n\n## 演示\n\nDemo 访问地址：https://openxlab.org.cn/apps/detail/HinGwenWong/ancient-chat-llm\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/demo.png\" alt=\"Demo\" width=\"70%\"\u003e\n\u003c/p\u003e\n\n模型对比：comming soon\n\n\n## Model Zoo\n\n| 模型                | 基座             | 数据量                   | ModelScope(HF)                                                          | Transformers(HF)                                               | OpenXLab(HF)                                                                                                                                                |\n| ------------------- | ---------------- | ------------------------ | ----------------------------------------------------------------------- | -------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| ancient-chat-llm-7b | interlm2-chat-7b | 230013 个单 conversation | [ModelScope](https://modelscope.cn/models/HinGwenWoong/ancient-chat-7b) | [hugging face](https://huggingface.co/hingwen/ancient-chat-7b) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/HinGwenWong/ancient-chat-llm-7b) |\n\n\n\u003cdetails\u003e\n\u003csummary\u003e 从 ModelScope 导入\u003c/summary\u003e\n\n```python\nimport torch\nfrom modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM\nmodel_dir = snapshot_download('HinGwenWoong/ancient-chat-7b')\ntokenizer = AutoTokenizer.from_pretrained(model_dir, device_map=\"auto\", trust_remote_code=True)\n# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.\nmodel = AutoModelForCausalLM.from_pretrained(model_dir, device_map=\"auto\", trust_remote_code=True, torch_dtype=torch.float16)\nmodel = model.eval()\nresponse, history = model.chat(tokenizer, \"你好\", history=[])\nprint(response)\nresponse, history = model.chat(tokenizer, \"李白简介\", history=history)\nprint(response)\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e 从 huggingface 导入 \u003c/summary\u003e\n\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\ntokenizer = AutoTokenizer.from_pretrained(\"hingwen/ancient-chat-7b\", trust_remote_code=True)\n# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.\nmodel = AutoModelForCausalLM.from_pretrained(\"hingwen/ancient-chat-7b\", device_map=\"auto\", trust_remote_code=True, torch_dtype=torch.float16)\nmodel = model.eval()\nresponse, history = model.chat(tokenizer, \"你好\", history=[])\nprint(response)\nresponse, history = model.chat(tokenizer, \"李白简介\", history=history)\nprint(response)\n```\n\n\u003c/details\u003e\n\n## 知识库\n\n- [x] 文言文翻译\n- [x] 成语\n- [x] 论语\n- [x] 唐诗\n- [x] 宋词\n- [x] 楚辞\n- [x] 四书五经\n- [x] 百家姓\n- [x] 弟子规\n- [ ] 史记\n- [ ] 宫廷制度\n- [ ] 二十四节气\n- [ ] ...\n\n## 环境搭建\n\n本项目使用 [xtuner](https://github.com/InternLM/xtuner) 训练，在 [internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) 上进行微调\n\n1. clone 本项目\n\n```bash\ngit clone https://github.com/PeterH0323/ancient-chat-llm.git\ncd ancient-chat-llm\n```\n\n2. 创建虚拟环境\n\n```bash\nconda env create -f environment.yml\nconda activate ancient-chat-llm\npip install -r requirements-raw.txt\n```\n\n## 数据集准备\n\n目前使用到的开源数据集有以下几个，我们还使用爬虫等技术进行爬取了其余知识库的数据集：\n\n- 文言文：https://huggingface.co/datasets/RUCAIBox/Erya-dataset/tree/main\n- 古诗：https://github.com/chinese-poetry/chinese-poetry\n\n数据集结构（省略了用不到的文件）：\n\n```bash\ndataset/\n├── Erya-dataset\n│   ├── dataset # 解压自 finetune.tgz\n│   └── stage_2 # 解压自 trans.tgz \n├── chinese-poetry\n│   ├── 五代诗词\n│   ├── 元曲\n│   ├── 全唐诗\n│   ├── 四书五经\n│   ├── 宋词\n│   ├── 幽梦影\n│   ├── 御定全唐詩\n│   ├── 曹操诗集\n│   ├── 楚辞\n│   ├── 水墨唐诗\n│   ├── 纳兰性德\n│   ├── 蒙学\n│   ├── 论语\n│   └── 诗经\n├── gen_dataset.py\n└── whoami.jsonl\n```\n\n使用脚本可以进行生成：\n\n```bash\ncd dataset\npython gen_dataset.py --data-root=./dataset --output=data.jsonl\n```\n\n生成之后可以在 ${output} 设置的路径看到 jsonl 文件，样例如下：\n\n```json\n    {\n        \"conversation\": [\n            {\n                \"system\": \"你精通中国文化和中文知识，你总能解答用户关于中国文化和中文的相关知识。\",\n                \"input\": \"帮我翻译成文言文：“往西走一里，绕着新寺西边的峰嘴往北走。”\",\n                \"output\": \"西一里，转新寺西峰之嘴而北。\"\n            }\n        ]\n    },\n    {\n        \"conversation\": [\n            {\n                \"system\": \"你精通中国文化和中文知识，你总能解答用户关于中国文化和中文的相关知识。\",\n                \"input\": \"背诵《将进酒》\",\n                \"output\": \"君不见黄河之水天上来，奔流到海不复回。\\n君不見高堂明鏡悲白髮，朝如青絲暮成雪。\\n人生得意須盡歡，莫使金樽空對月。\\n天生我材必有用，千金散盡還復來。\\n烹羊宰牛且爲樂，會須一飲三百盃。\\n岑夫子，丹丘生，將進酒，君莫停。\\n與君歌一曲，請君爲我側耳聽。\\n鐘鼓饌玉不足貴，但願長醉不願醒。\\n古來聖賢皆寂寞，惟有飲者留其名。\\n陳王昔時宴平樂，斗酒十千恣讙謔。\\n主人何爲言少錢，徑須沽取對君酌。\\n五花馬，千金裘，呼兒將出換美酒，與爾同銷萬古愁。\"\n            }\n        ]\n    },\n    ...\n```\n\n\n\n## 训练\n\n1. 训练之前，需要在 `xtuner` 代码中 `xtuner/xtuner/utils/templates.py` 添加 `SYSTEM_TEMPLATE.ancient_chat` ：\n\n```diff\nSYSTEM_TEMPLATE = ConfigDict(\n    moss_sft=('You are an AI assistant whose name is {bot_name}.\\n'\n              'Capabilities and tools that {bot_name} can possess.\\n'\n              '- Inner thoughts: enabled.\\n'\n              '- Web search: enabled. API: Search(query)\\n'\n              '- Calculator: enabled. API: Calculate(expression)\\n'\n              '- Equation solver: enabled. API: Solve(equation)\\n'\n              '- Text-to-image: disabled.\\n'\n              '- Image edition: disabled.\\n'\n              '- Text-to-speech: disabled.\\n'),\n    alpaca=('Below is an instruction that describes a task. '\n            'Write a response that appropriately completes the request.\\n'),\n    arxiv_gentile=('If you are an expert in writing papers, please generate '\n                   \"a good paper title for this paper based on other authors' \"\n                   'descriptions of their abstracts.\\n'),\n    colorist=('You are a professional color designer. Please provide the '\n              'corresponding colors based on the description of Human.\\n'),\n    coder=('You are a professional programer. Please provide the '\n           'corresponding code based on the description of Human.\\n'),\n    lawyer='你现在是一名专业的中国律师，请根据用户的问题给出准确、有理有据的回复。\\n',\n    medical='如果你是一名医生，请根据患者的描述回答医学问题。\\n',\n    sql=('If you are an expert in SQL, please generate a good SQL Query '\n         'for Question based on the CREATE TABLE statement.\\n'),``````\n+    ancient_chat=\"你精通中国文化和中文知识，你总能解答用户关于中国文化和中文的相关知识。\\n\",\n)\n```\n\n2. 将 `./finetune_configs/internlm2_chat_7b/internlm2_chat_7b_qlora_custom_data_finetune.py` 中 数据集路径 和 模型路径 改为您的本地路径\n\n```diff\n# Model\n- pretrained_model_name_or_path = 'internlm/internlm2-7b'\n+ pretrained_model_name_or_path = '/path/to/internlm/internlm2-7b' # 这步可选，如果事先下载好了模型可以直接使用绝对路径\n\n# Data\n- data_path = 'timdettmers/openassistant-guanaco'\n+ data_path = '/path/to/data.jsonl' # 数据集步骤生成的 json 文件绝对路径\nprompt_template = PROMPT_TEMPLATE.default\nmax_length = 2048\npack_to_max_length = True\n```\n\n3. 使用命令进行训练：\n\n```bash\nxtuner train finetune_configs/internlm2_chat_7b/internlm2_chat_7b_qlora_custom_data_finetune.py --deepspeed deepspeed_zero2\n```\n\n注意：如果显存不够了，调小一点 `batch_size` 和 `max_length`，反之还剩很多，调大这两个值\n\n## 部署\n\n### Web 部署 Demo\n\n1. 将 pth 转为 hf \n\n```bash\nxtuner convert pth_to_hf ./finetune_configs/internlm_chat_7b/internlm2_chat_7b_qlora_custom_data_finetune.py \\\n                         ./work_dirs/internlm2_chat_7b_qlora_custom_data_finetune/epoch_10.pth \\\n                         ./work_dirs/internlm2_chat_7b_qlora_custom_data_finetune/epoch_10_hf\n```\n\n2. 将微调后的模型和源模型 merge 生成新的模型\n\n```bash\nexport MKL_SERVICE_FORCE_INTEL=1 # 解决 Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.\nxtuner convert merge /path/to/internlm2-chat-7b \\\n                     ./work_dirs/internlm2_chat_7b_qlora_custom_data_finetune/epoch_10_hf \\\n                     ./work_dirs/internlm2_chat_7b_qlora_custom_data_finetune/epoch_10_merge\n```\n\n3. 启动 web demo\n\n```bash\nstreamlit run web_demo.py --server.address=0.0.0.0 --server.port 7860\n```\n\n\u003c!-- # 也可以直接使用命令行 cli 的方式进行启动\nxtuner chat ./work_dirs/internlm2_chat_7b_qlora_custom_data_finetune/epoch_10_merge \\\n            --prompt-template internlm2_chat \\\n            --system-template ancient_chat --\u003e\n\n### LMDeploy \n\n1. 安装 lmdeploy\n\n```bash\npip install 'lmdeploy[all]==v0.2.1'\n```\n\n1. 进行 4bit 量化\n\n```bash\nlmdeploy lite auto_awq ./work_dirs/internlm2_chat_7b_qlora_custom_data_finetune/epoch_10_merge \\\n                       --calib-dataset 'c4' \\\n                       --calib-samples 128 \\\n                       --calib-seqlen 2048 \\\n                       --w-bits 4 \\\n                       --w-group-size 128 \\\n                       --work-dir ./work_dirs/internlm2_chat_7b_qlora_custom_data_finetune/epoch_10_merge-4bit\n\n```\n\n## 模型测评\n\n使用的模型测评框架为 [opencompass](https://github.com/open-compass/opencompass)\n\n1. 搭建环境\n\n```bash\ngit clone https://github.com/open-compass/opencompass\ncd opencompass\npip install -e .\nexport PYTHONPATH=$(pwd)\n```\n\n2. 启动测评\n\n- CEval\n\n```bash\npython run.py --datasets ceval_gen \\\n              --hf-path ./work_dirs/internlm2_chat_7b_qlora_custom_data_finetune/epoch_10_merge/ \\\n              --tokenizer-path ./work_dirs/internlm2_chat_7b_qlora_custom_data_finetune/epoch_10_merge / \\\n              --tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \\\n              --model-kwargs trust_remote_code=True device_map='auto' \\\n              --max-seq-len 2048 \\\n              --max-out-len 16 \\\n              --batch-size 4 \\\n              --num-gpus 1 \\\n              --debug\n```\n\n测评结果：[ceval_gen](eval_report/ceval_gen)\n\n## TODO\n\n- [x] 量化模型\n- [x] 模型仍需迭代\n- [ ] 数据集需要清洗\n- [ ] 使用其它大模型进行数据集扩充\n\n## 后记\n\n本项目属于个人的一个学习项目，还有很多不足的地方，例如本模型在数据集方面的还没做很精细的调优，还有时候标点符号会错误。\n\n欢迎大家一起讨论，如果大家有数据集，可以在 issue 留言讨论。\n\n## 💕 致谢\n\n- [**xtuner**](https://github.com/InternLM/xtuner)\n\n感谢上海人工智能实验室推出的书生·浦语大模型实战营，为我们的项目提供宝贵的技术指导和强大的算力支持。\n\n## 开源许可证\n\n该项目采用 [Apache License 2.0 开源许可证](https://github.com/PeterH0323/ancient-chat-llm/LICENSE) 同时，请遵守所使用的模型与数据集的许可证。\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeterh0323%2Fancient-chat-llm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpeterh0323%2Fancient-chat-llm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeterh0323%2Fancient-chat-llm/lists"}