{"id":21158174,"url":"https://github.com/liucongg/unilmchatchitrobot","last_synced_at":"2025-10-13T09:13:49.596Z","repository":{"id":40608615,"uuid":"285762710","full_name":"liucongg/UnilmChatchitRobot","owner":"liucongg","description":"Unilm for Chinese Chitchat Robot.基于Unilm模型的夸夸式闲聊机器人项目。","archived":false,"fork":false,"pushed_at":"2021-01-21T01:49:28.000Z","size":10363,"stargazers_count":158,"open_issues_count":4,"forks_count":32,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-09-03T02:34:56.006Z","etag":null,"topics":["chatbot","chinese","generation","nlp","unilm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/liucongg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-08-07T07:07:18.000Z","updated_at":"2025-08-07T12:25:33.000Z","dependencies_parsed_at":"2022-08-19T03:00:41.782Z","dependency_job_id":null,"html_url":"https://github.com/liucongg/UnilmChatchitRobot","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/liucongg/UnilmChatchitRobot","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liucongg%2FUnilmChatchitRobot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liucongg%2FUnilmChatchitRobot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liucongg%2FUnilmChatchitRobot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liucongg%2FUnilmChatchitRobot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/liucongg","download_url":"https://codeload.github.com/liucongg/UnilmChatchitRobot/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liucongg%2FUnilmChatchitRobot/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279014484,"owners_count":26085535,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","chinese","generation","nlp","unilm"],"created_at":"2024-11-20T12:16:57.317Z","updated_at":"2025-10-13T09:13:49.580Z","avatar_url":"https://github.com/liucongg.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Unilm for Chinese Chitchat Robot\n基于Unilm模型的夸夸式闲聊机器人项目\n\n## 项目描述\n* 本项目是一个基于Unilm模型的夸夸式闲聊机器人项目。\n* 本项目目前开源的模型仅使用豆瓣夸夸群数据训练，所以称之为夸夸式闲聊机器人。感兴趣的同学，也可以使用本项目代码对其他对话语料进行训练。\n* 详细介绍见知乎：[夸夸式闲聊机器人之Unilm对话生成](https://zhuanlan.zhihu.com/p/170358507)。\n* 在最后对话生成时，对生成的敏感词进行了过滤。\n\n## 文件结构\n* kuakua_robot_model 模型保存路径（模型下载，见数据及模型）\n* unilm_model 预训练unilm模型路径（模型下载，见[Unilm预训练模型](https://github.com/YunwenTechnology/Unilm)）\n* data_dir 存放数据的文件夹\n   * dirty_words.txt 敏感词词典\n   * douban_kuakua_qa.txt 原始豆瓣夸夸群语料\n   * sample.json 训练集样例，需要将原始数据处理成样例形式（项目未提供预处理代码）\n* configuration_unilm.py unilm模型的config文件\n* modeling_unilm.py unilm模型文件\n* run_train.py 模型训练文件\n* interactive_conditional_samples.py 预测文件，根据训练好的模型，进行对话生成\n\n## 运行环境\n* transformers == 3.0.2\n* pytorch \u003e= 1.4\n\n## 数据及模型\n* 原始语料来自豆瓣夸夸群数据，见data/douban_kuakua_qa.txt。\n* 经过清洗后，得到了6万多单轮对话数据，数据格式见data/sample.json。\n* 已训练好的模型，下载地址：[链接](https://pan.baidu.com/s/1OWxxeGC2rndwVCGav8kbqQ)，验证码： [e0au]()\n\n## Train\n```\nnohup python3 -u run_train.py --data_dir data/ \n                              --src_file kuakua_data.json \n                              --model_type unilm \n                              --model_name_or_path unilm_model/ \n                              --output_dir kuakua_robot_model/ \n                              --max_seq_length 128 \n                              --max_position_embeddings 512 \n                              --do_train \n                              --do_lower_case \n                              --train_batch_size 32 \n                              --learning_rate 2e-5 \n                              --logging_steps 100 \n                              --num_train_epochs 10 \u003e log.log 2\u003e\u00261 \u0026\n```\n训练机器2060s，batch_size为32，训练了10个epoch。\nLoss如下：\n![avatar](image/tensorboard.png)\n\n通过loss曲线来看，其实模型并没有完全收敛，loss还有很大的下降幅度，可以继续进行训练，效果应该会更好。\n\n## Test\n```\npython3 interactive_conditional_samples.py --device 0\n                                           --topk 3\n                                           --topp 0.95\n                                           --dirty_path data/dirty_words.txt\n                                           --model_name_or_path kuakua_robot_model/\n                                           --repetition_penalty 1.2\n                                           --max_len 32 \n                                           --no_cuda False\n```\n可以通过调节，topk、topp和repetition_penalty来修改模型生成效果。\n## Result\n\nsample1\n```\nuser:工作不开心\nchatbot:你可以的！工作不开心也可以去吃火锅，看个影响自己心情的电视剧\nuser:工作不开心\nchatbot:我也是，不过现在开始努力找工作啦！一起加油鸭～\nuser:工作不开心\nchatbot:那就祝福你，能找到喜欢的工作吧。\nuser:工作不开心\nchatbot:你也要加油鸭！\nuser:工作不开心\nchatbot:我也是，不过总有一天就会好起来的\n```\nsample2\n```\nuser:我分手了\nchatbot:分手快乐，你会遇到更好的。\nuser:我分手了\nchatbot:旧的不去新的不来，你会遇见更好的\nuser:我分手了\nchatbot:分的好，下一个更乖\nuser:我分手了\nchatbot:旧的不去新的不来\nuser:我分手了\nchatbot:分手快乐，祝你快乐！\nuser:我分手了\nchatbot:旧的不去新的不来，加油！\n```\nsample3\n```\nuser:好像去吃火锅，但是我在减肥\nchatbot:你好棒！一定会瘦下来的！\nuser:好像去吃火锅，但是我在减肥\nchatbot:吃火锅是为了能更好的减肥，你真棒！\nuser:好像去吃火锅，但是我在减肥\nchatbot:吃火锅对胃好哦！我也在减肥，一起加油呀\nuser:好像去吃火锅，但是我在减肥\nchatbot:吃火锅对减肥有什么效果？我也想去吃，但是现在没勇气去吃\nuser:好像去吃火锅，但是我在减肥\nchatbot:你真的很棒，我也在减肥\n```\n\n## 不足\n本模型最大的问题是，在训练模型时，没有考虑上下文信息，也就是训练是针对单轮对话进行的，因此可能会出现文不对题的情况。\n\n## 未来工作\n- [ ] 后期可能将项目增加详细注释，增加数据预处理部分代码。\n- [ ] 后期可能使用大批量数据训练一个效果更好的基于unilm的闲聊对话模型。\n- [ ] 后期可能考虑上下文，训练一个支持多轮对话的模型。\n\n\n## 参考\n* [Unilm](https://github.com/YunwenTechnology/Unilm)\n\n## Citing\n```\n@misc{UnilmChatchitRobot,\n  author = {Cong Liu},\n  title = {Unilm for Chinese Chitchat Robot},\n  year = {2019},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  url=\"https://github.com/liucongg/UnilmChatchitRobot\",\n}\n```\n\n## 联系作者\n* e-mail：logcongcong@gmail.com\n* 知乎：[刘聪NLP](https://www.zhihu.com/people/LiuCongNLP)\n* 知乎专栏：[NLP工作站](https://zhuanlan.zhihu.com/c_1131882304422936576)\n* Github: [liucongg](https://github.com/liucongg)\n* 公众号：[NLP工作站]()\n\n![](image/logcong.png)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliucongg%2Funilmchatchitrobot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fliucongg%2Funilmchatchitrobot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliucongg%2Funilmchatchitrobot/lists"}