{"id":15027470,"url":"https://github.com/candlewill/dialog_corpus","last_synced_at":"2025-05-15T20:05:11.317Z","repository":{"id":41063253,"uuid":"84949566","full_name":"candlewill/Dialog_Corpus","owner":"candlewill","description":"用于训练中英文对话系统的语料库 Datasets for Training Chatbot System","archived":false,"fork":false,"pushed_at":"2020-09-23T21:06:45.000Z","size":101533,"stargazers_count":2047,"open_issues_count":2,"forks_count":496,"subscribers_count":83,"default_branch":"master","last_synced_at":"2025-05-15T20:04:55.115Z","etag":null,"topics":["chatbot","corpus","dataset","dialog","system"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/candlewill.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-03-14T13:01:29.000Z","updated_at":"2025-04-23T05:47:43.000Z","dependencies_parsed_at":"2022-07-14T07:10:29.486Z","dependency_job_id":null,"html_url":"https://github.com/candlewill/Dialog_Corpus","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/candlewill%2FDialog_Corpus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/candlewill%2FDialog_Corpus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/candlewill%2FDialog_Corpus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/candlewill%2FDialog_Corpus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/candlewill","download_url":"https://codeload.github.com/candlewill/Dialog_Corpus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254414499,"owners_count":22067272,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","corpus","dataset","dialog","system"],"created_at":"2024-09-24T20:06:30.053Z","updated_at":"2025-05-15T20:05:03.065Z","avatar_url":"https://github.com/candlewill.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 用于对话系统的中英文语料\nDatasets for Training Chatbot System\n\u003cbr\u003e本项目收集了一些从网络中找到的用于训练中文（英文）聊天机器人的对话语料\n\n### 公开语料\n搜集到的一些数据集如下，点击链接可以进入原始地址\n\n1. [dgk_shooter_min.conv.zip](https://github.com/rustch3n/dgk_lost_conv)\n\u003cbr\u003e中文电影对白语料，噪音比较大，许多对白问答关系没有对应好\n\n2. [The NUS SMS Corpus](https://github.com/kite1988/nus-sms-corpus)\n\u003cbr\u003e包含中文和英文短信息语料，据说是世界最大公开的短消息语料\n\n3. [ChatterBot中文基本聊天语料](https://github.com/gunthercox/chatterbot-corpus/tree/master/chatterbot_corpus/data)\n\u003cbr\u003eChatterBot聊天引擎提供的一点基本中文聊天语料，量很少，但质量比较高\n\n4. [Datasets for Natural Language Processing](https://github.com/karthikncode/nlp-datasets)\n\u003cbr\u003e这是他人收集的自然语言处理相关数据集，主要包含Question Answering，Dialogue Systems， Goal-Oriented Dialogue Systems三部分，都是英文文本。可以使用机器翻译为中文，供中文对话使用\n\n5. [小黄鸡](https://github.com/rustch3n/dgk_lost_conv/tree/master/results)\n\u003cbr\u003e据传这就是小黄鸡的语料：xiaohuangji50w_fenciA.conv.zip （已分词） 和 xiaohuangji50w_nofenci.conv.zip （未分词）\n\n6. [白鹭时代中文问答语料](https://github.com/Samurais/egret-wenda-corpus)\n\u003cbr\u003e由白鹭时代官方论坛问答板块10,000+ 问题中，选择被标注了“最佳答案”的纪录汇总而成。人工review raw data，给每一个问题，一个可以接受的答案。目前，语料库只包含2907个问答。([备份](./egret-wenda-corpus.zip))\n\n7. [Chat corpus repository](https://github.com/Marsan-Ma/chat_corpus)\n\u003cbr\u003echat corpus collection from various open sources\n\u003cbr\u003e包括：开放字幕、英文电影字幕、中文歌词、英文推文\n\n8. [保险行业QA语料库](https://github.com/Samurais/insuranceqa-corpus-zh)\n\u003cbr\u003e通过翻译 [insuranceQA](https://github.com/shuzi/insuranceQA)产生的数据集。train_data含有问题12,889条，数据 141779条，正例：负例 = 1:10； test_data含有问题2,000条，数据 22000条，正例：负例 = 1:10；valid_data含有问题2,000条，数据 22000条，正例：负例 = 1:10\n\n### 未公开语料\n\n这部分语料，网络上有所流传，但由于我们能力所限，或者原作者并未公开，暂时未获取。只是列举出来，供以后继续搜寻。\n\n1. 微软小冰\n\n### 版权\n\n所有原始语料归原作者所有\n\n### 联系\n\n[何云超](yunchaohe@gmail.com)\n\u003cbr\u003eweibo: [@Yunchao_He](http://weibo.com/heyunchao)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcandlewill%2Fdialog_corpus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcandlewill%2Fdialog_corpus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcandlewill%2Fdialog_corpus/lists"}