{"id":26483821,"url":"https://github.com/chatopera/efaqa-corpus-raw","last_synced_at":"2025-09-19T22:48:24.224Z","repository":{"id":216882614,"uuid":"742684450","full_name":"chatopera/efaqa-corpus-raw","owner":"chatopera","description":"Emotional First Aid Raw Dataset, 心理咨询问答原始语料库","archived":false,"fork":false,"pushed_at":"2024-01-13T04:11:00.000Z","size":1090,"stargazers_count":14,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-14T06:46:47.798Z","etag":null,"topics":["chatbot","corpus","data","nlp","psychological"],"latest_commit_sha":null,"homepage":"https://github.com/chatopera/efaqa-corpus-raw","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chatopera.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2024-01-13T03:55:08.000Z","updated_at":"2025-01-21T12:51:40.000Z","dependencies_parsed_at":"2024-01-13T12:20:11.374Z","dependency_job_id":null,"html_url":"https://github.com/chatopera/efaqa-corpus-raw","commit_stats":null,"previous_names":["chatopera/efaqa-corpus-raw"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chatopera%2Fefaqa-corpus-raw","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chatopera%2Fefaqa-corpus-raw/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chatopera%2Fefaqa-corpus-raw/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chatopera%2Fefaqa-corpus-raw/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chatopera","download_url":"https://codeload.github.com/chatopera/efaqa-corpus-raw/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244554067,"owners_count":20471173,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","corpus","data","nlp","psychological"],"created_at":"2025-03-20T04:58:11.077Z","updated_at":"2025-09-19T22:48:19.187Z","avatar_url":"https://github.com/chatopera.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# 心理咨询相关语料库\n\n| 语料库 | 地址 | 描述 |\n| --- | --- | --- |\n| 心理咨询问答语料库（Emotional First Aid Dataset） | [GitHub](https://github.com/chatopera/efaqa-corpus-zh), [Gitee](https://gitee.com/chatopera/efaqa-corpus-zh) | 人工标注的多轮对话 |\n| 心理咨询问答原始语料库（Emotional First Aid Raw Dataset） | [GitHub](https://github.com/chatopera/efaqa-corpus-raw), [Gitee](https://gitee.com/chatopera/efaqa-corpus-raw) | 爬取后未标注的原始语料 |\n\n# Emotional First Aid Raw Dataset\n\n[![PyPI pyversions](https://img.shields.io/pypi/pyversions/efaqa-corpus-raw.svg)](https://pypi.python.org/pypi/efaqa-corpus-raw/) [![PyPI download month](https://img.shields.io/pypi/dm/efaqa-corpus-raw.svg)](https://pypi.python.org/pypi/efaqa-corpus-raw/) [![PyPI version shields.io](https://img.shields.io/pypi/v/efaqa-corpus-raw.svg)](https://pypi.python.org/pypi/efaqa-corpus-raw/)  [![License](https://cdndownload2.chatopera.com/cskefu/licenses/chunsong1.0.svg)](https://www.cskefu.com/licenses/v1.html \"开源许可协议\")\n\n让人工智能技术更好的服务于人类。\n\u003cdiv align=left\u003e\n\n\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;-- Hai Liang W., [@hailiang-wang](https://github.com/hailiang-wang), [Chatopera](https://www.chatopera.com/) \n\u003c/div\u003e\n\n心理咨询问答原始语料库，仅限研究用途。\n\n心理咨询问答原始语料库（以下也称为“本数据集”，“本语料库”）是为应用人工智能技术于心理咨询领域制作的高品质语料，语料是爬取心理咨询、心理健康领域公开的网站的数据，经过整理和脱敏制作而成。消息总文本达**四千四百多万字符**。\n\n爬取开放数据网站，比如给\\*心理、简\\*心理、豆\\*讨论组等。目前，一些网站已经关闭了数据的开放访问，使得本语料库更有宝贵价值。\n\n![img](./assets/screenshot_20240113084135.jpg)\n\n## 数据格式\n\n以下为每条的数据格式说明：\n\n| 根节点 Key | 数组子元素 | 示例 | 描述 |\n| --- | --- | --- | --- |\n| title | - | `最近感觉好困好累，感觉好压抑` | 发布者发起的话题 |\n| date | - | `2017-12-31 21:20:25` | 发布者发布的时间 |\n| owner | - | `匿名` | 发布者昵称 |\n| id | - | `5e6b9b94d037ed455ee9c9d7` | 唯一标识 ID |\n| chats |  |  | 针对话题的交流，元素为 JSONArray，按照发生时间升序排列，即越靠近现在的 index 越大，越排在数组的后面，格式见下 |\n| | sender | `audience` 或 `owner` | 发布者角色，`audience` 代表评论者，`owner` 代表发布者 |\n| | name | `Audience1`, `Audience2` | 当 `sender` 为`audience`时存在，本评论发布者的名字（脱敏后） |\n| | time | `21:20:44` | 发布的时刻 | \n| | value | `您好` | 评论内容 |\n\n其中，每个话题都只有一个发布者 `owner`；数据进行了必要的脱敏，比如去掉了原始的爬取的 URL 地址、去掉了图片信息、重新生成了评论者的名称等。\n\n## 数据示例\n\n```\n{\n  \"title\": \"女 最近感觉好困好累，感觉好压抑，没有人理解自己，好多好多问题弄得我自己身心疲惫，活着好累啊。人为什么要活着啊，最好躺在那里永远不要起来\",\n  \"date\": \"2017-12-31 21:20:25\",\n  \"owner\": \"匿名\",\n  \"id\": \"5e6b9b94d037ed455ee9c9d7\",\n  \"chats\": [\n    {\n      \"sender\": \"audience\",\n      \"value\": \"您好！\",\n      \"time\": \"21:20:44\",\n      \"name\": \"Audience5\"\n    },\n    {\n      \"sender\": \"audience\",\n      \"value\": \"您今年多大了？这种好累的感觉有多久？\",\n      \"time\": \"21:22:13\",\n      \"name\": \"Audience3\"\n    },\n    {\n      \"sender\": \"audience\",\n      \"value\": \"你好，理解你的心情\",\n      \"time\": \"21:27:07\",\n      \"name\": \"Audience1\"\n    },\n    {\n      \"sender\": \"audience\",\n      \"value\": \"您好！发生了什么有影响的事件了吗？\",\n      \"time\": \"21:28:51\",\n      \"name\": \"Audience10\"\n    },\n    {\n      \"time\": \"07:26:01\",\n      \"sender\": \"owner\",\n      \"value\": \"很多事情，老公的不理解，婆婆的无理取闹，大姑姐也闹，做的我身心疲惫\"\n    },\n    {\n      \"time\": \"07:26:45\",\n      \"sender\": \"owner\",\n      \"value\": \"如果没有孩子这日子没法过了\"\n    },\n    {\n      \"sender\": \"audience\",\n      \"value\": \"请升级你的软件否则无法收到信息\",\n      \"time\": \"08:13:41\",\n      \"name\": \"Audience9\"\n    }\n  ]\n}\n```\n\n## 语料库规模\n\n本语料库（[心理咨询问答原始语料库](https://github.com/chatopera/efaqa-corpus-raw)）的统计数据如下 -\n\n话题数：172,316 （每个话题都带有评论）\n\n消息总数：2,381,273 (话题+评论的消息总数)\n\n消息文本规模：44,514,786 (全部话题和评论的文本字符的总计)\n\n平均每个话题的评论数： 12.8 个\n\n本语料也是[心理咨询问答语料库（Emotional First Aid Dataset，efaqa-corpus-zh）](https://github.com/chatopera/efaqa-corpus-zh)的语料来源：`心理咨询问答语料库`是在`心理咨询问答原始语料库`的基础上人工标记的结果，并且因为工作量巨大，仅完成了对部分原始语料的标记工作。\n\n## 下载安装\n\n安装和下载语料文件。\n\n### 1/3 Install Sourcecodes Package\n\n```bash\npip install -U efaqa-corpus-raw\n```\n\n### 2/3 Config license id\n\n首先，从[证书商店](https://store.chatopera.com/product/efaqa002)购买的证书的【证书标识】，在证书商店，证书详情页，点击【复制证书标识】。\n\n![img2](./assets/screenshot_20240113112212.png)\n\n其次，设置环境变量。\n\n* For Shell Users\n\ne.g. Shell, CMD Scripts on Linux, Windows, macOS.\n\n```bash\n# Linux / macOS\nexport EFAQA_RAW_LICENSE=YOUR_LICENSE\n## e.g. if your license id is `FOOBAR`, run `export EFAQA_RAW_LICENSE=FOOBAR`\n\n# Windows\n## 1/2 Command Prompt\nset EFAQA_RAW_LICENSE=YOUR_LICENSE\n## 2/2 PowerShell\n$env:EFAQA_RAW_LICENSE='YOUR_LICENSE'\n```\n\n* For Python Code Users\n\nJupyter Notebook, etc.\n\n```python\nimport os\nos.environ[\"EFAQA_RAW_LICENSE\"] = \"YOUR_LICENSE\"\n_licenseid = os.environ.get(\"EFAQA_RAW_LICENSE\", None)\nprint(\"EFAQA_RAW_LICENSE=\", _licenseid)\n```\n\n### 3/3 Download Model Package\n\n最后，使用以下脚本下载语料包文件。\n\n```bash\npython -c \"import efaqa_corpus_raw\"\n```\n\n**提示：安装后初次使用会下载语料文件，下载速度取决于网络情况。**\n\n## 加载读取\n\n```python\nimport efaqa_corpus_raw\ndata = efaqa_corpus_raw.corpus\nfor conversation in data:\n    print(conversation[\"id\"], conversation[\"title\"])\n```\n\n## 声明\n\n**本数据集不得再次销售或分享给除购买者以外的人、组织，如发生上述行为，本公司会进行积极的维权，侵权者承担法律和经济责任。** 尊重知识产权，人人有责。\n\n数据和程序可用于研究，必须注明引用和地址，比如发布的任何媒体、期刊、杂志或博客等内容。\n\n```\n@online{EfaqaCorpusRaw:chatopera2024,\n  author = {Hai Liang Wang},\n  title = {心理咨询问答原始语料库efaqa-corpus-raw},\n  year = 2024,\n  url = {https://github.com/chatopera/efaqa-corpus-raw},\n  urldate = {2024-01-13}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchatopera%2Fefaqa-corpus-raw","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchatopera%2Fefaqa-corpus-raw","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchatopera%2Fefaqa-corpus-raw/lists"}