{"id":13512019,"url":"https://github.com/SophonPlus/ChineseNlpCorpus","last_synced_at":"2025-03-30T21:31:29.359Z","repository":{"id":37664733,"uuid":"126583907","full_name":"SophonPlus/ChineseNlpCorpus","owner":"SophonPlus","description":"搜集、整理、发布 中文 自然语言处理 语料/数据集，与 有志之士 共同 促进 中文 自然语言处理 的 发展。","archived":false,"fork":false,"pushed_at":"2019-01-29T11:21:37.000Z","size":11755,"stargazers_count":5873,"open_issues_count":26,"forks_count":1399,"subscribers_count":117,"default_branch":"master","last_synced_at":"2024-11-01T13:35:57.695Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SophonPlus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-03-24T09:21:56.000Z","updated_at":"2024-11-01T09:10:30.000Z","dependencies_parsed_at":"2022-07-13T18:20:58.946Z","dependency_job_id":null,"html_url":"https://github.com/SophonPlus/ChineseNlpCorpus","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SophonPlus%2FChineseNlpCorpus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SophonPlus%2FChineseNlpCorpus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SophonPlus%2FChineseNlpCorpus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SophonPlus%2FChineseNlpCorpus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SophonPlus","download_url":"https://codeload.github.com/SophonPlus/ChineseNlpCorpus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246385377,"owners_count":20768667,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T03:01:25.304Z","updated_at":"2025-03-30T21:31:24.331Z","avatar_url":"https://github.com/SophonPlus.png","language":"Jupyter Notebook","readme":"# ChineseNlpCorpus\n搜集、整理、发布 中文 自然语言处理 语料/数据集，与 有志之士 共同 促进 中文 自然语言处理 的 发展。\n\n## 情感/观点/评论 倾向性分析\n\n| 数据集 | 数据概览 | 下载地址 |\n| ----- | -------- | ------- |\n| ChnSentiCorp_htl_all | 7000 多条酒店评论数据，5000 多条正向评论，2000 多条负向评论 | [点击查看](./datasets/ChnSentiCorp_htl_all/intro.ipynb) |\n| waimai_10k | 某外卖平台收集的用户评价，正向 4000 条，负向 约 8000 条 | [点击查看](./datasets/waimai_10k/intro.ipynb) |\n| online_shopping_10_cats | 10 个类别，共 6 万多条评论数据，正、负向评论各约 3 万条，\u003cbr /\u003e 包括书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店 | [点击查看](./datasets/online_shopping_10_cats/intro.ipynb) |\n| weibo_senti_100k | 10 万多条，带情感标注 新浪微博，正负向评论约各 5 万条 | [点击查看](./datasets/weibo_senti_100k/intro.ipynb) |\n| simplifyweibo_4_moods | 36 万多条，带情感标注 新浪微博，包含 4 种情感，\u003cbr /\u003e 其中喜悦约 20 万条，愤怒、厌恶、低落各约 5 万条 | [点击查看](./datasets/simplifyweibo_4_moods/intro.ipynb) |\n| dmsc_v2 | 28 部电影，超 70 万 用户，超 200 万条 评分/评论 数据 | [点击查看](./datasets/dmsc_v2/intro.ipynb) |\n| yf_dianping | 24 万家餐馆，54 万用户，440 万条评论/评分数据 | [点击查看](./datasets/yf_dianping/intro.ipynb) |\n| yf_amazon | 52 万件商品，1100 多个类目，142 万用户，720 万条评论/评分数据 | [点击查看](./datasets/yf_amazon/intro.ipynb) |\n\n## 中文命名实体识别\n\n| 数据集 | 数据概览 | 下载地址 |\n| ----- | -------- | ------- |\n| dh_msra | 5 万多条中文命名实体识别标注数据（包括地点、机构、人物） | [点击查看](./datasets/dh_msra/intro.ipynb) |\n\n## 推荐系统\n\n| 数据集 | 数据概览 | 下载地址 |\n| ----- | -------- | ------- |\n| ez_douban | 5 万多部电影（3 万多有电影名称，2 万多没有电影名称），2.8 万 用户，280 万条评分数据 | [点击查看](./datasets/ez_douban/intro.ipynb) |\n| dmsc_v2 | 28 部电影，超 70 万 用户，超 200 万条 评分/评论 数据 | [点击查看](./datasets/dmsc_v2/intro.ipynb) |\n| yf_dianping | 24 万家餐馆，54 万用户，440 万条评论/评分数据 | [点击查看](./datasets/yf_dianping/intro.ipynb) |\n| yf_amazon | 52 万件商品，1100 多个类目，142 万用户，720 万条评论/评分数据 | [点击查看](./datasets/yf_amazon/intro.ipynb) |\n\n## FAQ 问答系统\n\n| 数据集 | 数据概览 | 下载地址 |\n| ----- | -------- | ------- |\n| 保险知道 | 8000 多条保险行业问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/baoxianzhidao/intro.ipynb) |\n| 安徽电信知道 | 15.6 万条电信问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/anhuidianxinzhidao/intro.ipynb) |\n| 金融知道 | 77 万条金融行业问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/financezhidao/intro.ipynb) |\n| 法律知道 | 3.6 万条法律问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/lawzhidao/intro.ipynb) |\n| 联通知道 | 20.3 万条联通问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/liantongzhidao/intro.ipynb) |\n| 农行知道 | 4 万条农业银行问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/nonghangzhidao/intro.ipynb) |\n| 保险知道 | 58.8 万条保险行业问答数据，包括用户提问、网友回答、最佳回答 | [点击查看](./datasets/baoxianzhidao/intro.ipynb) |\n\n\n\n## 加入我们\n\n- 愿景：以人工智能产品和技术服务 30 亿人\n- 团队：极客精神、技术驱动，做有温度的技术，让世界更美好\n- 产品：面向细分行业领域的自动营销机器人，客户需求旺盛，产品前景无限\n- 职位：[自然语言人机交互应用研究](./docs/recruit/researcher.md)、[自然语言处理算法工程师](./docs/recruit/engineer.md)、[系统架构师（人工智能产品）](./docs/recruit/architect.md)\n\n![](./docs/images/recruit/recruit_banner.png)","funding_links":[],"categories":["Jupyter Notebook","NLP","Corpus 中文语料"],"sub_categories":["Multi-Modal Representation \u0026 Retrieval 多模态表征与检索"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSophonPlus%2FChineseNlpCorpus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSophonPlus%2FChineseNlpCorpus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSophonPlus%2FChineseNlpCorpus/lists"}