{"id":19119439,"url":"https://github.com/cluebenchmark/cluener2020","last_synced_at":"2025-04-08T09:12:49.226Z","repository":{"id":37490759,"uuid":"231908902","full_name":"CLUEbenchmark/CLUENER2020","owner":"CLUEbenchmark","description":"CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition","archived":false,"fork":false,"pushed_at":"2022-11-21T08:05:14.000Z","size":888,"stargazers_count":1481,"open_issues_count":59,"forks_count":303,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-04-01T08:39:00.571Z","etag":null,"topics":["albert","bert","chinese","chinese-ner","chinesener","dataset","fine-grained-ner","named-entity-recognition","ner","roberta","seq2seq","sequence-labeling","sequence-to-sequence"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2001.04351","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CLUEbenchmark.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-05T11:44:46.000Z","updated_at":"2025-04-01T05:42:19.000Z","dependencies_parsed_at":"2023-01-21T12:01:34.381Z","dependency_job_id":null,"html_url":"https://github.com/CLUEbenchmark/CLUENER2020","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CLUEbenchmark%2FCLUENER2020","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CLUEbenchmark%2FCLUENER2020/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CLUEbenchmark%2FCLUENER2020/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CLUEbenchmark%2FCLUENER2020/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CLUEbenchmark","download_url":"https://codeload.github.com/CLUEbenchmark/CLUENER2020/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247809964,"owners_count":20999816,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["albert","bert","chinese","chinese-ner","chinesener","dataset","fine-grained-ner","named-entity-recognition","ner","roberta","seq2seq","sequence-labeling","sequence-to-sequence"],"created_at":"2024-11-09T05:09:41.238Z","updated_at":"2025-04-08T09:12:49.199Z","avatar_url":"https://github.com/CLUEbenchmark.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"  # CLUENER 细粒度命名实体识别 \n  \n  **更多细节请参考我们的 \u003ca href='https://github.com/CLUEbenchmark/CLUENER2020/blob/master/CLUENER2020_paper.pdf'\u003e技术报告\u003c/a\u003e： https://arxiv.org/abs/2001.04351**\n  ![./pics/header.png](https://github.com/CLUEbenchmark/CLUENER2020/blob/master/cluener.png)\n\n\u003ca href=\"https://www.cluebenchmarks.com/clueai.html\"\u003eclueai工具包: 三分钟三行代码搞定NLP开发（零样本学习）\u003c/a\u003e\n\n  ## 数据类别：\n    数据分为10个标签类别，分别为: 地址（address），书名（book），公司（company），游戏（game），政府（government），电影（movie），姓名（name），组织机构（organization），职位（position），景点（scene）\n\n  ## 标签类别定义 \u0026 标注规则：\n    地址（address）: **省**市**区**街**号，**路，**街道，**村等（如单独出现也标记）。地址是标记尽量完全的, 标记到最细。\n    书名（book）: 小说，杂志，习题集，教科书，教辅，地图册，食谱，书店里能买到的一类书籍，包含电子书。\n    公司（company）: **公司，**集团，**银行（央行，中国人民银行除外，二者属于政府机构）, 如：新东方，包含新华网/中国军网等。\n    游戏（game）: 常见的游戏，注意有一些从小说，电视剧改编的游戏，要分析具体场景到底是不是游戏。\n    政府（government）: 包括中央行政机关和地方行政机关两级。 中央行政机关有国务院、国务院组成部门（包括各部、委员会、中国人民银行和审计署）、国务院直属机构（如海关、税务、工商、环保总局等），军队等。\n    电影（movie）: 电影，也包括拍的一些在电影院上映的纪录片，如果是根据书名改编成电影，要根据场景上下文着重区分下是电影名字还是书名。\n    姓名（name）: 一般指人名，也包括小说里面的人物，宋江，武松，郭靖，小说里面的人物绰号：及时雨，花和尚，著名人物的别称，通过这个别称能对应到某个具体人物。\n    组织机构（organization）: 篮球队，足球队，乐团，社团等，另外包含小说里面的帮派如：少林寺，丐帮，铁掌帮，武当，峨眉等。\n    职位（position）: 古时候的职称：巡抚，知州，国师等。现代的总经理，记者，总裁，艺术家，收藏家等。\n    景点（scene）: 常见旅游景点如：长沙公园，深圳动物园，海洋馆，植物园，黄河，长江等。\n  \n  ## 数据下载地址：\n  \u003ca href='https://www.cluebenchmarks.com/introduce.html'\u003e数据下载\u003c/a\u003e\n    \n  ## 数据分布：\n    训练集：10748\n    验证集集：1343\n\n    按照不同标签类别统计，训练集数据分布如下（注：一条数据中出现的所有实体都进行标注，如果一条数据出现两个地址（address）实体，那么统计地址（address）类别数据的时候，算两条数据）：\n    【训练集】标签数据分布如下：\n    地址（address）:2829\n    书名（book）:1131\n    公司（company）:2897\n    游戏（game）:2325\n    政府（government）:1797\n    电影（movie）:1109\n    姓名（name）:3661\n    组织机构（organization）:3075\n    职位（position）:3052\n    景点（scene）:1462\n\n    【验证集】标签数据分布如下：\n    地址（address）:364\n    书名（book）:152\n    公司（company）:366\n    游戏（game）:287\n    政府（government）:244\n    电影（movie）:150\n    姓名（name）:451\n    组织机构（organization）:344\n    职位（position）:425\n    景点（scene）:199\n\n\n  ## 数据字段解释：\n    以train.json为例，数据分为两列：text \u0026 label，其中text列代表文本，label列代表文本中出现的所有包含在10个类别中的实体。\n    例如：\n      text: \"北京勘察设计协会副会长兼秘书长周荫如\"\n      label: {\"organization\": {\"北京勘察设计协会\": [[0, 7]]}, \"name\": {\"周荫如\": [[15, 17]]}, \"position\": {\"副会长\": [[8, 10]], \"秘书长\": [[12, 14]]}}\n      其中，organization，name，position代表实体类别，\n      \"organization\": {\"北京勘察设计协会\": [[0, 7]]}：表示原text中，\"北京勘察设计协会\" 是类别为 \"组织机构（organization）\" 的实体, 并且start_index为0，end_index为7 （注：下标从0开始计数）\n      \"name\": {\"周荫如\": [[15, 17]]}：表示原text中，\"周荫如\" 是类别为 \"姓名（name）\" 的实体, 并且start_index为15，end_index为17\n      \"position\": {\"副会长\": [[8, 10]], \"秘书长\": [[12, 14]]}：表示原text中，\"副会长\" 是类别为 \"职位（position）\" 的实体, 并且start_index为8，end_index为10，同时，\"秘书长\" 也是类别为 \"职位（position）\" 的实体,\n      并且start_index为12，end_index为14\n\n## 数据来源：\n    本数据是在清华大学开源的文本分类数据集THUCTC基础上，选出部分数据进行细粒度命名实体标注，原数据来源于Sina News RSS.\n\n## 效果对比\n\n  | 模型     | \u003ca href='https://www.cluebenchmarks.com/ner.html'\u003e线上效果f1\u003c/a\u003e |\n|:-------------:|:-----:|\n| Bert-base   |  78.82  |\n| RoBERTa-wwm-large-ext | 80.42 |\n| Bi-Lstm + CRF | 70.00 |\n\n各个实体的评测结果(F1 score)：\n\n| 实体     | bilstm+crf | bert-base | roberta-wwm-large-ext | Human Performance |\n|:-------------:|:-----:|:-----:|:-----:|:-----:|\n| Person Name   | 74.04 | 88.75 | **89.09** | 74.49 |\n| Organization  | 75.96 | 79.43 | **82.34** | 65.41 |\n| Position      | 70.16 | 78.89 | **79.62** | 55.38 |\n| Company       | 72.27 | 81.42 | **83.02** | 49.32 |\n| Address       | 45.50 | 60.89 | **62.63** | 43.04 |\n| Game          | 85.27 | 86.42 | **86.80** | 80.39 |\n| Government    | 77.25 | 87.03 | **88.17** | 79.27 |\n| Scene         | 52.42 | 65.10 | **70.49** | 51.85 |\n| Book          | 67.20 | 73.68 | **74.60** | 71.70 |\n| Movie         | 78.97 | 85.82 | **87.46** | 63.21 |\n| Overall@Macro |   70.00 | 78.82  | **80.42** | 63.41  |\n\n## 基线模型（一键运行）\n\n  1.tf版本bert系列：\u003ca href='https://github.com/CLUEbenchmark/CLUENER2020/tree/master/tf_version'\u003etf_version\u003c/a\u003e\n  (test, f1 80.42) \n  \n  2.pytorch版本baseline：\u003ca href='https://github.com/CLUEbenchmark/CLUENER2020/tree/master/pytorch_version'\u003epytorch_version\u003c/a\u003e(79.63) \n \n  3.bilistm+crf的baseline: \u003ca href=\"https://github.com/CLUEbenchmark/CLUENER2020/tree/master/bilstm_crf_pytorch\"\u003ebilstm+crf\u003c/a\u003e\n  (test, f1 70.0) \n\n#### 技术交流与问题讨论QQ群: 836811304 Join us on QQ group\n\n\n#### 引用我们 Cite Us\n\n如果本目录中的内容对你的研究工作有所帮助，请在文献中引用下述报告：https://arxiv.org/abs/2001.04351\n```\n@article{xu2020cluener2020,\n  title={CLUENER2020: Fine-grained Name Entity Recognition for Chinese},\n  author={Xu, Liang and Dong, Qianqian and Yu, Cong and Tian, Yin and Liu, Weitang and Li, Lu and Zhang, Xuanwei},\n  journal={arXiv preprint arXiv:2001.04351},\n  year={2020}\n }\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcluebenchmark%2Fcluener2020","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcluebenchmark%2Fcluener2020","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcluebenchmark%2Fcluener2020/lists"}