{"id":19467371,"url":"https://github.com/thunlp/few-nerd","last_synced_at":"2025-04-05T15:06:16.299Z","repository":{"id":40264752,"uuid":"365198924","full_name":"thunlp/Few-NERD","owner":"thunlp","description":"Code and data of ACL 2021 paper \"Few-NERD: A Few-shot Named Entity Recognition Dataset\"","archived":false,"fork":false,"pushed_at":"2023-09-07T09:34:19.000Z","size":69,"stargazers_count":396,"open_issues_count":3,"forks_count":54,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-03-29T14:08:25.778Z","etag":null,"topics":["deep-learning","entity-typing","few-shot-learning","named-entity-recognition","nlp"],"latest_commit_sha":null,"homepage":"https://ningding97.github.io/fewnerd","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thunlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-05-07T10:34:41.000Z","updated_at":"2025-03-28T02:08:09.000Z","dependencies_parsed_at":"2024-08-03T09:17:18.338Z","dependency_job_id":null,"html_url":"https://github.com/thunlp/Few-NERD","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thunlp%2FFew-NERD","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thunlp%2FFew-NERD/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thunlp%2FFew-NERD/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thunlp%2FFew-NERD/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thunlp","download_url":"https://codeload.github.com/thunlp/Few-NERD/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247353745,"owners_count":20925329,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","entity-typing","few-shot-learning","named-entity-recognition","nlp"],"created_at":"2024-11-10T18:34:55.826Z","updated_at":"2025-04-05T15:06:16.283Z","avatar_url":"https://github.com/thunlp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\n# Few-NERD: Not Only a Few-shot NER Dataset\n\n![](https://img.shields.io/github/last-commit/thunlp/Few-NERD?color=green) ![](https://img.shields.io/badge/contributions-welcome-red) ![](https://img.shields.io/github/issues/thunlp/Few-NERD?color=yellow) \n\n\nThis is the source code of the ACL-IJCNLP 2021 paper:  [**Few-NERD: A Few-shot Named Entity Recognition Dataset**](https://arxiv.org/abs/2105.07464). Check out the [website](https://ningding97.github.io/fewnerd/) of Few-NERD. \n\n\n\n\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\* **Updates** \\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\\*\n\n- 09/03/2022: We have added the training script for supervised training using BERT tagger. Run `bash data/download.sh supervised` to download the data, and then run `bash run_supervised.sh`.\n\n- 01/09/2021: We have modified the results of the supervised setting of Few-NERD in arxiv, thanks for the help of [PedroMLF](https://github.com/PedroMLF).\n\n- 19/08/2021: **Important💥** In accompany with the released episode data, we have updated the training script. Simply add `--use_sampled_data` when running `train_demo.py` to train and test on the released episode data.\n\n- 02/06/2021: To simplify training, we have released the data sampled by episode. click [here](https://cloud.tsinghua.edu.cn/f/0e38bd108d7b49808cc4/?dl=1) to download. The files are named such: `{train/dev/test}_{N}_{K}.jsonl`. We sampled 20000, 1000, 5000 episodes for train, dev, test, respectively.\n\n- 26/05/2021: The current Few-NERD (SUP) is sentence-level. We will soon release  Few-NERD (SUP) 1.1, which is paragraph-level and contains more contextual information.\n\n- 11/06/2021: We have modified the word tokenization and we will soon update the latest results. We sincerely thank [tingtingma](https://github.com/mtt1998) and [Chandan Akiti](https://github.com/chandan047)\n\n\n\n## Contents\n\n- [Website](https://ningding97.github.io/fewnerd/)\n- [Overview](#overview)\n- [Getting Started](#requirements)\n  - [Requirements](#requirements)\n  - [Few-NERD Dataset](#few-nerd-dataset)\n    - [Get the Data](#get-the-data)\n    - [Data Format](Data-format)\n  - [Structure](#structure)\n  - [Key Implementations](#Key-Implementations)\n    - [N way K~2K shot Sampler](#Sampler)\n  - [How to Run](#How-to-Run)\n- [Citation](#Citation)\n- [Connection](#Connection)\n\n## Overview\n\nFew-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains *8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens*. Three benchmark tasks are built, one is supervised: Few-NERD (SUP) and the other two are few-shot: Few-NERD (INTRA) and Few-NERD (INTER).  \n\nThe schema of Few-NERD is:\n\n\u003cimg src=\"https://ftp.bmp.ovh/imgs/2021/05/30bd39a84c96e12a.png\" width=\"40%\" align=\"center\"/\u003e\n\n\n\nFew-NERD is manually annotated based on the context, for example, in the sentence \"*London is the fifth album by the British rock band…*\", the named entity `London` is labeled as `Art-Music`.\n\n\n\n## Requirements\n\n Run the following script to install the remaining dependencies,\n\n```shell\npip install -r requirements.txt\n```\n\n## Few-NERD Dataset \n\n### Get the Data\n\n- Few-NERD contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens.\n- We have splitted the data into 3 training mode. One for supervised setting-`supervised`, the other two for few-shot setting `inter` and `intra`. Each contains three files `train.txt`, `dev.txt`, `test.txt`. `supervised`datasets are randomly split. `inter` datasets are randomly split within coarse type, i.e. each file contains all 8 coarse types but different fine-grained types. `intra` datasets are randomly split by coarse type.\n- The splitted dataset can be downloaded automatically once you run the model. **If you want to download the data manually, run data/download.sh, remember to add parameter supervised/inter/intra to indicate the type of the dataset**\n\nTo obtain the three benchmark datasets of Few-NERD, simply run the bash file `data/download.sh` with parameter `supervised/inter/intra` as below\n\n```shell\nbash data/download.sh supervised\n```\n\nTo get the data sampled by episode, run\n\n```shell\nbash data/download.sh episode-data\nunzip -d data/ data/episode-data.zip\n```\n\n### Data Format\n\nThe data are pre-processed into the typical NER data forms as below (`token\\tlabel`). \n\n```latex\nBetween\tO\n1789\tO\nand\tO\n1793\tO\nhe\tO\nsat\tO\non\tO\na\tO\ncommittee\tO\nreviewing\tO\nthe\tO\nadministrative\tMISC-law\nconstitution\tMISC-law\nof\tMISC-law\nGalicia\tMISC-law\nto\tO\nlittle\tO\neffect\tO\n.\tO\n```\n\n## Structure\n\nThe structure of our project is:\n\n```shell\n--util\n| -- framework.py\n| -- data_loader.py\n| -- viterbi.py             # viterbi decoder for structshot only\n| -- word_encoder\n| -- fewshotsampler.py\n\n-- proto.py                 # prototypical model\n-- nnshot.py                # nnshot model\n\n-- train_demo.py            # main training script\n```\n\n\n\n## Key Implementations\n\n#### Sampler\n\nAs established in our paper, we design an *N way K~2K shot* sampling strategy in our work , the  implementation is sat `util/fewshotsampler.py`.\n\n#### ProtoBERT\n\n Prototypical nets with BERT is implemented in `model/proto.py`.\n\n#### NNShot \u0026 StructShot\n\nNNShot with BERT is implemented in `model/nnshot.py`. \n\nStructShot is realized by adding an extra viterbi decoder in `util/framework.py`. \n\n**Note that the backbone BERT encoder we used for structshot model is not pre-trained with NER task**\n\n## How to Run\n\nRun `train_demo.py`. The arguments are presented below. The default parameters are for `proto` model on `inter`mode dataset.\n\n```shell\n-- mode                 training mode, must be inter, intra, or supervised\n-- trainN               N in train\n-- N                    N in val and test\n-- K                    K shot\n-- Q                    Num of query per class\n-- batch_size           batch size\n-- train_iter           num of iters in training\n-- val_iter             num of iters in validation\n-- test_iter            num of iters in testing\n-- val_step             val after training how many iters\n-- model                model name, must be proto, nnshot or structshot\n-- max_length           max length of tokenized sentence\n-- lr                   learning rate\n-- weight_decay         weight decay\n-- grad_iter            accumulate gradient every x iterations\n-- load_ckpt            path to load model\n-- save_ckpt            path to save model\n-- fp16                 use nvidia apex fp16\n-- only_test            no training process, only test\n-- ckpt_name            checkpoint name\n-- seed                 random seed\n-- pretrain_ckpt        bert pre-trained checkpoint\n-- dot                  use dot instead of L2 distance in distance calculation\n-- use_sgd_for_bert     use SGD instead of AdamW for BERT.\n# only for structshot\n-- tau                  StructShot parameter to re-normalizes the transition probabilities\n```\n\n- For hyperparameter `--tau` in structshot, we use `0.32` in 1-shot setting, `0.318` for 5-way-5-shot setting, and `0.434` for 10-way-5-shot setting.\n\n- Take `structshot` model on `inter` dataset for example, the expriments can be run as follows.\n\n  ​\n\n**5-way-1~5-shot**\n\n```bash\npython3 train_demo.py  --mode inter \\\n--lr 1e-4 --batch_size 8 --trainN 5 --N 5 --K 1 --Q 1 \\\n--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 \\\n--max_length 64 --model structshot --tau 0.32\n```\n\n**5-way-5~10-shot**\n\n```bash\npython3 train_demo.py  --mode inter \\\n--lr 1e-4 --batch_size 1 --trainN 5 --N 5 --K 5 --Q 5 \\\n--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 \\\n--max_length 32 --model structshot --tau 0.318\n```\n\n**10-way-1~5-shot**\n\n```bash\npython3 train_demo.py  --mode inter \\\n--lr 1e-4 --batch_size 4 --trainN 10 --N 10 --K 1 --Q 1 \\\n--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 \\\n--max_length 64 --model structshot --tau 0.32\n```\n\n**10-way-5~10-shot**\n\n```bash\npython3 train_demo.py  --mode inter \\\n--lr 1e-4 --batch_size 1 --trainN 10 --N 10 --K 5 --Q 1 \\\n--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 \\\n--max_length 32 --model structshot --tau 0.434\n```\n\n\n\n## Citation\n\nIf you use Few-NERD in your work, please cite our paper:\n\n```bibtex\n@inproceedings{ding-etal-2021-nerd,\n    title = \"Few-{NERD}: A Few-shot Named Entity Recognition Dataset\",\n    author = \"Ding, Ning  and\n      Xu, Guangwei  and\n      Chen, Yulin  and\n      Wang, Xiaobin  and\n      Han, Xu  and\n      Xie, Pengjun  and\n      Zheng, Haitao  and\n      Liu, Zhiyuan\",\n    booktitle = \"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)\",\n    month = aug,\n    year = \"2021\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2021.acl-long.248\",\n    doi = \"10.18653/v1/2021.acl-long.248\",\n    pages = \"3198--3213\",\n}\n```\n\n## License\n\nFew-NERD dataset is distributed under the CC BY-SA 4.0 license. The code is distributed under the Apache 2.0 license.\n\n\n## Connection\n\nIf you have any questions, feel free to contact\n\n- [dingn18@mails.tsinghua.edu.cn;](mailto:dingn18@mails.tsinghua.edu.cn)\n- [yl-chen21@mails.tsinghua.edu.cn;](mailto:yl-chen21@mails.tsinghua.edu.cn)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthunlp%2Ffew-nerd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthunlp%2Ffew-nerd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthunlp%2Ffew-nerd/lists"}