{"id":28676510,"url":"https://github.com/zjunlp/lrebench","last_synced_at":"2025-06-13T23:04:59.066Z","repository":{"id":59119184,"uuid":"487758317","full_name":"zjunlp/LREBench","owner":"zjunlp","description":"[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study","archived":false,"fork":false,"pushed_at":"2024-02-23T09:20:52.000Z","size":1239,"stargazers_count":34,"open_issues_count":0,"forks_count":1,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-13T18:53:03.659Z","etag":null,"topics":["benchmark","chinese","data-augmentation","data-augumentation","dataset","efficient","emnlp","few-shot","information-extraction","kg","knowledge-graph","knowprompt","long-tail","low-resource","lrebench","prompt","re","relation-extraction","self-training"],"latest_commit_sha":null,"homepage":"https://zjunlp.github.io/project/LREBench","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zjunlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-05-02T07:43:24.000Z","updated_at":"2025-02-18T08:41:25.000Z","dependencies_parsed_at":"2024-02-12T09:08:32.299Z","dependency_job_id":"31d5be9b-4e4d-4141-b36b-7a48b02d3279","html_url":"https://github.com/zjunlp/LREBench","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zjunlp/LREBench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FLREBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FLREBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FLREBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FLREBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zjunlp","download_url":"https://codeload.github.com/zjunlp/LREBench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FLREBench/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259732772,"owners_count":22903087,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","chinese","data-augmentation","data-augumentation","dataset","efficient","emnlp","few-shot","information-extraction","kg","knowledge-graph","knowprompt","long-tail","low-resource","lrebench","prompt","re","relation-extraction","self-training"],"created_at":"2025-06-13T23:04:58.410Z","updated_at":"2025-06-13T23:04:59.037Z","avatar_url":"https://github.com/zjunlp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LREBench: A low-resource relation extraction benchmark.\n\nThis repo is official implementation for the EMNLP2022 (Findings) paper *[LREBench: Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study](https://arxiv.org/pdf/2210.10678.pdf)* [[poster](https://drive.google.com/file/d/1APDfO3gn27LWckDWTMdXr24WeOl0rfF9/view)]. \n\nThis paper presents an empirical study to build relation extraction systems in low-resource settings. Based upon recent PLMs, three schemes are comprehensively investigated to evaluate the performance in low-resource settings: $(i)$ different types of prompt-based methods with few-shot labeled data;  $(ii)$ diverse balancing methods to address the long-tailed distribution issue; $(iii)$ data augmentation technologies and self-training to generate more labeled in-domain data.\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figs/intro.png\" alt=\"intro\" width=70% height=70% /\u003e\n\u003c/div\u003e\n\n\n## Contents\n\n- [LREBench](#lrebench)\n  - [Environment](#environment)\n  - [Datasets](#datasets)\n  - [Normal Prompt-based Tuning](#normal-prompt-based-tuning)\n    - [1 Initialize Answer Words](#1-initialize-answer-words)\n    - [2 Split Datasets](#2-split-datasets)\n    - [3 Prompt-based Tuning](#3-prompt-based-tuning)\n    - [4 Different prompts](#4-different-prompts)\n  - [Balancing](#balancing)\n    - [1 Re-sampling](#1-re-sampling)\n    - [2 Re-weighting Loss](#2-re-weighting-loss)\n  - [Data Augmentation](#data-augmentation)\n    - [1 Prepare the environment](#1-prepare-the-environment)\n    - [2 Try different DA methods](#2-try-different-da-methods)\n  - [Self-training for Semi-supervised learning](#self-training-for-semi-supervised-learning)\n  - [Standard Fine-tuning Baseline](#standard-fine-tuning-baseline)\n\n\n## Environment\n\nTo install requirements:\n\n```shell\n\u003e\u003e conda create -n LREBench python=3.9\n\u003e\u003e conda activate LREBench\n\u003e\u003e pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113\n```\n\n\n## Datasets\n\nWe provide 8 benchmark datasets and prompts used in our experiments.\n\n- [SemEval](https://github.com/zjunlp/KnowPrompt/tree/master/dataset/semeval)\n- [TACREV](https://github.com/zjunlp/KnowPrompt/tree/master/dataset/tacrev)\n- [Wiki80](https://github.com/thunlp/OpenNRE/blob/master/benchmark/download_wiki80.sh)\n- [SciERC](http://nlp.cs.washington.edu/sciIE/)\n- [ChemProt](https://github.com/ncbi-nlp/BLUE_Benchmark)\n- [DialogRE](https://dataset.org/dialogre/)\n- [DuIE2.0](https://www.luge.ai/#/luge/dataDetail?id=5)\n- [CMeIE](https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414)\n\nAll processed full-shot datasets can be [downloaded](https://drive.google.com/drive/folders/1OXxMr4SXUhehJx1XmDdrhJAppfo_ZNUh?usp=sharing) and need to be placed in the [*dataset*](dataset) folder.\nThe expected files of one dataset contains **rel2id.json**, **train.json** and **test.json**.\n\n\n## Normal Prompt-based Tuning\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figs/prompt.png\" alt=\"prompt\" width=70% height=70% /\u003e\n\u003c/div\u003e\n\n### 1 Initialize Answer Words\n\nUse the command below to get answer words first.\n\n```shell\n\u003e\u003e python get_label_word.py --modelpath roberta-large --dataset semeval\n```\n\nThe `{modelpath}_{dataset}.pt` will be saved in the *dataset* folder, and you need to assign the `modelpath` and `dataset` with names of the pre-trained language model and the dataset to be used before.\n\n### 2 Split Datasets\n\nWe provide the sampling code for obtaining 8-shot ([sample_8shot.py](sample_8shot.py)) , 10% ([sample_10.py](sample_10.py)) datasets and the rest datasets used as unlabeled data for self-training. If there are classes with less than 8 instances, these classes are removed in training and testing sets when sampling 8-shot datasets and **new_test.json** and **new_rel2id.json** are obtained. \n\n```shell\n\u003e\u003e python sample_8shot.py -h\n    usage: sample_8shot.py [-h] --input_dir INPUT_DIR --output_dir OUTPUT_DIR\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      --input_dir INPUT_DIR, -i INPUT_DIR\n                            The directory of the training file.\n      --output_dir OUTPUT_DIR, -o OUTPUT_DIR\n                            The directory of the sampled files.\n\u003e\u003e python sample_10.py -h\n    usage: sample_10.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      --input_file INPUT_FILE, -i INPUT_FILE\n                            The directory of the training file.\n      --output_dir OUTPUT_DIR, -o OUTPUT_DIR\n                            The directory of the sampled files.\n```\n\nFor example:\n\n```shell\n\u003e\u003e python sample_8.py -i dataset/semeval -o dataset/semeval/8-shot\n\u003e\u003e cd dataset/semeval\n\u003e\u003e mkdir 8-1\n\u003e\u003e cp 8-shot/new_rel2id.json 8-1/rel2id.json\n\u003e\u003e cp 8-shot/new_test.json 8-1/test.json\n\u003e\u003e cp 8-shot/train_8_1.json 8-1/train.json\n\u003e\u003e cp 8-shot/unlabel_8_1.json 8-1/label.json\n```\n\n\n### 3 Prompt-based Tuning\n\nAll running scripts for each dataset are in the *scripts* folder. \nFor example, train *KonwPrompt* on SemEval, CMeIE and ChemProt with the following commands:\n\n```shell\n\u003e\u003e bash scripts/semeval.sh  # RoBERTa-large\n\u003e\u003e bash scripts/CMeIE.sh    # Chinese RoBERTa-large\n\u003e\u003e bash scripts/ChemProt.sh # BioBERT-large\n```\n\n\n### 4 Different prompts\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figs/prompts.png\" alt=\"prompts\" width=70% height=70% /\u003e\n\u003c/div\u003e\n\nSimply add parameters to the scripts.\n\nTemplate Prompt: `--use_template_words 0`\n\nSchema Prompt: `--use_template_words 0 --use_schema_prompt True`\n\nPTR: refer to [PTR](https://github.com/thunlp/PTR)\n\n\n## Balancing\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figs/balance.png\" alt=\"balance\" width=40% height=40% /\u003e\n\u003c/div\u003e\n\n### 1 Re-sampling\n\n- Create the re-sampled training file based on the 10% training set by *resample.py*.\n  \n  ```shell\n  \u003e\u003e python resample.py -h\n      usage: resample.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR --rel_file REL_FILE\n  \n      optional arguments:\n        -h, --help            show this help message and exit\n        --input_file INPUT_FILE, -i INPUT_FILE\n                              The path of the training file.\n        --output_dir OUTPUT_DIR, -o OUTPUT_DIR\n                              The directory of the sampled files.\n        --rel_file REL_FILE, -r REL_FILE\n                              the path of the relation file\n  ```\n  \n  For example,\n  \n  ```shell\n  \u003e\u003e mkdir dataset/semeval/10sa-1\n  \u003e\u003e python resample.py -i dataset/semeval/10/train10per_1.json -r dataset/semeval/rel2id.json -o dataset/semeval/sa\n  \u003e\u003e cd dataset/semeval\n  \u003e\u003e cp rel2id.json test.json 10sa-1/\n  \u003e\u003e cp sa/sa_1.json 10sa-1/train.json\n  ```\n\n### 2 Re-weighting Loss\n\nSimply add the *useloss* parameter to script for choosing various re-weighting loss.\n\nFor exampe: `--useloss MultiFocalLoss`.\n(chocies: MultiDSCLoss, MultiFocalLoss, GHMC_Loss, LDAMLoss)\n\n\n## Data Augmentation\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figs/DA.png\" alt=\"DA\" width=70% height=70% /\u003e\n\u003c/div\u003e\n\n### 1 Prepare the environment\n\n```shell\n\u003e\u003e pip install nlpaug nlpcda\n```\n\nPlease follow the instructions from [nlpaug](https://github.com/makcedward/nlpaug#installation) and [nlpcda](https://github.com/425776024/nlpcda) for more information (Thanks a lot!).\n\n### 2 Try different DA methods\n\nWe provide many data augmentation methods\n\n- English (nlpaug): TF-IDF, contextual word embedding (BERT and RoBERTa), and WordNet' Synonym (-lan==en, -d).\n- Chinese (nlpcda): Synonym (-lan==cn)\n- All DA methods can be implemented on contexts, entities and both of them (--locations). \n- Generate augmented data\n  ```shell\n  \u003e\u003e python DA.py -h\n      usage: DA.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR --language {en,cn}\n                    [--locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...]]\n                    [--DAmethod {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}]\n                    [--model_dir MODEL_DIR] [--model_name MODEL_NAME] [--create_num CREATE_NUM] [--change_rate CHANGE_RATE]\n  \n      optional arguments:\n        -h, --help            show this help message and exit\n        --input_file INPUT_FILE, -i INPUT_FILE\n                              the training set file\n        --output_dir OUTPUT_DIR, -o OUTPUT_DIR\n                              The directory of the sampled files.\n        --language {en,cn}, -lan {en,cn}\n                              DA for English or Chinese\n        --locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...], -l {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...]\n                              List of positions that you want to manipulate\n        --DAmethod {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}, -d {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}\n                              Data augmentation method\n        --model_dir MODEL_DIR, -m MODEL_DIR\n                              the path of pretrained models used in DA methods\n        --model_name MODEL_NAME, -mn MODEL_NAME\n                              model from huggingface\n        --create_num CREATE_NUM, -cn CREATE_NUM\n                              The number of samples augmented from one instance.\n        --change_rate CHANGE_RATE, -cr CHANGE_RATE\n                              the changing rate of text\n  ```\n\n  Take context-level DA based on contextual word embedding on ChemProt for example:\n\n  ```shell\n  python DA.py \\\n      -i dataset/ChemProt/10/train10per_1.json \\\n      -o dataset/ChemProt/aug \\\n      -d word_embedding_bert \\\n      -mn dmis-lab/biobert-large-cased-v1.1 \\\n      -l sent1 sent2 sent3\n  ```\n\n- Delete repeated instances and get the final augmented data\n\n  ```shell\n  \u003e\u003e python merge_dataset.py -h\n  usage: merge_dataset.py [-h] [--input_files INPUT_FILES [INPUT_FILES ...]] [--output_file OUTPUT_FILE]\n  \n  optional arguments:\n    -h, --help            show this help message and exit\n    --input_files INPUT_FILES [INPUT_FILES ...], -i INPUT_FILES [INPUT_FILES ...]\n                          List of input files containing datasets to merge\n    --output_file OUTPUT_FILE, -o OUTPUT_FILE\n                          Output file containing merged dataset\n  ```\n\n  For example:\n\n  ```bash\n  python merge_dataset.py \\\n      -i dataset/ChemProt/train10per_1.json dataset/ChemProt/aug/aug.json \\\n      -o dataset/ChemProt/aug/merge.json\n  ```\n\n## Self-training for Semi-supervised learning\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figs/self-training.png\" alt=\"st\" width=70% height=70% /\u003e\n\u003c/div\u003e\n\n- Train a teacher model on a few labeled data (8-shot or 10%)\n- Place the unlabeled data **label.json** in the corresponding dataset folder.\n- Assigning pseudo labels using the trained teacher model: add `--labeling True` to the script and obtain the pseudo-labeled dataset **label2.json**.\n- Put the gold-labeled data and pseudo-labeled data together. For example:\n  ```shell\n  \u003e\u003e python self-train_combine.py -g dataset/semeval/10-1/train.json -p dataset/semeval/10-1/label2.json -la dataset/semeval/10la-1\n  \u003e\u003e cd dataset/semeval\n  \u003e\u003e cp rel2id.json test.json 10la-1/\n  ```\n- Train the final student model: add `--stutrain True` to the script\n\n\n## Standard Fine-tuning Baseline\n\u003cdiv align=center\u003e\n\u003cimg src=\"figs/fine-tuning.png\" alt=\"ft\" width=70% height=70% /\u003e\n\u003c/div\u003e \n\n[Fine-tuning](Fine-tuning/)\n\n# Citation\nIf you use the code, please cite the following paper:\n\n\n```bibtex\n@inproceedings{xu-etal-2022-towards-realistic,\n    title = \"Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study\",\n    author = \"Xu, Xin  and\n      Chen, Xiang  and\n      Zhang, Ningyu  and\n      Xie, Xin  and\n      Chen, Xi  and\n      Chen, Huajun\",\n    editor = \"Goldberg, Yoav  and\n      Kozareva, Zornitsa  and\n      Zhang, Yue\",\n    booktitle = \"Findings of the Association for Computational Linguistics: EMNLP 2022\",\n    month = dec,\n    year = \"2022\",\n    address = \"Abu Dhabi, United Arab Emirates\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2022.findings-emnlp.29\",\n    doi = \"10.18653/v1/2022.findings-emnlp.29\",\n    pages = \"413--427\"\n}\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Flrebench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzjunlp%2Flrebench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Flrebench/lists"}