{"id":13754352,"url":"https://github.com/universal-ie/UIE","last_synced_at":"2025-05-09T22:32:03.685Z","repository":{"id":37643627,"uuid":"482491448","full_name":"universal-ie/UIE","owner":"universal-ie","description":"Unified Structure Generation for Universal Information Extraction","archived":false,"fork":false,"pushed_at":"2022-07-30T14:21:04.000Z","size":211,"stargazers_count":897,"open_issues_count":37,"forks_count":100,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-11-16T07:33:26.834Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/universal-ie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-04-17T10:23:44.000Z","updated_at":"2024-11-14T10:34:20.000Z","dependencies_parsed_at":"2022-07-14T08:08:56.627Z","dependency_job_id":null,"html_url":"https://github.com/universal-ie/UIE","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/universal-ie%2FUIE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/universal-ie%2FUIE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/universal-ie%2FUIE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/universal-ie%2FUIE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/universal-ie","download_url":"https://codeload.github.com/universal-ie/UIE/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335857,"owners_count":21892747,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:56.426Z","updated_at":"2025-05-09T22:31:58.674Z","avatar_url":"https://github.com/universal-ie.png","language":"Python","funding_links":[],"categories":["关系抽取、信息抽取"],"sub_categories":["其他_文本生成、文本对话"],"readme":"# UIE\n\n- Code for [``Unified Structure Generation for Universal Information Extraction``](https://aclanthology.org/2022.acl-long.395/)\n- Please contact [Yaojie Lu](http://luyaojie.github.io) ([@luyaojie](mailto:yaojie2017@iscas.ac.cn)) for questions and suggestions.\n\n## Update\n- [2022-06-12] Update pre-training code.\n- [2022-05-10] Update data preprocessing code.\n\n## Requirements\n\nGeneral\n\n- Python (verified on 3.8)\n- CUDA (verified on 11.1/10.2)\n\nPython Packages\nCUDA 10.2\n``` bash\nconda create -n uie python=3.8\nconda install -y pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch\npip install -r requirements.txt\n```\n\nCUDA 11.1\n``` bash\nconda create -n uie python=3.8\npip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html\npip install -r requirements.txt\n```\n\n## Quick Start\n\n### Datasets of Extraction Tasks\n\nDetails of preprocessing see [Data preprocessing](dataset_processing/).\n\nAfter that, please link the preprocessed dataset as:\n``` bash\nln -s dataset_processing/converted_data/ data\n```\n\n### Data Format\n\nData folder contains seven files：\n\n```text\ndata/text2spotasoc/absa/14lap\n├── entity.schema       # Entity Types for converting SEL to Record\n├── relation.schema     # Relation Types for converting SEL to Record\n├── event.schema        # Event Types for converting SEL to Record\n├── record.schema       # Spot/Asoc Type for constructing SSI\n├── test.json\n├── train.json\n└── val.json\n```\n\ntrain/val/test.json are data files, and each line is a JSON instance.\nEach JSON instance contains `text` and `record` fields, in which `text` is plain text, and `record` is the SEL representation of the extraction structure.\nDetails definition see [DATASETS.md](docs/DATASETS.md).\n\nNote:\n- Use the extra character of T5 as the structure indicators, such as `\u003cextra_id_0\u003e`, `\u003cextra_id_1\u003e`, `\u003cextra_id_5\u003e`.\n\n| Token  | Role |\n| ------------- | ------------- |\n| \u003cextra_id_0\u003e  | Start of Label Name |\n| \u003cextra_id_1\u003e  | End of Label Name   |\n| \u003cextra_id_2\u003e  | Start of Input Text |\n| \u003cextra_id_5\u003e  | Start of Text Span  |\n| \u003cextra_id_6\u003e  | NULL span for Rejection |\n\n- `record.schema` is the record schema file for building SSI.\nIt contains three lines: the first line is spot name list, the second line is asoc name list. And the third line is spot-to-asoc dictionary (do not use in code, can be ignored).\n\n  ```text\n  [\"aspect\", \"opinion\"]\n  [\"neutral\", \"positive\", \"negative\"]\n  {\"aspect\": [\"neutral\", \"positive\", \"negative\"], \"opinion\": []}\n  ```\n\n### Pretrained Models\nYou can find the pre-trained models as following CAS Cloud Box/Google Drive links or download models using command `gdown` (`pip install gdown`).\n\nuie-en-base [[CAS Cloud Box]](https://pan.cstcloud.cn/s/w2hTaHYaRWw) [[Google Drive]](https://drive.google.com/file/d/12Dkh6KLDPvXrkQ1I-1xLqODQSYjkwnvs/view) [[Huggingface]](https://huggingface.co/luyaojie/uie-base-en)\n\nuie-en-large [[CAS Cloud Box]](https://pan.cstcloud.cn/s/2vrXYBVTbk) [[Google Drive]](https://drive.google.com/file/d/15OFkWw8kJA1k2g_zehZ0pxcjTABY2iF1/view) [[Huggingface]](https://huggingface.co/luyaojie/uie-large-en)\n\nuie-char-small (chinese) [[CAS Cloud Box]](https://pan.cstcloud.cn/s/J7HOsDHHQHY)\n\n``` bash\n# Example of Google Drive\ngdown 12Dkh6KLDPvXrkQ1I-1xLqODQSYjkwnvs \u0026\u0026 unzip uie-base-en.zip\ngdown 15OFkWw8kJA1k2g_zehZ0pxcjTABY2iF1 \u0026\u0026 unzip uie-large-en.zip\n```\n\nPut all models to `hf_models/` for default running scripts.\n\n### Model Fine-tuning\n\nFirst make directories `otuput`.\n\nTraining scripts as follows:\n\n- `run_uie_finetune.py`: Python code entry\n- `run_uie_finetune.bash`: Model training and evaluating process script.\n- `scripts_exp/run_exp.bash`: Model environment configuration and parameter setting entry.\n\nThe command for the training is as follows (see bash scripts and Python files for the corresponding command-line\narguments):\n\n```bash\n. config/data_conf/base_model_conf_absa.ini  \u0026\u0026 model_name=uie-base-en dataset_name=absa/14lap bash scripts_exp/run_exp.bash\n```\n\n- `config/data_conf/base_model_conf_absa.ini` refers to using the training settings in `base_model_conf_absa.ini` \n- `model_name=uie-base-en` refers to using uie-base-en.\n- `dataset_name=absa/14lap` refers to the dataset path.\n\nTrained models are saved in the `output_dir` specified by `run_uie_finetune.bash`.\n\nSimple Training Command\n```\nbash run_uie_finetune.bash -v -d 0 \\\n  -b 16 \\\n  -k 3 \\\n  --lr 1e-4 \\\n  --warmup_ratio 0.06 \\\n  -i absa/14lap \\\n  --epoch 50 \\\n  --spot_noise 0.1 \\\n  --asoc_noise 0.1 \\\n  -f spotasoc \\\n  --epoch 50 \\\n  --map_config config/offset_map/closest_offset_en.yaml \\\n  -m hf_models/uie-base-en \\\n  --random_prompt\n```\n\nProgress logs\n```\n...\n***** Running training *****\n  Num examples = 906\n  Num Epochs = 50\n  Instantaneous batch size per device = 16\n  Total train batch size (w. parallel, distributed \u0026 accumulation) = 16\n  Gradient Accumulation steps = 1\n  Total optimization steps = 2850\n  Num examples = 219\n  Batch size = 64\n...\n```\n\nFinal Result (specific scores may different from different machines and environments)\n```\n...\ntest offset-rel-strict-P 67.01461377870564\ntest offset-rel-strict-R 59.11602209944752\ntest offset-rel-strict-F1 62.81800391389433\n...\n```\n\n| Metric      | Definition |\n| ----------- | ----------- |\n| ent-(P/R/F1)      | Micro-F1 of Entity (Entity Type, Entity Span) |\n| rel-strict-(P/R/F1)   | Micro-F1 of Relation Strict (Relation Type, Arg1 Span, Arg1 Type, Arg2 Span, Arg2 Type) |\n| rel-boundary-(P/R/F1)   | Micro-F1 of Relation Boundary (Relation Type, Arg1 Span, Arg2 Span) |\n| evt-trigger-(P/R/F1)   | Micro-F1 of Event Trigger (Event Type, Trigger Span) |\n| evt-role-(P/R/F1)   | Micro-F1 of Event Argument (Event Type, Arg Role, Arg Span) |\n\n\n### Model Pre-training\n\n[TODO] Add detailed decription.\n\n### Data Collator\n\nWe construct different sequence-to-sequence tasks using different data collators.\n- For pre-training, `HybirdDataCollator` constructs different seq2seq pairs for different tasks, and `DataCollatorForMetaSeq2Seq` constructs ssi with **_Sampling Strategy_**.\n- For fine-tuning, `DataCollatorForMetaSeq2Seq` constructs the dynamic seq2seq pair with **_Rejection Mechanism_**.\n\n#### HybirdDataCollator\n\nWe unify different types of (text, strcuture) pairs for pre-training with HybirdDataCollator.\nIt contains multiple data collators for different instances:\n- `DataCollatorForMetaSeq2Seq` for pair task, similiar to fine-tune stage\n- `DataCollatorForSeq2Seq` for record task\n- `DataCollatorForT5MLM` for text task\n\n#### DataCollatorForMetaSeq2Seq\n\n**_Sampling Strategy_** and **_Rejection Mechanism_** can be adopted in the training process. \n\n- `uie/seq2seq/data_collator/meta_data_collator.py` class _DataCollatorForMetaSeq2Seq_ is for collating data, class _DynamicSSIGenerator_ is for prompt sampling\n- `run_uie_finetune.py` class _DataTrainingArguments_ contains related parameters\n\nRelated parameters in class _DataTrainingArguments_ are briefly introduced here: \n\n- About **_Sampling Strategy_**\n``` text\n    - max_prefix_length       Maximum length of SSI\n    - ordered_prompt          Whether to sort the spot/asoc of SSI or not\n    - record_schema           record schema read from record.schema\n``` \n\n- About **_Rejection Mechanism_**\n``` text\n  - spot_noise              The noise rate of null spot\n  - asoc_noise              The noise rate of null asoc\n```\n\n### Scripts for Model Evaluation\n\nTo verify the performance of the UIE requires converting the generated **SEL** expression into **Record** and then evaluating it.\n\n#### 1. Convert structured expressions to record structures (sel2record.py)\nAfter training, `pred_folder` will contain 'eval_preds_seq2seq.txt' or 'test_preds_seq2seq.txt'\n\n``` text\n $ python scripts/sel2record.py -h     \nusage: sel2record.py [-h] [-g GOLD_FOLDER] [-p PRED_FOLDER [PRED_FOLDER ...]] [-c MAP_CONFIG] [-d DECODING] [-v]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -g GOLD_FOLDER        folder of golden answer\n  -p PRED_FOLDER [PRED_FOLDER ...]\n                        multiple different prediction folders\n  -c MAP_CONFIG, --config MAP_CONFIG\n                        offset matching strategy configuration file, more configuration files are placed in config/offset_map\n  -d DECODING           specify structure parser, default is SpotAsoc structure\n  -v, --verbose         print more detailed log information\n```\n\n#### 2. Validate model performance (eval_extraction.py)\nAfter converting, `pred_folder` will contain 'eval_preds_record.txt' or 'test_preds_record.txt'\n\n```text\n $ python scripts/eval_extraction.py -h   \nusage: eval_extraction.py [-h] [-g GOLD_FOLDER] [-p PRED_FOLDER [PRED_FOLDER ...]] [-v] [-w] [-m] [-case]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -g GOLD_FOLDER        Golden Dataset folder\n  -p PRED_FOLDER [PRED_FOLDER ...]\n                        Predicted model folder\n  -v                    Show more information during running\n  -w                    Write evaluation results to predicted folder\n  -m                    Refers to the matching policy\n  -case                 Show case study\n```\n\n#### 3. Verify the performance of the mapping label (check_offset_map_gold_as_pred.bash)\nTo verify the effect of structure parser, we took the golden answer `SEL` as the prediction result, and evaluate its performance.\n``` bash\nbash scripts/check_offset_map_gold_as_pred.bash \u003cdata-folder\u003e \u003cmap-config\u003e\n```\n\n## Citation\n\nIf this repository helps you, please cite this paper:\n\nYaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu Lin, Xianpei Han, Le Sun, Hua Wu.\nUnified Structure Generation for Universal Information Extraction.\nIn Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5755–5772, Dublin, Ireland. Association for Computational Linguistics.\n\n```\n@inproceedings{lu-etal-2022-unified,\n    title = \"Unified Structure Generation for Universal Information Extraction\",\n    author = \"Lu, Yaojie  and\n      Liu, Qing  and\n      Dai, Dai  and\n      Xiao, Xinyan  and\n      Lin, Hongyu  and\n      Han, Xianpei  and\n      Sun, Le  and\n      Wu, Hua\",\n    booktitle = \"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\",\n    month = may,\n    year = \"2022\",\n    address = \"Dublin, Ireland\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2022.acl-long.395\",\n    pages = \"5755--5772\",\n}\n```\n\n## License\nThe code is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License for Noncommercial use only.\nAny commercial use should get formal permission first.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funiversal-ie%2FUIE","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Funiversal-ie%2FUIE","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funiversal-ie%2FUIE/lists"}