{"id":28676579,"url":"https://github.com/zjunlp/iepile","last_synced_at":"2025-06-13T23:05:14.664Z","repository":{"id":222454931,"uuid":"739996731","full_name":"zjunlp/IEPile","owner":"zjunlp","description":"IEPile: A Large-Scale Information Extraction Corpus","archived":false,"fork":false,"pushed_at":"2024-04-11T09:11:06.000Z","size":2170,"stargazers_count":61,"open_issues_count":0,"forks_count":4,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-04-11T12:57:46.707Z","etag":null,"topics":["bilingual","chinese","corpus","dataset","english","event-extraction","ie","iepie","information-extraction","instructions","knowledge-graph","large-language-models","named-entity-recognition","natural-language-processing","relation-extraction"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zjunlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-01-07T07:31:05.000Z","updated_at":"2024-04-14T16:50:13.654Z","dependencies_parsed_at":"2024-04-01T09:33:13.492Z","dependency_job_id":"518e3357-89e9-44a6-af6e-c8eb43872cbb","html_url":"https://github.com/zjunlp/IEPile","commit_stats":null,"previous_names":["zjunlp/iepile"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zjunlp/IEPile","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FIEPile","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FIEPile/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FIEPile/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FIEPile/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zjunlp","download_url":"https://codeload.github.com/zjunlp/IEPile/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FIEPile/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259732770,"owners_count":22903087,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bilingual","chinese","corpus","dataset","english","event-extraction","ie","iepie","information-extraction","instructions","knowledge-graph","large-language-models","named-entity-recognition","natural-language-processing","relation-extraction"],"created_at":"2025-06-13T23:05:13.589Z","updated_at":"2025-06-13T23:05:14.646Z","avatar_url":"https://github.com/zjunlp.png","language":"Python","readme":"\n\u003cp align=\"left\"\u003e\n    \u003cb\u003e English | \u003ca href=\"https://github.com/zjunlp/IEPile/blob/main/README_CN.md\"\u003eChinese\u003c/a\u003e \u003c/b\u003e\n\u003c/p\u003e\n\n# IEPile: A Large-Scale Information Extraction Corpus\n\nThis is the official repository for [IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus](https://arxiv.org/abs/2402.14710)\n\n[**Datasets**](https://huggingface.co/datasets/zjunlp/iepile) | \n[**Paper**](https://huggingface.co/papers/2402.14710) | \n[**Usage**](./README.md#3using-iepile-to-train-models) |\n[**Limitations**](./README.md#8limitations) |\n[**Statement \u0026 License**](./README.md#7statement-and-license) |\n[**Citation**](./README.md#9cite) \n\n\u003e Please note that our IEPile may undergo **updates** (we will inform you upon their release). It is recommended to utilize the most current version.\n\n\n- [IEPile: A Large-Scale Information Extraction Corpus](#iepile-a-large-scale-information-extraction-corpus)\n  - [News](#news)\n  - [1.Introduction](#1introduction)\n  - [2.Data](#2data)\n    - [2.1Construction of IEPile](#21construction-of-iepile)\n    - [2.2Data Format of IEPile](#22data-format-of-iepile)\n  - [3.Using IEPile to Train Models](#3using-iepile-to-train-models)\n    - [3.1Environment](#31environment)\n    - [3.2Download Data and Models](#32download-data-and-models)\n    - [3.3LoRA Fine-tuning](#33lora-fine-tuning)\n  - [4.Continued Training with In-Domain Data](#4continued-training-with-in-domain-data)\n    - [4.1Training Data Conversion](#41training-data-conversion)\n    - [4.2Continued Training](#42continued-training)\n    - [4.3Continued Training OneKE](#43continued-training-oneke)\n      - [4.3.1Full SFT](#431full-sft)\n      - [4.3.1Lora SFT](#431lora-sft)\n  - [5.Prediction](#5prediction)\n    - [5.1Test Data Conversion](#51test-data-conversion)\n    - [5.2IEPile Test Data](#52iepile-test-data)\n    - [5.2Basic Model + LoRA Prediction](#52basic-model--lora-prediction)\n    - [5.3IE-Specific Model Prediction](#53ie-specific-model-prediction)\n  - [Model Usage](#model-usage)\n    - [Model Download](#model-download)\n    - [Environmental Installation](#environmental-installation)\n    - [Fast Running](#fast-running)\n    - [VLLM Inference](#vllm-inference)\n    - [GGUF Format Conversion](#gguf-format-conversion)\n    - [Ollama Inference](#ollama-inference)\n    - [Inference on Mac](#inference-on-mac)\n    - [Multi GPU Inference](#multi-gpu-inference)\n  - [6.Evaluation](#6evaluation)\n  - [7.Statement and License](#7statement-and-license)\n  - [8.Limitations](#8limitations)\n  - [9.Cite](#9cite)\n  - [10.Acknowledgements](#10acknowledgements)\n\n\n## News\n* [2024/05] The paper [IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus](https://doi.org/10.48550/arXiv.2402.14710) is accepted by ACL 2024 main conference.\n* [2024/04] We release a new bilingual (Chinese and English) schema-based information extraction model called [OneKE](https://huggingface.co/zjunlp/OneKE)  based on Chinese-Alpaca-2-13B.\n* [2024/02] We release a large-scale (0.32B tokens) high-quality bilingual (Chinese and English) Information Extraction (IE) instruction dataset named [IEPile](https://huggingface.co/datasets/zjunlp/iepie), along with two models trained with `IEPile`, [baichuan2-13b-iepile-lora](https://huggingface.co/zjunlp/baichuan2-13b-iepile-lora) and [llama2-13b-iepile-lora](https://huggingface.co/zjunlp/llama2-13b-iepile-lora).\n* [2023/10] We released a new bilingual (Chinese and English) theme-based Information Extraction (IE) instruction dataset named [InstructIE](https://huggingface.co/datasets/zjunlp/InstructIE) with [paper](https://arxiv.org/abs/2305.11527).\n* [2023/08] We introduced a dedicated 13B model for Information Extraction (IE), named [knowlm-13b-ie](https://huggingface.co/zjunlp/knowlm-13b-ie/tree/main).\n* [2023/05] We initiated an instruction-based Information Extraction project.\n\n\n\n## 1.Introduction\n\n\n**`IEPile`** dataset download links: [Google Drive](https://drive.google.com/file/d/1jPdvXOTTxlAmHkn5XkeaaCFXQkYJk5Ng/view?usp=sharing) | [Hugging Face](https://huggingface.co/datasets/zjunlp/iepile) | [WiseModel](https://wisemodel.cn/datasets/zjunlp/IEPile) | [ModelScpoe](https://modelscope.cn/datasets/ZJUNLP/IEPile)\n\n\n\u003e Please be aware that the data contained in the dataset links provided above has already excluded any part related to the ACE2005 dataset. Should you require access to the unfiltered, complete dataset and have successfully obtained the necessary permissions, please do not hesitate to contact us via email at guihonghao@zju.edu.cn or zhangningyu@zju.edu.cn. We will provide the complete dataset resources for your use.\n\n\nModel download links for **`LLaMA2-IEPile`** | **`Baichuan2-IEPile`** | **`OneKE`**: [zjunlp/llama2-13b-iepile-lora](https://huggingface.co/zjunlp/llama2-13b-iepile-lora/tree/main) | [zjunlp/baichuan2-13b-iepile-lora](https://huggingface.co/zjunlp/baichuan2-13b-iepile-lora) | [zjunlp/OneKE](https://huggingface.co/zjunlp/OneKE)\n\n\n![statistic](./assets/statistic.jpg)\n\n\nWe have collected and cleaned existing Information Extraction (IE) datasets, integrating a total of 26 **English** IE datasets and 7 **Chinese** IE datasets. As shown in the Figure, these datasets cover multiple domains including **general**, **medical**, **financial**, and others.\n\nIn this study, we adopted the proposed \"`schema-based batched instruction generation strategy`\" to create a large-scale, high-quality, **bilingual** (Chinese and English) IE instruction tuning dataset named **IEPile**, containing approximately `0.32B` tokens.\n\nBased on **IEPile**, we fine-tuned the `Baichuan2-13B-Chat` and `LLaMA2-13B-Chat` models using the `Lora` technique. Experiments have demonstrated that the fine-tuned `Baichuan2-IEPile` and `LLaMA2-IEPile` models perform remarkably on fully supervised training sets and have achieved improvements in **zero-shot information extraction tasks**.\n\n\n\n![zero_en](./assets/zero_en.jpg)\n\n![zero_zh](./assets/zero_zh.jpg)\n\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eSupervision Results\u003c/b\u003e\u003c/summary\u003e\n\n![supervision_ner](./assets/supervision_ner.jpg)\n\n![supervision_re](./assets/supervision_re.jpg)\n\n![supervision_ee](./assets/supervision_ee.jpg)\n\n\u003c/details\u003e\n\n\n## 2.Data\n\n\n### 2.1Construction of IEPile\n\nWe concentrate on instruction-based IE, thus the construction of schema within the instructions is crucial. This is because they reflect the specific extraction requirements and are dynamically variable. Previous approaches with existing IE datasets often employ a rather extensive schema processing strategy when constructing instructions, utilizing all schemas within a label set for instruction building, raising two potential issues: \n1. **Inconsistency in the number of schema queries within instruction between training and evaluation**. For example, the model's performance will decrease if it is trained on about 20 schema queries but tested with either 10 or 30, even if the training and evaluation schemas are similar in content.\n2. **Inadequate differentiation among schemas in the instructions**. For example, semantically similar schemas like \"layoffs\", \"depart\" and \"dismissals\", may present co-occurrence ambiguities that could confuse the LLMs. Such schemas should co-occur more frequently within the instruction.\n\nTherefore, we introduce the following solutions: 1）Hard Negative Schema; and 2） Batched Instruction Generation.\n\n\n![iepile](./assets/iepile.jpg)\n\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eHard Negative Schema\u003c/b\u003e\u003c/summary\u003e\n\nAssuming that dataset $\\mathcal{D}$ possesses a full label set $L$. For a given text $S$, the schemas present in its annotation constitute the positive schema set $Pos\\_L$, while others form the negative schema set $Neg\\_L$. In our analysis, we discover that the primary cause of model misjudgment stems from the semantic ambiguity of the schema.  In traditional approaches, the $Neg\\_L$ is simply defined as $L - Pos\\_L$. However, they overlook a critical aspect: it is important to pay special attention to negative schemas that are semantically close to positive schemas. Inspired by the theory of contrastive learning, we construct a hard negative schema dictionary $\\mathcal{K}$, where each key represents a unique schema and the associated value is a collection of schemas that are semantically similar to the key schema. Based on this, we define the hard negative schema set as $Hard\\_L = \\mathcal{K}[Pos\\_L]$, and the other negative schema set as $Other\\_L = L - Pos\\_L - Hard\\_L$. The final $Neg\\_L$ is constituted by $Hard\\_L$ and a small subset of $Other\\_L$. Through this strategy, we not only present semantically similar schemas more frequently within the instruction but also reduce the number of training instances without sacrificing model performance.\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eBatched Instruction Generation\u003c/b\u003e\u003c/summary\u003e\n\nSubsequently, we obtain the final schema set $L' = Pos\\_L + Neg\\_L$. We employ a batched instruction generation method, limiting the number of schemas inquired in each instruction to the number of $split\\_num$, which ranges between 4 to 6. Therefore, $L'$ will be divided into $|L'|/split\\_num$ batches for querying, with each batch querying $split\\_num$ schemas. Consequently, even if the number of schemas inquired during the evaluation phase differs from that of training, the batched mechanism allows us to distribute the inquiries across $split\\_num$ schemas, thereby mitigating the decline in generalization performance.\n\n\u003c/details\u003e\n\n\n### 2.2Data Format of IEPile\n\nEach instance in `IEPile` contains four fields: `task`, `source`, `instruction`, and `output`. \n\n\nBelow is a **data example**:\n\n```json\n{\n    \"task\": \"NER\", \n    \"source\": \"CoNLL2003\", \n    \"instruction\": \"{\\\"instruction\\\": \\\"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\\\", \\\"schema\\\": [\\\"person\\\", \\\"organization\\\", \\\"else\\\", \\\"location\\\"], \\\"input\\\": \\\"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\\\"}\", \n    \"output\": \"{\\\"person\\\": [\\\"Robert Allenby\\\", \\\"Allenby\\\", \\\"Miguel Angel Martin\\\"], \\\"organization\\\": [], \\\"else\\\": [], \\\"location\\\": [\\\"Australia\\\", \\\"Spain\\\"]}\"\n}\n```\n\nThe data instance belongs to the `NER` task, is part of the `CoNLL2003` dataset, the schema list to be extracted includes [\"`person`\", \"`organization`\", \"`else`\", \"`location`\"], and the text to be extracted from is \"*284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )*\". The output is `{\"person\": [\"Robert Allenby\", \"Allenby\", \"Miguel Angel Martin\"], \"organization\": [], \"else\": [], \"location\": [\"Australia\", \"Spain\"]}`.\n\n\u003e Note that the order of schemas in the output is consistent with the order in the instruction.\n\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eMore Tasks Instance\u003c/b\u003e\u003c/summary\u003e\n\n```json\n{\n  \"task\": \"EE\", \n  \"source\": \"PHEE\", \n  \"instruction\": \"{\\\"instruction\\\": \\\"You are an expert in event extraction. Please extract events from the input that conform to the schema definition. Return an empty list for events that do not exist, and return NAN for arguments that do not exist. If an argument has multiple values, please return a list. Respond in the format of a JSON string.\\\", \\\"schema\\\": [{\\\"event_type\\\": \\\"potential therapeutic event\\\", \\\"trigger\\\": true, \\\"arguments\\\": [\\\"Treatment.Time_elapsed\\\", \\\"Treatment.Route\\\", \\\"Treatment.Freq\\\", \\\"Treatment\\\", \\\"Subject.Race\\\", \\\"Treatment.Disorder\\\", \\\"Effect\\\", \\\"Subject.Age\\\", \\\"Combination.Drug\\\", \\\"Treatment.Duration\\\", \\\"Subject.Population\\\", \\\"Subject.Disorder\\\", \\\"Treatment.Dosage\\\", \\\"Treatment.Drug\\\"]}, {\\\"event_type\\\": \\\"adverse event\\\", \\\"trigger\\\": true, \\\"arguments\\\": [\\\"Subject.Population\\\", \\\"Subject.Age\\\", \\\"Effect\\\", \\\"Treatment.Drug\\\", \\\"Treatment.Dosage\\\", \\\"Treatment.Freq\\\", \\\"Subject.Gender\\\", \\\"Treatment.Disorder\\\", \\\"Subject\\\", \\\"Treatment\\\", \\\"Treatment.Time_elapsed\\\", \\\"Treatment.Duration\\\", \\\"Subject.Disorder\\\", \\\"Subject.Race\\\", \\\"Combination.Drug\\\"]}], \\\"input\\\": \\\"Our findings reveal that even in patients without a history of seizures, pregabalin can cause a cortical negative myoclonus.\\\"}\", \n  \"output\": \"{\\\"potential therapeutic event\\\": [], \\\"adverse event\\\": [{\\\"trigger\\\": \\\"cause \\\", \\\"arguments\\\": {\\\"Subject.Population\\\": \\\"NAN\\\", \\\"Subject.Age\\\": \\\"NAN\\\", \\\"Effect\\\": \\\"cortical negative myoclonus\\\", \\\"Treatment.Drug\\\": \\\"pregabalin\\\", \\\"Treatment.Dosage\\\": \\\"NAN\\\", \\\"Treatment.Freq\\\": \\\"NAN\\\", \\\"Subject.Gender\\\": \\\"NAN\\\", \\\"Treatment.Disorder\\\": \\\"NAN\\\", \\\"Subject\\\": \\\"patients without a history of seizures\\\", \\\"Treatment\\\": \\\"pregabalin\\\", \\\"Treatment.Time_elapsed\\\": \\\"NAN\\\", \\\"Treatment.Duration\\\": \\\"NAN\\\", \\\"Subject.Disorder\\\": \\\"NAN\\\", \\\"Subject.Race\\\": \\\"NAN\\\", \\\"Combination.Drug\\\": \\\"NAN\\\"}}]}\"\n}\n\n{\n  \"task\": \"RE\", \n  \"source\": \"NYT11\", \n  \"instruction\": \"{\\\"instruction\\\": \\\"You are an expert in relationship extraction. Please extract relationship triples that match the schema definition from the input. Return an empty list for relationships that do not exist. Please respond in the format of a JSON string.\\\", \\\"schema\\\": [\\\"neighborhood of\\\", \\\"nationality\\\", \\\"children\\\", \\\"place of death\\\"], \\\"input\\\": \\\" In the way New Jersey students know that Thomas Edison 's laboratory is in West Orange , the people of Colma know that Wyatt Earp 's ashes are buried at Hills of Eternity , a Jewish cemetery he was n't ; his wife was , and that Joe DiMaggio is at Holy Cross Cemetery , where visitors often lean bats against his gravestone . \\\"}\", \n  \"output\": \"{\\\"neighborhood of\\\": [], \\\"nationality\\\": [], \\\"children\\\": [], \\\"place of death\\\": [{\\\"subject\\\": \\\"Thomas Edison\\\", \\\"object\\\": \\\"West Orange\\\"}]}\"\n}\n```\n\n\u003c/details\u003e\n\n\nBelow are the explanations for each field:\n\n| Field | Description |\n| :---: | :---: |\n| task | The task to which the instance belongs, one of the five types (`NER`, `RE`, `EE`, `EET`, `EEA`). |\n| source | The dataset to which the instance belongs. |\n| instruction | The instruction for inputting into the model, processed into a JSON string via json.dumps, including three parts: `\"instruction\"`, `\"schema\"`, and `\"input\"`. |\n| output | The output in the format of a dictionary's JSON string, where the key is the schema, and the value is the extracted content. |\n\n\n\nIn `IEPile`, the **instruction** format of `IEPile` adopts a JSON-like string structure, which is essentially a dictionary-type string composed of the following three main components:\n(1) **`'instruction'`**: Task description, which outlines the task to be performed by the instruction (one of `NER`, `RE`, `EE`, `EET`, `EEA`).\n(2) **`'schema'`**: A list of schemas to be extracted (`entity types`, `relation types`, `event types`).\n(3) **`'input'`**: The text from which information is to be extracted.\n\nThe file [instruction.py](./ie2instruction/convert/utils/instruction.py) provides instructions for various tasks.\n\n\n\n\n## 3.Using IEPile to Train Models\n\n### 3.1Environment\n\nBefore you begin, make sure to create an appropriate **virtual environment** following the instructions below:\n\n```bash\nconda create -n IEPile python=3.9   # Create a virtual environment\nconda activate IEPile               # Activate the environment\npip install -r requirements.txt     # Install dependencies\n```\n\n\n### 3.2Download Data and Models\n\n**`IEPile`** dataset download links: [Google Drive](https://drive.google.com/file/d/1jPdvXOTTxlAmHkn5XkeaaCFXQkYJk5Ng/view?usp=sharing) | [Hugging Face](https://huggingface.co/datasets/zjunlp/IEPile)\n\n\n```python\nIEPile\n├── train.json    # Training set\n└── dev.json      # Validation set\n```\n\nHere are some of the models supported by the code in this repository:\n[[llama](https://huggingface.co/meta-llama), [alpaca](https://github.com/tloen/alpaca-lora), [vicuna](https://huggingface.co/lmsys), [zhixi](https://github.com/zjunlp/KnowLM), [falcon](https://huggingface.co/tiiuae), [baichuan](https://huggingface.co/baichuan-inc), [chatglm](https://huggingface.co/THUDM), [qwen](https://huggingface.co/Qwen), [moss](https://huggingface.co/fnlp), [openba](https://huggingface.co/OpenBA)]\n\n\n```bash\nmkdir data         # Put data here\nmkdir models       # Put base models here\nmkdir results      # Put prediction results here\nmkdir lora         # Put LoRA fine-tuning results here\n```\n\nData should be placed in the `./data` directory.\n\n\n\n### 3.3LoRA Fine-tuning\n\n\u003e Important Note: All the commands below should be executed within the `IEPile` directory. For example, if you want to run the fine-tuning script, you should use the following command: `bash ft_scripts/fine_llama.bash`. Please ensure your current working directory is correct.\n\u003e Please make sure that each entry in the training/validation files includes the `instruction`, `output` fields.\n\n\n\n```bash\noutput_dir='lora/llama2-13b-chat-v1'\nmkdir -p ${output_dir}\nCUDA_VISIBLE_DEVICES=\"0,1,2,3\" torchrun --nproc_per_node=4 --master_port=1287 src/test_finetune.py \\\n    --do_train --do_eval \\\n    --overwrite_output_dir \\\n    --model_name_or_path 'models/llama2-13b-chat' \\\n    --stage 'sft' \\\n    --model_name 'llama' \\\n    --template 'llama2' \\\n    --train_file 'data/train.json' \\\n    --valid_file 'data/dev.json' \\\n    --output_dir=${output_dir} \\\n    --per_device_train_batch_size 2 \\\n    --per_device_eval_batch_size 2 \\\n    --gradient_accumulation_steps 4 \\\n    --preprocessing_num_workers 16 \\\n    --num_train_epochs 10 \\\n    --learning_rate 5e-5 \\\n    --max_grad_norm 0.5 \\\n    --optim \"adamw_torch\" \\\n    --max_source_length 400 \\\n    --cutoff_len 700 \\\n    --max_target_length 300 \\\n    --evaluation_strategy \"epoch\" \\\n    --save_strategy \"epoch\" \\\n    --save_total_limit 10 \\\n    --lora_r 16 \\\n    --lora_alpha 32 \\\n    --lora_dropout 0.05 \\\n    --bf16 \n```\n* `CUDA_VISIBLE_DEVICES=\"0,1,2,3\"`: used to specify which GPUs are available for the current training task. In this case, \"0,1,2,3\" means that the four GPUs with IDs 0, 1, 2, and 3 are being utilized. If your machine is equipped with more than four GPUs, this setting allows you to select any four of them for use.\n* `--nproc_per_node=4`: specifies the number of processes to be launched on each node. Since four GPUs have been specified in this example, it is necessary to start four separate processes, with each process corresponding to one GPU.\n* For training tasks that use only **a single GPU**, the command `CUDA_VISIBLE_DEVICES=0 python src/finetune.py` can be used to initiate the training. Here, CUDA_VISIBLE_DEVICES=0 designates GPU number 0 for this training task.\n* `model_name`: Specifies the **name of the model architecture** you want to use (7B, 13B, Base, Chat belong to the same model architecture). Currently supported models include: [\"`llama`\", \"`alpaca`\", \"`vicuna`\", \"`zhixi`\", \"`falcon`\", \"`baichuan`\", \"`chatglm`\", \"`qwen`\", \"`moss`\", \"`openba`\"]. **Please note**, this parameter should be distinguished from `--model_name_or_path`.\n* `model_name_or_path`: Model path, please download the corresponding model from [HuggingFace](https://huggingface.co/models).\n* `template`: The **name of the template** used, including: `alpaca`, `baichuan`, `baichuan2`, `chatglm3`, etc. Refer to [src/datamodule/template.py](./src/datamodule/template.py) to see all supported template names. The default is the `alpaca` template. **For `Chat` versions of models, it is recommended to use the matching template, while `Base` version models can default to using `alpaca`**.\n* `train_file`, `valid_file (optional)`: The **file paths** for the training set and the validation set, respectively. Note: Only **JSON format** files are currently supported. ⚠️If `valid_file` is not specified, a subset of `val_set_size` entries will be automatically allocated from `train_file` to serve as the validation set.\n* `output_dir`: The **path to save the weight parameters** after LoRA fine-tuning.\n* `val_set_size`: The number of samples in the **validation set**, default is 1000.\n* `per_device_train_batch_size`, `per_device_eval_batch_size`: The `batch_size` on each GPU device, adjust according to the size of the memory. For RTX3090, it is recommended to set between 2 and 4.\n* `max_source_length`, `max_target_length`, `cutoff_len`: The maximum input and output lengths, and the cutoff length, which can simply be considered as the maximum input length + maximum output length. Set appropriate values according to specific needs and memory size.\n* If running out of GPU memory occurs when saving the model after the evaluation phase, please set `evaluation_strategy` to `no`.\n\n\u003e Quantization can be performed by setting bits to 4; it is recommended for the RTX3090.\n\nTo learn more about parameter configuration, please refer to the [src/utils/args](./src/args). \n\nThe specific script for fine-tuning the `LLaMA2-13B-Chat` model can be found in [ft_scripts/fine_llama.bash](./ft_scripts/fine_llama.bash).\n\n\nThe specific script for fine-tuning the `Baichuan2-13B-Chat` model can be found in [ft_scripts/fine_baichuan.bash](./ft_scripts/fine_baichuan.bash).bash.\n\n\n\n## 4.Continued Training with In-Domain Data\n\nAlthough the `Baichuan2-IEPile` and `LLaMA2-IEPile` models have undergone extensive instruction fine-tuning on multiple general datasets and thus possess a degree of **general information extraction capability**, they may still exhibit certain limitations when processing data in **specific domains** (such as `law`, `education`, `science`, `telecommunications`). To address this challenge, it is recommended to conduct **secondary training** of these models on datasets specific to these domains. This will help the models better adapt to the semantic and structural characteristics of the specific domains, enhancing their **information extraction capability within those domains**.\n\n\n### 4.1Training Data Conversion\n\nFirstly, it's necessary to **format the data** to include `instruction` and `output` fields. For this purpose, we provide a script [convert_func.py](./ie2instruction/convert_func.py), which can batch convert data into a format that can be directly used by the model.\n\n\n\u003e Before using the [convert_func.py](./ie2instruction/convert_func.py) script, please make sure to refer to the [data](./data) directory. This directory provides detailed instructions on the data format required for each task. Refer to `sample.json` to understand the format of the data before conversion, `schema.json` to see the organization of the schema, and `train.json` to describe the data format after conversion.\n\n\u003e Additionally, you can directly use the bilingual (Chinese and English) information extraction dataset [zjunlp/InstructIE](https://huggingface.co/datasets/zjunlp/InstructIE), which includes 12 themes such as characters, vehicles, works of art, natural science, man-made objects, astronomical objects, etc.\n\n\n```bash\npython ie2instruction/convert_func.py \\\n    --src_path data/NER/sample.json \\\n    --tgt_path data/NER/train.json \\\n    --schema_path data/NER/schema.json \\\n    --language zh \\\n    --task NER \\\n    --split_num 6 \\       \n    --random_sort \\\n    --split train\n```\n\n\n* `language`: Supports two languages, `zh` (Chinese) and `en` (English), with different instruction templates used for each language.\n* `task`: Currently supports five types of tasks: ['`RE`', '`NER`', '`EE`', '`EET`', '`EEA`'].\n* `split_num`: Defines the maximum number of schemas that can be included in a single instruction. The default value is 4, and setting it to -1 means no splitting is done. The recommended number of task splits varies by task: **6 for NER, and 4 for RE, EE, EET, EEA**.\n* `random_sort`: Whether to randomize the order of schemas in the instructions. The default is False, which means schemas are sorted alphabetically.\n* `split`: Specifies the type of dataset, with options `train` or `test`.\n\nThe converted training data will contain four fields: `task`, `source`, `instruction`, `output`.\n\n\n\n**`Generation of Hard Negative Samples`**: Promote co-occurrence of semantically close and easily confused schemas, reducing the amount of training samples.\n\n```bash\npython ie2instruction/convert_func.py \\\n    --src_path data/SPO/sample.json \\\n    --tgt_path data/SPO/train.json \\\n    --schema_path data/SPO/schema.json \\\n    --cluster_mode \\\n    --hard_negative_path data/hard_negative/SPO_DuIE2.0.json \\\n    --language zh \\\n    --task SPO \\\n    --split_num 4 \\\n    --random_sort \\\n    --split train\n```\n\nThe addition of the `--cluster_mode` and `--hard_negative_path data/hard_negative/SPO_DuIE2.0.json` parameters, where `--hard_negative_path` corresponds to the dictionary of difficult negative samples. The [hard_dict.json](./data/hard_negative/hard_dict.json) contains dictionaries of hard negative samples for all datasets involved in IEPILE.\n\n\n\n### 4.2Continued Training\n\n\nModel download links for **`LLaMA2-IEPile`** | **`Baichuan2-IEPile`** | **`LLaMA3-IEPile`** | **`Qwen1.5-IEPile`** | **`OneKE`**: [zjunlp/llama2-13b-iepile-lora](https://huggingface.co/zjunlp/llama2-13b-iepile-lora/tree/main) | [zjunlp/baichuan2-13b-iepile-lora](https://huggingface.co/zjunlp/baichuan2-13b-iepile-lora) | [zjunlp/llama3-8b-iepile-lora](https://huggingface.co/zjunlp/llama3-8b-iepile-lora) | [zjunlp/qwen1.5-14b-iepile-lora](https://huggingface.co/zjunlp/qwen1.5-14b-iepile-lora) | [zjunlp/OneKE](https://huggingface.co/zjunlp/OneKE)\n\n\n| checkpoint_dir | model_name_or_path | moadel_name | fp16/bf16 | template | \n| --- | --- | --- | --- | --- |\n| llama2-13b-iepile-lora | LLaMA2-13B-Chat | llama | bf16 | llama2 |\n| baichuan2-13b-iepile-lora | BaiChuan2-13B-Chat | baichuan | bf16 | baichuan2 |\n| llama3-8b-iepile-lora | LLaMA3-8B-Instruct | llama | bf16 | alpaca |\n| qwen1.5-14b-iepile-lora | Qwen1.5-14B-Chat | qwen2 | bf16 | qwen |\n| OneKE | OneKE | llama | bf16 | llama2_zh |\n\n```bash\noutput_dir='lora/llama2-13b-chat-v1-continue'\nmkdir -p ${output_dir}\nCUDA_VISIBLE_DEVICES=\"0,1,2,3\" torchrun --nproc_per_node=4 --master_port=1287 src/test_finetune.py \\\n    --do_train --do_eval \\\n    --overwrite_output_dir \\\n    --model_name_or_path 'models/llama2-13B-Chat' \\\n    --checkpoint_dir 'zjunlp/llama2-13b-iepile-lora' \\\n    --stage 'sft' \\\n    --model_name 'llama' \\\n    --template 'llama2' \\\n    --train_file 'data/train.json' \\\n    --valid_file 'data/dev.json' \\\n    --output_dir=${output_dir} \\\n    --per_device_train_batch_size 2 \\\n    --per_device_eval_batch_size 2 \\\n    --gradient_accumulation_steps 4 \\\n    --preprocessing_num_workers 16 \\\n    --num_train_epochs 10 \\\n    --learning_rate 5e-5 \\\n    --max_grad_norm 0.5 \\\n    --optim \"adamw_torch\" \\\n    --max_source_length 400 \\\n    --cutoff_len 700 \\\n    --max_target_length 300 \\\n    --evaluation_strategy \"epoch\" \\\n    --save_strategy \"epoch\" \\\n    --save_total_limit 10 \\\n    --lora_r 64 \\\n    --lora_alpha 64 \\\n    --lora_dropout 0.05 \\\n    --bf16 \n```\n\n* Please refer to the [3.3LoRA Fine-tuning](./README.md#33lora-fine-tuning) for further parameter description.\n* To continue training based on the fine-tuned LoRA weights, simply point the `--checkpoint_dir` parameter to the path of the LoRA weights, for example by setting it to `'zjunlp/llama2-13b-iepile-lora'`.\n\n\u003e Quantization can be performed by setting bits to 4; it is recommended for the RTX3090.\n\n\n\u003e Please note that when using **`LLaMA2-IEPile`** or **`Baichuan2-IEPile`**, keep both lora_r and lora_alpha at 64. We do not provide recommended settings for these parameters.\n\n\n* To continue training based on the fine-tuned model weights, just set the `--model_name_or_path` parameter to the path of the weights, such as `'zjunlp/KnowLM-IE-v2'`, without setting `--checkpoint_dir`.\n\n\nThe script can be found at [ft_scripts/fine_continue.bash](./ft_scripts/fine_continue.bash).\n\n\n\n### 4.3Continued Training OneKE\n\n\n#### 4.3.1Full SFT\n\n```bash\noutput_dir='lora/OneKE-continue'\nmkdir -p ${output_dir}\nCUDA_VISIBLE_DEVICES=\"0,1,2,3\" torchrun --nproc_per_node=4 --master_port=1287 src/test_finetune.py \\\n    --do_train --do_eval \\\n    --overwrite_output_dir \\\n    --model_name_or_path 'models/OneKE' \\\n    --stage 'sft' \\\n    --model_name 'llama' \\\n    --template 'llama2_zh' \\\n    --train_file 'data/train.json' \\\n    --valid_file 'data/dev.json' \\\n    --output_dir=${output_dir} \\\n    --per_device_train_batch_size 2 \\\n    --per_device_eval_batch_size 2 \\\n    --gradient_accumulation_steps 4 \\\n    --preprocessing_num_workers 16 \\\n    --num_train_epochs 10 \\\n    --learning_rate 5e-5 \\\n    --max_grad_norm 0.5 \\\n    --optim \"adamw_torch\" \\\n    --max_source_length 400 \\\n    --cutoff_len 700 \\\n    --max_target_length 300 \\\n    --evaluation_strategy \"epoch\" \\\n    --save_strategy \"epoch\" \\\n    --save_total_limit 10 \\\n    --bf16 \n```\n\n\n\n#### 4.3.1Lora SFT\n\n```bash\noutput_dir='lora/OneKE-continue-lora'\nmkdir -p ${output_dir}\nCUDA_VISIBLE_DEVICES=\"0,1,2,3\" torchrun --nproc_per_node=4 --master_port=1287 src/test_finetune.py \\\n    --do_train --do_eval \\\n    --overwrite_output_dir \\\n    --model_name_or_path 'models/OneKE' \\\n    --stage 'sft' \\\n    --model_name 'llama' \\\n    --template 'llama2_zh' \\\n    --train_file 'data/train.json' \\\n    --valid_file 'data/dev.json' \\\n    --output_dir=${output_dir} \\\n    --per_device_train_batch_size 2 \\\n    --per_device_eval_batch_size 2 \\\n    --gradient_accumulation_steps 4 \\\n    --preprocessing_num_workers 16 \\\n    --num_train_epochs 10 \\\n    --learning_rate 5e-5 \\\n    --max_grad_norm 0.5 \\\n    --optim \"adamw_torch\" \\\n    --max_source_length 400 \\\n    --cutoff_len 700 \\\n    --max_target_length 300 \\\n    --evaluation_strategy \"epoch\" \\\n    --save_strategy \"epoch\" \\\n    --save_total_limit 10 \\\n    --lora_r 64 \\\n    --lora_alpha 64 \\\n    --lora_dropout 0.05 \\\n    --bf16 \n```\n\n\n\n## 5.Prediction\n\n### 5.1Test Data Conversion\n\nBefore preparing the test data conversion, please visit the [data](./data) directory to understand the data structure required for each task: 1) For the input data format, see `sample.json`. 2) For the schema format, please refer to `schema.json`. 3) For the format of the transformed data, refer to `train.json`. **Unlike training data, test data input does not need to include annotation fields (`entity`, `relation`, `event`)**.\n\n\n```bash\npython ie2instruction/convert_func.py \\\n    --src_path data/NER/sample.json \\\n    --tgt_path data/NER/test.json \\\n    --schema_path data/NER/schema.json \\\n    --language zh \\\n    --task NER \\\n    --split_num 6 \\\n    --split test\n```\n\nWhen setting `split` to **test**, select the appropriate number of schemas according to the task type: **6 is recommended for NER, while 4 is recommended for RE, EE, EET, EEA**. The transformed test data will contain five fields: `id`, `task`, `source`, `instruction`, `label`.\n\nThe `label` field will be used for subsequent evaluation. If the input data lacks the annotation fields (`entity`, `relation`, `event`), the transformed test data will not contain the `label` field, which is suitable for scenarios where no original annotated data is available.\n\n\n\n### 5.2IEPile Test Data\n\nDownload the IEPile dataset from [Google Drive](https://drive.google.com/file/d/1jPdvXOTTxlAmHkn5XkeaaCFXQkYJk5Ng/view?usp=sharing) | [Hugging Face](https://huggingface.co/datasets/zjunlp/iepile) | [WiseModel](https://wisemodel.cn/datasets/zjunlp/IEPile) | [ModelScpoe](https://modelscope.cn/datasets/ZJUNLP/IEPile)\n\nThe file tree is shown as follows:\n\n```bash\nIEPile\n├── train.json      # Training Set\n├── dev.json        # Validation Set\n├── IE-en           # English Unified Format Data\n│   ├── NER\n│   │   ├── CoNLL2003\n│   │   │   ├── train.json\n│   │   │   ├── dev.json\n│   │   │   ├── schema.json   # schema information file\n│   │   │   └── test.json\n│   │   ├── ...\n│   ├── RE\n│   ├── EE\n│   ├── EET\n│   ├── EEA\n├── IE-zh           # Chinese Unified Format Data\n│   ├── NER\n│   ├── RE\n│   ├── EE\n│   ├── EET\n│   ├── EEA\n```\n\nBatch test instruction data can be obtained through the following script:\n\n```bash\nbash ie2instruction/eval_data_convert.bash\n```\n\n\u003e You need to set the dir_path in the first line of the script to the actual absolute path of the IEPile dataset.\n\u003e Note: Due to the possible inconsistency in the order of labels in the converted schema sequence, there may be slight deviations in the evaluation results.\n\n\n\n\n### 5.2Basic Model + LoRA Prediction\n\nModel download links for **`LLaMA2-IEPile`** | **`Baichuan2-IEPile`** : [zjunlp/llama2-13b-iepile-lora](https://huggingface.co/zjunlp/llama2-13b-iepile-lora/tree/main) | [zjunlp/baichuan2-13b-iepile-lora](https://huggingface.co/zjunlp/baichuan2-13b-iepile-lora) \n\n\n| checkpoint_dir | model_name_or_path | moadel_name | fp16/bf16 | template | \n| --- | --- | --- | --- | --- |\n| llama2-13b-iepile-lora | LLaMA2-13B-Chat | llama | bf16 | llama2 |\n| baichuan2-13b-iepile-lora | BaiChuan2-13B-Chat | baichuan | bf16 | baichuan2 |\n| llama3-8b-iepile-lora | LLaMA3-8B-Instruct | llama | bf16 | alpaca |\n| qwen1.5-14b-iepile-lora | Qwen1.5-14B-Chat | qwen2 | bf16 | qwen |\n\n⚠️ When performing the **Basic Model + LoRA Prediction**, it's necessary not only to download the Lora weight parameters but also the base model parameters. For example, when using `baichuan2-13b-iepile-lora` (specified with `--checkpoint_dir`), you must also download `BaiChuan2-13B-Chat` (specified with `--model_name_or_path`). 🚫**You cannot** merely set `--model_name_or_path lora/baichuan2-13b-iepile-lora`.\n\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python src/inference.py \\\n    --stage sft \\\n    --model_name_or_path 'models/llama2-13B-Chat' \\\n    --checkpoint_dir 'lora/llama2-13b-IEPile-lora' \\\n    --model_name 'llama' \\\n    --template 'llama2' \\\n    --do_predict \\\n    --input_file 'data/NER/test.json' \\\n    --output_file 'results/llama2-13b-IEPile-lora_output.json' \\\n    --finetuning_type lora \\\n    --output_dir 'lora/test' \\\n    --predict_with_generate \\\n    --cutoff_len 512 \\\n    --bf16 \\\n    --max_new_tokens 300 \\\n    --bits 4\n```\n\n* During inference, `model_name`, `template`, and `bf16` must be the same as the settings used during training.\n* `model_name_or_path`: Specify the path to the base model being used, which must match the corresponding LoRA model.\n* `checkpoint_dir`: The path to the LoRA weight files.\n* `output_dir`: This parameter does not take effect during inference and any path can be specified.\n* `input_file`, `output_file`: Specify the input path for the test file and the output path for the prediction results, respectively.\n* `cutoff_len`, `max_new_tokens`: Set the maximum input length and the number of new tokens to be generated, adjusting according to device performance.\n\n\u003e Quantization can be performed by setting bits to 4; it is recommended for the RTX3090.\n\n\n\n### 5.3IE-Specific Model Prediction\n\n| checkpoint_dir | model_name_or_path | moadel_name | fp16/bf16 | template | \n| --- | --- | --- | --- | --- |\n| OneKE | OneKE | llama | bf16 | llama2_zh |\n\n\nModel download links for **`OneKE(based on chinese-alpaca2)`**: [zjunlp/OneKE](https://huggingface.co/zjunlp/OneKE)\n\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python src/inference.py \\\n    --stage sft \\\n    --model_name_or_path 'models/OneKE' \\\n    --model_name 'llama' \\\n    --template 'llama2_zh' \\\n    --do_predict \\\n    --input_file 'data/NER/test.json' \\\n    --output_file 'results/OneKE_output.json' \\\n    --output_dir 'lora/test' \\\n    --predict_with_generate \\\n    --cutoff_len 512 \\\n    --bf16 \\\n    --max_new_tokens 300 \\\n    --bits 4\n```\n\n`model_name_or_path`: The path to the weights of the model specialized for Information Extraction (IE).\n\n\n\n## Model Usage\n\n### Model Download\n\n[HuggingFace](https://huggingface.co/zjunlp/OneKE), [ModelScope](https://modelscope.cn/models/ZJUNLP/OneKE), [WiseModel](https://wisemodel.cn/models/zjunlp/OneKE)\n\n\n### Environmental Installation\n\n```bash\nconda create -n OneKE python=3.9\nconda activate OneKE\npip install -r requirements.txt\n```\n\n### Fast Running\n\nIt is recommended to have at least **20GB of GPU memory for training and reasoning**\n\n\n```python\nimport torch\nfrom transformers import (\n    AutoConfig,\n    AutoTokenizer,\n    AutoModelForCausalLM,\n    GenerationConfig,\n    BitsAndBytesConfig\n)\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nmodel_path = 'zjunlp/OneKE' #选择你下载的模型存储在本地的位置\nconfig = AutoConfig.from_pretrained(model_path, trust_remote_code=True)\ntokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n\n\n# 4bit量化OneKE\nquantization_config=BitsAndBytesConfig(     \n    load_in_4bit=True,\n    llm_int8_threshold=6.0,\n    llm_int8_has_fp16_weight=False,\n    bnb_4bit_compute_dtype=torch.bfloat16,\n    bnb_4bit_use_double_quant=True,\n    bnb_4bit_quant_type=\"nf4\",\n)\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_path,\n    config=config,\n    device_map=\"auto\",  \n    quantization_config=quantization_config,\n    torch_dtype=torch.bfloat16,\n    trust_remote_code=True,\n)\nmodel.eval()\n\n\nsystem_prompt = '\u003c\u003cSYS\u003e\u003e\\nYou are a helpful assistant. 你是一个乐于助人的助手。\\n\u003c\u003c/SYS\u003e\u003e\\n\\n'\nsintruct = \"{\\\"instruction\\\": \\\"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\\\", \\\"schema\\\": [\\\"person\\\", \\\"organization\\\", \\\"else\\\", \\\"location\\\"], \\\"input\\\": \\\"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\\\"}\"\nsintruct = '[INST] ' + system_prompt + sintruct + '[/INST]'\n\ninput_ids = tokenizer.encode(sintruct, return_tensors=\"pt\").to(device)\ninput_length = input_ids.size(1)\ngeneration_output = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_length=1024, max_new_tokens=512, return_dict_in_generate=True), pad_token_id=tokenizer.eos_token_id)\ngeneration_output = generation_output.sequences[0]\ngeneration_output = generation_output[input_length:]\noutput = tokenizer.decode(generation_output, skip_special_tokens=True)\n\nprint(output)\n```\n\n\n### VLLM Inference\n\nThe environment configuration of vLLM can be found in its official installation configuration document ([Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html))\n\n\nDeployment Services\n\n```bash\npython -m vllm.entrypoints.openai.api_server --model zjunlp/OneKE\n```\n\nTerminal uses API inference\n\n```bash\ncurl http://localhost:8000/v1/completions -H \"Content-Type: application/json\" -d '{\"model\": \"/data2/lkw/OneKE\", \"prompt\": \"[INST] \u003c\u003cSYS\u003e\u003eYou are a helpful assistant. 你是一个乐于助人的助手。\u003c\u003c/SYS\u003e\u003e{\\\"instruction\\\": \\\"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\\\", \\\"schema\\\": [\\\"person\\\", \\\"organization\\\", \\\"else\\\", \\\"location\\\"], \\\"input\\\": \\\"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\\\"}[/INST]\", \"max_tokens\": 1024, \"temperature\": 0}'\n```\n\n\n### GGUF Format Conversion\n\nTo convert model weights from Hugging Face format to GGUF format, we first need to clone the GitHub repository of llama.cpp, which contains the necessary conversion scripts. Please follow the steps below:\n\n\n```bash\ngit clone https://github.com/ggerganov/llama.cpp.git\ncd llama.cpp\n```\n\n\nNext, use the provided Python script convert_hf_to_gguf.py to perform the format conversion. Ensure you have installed the required Python environment and dependencies. Below is the specific command to execute the conversion:\n\n```bash\npython3 convert_hf_to_gguf.py \\\n    /disk/disk_20T/ghh/OneKE \\\n    --outfile /disk/disk_20T/ghh/OneKE-gguf \\\n    --outtype bf16 \n```\n\n\nPlease note that the `--model_dir` parameter specifies the location of the original model files, while the `--outfile` parameter defines the save location for the converted GGUF file. The `--outtype` parameter sets the precision of the numerical values in the output file.\n\n\n\n### Ollama Inference\n\nThe environment configuration of Olama can be found in its official documentation https://github.com/ollama/ollama/tree/main\n\n```bash\ncurl -fsSL https://ollama.com/install.sh | sh\n```\n\nCreate Modelfile file\n\n```bash\nFROM ./OneKE-13B-BF16.gguf\nPARAMETER temperature 0\nPARAMETER num_ctx 4096\nTEMPLATE \"\"\"[INST] \u003c\u003cSYS\u003e\u003eYou are a helpful assistant. 你是一个乐于助人的助手。\u003c\u003c/SYS\u003e\u003e{{ .Prompt }}[/INST]\"\"\"\n```\n\nStart ollama\n\n```bash\nollama serve\n```\n\nEnter commands in another terminal window\n\n```bash\nollama create oneke -f Modelfile\n\nollama run oneke\n```\n\nInput and Output\n\n```\n\u003e\u003e\u003e {\\\"instruction\\\": \\\"你是专门进行实体抽取的专家。请从input中抽取出符合schema定义的实体，不存在的实体类型\n... 返回空列表。请按照JSON字符串的格式回答。\\\", \\\"schema\\\": [\\\"人物\\\", \\\"地理位置\\\", \\\"组织机构\\\"], \\\"input\n... \\\": \\\"在这里恕弟不恭之罪，敢在尊前一诤：前人论书，每曰“字字有来历，笔笔有出处”，细读公字，何尝跳出前人\n... 藩篱，自隶变而后，直至明季，兄有何新出？\\\"}\n {\"人物\": [], \"地理位置\": [], \"组织机构\": []}\n\n\u003e\u003e\u003e {\\\"instruction\\\": \\\"你是专门进行实体抽取的专家。请从input中抽取出符合schema定义的实体，不存在的实体类型\n... 返回空列表。请按照JSON字符串的格式回答。\\\", \\\"schema\\\": [\\\"组织机构\\\", \\\"地理位置\\\", \\\"人物\\\"], \\\"input\n... \\\": \\\"胡老说，当画画疲倦时就到院里去看看，给这盆花浇点水，给那棵花剪剪枝，回来再接着画，画累了再出去，\n... 如此循环往复，脑体结合，有益健康，胜过吃药。\\\"}\n {\"组织机构\": [], \"地理位置\": [], \"人物\": [\"胡\"]}\n\n\u003e\u003e\u003e {\\\"instruction\\\": \\\"你是专门进行事件提取的专家。请从input中抽取出符合schema定义的事件，不存在的事件返回\n... 空列表，不存在的论元返回NAN，如果论元存在多值请返回列表。请按照JSON字符串的格式回答。\\\", \\\"schema\\\": [{\n... \\\"event_type\\\": \\\"产品行为-获奖\\\", \\\"trigger\\\": true, \\\"arguments\\\": [\\\"获奖人\\\", \\\"颁奖机构\\\", \\\"奖项\\\n... \", \\\"时间\\\"]}, {\\\"event_type\\\": \\\"组织行为-罢工\\\", \\\"trigger\\\": true, \\\"arguments\\\": [\\\"罢工人数\\\", \\\"\n... 罢工人员\\\", \\\"所属组织\\\", \\\"时间\\\"]}, {\\\"event_type\\\": \\\"组织关系-裁员\\\", \\\"trigger\\\": true, \\\"argument\n... s\\\": [\\\"裁员方\\\", \\\"时间\\\", \\\"裁员人数\\\"]}, {\\\"event_type\\\": \\\"组织关系-解散\\\", \\\"trigger\\\": true, \\\"ar\n... guments\\\": [\\\"解散方\\\", \\\"时间\\\"]}], \\\"input\\\": \\\"消失的“外企光环”，5月份在华裁员900余人，香饽饽变“臭”\n... 了\\\"}\n {\"产品行为-获奖\": [], \"组织行为-罢工\": [], \"组织关系-裁员\": [{\"trigger\": \"裁员\", \"arguments\": {\"裁员方\n\": \"NAN\", \"时间\": \"5月份\", \"裁员人数\": \"900余人\"}}], \"组织关系-解散\": []}\n```\n\n\nDelete after exiting\n\n```bash\nollama stop oneke\n\nollama rm oneke\n```\n\n\n### Inference on Mac\n\n```python\nimport torch\nfrom transformers import (\n    AutoConfig,\n    AutoTokenizer,\n    AutoModelForCausalLM,\n    GenerationConfig,\n    BitsAndBytesConfig\n)\n\ndevice = torch.device(\"mps\")\nmodel_path = 'zjunlp/OneKE' #选择你下载的模型存储在本地的位置\nconfig = AutoConfig.from_pretrained(model_path, trust_remote_code=True)\ntokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_path,\n    config=config,\n    device_map=\"auto\",  \n    torch_dtype=torch.bfloat16,\n    trust_remote_code=True,\n)\nmodel.eval()\nmodel = model.to(device)\n```\n\n`PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py` 命令行启动。\n\n\n### Multi GPU Inference\n\n```python\nimport torch\nfrom transformers import AutoConfig, AutoModel, AutoTokenizer, GenerationConfig\nfrom accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_in_model, dispatch_model\n\nmax_memory_each_gpu = '15GiB' \ngpu_device_ids = [0, 1] \nno_split_module_classes = [\"LlamaDecoderLayer\"]\nmodel_path = '/disk/disk_20T/ghh/OneKE' #选择你下载的模型存储在本地的位置\n\nmax_memory = {\n    device_id: max_memory_each_gpu for device_id in gpu_device_ids\n}\n\nconfig = AutoConfig.from_pretrained(model_path, trust_remote_code=True)\ntokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n\nwith init_empty_weights():\n    model = AutoModel.from_config(config, torch_dtype=torch.float16, trust_remote_code=True)\n\ndevice_map = infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=no_split_module_classes)\n\nprint(\"auto determined device_map\", device_map)\ndevice_map[\"llm.model.embed_tokens\"] = 0\ndevice_map[\"llm.model.layers.0\"] = 0\ndevice_map[\"llm.lm_head\"] = 0\ndevice_map[\"vpm\"] = 0\ndevice_map[\"resampler\"] = 0\nprint(\"modified device_map\", device_map)\n\nload_checkpoint_in_model(model, model_path, device_map=device_map)\n\nmodel = dispatch_model(model, device_map=device_map)\ntorch.set_grad_enabled(False)\nmodel.eval()\n\n\nsystem_prompt = '\u003c\u003cSYS\u003e\u003e\\nYou are a helpful assistant. 你是一个乐于助人的助手。\\n\u003c\u003c/SYS\u003e\u003e\\n\\n'\nsintruct = \"{\\\"instruction\\\": \\\"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\\\", \\\"schema\\\": [\\\"person\\\", \\\"organization\\\", \\\"else\\\", \\\"location\\\"], \\\"input\\\": \\\"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\\\"}\"\nsintruct = '[INST] ' + system_prompt + sintruct + '[/INST]'\n\ninput_ids = tokenizer.encode(sintruct, return_tensors=\"pt\")\ninput_length = input_ids.size(1)\ngeneration_output = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_length=1024, max_new_tokens=512, return_dict_in_generate=True), pad_token_id=tokenizer.eos_token_id)\ngeneration_output = generation_output.sequences[0]\ngeneration_output = generation_output[input_length:]\noutput = tokenizer.decode(generation_output, skip_special_tokens=True)\n\nprint(output)\n```\n\n\n\n## 6.Evaluation\n\nWe provide scripts for evaluating the F1 scores for various tasks.\n\n```bash\npython ie2instruction/eval_func.py \\\n  --path1 data/NER/processed.json \\\n  --task NER \n```\n\n* `task`: Currently supports five types of tasks: ['`RE`', '`NER`', '`EE`', '`EET`', '`EEA`'].\n* You can set `sort_by` to `source` to calculate the F1 scores on each dataset separately.\n\n\n\n## 7.Statement and License\nWe believe that annotated data contains the wisdom of humanity, and its existence is to promote the benefit of all humankind and help enhance our quality of life. We strongly urge all users not to use our corpus for any actions that may harm national or public security or violate legal regulations.\nWe have done our best to ensure the quality and legality of the data provided. However, we also recognize that despite our efforts, there may still be some unforeseen issues, such as concerns about data protection and risks and problems caused by data misuse. We will not be responsible for these potential problems.\nFor original data that is subject to usage permissions stricter than the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en) agreement, IEPile will adhere to those stricter terms. In all other cases, our operations will be based on the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en) license agreement.\n\n\n\n\n## 8.Limitations\n\nFrom the data perspective, our study primarily focuses on schema-based IE, which limits our ability to generalize to human instructions that do not follow our specific format requirements. \nAdditionally, we do not explore the field of Open Information Extraction (Open IE); however, if we remove schema constraints, our dataset would be suitable for Open IE scenarios.\nBesides, IEPile is confined to data in English and Chinese, and in the future, we hope to include data in more languages.\n\nFrom the model perspective, due to computational resource limitations, our research only assessed two models: Baichuan and LLaMA, along with some baseline models. Our dataset can be applied to any other large language models (LLMs), such as Qwen, ChatGLM, Gemma.\n\n\n\n## 9.Cite\nIf you use the IEPile or the code, please cite the paper:\n\n```bibtex\n@article{DBLP:journals/corr/abs-2402-14710,\n  author       = {Honghao Gui and\n                  Lin Yuan and\n                  Hongbin Ye and\n                  Ningyu Zhang and\n                  Mengshu Sun and\n                  Lei Liang and\n                  Huajun Chen},\n  title        = {IEPile: Unearthing Large-Scale Schema-Based Information Extraction\n                  Corpus},\n  journal      = {CoRR},\n  volume       = {abs/2402.14710},\n  year         = {2024},\n  url          = {https://doi.org/10.48550/arXiv.2402.14710},\n  doi          = {10.48550/ARXIV.2402.14710},\n  eprinttype    = {arXiv},\n  eprint       = {2402.14710},\n  timestamp    = {Tue, 09 Apr 2024 07:32:43 +0200},\n  biburl       = {https://dblp.org/rec/journals/corr/abs-2402-14710.bib},\n  bibsource    = {dblp computer science bibliography, https://dblp.org}\n}\n```\n\n## 10.Acknowledgements\nWe are very grateful for the inspiration provided by the [MathPile](mathpile) and [KnowledgePile](https://huggingface.co/datasets/Query-of-CC/Knowledge_Pile) projects. Special thanks are due to the builders and maintainers of the following datasets: [AnatEM](https://doi.org/10.1093/BIOINFORMATICS/BTT580)、[BC2GM](https://link.springer.com/chapter/10.1007/978-3-030-68763-2_48)、[BC4CHEMD](https://link.springer.com/chapter/10.1007/978-3-030-68763-2_48)、[NCBI-Disease](https://linkinghub.elsevier.com/retrieve/pii/S1532046413001974)、[BC5CDR](https://openreview.net/pdf?id=9EAQVEINuum)、[HarveyNER](https://aclanthology.org/2022.naacl-main.243/)、[CoNLL2003](https://aclanthology.org/W03-0419/)、[GENIA](https://pubmed.ncbi.nlm.nih.gov/12855455/)、[ACE2005](https://catalog.ldc.upenn.edu/LDC2006T06)、[MIT Restaurant](https://ieeexplore.ieee.org/document/6639301)、[MIT Movie](https://ieeexplore.ieee.org/document/6639301)、[FabNER](https://link.springer.com/article/10.1007/s10845-021-01807-x)、[MultiNERD](https://aclanthology.org/2022.findings-naacl.60/)、[Ontonotes](https://aclanthology.org/N09-4006/)、[FindVehicle](https://arxiv.org/abs/2304.10893)、[CrossNER](https://ojs.aaai.org/index.php/AAAI/article/view/17587)、[MSRA NER](https://aclanthology.org/W06-0115/)、[Resume NER](https://aclanthology.org/P18-1144/)、[CLUE NER](https://arxiv.org/abs/2001.04351)、[Weibo NER](https://aclanthology.org/D15-1064/)、[Boson](https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/boson)、[ADE Corpus](https://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-3-15)、[GIDS](https://arxiv.org/abs/1804.06987)、[CoNLL2004](https://aclanthology.org/W04-2412/)、[SciERC](https://aclanthology.org/D18-1360/)、[Semeval-RE](https://aclanthology.org/S10-1006/)、[NYT11-HRL](https://ojs.aaai.org/index.php/AAAI/article/view/4688)、[KBP37](https://arxiv.org/abs/1508.01006)、[NYT](https://link.springer.com/chapter/10.1007/978-3-642-15939-8_10)、[Wiki-ZSL](https://aclanthology.org/2021.naacl-main.272/)、[FewRel](https://aclanthology.org/D18-1514/)、[CMeIE](https://link.springer.com/chapter/10.1007/978-3-030-60450-9_22)、[DuIE](https://link.springer.com/chapter/10.1007/978-3-030-32236-6_72)、[COAE2016](https://github.com/Sewens/COAE2016)、[IPRE](https://arxiv.org/abs/1907.12801)、[SKE2020](https://aistudio.baidu.com/datasetdetail/177191)、[CASIE](https://ojs.aaai.org/index.php/AAAI/article/view/6401)、[PHEE](https://aclanthology.org/2022.emnlp-main.376/)、[CrudeOilNews](https://aclanthology.org/2022.lrec-1.49/)、[RAMS](https://aclanthology.org/2020.acl-main.718/)、[WikiEvents](https://aclanthology.org/2021.naacl-main.69/)、[DuEE](https://link.springer.com/chapter/10.1007/978-3-030-60457-8_44)、[DuEE-Fin](https://link.springer.com/chapter/10.1007/978-3-031-17120-8_14)、[FewFC](https://ojs.aaai.org/index.php/AAAI/article/view/17720)、[CCF law](https://aistudio.baidu.com/projectdetail/4201483), and more. These datasets have significantly contributed to the advancement of this research. We are also grateful for the valuable contributions in the field of information extraction made by [InstructUIE](http://arxiv.org/abs/2304.08085) and [YAYI-UIE](http://arxiv.org/abs/2312.15548), both in terms of data and model innovation. Our research results have benefitted from their creativity and hard work as well. Additionally, our heartfelt thanks go to [hiyouga/LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory); our fine-tuning code implementation owes much to their work. The assistance provided by these academic resources has been instrumental in the completion of our research, and for this, we are deeply appreciative.\n\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Fiepile","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzjunlp%2Fiepile","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Fiepile/lists"}