{"id":13754202,"url":"https://github.com/hitz-zentroa/GoLLIE","last_synced_at":"2025-05-09T22:31:34.815Z","repository":{"id":198556758,"uuid":"700942096","full_name":"hitz-zentroa/GoLLIE","owner":"hitz-zentroa","description":"Guideline following Large Language Model for Information Extraction","archived":false,"fork":false,"pushed_at":"2024-10-27T20:44:54.000Z","size":11369,"stargazers_count":367,"open_issues_count":3,"forks_count":25,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-04-30T03:03:36.262Z","etag":null,"topics":["code-llama","event-extraction","gollie","guidelines","hugginface-hub","huggingface","inference","information-extraction","llama","llama2","llm","llms","named-entity-recognition","relation-extraction","state-of-the-art","text-generation","training","transformer"],"latest_commit_sha":null,"homepage":"https://hitz-zentroa.github.io/GoLLIE/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hitz-zentroa.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-05T15:38:20.000Z","updated_at":"2025-04-29T09:57:58.000Z","dependencies_parsed_at":"2024-06-12T16:41:56.562Z","dependency_job_id":"8398ff34-ec31-445a-92aa-aabc817ae1aa","html_url":"https://github.com/hitz-zentroa/GoLLIE","commit_stats":{"total_commits":681,"total_committers":6,"mean_commits":113.5,"dds":"0.35242290748898675","last_synced_commit":"ca0edffb4d0241415473150b3c4e20ef1f6f182b"},"previous_names":["hitz-zentroa/gollie"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hitz-zentroa%2FGoLLIE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hitz-zentroa%2FGoLLIE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hitz-zentroa%2FGoLLIE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hitz-zentroa%2FGoLLIE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hitz-zentroa","download_url":"https://codeload.github.com/hitz-zentroa/GoLLIE/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335686,"owners_count":21892713,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-llama","event-extraction","gollie","guidelines","hugginface-hub","huggingface","inference","information-extraction","llama","llama2","llm","llms","named-entity-recognition","relation-extraction","state-of-the-art","text-generation","training","transformer"],"created_at":"2024-08-03T09:01:49.459Z","updated_at":"2025-05-09T22:31:29.802Z","avatar_url":"https://github.com/hitz-zentroa.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"\n\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"assets/GoLLIE.png\" style=\"height: 250px;\"\u003e\n    \u003cbr\u003e\n    \u003ch2 align=\"center\"\u003e\u003cb\u003eG\u003c/b\u003euideline f\u003cb\u003eo\u003c/b\u003ellowing \u003cb\u003eL\u003c/b\u003earge \u003cb\u003eL\u003c/b\u003eanguage Model for \u003cb\u003eI\u003c/b\u003enformation \u003cb\u003eE\u003c/b\u003extraction\u003c/h2\u003e\n\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://twitter.com/intent/tweet?text=Wow+this+new+model+is+amazing:\u0026url=https%3A%2F%2Fgithub.com%2Fhitz-zentroa%2FGoLLIE\"\u003e\u003cimg alt=\"Twitter\" src=\"https://img.shields.io/twitter/url?style=social\u0026url=https%3A%2F%2Fgithub.com%2Fhitz-zentroa%2FGoLLIE\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/hitz-zentroa/GoLLIE/blob/main/LICENSE\"\u003e\u003cimg alt=\"GitHub license\" src=\"https://img.shields.io/github/license/hitz-zentroa/GoLLIE\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://huggingface.co/collections/HiTZ/gollie-651bf19ee315e8a224aacc4f\"\u003e\u003cimg alt=\"Pretrained Models\" src=\"https://img.shields.io/badge/🤗HuggingFace-Pretrained Models-green\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://hitz-zentroa.github.io/GoLLIE/\"\u003e\u003cimg alt=\"Blog\" src=\"https://img.shields.io/badge/📒-Blog Post-blue\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://arxiv.org/abs/2310.03668\"\u003e\u003cimg alt=\"Paper\" src=\"https://img.shields.io/badge/📖-Paper-orange\"\u003e\u003c/a\u003e\n\u003cbr\u003e\n     \u003ca href=\"http://www.hitz.eus/\"\u003e\u003cimg src=\"https://img.shields.io/badge/HiTZ-Basque%20Center%20for%20Language%20Technology-blueviolet\"\u003e\u003c/a\u003e\n    \u003ca href=\"http://www.ixa.eus/?language=en\"\u003e\u003cimg src=\"https://img.shields.io/badge/IXA-%20NLP%20Group-ff3333\"\u003e\u003c/a\u003e\n    \u003cbr\u003e\n     \u003cbr\u003e\n\u003c/p\u003e\n\n\u003cp align=\"justify\"\u003e\nWe present  \u003cimg src=\"assets/GoLLIE.png\" width=\"20\"\u003e GoLLIE, a Large Language Model trained to follow annotation guidelines. GoLLIE outperforms previous approaches on zero-shot Information Extraction and allows the user to perform inferences with annotation schemas defined on the fly. Different from previous approaches, GoLLIE is able to follow detailed definitions and does not only rely on the knowledge already encoded in the LLM. Code and models are publicly available.\n\n- 📒 Blog Post: [GoLLIE: Guideline-following Large Language Model for Information Extraction](https://hitz-zentroa.github.io/GoLLIE/)\n- 📖 Paper: [GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction](https://openreview.net/forum?id=Y3wpuxd7u9)\n- \u003cimg src=\"assets/GoLLIE.png\" width=\"20\"\u003eGoLLIE in the 🤗HuggingFace Hub: [HiTZ/gollie](https://huggingface.co/collections/HiTZ/gollie-651bf19ee315e8a224aacc4f)\n- 🚀 Example Jupyter Notebooks: [GoLLIE Notebooks](notebooks/)\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/zero_shot_results.png\"\u003e\n\u003c/p\u003e\n\n\n## Schema definition and inference example\n\nThe labels are represented as Python classes, and the guidelines or instructions are introduced as docstrings. The model start generating after the `result = [` line.\n\u003c!---\n```Python\n# Entity definitions\n@dataclass\nclass Launcher(Template):\n    \"\"\"Refers to a vehicle designed primarily to transport payloads from the Earth's \n    surface to space. Launchers can carry various payloads, including satellites, \n    crewed spacecraft, and cargo, into various orbits or even beyond Earth's orbit. \n    They are usually multi-stage vehicles that use rocket engines for propulsion.\"\"\"\n\n    mention: str  \n    \"\"\"\n    The name of the launcher vehicle. \n    Such as: \"Sturn V\", \"Atlas V\", \"Soyuz\", \"Ariane 5\"\n    \"\"\"\n    space_company: str # The company that operates the launcher. Such as: \"Blue origin\", \"ESA\", \"Boeing\", \"ISRO\", \"Northrop Grumman\", \"Arianespace\"\n    crew: List[str] # Names of the crew members boarding the Launcher. Such as: \"Neil Armstrong\", \"Michael Collins\", \"Buzz Aldrin\"\n    \n\n@dataclass\nclass Mission(Template):\n    \"\"\"Any planned or accomplished journey beyond Earth's atmosphere with specific objectives, \n    either crewed or uncrewed. It includes missions to satellites, the International \n    Space Station (ISS), other celestial bodies, and deep space.\"\"\"\n    \n    mention: str\n    \"\"\"\n    The name of the mission. \n    Such as: \"Apollo 11\", \"Artemis\", \"Mercury\"\n    \"\"\"\n    date: str # The start date of the mission\n    departure: str # The place from which the vehicle will be launched. Such as: \"Florida\", \"Houston\", \"French Guiana\"\n    destination: str # The place or planet to which the launcher will be sent. Such as \"Moon\", \"low-orbit\", \"Saturn\"\n\n# This is the text to analyze\ntext = (\n    \"The Ares 3 mission to Mars is scheduled for 2032. The Starship rocket build by SpaceX will take off from Boca Chica,\"\n    \"carrying the astronauts Max Rutherford, Elena Soto, and Jake Martinez.\"\n)\n\n# The annotation instances that take place in the text above are listed here\nresult = [\n    Mission(mention='Ares 3', date='2032', departure='Boca Chica', destination='Mars'),\n    Launcher(mention='Starship', space_company='SpaceX', crew=['Max Rutherford', 'Elena Soto', 'Jake Martinez'])\n]\n```\n--\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/snippets/space_transparent.png\"\u003e\n\u003c/p\u003e\n\n## Installation\n\nYou will need to install the following dependencies to run the GoLLIE codebase:\n```bash\nPytorch \u003e= 2.0.0 | https://pytorch.org/get-started\nWe recommend that you install the 2.1.0 version or newer, as it includes important bug fixes.\n\ntransformers \u003e= 4.33.1\npip install --upgrade transformers\n\nPEFT \u003e= 0.4.0\npip install --upgrade peft\n\nbitsandbytes \u003e= 0.40.0\npip install --upgrade bitsandbytes\n\nFlash Attention 2.0\npip install flash-attn --no-build-isolation\npip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary\n```\n\nYou will also need these dependencies\n```bash\npip install numpy black Jinja2 tqdm rich psutil datasets ruff wandb fschat\n```\n\n## Pretrained models\nWe release three GoLLIE models based on [CODE-LLama](https://huggingface.co/codellama) (7B, 13B, and 34B). The models are available in the 🤗HuggingFace Hub.\n\n| Model | Supervised average F1 | Zero-shot average F1 |                     🤗HuggingFace Hub                     |\n|---|:---------------------:|:--------------------:|:---------------------------------------------------------:|\n| GoLLIE-7B |         73.0          |         55.3         |  [HiTZ/GoLLIE-7B](https://huggingface.co/HiTZ/GoLLIE-7B)  |\n| GoLLIE-13B |         73.9          |         56.0         | [HiTZ/GoLLIE-13B](https://huggingface.co/HiTZ/GoLLIE-13B) |\n| GoLLIE-34B |       **75.0**        |       **57.2**       | [HiTZ/GoLLIE-34B](https://huggingface.co/HiTZ/GoLLIE-34B) |\n\n## How to use GoLLIE\n\nPlease take a look at our 🚀 Example Jupyter Notebooks to learn how to use GoLLIE: [GoLLIE Notebooks](notebooks/)\n\n## Currently supported tasks\n\nThis is the list of task used for training and evaluating GoLLIE. However, as demonstrated in the  🚀 [Create Custom Task notebook](notebooks/Create%20Custom%20Task.ipynb) GoLLIE can perform a wide range of unseen tasks. \nFor more info, read our [📖Paper](https://arxiv.org/abs/2310.03668).\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/datasets.png\"\u003e\n\u003c/p\u003e\n\nWe plan to continue adding more tasks to the list. If you want to contribute, please feel free to open a PR or contact us. You can use as example the already implemented tasks in the `src/tasks` folder.\n\n\n## Generate the GoLLIE dataset\nThe configuration files used to generate the GoLLIE dataset are available in the [configs/data_configs/](configs/data_configs/) folder.\nYou can generate the dataset by running the following command (See [bash_scripts/generate_data.sh](bash_scripts/generate_data.sh) for more info): \n```bash\nCONFIG_DIR=\"configs/data_configs\"\nOUTPUT_DIR=\"data/processed_w_examples\"\n\npython -m src.generate_data \\\n     --configs \\\n        ${CONFIG_DIR}/ace_config.json \\\n        ${CONFIG_DIR}/bc5cdr_config.json \\\n        ${CONFIG_DIR}/broadtwitter_config.json \\\n        ${CONFIG_DIR}/casie_config.json \\\n        ${CONFIG_DIR}/conll03_config.json \\\n        ${CONFIG_DIR}/crossner_ai_config.json \\\n        ${CONFIG_DIR}/crossner_literature_config.json \\\n        ${CONFIG_DIR}/crossner_music_config.json \\\n        ${CONFIG_DIR}/crossner_politics_config.json \\\n        ${CONFIG_DIR}/crossner_science_config.json \\\n        ${CONFIG_DIR}/diann_config.json \\\n        ${CONFIG_DIR}/e3c_config.json \\\n        ${CONFIG_DIR}/europarl_config.json \\\n        ${CONFIG_DIR}/fabner_config.json \\\n        ${CONFIG_DIR}/harveyner_config.json \\\n        ${CONFIG_DIR}/mitmovie_config.json \\\n        ${CONFIG_DIR}/mitrestaurant_config.json \\\n        ${CONFIG_DIR}/mitmovie_config.json \\\n        ${CONFIG_DIR}/multinerd_config.json \\\n        ${CONFIG_DIR}/ncbidisease_config.json \\\n        ${CONFIG_DIR}/ontonotes_config.json \\\n        ${CONFIG_DIR}/rams_config.json \\\n        ${CONFIG_DIR}/tacred_config.json \\\n        ${CONFIG_DIR}/wikievents_config.json \\\n        ${CONFIG_DIR}/wnut17_config.json \\\n     --output ${OUTPUT_DIR} \\\n     --overwrite_output_dir \\\n     --include_examples\n```\n\n**We do not redistribute the datasets used to train and evaluate GoLLIE**. Not all of them are publicly available; some require a license to access them.\n\nFor the datasets available in the HuggingFace Datasets library, the script will download them automatically.\n\nFor the following datasets, you must provide the path to the dataset by modifying the corresponding [configs/data_configs/](configs/data_configs/) file: [ACE05](https://catalog.ldc.upenn.edu/LDC2006T06) ([Preprocessing script](https://github.com/hitz-zentroa/GoLLIE/blob/main/src/tasks/ace/preprocess_ace.py)), [CASIE](https://github.com/Ebiquity/CASIE/tree/master/data), [CrossNer](https://github.com/zliucr/CrossNER), [DIANN](http://nlp.uned.es/diann/), [E3C](https://github.com/hltfbk/E3C-Corpus/tree/main/preprocessed_data/clinical_entities/English), [HarveyNER](https://github.com/brickee/HarveyNER/tree/main/data/tweets), [MitMovie](https://groups.csail.mit.edu/sls/downloads/movie/), [MitRestaurant](https://groups.csail.mit.edu/sls/downloads/restaurant/), [RAMS](https://nlp.jhu.edu/rams/), [TACRED](https://nlp.stanford.edu/projects/tacred/), [WikiEvents](https://github.com/raspberryice/gen-arg).\n\nRegarding the ACE05 dataset, you can obtain the splits from the code of OneIE paper: [http://blender.cs.illinois.edu/software/oneie/](http://blender.cs.illinois.edu/software/oneie/)\n\nIf you encounter difficulties generating the dataset, please don't hesitate to contact us.\n\n## How to train your own GoLLIE\n\nFirst, you need to generate the GoLLIE dataset. See the previous section for more info.\n\nSecond, you must create a configuration file. Please, see the [configs/model_configs](configs/model_configs) folder for examples. \n\nFinally, you can train your own GoLLIE by running the following command (See [bash_scripts/](bash_scripts/) folder for more examples): \n```bash\nCONFIGS_FOLDER=\"configs/model_configs\"\npython3 -m src.run ${CONFIGS_FOLDER}/GoLLIE+-7B_CodeLLaMA.yaml\n```\n\n## How to evaluate a model\nFirst, you need to generate the GoLLIE dataset. See the previous section for more info.\n\nSecond, you must create a configuration file. Please, see the [configs/model_configs/eval](configs/model_configs/eval) folder for examples. \n\nFinally, you can evaluate your own GoLLIE by running the following command (See [bash_scripts/eval](bash_scripts/eval) folder for more examples): \n```bash\nCONFIGS_FOLDER=\"configs/model_configs/eval\"\npython3 -m src.run ${CONFIGS_FOLDER}/GoLLIE+-7B_CodeLLaMA.yaml\n```\n\n\n\n## Citation\n```bibtex\n@inproceedings{\n    sainz2024gollie,\n    title={Go{LLIE}: Annotation Guidelines improve Zero-Shot Information-Extraction},\n    author={Oscar Sainz and Iker Garc{\\'\\i}a-Ferrero and Rodrigo Agerri and Oier Lopez de Lacalle and German Rigau and Eneko Agirre},\n    booktitle={The Twelfth International Conference on Learning Representations},\n    year={2024},\n    url={https://openreview.net/forum?id=Y3wpuxd7u9}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhitz-zentroa%2FGoLLIE","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhitz-zentroa%2FGoLLIE","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhitz-zentroa%2FGoLLIE/lists"}