{"id":28653901,"url":"https://github.com/tiger-ai-lab/tigerscore","last_synced_at":"2025-06-13T07:08:04.239Z","repository":{"id":198136437,"uuid":"698832667","full_name":"TIGER-AI-Lab/TIGERScore","owner":"TIGER-AI-Lab","description":"\"TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks\" [TMLR 2024]","archived":false,"fork":false,"pushed_at":"2024-12-21T20:44:46.000Z","size":275773,"stargazers_count":28,"open_issues_count":1,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-12-21T21:27:54.476Z","etag":null,"topics":["evaluation","language-model","llm","metrics"],"latest_commit_sha":null,"homepage":"https://tiger-ai-lab.github.io/TIGERScore/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TIGER-AI-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-01T05:21:44.000Z","updated_at":"2024-12-21T20:44:50.000Z","dependencies_parsed_at":"2023-10-04T16:08:21.106Z","dependency_job_id":"5749be25-28a9-4a10-9af1-09fa010905c7","html_url":"https://github.com/TIGER-AI-Lab/TIGERScore","commit_stats":null,"previous_names":["tiger-ai-lab/tigerscore"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TIGER-AI-Lab/TIGERScore","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FTIGERScore","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FTIGERScore/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FTIGERScore/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FTIGERScore/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TIGER-AI-Lab","download_url":"https://codeload.github.com/TIGER-AI-Lab/TIGERScore/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FTIGERScore/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259599331,"owners_count":22882357,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation","language-model","llm","metrics"],"created_at":"2025-06-13T07:08:03.632Z","updated_at":"2025-06-13T07:08:04.191Z","avatar_url":"https://github.com/TIGER-AI-Lab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **TIGERScore**\r\nThis repo contains the code, data, and models for TMLR 2024 paper \"[TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks](https://arxiv.org/abs/2310.00752)\"\r\n\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n 🔥 🔥 🔥 Check out our \u003ca href = \"https://tiger-ai-lab.github.io/TIGERScore/\"\u003e[Project Page]\u003c/a\u003e for more results and analysis!\r\n\u003c/div\u003e\r\n\r\n\u003cbr\u003e\r\n\u003cdiv align=\"center\"\u003e\r\n  \u003cimg src=\"github_overview.png\" width=\"80%\" title=\"Introduction Figure\"\u003e\r\n\u003c/div\u003e\r\n\r\n## 🔥News\r\n\r\n- [12/2] TIGERScore now support running with llama.cpp, check [Quantization Support Cpu](#quantization-support-cpu) for details\r\n\r\n## **Table of Contents**\r\n\r\n- [📌 Introduction](#introduction)\r\n- [🤗 Datasets and Models](#datasets-and-models)\r\n- [⚙️ Installation](#installation)\r\n- [🛠️ Usage](#usage)\r\n- [📜 License](#license)\r\n- [📖 Citation](#citation)\r\n\r\n\r\n\r\n\r\n## **Introduction**\r\nWe present 🐯 TIGERScore, a **T**rained metric that follows **I**nstruction **G**uidance to perform **E**xplainable, and **R**eference-free evaluation over a wide spectrum of text generation tasks. \r\n\r\nExisting automatic metrics either are lagging and suffer from issues like 1) **Dependency on references**, 2) **Limited to specific domains**, 3) **Lack of attribution**. Contrary to them, TIGERScore is designed to be driven by natural language instruction and provide detailed error analysis to pinpoint the mistakes in the generated text.\r\n\r\nSpecifically, TIGERScore takes an instruction, an associated input context along with a hypothesis output that might contain errors. Then, TIGERScore will evaluate this hypothesis output and list several errors, each consisting of the error location, aspect, explanation and penalty scores (score reduced, starting from 0). The sum of the reduced scores is taken as the overall rating of this output. \r\n\r\nExperiments show that TIGERScore surpass existing baseline metrics in correlation with human ratings on all 6 held-in tasks and 1 held-out task, achiving the highest overall performance. We hope the emergence of TIGERScore can promote the research in the LLM community as a powerful, interpretable, and easy-to-use metric.\r\n\r\n## Datasets and Models\r\n\r\n| Datasets |\r\n| ----- |\r\n| 📏 [MetricInstruct](https://huggingface.co/datasets/TIGER-Lab/MetricInstruct) |\r\n\r\n| Models 🐯                                           \t | \r\n|---------------------------------------------------------------\t |\r\n|  🦙 [TIGERScore-7B](https://huggingface.co/TIGER-Lab/TIGERScore-7B)   \t| \r\n|  🦙 [TIGERScore-13B](https://huggingface.co/TIGER-Lab/TIGERScore-13B) \t| \r\n|  🦙 [TIGERScore-7B-GGUF](https://huggingface.co/TIGER-Lab/TIGERScore-7B-GGUF)   \t| \r\n|  🦙 [TIGERScore-13B-GGUF](https://huggingface.co/TIGER-Lab/TIGERScore-13B-GGUF) \t| \r\n|  \u003cimg src=\"https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/Yi_logo_icon_light.svg\" style=\"height: 1em; vertical-align: middle;\" title=\"Yi\"\u003e [TIGERScore-Yi-6B](https://huggingface.co/TIGER-Lab/TIGERScore-Yi-6B) |\r\n\r\n| Other Resources                                           \t | \r\n|---------------------------------------------------------------\t |\r\n| [🤗 TIGERScore Collections](https://huggingface.co/collections/TIGER-Lab/tigerscore-657020bfae61260b6131f1ca)|\r\n| [🤗 Huggingface Demo](https://huggingface.co/spaces/TIGER-Lab/TIGERScore) |\r\n\r\n\r\n\r\n\r\n\r\n## Installation\r\n\r\nTo directly use tigerscore pipeline, you first need to install it as a python package. \r\n```bash\r\npip install git+https://github.com/TIGER-AI-Lab/TIGERScore.git\r\n```\r\nPlease do check if your `torch.cuda.is_available()` is `True` for your local machine.\r\n\r\nBesides, to use TIGERScore with vllm detailed [here](#vllm-support-recommended), you need to mannually install vllm following [vllm document](https://docs.vllm.ai/en/latest/getting_started/installation.html). \r\n\r\n- if your CUDA is 12.1\r\n```bash\r\npip install vllm\r\npip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121\r\n```\r\n- if your CUDA is 11.8\r\n```bash\r\n# Replace `cp39` with your Python version (e.g., `cp38`, `cp39`, `cp311`).\r\npip install https://github.com/vllm-project/vllm/releases/download/v0.2.2/vllm-0.2.2+cu118-cp39-cp39-manylinux1_x86_64.whl\r\npip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118\r\n```\r\n\r\nIf you want to use the training scripts, install the dependencies by running the following command:\r\n```bash\r\npip install -r requirements.txt\r\n```\r\n## Usage\r\n\r\n### Basic Usage\r\nAfter installation, you are good to score the text generations with the following exmaple python code (see in [`tigerscore_example_usage.ipynb`](./tigerscore_example_usage.ipynb) for more use cases) :\r\n```python\r\n# gpu device setup\r\nimport os\r\nos.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\r\n# example  \r\ninstruction = \"Write an apology letter.\"\r\ninput_context = \"Reason: You canceled a plan at the last minute due to illness.\"\r\nhypo_output = \"Hey [Recipient],\\n\\nI'm really sorry for ditching our plan. I suddenly got an opportunity for a vacation so I took it. I know this might have messed up your plans and I regret that.\\n\\nDespite being under the weather, I would rather go for an adventure. I hope you can understand my perspective and I hope this incident doesn't change anything between us.\\n\\nWe can reschedule our plan for another time. Sorry again for the trouble.\\n\\nPeace out,\\n[Your Name]\\n\\n---\"\r\n\r\n# Load and evaluate examples in all options in 3 lines of code\r\nfrom tigerscore import TIGERScorer\r\nscorer = TIGERScorer(model_name=\"TIGER-Lab/TIGERScore-7B\") # on GPU\r\n# scorer = TIGERScorer(model_name=\"TIGER-Lab/TIGERScore-7B\", quantized=True) # 4 bit quantization on GPU\r\n# scorer = TIGERScorer(model_name=\"TIGER-Lab/TIGERScore-7B\", use_vllm=True) # VLLM on GPU\r\n# scorer = TIGERScorer(model_name=\"TIGER-Lab/TIGERScore-7B-GGUF\", use_llamacpp=True) # 4 bit quantization on CPU\r\nresults = scorer.score([instruction], [hypo_output], [input_context])\r\n\r\n# print the results, which is a list of json output containging the automatically parsed results!\r\nprint(results)\r\n``` \r\nThe results is a list of dicts consisting of structured error analysis.\r\n```json\r\n[\r\n    {\r\n        \"num_errors\": 3,\r\n        \"score\": -12.0,\r\n        \"errors\": {\r\n            \"error_0\": {\r\n                \"location\": \"\\\"I'm really glad for ditching our plan.\\\"\",\r\n                \"aspect\": \"Inappropriate language or tone\",\r\n                \"explanation\": \"The phrase \\\"ditching our plan\\\" is informal and disrespectful. It should be replaced with a more respectful and apologetic phrase like \\\"cancelling our plan\\\".\",\r\n                \"severity\": \"Major\",\r\n                \"score_reduction\": \"4.0\"\r\n            },\r\n            \"error_1\": {\r\n                \"location\": \"\\\"I suddenly got an opportunity for a vacation so I took it.\\\"\",\r\n                \"aspect\": \"Lack of apology or remorse\",\r\n                \"explanation\": \"This sentence shows no remorse for cancelling the plan at the last minute. It should be replaced with a sentence that expresses regret for the inconvenience caused.\",\r\n                \"severity\": \"Major\",\r\n                \"score_reduction\": \"4.0\"\r\n            },\r\n            \"error_2\": {\r\n                \"location\": \"\\\"I would rather go for an adventure.\\\"\",\r\n                \"aspect\": \"Incorrect reason for cancellation\",\r\n                \"explanation\": \"This sentence implies that the reason for cancelling the plan was to go on an adventure, which is incorrect. The correct reason was illness. This sentence should be replaced with a sentence that correctly states the reason for cancellation.\",\r\n                \"severity\": \"Major\",\r\n                \"score_reduction\": \"4.0\"\r\n            }\r\n        },\r\n        \"raw_output\": \"...\"\r\n    }\r\n]\r\n```\r\n\r\n### VLLM Support (**Recommended**)\r\n```python\r\nscorer = TIGERScorer(model_name=\"TIGER-Lab/TIGERScore-7B\", use_vllm=True) # VLLM on GPU\r\n```\r\nTIGERScore supports VLLM fast inference. On a single A6000 (48GB) GPU, it only takes **0.2s - 0.3s** for TIGERScore-13b to score each instance.\r\n\r\n### Quantization Support (GPU)\r\n```python\r\nscorer = TIGERScorer(model_name=\"TIGER-Lab/TIGERScore-7B\", quantized=True) # 4 bit quantization on GPU\r\n```\r\nBy setting the initialization parameter `quanitzed=True`, the model is set to be load in 4-bit version with hugging face `load_in_4bit=True` option. \r\n\r\nPlease note that though using quantization would decrease the memory requirement by a large margin. You can run TIGERScore on about a 20+GB memory GPU. However, the inference speed might be slower than using the original bfloat16 version. It depends on you to make an trade-off.\r\n\r\n### LlamaCPP Support (CPU)\r\n```python\r\nscorer = TIGERScorer(model_name=\"TIGER-Lab/TIGERScore-7B-GGUF\", use_llamacpp=True)\r\n```\r\nWe also provide the Llamacpp version of TIGERScore-7B/13B. By using the GGUF version we provided, you can run TIGERScore on pure CPU devices. It generally takes **20s** for TIGERScore-13b to score each instance.\r\n\r\n## Data Preparation\r\ndataset preprocessing scripts and intermediate results can be found [here](https://drive.google.com/file/d/1DAjvig-A_57CuBvENLg8A2PycOaz9ZkT/view?usp=sharing)\r\n### Propmting template\r\nfolder [`xgptscore`](./tigerscore/xgptscore/) contains all the templates that we used to query ChatGPT or GPT-4 to get the identified errors in the hypothesis output for different tasks that TIGERScore involved. We call these API query methods as XGPTScore for a e**X**planainable **Scoring** method by querying **GPT** Models.\r\n\r\nThe overall pipeline of XGPTScore is:\r\n\r\n1. We define a query template that askes GPT Models to idnetify errors in the hypothesis output based on the task instruction, source text and reference text.\r\n2. We mannual construct various evaluation aspects to focus on for different tasks. ([`./constants.py`](./tigerscore/xgptscore/constants.py))\r\n3. Then, by applying the templates and also specifiy the aspects to focus on in the template, GPT Models are required to return the identified errors in a predefined format (like json format).\r\n\r\nCheck [`xgptscore/README.md`](./tigerscore/xgptscore/README.md) for more details. And how to use our query template with a single function `xgptscore()`\r\n\r\n### Dataset Components\r\nMetricInstruct consists of data from 2 sampling channels, **real-world channel** and **synthetic channel**. \r\n- The real-world channel data is generated by script [`generate_distill_data.sh`](./tigerscore/eval_scripts/generate_distill_data.sh).\r\n- The synthetic channel data is generated by script [`generate_synthesis_distill_data.sh`](./tigerscore/eval_scripts/generate_synthesis_distill_data.sh).\r\nThe overall purpose of 2 channel data collection is to make sure we cover as many as error types in the training data so that our model generalize better.\r\n\r\nAfter getting these data, we do a series heuristics to filter our bad data and augment data:\r\n1. Drop item that is too long, too short, bad format, etc (pattern matching)\r\n2. Propmt GPT-4 to drop item with unreasonable error analysis contents ([`check_data.sh`](./tigerscore/eval_scripts/check_data.sh))\r\n3. Our evaluation asepcts might be limited because they are mannually defined and fixed. Therefore, we propose to generate high-quality outputs with free-form error asepcts using [`generate_inst_synthetic_data.sh`](./tigerscore/eval_scripts/generate_inst_synthetic_data.sh) as a supplement to the synthetic channel. \r\n\r\n### 📏 MetricInstruct \r\n\r\nYou can load our preprocessed data used to finetune TIGERScore-V1 from hugging face 🤗 directly:\r\n```python\r\nfrom datasets import load_dataset\r\ndataset = load_dataset(\"TIGER-Lab/MetricInstruct\")\r\n```\r\n\r\n## **Training Scripts**\r\n\r\nWe provide our training and testing scripts in folder [`finetune`](./tigerscore/finetune/), where we use🧮 \r\n- [`finetune_llama.sh`](./tigerscore/finetune/finetune_llama.sh) to finetine the model.\r\n- [`format_distill_data.sh`](./tigerscore/finetune/format_distill_data.sh) to transform the data into the format for finetuning, that is, a sinlge instruction and input context with an output.\r\n- [`test_llama_vllm.sh`](./tigerscore/finetune/test_llama_vllm.sh) to test and compute the correlation as the performance of our finetuned model. \r\nPlease check these scripts to know more details of our training and testing process.\r\n- ['eval_baseline.sh](./tigerscore/eval_scripts/eval_baseline.sh) to restore baseline experiments results. See [`./tigerscore/common/README.md`](./tigerscore/common/README.md) to install the env.\r\n\r\n## **Citation**\r\n\r\nPlease cite our paper if you fine our data, model or code useful. \r\n\r\n```\r\n@article{Jiang2023TIGERScoreTB,\r\n  title={TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks},\r\n  author={Dongfu Jiang and Yishan Li and Ge Zhang and Wenhao Huang and Bill Yuchen Lin and Wenhu Chen},\r\n  journal={ArXiv},\r\n  year={2023},\r\n  volume={abs/2310.00752},\r\n  url={https://api.semanticscholar.org/CorpusID:263334281}\r\n}\r\n```\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Ftigerscore","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftiger-ai-lab%2Ftigerscore","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Ftigerscore/lists"}