{"id":13935590,"url":"https://github.com/Unbabel/COMET","last_synced_at":"2025-07-19T20:33:18.355Z","repository":{"id":39958973,"uuid":"267619792","full_name":"Unbabel/COMET","owner":"Unbabel","description":" A Neural Framework for MT Evaluation","archived":false,"fork":false,"pushed_at":"2024-07-29T12:41:23.000Z","size":10013,"stargazers_count":501,"open_issues_count":39,"forks_count":78,"subscribers_count":20,"default_branch":"master","last_synced_at":"2024-11-09T16:48:58.800Z","etag":null,"topics":["artificial-intelligence","evaluation-metrics","machine-learning","machine-translation","natural-language-processing","nlp"],"latest_commit_sha":null,"homepage":"https://unbabel.github.io/COMET/html/index.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Unbabel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-28T14:59:58.000Z","updated_at":"2024-11-06T21:06:37.000Z","dependencies_parsed_at":"2024-01-08T13:42:42.783Z","dependency_job_id":"48f5b16d-7a57-421c-a185-1e142abe7d12","html_url":"https://github.com/Unbabel/COMET","commit_stats":{"total_commits":437,"total_committers":35,"mean_commits":"12.485714285714286","dds":0.528604118993135,"last_synced_commit":"8503fe799658b753055ced0b1f0950e4404b5065"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unbabel%2FCOMET","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unbabel%2FCOMET/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unbabel%2FCOMET/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unbabel%2FCOMET/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Unbabel","download_url":"https://codeload.github.com/Unbabel/COMET/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226677122,"owners_count":17666010,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","evaluation-metrics","machine-learning","machine-translation","natural-language-processing","nlp"],"created_at":"2024-08-07T23:01:54.747Z","updated_at":"2024-11-27T03:30:51.991Z","avatar_url":"https://github.com/Unbabel.png","language":"Python","funding_links":[],"categories":["Python","Evaluation and Monitoring","Neuronale Übersetzungstools","Libraries"],"sub_categories":["Bewertung der Übersetzungsqualität","Books"],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/Unbabel/COMET/master/docs/source/_static/img/COMET_lockup-dark.png\"\u003e\n  \u003cbr /\u003e\n  \u003cbr /\u003e\n  \u003ca href=\"https://github.com/Unbabel/COMET/blob/master/LICENSE\"\u003e\u003cimg alt=\"License\" src=\"https://img.shields.io/github/license/Unbabel/COMET\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/Unbabel/COMET/stargazers\"\u003e\u003cimg alt=\"GitHub stars\" src=\"https://img.shields.io/github/stars/Unbabel/COMET\" /\u003e\u003c/a\u003e\n  \u003ca href=\"\"\u003e\u003cimg alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/unbabel-comet\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/psf/black\"\u003e\u003cimg alt=\"Code Style\" src=\"https://img.shields.io/badge/code%20style-black-black\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n**NEWS:** \n1) We added a new method to extract free-text explanations from XCOMET outputs! [Check this section](https://github.com/Unbabel/COMET?tab=readme-ov-file#explaining-translation-errors)\n2) We now support [DocCOMET](https://statmt.org/wmt22/pdf/2022.wmt-1.6.pdf), a document-level extension of COMET which can utilize contextual information. Using context improves accuracy on discourse phenomena tasks as well as referenceless evaluation of [chat translation quality](https://arxiv.org/pdf/2403.08314).\n3) We released our new eXplainable COMET models ([XCOMET-XL](https://huggingface.co/Unbabel/XCOMET-XL) and [-XXL](https://huggingface.co/Unbabel/XCOMET-XXL)) which along with quality scores detects which errors in the translation are minor, major or critical according to MQM typology\n\nPlease check all available models [here](https://github.com/Unbabel/COMET/blob/master/MODELS.md)\n \n# Quick Installation\n\nCOMET requires python 3.8 or above. Simple installation from PyPI\n\n```bash\npip install --upgrade pip  # ensures that pip is current \npip install unbabel-comet\n```\n\n**Note:** To use some COMET models such as `Unbabel/wmt22-cometkiwi-da` you must acknowledge it's license on Hugging Face Hub and [log-in into hugging face hub](https://huggingface.co/docs/huggingface_hub/quick-start#:~:text=Once%20you%20have%20your%20User%20Access%20Token%2C%20run%20the%20following%20command%20in%20your%20terminal%3A).\n\n\nTo develop locally install run the following commands:\n```bash\ngit clone https://github.com/Unbabel/COMET\ncd COMET\npip install poetry\npoetry install\n```\n\nFor development, you can run the CLI tools directly, e.g.,\n\n```bash\nPYTHONPATH=. ./comet/cli/score.py\n```\n\n# Table of Contents\n\n1. [Scoring MT outputs](#scoring-mt-outputs)\n    1. [CLI Usage](#cli-usage)\n        1. [Basic scoring command](#basic-scoring-command)\n        2. [Reference-free evaluation](#reference-free-evaluation)\n        3. [Comparing multiple systems](#comparing-multiple-systems)\n        4. [Minimum Bayes Risk Decoding](#minimum-bayes-risk-decoding)\n2. [COMET Models](#comet-models)\n    1. [Interpreting Scores](#interpreting-scores)\n    2. [Languages Covered](#languages-covered)\n    3. [COMET for African Languages](#comet-for-african-languages)\n    4. [Scoring within Python](#scoring-within-python)\n    5. [Explaining Translation Errors](#explaining-translation-errors)\n3. [Train your own Metric](#train-your-own-metric)\n4. [Unittest](#unittest)\n5. [Publications](#publications)\n\n\n# Scoring MT outputs:\n\n## CLI Usage:\n\nTest examples:\n\n```bash\necho -e \"10 到 15 分钟可以送到吗\\nPode ser entregue dentro de 10 a 15 minutos?\" \u003e\u003e src.txt\necho -e \"Can I receive my food in 10 to 15 minutes?\\nCan it be delivered in 10 to 15 minutes?\" \u003e\u003e hyp1.txt\necho -e \"Can it be delivered within 10 to 15 minutes?\\nCan you send it for 10 to 15 minutes?\" \u003e\u003e hyp2.txt\necho -e \"Can it be delivered between 10 to 15 minutes?\\nCan it be delivered between 10 to 15 minutes?\" \u003e\u003e ref.txt\n```\n\n### Basic scoring command:\n```bash\ncomet-score -s src.txt -t hyp1.txt -r ref.txt\n```\n\u003e you can set the number of gpus using `--gpus` (0 to test on CPU).\n\nFor better error analysis, you can use XCOMET models such as [`Unbabel/XCOMET-XL`](https://huggingface.co/Unbabel/XCOMET-XL), you can export the identified errors using the `--to_json` flag:\n\n```bash\ncomet-score -s src.txt -t hyp1.txt -r ref.txt --model Unbabel/XCOMET-XL --to_json output.json\n```\n\nScoring multiple systems:\n```bash\ncomet-score -s src.txt -t hyp1.txt hyp2.txt -r ref.txt\n```\n\nWMT test sets via [SacreBLEU](https://github.com/mjpost/sacrebleu):\n\n```bash\ncomet-score -d wmt22:en-de -t PATH/TO/TRANSLATIONS\n```\n\nScoring with context:\n```bash\necho -e \"Pies made from apples like these. \u003c/s\u003e Oh, they do look delicious.\\nOh, they do look delicious.\" \u003e\u003e src.txt\necho -e \"Des tartes faites avec des pommes comme celles-ci. \u003c/s\u003e Elles ont l’air delicieux.\\nElles ont l’air delicieux\" \u003e\u003e hyp1.txt\necho -e \"Des tartes faites avec des pommes comme celles-ci. \u003c/s\u003e Ils ont l’air delicieux.\\nIls ont l’air delicieux.\" \u003e\u003e hyp2.txt\n```\n\nwhere `\u003c/s\u003e` is the separator token of the specific tokenizer (here: `xlm-roberta-large`) that the underlying model uses. \n\n```bash\ncomet-score -s src.txt -t hyp1.txt hyp2.txt --model Unbabel/wmt20-comet-qe-da --enable-context\n```\n\nIf you are only interested in a system-level score use the following command:\n\n```bash\ncomet-score -s src.txt -t hyp1.txt -r ref.txt --quiet --only_system\n```\n\n### Reference-free evaluation:\n\n```bash\ncomet-score -s src.txt -t hyp1.txt --model Unbabel/wmt22-cometkiwi-da\n```\n\n**Note:** To use the `Unbabel/wmt23-cometkiwi-da-xl` you first have to acknowledge its license on [Hugging Face Hub](https://huggingface.co/Unbabel/Unbabel/wmt23-cometkiwi-da-xl).\n\n### Comparing multiple systems:\n\nWhen comparing multiple MT systems we encourage you to run the `comet-compare` command to get **statistical significance** with Paired T-Test and bootstrap resampling [(Koehn, et al 2004)](https://aclanthology.org/W04-3250/).\n\n```bash\ncomet-compare -s src.de -t hyp1.en hyp2.en hyp3.en -r ref.en\n```\n\n### Minimum Bayes Risk Decoding:\n\nThe MBR command allows you to rank translations and select the best one according to COMET metrics. For more details you can read our paper on [Quality-Aware Decoding for Neural Machine Translation](https://aclanthology.org/2022.naacl-main.100.pdf).\n\n\n```bash\ncomet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt --num_sample [X] -o [OUTPUT_FILE].txt\n```\n\nIf working with a very large candidate list you can use `--rerank_top_k` flag to prune the topK most promissing candidates according to a reference-free metric.\n\nExample for a candidate list of 1000 samples:\n\n```bash\ncomet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt -o [OUTPUT_FILE].txt --num_sample 1000 --rerank_top_k 100 --gpus 4 --qe_model Unbabel/wmt23-cometkiwi-da-xl\n```\n\nYour source and samples file should be [formatted in this way](https://unbabel.github.io/COMET/html/running.html#:~:text=Example%20with%202%20source%20and%203%20samples%3A).\n\n# COMET Models\n\nWithin COMET, there are several evaluation models available. You can refer to the [MODELS](MODELS.md) page for a comprehensive list of all available models. Here is a concise list of the main reference-based and reference-free models:\n\n- **Default Model:** [`Unbabel/wmt22-comet-da`](https://huggingface.co/Unbabel/wmt22-comet-da) - This model employs a reference-based regression approach and is built upon the XLM-R architecture. It has been trained on direct assessments from WMT17 to WMT20 and provides scores ranging from 0 to 1, where 1 signifies a perfect translation.\n- **Reference-free Model:** [`Unbabel/wmt22-cometkiwi-da`](https://huggingface.co/Unbabel/wmt22-cometkiwi-da) - This reference-free model employs a regression approach and is built on top of InfoXLM. It has been trained using direct assessments from WMT17 to WMT20, as well as direct assessments from the MLQE-PE corpus. Similar to other models, it generates scores ranging from 0 to 1. For those interested, we also offer larger versions of this model: [`Unbabel/wmt23-cometkiwi-da-xl`](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xl) with 3.5 billion parameters and [`Unbabel/wmt23-cometkiwi-da-xxl`](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xxl) with 10.7 billion parameters.\n- **eXplainable COMET (XCOMET):** [`Unbabel/XCOMET-XXL`](https://huggingface.co/Unbabel/XCOMET-XXL) - Our latest model is trained to identify error spans and assign a final quality score, resulting in an explainable neural metric. We offer this version in XXL with 10.7 billion parameters, as well as the XL variant with 3.5 billion parameters ([`Unbabel/XCOMET-XL`](https://huggingface.co/Unbabel/XCOMET-XL)). These models have demonstrated the highest correlation with MQM and are our best performing evaluation models.\n\nPlease be aware that different models may be subject to varying licenses. To learn more, kindly refer to the [LICENSES.models](LICENSE.models.md) and model licenses sections.\n\nIf you intend to compare your results with papers published before 2022, it's likely that they used older evaluation models. In such cases, please refer to [`Unbabel/wmt20-comet-da`](https://huggingface.co/Unbabel/wmt20-comet-da) and [`Unbabel/wmt20-comet-qe-da`](https://huggingface.co/Unbabel/wmt20-comet-qe-da), which were the primary checkpoints used in previous versions (\u003c2.0) of COMET.\n\nAlso, [UniTE Metric](https://aclanthology.org/2022.acl-long.558/) developed by the NLP2CT Lab at the University of Macau and Alibaba Group can be used directly through COMET check [here for more details](https://huggingface.co/Unbabel/unite-mup).\n\n## Interpreting Scores:\n\n**New:** An excellent reference for learning how to interpret machine translation metrics is the analysis paper by Kocmi et al. (2024), available [at this link.](https://arxiv.org/pdf/2401.06760.pdf)\n\nWhen using COMET to evaluate machine translation, it's important to understand how to interpret the scores it produces.\n\nIn general, COMET models are trained to predict quality scores for translations. These scores are typically normalized using a [z-score transformation](https://simplypsychology.org/z-score.html) to account for individual differences among annotators. While the raw score itself does not have a direct interpretation, it is useful for ranking translations and systems according to their quality.\n\nHowever, since 2022 we have introduced a new training approach that scales the scores between 0 and 1. This makes it easier to interpret the scores: a score close to 1 indicates a high-quality translation, while a score close to 0 indicates a translation that is no better than random chance. Also, with the introduction of XCOMET models we can now analyse which text spans are part of minor, major or critical errors according to the MQM typology.\n\nIt's worth noting that when using COMET to compare the performance of two different translation systems, it's important to run the `comet-compare` command to obtain statistical significance measures. This command compares the output of two systems using a statistical hypothesis test, providing an estimate of the probability that the observed difference in scores between the systems is due to chance. This is an important step to ensure that any differences in scores between systems are statistically significant.\n\nOverall, the added interpretability of scores in the latest COMET models, combined with the ability to assess statistical significance between systems using `comet-compare`, make COMET a valuable tool for evaluating machine translation.\n\n## Languages Covered:\n\nAll the above mentioned models are build on top of XLM-R (variants) which cover the following languages:\n\nAfrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.\n\n**Thus, results for language pairs containing uncovered languages are unreliable!**\n\n### COMET for African Languages:\n\nIf you are interested in COMET metrics for african languages please visit [afriCOMET](https://github.com/masakhane-io/africomet). \n\n## Scoring within Python:\n\n```python\nfrom comet import download_model, load_from_checkpoint\n\n# Choose your model from Hugging Face Hub\nmodel_path = download_model(\"Unbabel/XCOMET-XL\")\n# or for example:\n# model_path = download_model(\"Unbabel/wmt22-comet-da\")\n\n# Load the model checkpoint:\nmodel = load_from_checkpoint(model_path)\n\n# Data must be in the following format:\ndata = [\n    {\n        \"src\": \"10 到 15 分钟可以送到吗\",\n        \"mt\": \"Can I receive my food in 10 to 15 minutes?\",\n        \"ref\": \"Can it be delivered between 10 to 15 minutes?\"\n    },\n    {\n        \"src\": \"Pode ser entregue dentro de 10 a 15 minutos?\",\n        \"mt\": \"Can you send it for 10 to 15 minutes?\",\n        \"ref\": \"Can it be delivered between 10 to 15 minutes?\"\n    }\n]\n# Call predict method:\nmodel_output = model.predict(data, batch_size=8, gpus=1)\n```\n\nAs output, we get the following information:\n```python\n# Sentence-level scores (list)\n\u003e\u003e\u003e model_output.scores\n[0.9822099208831787, 0.9599897861480713]\n\n# System-level score (float)\n\u003e\u003e\u003e model_output.system_score\n0.971099853515625\n\n# Detected error spans (list of list of dicts)\n\u003e\u003e\u003e model_output.metadata.error_spans\n[\n  [{'confidence': 0.4160953164100647,\n   'end': 21,\n   'severity': 'minor',\n   'start': 13,\n   'text': 'my food'}],\n  [{'confidence': 0.40004390478134155,\n   'end': 19,\n   'severity': 'minor',\n   'start': 3,\n   'text': 'you send it for'}]\n]\n```\n\nHowever, note that not all COMET models return metadata with detected error spans.\n\n\n## Explaining translation errors:\n\nCheck [this notebook](https://gist.github.com/mtreviso/b618b499bc6de0414a3e11157e91cf02) for a minimal example on how you can combine xCOMET with [xTower](https://huggingface.co/sardinelab/xTower13B) to generate a natural language explanation for each error span. \n\nFor the Portuguese-English example above, we would call xTower with the following **prompt**:\n\u003e You are provided with a Source, Translation, Translation quality analysis, and Translation quality score (weak, moderate, good, excellent, best). The Translation quality analysis contains a translation with marked error spans with different levels of severity (minor or major). Given this information, generate an explanation for each error and a fully correct translation. \u003cbr\u003e\u003cbr\u003e\n\u003e Portuguese source: Pode ser entregue dentro de 10 a 15 minutos? \u003cbr\u003e\n\u003e English translation: Can you send it for 10 to 15 minutes? \u003cbr\u003e\n\u003e Translation quality analysis: Can `\u003cerror1 severity='minor'\u003e`you send it for`\u003c/error1\u003e` 10 to 15 minutes? \u003cbr\u003e\n\u003e Translation quality score: excellent\n\n\nAnd get this as output:\n\u003e Explanation for error1: The phrase \"Can you send it for 10 to 15 minutes?\" is a mistranslation of the original Portuguese sentence. The correct interpretation should focus on the delivery time rather than the duration of sending. The original sentence is asking about the delivery time, not the duration of sending. \u003cbr\u003e\n\u003e Translation correction: Can it be delivered within 10 to 15 minutes?\n\n\nFor more information, check [xTower documentation](https://huggingface.co/sardinelab/xTower13B).\n\n\n# Train your own Metric: \n\nInstead of using pretrained models your can train your own model with the following command:\n```bash\ncomet-train --cfg configs/models/{your_model_config}.yaml\n```\n\nYou can then use your own metric to score:\n\n```bash\ncomet-score -s src.de -t hyp1.en -r ref.en --model PATH/TO/CHECKPOINT\n```\n\nYou can also upload your model to [Hugging Face Hub](https://huggingface.co/docs/hub/index). Use [`Unbabel/wmt22-comet-da`](https://huggingface.co/Unbabel/wmt22-comet-da) as example. Then you can use your model directly from the hub.\n\n# unittest:\nIn order to run the toolkit tests you must run the following command:\n\n```bash\npoetry run coverage run --source=comet -m unittest discover\npoetry run coverage report -m # Expected coverage 76%\n```\n\n**Note:** Testing on CPU takes a long time\n\n# Publications\n\nIf you use COMET please cite our work **and don't forget to say which model you used!**\n\n- [xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection](https://arxiv.org/pdf/2310.10482.pdf)\n\n- [Scaling up CometKiwi: Unbabel-IST 2023 Submission for the Quality Estimation Shared Task](https://arxiv.org/pdf/2309.11925.pdf)\n\n- [CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task](https://aclanthology.org/2022.wmt-1.60/)\n\n- [COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task](https://aclanthology.org/2022.wmt-1.52/)\n\n- [Searching for Cometinho: The Little Metric That Could](https://aclanthology.org/2022.eamt-1.9/)\n\n- [Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task](https://aclanthology.org/2021.wmt-1.111/)\n\n- [Uncertainty-Aware Machine Translation Evaluation](https://aclanthology.org/2021.findings-emnlp.330/) \n\n- [COMET - Deploying a New State-of-the-art MT Evaluation Metric in Production](https://www.aclweb.org/anthology/2020.amta-user.4)\n\n- [Unbabel's Participation in the WMT20 Metrics Shared Task](https://aclanthology.org/2020.wmt-1.101/)\n\n- [COMET: A Neural Framework for MT Evaluation](https://www.aclweb.org/anthology/2020.emnlp-main.213)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FUnbabel%2FCOMET","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FUnbabel%2FCOMET","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FUnbabel%2FCOMET/lists"}