{"id":26027767,"url":"https://github.com/gsarti/qe4pe","last_synced_at":"2025-03-06T16:57:18.011Z","repository":{"id":280968320,"uuid":"864444215","full_name":"gsarti/qe4pe","owner":"gsarti","description":"Code for \"QE4PE: Word-level Quality Estimation for Human Post-Editing\" ✍️","archived":false,"fork":false,"pushed_at":"2025-03-06T08:51:25.000Z","size":3744,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-06T09:41:31.332Z","etag":null,"topics":["behavioral-logs","dutch","human-evaluation","italian","machine-translation","machine-translation-evaluation","machine-translation-metrics","post-editing","quality-estimation","unbabel-comet","word-level-quality-estimation"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gsarti.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-28T08:29:08.000Z","updated_at":"2025-03-06T08:54:08.000Z","dependencies_parsed_at":"2025-03-06T09:51:38.093Z","dependency_job_id":null,"html_url":"https://github.com/gsarti/qe4pe","commit_stats":null,"previous_names":["gsarti/qe4pe"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsarti%2Fqe4pe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsarti%2Fqe4pe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsarti%2Fqe4pe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsarti%2Fqe4pe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gsarti","download_url":"https://codeload.github.com/gsarti/qe4pe/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242250925,"owners_count":20096895,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["behavioral-logs","dutch","human-evaluation","italian","machine-translation","machine-translation-evaluation","machine-translation-metrics","post-editing","quality-estimation","unbabel-comet","word-level-quality-estimation"],"created_at":"2025-03-06T16:57:17.557Z","updated_at":"2025-03-06T16:57:18.003Z","avatar_url":"https://github.com/gsarti.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# QE4PE: Word-level Quality Estimation for Human Post-Editing\n\n[Gabriele Sarti](https://gsarti.com) • [Vilém Zouhar](https://vilda.net/) •  [Grzegorz Chrupała](https://grzegorz.chrupala.me/) • [Ana Guerberof Arenas](https://scholar.google.com/citations?user=i6bqaTsAAAAJ) • [Malvina Nissim](https://malvinanissim.github.io/) • [Arianna Bisazza](https://www.cs.rug.nl/~bisazza/)\n\n\u003cp float=\"left\"\u003e\n    \u003cimg src=\"figures/highlevel_qe4pe.png\" alt=\"QE4PE annotation pipeline\" width=\"350\"/\u003e\n    \u003cimg src=\"figures/quality_edited.png\" alt=\"DivEMT annotation pipeline\" width=\"400\"/\u003e\n\u003c/p\u003e\n\n\u003e **Abstract:** Word-level quality estimation (QE) detects erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. Our QE4PE study investigates the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated by behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors' speed are critical factors in determining highlights' effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.\n\nThis repository contains data, scripts and notebooks associated to the paper [\"QE4PE: Word-level Quality Estimation for Human Post-Editing\"](https://arxiv.org/abs/2503.03044). If you use any of the following contents for your work, we kindly ask you to cite our paper:\n\n```bibtex\n@misc{sarti-etal-2024-qe4pe,\n      title={{QE4PE}: Word-level Quality Estimation for Human Post-Editing}, \n      author={Gabriele Sarti and Vilém Zouhar and Grzegorz Chrupała and Ana Guerberof-Arenas and Malvina Nissim and Arianna Bisazza},\n      year={2025},\n      eprint={2503.03044},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2503.03044}, \n}\n```\n\n## 🐮 Groningen Translation Environment (GroTE)\n\nGroTE is a simple Gradio-based interface for post-editing machine translation outputs with error spans. It allows to visualize and edit translations in a web interface hosted on [HF Spaces](https://huggingface.co/spaces), with real-time logging of granular editing actions. Find out more about setting up and running GroTE in the [GroTE repository](https://github.com/gsarti/grote).\n\n## The QE4PE Dataset\n\nProcessed QE4PE logs for `pre`, `main` and `post` tags, MQM/ESA annotations and questionnaire responses are available as [🤗 Datasets](https://huggingface.co/datasets/gsarti/qe4pe). Summary of the data:\n\n- Post-edits over [NLLB 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) outputs for \u003e400 segments from [WMT23](https://www2.statmt.org/wmt23/) (social media and biomedical abstracts): **15 edits per direction** (3 oracle post-edits + 12 core set translators) for En-\u003eIt and En-\u003eNl.\n- A single set of [MQM](https://themqm.org/) and [ESA](https://aclanthology.org/2024.wmt-1.131/) annotations from 12 human annotators for **MT outputs and all post-edited versions** across both directions for a subset of ~150 segments.\n- Fine-grained editing logs for core set translators across `pre`, `main` and `post` editing phases.\n- Pre- and post-task questionnaires for all post-editors.\n\nThe raw logfiles produced by our [🐮 GroTE](https://github.com/gsarti/grote) interface are available in the `task` folder in the same repository as the datasets. Refer to the [main QE4PE dataset readme](https://huggingface.co/datasets/gsarti/qe4pe) and readmes in each task folder for more details about the provided data.\n\n## Reproducing Our Processing Pipeline (⚠️ WIP)\n\nThis section provides a step-by-step guide to reproduce the data processing and analysis steps for the QE4PE study.\n\n**IMPORTANT:** While we describe how to regenerate all outputs we used for our analysis, they are all pre-computed and available in the [🤗 Datasets](https://huggingface.co/datasets/gsarti/qe4pe) repository. We are adding the scripts little by little, please be patient and reach out if needed! 🤗\n\n### 1. Setup\n\nInstall the required dependencies and the `qe4pe` package:\n\n```bash\npip install -r requirements-dev.txt\npip install -e .\n```\n\nDownload the QE4PE repository from the [🤗 Datasets](https://huggingface.co/datasets/gsarti/qe4pe) repository and place it in the `data` folder (it can be pulled as a git submodule with `git submodule update --init --recursive` and `git submodule update --recursive`).\n\n### 2. Generate WMT23 Outputs\n\nTODO: Add script for generation with NLLB 3.3B\n\nThe generated outputs are saved in `data/setup/wmt23/nllb_\u003cSIZE\u003e/wmttest2023.\u003cLANG\u003e`, with `\u003cSIZE\u003e` being either `3b` or `600m` and `\u003cLANG\u003e` being `ita` or `nld`.\n\n### 3. Annotate Outputs with XCOMET\n\nTODO: Add script for XCOMET annotations\n\nThe generated outputs are saved in `data/setup/wmt23/nllb_\u003cNLLB_SIZE\u003e/wmttest2023_xcomet-\u003cXCOMET_SIZE\u003e_\u003cLANG\u003e.json`, with `\u003cNLLB_SIZE\u003e` being either `3b` or `600m`, `\u003cLANG\u003e` being `ita` or `nld`, and `\u003cXCOMET_SIZE\u003e` being `xl` or `xxl`.\n\n### 4. From WMT23 Outputs to Selected Segments\n\nRun `qe4pe filter-wmt-data` to recover selected segments for `pre`, `main` and `post` editing phases from the full set of WMT23 segments and their translations available in `data/setup/wmt23`. Intermediate outputs are saved in `data/setup/processed`.\n\n### 5. Generate Highlights for Selected Segments\n\nTODO: Add scripts for generating highlights with XCOMET and the unsupervised methods.\n\nHighlighted segments are saved in the `data/setup/highlights` folder.\n\n### 6. Generate QA Dataframe from HTML MQM/ESA Annotations\n\nRaw QA annotations are provided in `data/setup/qa/eng-ita` and `data/setup/qa/eng-nld`.\n\nTODO: Add script for converting HTML annotations to a QA dataframe.\n\nThe final dataframe is saved in `data/setup/qa/qa_df.csv`.\n\n### 7. Putting it All Together: Merging Outputs, Logs and QA into a Unified Dataset\n\nRun `qe4pe process-task-data --TASK_PATH` to perform the preprocessing of outputs and logs for a specific task in `data/setup/task`, e.g. `qe4pe process-task-data data/task/main`. The processing is controlled by the task `processing_config.json` file, which specifies paths and additional info (e.g. for `main` QA annotations are merged with other fields).\n\nThe processed data is saved in `data/processed/task` as `processed_\u003cTASK\u003e.csv`.\n\n## Reproducing Our Analysis\n\n### 1. Visualizing the Selection Process\n\nTODO: Add notebook with plots from the selection process.\n\n### 2. Reproducing the Paper Analysis from Processed Data\n\nFollow the [analysis notebook](notebooks/analysis.ipynb) to reproduce the main plots and results from the paper. While some plots were retouched in Inkscape for the final version (marked as `_edited` in `figures/`), we provide the code to generate them from the processed data.\n\nModeling results can be reproduced from the [modeling notebook](notebooks/modeling.Rmd).\n\nTODO: Add additional analysis scripts for appendix plots.\n\n## See an Issue?\n\nIf you encounter any issues while running the scripts or notebooks, please open an issue in this repository. We will be happy to help you out!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgsarti%2Fqe4pe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgsarti%2Fqe4pe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgsarti%2Fqe4pe/lists"}