https://github.com/gsarti/qe4pe

Code for "QE4PE: Word-level Quality Estimation for Human Post-Editing" ✍️
https://github.com/gsarti/qe4pe

behavioral-logs dutch human-evaluation italian machine-translation machine-translation-evaluation machine-translation-metrics post-editing quality-estimation unbabel-comet word-level-quality-estimation

Last synced: 3 months ago
JSON representation

Code for "QE4PE: Word-level Quality Estimation for Human Post-Editing" ✍️

Host: GitHub
URL: https://github.com/gsarti/qe4pe
Owner: gsarti
License: apache-2.0
Created: 2024-09-28T08:29:08.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-03-06T08:51:25.000Z (3 months ago)
Last Synced: 2025-03-06T09:41:31.332Z (3 months ago)
Topics: behavioral-logs, dutch, human-evaluation, italian, machine-translation, machine-translation-evaluation, machine-translation-metrics, post-editing, quality-estimation, unbabel-comet, word-level-quality-estimation
Language: Jupyter Notebook
Homepage:
Size: 3.57 MB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

# QE4PE: Word-level Quality Estimation for Human Post-Editing

[Gabriele Sarti](https://gsarti.com) • [Vilém Zouhar](https://vilda.net/) • [Grzegorz Chrupała](https://grzegorz.chrupala.me/) • [Ana Guerberof Arenas](https://scholar.google.com/citations?user=i6bqaTsAAAAJ) • [Malvina Nissim](https://malvinanissim.github.io/) • [Arianna Bisazza](https://www.cs.rug.nl/~bisazza/)

QE4PE annotation pipeline
DivEMT annotation pipeline

> **Abstract:** Word-level quality estimation (QE) detects erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. Our QE4PE study investigates the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated by behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors' speed are critical factors in determining highlights' effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.

This repository contains data, scripts and notebooks associated to the paper ["QE4PE: Word-level Quality Estimation for Human Post-Editing"](https://arxiv.org/abs/2503.03044). If you use any of the following contents for your work, we kindly ask you to cite our paper:

```bibtex
@misc{sarti-etal-2024-qe4pe,
title={{QE4PE}: Word-level Quality Estimation for Human Post-Editing},
author={Gabriele Sarti and Vilém Zouhar and Grzegorz Chrupała and Ana Guerberof-Arenas and Malvina Nissim and Arianna Bisazza},
year={2025},
eprint={2503.03044},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.03044},
}
```

## 🐮 Groningen Translation Environment (GroTE)

GroTE is a simple Gradio-based interface for post-editing machine translation outputs with error spans. It allows to visualize and edit translations in a web interface hosted on [HF Spaces](https://huggingface.co/spaces), with real-time logging of granular editing actions. Find out more about setting up and running GroTE in the [GroTE repository](https://github.com/gsarti/grote).

## The QE4PE Dataset

Processed QE4PE logs for `pre`, `main` and `post` tags, MQM/ESA annotations and questionnaire responses are available as [🤗 Datasets](https://huggingface.co/datasets/gsarti/qe4pe). Summary of the data:

- Post-edits over [NLLB 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) outputs for >400 segments from [WMT23](https://www2.statmt.org/wmt23/) (social media and biomedical abstracts): **15 edits per direction** (3 oracle post-edits + 12 core set translators) for En->It and En->Nl.
- A single set of [MQM](https://themqm.org/) and [ESA](https://aclanthology.org/2024.wmt-1.131/) annotations from 12 human annotators for **MT outputs and all post-edited versions** across both directions for a subset of ~150 segments.
- Fine-grained editing logs for core set translators across `pre`, `main` and `post` editing phases.
- Pre- and post-task questionnaires for all post-editors.

The raw logfiles produced by our [🐮 GroTE](https://github.com/gsarti/grote) interface are available in the `task` folder in the same repository as the datasets. Refer to the [main QE4PE dataset readme](https://huggingface.co/datasets/gsarti/qe4pe) and readmes in each task folder for more details about the provided data.

## Reproducing Our Processing Pipeline (⚠️ WIP)

This section provides a step-by-step guide to reproduce the data processing and analysis steps for the QE4PE study.

**IMPORTANT:** While we describe how to regenerate all outputs we used for our analysis, they are all pre-computed and available in the [🤗 Datasets](https://huggingface.co/datasets/gsarti/qe4pe) repository. We are adding the scripts little by little, please be patient and reach out if needed! 🤗

### 1. Setup

Install the required dependencies and the `qe4pe` package:

```bash
pip install -r requirements-dev.txt
pip install -e .
```

Download the QE4PE repository from the [🤗 Datasets](https://huggingface.co/datasets/gsarti/qe4pe) repository and place it in the `data` folder (it can be pulled as a git submodule with `git submodule update --init --recursive` and `git submodule update --recursive`).

### 2. Generate WMT23 Outputs

TODO: Add script for generation with NLLB 3.3B

The generated outputs are saved in `data/setup/wmt23/nllb_/wmttest2023.`, with `` being either `3b` or `600m` and `` being `ita` or `nld`.

### 3. Annotate Outputs with XCOMET

TODO: Add script for XCOMET annotations

The generated outputs are saved in `data/setup/wmt23/nllb_/wmttest2023_xcomet-_.json`, with `` being either `3b` or `600m`, `` being `ita` or `nld`, and `` being `xl` or `xxl`.

### 4. From WMT23 Outputs to Selected Segments

Run `qe4pe filter-wmt-data` to recover selected segments for `pre`, `main` and `post` editing phases from the full set of WMT23 segments and their translations available in `data/setup/wmt23`. Intermediate outputs are saved in `data/setup/processed`.

### 5. Generate Highlights for Selected Segments

TODO: Add scripts for generating highlights with XCOMET and the unsupervised methods.

Highlighted segments are saved in the `data/setup/highlights` folder.

### 6. Generate QA Dataframe from HTML MQM/ESA Annotations

Raw QA annotations are provided in `data/setup/qa/eng-ita` and `data/setup/qa/eng-nld`.

TODO: Add script for converting HTML annotations to a QA dataframe.

The final dataframe is saved in `data/setup/qa/qa_df.csv`.

### 7. Putting it All Together: Merging Outputs, Logs and QA into a Unified Dataset

Run `qe4pe process-task-data --TASK_PATH` to perform the preprocessing of outputs and logs for a specific task in `data/setup/task`, e.g. `qe4pe process-task-data data/task/main`. The processing is controlled by the task `processing_config.json` file, which specifies paths and additional info (e.g. for `main` QA annotations are merged with other fields).

The processed data is saved in `data/processed/task` as `processed_.csv`.

## Reproducing Our Analysis

### 1. Visualizing the Selection Process

TODO: Add notebook with plots from the selection process.

### 2. Reproducing the Paper Analysis from Processed Data

Follow the [analysis notebook](notebooks/analysis.ipynb) to reproduce the main plots and results from the paper. While some plots were retouched in Inkscape for the final version (marked as `_edited` in `figures/`), we provide the code to generate them from the processed data.

Modeling results can be reproduced from the [modeling notebook](notebooks/modeling.Rmd).

TODO: Add additional analysis scripts for appendix plots.

## See an Issue?

If you encounter any issues while running the scripts or notebooks, please open an issue in this repository. We will be happy to help you out!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gsarti/qe4pe

Awesome Lists containing this project

README