https://github.com/mideind/byte-gec

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/mideind/byte-gec
Owner: mideind
Created: 2023-05-22T08:22:58.000Z (about 3 years ago)
Default Branch: master
Last Pushed: 2023-07-03T14:18:06.000Z (almost 3 years ago)
Last Synced: 2025-05-27T18:54:07.521Z (about 1 year ago)
Language: Python
Size: 2.77 MB
Stars: 8
Watchers: 7
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# GEC for Icelandic
This repository contains example scripts and evaluation data for the paper [Byte-Level Grammatical Error Correction
Using Synthetic and Curated Corpora](https://arxiv.org/pdf/2305.17906.pdf), accepted to the ACL'23 main conference.

## Data
We provide data for evaluating Icelandic GEC models, and provide references to the data used for training the models in the paper.

### Test sets
All evaluation data for the models is included in the ``data/testsets`` directory. The ``is_err`` file ending represents the source (errored) file, and ``.is_corr`` is the file containing the corrected references. We refer to the paper for a description of each test set.

### Error corpora
The Icelandic Error Corpus and the accompanying specialized corpora can be downloaded from the CLARIN website at the following URLs:

http://hdl.handle.net/20.500.12537/105
http://hdl.handle.net/20.500.12537/106
http://hdl.handle.net/20.500.12537/132
http://hdl.handle.net/20.500.12537/133

Note that sentences from these corpora appear in the following test sets provided with this submission: ``test.500.dyslex``, ``test.500.L2``, ``test.500.child``.
If the test sets are used for evaluation, these sentences need to be filtered out from the training data.

### Icelandic Gigaword Corpus
For generating the synthetic error data, we used the Icelandic Gigaword Corpus. This corpus can be downloaded from CLARIN as well:

http://hdl.handle.net/20.500.12537/254

The paper describes how the synthetic data was generated by noising this corpus.

## Scripts
In the ``example_scripts``directory you can find scripts for training the different models for GEC.

### Installation
pip install -r requirements.txt

For evaluation using GLEU, you need to install the GLEU package:
`git clone https://github.com/cnap/gec-ranking.git`.

and run with `./gec-ranking/scripts/compute_glue -r $REF_FILE -s $SRC_FILE -o $GENERATED_FILE > gleu_results`

### Structure
The scripts are organized in the following way:

- byt5 - scripts for synth and finetuning training Byte-level BPE models. Uses the `transformers` library.
- mt5 - scripts for synth and finetuning training mT5 models. Uses the `transformers` library.
- mbart - scripts for synth and finetuning training mBART-ISEN models. Uses the `fairseq` library.
- noising - scripts for adding noise to the data. Has its own README.
- infer.py - script for inference using the trained ByT5 models. Uses the `transformers` library.

Note that most of the arguments regarding paths have been removed from the scripts. You need to add them manually.

## Models
For training the GEC models described in the paper, the following pre-trained models were used:
- mT5 (base) - Available on Hugging Face (https://huggingface.co/google/mt5-base)
- ByT5 (base) - Available on Hugging Face (https://huggingface.co/google/byt5-base)
- mBART-ENIS - This model is not currently published, but its training is described in the paper (see Appendix A). It is trained upon the pre-trained mBART (https://github.com/facebookresearch/fairseq/tree/main/examples/mbart)

The best performing model (referred to as ``ByT5-Synth-550k+EC`` in the paper) is published at the CLARIN website:

http://hdl.handle.net/20.500.12537/255

This model is a ByT5-base model further trained for 550,000 updates on the synthetic error corpus and finetuned on the Icelandic Error Corpus.

## Abstract of paper
Grammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and grammatical issues in text. Approaching the problem as a sequence-to-sequence task, we compare the use of a common subword unit vocabulary and byte-level encoding. Initial synthetic training data is created using an error-generating pipeline, and used for finetuning two subword-level models and one byte-level model. Models are then finetuned further on hand-corrected error corpora, including texts written by children, university students, dyslexic and second-language writers, and evaluated over different error types and origins. We show that a byte-level model enables higher correction quality than a subword approach, not only for simple spelling errors, but also for more complex semantic, stylistic and grammatical issues. In particular, initial training on synthetic corpora followed by finetuning on a relatively small parallel corpus of real-world errors helps the byte-level model correct a wide range of commonly occurring errors. Our experiments are run for the Icelandic language but should hold for other similar languages, particularly morphologically rich ones.

## Citing this paper
(Will be updated with the ACL Anthology citation once published.)

```
@article{ingolfsdottir-byte:2023,
author = "Svanhvít Lilja Ingólfsdóttir, Pétur Orri Ragnarsson, Haukur Páll Jónsson, Haukur Barri Símonarson, Vilhjálmur Þorsteinsson, Vésteinn Snæbjarnarson",
title = "{Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora}",
journal = {ArXiv},
year = {2023},
volume = {abs/2305.17906},
url = {https://arxiv.org/abs/2305.17906}}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mideind/byte-gec

Awesome Lists containing this project

README