{"id":21769973,"url":"https://github.com/mideind/byte-gec","last_synced_at":"2026-03-16T22:03:19.030Z","repository":{"id":186414897,"uuid":"643821760","full_name":"mideind/byte-gec","owner":"mideind","description":null,"archived":false,"fork":false,"pushed_at":"2023-07-03T14:18:06.000Z","size":2909,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-05-27T18:54:07.521Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mideind.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-05-22T08:22:58.000Z","updated_at":"2025-04-08T10:19:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"2b19d5f4-a5e5-49aa-9a05-b15111c0e1ce","html_url":"https://github.com/mideind/byte-gec","commit_stats":null,"previous_names":["mideind/byte-gec"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mideind/byte-gec","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mideind%2Fbyte-gec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mideind%2Fbyte-gec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mideind%2Fbyte-gec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mideind%2Fbyte-gec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mideind","download_url":"https://codeload.github.com/mideind/byte-gec/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mideind%2Fbyte-gec/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262102772,"owners_count":23259328,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-26T14:10:51.338Z","updated_at":"2026-03-16T22:03:18.974Z","avatar_url":"https://github.com/mideind.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GEC for Icelandic\nThis repository contains example scripts and evaluation data for the paper [Byte-Level Grammatical Error Correction\nUsing Synthetic and Curated Corpora](https://arxiv.org/pdf/2305.17906.pdf), accepted to the ACL'23 main conference.\n\n## Data\nWe provide data for evaluating Icelandic GEC models, and provide references to the data used for training the models in the paper.\n\n### Test sets\nAll evaluation data for the models is included in the ``data/testsets`` directory. The ``is_err`` file ending represents the source (errored) file, and ``.is_corr`` is the file containing the corrected references. We refer to the paper for a description of each test set.\n\n### Error corpora\nThe Icelandic Error Corpus and the accompanying specialized corpora can be downloaded from the CLARIN website at the following URLs:\n\nhttp://hdl.handle.net/20.500.12537/105\nhttp://hdl.handle.net/20.500.12537/106\nhttp://hdl.handle.net/20.500.12537/132\nhttp://hdl.handle.net/20.500.12537/133\n\nNote that sentences from these corpora appear in the following test sets provided with this submission: ``test.500.dyslex``, ``test.500.L2``, ``test.500.child``. \nIf the test sets are used for evaluation, these sentences need to be filtered out from the training data.\n\n### Icelandic Gigaword Corpus\nFor generating the synthetic error data, we used the Icelandic Gigaword Corpus. This corpus can be downloaded from CLARIN as well:\n\nhttp://hdl.handle.net/20.500.12537/254\n\nThe paper describes how the synthetic data was generated by noising this corpus.\n\n## Scripts\nIn the ``example_scripts``directory you can find scripts for training the different models for GEC.\n\n### Installation\npip install -r requirements.txt\n\nFor evaluation using GLEU, you need to install the GLEU package:\n`git clone https://github.com/cnap/gec-ranking.git`.\n\nand run with `./gec-ranking/scripts/compute_glue -r $REF_FILE -s $SRC_FILE -o $GENERATED_FILE \u003e gleu_results`\n\n### Structure\nThe scripts are organized in the following way:\n\n- byt5 - scripts for synth and finetuning training Byte-level BPE models. Uses the `transformers` library.\n- mt5 - scripts for synth and finetuning training mT5 models. Uses the `transformers` library.\n- mbart - scripts for synth and finetuning training mBART-ISEN models. Uses the `fairseq` library.\n- noising - scripts for adding noise to the data. Has its own README.\n- infer.py - script for inference using the trained ByT5 models. Uses the `transformers` library.\n\nNote that most of the arguments regarding paths have been removed from the scripts. You need to add them manually.\n\n## Models\nFor training the GEC models described in the paper, the following pre-trained models were used:\n- mT5 (base) - Available on Hugging Face (https://huggingface.co/google/mt5-base)\n- ByT5 (base) - Available on Hugging Face (https://huggingface.co/google/byt5-base)\n- mBART-ENIS - This model is not currently published, but its training is described in the paper (see Appendix A). It is trained upon the pre-trained mBART (https://github.com/facebookresearch/fairseq/tree/main/examples/mbart)\n\nThe best performing model (referred to as ``ByT5-Synth-550k+EC`` in the paper) is published at the CLARIN website:\n\nhttp://hdl.handle.net/20.500.12537/255\n\nThis model is a ByT5-base model further trained for 550,000 updates on the synthetic error corpus and finetuned on the Icelandic Error Corpus.\n\n## Abstract of paper\nGrammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and grammatical issues in text. Approaching the problem as a sequence-to-sequence task, we compare the use of a common subword unit vocabulary and byte-level encoding. Initial synthetic training data is created using an error-generating pipeline, and used for finetuning two subword-level models and one byte-level model. Models are then finetuned further on hand-corrected error corpora, including texts written by children, university students, dyslexic and second-language writers, and evaluated over different error types and origins. We show that a byte-level model enables higher correction quality than a subword approach, not only for simple spelling errors, but also for more complex semantic, stylistic and grammatical issues. In particular, initial training on synthetic corpora followed by finetuning on a relatively small parallel corpus of real-world errors helps the byte-level model correct a wide range of commonly occurring errors. Our experiments are run for the Icelandic language but should hold for other similar languages, particularly morphologically rich ones.\n\n## Citing this paper\n(Will be updated with the ACL Anthology citation once published.)\n\n```\n @article{ingolfsdottir-byte:2023,\n    author    = \"Svanhvít Lilja Ingólfsdóttir, Pétur Orri Ragnarsson, Haukur Páll Jónsson, Haukur Barri Símonarson, Vilhjálmur Þorsteinsson, Vésteinn Snæbjarnarson\",\n    title     = \"{Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora}\",\n    journal   = {ArXiv},\n    year      = {2023},\n    volume    = {abs/2305.17906},\n    url       = {https://arxiv.org/abs/2305.17906}}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmideind%2Fbyte-gec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmideind%2Fbyte-gec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmideind%2Fbyte-gec/lists"}