{"id":13754405,"url":"https://github.com/grammarly/gector","last_synced_at":"2026-02-21T13:23:56.612Z","repository":{"id":40954987,"uuid":"265850286","full_name":"grammarly/gector","owner":"grammarly","description":"Official implementation of the papers \"GECToR – Grammatical Error Correction: Tag, Not Rewrite\" (BEA-20) and \"Text Simplification by Tagging\" (BEA-21)","archived":false,"fork":false,"pushed_at":"2024-05-21T06:44:30.000Z","size":685,"stargazers_count":920,"open_issues_count":27,"forks_count":220,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-05-18T13:02:00.368Z","etag":null,"topics":["bert","grammatical-error-correction","natural-language-processing","nlp","roberta","sequence-labeling","text-simplification","transformers","xlnet"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/grammarly.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-21T13:04:03.000Z","updated_at":"2025-05-17T16:20:46.000Z","dependencies_parsed_at":"2024-08-03T09:17:23.099Z","dependency_job_id":null,"html_url":"https://github.com/grammarly/gector","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/grammarly/gector","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grammarly%2Fgector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grammarly%2Fgector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grammarly%2Fgector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grammarly%2Fgector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/grammarly","download_url":"https://codeload.github.com/grammarly/gector/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grammarly%2Fgector/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29681563,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-21T12:30:22.644Z","status":"ssl_error","status_checked_at":"2026-02-21T12:29:55.402Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","grammatical-error-correction","natural-language-processing","nlp","roberta","sequence-labeling","text-simplification","transformers","xlnet"],"created_at":"2024-08-03T09:01:58.457Z","updated_at":"2026-02-21T13:23:56.562Z","avatar_url":"https://github.com/grammarly.png","language":"Python","funding_links":[],"categories":["Python","其他_NLP自然语言处理"],"sub_categories":["其他_文本生成、文本对话"],"readme":"# GECToR – Grammatical Error Correction: Tag, Not Rewrite\n\nThis repository provides code for training and testing state-of-the-art models for grammatical error correction with the official PyTorch implementation of the following paper:\n\u003e [GECToR – Grammatical Error Correction: Tag, Not Rewrite](https://aclanthology.org/2020.bea-1.16/) \u003cbr\u003e\n\u003e [Kostiantyn Omelianchuk](https://github.com/komelianchuk), [Vitaliy Atrasevych](https://github.com/atrasevych), [Artem Chernodub](https://github.com/achernodub), [Oleksandr Skurzhanskyi](https://github.com/skurzhanskyi) \u003cbr\u003e\n\u003e Grammarly \u003cbr\u003e\n\u003e [15th Workshop on Innovative Use of NLP for Building Educational Applications (co-located with ACL 2020)](https://sig-edu.org/bea/2020) \u003cbr\u003e\n\nIt is mainly based on `AllenNLP` and `transformers`.\n## Installation\nThe following command installs all necessary packages:\n```.bash\npip install -r requirements.txt\n```\nThe project was tested using Python 3.7.\n\n## Datasets\nAll the public GEC datasets used in the paper can be downloaded from [here](https://www.cl.cam.ac.uk/research/nl/bea2019st/#data).\u003cbr\u003e\nSynthetically created datasets can be generated/downloaded [here](https://github.com/awasthiabhijeet/PIE/tree/master/errorify).\u003cbr\u003e\nTo train the model data has to be preprocessed and converted to special format with the command:\n```.bash\npython utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE\n```\n## Pretrained models\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003ePretrained encoder\u003c/th\u003e\n    \u003cth\u003eConfidence bias\u003c/th\u003e\n    \u003cth\u003eMin error prob\u003c/th\u003e\n    \u003cth\u003eCoNNL-2014 (test)\u003c/th\u003e\n    \u003cth\u003eBEA-2019 (test)\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eBERT \u003ca href=\"https://grammarly-nlp-data-public.s3.amazonaws.com/gector/bert_0_gectorv2.th\"\u003e[link]\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e0.1\u003c/td\u003e\n    \u003ctd\u003e0.41\u003c/td\u003e\n    \u003ctd\u003e61.0\u003c/td\u003e\n    \u003ctd\u003e68.0\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eRoBERTa \u003ca href=\"https://grammarly-nlp-data-public.s3.amazonaws.com/gector/roberta_1_gectorv2.th\"\u003e[link]\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e0.2\u003c/td\u003e\n    \u003ctd\u003e0.5\u003c/td\u003e\n    \u003ctd\u003e64.0\u003c/td\u003e\n    \u003ctd\u003e71.8\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eXLNet \u003ca href=\"https://grammarly-nlp-data-public.s3.amazonaws.com/gector/xlnet_0_gectorv2.th\"\u003e[link]\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e0.2\u003c/td\u003e\n    \u003ctd\u003e0.5\u003c/td\u003e\n    \u003ctd\u003e63.2\u003c/td\u003e\n    \u003ctd\u003e71.2\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n**Note**: The scores in the table are different from the paper's ones, as the later version of transformers is used. To reproduce the results reported in the paper, use [this version](https://github.com/grammarly/gector/tree/fea1532608) of the repository. \n\n## Train model\nTo train the model, simply run:\n```.bash\npython train.py --train_set TRAIN_SET --dev_set DEV_SET \\\n                --model_dir MODEL_DIR\n```\nThere are a lot of parameters to specify among them:\n- `cold_steps_count` the number of epochs where we train only last linear layer\n- `transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert}` model encoder\n- `tn_prob` probability of getting sentences with no errors; helps to balance precision/recall\n- `pieces_per_token` maximum number of subwords per token; helps not to get CUDA out of memory\n\nIn our experiments we had 98/2 train/dev split.\n\n## Training parameters\nWe described all parameters that we use for training and evaluating [here](https://github.com/grammarly/gector/blob/master/docs/training_parameters.md). \n\u003cbr\u003e\n\n## Model inference\nTo run your model on the input file use the following command:\n```.bash\npython predict.py --model_path MODEL_PATH [MODEL_PATH ...] \\\n                  --vocab_path VOCAB_PATH --input_file INPUT_FILE \\\n                  --output_file OUTPUT_FILE\n```\nAmong parameters:\n- `min_error_probability` - minimum error probability (as in the paper)\n- `additional_confidence` - confidence bias (as in the paper)\n- `special_tokens_fix` to reproduce some reported results of pretrained models\n\nFor evaluation use [M^2Scorer](https://github.com/nusnlp/m2scorer) and [ERRANT](https://github.com/chrisjbryant/errant).\n\n## Text Simplification\nThis repository also implements the code of the following paper:\n\u003e [Text Simplification by Tagging](https://aclanthology.org/2021.bea-1.2/) \u003cbr\u003e\n\u003e [Kostiantyn Omelianchuk](https://github.com/komelianchuk), [Vipul Raheja](https://github.com/vipulraheja), [Oleksandr Skurzhanskyi](https://github.com/skurzhanskyi) \u003cbr\u003e\n\u003e Grammarly \u003cbr\u003e\n\u003e [16th Workshop on Innovative Use of NLP for Building Educational Applications (co-located w EACL 2021)](https://sig-edu.org/bea/current) \u003cbr\u003e\n\nFor data preprocessing, training and testing the same interface as for GEC could be used. For both training and evaluation stages `utils/filter_brackets.py` is used to remove noise. During inference, we use `--normalize` flag.\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003e\u003c/th\u003e\n    \u003cth colspan=\"2\"\u003eSARI\u003c/th\u003e\n    \u003cth rowspan=\"2\"\u003eFKGL\u003c/th\u003e\n  \u003c/tr\u003e\n    \u003cth\u003eModel\u003c/th\u003e\n    \u003cth\u003eTurkCorpus\u003c/th\u003e\n    \u003cth\u003eASSET\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eTST-FINAL \u003ca href=\"https://grammarly-nlp-data-public.s3.amazonaws.com/gector/roberta_1_tst.th\"\u003e[link]\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003e39.9\u003c/td\u003e\n    \u003ctd\u003e40.3\u003c/td\u003e\n    \u003ctd\u003e7.65\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eTST-FINAL + tweaks\u003c/td\u003e\n    \u003ctd\u003e41.0\u003c/td\u003e\n    \u003ctd\u003e42.7\u003c/td\u003e\n    \u003ctd\u003e7.61\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\nInference tweaks parameters: \u003cbr\u003e\n```\niteration_count = 2\nadditional_keep_confidence = -0.68\nadditional_del_confidence = -0.84\nmin_error_probability = 0.04\n```\nFor evaluation use [EASSE](https://github.com/feralvam/easse) package.\n\n**Note**:  The scores in the table are very close to those in the paper, but not fully match them due to the 2 reasons:\n- in the paper, we reported average scores of 4 models trained with different seeds;\n- we merged codebases for GEC and Text Simplification tasks and updated them to the newer version of transformers lib.\n\n## Noticeable works based on GECToR\n\n- Vanilla PyTorch implementation of GECToR with AMP and distributed support by DeepSpeed [[code](https://github.com/cofe-ai/fast-gector)]\n- Improving Sequence Tagging approach for Grammatical Error Correction task [[paper](https://s3.eu-central-1.amazonaws.com/ucu.edu.ua/wp-content/uploads/sites/8/2021/04/Improving-Sequence-Tagging-Approach-for-Grammatical-Error-Correction-Task-.pdf)][[code](https://github.com/MaksTarnavskyi/gector-large)]\n- LM-Critic: Language Models for Unsupervised Grammatical Error Correction [[paper](https://arxiv.org/pdf/2109.06822.pdf)][[code](https://github.com/michiyasunaga/LM-Critic)]\n\n## Citation\nIf you find this work is useful for your research, please cite our papers:\n\n#### GECToR – Grammatical Error Correction: Tag, Not Rewrite\n```\n@inproceedings{omelianchuk-etal-2020-gector,\n    title = \"{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite\",\n    author = \"Omelianchuk, Kostiantyn  and\n      Atrasevych, Vitaliy  and\n      Chernodub, Artem  and\n      Skurzhanskyi, Oleksandr\",\n    booktitle = \"Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications\",\n    month = jul,\n    year = \"2020\",\n    address = \"Seattle, WA, USA â†’ Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/2020.bea-1.16\",\n    pages = \"163--170\",\n    abstract = \"In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F{\\_}0.5 of 65.3/66.5 on CONLL-2014 (test) and F{\\_}0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.\",\n}\n```\n\n#### Text Simplification by Tagging\n```\n@inproceedings{omelianchuk-etal-2021-text,\n    title = \"{T}ext {S}implification by {T}agging\",\n    author = \"Omelianchuk, Kostiantyn  and\n      Raheja, Vipul  and\n      Skurzhanskyi, Oleksandr\",\n    booktitle = \"Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications\",\n    month = apr,\n    year = \"2021\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2021.bea-1.2\",\n    pages = \"11--25\",\n    abstract = \"Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are trained on parallel corpora, these methods have proven to be much more effective since they are able to learn to make fast and accurate transformations while leveraging powerful pre-trained language models. Inspired by these ideas, we present TST, a simple and efficient Text Simplification system based on sequence Tagging, leveraging pre-trained Transformer-based encoders. Our system makes simplistic data augmentations and tweaks in training and inference on a pre-existing system, which makes it less reliant on large amounts of parallel training data, provides more control over the outputs and enables faster inference speeds. Our best model achieves near state-of-the-art performance on benchmark test datasets for the task. Since it is fully non-autoregressive, it achieves faster inference speeds by over 11 times than the current state-of-the-art text simplification system.\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgrammarly%2Fgector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgrammarly%2Fgector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgrammarly%2Fgector/lists"}