{"id":13564387,"url":"https://github.com/google-research/electra","last_synced_at":"2025-05-15T07:07:38.067Z","repository":{"id":37369797,"uuid":"246201728","full_name":"google-research/electra","owner":"google-research","description":"ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators","archived":false,"fork":false,"pushed_at":"2024-03-23T03:59:14.000Z","size":113,"stargazers_count":2351,"open_issues_count":62,"forks_count":351,"subscribers_count":58,"default_branch":"master","last_synced_at":"2025-04-14T12:59:21.837Z","etag":null,"topics":["deep-learning","nlp","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-10T03:42:50.000Z","updated_at":"2025-04-11T15:24:09.000Z","dependencies_parsed_at":"2024-08-01T13:30:28.266Z","dependency_job_id":null,"html_url":"https://github.com/google-research/electra","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Felectra","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Felectra/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Felectra/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Felectra/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google-research","download_url":"https://codeload.github.com/google-research/electra/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254292043,"owners_count":22046426,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","nlp","tensorflow"],"created_at":"2024-08-01T13:01:30.496Z","updated_at":"2025-05-15T07:07:33.054Z","avatar_url":"https://github.com/google-research.png","language":"Python","funding_links":[],"categories":["Python","Libraries and Tools","TensorFlow Models","Language Model"],"sub_categories":["2023","Natural Language Processing"],"readme":"# ELECTRA\n\n## Introduction\n\n**ELECTRA** is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish \"real\" input tokens vs \"fake\" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.\n\nFor a detailed description and experimental results, please refer to our ICLR 2020 paper [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB).\n\nThis repository contains code to pre-train ELECTRA, including small ELECTRA models on a single GPU. It also supports fine-tuning ELECTRA on downstream tasks including classification tasks (e.g,. [GLUE](https://gluebenchmark.com/)), QA tasks (e.g., [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)), and sequence tagging tasks (e.g., [text chunking](https://www.clips.uantwerpen.be/conll2000/chunking/)).\n\nThis repository also contains code for **Electric**, a version of ELECTRA inspired by [energy-based models](http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf). Electric provides a more principled view of ELECTRA as a \"negative sampling\" [cloze model](https://en.wikipedia.org/wiki/Cloze_test). It can also efficiently produce [pseudo-likelihood scores](https://arxiv.org/pdf/1910.14659.pdf) for text, which can be used to re-rank the outputs of speech recognition or machine translation systems. For details on Electric, please refer to out EMNLP 2020 paper [Pre-Training Transformers as Energy-Based Cloze Models](https://www.aclweb.org/anthology/2020.emnlp-main.20.pdf).\n\n\n\n## Released Models\n\nWe are initially releasing three pre-trained models:\n\n| Model | Layers | Hidden Size | Params | GLUE score (test set) | Download |\n| --- | --- | --- | --- | ---  | --- |\n| ELECTRA-Small | 12 | 256 | 14M | 77.4  | [link](https://storage.googleapis.com/electra-data/electra_small.zip) |\n| ELECTRA-Base | 12 | 768 | 110M | 82.7 | [link](https://storage.googleapis.com/electra-data/electra_base.zip) |\n| ELECTRA-Large | 24 | 1024 | 335M |  85.2 | [link](https://storage.googleapis.com/electra-data/electra_large.zip) |\n\nThe models were trained on uncased English text. They correspond to ELECTRA-Small++, ELECTRA-Base++, ELECTRA-1.75M  in our paper. We hope to release other models, such as multilingual models, in the future.\n\nOn [GLUE](https://gluebenchmark.com/), ELECTRA-Large scores slightly better than ALBERT/XLNET, ELECTRA-Base scores better than BERT-Large, and ELECTRA-Small scores slightly worst than [TinyBERT](https://arxiv.org/abs/1909.10351) (but uses no distillation). See the expected results section below for detailed performance numbers.\n\n\n\n## Requirements\n* Python 3\n* [TensorFlow](https://www.tensorflow.org/) 1.15 (although we hope to support TensorFlow 2.0 at a future date)\n* [NumPy](https://numpy.org/)\n* [scikit-learn](https://scikit-learn.org/stable/) and [SciPy](https://www.scipy.org/) (for computing some evaluation metrics).\n\n## Pre-training\nUse `build_pretraining_dataset.py` to create a pre-training dataset from a dump of raw text. It has the following arguments:\n\n* `--corpus-dir`: A directory containing raw text files to turn into ELECTRA examples. A text file can contain multiple documents with empty lines separating them.\n* `--vocab-file`: File defining the wordpiece vocabulary.\n* `--output-dir`: Where to write out ELECTRA examples.\n* `--max-seq-length`: The number of tokens per example (128 by default).\n* `--num-processes`: If \u003e1 parallelize across multiple processes (1 by default).\n* `--blanks-separate-docs`: Whether blank lines indicate document boundaries (True by default).\n* `--do-lower-case/--no-lower-case`: Whether to lower case the input text (True by default).\n\nUse `run_pretraining.py` to pre-train an ELECTRA model. It has the following arguments:\n\n* `--data-dir`: a directory where pre-training data, model weights, etc. are stored. By default, the training loads examples from `\u003cdata-dir\u003e/pretrain_tfrecords` and a vocabulary from `\u003cdata-dir\u003e/vocab.txt`.\n*  `--model-name`: a name for the model being trained. Model weights will be saved in `\u003cdata-dir\u003e/models/\u003cmodel-name\u003e` by default.\n* `--hparams` (optional): a JSON dict or path to a JSON file containing model hyperparameters, data paths, etc. See `configure_pretraining.py` for the supported hyperparameters.\n\nIf training is halted, re-running the `run_pretraining.py` with the same arguments will continue the training where it left off.\n\nYou can continue pre-training from the released ELECTRA checkpoints by\n1. Setting the model-name to point to a downloaded model (e.g., `--model-name electra_small` if you downloaded weights to `$DATA_DIR/electra_small`).\n2. Setting `num_train_steps` by (for example) adding `\"num_train_steps\": 4010000` to the `--hparams`. This will continue training the small model for 10000 more steps (it has already been trained for 4e6 steps).\n3. Increase the learning rate to account for the linear learning rate decay. For example, to start with a learning rate of 2e-4 you should set the `learning_rate` hparam to 2e-4 * (4e6 + 10000) / 10000.\n4. For ELECTRA-Small, you also need to specifiy `\"generator_hidden_size\": 1.0` in the `hparams` because we did not use a small generator for that model.\n\n##  Quickstart: Pre-train a small ELECTRA model.\nThese instructions pre-train a small ELECTRA model (12 layers, 256 hidden size). Unfortunately, the data we used in the paper is not publicly available, so we will use the [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/) released by Aaron Gokaslan and Vanya Cohen instead. The fully-trained model (~4 days on a v100 GPU) should perform roughly in between [GPT](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) and BERT-Base in terms of GLUE performance. By default the model is trained on length-128 sequences, so it is not suitable for running on question answering. See the \"expected results\" section below for more details on model performance.\n\n#### Setup\n1. Place a vocabulary file in `$DATA_DIR/vocab.txt`. Our ELECTRA models all used the exact same vocabulary as English uncased BERT, which you can download [here](https://storage.googleapis.com/electra-data/vocab.txt).\n2. Download the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) corpus (12G) and extract it  (i.e., run `tar xf openwebtext.tar.xz`). Place it in `$DATA_DIR/openwebtext`.\n3. Run `python3 build_openwebtext_pretraining_dataset.py --data-dir $DATA_DIR --num-processes 5`. It pre-processes/tokenizes the data and outputs examples as [tfrecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) files under `$DATA_DIR/pretrain_tfrecords`. The tfrecords require roughly 30G of disk space.\n\n#### Pre-training the model.\nRun `python3 run_pretraining.py --data-dir $DATA_DIR --model-name electra_small_owt`\nto train a small ELECTRA model for 1 million steps on the data. This takes slightly over 4 days on a Tesla V100 GPU. However, the model should achieve decent results after 200k steps (10 hours of training on the v100 GPU).\n\nTo customize the training, add `--hparams '{\"hparam1\": value1, \"hparam2\": value2, ...}'` to the run command. `--hparams` can also be a path to a `.json` file containing the hyperparameters. Some particularly useful options:\n\n* `\"debug\": true` trains a tiny ELECTRA model for a few steps.\n* `\"model_size\": one of \"small\", \"base\", or \"large\"`: determines the size of the model\n* `\"electra_objective\": false` trains a model with masked language modeling instead of replaced token detection (essentially BERT with dynamic masking and no next-sentence prediction).\n* `\"num_train_steps\": n` controls how long the model is pre-trained for.\n* `\"pretrain_tfrecords\": \u003cpaths\u003e` determines where the pre-training data is located. Note you need to specify the specific files not just the directory (e.g., `\u003cdata-dir\u003e/pretrain_tf_records/pretrain_data.tfrecord*`)\n* `\"vocab_file\": \u003cpath\u003e` and `\"vocab_size\": n` can be used to set a custom wordpiece vocabulary.\n* `\"learning_rate\": lr, \"train_batch_size\": n`, etc. can be used to change training hyperparameters\n* `\"model_hparam_overrides\": {\"hidden_size\": n, \"num_hidden_layers\": m}`, etc. can be used to changed the hyperparameters for the underlying transformer (the `\"model_size\"` flag sets the default values).\n\nSee `configure_pretraining.py` for the full set of supported hyperparameters.\n\n#### Evaluating the pre-trained model.\n\nTo evaluate the model on a downstream task, see the below finetuning instructions. To evaluate the generator/discriminator on the openwebtext data run `python3 run_pretraining.py --data-dir $DATA_DIR --model-name electra_small_owt --hparams '{\"do_train\": false, \"do_eval\": true}'`. This will print out eval metrics such as the accuracy of the generator and discriminator, and also writing the metrics out to `data-dir/model-name/results`.\n\n## Fine-tuning\n\nUse `run_finetuning.py` to fine-tune and evaluate an ELECTRA model on a downstream NLP task. It expects three arguments:\n\n* `--data-dir`: a directory where data, model weights, etc. are stored. By default, the script loads finetuning data from `\u003cdata-dir\u003e/finetuning_data/\u003ctask-name\u003e` and a vocabulary from `\u003cdata-dir\u003e/vocab.txt`.\n*  `--model-name`: a name of the pre-trained model: the pre-trained weights should exist in `data-dir/models/model-name`.\n* `--hparams`: a JSON dict containing model hyperparameters, data paths, etc. (e.g., `--hparams '{\"task_names\": [\"rte\"], \"model_size\": \"base\", \"learning_rate\": 1e-4, ...}'`). See `configure_pretraining.py` for the supported hyperparameters.  Instead of a dict, this can also be a path to a `.json` file containing the hyperparameters. You must specify the `\"task_names\"` and `\"model_size\"` (see examples below).\n\nEval metrics will be saved in `data-dir/model-name/results` and model weights will be saved in `data-dir/model-name/finetuning_models` by default. Evaluation is done on the dev set by default. To customize the training, add `--hparams '{\"hparam1\": value1, \"hparam2\": value2, ...}'` to the run command. Some particularly useful options:\n\n* `\"debug\": true` fine-tunes a tiny ELECTRA model for a few steps.\n* `\"task_names\": [\"task_name\"]`: specifies the tasks to train on. A list because the codebase nominally supports multi-task learning, (although be warned this has not been thoroughly tested).\n* `\"model_size\": one of \"small\", \"base\", or \"large\"`: determines the size of the model; you must set this to the same size as the pre-trained model.\n* `\"do_train\" and \"do_eval\"`: train and/or evaluate a model (both are set to true by default). For using `\"do_eval\": true` with `\"do_train\": false`, you need to specify the `init_checkpoint`, e.g., `python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{\"model_size\": \"base\", \"task_names\": [\"mnli\"], \"do_train\": false, \"do_eval\": true, \"init_checkpoint\": \"\u003cdata-dir\u003e/models/electra_base/finetuning_models/mnli_model_1\"}'`\n* `\"num_trials\": n`: If \u003e1, does multiple fine-tuning/evaluation runs with different random seeds.\n* `\"learning_rate\": lr, \"train_batch_size\": n`, etc. can be used to change training hyperparameters.\n* `\"model_hparam_overrides\": {\"hidden_size\": n, \"num_hidden_layers\": m}`, etc. can be used to changed the hyperparameters for the underlying transformer (the `\"model_size\"` flag sets the default values).\n\n### Setup\nGet a pre-trained ELECTRA model either by training your own (see pre-training instructions above), or downloading the release ELECTRA weights and unziping them under `$DATA_DIR/models` (e.g., you should have a directory`$DATA_DIR/models/electra_large` if you are using the large model).\n\n\n### Finetune ELECTRA on a GLUE  task\n\nDownload the GLUE data by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e). Set up the data by running `mv CoLA cola \u0026\u0026 mv MNLI mnli \u0026\u0026 mv MRPC mrpc \u0026\u0026 mv QNLI qnli \u0026\u0026 mv QQP qqp \u0026\u0026 mv RTE rte \u0026\u0026 mv SST-2 sst \u0026\u0026 mv STS-B sts \u0026\u0026 mv diagnostic/diagnostic.tsv mnli \u0026\u0026 mkdir -p $DATA_DIR/finetuning_data \u0026\u0026 mv * $DATA_DIR/finetuning_data`.\n\nThen run `run_finetuning.py`. For example, to fine-tune ELECTRA-Base  on MNLI\n```\npython3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{\"model_size\": \"base\", \"task_names\": [\"mnli\"]}'\n```\nOr fine-tune a small model pre-trained using the above instructions on CoLA.\n```\npython3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_small_owt --hparams '{\"model_size\": \"small\", \"task_names\": [\"cola\"]}'\n```\n\n### Finetune ELECTRA on question answering\n\nThe code supports [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) 1.1 and 2.0, as well as datasets in [the 2019 MRQA shared task](https://github.com/mrqa/MRQA-Shared-Task-2019)\n\n* **Squad 1.1**: Download the [train](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json) and [dev](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json) datasets and move them under `$DATA_DIR/finetuning_data/squadv1/(train|dev).json`\n* **Squad 2.0**: Download the datasets from the [SQuAD Website](https://rajpurkar.github.io/SQuAD-explorer/) and move them under `$DATA_DIR/finetuning_data/squad/(train|dev).json`\n* **MRQA tasks**: Download the data from [here](https://github.com/mrqa/MRQA-Shared-Task-2019#datasets). Move the data to `$DATA_DIR/finetuning_data/(newsqa|naturalqs|triviaqa|searchqa)/(train|dev).jsonl`.\n\nThen run (for example)\n```\npython3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{\"model_size\": \"base\", \"task_names\": [\"squad\"]}'\n```\n\nThis repository uses the official evaluation code released by the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) authors and [the MRQA shared task](https://github.com/mrqa/MRQA-Shared-Task-2019) to compute metrics\n\n### Finetune ELECTRA on sequence tagging\n\nDownload the CoNLL-2000 text chunking dataset from [here](https://www.clips.uantwerpen.be/conll2000/chunking/) and put it under `$DATA_DIR/finetuning_data/chunk/(train|dev).txt`. Then run\n```\npython3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{\"model_size\": \"base\", \"task_names\": [\"chunk\"]}'\n```\n\n### Adding a new task\nThe easiest way to run on a new task is to implement a new `finetune.task.Task`, add it to `finetune.task_builder.py`, and then use `run_finetuning.py` as normal. For classification/qa/sequence tagging, you can inherit from a `finetune.classification.classification_tasks.ClassificationTask`, `finetune.qa.qa_tasks.QATask`, or `finetune.tagging.tagging_tasks.TaggingTask`.\nFor preprocessing data, we use the same tokenizer as [BERT](https://github.com/google-research/bert).\n\n\n## Expected Results\nHere are expected results for ELECTRA on various tasks (test set for chunking, dev set for the other tasks). Note that variance in fine-tuning can be [quite large](https://arxiv.org/abs/2002.06305), so for some tasks you may see big fluctuations in scores when fine-tuning from the same checkpoint multiple times. The below scores show median performance over a large number of random seeds.  ELECTRA-Small/Base/Large are our released models. ELECTRA-Small-OWT is the OpenWebText-trained model from above (it performs a bit worse than ELECTRA-Small due to being trained for less time and on a smaller dataset).\n\n|  | CoLA | SST | MRPC | STS  | QQP  | MNLI | QNLI | RTE | SQuAD 1.1 | SQuAD 2.0 | Chunking |\n| --- | --- | --- | --- | ---  | ---  | --- | --- | --- | ---| ---| --- |\n| Metrics | MCC | Acc | Acc | Spearman  | Acc  | Acc | Acc | Acc | EM | EM | F1 |\n| ELECTRA-Large| 69.1 | 96.9 | 90.8 | 92.6 | 92.4 | 90.9 | 95.0 | 88.0 | 89.7 | 88.1 | 97.2 |\n| ELECTRA-Base | 67.7 | 95.1 | 89.5 | 91.2 | 91.5  | 88.8  | 93.2 | 82.7 | 86.8 | 80.5 | 97.1 |\n| ELECTRA-Small | 57.0 | 91.2 | 88.0 |  87.5 | 89.0  | 81.3 | 88.4 | 66.7 | 75.8 | 70.1 |  96.5 |\n| ELECTRA-Small-OWT | 56.8 | 88.3 | 87.4 |  86.8 | 88.3  | 78.9 | 87.9 | 68.5 | -- | -- |  -- |\n\nSee [here](https://github.com/google-research/electra/issues/3) for losses / training curves of the models during pre-training.\n\n## Electric\n\nTo train [Electric](https://www.aclweb.org/anthology/2020.emnlp-main.20.pdf), use the same pre-training script and command as ELECTRA. Pass `\"electra_objective\": false` and  `\"electric_objective\": true` to the hyperparameters. We plan to release pre-trained Electric models soon!\n\n## Citation\nIf you use this code for your publication, please cite the original paper:\n```\n@inproceedings{clark2020electra,\n  title = {{ELECTRA}: Pre-training Text Encoders as Discriminators Rather Than Generators},\n  author = {Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning},\n  booktitle = {ICLR},\n  year = {2020},\n  url = {https://openreview.net/pdf?id=r1xMH1BtvB}\n}\n```\n\nIf you use the code for Electric, please cite the Electric paper:\n```\n@inproceedings{clark2020electric,\n  title = {Pre-Training Transformers as Energy-Based Cloze Models},\n  author = {Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning},\n  booktitle = {EMNLP},\n  year = {2020},\n  url = {https://www.aclweb.org/anthology/2020.emnlp-main.20.pdf}\n}\n```\n\n## Contact Info\nFor help or issues using ELECTRA, please submit a GitHub issue.\n\nFor personal communication related to ELECTRA, please contact [Kevin Clark](https://cs.stanford.edu/~kevclark/) (`kevclark@cs.stanford.edu`).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Felectra","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle-research%2Felectra","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Felectra/lists"}