{"id":13753007,"url":"https://github.com/alexa/bort","last_synced_at":"2025-05-09T20:34:41.732Z","repository":{"id":40960744,"uuid":"299374986","full_name":"alexa/bort","owner":"alexa","description":"Repository for the paper \"Optimal Subarchitecture Extraction for BERT\"","archived":true,"fork":false,"pushed_at":"2022-06-22T02:57:06.000Z","size":104,"stargazers_count":470,"open_issues_count":3,"forks_count":40,"subscribers_count":15,"default_branch":"master","last_synced_at":"2024-11-16T05:32:30.081Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alexa.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-09-28T16:56:22.000Z","updated_at":"2024-09-18T10:33:32.000Z","dependencies_parsed_at":"2022-07-26T12:17:18.522Z","dependency_job_id":null,"html_url":"https://github.com/alexa/bort","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexa%2Fbort","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexa%2Fbort/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexa%2Fbort/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexa%2Fbort/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alexa","download_url":"https://codeload.github.com/alexa/bort/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253321810,"owners_count":21890470,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:14.426Z","updated_at":"2025-05-09T20:34:39.429Z","avatar_url":"https://github.com/alexa.png","language":"Python","funding_links":[],"categories":["BERT优化","Python"],"sub_categories":["大语言对话模型及数据"],"readme":"# Bort\n##### Companion code for the paper \"Optimal Subarchitecture Extraction for BERT.\"\n\nBort is an optimal subset of architectural parameters for the BERT architecture, extracted by applying a fully polynomial-time approximation scheme (FPTAS) for neural architecture search. Bort has an effective (that is, not counting the embedding layer) size of 5.5\\% the original BERT-large architecture, and 16\\% of the net size. It is also able to be pretrained in 288 GPU hours, which is 1.2\\% of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large.\nIt is also 7.9x faster than BERT-base (20x faster than BERT/RoBERTa-large) on a CPU, and performs better than other compressed variants of the architecture, and some of the non-compressed variants; it obtains an average performance improvement of between 0.3\\% and 31\\%, relative, with respect to BERT-large on multiple public natural language understanding (NLU) benchmarks.\n\nHere are the corresponding GLUE scores on the test set:\n\n|Model|Score|CoLA|SST-2|MRPC|STS-B|QQP|MNLI-m|MNLI-mm|QNLI(v2)|RTE|WNLI|AX|\n|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n|Bort      |83.6|63.9|96.2|94.1/92.3|89.2/88.3|66.0/85.9|88.1|87.8|92.3|82.7|71.2|51.9|\n|BERT-Large|80.5|60.5|94.9|89.3/85.4|87.6/86.5|72.1/89.3|86.7|85.9|92.7|70.1|65.1|39.6|\n\n\nAnd SuperGLUE scores on the test set:\n\n|Model|Score|BoolQ|CB|COPA|MultiRC|ReCoRD|RTE|WiC|WSC|AX-b|AX-g|\n|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n|Bort       |74.1|83.7|81.9/86.5|89.6|83.7/54.1|49.8/49.0|81.2|70.1|65.8|48.0|96.1/61.5|\n|BERT-Large|69.0|77.4|75.7/83.6|70.6|70.0/24.1|72.0/71.3|71.7|69.6|64.4|23.0|97.8/51.7\n\n\nAnd here are the architectural parameters:\n\n|Model|Parameters (M) |Layers |Attention heads|Hidden size| Intermediate size| Embedding size (M) | Encoder proportion (%)|\n|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n|Bort       |56 |  4| 8  | 1024  | 768 | 39  | 30.3 |\n|BERT-Large|340| 24| 16 | 1024 | 4096 | 31.8| 90.6 |\n\n\n## Setup:\n1. You need to install the requirements from the `requirements.txt` file:\n```\npip install -r requirements.txt\n```\nThis code has been tested with Python 3.6.5+.\nTo save yourself some headache we recommend you install Horovod from source, _after_ you install MxNet. This is only needed if you are pre-training the architecture. For this, run the following commands (you'll need a C++ compiler which supports c++11 standards, like gcc \u003e 4.8):\n```bash\n    pip uninstall horovod\n    HOROVOD_CUDA_HOME=/usr/local/cuda-10.1 \\\n    HOROVOD_WITH_MXNET=1 \\\n    HOROVOD_GPU_ALLREDUCE=NCCL \\\n    pip install horovod==0.16.2 --no-cache-dir\n```\n\n2. You also need to download the model from [here](https://alexa-saif-bort.s3.amazonaws.com/bort.params). If you have the AWS CLI, all you need to do is run:\n```\naws s3 cp s3://alexa-saif-bort/bort.params model/\n```\n\n3. To run the tests, you also need to download the sample text from [Gluon](https://github.com/dmlc/gluon-nlp/blob/v0.9.x/scripts/bert/sample_text.txt) and put it in `test_data/`:\n```\nwget https://github.com/dmlc/gluon-nlp/blob/v0.9.x/scripts/bert/sample_text.txt\nmv sample_text.txt test_data/\n```\n\n\n\n## Pre-training:\n\nBort is already pre-trained, but if you want to try out other datasets, you can follow the steps here. Note that this does not run the FPTAS described in the paper, and works for a fixed architecture (Bort).\n\n1. First, you will need to tokenize the pre-training text:\n```bash\npython create_pretraining_data.py \\\n            --input_file \u003cinput text\u003e \\\n            --output_dir \u003coutput directory\u003e \\\n            --dataset_name \u003cdataset name\u003e \\\n            --dupe_factor \u003cduplication factor\u003e \\\n            --num_outputs \u003cnumber of output files\u003e\n```\nWe recommend using `--dataset_name  openwebtext_ccnews_stories_books_cased` for the vocabulary.\nIf your data file is too large, the script will throw out-of-memory errors. We recommend splitting it into smaller chunks and then calling the script one-by-one.\n\n2. Then run the pre-training distillation script:\n```bash\n./run_pretraining_distillation.sh \u003cnum gpus\u003e \u003ctraining data\u003e \u003ctesting data\u003e [optional teacher checkpoint]\n```\nPlease see the contents of `run_pretraining_distillation.sh` for example usages and additional optional configuration. If you have installed Horovod, we highly recommend you use `run_pretraining_distillation_hvd.py` instead.\n\n## Fine-tuning:\n\n1. To fine-tune Bort, run:\n```bash\n./run_finetune.sh \u003cyour task here\u003e\n```\nWe recommend you play with the hyperparameters from  `run_finetune.sh`.\nThis code supports all the tasks outlined in the paper, but for the case of the RACE dataset, you need to [download](http://www.cs.cmu.edu/~glai1/data/race/) the data and extract it. The default location for extraction is `~/.mxnet/datasets/race`. Same goes for SuperGLUE's MultiRC, since the Gluon implementation is the old version. You can [download](https://github.com/nyu-mll/jiant/blob/master/scripts/download_superglue_data.py) the data and extract it to `~/.mxnet/datasets/superglue_multirc/`.\n\n It is normal to get very odd results for the fine-tuning step, since this repository only contains the training part of Agora.\nHowever, you can easily implement your own version of that algorithm.\nWe recommend you use the following initial set of hyperparameters, and follow the requirements described in the papers at the end of this file:\n```\nseeds={0,1,2,3,4}\nlearning_rates={1e-4, 1e-5, 9e-6}\nweight_decays={0, 10, 100, 350}\nwarmup_rates={0.35, 0.40, 0.45, 0.50}\nbatch_sizes={8, 16}\n```\n\n\n\n## Troubleshooting:\n##### Dependency errors\nBort requires a rather unusual environment to run. For this reason, most of the problems regarding runtime can be fixed by installing the requirements from the `requirements.txt` file. Also make sure to have reinstalled Horovod as outlined above.\n##### Script failing when downloading the data\nThis is inherent to the way Bort is fine-tuned, since it expects the data to be preexisting for some arbitrary implementation of Agora. You can get around that error by downloading the data before running the script, e.g.:\n```\nfrom data.classification import BoolQTask\ntask = BoolQTask()\ntask.dataset_train()[1]; task.dataset_val()[1]; task.dataset_test()[1]\n```\n##### Out-of-memory errors\nWhile Bort is designed to be efficient in terms of the space it occupies in memory, a very large batch size or sequence length will still cause you to run out of memory. More often than ever, reducing the sequence length from `512` to `256` will solve out-of-memory issues. 80% of the time, it works every time.\n##### Slow fine-tuning/pre-training\nWe strongly recommend using distributed training for both fine-tuning and pre-training. If your Horovod acts weird, remember that it needs to be built _after_ the installation of MXNet (or any framework for that matter).\n##### Low task-specific performance\nIf you observe near-random task-specific performance, that is to be expected. Bort is a rather small architecture and the optimizer/scheduler/learning rate combination is quite aggressive. We _highly_ recommend you fine-tune Bort using an implementation of Agora. More details on how to do that are in the references below, specifically the second paper. Note that we needed to implement \"replay\" (i.e., re-doing some iterations of Agora) to get it to converge better.\n\n\n## References\nIf you use Bort or the other algorithms in your work, we'd love to hear from it! Also, please cite the so-called \"Bort trilogy\" papers:\n```\n@article{deWynterApproximation,\n    title={An Approximation Algorithm for Optimal Subarchitecture Extraction},\n    author={Adrian de Wynter},\n    year={2020},\n    eprint={2010.08512},\n    archivePrefix={arXiv},\n    primaryClass={cs.LG},\n    journal={CoRR},\n    volume={abs/2010.08512},\n    url={http://arxiv.org/abs/2010.08512}\n}\n```\n```\n@article{deWynterAlgorithm,\n      title={An Algorithm for Learning Smaller Representations of Models With Scarce Data},\n      author={Adrian de Wynter},\n      year={2020},\n      eprint={2010.07990},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      journal={CoRR},\n      volume={abs/2010.07990},\n      url={http://arxiv.org/abs/2010.07990}\n}\n```\n```\n@article{deWynterPerryOptimal,\n      title={Optimal Subarchitecture Extraction for BERT},\n      author={Adrian de Wynter and Daniel J. Perry},\n      year={2020},\n      eprint={2010.10499},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      journal={CoRR},\n      volume={abs/2010.10499},\n      url={http://arxiv.org/abs/2010.10499}\n}\n```\nLastly, if you use the GLUE/SuperGLUE/RACE tasks, don't forget to give proper attribution to the original authors.\n\n## Security\n\nSee [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.\n\n## License\n\nThis project is licensed under the Apache-2.0 License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexa%2Fbort","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falexa%2Fbort","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexa%2Fbort/lists"}