{"id":16828456,"url":"https://github.com/guolinke/tupe","last_synced_at":"2025-07-09T05:33:46.227Z","repository":{"id":46194776,"uuid":"274556244","full_name":"guolinke/TUPE","owner":"guolinke","description":"Transformer with Untied Positional Encoding (TUPE). Code of paper \"Rethinking Positional Encoding in Language Pre-training\". Improve existing models like BERT.","archived":false,"fork":false,"pushed_at":"2021-11-08T10:19:06.000Z","size":626,"stargazers_count":250,"open_issues_count":11,"forks_count":27,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-27T01:10:03.377Z","etag":null,"topics":["bert","language-model","pretraining","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/guolinke.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-06-24T02:30:16.000Z","updated_at":"2024-10-30T07:42:23.000Z","dependencies_parsed_at":"2022-08-12T12:40:58.042Z","dependency_job_id":null,"html_url":"https://github.com/guolinke/TUPE","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guolinke%2FTUPE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guolinke%2FTUPE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guolinke%2FTUPE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guolinke%2FTUPE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/guolinke","download_url":"https://codeload.github.com/guolinke/TUPE/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248691663,"owners_count":21146411,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","language-model","pretraining","transformer"],"created_at":"2024-10-13T11:26:44.997Z","updated_at":"2025-04-13T09:37:39.949Z","avatar_url":"https://github.com/guolinke.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TUPE (Transformer with Untied Positional Encoding)\n\nImplementation for the paper [Rethinking Positional Encoding in Language Pre-training](https://arxiv.org/abs/2006.15595). \n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"tupe.png\" width=\"400\"\u003e\n\u003c/p\u003e\n\n## Brief Introduction\n\nThis repo is to demonstrate TUPE (Transformer with Untied Positional Encoding). The algorithm details could be found in our paper. TUPE can outperform other baselines on GLUE benchmark by a large margin. In particular, it can achieve a higher score than baselines while only using 30% pre-training computational costs. \n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"exp.png\" width=\"600\"\u003e\n\u003c/p\u003e\n\n\nDue to limited computational resources, we use the most widely-used pre-training model, BERT-Base, for verification. However, please note that our method could be used for larger (and better) Transformer-based models, like RoBERTa, ELECTRA and UniLM, and further improve them. Besides, since the modification is simple and easy, you can easily apply TUPE in your models.\n\nOur implementation is based on [fairseq](https://github.com/pytorch/fairseq), with several changes:\n1. update [`fairseq/modules/transformer_sentence_encoder.py`](fairseq/modules/transformer_sentence_encoder.py) and [`fairseq/modules/multihead_attention.py`](fairseq/modules/multihead_attention.py) for untied positional encoding.\n2. some other minor changes to support `max-epoch` with `warmup-ratio` in finetune, instead of setting different `total-num-update` and `warmup-updates` for different tasks.\n\n\n## Requirements and Installation\n\nMore details see [fairseq](https://github.com/pytorch/fairseq). Briefly,\n\n* [PyTorch](http://pytorch.org/)\n* Python version \u003e= 3.5\n* NVIDIA's [apex](https://github.com/NVIDIA/apex) library with the `--cuda_ext` installation option, for mixed precision training\n* You may need [NCCL](https://github.com/NVIDIA/nccl) for multi-node distributed training\n\n**Installing from source**\n\nTo install TUPE from source and develop locally:\n```bash\ngit clone https://github.com/guolinke/TUPE\ncd TUPE\npip install --editable .\n```\n\n## Getting Started\n\n### Data Pre-Processing\n\nThe pre-processing relies on [mosesdecoder](https://github.com/moses-smt/mosesdecoder), you can run the following script to pull it.\n\n```bash\ncd TUPE\ngit submodule update --init\n```\n\n#### Pretraining Data\nRefer to the steps in [`preprocess/pretrain/process.sh`](preprocess/pretrain/process.sh).\n\n#### Downstream Data\nRefer to the steps in [`preprocess/glue/process.sh`](preprocess/glue/process.sh).\n\n### Pre-Training\n\n```bash\nDATA_DIR=./path_to_your_data/\nSAVE_DIR=./your_own_save_path/\nTOTAL_UPDATES=1000000\nWARMUP_UPDATES=10000\nPEAK_LR=0.0001\nMAX_POSITIONS=512\nMAX_SENTENCES=16\nUPDATE_FREQ=1\nSEED=your_seed\npython train.py $DATA_DIR --fp16 --num-workers 16 --ddp-backend=c10d \\\n    --task masked_lm --criterion masked_lm --arch bert_base \\\n    --sample-break-mode complete --tokens-per-sample $MAX_POSITIONS \\\n    --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 --clip-norm 1.0 \\\n    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \\\n    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \\\n    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ --seed $SEED \\\n    --mask-prob 0.15 \\\n    --embedding-normalize \\\n    --max-update $TOTAL_UPDATES --log-format simple --log-interval 100 \\\n    --keep-updates-list 100000 300000 600000 1000000 \\\n    --save-interval-updates 25000 --keep-interval-updates 3 --no-epoch-checkpoints --skip-invalid-size-inputs-valid-test \\\n    --save-dir $SAVE_DIR --rel-pos\n\n```\n\nThe above setting is for 16 V100 GPUs, and the batch size is 256 (`n_gpu * MAX_SENTENCES * UPDATE_FREQ`). You may need to change `MAX_SENTENCES` or `UPDATE_FREQ` according to your environment. To disable relative position, you can remove `--rel-pos` .\n\n\n### Fine-Tuning\n\n```bash\nDATA_DIR=./path_to_your_downstream_data\nSAVE_DIR=./path_to_your_save_dir\nBERT_MODEL_PATH=./path_to_your_checkpoint\nBATCH_SIZE=32\nN_EPOCH=10     # 5 for MNLI, QNLI, QQP\nSEED=your_seed\nWARMUP_RATIO=0.06\nN_CLASSES=2     # 3 for MNLI, 1 for STS-B\nLR=0.00005     # search from 2e-5, 3e-5, 4e-5, 5e-5\nMETRIC=accuracy     # mcc for CoLA, pearson for STS-B\n\npython train.py $DATA_DIR --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \\\n    --restore-file $BERT_MODEL_PATH \\\n    --max-positions 512 \\\n    --max-sentences $BATCH_SIZE \\\n    --max-tokens 4400 \\\n    --task sentence_prediction \\\n    --reset-optimizer --reset-dataloader --reset-meters \\\n    --required-batch-size-multiple 1 \\\n    --init-token 0 --separator-token 2 \\\n    --arch bert_base \\\n    --criterion sentence_prediction \\\n    --num-classes $N_CLASSES \\\n    --dropout 0.1 --attention-dropout 0.1 \\\n    --weight-decay 0.01 --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-06 \\\n    --clip-norm 1.0 --validate-interval-updates 2 \\\n    --lr-scheduler polynomial_decay --lr $LR --warmup-ratio $WARMUP_RATIO \\\n    --max-epoch $N_EPOCH --seed $SEED --save-dir $SAVE_DIR --no-progress-bar --log-interval 100 --no-epoch-checkpoints --no-last-checkpoints --no-best-checkpoints \\\n    --find-unused-parameters --skip-invalid-size-inputs-valid-test --truncate-sequence --embedding-normalize \\\n    --tensorboard-logdir . \\\n    --best-checkpoint-metric $METRIC --maximize-best-checkpoint-metric --rel-pos\n```\n\nTo speed up finetune, we set `N_EPOCH=5` for MNLI, QNLI and QQP, and `N_EPOCH=10` for others. For MNLI, `N_CLASSES=3` and an additional setting `--valid-subset valid,valid1` is used for evaluating MNLI-m/-mm together. STS-B is a regression task, so we set `N_CLASSES=1`, with additional settings `--regression-target` and `METRIC=pearson`. For CoLA, we set `METRIC=mcc`.\n\n`LR` is searched from `{2e-5, 3e-5, 4e-5, 5e-5}`, each `LR` will be run by 5 different seeds, and we use the median of them as the result of that `LR`. The result of the best `LR` will be used.\n\n**NOTE**: If your pretraining model used `--rel-pos`, you should set `--rel-pos` in the finetune, otherwise you should remove it.\n\nWe also release the [checkpoint](https://guolinke.blob.core.windows.net/tupe/tupe_ckp.tar.gz) of TUPE-R (with `--rel-pos`), for reproducibility.\n\n## Reference\n\nYou can cite our paper by\n```\n@inproceedings{\nke2021rethinking,\ntitle={Rethinking Positional Encoding in Language Pre-training},\nauthor={Guolin Ke and Di He and Tie-Yan Liu},\nbooktitle={International Conference on Learning Representations},\nyear={2021},\nurl={https://openreview.net/forum?id=09-528y2Fgf}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fguolinke%2Ftupe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fguolinke%2Ftupe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fguolinke%2Ftupe/lists"}