{"id":13752992,"url":"https://github.com/epfml/collaborative-attention","last_synced_at":"2025-10-25T17:07:39.939Z","repository":{"id":82332285,"uuid":"268808826","full_name":"epfml/collaborative-attention","owner":"epfml","description":"Code for Multi-Head Attention: Collaborate Instead of Concatenate","archived":false,"fork":false,"pushed_at":"2023-06-12T21:27:03.000Z","size":49,"stargazers_count":151,"open_issues_count":6,"forks_count":22,"subscribers_count":14,"default_branch":"master","last_synced_at":"2024-12-11T14:04:44.139Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2006.16362","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epfml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-06-02T13:29:08.000Z","updated_at":"2024-11-19T12:35:13.000Z","dependencies_parsed_at":"2024-08-03T09:04:47.623Z","dependency_job_id":null,"html_url":"https://github.com/epfml/collaborative-attention","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fcollaborative-attention","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fcollaborative-attention/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fcollaborative-attention/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fcollaborative-attention/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epfml","download_url":"https://codeload.github.com/epfml/collaborative-attention/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230554330,"owners_count":18244234,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:14.116Z","updated_at":"2025-10-25T17:07:39.851Z","avatar_url":"https://github.com/epfml.png","language":"Python","readme":"# Collaborative Attention\n\nCode for the paper [Multi-Head Attention: Collaborate Instead of Concatenate](https://arxiv.org/abs/2006.16362), Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi.\n\n\u003e Clone this repo with submodules `git clone --recurse-submodules https://github.com/epfml/collaborative-attention.git`\n\nWe provide a python package to reparametrize any pretrained attention layer into a collaborative attention layer.\nThis allows to decrease the key/query dimension without affecting the performance of the model.\nOur factorization can be used either for pretraining as a drop-in replacement of concatenated heads attention or before fine tuning as a compression method.\n\n[![tests](https://github.com/epfml/collaborative-attention/workflows/tests/badge.svg)](https://github.com/epfml/collaborative-attention/actions?query=workflow%3Atests)\n\n## Install\n\nClone this repository and install the package with pip:\n\n```bash\n# you need to have PyTorch installed\ngit clone https://github.com/epfml/collaborative-attention.git\npip install -U -e collaborative-attention\n```\n\n## Quick Start\n\nWe provide code to reparametrize any attention layer into our efficient collaborative version.\nThe following code factorize a pretrained BERT-base with collaborative heads.\n\n```python\nfrom transformers import AutoModel\nfrom collaborative_attention import swap_to_collaborative, BERTCollaborativeAdapter\nimport copy\nimport torch\n\nmodel = AutoModel.from_pretrained(\"bert-base-cased-finetuned-mrpc\")\n\n# reparametrize the model with tensor decomposition to use collaborative heads\n# decrease dim_shared_query_key to 384 for example to compress the model\ncollab_model = copy.deepcopy(model)\nswap_to_collaborative(collab_model, BERTCollaborativeAdapter, dim_shared_query_key=768)\n\n# check that output is not altered too much\nany_input = torch.LongTensor(3, 25).random_(1000, 10000)\ncollab_model.eval()  # to disable dropout\nout_collab = collab_model(any_input)\n\nmodel.eval()\nout_original = model(any_input)\n\nprint(\"Max l1 error: {:.1e}\".format((out_collab[0] - out_original[0]).abs().max().item()))\n# \u003e\u003e\u003e Max l1 error: 1.9e-06\n\n# You can evaluate the new model, refine tune it or save it.\n# We also want to pretrain our collaborative head from scratch (if you were wondering).\n```\n\n## Explore the Code\n\n- The collaborative multi-head attention layer is defined in [src/collaborative_attention/collaborative_attention.py](src/collaborative_attention/collaborative_attention.py).\n- We use [tensorly](http://tensorly.org/stable/index.html) to decompose a trained attention head and reparametrize it as a collaborative layer. You can look at the decomposition code in [src/collaborative_attention/swap.py](src/collaborative_attention/swap.p) that defines the `swap_to_collaborative` function.\nWhen run on a GPU, the decomposition takes less than a minute per layer.\n\n## Other Transformers\n\nOur framework can be adapted on any transformer that we know of.\nOur code base is modular so that we can swap collaborative heads in any transformer.\nWe use small adapter classes that extract the parameters of the layers we want to transform.\nWe have defined adapters for the following transformers:\n\n| Model | Adapter Class | File |\n| ----- | ------------- | ---- |\n| [BERT](https://arxiv.org/abs/1810.04805) | BERTCollaborativeAdapter | `src/collaborative_attention/adapter_bert.py` |\n| [DistilBERT](https://arxiv.org/abs/1910.01108) | DistilBERTCollaborativeAdapter | `src/collaborative_attention/adapter_distilbert.py` |\n| [ALBERT](https://arxiv.org/abs/1909.11942) | ALBERTCollaborativeAdapter | `src/collaborative_attention/adapter_albert.py` |\n\nAdding a new model is very simple: define your own adapter based on `CollaborativeLayerAdapter`. You simply have to write a few one liner functions and you can get inspiration from the files above. We are happy to quickly merge PR, just copy paste a test in `tests/` to make sure your adapter is working.\n\n## Results\n\n### Natural Language Understanding\n\nDownload the GLUE data following [this](https://github.com/huggingface/transformers/tree/master/examples/text-classification) and set `GLUE_DIR` environment variable.\n\nYou should proceed in two steps\n1. Fine tune the original model, `bert-base-cased` for example, for the task (without `--mix_heads` and `--mix_size`)\n2. Use the saved finetuned model in `output/` to do the decomposition (`model_name_or_path` argument), it will swap it to collaborative and re-finetune.\n\nWe show a comand example with an already finetuned model on MRPC:\n\n```\npython run_glue.py \\\n    --model_name_or_path=bert-base-cased-finetuned-mrpc \\\n    --task_name=mrpc \\\n    --data_dir=$GLUE_DIR \\\n    --output_dir=output/ \\\n    --do_train \\\n    --do_eval \\\n    --max_seq_length=128 \\\n    --per_gpu_train_batch_size=32 \\\n    --learning_rate=2e-5 \\\n    --num_train_epochs=3.0 \\\n    --overwrite_output_dir \\\n    --save_total_limit=3 \\\n    --mix_heads \\\n    --mix_size 384\n```\n\n### Neural Machine Translation\n\n```\ncd fairseq/\npip install --editable ./\n# on MacOS:\n# CFLAGS=\"-stdlib=libc++\" pip install --editable ./\n```\n\nDownload and preprocess the data following these [instructions](https://github.com/pytorch/fairseq/tree/master/examples/scaling_nmt).\n\nReproduce our experiments on a machine with 4 GPUs with the following command:\n\n```bash\n# set COLAB to \"none\" to run the original transformer\n# set KEY_DIM for different key dimensions\nKEY_DIM=512 COLAB=\"encoder_cross_decoder\" CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py data-bin/wmt16_en_de_bpe32k \\\n    --arch transformer_wmt_en_de \\\n    --save-dir checkpoints/wmt16-en-de/base-d-$KEY_DIM-colab-$COLAB \\\n    --share-all-embeddings \\\n    --optimizer adam \\\n    --adam-betas '(0.9, 0.98)' \\\n    --clip-norm 0.0 \\\n    --lr 0.0007 \\\n    --min-lr 1e-09 \\\n    --lr-scheduler inverse_sqrt \\\n    --warmup-updates 4000 \\\n    --warmup-init-lr 1e-07 \\\n    --dropout 0.1 \\\n    --weight-decay 0.0 \\\n    --criterion label_smoothed_cross_entropy \\\n    --label-smoothing 0.1 \\\n    --max-tokens 3584 \\\n    --update-freq 2 \\\n    --fp16 \\\n    --collaborative-heads $COLAB \\\n    --key-dim $KEY_DIM \\\n```\n\n### Vision Transformers\n\nFollow deit setup\n\n```\ncd deit\nconda install -c pytorch pytorch torchvision\npip install timm==0.3.2 tensorly\n```\n\nTo train Base3 models, run the following command:\n\n```\npython -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model deit_base3_patch16_224_collab384 --batch-size 256 --data-path /imagenet --output_dir ../outputs\n```\n\nor for the concatenate attention:\n\n```\npython -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model deit_base3_patch16_224_key384 --batch-size 256 --data-path /imagenet --output_dir ../outputs\n```\n\nYou can reparametrize a pretrained model by running the following command on a single GPU machine:\n\n```\npython --model deit_base_patch16_224 --shared_key_query_dim 384 --output_dir ./models\n```\n\nwhich will create a new checkpoint for this reparametrized model in `./models/deit_base_patch16_224_collab384.pt`.\n\nTo evaluate this model, run:\n\n```\npython main.py --eval --model deit_base_patch16_224_collab384 --data-path /imagenet --pretrained  --models_directory ./models\n```\n\n\n## Citation\n\nIf you find this code useful, please cite the paper:\n\n```\n@misc{cordonnier2020multihead,\n    title={Multi-Head Attention: Collaborate Instead of Concatenate},\n    author={Jean-Baptiste Cordonnier and Andreas Loukas and Martin Jaggi},\n    year={2020},\n    eprint={2006.16362},\n    archivePrefix={arXiv},\n    primaryClass={cs.LG}\n}\n```\n","funding_links":[],"categories":["BERT优化"],"sub_categories":["大语言对话模型及数据"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Fcollaborative-attention","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepfml%2Fcollaborative-attention","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Fcollaborative-attention/lists"}