{"id":20405687,"url":"https://github.com/skyworkai/moh","last_synced_at":"2025-04-04T17:03:53.266Z","repository":{"id":259170047,"uuid":"869383878","full_name":"SkyworkAI/MoH","owner":"SkyworkAI","description":"MoH: Multi-Head Attention as Mixture-of-Head Attention","archived":false,"fork":false,"pushed_at":"2024-10-29T15:22:54.000Z","size":5515,"stargazers_count":230,"open_issues_count":3,"forks_count":9,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-28T16:04:05.385Z","etag":null,"topics":["attention","dit","llms","mixture-of-experts","moe","transformer","vit"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2410.11842","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SkyworkAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-08T07:52:37.000Z","updated_at":"2025-03-26T14:51:42.000Z","dependencies_parsed_at":"2024-10-23T07:47:18.661Z","dependency_job_id":"4d91b3bc-3f92-493b-ac35-c477cdc2ee0e","html_url":"https://github.com/SkyworkAI/MoH","commit_stats":null,"previous_names":["skyworkai/moh"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SkyworkAI%2FMoH","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SkyworkAI%2FMoH/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SkyworkAI%2FMoH/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SkyworkAI%2FMoH/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SkyworkAI","download_url":"https://codeload.github.com/SkyworkAI/MoH/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247217174,"owners_count":20903008,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention","dit","llms","mixture-of-experts","moe","transformer","vit"],"created_at":"2024-11-15T05:12:28.648Z","updated_at":"2025-04-04T17:03:53.209Z","avatar_url":"https://github.com/SkyworkAI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=center\u003e\n\u003cimg src=\"figures/fig1.png\" width=\"280px\"\u003e\n\u003c/div\u003e\n\n\u003ch2 align=\"center\"\u003e \u003ca href=\"https://arxiv.org/abs/2410.11842\"\u003eMoH: Multi-Head Attention as Mixture-of-Head Attention\n\n\u003c/a\u003e\u003c/h2\u003e\n\u003ch5 align=\"center\"\u003e If you like our project, please give us a star ⭐ on GitHub for the latest update.\u003c/h5\u003e\n\n\u003ch5 align=center\u003e\n\n\u003c!-- [![Demo](https://img.shields.io/badge/⚡-Hugging%20Face%20Demo-yellow.svg)](https://huggingface.co/spaces/Chat-UniVi/Chat-UniVi) --\u003e\n[![hf](https://img.shields.io/badge/🤗-Hugging%20Face-blue.svg)](https://huggingface.co/Chat-UniVi)\n[![arXiv](https://img.shields.io/badge/Arxiv-2410.11842-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.11842)\n[![License](https://img.shields.io/badge/Code%20License-Apache2.0-yellow)](https://github.com/SkyworkAI/MoH/blob/main/LICENSE)\n[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FSkyworkAI%2FMoH\u0026count_bg=%2379C83D\u0026title_bg=%23555555\u0026icon=\u0026icon_color=%23E7E7E7\u0026title=Visitor\u0026edge_flat=false)](https://hits.seeyoufarm.com)\n[![GitHub issues](https://img.shields.io/github/issues/SkyworkAI/MoH?color=critical\u0026label=Issues)](https://github.com/SkyworkAI/MoH/issues?q=is%3Aopen+is%3Aissue)\n[![GitHub closed issues](https://img.shields.io/github/issues-closed/SkyworkAI/MoH?color=success\u0026label=Issues)](https://github.com/SkyworkAI/MoH/issues?q=is%3Aissue+is%3Aclosed)\n\u003c/h5\u003e\n\n\u003cdetails open\u003e\u003csummary\u003e💡 I also have other projects that may interest you ✨. \u003c/summary\u003e\u003cp\u003e\n\u003c!--  may --\u003e\n    \n\u003e [**Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding**](https://arxiv.org/abs/2311.08046) \u003cbr\u003e\n\u003e Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, Li Yuan \u003cbr\u003e\n[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/PKU-YuanGroup/Chat-UniVi)  [![github](https://img.shields.io/github/stars/PKU-YuanGroup/Chat-UniVi.svg?style=social)](https://github.com/PKU-YuanGroup/Chat-UniVi) [![arXiv](https://img.shields.io/badge/Arxiv-2311.08046-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2311.08046) [![Conference](http://img.shields.io/badge/CVPR-2024(Highlight)-FFD93D.svg)](https://cvpr.thecvf.com/) \u003cbr\u003e\n    \n\u003e [**MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts**](https://github.com/SkyworkAI/MoE-plus-plus) \u003cbr\u003e\n\u003e Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan \u003cbr\u003e\n[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/SkyworkAI/MoE-plus-plus)  [![github](https://img.shields.io/github/stars/SkyworkAI/MoE-plus-plus.svg?style=social)](https://github.com/SkyworkAI/MoE-plus-plus) [![arXiv](https://img.shields.io/badge/Arxiv-2410.07348-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.07348) \u003cbr\u003e\n\n--\u003e\n\n\u003c/p\u003e\u003c/details\u003e\n\n## 📣 News\n* **[2024/10/23]** We updated [tokenizer config](https://huggingface.co/Chat-UniVi/MoH-LLaMA3-8B/blob/main/tokenizer_config.json) for [MoH-LLaMA3-8B](https://huggingface.co/Chat-UniVi/MoH-LLaMA3-8B).\n* **[2024/10/22]**  Now [MoH-LLaMA3-8B](https://huggingface.co/Chat-UniVi/MoH-LLaMA3-8B) is available.\n* **[2024/10/10]**  MoH-LLaMA3-8B weights are being approved and will be available for download after approval.\n* **[2024/10/09]**  Model weight and inference code are available now! Welcome to **watch** 👀 this repository for the latest updates.\n\n## ⚡ Overview\nWe propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages:\n* First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. \n* Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential.\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figures/fig2.png\" width=\"800px\"\u003e\n\u003c/div\u003e\n\n\n## 😮 Highlights\n### 💡 General Framework\nWe evaluate our proposed MoH across various popular model frameworks, including Vision Transformers (ViT) for image classification, Diffusion models with Transformers (DiT) for class-conditional image generation, and Large Language Models (LLMs) for language tasks.\n\n\u003cdiv align=center\u003e\n\n|                   Code                    |                                                                                                                         HuggingFace Model                                                                                                                         |  \n|:-----------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|\n|     **[MoH-ViT](https://github.com/SkyworkAI/MoH/tree/main/MoH-ViT)**      | 🤗 [MoH-ViT-B-75](https://huggingface.co/Chat-UniVi/MoH-ViT-B-75), [MoH-ViT-B-50](https://huggingface.co/Chat-UniVi/MoH-ViT-B-50), [MoH-ViT-S-80](https://huggingface.co/Chat-UniVi/MoH-ViT-S-80), [MoH-ViT-S-75](https://huggingface.co/Chat-UniVi/MoH-ViT-S-75) |\n|     **[MoH-DiT](https://github.com/SkyworkAI/MoH/tree/main/MoH-DiT)**      |                                                                                                 😊 [MoH-DiT-90](https://huggingface.co/Chat-UniVi/MoH-DiT-XL-90)                                                                                                  | \n| **[MoH-LLaMA3-8B](https://github.com/SkyworkAI/MoH/tree/main/MoH-LLaMA3)** |                                                                                                                        😊 [MoH-LLaMA3-8B](https://huggingface.co/Chat-UniVi/MoH-LLaMA3-8B)                                                                                                                         | \n\n\u003c/div\u003e\n\n### 🔥 High Performance\nExtensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only **50%~90%** of the attention heads.\n\n### 🤗 Support Continue-Tuning Starting from the Multi-Head Attention Models\nWe demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads.\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figures/fig3.png\" width=\"800px\"\u003e\n\u003c/div\u003e\n\nThe MoH model quickly recovers to over **95%** of the performance of the original model within a training budget of 10B tokens. Then, the performance gradually improves with the increase of the training tokens.\n\n## 🚀 Main Results\n### ViT for ImageNet-1K Classification\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figures/fig4.png\" width=\"800px\"\u003e\n\u003c/div\u003e\n\n### DiT for Class-Conditional Image Generation (ImageNet-1K)\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figures/fig9.png\" width=\"800px\"\u003e\n\u003c/div\u003e\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figures/fig5.png\" width=\"800px\"\u003e\n\u003c/div\u003e\n\n### Training LLMs from Scratch\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figures/fig6.png\" width=\"800px\"\u003e\n\u003c/div\u003e\n\n### Continue-Tuning LLaMA3-8B\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figures/fig7.png\" width=\"800px\"\u003e\n\u003c/div\u003e\n\n\n## 😍 Why is MoH better than Multi-Head Attention?\n### Flexible Head Assignment Patterns\nWe observe significant variation in attention head assignments across different categories and task topics, indicating that the MoH model adapts to diverse tasks by employing distinct head assignment patterns. This characteristic of MoH allows different attention heads to focus on different types of tasks, making parameter utilization more efficient than multi-head attention.\n\n\u003cdiv align=center\u003e\n\u003cimg src=\"figures/fig8.png\" width=\"800px\"\u003e\n\u003c/div\u003e\n\n### Weighted Summation of Heads\nBy replacing the standard summation in multi-head attention with a weighted summation, MoH enhances the flexibility of the attention mechanism and increases the performance potential.\n\n\n## 🗝️ Training \u0026 Validating\n* The training instruction of MoH-ViT is in [MoH-ViT](MoH-ViT/README.md).\n* The training instruction of MoH-DiT is in [MoH-DiT](MoH-DiT/README.md).\n* The instruction of MoH-LLaMA-8B is in [MoH-LLaMA-8B](MoH-LLaMA3/README.md).\n\n\n## 👍 Acknowledgement\n* [Skywork-MoE](https://github.com/SkyworkAI/Skywork-MoE) It is an advanced MoE language model.\n* [LLaMA3](https://github.com/meta-llama/llama3) It is an advanced open-source language model.\n* [TransNeXt (CVPR 2024)](https://github.com/DaiShiResearch/TransNeXt) It is an advanced ViT model.\n* [DiT (ICCV 2023)](https://github.com/facebookresearch/DiT) It is an advanced diffusion model with the Transformer.\n\n## 🤝 Related Projects\n* [MoE++](https://github.com/SkyworkAI/MoE-plus-plus) MoE++ achieves better performance while delivering 1.1~2.1x expert forward throughput compared to a vanilla MoE model of the same size, which lays a solid foundation for developing advanced and efficient MoE-related models.\n* [Chat-UniVi (CVPR 2024 Highlight)](https://github.com/PKU-YuanGroup/Chat-UniVi) The model is an efficient large language and video assistant. This framework exhibits remarkable interactive capabilities between images and videos.\n\n\n## 🔒 License\n* The majority of this project is released under the Apache 2.0 license as found in the [LICENSE](https://github.com/SkyworkAI/MoH/blob/main/LICENSE) file.\n* The service is a research preview intended for non-commercial use only, subject to the model [License](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) of LLaMA, [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI, and [Privacy Practices](https://chrome.google.com/webstore/detail/sharegpt-share-your-chatg/daiacboceoaocpibfodeljbdfacokfjb) of ShareGPT. Please contact us if you find any potential violations.\n\n\n## ✏️ Citation\nIf you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:\n```\n@article{jin2024moh,\n  title={MoH: Multi-Head Attention as Mixture-of-Head Attention},\n  author={Jin, Peng and Zhu, Bo and Yuan, Li and Yan, Shuicheng},\n  journal={arXiv preprint arXiv:2410.11842},\n  year={2024}\n}\n```\n\n## ✨ Contributors\n\u003ca href=\"https://github.com/SkyworkAI/MoH/graphs/contributors\"\u003e\n  \u003cimg src=\"https://contrib.rocks/image?repo=SkyworkAI/MoH\" /\u003e\n\u003c/a\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fskyworkai%2Fmoh","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fskyworkai%2Fmoh","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fskyworkai%2Fmoh/lists"}