{"id":18322504,"url":"https://github.com/tencentarc/mllm-npu","last_synced_at":"2025-06-17T23:05:36.283Z","repository":{"id":247372970,"uuid":"823640020","full_name":"TencentARC/mllm-npu","owner":"TencentARC","description":"mllm-npu: training multimodal large language models on Ascend NPUs","archived":false,"fork":false,"pushed_at":"2024-08-29T02:51:55.000Z","size":51267,"stargazers_count":91,"open_issues_count":3,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-03T10:04:49.657Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TencentARC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-03T12:22:39.000Z","updated_at":"2025-03-23T08:02:49.000Z","dependencies_parsed_at":"2024-08-29T03:10:09.939Z","dependency_job_id":null,"html_url":"https://github.com/TencentARC/mllm-npu","commit_stats":null,"previous_names":["tencentarc/mllm-npu"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TencentARC/mllm-npu","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2Fmllm-npu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2Fmllm-npu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2Fmllm-npu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2Fmllm-npu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TencentARC","download_url":"https://codeload.github.com/TencentARC/mllm-npu/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentARC%2Fmllm-npu/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260453741,"owners_count":23011575,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-05T18:24:53.992Z","updated_at":"2025-06-17T23:05:31.262Z","avatar_url":"https://github.com/TencentARC.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./images/title.png\" width=\"50%\"\u003e\n\u003c/p\u003e\n\n\u003ch3 align=\"center\"\u003eTraining Multimodal Large Language Models on Ascend NPUs\u003c/h3\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./images/bar.png\"\u003e\n\u003c/p\u003e\n\n\u003ch4 align=\"center\"\u003e\n    \u003cp\u003e\n        \u003ca href=\"./README.md\"\u003eEnglish\u003c/a\u003e |\n        \u003ca href=\"./README_ZH.md\"\u003e中文\u003c/a\u003e  \n    \u003c/p\u003e\n\u003c/h4\u003e\n\n\u003c/br\u003e\n\nIn recent years, the widespread use of NPUs has provided more training and usage resources for LLMs, especially MLLMs.\nHowever, the current use of NPUs still has more or less adaptation issues.\nTherefore, we provide a framework that can flexibly select different visual encoders, adapters, LLMs, and corresponding generation components to form MLLMs for training, inferring, and image generation.\n\nFor example, we give an implementation of a high-performance MLLM (i.e., SEED-X) using this framework. Of course, you can also choose different modules in this framework to build your own MLLM.\n\n- MLLM: the standard multimodal large language models for multimodal comprehension.\n\n- [SEED-X](https://github.com/AILab-CVC/SEED-X/tree/main): a unified and versatile foundation model which is capable of responding to a variety of user needs through unifying **multi-granularity comprehension and generation**.\n\n\n## 🌟 Highlights\n\n* **modular design**: this project is flexible and it's easy to change the large language models or vision encoders with configs.\n\n* **training recipe**: this project provides the complete code for pre-training or superivsed finetuning the multimodal large language models on (Ascend) NPUs.\n\n* **acceleration**: this project provides an existing GPU-accelerated component replacement scheme for NPUs.\n\n* ****\n\n## 📢 News\n\n* **2024-07-24** 🔥 We release 7 Chinese and English pure text and multi-modal evaluation benchmarks.\n\n* **2024-07-08** 🔥 We release NPU-based multi-modal inference and pre-training code, and various ways to use SEED-X.\n\n## 📋 TODOs\n\nThis project is **under active development**, please stay tuned ☕️!\n\n- [ ] Model zoo on NPU.\n- [ ] Multimodal benchmarks.\n\n\n\n## 🔨 Install\n\n- Dependencies \u0026 Environment\n  - python \u003e= 3.8 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux))\n  - [torch = 2.1.0+cpu](https://pytorch.org/) + [torch-npu = 2.1.0](https://pypi.org/project/torch-npu/2.1.0/)\n  - ASCEND NPU (Recommend to use [910B]()) + [CANN](https://www.hiascend.com/en/software/cann)\n    - CANN version\n    \n\n    ```bash\n    \u003e cat /usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/ascend_toolkit_install.info \n    package_name=Ascend-cann-toolkit\n    version=8.0.T6\n    innerversion=V100R001C17B214\n    compatible_version=[V100R001C15,V100R001C18],[V100R001C30],[V100R001C13],[V100R003C11],[V100R001C29],[V100R001C10]\n    arch=x86_64\n    os=linux\n    path=/usr/local/Ascend/ascend-toolkit/8.0.T6/x86_64-linux\n    ```\n\n- Installation\n  - Clone the repo and install dependent packages\n\n  ```bash\n  git clone https://github.com/TencentARC/mllm-npu.git\n  cd mllm-npu\n  pip install -r requirements.txt\n  ```\n\n## 💻 Demo\n\n### Quick Start\n\nTo quickly try out this framework, you can execute the following script.\n\n```bash\n# For image comprehension\npython ./demo/img2txt_inference.py\n\n# For image generation\npython ./demo/txt2img_generation.py\n```\n\n### Gradio Web UI\n\nTo launch a Gradio demo locally, please run the following commands one by one. If you plan to launch multiple model workers to compare between different checkpoints, you only need to launch the controller and the web server ONCE.\n\n1. Launch a contoller\n\n    ```bash\n    python mllm_npu/serve/controller.py --host 0.0.0.0 --port 10000\n    ```\n\n2. Launch a model worker\n\n    ```bash\n    python mllm_npu/serve/worker.py --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000\n    ```\n\n3. Launch a gradio web app\n\n    ```bash\n    python mllm_npu/serve/gradio_app.py\n    ```\n\n4. You can also use this service through API, see [demo](./demo/demo.ipynb) for the format.\n\n    ```json\n    {\n        \"input_text\": \"put your input text here\",\n        \"image\": \"put your input image (base64)\",\n        \"image_gen\": False or True\n    }\n    ```\n   \n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./images/gradio_inference.png\" width=\"90%\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./images/gradio_generation.png\" width=\"90%\"\u003e\n\u003c/p\u003e\n\n## ⚙️ Model\n\nWe mainly adopt the `GeneraliazedMultimodalModels` in [mllm.py](./mllm_npu/models/mllm.py) as the general architecture of multimodal large language models, such as LLaVA, which contains three basic modules:\n- (1) a **language model**, e.g., LLaMA-2.\n- (2) a **projector** to project image features into language embeddings.\n- (3) a **vision encoder**, e.g., ViT.\n\nThe MLLM is built according to the model config with `hydra.utils.instantiate`, and you can find some samples in [models](./mllm_npu/configs/models).\n\n\u003cdiv align=\"center\"\u003e\u003cimg src=\"images/mllm.png\"\u003e\u003c/div\u003e\n\nSpecifically, we support two mainstream architectures now:\n\n* standard multimodal models (`GeneraliazedMultimodalModels`): aim for multimodal comprehension, containing a vision encoder, a vision-language projector, and a Large Lagnguage Model.\n\n* [SEED-X](https://github.com/AILab-CVC/SEED-X) (`SEED`): the versatile multimodal model for comprehension and generation, extends the standard multimodal model with a output projector for generating images with the stable diffusion.\n\n    | Architecture | Any Resolution | Comprehension | Generation |\n    | :----------- | :------------: | :-----------: | :--------: |\n    | MLLM         | ✔️              | ✔️             | ✖️          |\n    | SEED-X       | ✔️              | ✔️             | ✔️          |\n\n## 🌐 Data\n\nYou can prepare your own data to pre-train or fine-tune your model. Specifically, we provide four different tasks and corresponding formats (please refer to the [examples](./data/)). In order to use the data more efficiently, we use [webdataset](https://webdataset.github.io/webdataset/) to organize the data. Besides, please refer to [data.yaml](./seed_npu/configs/dataset/pretrain_data.yaml) for the index of the data. You can adjust the data sampling rate and other settings by setting it in this file.\n\nPlease refer to [dataset](./data/data.md) for more data information.\n\n## 🏃 Train\n\n### Prepare Tokenizers\n\nFor multimodal comprehension, we need to add special tokens to the tokenizers, such as `\u003cimg\u003e` or `\u003cpatch\u003e`, you can specify the path of the tokenizer in [scripts/tools/add_special_tokens_to_tokenizer.py](./scripts/tools/add_special_tokens_to_tokenizer.py) and directly run this scripts to obtain the updated tokenizer.\n\n### Pre-training\nYou need to specify the **model config** and **data config** in the training scripts, such as [`scripts/mllm_llama3_8b_siglip_vit_pretrain.sh`](./scripts/mllm_llama3_8b_siglip_vit_pretrain.sh).\n\n```bash\nbash scripts/mllm_llama3_8b_siglip_vit_pretrain.sh\n```\n\n### Supervised Finetuning / Instruction Tuning\n\nFor supervised finetuning,  you can keep most settings unchanged and:\n\n1. specify the initial weights of SFT through the \"pretrained_model_name_path\" in the model configuration file.  \n2. adjust the SFT data and its instruction format.  \n3. follow the pre-training script for the rest.\n\n## 🌟 Benchmark Evaluation\nWe collected some popular English/Chinese plain text and multi-modal benchmarks (e.g., mmlu, cmmlu, etc.), see [here](./evaluate/evaluate.md) for details.\n\n\n## 🚅 Acceleration\n\nOn the GPU, there are some common acceleration components that can significantly improve the model calculation speed, such as [flash-attn](https://github.com/Dao-AILab/flash-attention) and [xformers](https://github.com/facebookresearch/xformers). \nSince there is currently no direct implementation on the NPU, we now provide some optional acceleration implementations, please see [acceleration](./mllm_npu/acceleration/acceleration.md) for details.\n\n\n## 💡 Citation\n\nIf you find the work helpful, please consider citing:\n\n- mllm-npu\n\n    ```bibtex\n    @misc{mllm_npu\n        title={mllm-npu},\n        author={Li, Chen and Cheng, Tianheng and Ge, Yuying and Wang, Teng and Ge, Yixiao},\n        howpublished={\\url{https://github.com/TencentARC/mllm-npu}},\n        year={2024},\n    }\n    ```\n\n- SEED-X\n\n    ```bibtex\n    @article{ge2024seed,\n        title={SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation},\n        author={Ge, Yuying and Zhao, Sijie and Zhu, Jinguo and Ge, Yixiao and Yi, Kun and Song, Lin and Li, Chen and Ding, Xiaohan and Shan, Ying},\n        journal={arXiv preprint arXiv:2404.14396},\n        year={2024}\n    }\n    ```\n\n## 🔎 License\nThis project is under the Apache-2.0 License. For models built with LLaMA or Qwen models, please also adhere to their licenses!\n\n\n## 👍 Acknowledgement\n\nThis project is developed based on the source code of [SEED-X]().\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fmllm-npu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftencentarc%2Fmllm-npu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftencentarc%2Fmllm-npu/lists"}