https://github.com/UbiquitousLearning/mllm

Fast Multimodal LLM on Mobile Devices
https://github.com/UbiquitousLearning/mllm
large-language-models llama multimodal
Last synced: about 1 month ago
JSON representation
Fast Multimodal LLM on Mobile Devices
Host: GitHub
URL: https://github.com/UbiquitousLearning/mllm
Owner: UbiquitousLearning
License: mit
Created: 2023-08-30T16:33:36.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-10-18T13:14:31.000Z (about 1 month ago)
Last Synced: 2025-10-21T20:44:03.530Z (about 1 month ago)
Topics: large-language-models, llama, multimodal
Language: C++
Homepage: https://ubiquitouslearning.github.io/mllm_website
Size: 46.2 MB
Stars: 1,121
Watchers: 26
Forks: 136
Open Issues: 39
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

Awesome-LLMs-on-device - [Github
awesome-ai-papers - [mllm - finetune](https://github.com/zjysteven/lmms-finetune)\] (Multimodal / 5. Multimodal LLM)
README

          


mllm





fast and lightweight multimodal LLM inference engine for mobile and edge devices





| Arm CPU | X86 CPU | Qualcomm NPU(QNN) |





[![Website](https://img.shields.io/badge/website-visit-green)](https://ubiquitouslearning.github.io/mllm_website/)

[![Documentation](https://img.shields.io/badge/view-docs-blue)](https://ubiquitouslearning.github.io/mllm_website/introduction/getstarted/)

[![Android App](https://img.shields.io/badge/android-app-pink)](https://github.com/lx200916/ChatBotApp/)

[![Actions Status](https://github.com/UbiquitousLearning/mllm/workflows/Tests/badge.svg)](https://github.com/UbiquitousLearning/mllm/actions)



- Plain C/C++ implementation without dependencies

- Optimized for multimodal LLMs like fuyu-8B and LLaVA

- Supported: ARM NEON, x86 AVX2, Qualcomm NPU (QNN), etc

- Various quantization schemes

- End-to-end Android app demo

Why on-device multimodal LLM? - It's a key building block for [intelligent personal agent](https://arxiv.org/pdf/2401.05459.pdf), text-based image searching/retrieval, screen VQA, and many more exciting mobile apps, without giving away your private data (chat history, screenshots, taken photos, etc).

## Recent update

- [2024 November 21] Support new model: Phi 3 Vision https://github.com/UbiquitousLearning/mllm/pull/186

- [2024 August 30] Support new model: MiniCPM 2B https://github.com/UbiquitousLearning/mllm/pull/132

- [2024 August 15] Support new model: Phi 3 mini https://github.com/UbiquitousLearning/mllm/pull/119

- [2024 Aug 10] Supporting Qualcomm NPU: https://github.com/UbiquitousLearning/mllm/pull/112 | [try it out](https://github.com/UbiquitousLearning/mllm/tree/main/src/backends/qnn) | [paper](https://arxiv.org/pdf/2407.05858v1)

### Contents

- [Android Demo](#android-demo)

- [Support models](#support-models)

- [Quick Start](#quick-start)

    - [Get the Code](#get-the-code)

    - [Check prerequisites](#check-prerequisites)

    - [Run Qwen with Hexagon NPU accelerating using QNN](#run-qwen-with-hexagon-npu-accelerating-using-qnn)

    - [Run with the CPU of Android](#run-with-the-cpu-of-android)

    - [Run for Linux](#run-for-linux)

- [Customization](#customization)

    - [Convert models](#convert-models)

    - [Convert vocabulary](#convert-vocabulary)

    - [Quantize models](#quantize-models)

- [Roadmap](#roadmap)

- [Documentation](#documentation)

- [Contribution](#contribution)

- [Acknowledgments](#acknowledgments)

- [License](#license)

## Android Demo

    

        Android Intent Invocation

        Image Understanding

    

    

           

           

    

    

        Chat CPU

        Chat NPU

        

    

           

           

    

## Support models

### Language models

| Model 
|-------------------------------------------------------- 
| [LLaMA 2 7B](https://github.com/facebookresearch/llama) 
| [LLaMA 3 1B](https://github.com/meta-llama/llama3) 
| [LLaMA 3 3B](https://github.com/meta-llama/llama3) 
| [Alpaca 7B](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2) 
| [TinyLLaMA 1.1B](https://github.com/jzhang38/TinyLlama) 
| [LLaVA 7B](https://github.com/haotian-liu/LLaVA) 
| [Gemma 2B](https://github.com/google/gemma_pytorch) 
| [Gemma 2 2B](https://github.com/google/gemma_pytorch) 
| [Qwen 1.5 0.5B](https://github.com/QwenLM/Qwen) 
| [Qwen 1.5 1.8B](https://github.com/QwenLM/Qwen) 
| [Qwen 2.5 1.5B](https://github.com/QwenLM/Qwen2.5) 
| [Mistral 7B](https://github.com/mistralai/mistral-src) 
| [Yi 6B](https://huggingface.co/01-ai/Yi-1.5-6B) 
| [StableLM 2 1.6B](https://github.com/Stability-AI/StableLM) 
| [OPT 1.3B](https://github.com/facebookresearch/metaseq/ 
| [Phi 3 mini 3.8B](https://huggingface.co/microsoft/Phi- 
| [MiniCPM 2B](https://huggingface.co/openbmb/MiniCPM-2B-dpo-fp32) 
| [MiniCPM 3 4B](https://huggingface.co/openbmb/MiniCPM3-4B) 
| [MiniCPM MoE 8x2B](https://huggingface.co/openbmb/MiniCPM-MoE-8x2B) 
| [SmolLM 1.7B](https://huggingface.co/HuggingFaceTB/Smol 
| [DCLM 1B](https://huggingface.co/TRI-ML/DCLM-1B) 
| [OpenELM 1.1B](https://github.com/apple/corenet/tree/ma 
[PhoneLM 1.5B](https://github.com/UbiquitousLearning/PhoneLM)

| CPU 
 FP32 | CPU 
 INT4  | Hexagon NPU 
 INT8 | ---------------------|------|-----|----------------------------| | [✔️](https://huggingface.co/mllmTeam/llama-2-7b-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/llama-2-7b-mllm/tree/main)   |  | | [✔️](https://huggingface.co/mllmTeam/llama-3.2-1b-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/llama-3.2-1b-mllm/tree/main)   |  | | [✔️](https://huggingface.co/mllmTeam/llama-3.2-3b-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/llama-3.2-3b-mllm/tree/main)   |  | | [✔️](https://huggingface.co/mllmTeam/chinese-alpaca-7b-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/chinese-alpaca-7b-mllm/tree/main)   |  | | [✔️](https://huggingface.co/mllmTeam/tinyllama-1.1b-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/tinyllama-1.1b-mllm/tree/main)   |  | | [✔️](https://huggingface.co/mllmTeam/llava-1.5-7b-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/llava-1.5-7b-mllm/tree/main)   |  | | [✔️](https://huggingface.co/mllmTeam/gemma-2b-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/gemma-2b-mllm/tree/main)   |  | | [✔️](https://huggingface.co/mllmTeam/gemma-2-2b-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/gemma-2-2b-mllm/tree/main)   |  | | [✔️](https://huggingface.co/mllmTeam/qwen-1.5-0.5b-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/qwen-1.5-0.5b-mllm/tree/main)   |  | | [✔️](https://huggingface.co/mllmTeam/qwen-1.5-1.8b-chat-mllm)  | [✔️](https://huggingface.co/mllmTeam/qwen-1.5-1.8b-chat-mllm)   | [✔️](https://huggingface.co/mllmTeam/qwen-1.5-1.8b-chat-mllm) | | [✔️](https://huggingface.co/mllmTeam/qwen-2.5-1.5b-mllm/tree/main) | [✔️](https://huggingface.co/mllmTeam/qwen-2.5-1.5b-mllm/tree/main) | | | [✔️](https://huggingface.co/mllmTeam/mistral-7b-instruct-v0.2-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/mistral-7b-instruct-v0.2-mllm/tree/main)   |  | | [✔️](https://huggingface.co/mllmTeam/yi-1.5-6b-chat-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/yi-1.5-6b-chat-mllm/tree/main)   |  | | [✔️](https://huggingface.co/mllmTeam/stablelm-2-1.6b-chat-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/stablelm-2-1.6b-chat-mllm/tree/main)   |  | tree/main/projects/OPT)                     | [✔️](https://huggingface.co/mllmTeam/opt-1.3b-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/opt-1.3b-mllm/tree/main)   |  | 3-mini-4k-instruct)                     |  [✔️](https://huggingface.co/mllmTeam/phi-3-mini-instruct-mllm/tree/main)   | [✔️](https://huggingface.co/mllmTeam/phi-3-mini-instruct-mllm/tree/main)   |  | |  [✔️](https://huggingface.co/mllmTeam/minicpm-2b-dpo-mllm/tree/main)   | [✔️](https://huggingface.co/mllmTeam/minicpm-2b-dpo-mllm/tree/main)   |  | |  [✔️](https://huggingface.co/mllmTeam/minicpm3-4b-mllm/tree/main)   | [✔️](https://huggingface.co/mllmTeam/minicpm3-4b-mllm/tree/main)   |  | |  [✔️](https://huggingface.co/mllmTeam/minicpm-moe-8x2b-mllm/tree/main)   | [✔️](https://huggingface.co/mllmTeam/minicpm-moe-8x2b-mllm/tree/main)   |  | LM-1.7B-Instruct)                     |  [✔️](https://huggingface.co/mllmTeam/smollm-1.7b-instruct-mllm/tree/main)   | [✔️](https://huggingface.co/mllmTeam/smollm-1.7b-instruct-mllm/tree/main)   |  | | [✔️](https://huggingface.co/mllmTeam/dclm-1b-mllm/tree/main)| [✔️](https://huggingface.co/mllmTeam/dclm-1b-mllm/tree/main)| | in/projects/openelm) | [✔️](https://huggingface.co/mllmTeam/openelm-1.1b-mllm/tree/main)| [✔️](https://huggingface.co/mllmTeam/openelm-1.1b-mllm/tree/main)| | | [✔️](https://huggingface.co/mllmTeam/phonelm-1.5b-mllm/tree/main)| [✔️](https://huggingface.co/mllmTeam/phonelm-1.5b-mllm/tree/main)| [✔️](https://huggingface.co/mllmTeam/phonelm-1.5b-mllm/tree/main)|

### Multimodal models

| Model                                                                       | CPU 
 FP32 | CPU 
 INT4  | 

|-----------------------------------------------------------------------------|------|-----|

| [Fuyu 8B](https://www.adept.ai/blog/fuyu-8b)                                | [✔️](https://huggingface.co/mllmTeam/fuyu-8b-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/fuyu-8b-mllm/tree/main)   |  

| [Vision Transformer](https://github.com/google-research/vision_transformer) | [✔️](https://huggingface.co/mllmTeam/vit-base-patch16-224-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/vit-base-patch16-224-mllm/tree/main)   | 

| [CLIP](https://github.com/openai/CLIP)                                      | [✔️](https://huggingface.co/mllmTeam/clip-vit-base-patch32-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/clip-vit-base-patch32-mllm/tree/main)   |

| [ImageBind](https://github.com/facebookresearch/ImageBind) (3 modalities)   | [✔️](https://huggingface.co/mllmTeam/imagebind_huge-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/imagebind_huge-mllm/tree/main)   | 

| [LLaVA 7B](https://github.com/haotian-liu/LLaVA)                            | [✔️](https://huggingface.co/mllmTeam/llava-1.5-7b-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/llava-1.5-7b-mllm/tree/main)   |

| [Phi 3 Vision](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)                            | [✔️](https://huggingface.co/mllmTeam/Phi-3-vision-instruct-mllm/tree/main)  | [✔️](https://huggingface.co/mllmTeam/Phi-3-vision-instruct-mllm/tree/main)   |

## Quick Start

### Get the Code

```bash

git clone https://github.com/UbiquitousLearning/mllm

cd mllm

```

### Check prerequisites

Building mllm requires following tools:

- gcc(11.4+) / clang (11.0+)

- CMake >= 3.18

- Android NDK Toolchains >= 26

> Note that building OpenMP libs on macOS may fail due to Apple LLVM compiler, so we disable OpenMP on macOS by default, you may experience slower performance on macOS. Build mllm is more recommended on Linux.

### Run Qwen with Hexagon NPU accelerating using QNN

*`NOTE:` The QNN backend is preliminary version which can do end-to-end inference. It is still under active development for better performance and more supported models.*

We support running Qwen-1.5-1.8B-Chat using [Qualcomm QNN](https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk) to get Hexagon NPU acceleration on devices with Snapdragon 8 Gen3. The details of QNN environment set up and design is [here](./src/backends/qnn/README.md). The prefilling stage is performered by QNN & CPU, and the inference stage is performed by CPU.

Build the target with QNN backend.

```bash

cd ../script

./build_qnn_android.sh

```

Download the model from [here](https://huggingface.co/mllmTeam/qwen-1.5-1.8b-chat-mllm/blob/main/), or using the following instructions

```bash

mkdir ../models && cd ../models

# Download int8 model used by npu & q4k model used by cpu

wget https://huggingface.co/mllmTeam/qwen-1.5-1.8b-chat-mllm/resolve/main/qwen-1.5-1.8b-chat-int8.mllm?download=true  -O qwen-1.5-1.8b-chat-int8.mllm

wget https://huggingface.co/mllmTeam/qwen-1.5-1.8b-chat-mllm/resolve/main/qwen-1.5-1.8b-chat-q4k.mllm?download=true  -O qwen-1.5-1.8b-chat-q4k.mllm

```

Run on an android phone with at least 16GB of memory.

```bash

cd ../script

./run_qwen_npu.sh

```

There are two arguments in the executable. `-s` is for the sequence length of prefilling, the default value is 64 in the demo we provided. `-c` for type of QNN prefilling options, when it is set to 1, the input will be splited into many chunks of sequence 32 and be executed in a pipeline. When it is set to 0, the input will be executed in one chunk.

Result are as followed:

```

> ./main_qwen_npu -s 64 -c 1

[Q] <|im_start|>system

You are a helpful assistant.<|im_end|>

<|im_start|>user

Give me a short introduction to large language model.<|im_end|>

<|im_start|>assistant

[A] A short introduction to a large language model is a type of artificial intelligence language model that is designed to understand and generate human language text. These models are typically trained on large amounts of text data, such as books, articles, and other written materials, to learn the patterns and structures of human language. They use a combination of natural language processing (NLP)

```

### Run with the CPU of Android

#### Build

  ```bash

  export ANDROID_NDK=/path/to/your/ndk

  cd scripts

  ./build_android.sh

  ```

#### Run Fuyu-8B

Download the model from [here](https://huggingface.co/mllmTeam/fuyu-8b-mllm/tree/main/), or using the following instructions

```bash

mkdir ../models && cd ../models

# Download fuyu-8b-q4_k.mllm

wget https://huggingface.co/mllmTeam/fuyu-8b-mllm/resolve/main/fuyu-8b-q4_k.mllm?download=true  -O fuyu-8b-q4_k.mllm

```

Run on an android phone with at least 12GB of memory.

```bash

cd ../script

./run_fuyu.sh

```

Result are as followed:

```

> ./demo_fuyu

[Q] [../assets/bus.png]Generate a coco-style caption.

[A]  A blue bus driving down▁the▁road next▁to a sidewalk.

[Q] [../assets/two_cats.jpg]What's this?

[A]  Two cats laying on▁top▁of a pink blanket.

```

#### Run LLaMA-2-7B

Download model from [here](https://huggingface.co/mllmTeam/llama-2-7b-mllm/tree/main/), or using the following instructions

```bash

mkdir ../models && cd ../models

# Dowload llama-2-7b-chat-q4_k.mllm

wget https://huggingface.co/mllmTeam/llama-2-7b-mllm/resolve/main/llama-2-7b-chat-q4_k.mllm?download=true  -O llama-2-7b-chat-q4_k.mllm

```

Run on an android phone with at least 12GB of memory.

```bash

cd ../script

./run_llama.sh

```

Result are as followed:

```

> ./demo_llama

[Q]  Hello, who are you?

[A]

I am a bot, I'm here to help answer your questions and provide information. I'm just an AI, I don't have personal experiences or feelings like humans do, but I'm here to help you with any questions you may have. Is there something specific you would like to know or talk about?

[Q]  What can you do?

[A]

I can help with a wide range of tasks, including:

  

* Answering questions on various topics, such as science, history, technology, and more.

* Providing information on different subjects, such as culture, entertainment, and lifestyle.

* Helping with language-related tasks, such as translating text or improving grammar and syntax.

* Generating text, such as articles, stories, or even entire books.

* Assisting with coding

[Q]  Please introduce Beijing University of Posts and Telecommunications.

[A]

Beijing University of Posts and Telecommunications (BUPT) is a public research university located in Beijing, China. It was established in 1955 and is one of the top universities in China, with a strong focus on engineering, technology, and telecommunications.

BUPT offers a wide range of undergraduate and graduate programs in fields such as computer science, electrical engineering, software engineering, and communication engineering. The university

```

#### Run ImageBind

Download model from [here](https://huggingface.co/mllmTeam/imagebind_huge-mllm/tree/main), or using the following instructions

```bash

mkdir ../models && cd ../models

# Download imagebind_huge-q4_k.mllm

wget https://huggingface.co/mllmTeam/imagebind_huge-mllm/resolve/main/imagebind_huge-q4_k.mllm?download=true -O imagebind_huge-q4_k.mllm 

```

Run on an android phone with at least 4GB of memory.

```bash

cd ../script

./run_imagebind.sh

```

Result are as followed:

```

> ./demo_imagebind 

vision X text :

0.9985647 0.0013827 0.0000526 

0.0000365 0.9998636 0.0000999 

0.0000115 0.0083149 0.9916736 

vision X audio :

0.8054272 0.1228001 0.0717727 

0.0673458 0.8429284 0.0897258 

0.0021967 0.0015335 0.9962698 

```

### Run for Linux

#### Build

```bash

cd scripts

./build.sh

 ```

#### Run Fuyu-8B

```bash

cd ./bin

./demo_fuyu -m ../models/fuyu-8b-q4_k.mllm -v ../vocab/fuyu_vocab.mllm

 ```

#### Run LLaMA-2-7B

```bash

cd ./bin

./demo_llama -m ../models/llama-2-7b-chat-q4_k.mllm -v ../vocab/llama2_vocab.mllm

```

#### Run ImageBind

```bash

cd ./bin

./demo_imagebind -m ../models/imagebind_huge-q4_k.mllm -v ../vocab/clip_vocab.mllm

```

## Customization

### Convert models

You can download models from [here](https://huggingface.co/mllmTeam), or you can convert a pytorch/safetensor model to

mllm model by yourself.

```bash

cd tools/convertor

pip install -r ./requirements.txt

# for one file pytorch model

python converter.py --input_model=model.pth --output_model=model.mllm --type=torch

# for multi-file pytorch model

python converter.py --input_model=pytorch_model.bin.index.json --output_model=model.mllm --type=torch

# for one file safetensor model

python converter.py --input_model=model.bin --output_model=model.mllm --type=safetensor

# for multi-file safetensor model

python converter.py --input_model=model.safetensors.index.json --output_model=model.mllm --type=safetensor

``` 

### Convert vocabulary

You can convert vocabulary to mllm vocabulary as followed.

```bash

cd tools/convertor

python vocab.py --input_file=tokenizer.json --output_file=vocab.mllm --type=Unigram

```

### Quantize models

You can quantize mllm model to int4 model by yourself.

mllm only support two quantize modes: Q4_0 and Q4_K.

```bash

cd bin

./quantize model.mllm model_q4_k.mllm Q4_K

```

## Roadmap

- More backends like QNN

- More models like PandaGPT

- More optimizations like LUT-GEMM

- [More..](https://ubiquitouslearning.github.io/mllm_website/roadmap/roadmap/)

## Documentation

See the [documentation](https://ubiquitouslearning.github.io/mllm_website/introduction/getstarted/) here for more

information

## Contribution

Read the [contribution](https://ubiquitouslearning.github.io/mllm_website/contributing/contributing/) before you

contribute.

## Acknowledgments

mllm reuses many low-level kernel implementation from [ggml](https://github.com/ggerganov/ggml) on ARM CPU.

It also utilizes [stb](https://github.com/nothings/stb) and [wenet](https://github.com/wenet-e2e/wenet) for

pre-processing images and audios.

mllm also has benefitted from following projects: [llama.cpp](https://github.com/ggerganov/llama.cpp)

and [MNN](https://github.com/alibaba/MNN).

## License

### Overall Project License

This project is licensed under the terms of the MIT License. Please see the [LICENSE](LICENSE) file in the root

directory for the full text of the MIT License.

### Apache 2.0 Licensed Components

Certain component([wenet](https://github.com/wenet-e2e/wenet)) of this project is licensed under the Apache License 2.0.

These component is clearly identified in their respective subdirectories along with a copy of the Apache License 2.0.

For the full text of the Apache License 2.0, please refer to the [LICENSE-APACHE](third_party/wenet_audio/LICENSE) file

located in the relevant subdirectories.

## Citation

```

@article{xu2025fast,

  title={Fast On-device LLM Inference with NPUs},

  author={Xu, Daliang and Zhang, Hao and Yang, Liming and Liu, Ruiqi and Huang, Gang and Xu, Mengwei and Liu, Xuanzhe},

  booktitle={International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},

  year={2025}

}

@misc{yi2023mllm,

  title = {mllm: fast and lightweight multimodal LLM inference engine for mobile and edge devices},

  author = {Rongjie Yi and Xiang Li and Zhenyan Lu and Hao Zhang and Daliang Xu and Liming Yang and Weikai Xie and Chenghua Wang and Xuanzhe Liu and Mengwei Xu},

  year = {2023},

  publisher = {mllm Team},

  url = {https://github.com/UbiquitousLearning/mllm}

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/UbiquitousLearning/mllm

Awesome Lists containing this project

README

mllm

fast and lightweight multimodal LLM inference engine for mobile and edge devices

| Arm CPU | X86 CPU | Qualcomm NPU(QNN) |