{"id":28100702,"url":"https://github.com/mit-han-lab/tinychatengine","last_synced_at":"2025-05-13T18:38:10.827Z","repository":{"id":190491008,"uuid":"644799767","full_name":"mit-han-lab/TinyChatEngine","owner":"mit-han-lab","description":"TinyChatEngine: On-Device LLM Inference Library","archived":false,"fork":false,"pushed_at":"2024-07-04T04:20:45.000Z","size":87349,"stargazers_count":846,"open_issues_count":39,"forks_count":85,"subscribers_count":17,"default_branch":"main","last_synced_at":"2025-05-07T16:47:34.185Z","etag":null,"topics":["arm","c","cpp","cuda-programming","deep-learning","edge-computing","large-language-models","on-device-ai","quantization","x86-64"],"latest_commit_sha":null,"homepage":"https://mit-han-lab.github.io/TinyChatEngine/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mit-han-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-24T09:29:41.000Z","updated_at":"2025-05-07T09:11:34.000Z","dependencies_parsed_at":"2024-11-14T12:44:08.271Z","dependency_job_id":null,"html_url":"https://github.com/mit-han-lab/TinyChatEngine","commit_stats":null,"previous_names":["mit-han-lab/tinychatengine"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2FTinyChatEngine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2FTinyChatEngine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2FTinyChatEngine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2FTinyChatEngine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mit-han-lab","download_url":"https://codeload.github.com/mit-han-lab/TinyChatEngine/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254004728,"owners_count":21998114,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arm","c","cpp","cuda-programming","deep-learning","edge-computing","large-language-models","on-device-ai","quantization","x86-64"],"created_at":"2025-05-13T18:38:10.164Z","updated_at":"2025-05-13T18:38:10.809Z","avatar_url":"https://github.com/mit-han-lab.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"![tinychat_logo](assets/figures/tinychat_logo.png)\n\n# TinyChatEngine: On-Device LLM/VLM Inference Library\n\nRunning large language models (LLMs) and visual language models (VLMs) on the edge is useful: copilot services (coding, office, smart reply) on laptops, cars, robots, and more. Users can get instant responses  with better privacy, as the data is local.\n\nThis is enabled by LLM model compression technique: [SmoothQuant](https://github.com/mit-han-lab/smoothquant) and [AWQ (Activation-aware Weight Quantization)](https://github.com/mit-han-lab/llm-awq), co-designed with TinyChatEngine that implements the compressed low-precision model. \n\nFeel free to check out our [slides](assets/slides.pdf) for more details!\n\n### Code LLaMA Demo on NVIDIA GeForce RTX 4070 laptop:\n![coding_demo_gpu](assets/figures/coding_demo_gpu.gif)\n\n### VILA Demo on Apple MacBook M1 Pro:\n![vlm_demo_m1](assets/figures/vlm_demo_m1.gif)\n\n### LLaMA Chat Demo on Apple MacBook M1 Pro:\n![chat_demo_m1](assets/figures/chat_demo_m1.gif)\n\n\n## Overview\n### LLM Compression: SmoothQuant and AWQ\n[SmoothQuant](https://github.com/mit-han-lab/smoothquant): Smooth the activation outliers by migrating the quantization difficulty from activations to weights, with a mathematically equal transformation (100\\*1 = 10\\*10).\n\n![smoothquant_intuition](assets/figures/smoothquant_intuition.png)\n\n[AWQ (Activation-aware Weight Quantization)](https://github.com/mit-han-lab/llm-awq): Protect salient weight channels by analyzing activation magnitude as opposed to the weights.\n\n### LLM Inference Engine: TinyChatEngine\n- **Universal**: x86 (Intel/AMD), ARM (Apple M1/M2, Raspberry Pi), CUDA (Nvidia GPU).\n- **No library dependency**: From-scratch C/C++ implementation.\n- **High performance**: Real-time on Macbook \u0026 GeForce laptop.\n- **Easy to use**: Download and compile, then ready to go!\n\n![overview](assets/figures/overview.png)\n\n\n## News\n\n- **(2024/05)** 🏆 AWQ and TinyChat received the **Best Paper Award** at **MLSys 2024**. 🎉\n- **(2024/05)** 🔥 We released the support for the **Llama-3** model family! Check out our [example](#step-by-step-to-deploy-llama-3-8b-instruct-with-tinychatengine) and [model zoo](https://huggingface.co/mit-han-lab/tinychatengine-model-zoo).\n- **(2024/02)** 🔥AWQ and TinyChat has been accepted to **MLSys 2024**!\n- **(2024/02)** 🔥We extended the support for **vision language models (VLM)**. Feel free to try running **[VILA](#deploy-vision-language-model-vlm-chatbot-with-tinychatengine)** on your edge device.\n\u003c!-- - **(2024/01)** 🔥We released TinyVoiceChat, a voice chatbot that can be deployed on your edge devices, such as MacBook and Jetson Orin Nano. Check out our [demo video](https://youtu.be/Bw5Dm3aWMnA?si=CCvZDmq3HwowEQcC) and follow the [instructions](#deploy-speech-to-speech-chatbot-with-tinychatengine-demo) to deploy it on your device! --\u003e\n- **(2023/10)** We extended the support for the coding assistant [Code Llama](#download-and-deploy-models-from-our-model-zoo). Feel free to check out our [model zoo](https://huggingface.co/mit-han-lab/tinychatengine-model-zoo).\n- **(2023/10)** ⚡We released the new CUDA backend to support Nvidia GPUs with compute capability \u003e= 6.1 for both server and edge GPUs. Its performance is also speeded up by ~40% compared to the previous version. Feel free to check out!\n\n\n## Prerequisites\n\n### MacOS\n\nFor MacOS, install boost and llvm by\n\n```bash\nbrew install boost\nbrew install llvm\n```\n\nFor M1/M2 users, install Xcode from AppStore to enable the metal compiler for GPU support.\n\n### Windows with CPU\n\nFor Windows, download and install the GCC compiler with MSYS2. Follow this tutorial: https://code.visualstudio.com/docs/cpp/config-mingw for installation.\n\n- Install required dependencies with MSYS2\n\n```\npacman -S --needed base-devel mingw-w64-x86_64-toolchain make unzip git\n```\n\n- Add binary directories (e.g., C:\\\\msys64\\\\mingw64\\\\bin and C:\\\\msys64\\\\usr\\\\bin) to the environment path\n\n### Windows with Nvidia GPU (Experimental)\n\n- Install CUDA toolkit for Windows ([link](https://developer.nvidia.com/cuda-toolkit)). When installing CUDA on your PC, please change the installation path to another one that does not include \"spaces\".\n\n- Install Visual Studio with C and C++ support: Follow the [Instruction](https://learn.microsoft.com/en-us/cpp/build/vscpp-step-0-installation?view=msvc-170).\n\n- Follow the instructions below and use x64 Native Tools Command Prompt from Visual Studio to compile TinyChatEngine. \n\n\n## Step-by-step to Deploy Llama-3-8B-Instruct with TinyChatEngine\n\nHere, we provide step-by-step instructions to deploy Llama-3-8B-Instruct with TinyChatEngine from scratch.\n\n- Download the repo.\n  ```bash\n  git clone --recursive https://github.com/mit-han-lab/TinyChatEngine\n  cd TinyChatEngine\n  ```\n\n- Install Python Packages\n  - The primary codebase of TinyChatEngine is written in pure C/C++. The Python packages are only used for downloading (and converting) models from our model zoo.\n    ```bash\n    conda create -n TinyChatEngine python=3.10 pip -y\n    conda activate TinyChatEngine\n    pip install -r requirements.txt\n    ```\n- Download the quantized Llama model from our model zoo.\n  ```bash\n  cd llm\n  ```\n  - On an x86 device (e.g., Intel/AMD laptop)\n    ```bash\n    python tools/download_model.py --model LLaMA_3_8B_Instruct_awq_int4 --QM QM_x86\n    ```\n  - On an ARM device (e.g., M1/M2 Macbook, Raspberry Pi)\n    ```bash\n    python tools/download_model.py --model LLaMA_3_8B_Instruct_awq_int4 --QM QM_ARM\n    ```\n  - On a CUDA device (e.g., Jetson AGX Orin, PC/Server)\n    ```bash\n    python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_CUDA\n    ```\n  - Check this [table](#download-and-deploy-models-from-our-model-zoo) for the detailed list of supported models\n- *(CUDA only)* Based on the platform you are using and the compute capability of your GPU, modify the Makefile accordingly. If using Windows with Nvidia GPU, please modify `-arch=sm_xx` in [Line 54](llm/Makefile#L54). If using other platforms with Nvidia GPU, please modify `-gencode arch=compute_xx,code=sm_xx` in [Line 60](llm/Makefile#L60). \n- Compile and start the chat locally.\n  ```bash\n  make chat -j\n  ./chat\n\n  TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine\n  Using model: LLaMA_3_8B_Instruct\n  Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq\n  Loading model... Finished!\n  USER: Write a syllabus for the parallel computing course.\n  ASSISTANT: Here is a sample syllabus for a parallel computing course:\n  \n  **Course Title:** Parallel Computing\n  **Instructor:** [Name]\n  **Description:** This course covers the fundamental concepts of parallel computing, including parallel algorithms, programming models, and architectures. Students will learn how to design, implement, and optimize parallel programs using various languages and frameworks.\n  **Prerequisites:** Basic knowledge of computer science and programming concepts.\n  **Course Objectives:**\n  * Understand the principles of parallelism and its applications\n  * Learn how to write parallel programs using different languages (e.g., OpenMP, MPI)\n  ...\n  ```\n\n\n\u003c!-- ## Deploy speech-to-speech chatbot with TinyChatEngine [[Demo]](https://youtu.be/Bw5Dm3aWMnA?si=CCvZDmq3HwowEQcC)\n\nTinyChatEngine offers versatile capabilities suitable for various applications. Additionally, we introduce a sophisticated voice chatbot. Here, we provide very easy-to-follow instructions to deploy speech-to-speech chatbot (Llama-3-8B-Instruct) with TinyChatEngine. \n\n- Follow the instructions above to setup the basic environment, i.e., [Prerequisites](#prerequisites) and [Step-by-step to Deploy Llama-3-8B-Instruct with TinyChatEngine](#step-by-step-to-deploy-llama-3-8b-instruct-with-tinychatengine).\n\n- Run the shell script to set up the environment for speech-to-speech chatbot.\n  ```bash\n  cd llm\n  ./voicechat_setup.sh\n  ```\n\n- Start the speech-to-speech chat locally.\n  ```bash\n  ./voicechat  # chat.exe -v on Windows\n  ```\n\n- If you encounter any issues or errors during setup, please explore [here](llm/application/README.md) to follow the step-by-step guide to debug.\n --\u003e\n\n## Deploy vision language model (VLM) chatbot with TinyChatEngine\n\n\u003c!-- TinyChatEngine supports not only LLM but also VLM. We introduce a sophisticated text/voice chatbot for VLM. Here, we provide easy-to-follow instructions to deploy vision language model chatbot (VILA-7B) with TinyChatEngine. We recommend using M1/M2 MacBooks for this VLM feature. --\u003e\nTinyChatEngine supports not only LLM but also VLM. We introduce a sophisticated chatbot for VLM. Here, we provide easy-to-follow instructions to deploy vision language model chatbot (VILA-7B) with TinyChatEngine. We recommend using M1/M2 MacBooks for this VLM feature.\n\n- Follow the instructions above to setup the basic environment, i.e., [Prerequisites](#prerequisites) and [Step-by-step to Deploy Llama-3-8B-Instruct with TinyChatEngine](#step-by-step-to-deploy-llama-3-8b-instruct-with-tinychatengine).\n\n- To demonstrate images in the terminal, please download and install the following toolkit.\n  - Install [termvisage](https://github.com/AnonymouX47/termvisage).\n  - (For MacOS) Install [iTerm2](https://iterm2.com/index.html).\n  - (For other OS) Please refer to [here](https://github.com/AnonymouX47/termvisage?tab=readme-ov-file#requirements) to get the appropriate terminal ready.\n\n\u003c!-- - (Optional) To enable the speech-to-speech chatbot for VLM, please follow the [instruction above](#deploy-speech-to-speech-chatbot-with-tinychatengine-demo) to run the shell script to set up the environment.\n  ```bash\n  cd llm\n  ./voicechat_setup.sh\n  ``` --\u003e\n\n- Download the quantized VILA-7B model from our model zoo.\n\n  - On an x86 device (e.g., Intel/AMD laptop)\n    ```bash\n    python tools/download_model.py --model VILA_7B_awq_int4_CLIP_ViT-L --QM QM_x86\n    ```\n  - On an ARM device (e.g., M1/M2 Macbook, Raspberry Pi)\n    ```bash\n    python tools/download_model.py --model VILA_7B_awq_int4_CLIP_ViT-L --QM QM_ARM\n    ```\n\n- (For MacOS) Start the chatbot locally. Please use an appropriate terminal (e.g., iTerm2).\n  - Image/Text to text\n    ```bash\n    ./vila ../assets/figures/vlm_demo/pedestrian.png\n    ```\n\n  \u003c!-- - Image/Speech to speech\n    ```bash\n    ./voice_vila ../assets/figures/vlm_demo/pedestrian.png\n    ``` --\u003e\n\n    - There are several images under the path `../assets/figures/vlm_demo`. Feel free to try different images with VILA on your device!\n\n  \u003c!-- - For other OS, please modify Line 4 in [vila.sh](llm/scripts/vila.sh) and [voice_vila.sh](llm/scripts/voice_vila.sh) to use the correct terminal. --\u003e\n  - For other OS, please modify Line 4 in [vila.sh](llm/scripts/vila.sh) to use the correct terminal.\n\n## Backend Support\n\n| Precision | x86\u003cbr /\u003e (Intel/AMD CPU) | ARM\u003cbr /\u003e (Apple M1/M2 \u0026 RPi) | Nvidia GPU |\n| ------ | --------------------------- | --------- | --------- |\n| FP32   |  ✅    |    ✅  |         |\n| W4A16  |      |      |  ✅  |\n| W4A32  |  ✅  |  ✅  |      |\n| W4A8   |  ✅  |  ✅  |      |\n| W8A8   |  ✅  |  ✅  |      |\n\n- For Raspberry Pi, we recommend using the board with 8GB RAM. Our testing was primarily conducted on Raspberry Pi 4 Model B Rev 1.4 with aarch64. For other versions, please feel free to try it out and let us know if you encounter any issues.\n- For Nvidia GPU, our CUDA backend can support Nvidia GPUs with compute capability \u003e= 6.1. For the GPUs with compute capability \u003c 6.1, please feel free to try it out but we haven't tested it yet and thus cannot guarantee the results.\n\n## Quantization and Model Support\n\nThe goal of TinyChatEngine is to support various quantization methods on various devices. For example, At present, it supports the quantized weights for int8 opt models that originate from [smoothquant](https://github.com/mit-han-lab/smoothquant) using the provided conversion script [opt_smooth_exporter.py](llm/tools/opt_smooth_exporter.py). For LLaMA models, scripts are available for converting Huggingface format checkpoints to our int4 wegiht [format](llm/tools/llama_exporter.py), and for quantizing them to specific methods [based on your device](llm/tools/model_quantizer.py). Before converting and quantizing your models, it is recommended to apply the fake quantization from [AWQ](https://github.com/mit-han-lab/llm-awq) to achieve better accuracy. We are currently working on supporting more models, please stay tuned!\n\n### Device-specific int4 Weight Reordering\n\nTo mitigate the runtime overheads associated with weight reordering, TinyChatEngine conducts this process offline during model conversion. In this section, we will explore the weight layouts of QM_ARM and QM_x86. These layouts are tailored for ARM and x86 CPUs, supporting 128-bit SIMD and 256-bit SIMD operations, respectively. We also support QM_CUDA for Nvidia GPUs, including server and edge GPUs.\n\n| Platforms  | ISA | Quantization methods |\n| ------------- | ------------- |  ------------- |\n| Intel \u0026 AMD |  x86-64  | QM_x86  |\n| Apple M1/M2 Mac \u0026 Raspberry Pi | ARM | QM_ARM  |\n| Nvidia GPU| CUDA | QM_CUDA  |\n\n- Example layout of QM_ARM: For QM_ARM, consider the initial configuration of a 128-bit weight vector, \\[w0, w1, ... , w30, w31\\], where each wi is a 4-bit quantized weight. TinyChatEngine rearranges these weights in the sequence  \\[w0, w16, w1, w17, ..., w15, w31\\] by interleaving the lower half and upper half of the weights. This new arrangement facilitates the decoding of both the lower and upper halves using 128-bit AND and shift operations, as depicted in the subsequent figure. This will eliminate runtime reordering overheads and improve performance.\n\n## TinyChatEngine Model Zoo\n\nWe offer a selection of models that have been tested with TinyChatEngine. These models can be readily downloaded and deployed on your device. To download a model, locate the target model's ID in the table below and use the associated script. Check out our model zoo [here](https://huggingface.co/mit-han-lab/tinychatengine-model-zoo).\n\n\u003ctable\u003e\n    \u003cthead\u003e\n        \u003ctr\u003e\n            \u003cth\u003eModels\u003c/th\u003e\n            \u003cth\u003ePrecisions\u003c/th\u003e\n            \u003cth\u003eID\u003c/th\u003e\n            \u003cth\u003ex86 backend\u003c/th\u003e\n            \u003cth\u003eARM backend\u003c/th\u003e\n            \u003cth\u003eCUDA backend\u003c/th\u003e\n        \u003c/tr\u003e\n    \u003c/thead\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"2\"\u003eLLaMA_3_8B_Instruct\u003c/td\u003e\n            \u003ctd\u003efp32\u003c/td\u003e\n            \u003ctd\u003eLLaMA_3_8B_Instruct_fp32 \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003e int4\u003c/td\u003e\n            \u003ctd\u003e LLaMA_3_8B_Instruct_awq_int4\u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"2\"\u003eLLaMA2_13B_chat\u003c/td\u003e\n            \u003ctd\u003e fp32\u003c/td\u003e\n            \u003ctd\u003e LLaMA2_13B_chat_fp32 \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003eint4\u003c/td\u003e\n            \u003ctd\u003eLLaMA2_13B_chat_awq_int4\u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"2\"\u003eLLaMA2_7B_chat\u003c/td\u003e\n            \u003ctd\u003efp32\u003c/td\u003e\n            \u003ctd\u003eLLaMA2_7B_chat_fp32 \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003e int4\u003c/td\u003e\n            \u003ctd\u003e LLaMA2_7B_chat_awq_int4\u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"2\"\u003eLLaMA_7B\u003c/td\u003e\n            \u003ctd\u003e fp32\u003c/td\u003e\n            \u003ctd\u003e LLaMA_7B_fp32 \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003eint4\u003c/td\u003e\n            \u003ctd\u003eLLaMA_7B_awq_int4\u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"2\"\u003eCodeLLaMA_13B_Instruct\u003c/td\u003e\n            \u003ctd\u003e fp32\u003c/td\u003e\n            \u003ctd\u003e CodeLLaMA_13B_Instruct_fp32 \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003eint4\u003c/td\u003e\n            \u003ctd\u003eCodeLLaMA_13B_Instruct_awq_int4\u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"2\"\u003eCodeLLaMA_7B_Instruct\u003c/td\u003e\n            \u003ctd\u003e fp32\u003c/td\u003e\n            \u003ctd\u003e CodeLLaMA_7B_Instruct_fp32 \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003eint4\u003c/td\u003e\n            \u003ctd\u003eCodeLLaMA_7B_Instruct_awq_int4\u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"2\"\u003eMistral-7B-Instruct-v0.2\u003c/td\u003e\n            \u003ctd\u003e fp32\u003c/td\u003e\n            \u003ctd\u003e Mistral_7B_v0.2_Instruct_fp32 \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003eint4\u003c/td\u003e\n            \u003ctd\u003eMistral_7B_v0.2_Instruct_awq_int4\u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"2\"\u003eVILA-7B\u003c/td\u003e\n            \u003ctd\u003e fp32\u003c/td\u003e\n            \u003ctd\u003e VILA_7B_CLIP_ViT-L_fp32 \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003e int4\u003c/td\u003e\n            \u003ctd\u003e VILA_7B_awq_int4_CLIP_ViT-L \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"2\"\u003eLLaVA-v1.5-13B\u003c/td\u003e\n            \u003ctd\u003e fp32\u003c/td\u003e\n            \u003ctd\u003e LLaVA_13B_CLIP_ViT-L_fp32 \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003e int4\u003c/td\u003e\n            \u003ctd\u003e LLaVA_13B_awq_int4_CLIP_ViT-L \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"2\"\u003eLLaVA-v1.5-7B\u003c/td\u003e\n            \u003ctd\u003e fp32\u003c/td\u003e\n            \u003ctd\u003e LLaVA_7B_CLIP_ViT-L_fp32 \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003e int4\u003c/td\u003e\n            \u003ctd\u003e LLaVA_7B_awq_int4_CLIP_ViT-L \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"2\"\u003eStarCoder\u003c/td\u003e\n            \u003ctd\u003e fp32\u003c/td\u003e\n            \u003ctd\u003e StarCoder_15.5B_fp32 \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003eint4\u003c/td\u003e\n            \u003ctd\u003eStarCoder_15.5B_awq_int4\u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e ✅ \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"3\"\u003eopt-6.7B\u003c/td\u003e\n            \u003ctd\u003efp32\u003c/td\u003e\n            \u003ctd\u003eopt_6.7B_fp32\u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003eint8\u003c/td\u003e\n            \u003ctd\u003eopt_6.7B_smooth_int8\u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003e int4\u003c/td\u003e\n            \u003ctd\u003e opt_6.7B_awq_int4\u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"3\"\u003eopt-1.3B\u003c/td\u003e\n            \u003ctd\u003efp32\u003c/td\u003e\n            \u003ctd\u003eopt_1.3B_fp32\u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003eint8\u003c/td\u003e\n            \u003ctd\u003eopt_1.3B_smooth_int8\u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003e int4\u003c/td\u003e\n            \u003ctd\u003e opt_1.3B_awq_int4\u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd rowspan=\"3\"\u003eopt-125m\u003c/td\u003e\n            \u003ctd\u003efp32\u003c/td\u003e\n            \u003ctd\u003eopt_125m_fp32\u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003eint8\u003c/td\u003e\n            \u003ctd\u003eopt_125m_smooth_int8\u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003c!-- No data for the first column here because it's merged with data1 --\u003e\n            \u003ctd\u003e int4\u003c/td\u003e\n            \u003ctd\u003e opt_125m_awq_int4\u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e ✅  \u003c/td\u003e\n            \u003ctd\u003e  \u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/tbody\u003e\n\u003c/table\u003e\n\nFor instance, to download the quantized LLaMA-2-7B-chat model: (for int4 models, use --QM  to choose the quantized model for your device)\n\n- On an Intel/AMD latptop:\n  ```bash\n  python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_x86\n  ```\n- On an M1/M2 Macbook:\n  ```bash\n  python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_ARM\n  ```\n- On an Nvidia GPU:\n  ```bash\n  python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_CUDA\n  ```\n\nTo deploy a quantized model with TinyChatEngine, compile and run the chat program.\n\n- On CPU platforms\n```bash\nmake chat -j\n# ./chat \u003cmodel_name\u003e \u003cprecision\u003e \u003cnum_threads\u003e\n./chat LLaMA2_7B_chat INT4 8\n```\n\n- On GPU platforms\n```bash\nmake chat -j\n# ./chat \u003cmodel_name\u003e \u003cprecision\u003e\n./chat LLaMA2_7B_chat INT4\n```\n\n\n## Related Projects\n\n[TinyEngine: Memory-efficient and High-performance Neural Network Library for Microcontrollers](https://github.com/mit-han-lab/tinyengine)\n\n[SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://github.com/mit-han-lab/smoothquant)\n\n[AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://github.com/mit-han-lab/llm-awq)\n\n\n## Acknowledgement\n\n[llama.cpp](https://github.com/ggerganov/llama.cpp)\n\n[whisper.cpp](https://github.com/ggerganov/whisper.cpp)\n\n[transformers](https://github.com/huggingface/transformers)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmit-han-lab%2Ftinychatengine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmit-han-lab%2Ftinychatengine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmit-han-lab%2Ftinychatengine/lists"}