https://github.com/NexaAI/nexa-sdk
Nexa SDK is a comprehensive toolkit for supporting GGML and ONNX models. It supports text generation, image generation, vision-language models (VLM), Audio Language Model, auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.
https://github.com/NexaAI/nexa-sdk
asr audio edge-computing language-model llm on-device-ai on-device-ml sdk sdk-python stable-diffusion transformers tts vlm whisper
Last synced: about 2 months ago
JSON representation
Nexa SDK is a comprehensive toolkit for supporting GGML and ONNX models. It supports text generation, image generation, vision-language models (VLM), Audio Language Model, auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.
- Host: GitHub
- URL: https://github.com/NexaAI/nexa-sdk
- Owner: NexaAI
- License: apache-2.0
- Created: 2024-08-16T20:13:07.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-03T21:26:12.000Z (11 months ago)
- Last Synced: 2025-02-04T15:01:04.629Z (11 months ago)
- Topics: asr, audio, edge-computing, language-model, llm, on-device-ai, on-device-ml, sdk, sdk-python, stable-diffusion, transformers, tts, vlm, whisper
- Language: Python
- Homepage: https://docs.nexa.ai/
- Size: 195 MB
- Stars: 4,315
- Watchers: 424
- Forks: 613
- Open Issues: 54
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
- StarryDivineSky - NexaAI/nexa-sdk
README
# NexaSDK - Run any AI model on any backend
NexaSDK is an easy-to-use developer toolkit for running any AI model locally — across NPUs, GPUs, and CPUs — powered by our NexaML engine, built entirely from scratch for peak performance on every hardware stack. Unlike wrappers that depend on existing runtimes, NexaML is a unified inference engine built at the kernel level. It’s what lets NexaSDK achieve Day-0 support for new model architectures (LLMs, multimodal, audio, vision). NexaML supports 3 model formats: GGUF, MLX, and Nexa AI's own `.nexa` format.
### ⚙️ Differentiation
| Features | **NexaSDK** | **Ollama** | **llama.cpp** | **LM Studio** |
|----------|--------------|-------------|----------------|----------------|
| NPU support | ✅ NPU-first | ❌ | ❌ | ❌ |
| Support any model in GGUF, MLX, NEXA format | ✅ Low-level Control | ❌ | ⚠️ | ❌ |
| Full multimodality support | ✅ Image, Audio, Text | ⚠️ | ⚠️ | ⚠️ |
| Cross-platform support | ✅ Desktop, Mobile, Automotive, IoT | ⚠️ | ⚠️ | ⚠️ |
| One line of code to run | ✅ | ✅ | ⚠️ | ✅ |
| OpenAI-compatible API + Function calling | ✅ | ✅ | ✅ | ✅ |
Legend:
✅ Supported |
⚠️ Partial or limited support |
❌ No
## Recent Wins
- 📣 Day-0 Support for **Qwen3-VL-4B and 8B** in GGUF, MLX, .nexa format for NPU/GPU/CPU. We are the only framework that supports the GGUF format. [Featured in Qwen's post about our partnership](https://x.com/Alibaba_Qwen/status/1978154384098754943).
- 📣 Day-0 Support for **IBM Granite 4.0** on NPU/GPU/CPU. [NexaML engine were featured right next to vLLM, llama.cpp, and MLX in IBM's blog](https://x.com/IBM/status/1978154384098754943).
- 📣 Day-0 Support for **Google EmbeddingGemma** on NPU. We are [featured in Google's social post](https://x.com/googleaidevs/status/1969188152049889511).
- 📣 Supported **vision capability for Gemma3n**: First-ever [Gemma-3n](https://sdk.nexa.ai/model/Gemma3n-E4B) **multimodal** inference for GPU & CPU, in GGUF format.
- 📣 AMD NPU Support for [SDXL](https://huggingface.co/NexaAI/sdxl-turbo-amd-npu) image generation
- 📣 Intel NPU Support [DeepSeek-r1-distill-Qwen-1.5B](https://sdk.nexa.ai/model/DeepSeek-R1-Distill-Qwen-1.5B-Intel-NPU) and [Llama3.2-3B](https://sdk.nexa.ai/model/Llama3.2-3B-Intel-NPU)
- 📣 Apple Neural Engine Support for real-time speech recognition with [Parakeet v3 model](https://sdk.nexa.ai/model/parakeet-v3-ane)
# Quick Start
## Step 1: Download Nexa CLI with one click
### macOS
* [arm64 with Apple Neural Engine support](https://public-storage.nexa4ai.com/nexa_sdk/downloads/nexa-cli_macos_arm64.pkg)
* [x86_64](https://public-storage.nexa4ai.com/nexa_sdk/downloads/nexa-cli_macos_x86_64.pkg)
### Windows
* [arm64 with Qualcomm NPU support](https://public-storage.nexa4ai.com/nexa_sdk/downloads/nexa-cli_windows_arm64.exe)
* [x86_64 with Intel / AMD NPU support](https://public-storage.nexa4ai.com/nexa_sdk/downloads/nexa-cli_windows_x86_64.exe)
### Linux
#### For x86_64:
```bash
curl -fsSL https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_x86_64.sh -o install.sh && chmod +x install.sh && ./install.sh && rm install.sh
```
#### For arm64:
```bash
curl -fsSL https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_arm64.sh -o install.sh && chmod +x install.sh && ./install.sh && rm install.sh
```
## Step 2: Run models with one line of code
You can run any compatible GGUF, MLX, or nexa model from 🤗 Hugging Face by using the `nexa infer `.
### GGUF models
> [!TIP]
> GGUF runs on macOS, Linux, and Windows on CPU/GPU. Note certain GGUF models are only supported by NexaSDK (e.g. Qwen3-VL-4B and 8B).
📝 Run and chat with LLMs, e.g. Qwen3:
```bash
nexa infer ggml-org/Qwen3-1.7B-GGUF
```
🖼️ Run and chat with Multimodal models, e.g. Qwen3-VL-4B:
```bash
nexa infer NexaAI/Qwen3-VL-4B-Instruct-GGUF
```
### MLX models
> [!TIP]
> MLX is macOS-only (Apple Silicon). Many MLX models in the Hugging Face mlx-community organization have quality issues and may not run reliably.
> We recommend starting with models from our curated [NexaAI Collection](https://huggingface.co/NexaAI/collections) for best results. For example
📝 Run and chat with LLMs, e.g. Qwen3:
```bash
nexa infer NexaAI/Qwen3-4B-4bit-MLX
```
🖼️ Run and chat with Multimodal models, e.g. Gemma3n:
```bash
nexa infer NexaAI/gemma-3n-E4B-it-4bit-MLX
```
### Qualcomm NPU models
> [!TIP]
> You need to download the [arm64 with Qualcomm NPU support](https://public-storage.nexa4ai.com/nexa_sdk/downloads/nexa-cli_windows_arm64.exe) and make sure you have Snapdragon® X Elite chip on your laptop.
#### Quick Start (Windows arm64, Snapdragon X Elite)
1. **Login & Get Access Token (required for Pro Models)**
- Create an account at [sdk.nexa.ai](https://sdk.nexa.ai)
- Go to **Deployment → Create Token**
- Run this once in your terminal (replace with your token):
```bash
nexa config set license ''
```
2. Run and chat with our multimodal model, OmniNeural-4B, or other models on NPU
```bash
nexa infer NexaAI/OmniNeural-4B
nexa infer NexaAI/Granite-4-Micro-NPU
nexa infer NexaAI/Qwen3-VL-4B-Instruct-NPU
```
## CLI Reference
| Essential Command | What it does |
|----------------------------------|----------------------------------------------------------------------|
| `nexa -h` | show all CLI commands |
| `nexa pull ` | Interactive download & cache of a model |
| `nexa infer ` | Local inference |
| `nexa list` | Show all cached models with sizes |
| `nexa remove ` / `nexa clean` | Delete one / all cached models |
| `nexa serve --host 127.0.0.1:8080` | Launch OpenAI‑compatible REST server |
| `nexa run ` | Chat with a model via an existing server |
👉 To interact with multimodal models, you can drag photos or audio clips directly into the CLI — you can even drop multiple images at once!
See [CLI Reference](https://nexaai.mintlify.app/nexa-sdk-go/NexaCLI) for full commands.
## Acknowledgements
We would like to thank the following projects:
- [ggml](https://github.com/ggml-org/ggml)
- [mlx-lm](https://github.com/ml-explore/mlx-lm)
- [mlx-vlm](https://github.com/Blaizzy/mlx-vlm)
- [mlx-audio](https://github.com/Blaizzy/mlx-audio)