https://github.com/n24q02m/qwen3-embed
Lightweight ONNX inference for Qwen3 embedding and reranking models
https://github.com/n24q02m/qwen3-embed
embeddings machine-learning onnx onnx-runtime open-source python qwen3 reranking semantic-search text-embedding
Last synced: 2 months ago
JSON representation
Lightweight ONNX inference for Qwen3 embedding and reranking models
- Host: GitHub
- URL: https://github.com/n24q02m/qwen3-embed
- Owner: n24q02m
- License: apache-2.0
- Created: 2026-02-13T16:56:22.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-03-01T06:00:48.000Z (3 months ago)
- Last Synced: 2026-03-01T06:05:23.189Z (3 months ago)
- Topics: embeddings, machine-learning, onnx, onnx-runtime, open-source, python, qwen3, reranking, semantic-search, text-embedding
- Language: Python
- Homepage: https://pypi.org/project/qwen3-embed/
- Size: 306 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
# Qwen3 Embed
**Lightweight Qwen3 text embedding and reranking via ONNX Runtime and GGUF**
[](https://github.com/n24q02m/qwen3-embed/actions/workflows/ci.yml)
[](https://codecov.io/gh/n24q02m/qwen3-embed)
[](https://pypi.org/project/qwen3-embed/)
[](LICENSE)
[](#)
[](#)
[](#)
[](https://github.com/python-semantic-release/python-semantic-release)
[](https://developer.mend.io/)
Trimmed fork of [fastembed](https://github.com/qdrant/fastembed), keeping only Qwen3 models.
## Features
- **Last-token pooling**: Uses the final token representation (with left-padding) instead of mean pooling.
- **MRL support**: Matryoshka Representation Learning allows truncating embeddings to any dimension from 32 to 1024 while preserving quality.
- **Instruction-aware**: Query embedding supports task instructions for better retrieval performance.
- **Causal LM reranking**: Reranker uses yes/no logit scoring via causal language model, producing calibrated [0, 1] scores.
- **Multiple backends**: ONNX Runtime (INT8, Q4F16) and GGUF (Q4_K_M via llama-cpp-python).
- **GPU optional, no PyTorch**: Runs on ONNX Runtime or llama-cpp-python -- no heavy ML framework required. Auto-detects GPU (CUDA, DirectML) when available.
- **Multilingual**: Both models support multi-language inputs.
## Supported Models
### ONNX (default)
| Model | Type | Dims | Max Tokens | Size |
|:------|:-----|:-----|:-----------|:-----|
| `n24q02m/Qwen3-Embedding-0.6B-ONNX` | Embedding | 32-1024 (MRL) | 32768 | 573 MB |
| `n24q02m/Qwen3-Embedding-0.6B-ONNX-Q4F16` | Embedding | 32-1024 (MRL) | 32768 | 517 MB |
| `n24q02m/Qwen3-Reranker-0.6B-ONNX` | Reranker | - | 40960 | 573 MB |
| `n24q02m/Qwen3-Reranker-0.6B-ONNX-Q4F16` | Reranker | - | 40960 | 518 MB |
| `n24q02m/Qwen3-Reranker-0.6B-ONNX-YesNo` | Reranker | - | 40960 | 598 MB |
### GGUF (optional, requires `llama-cpp-python`)
| Model | Type | Dims | Max Tokens | Size |
|:------|:-----|:-----|:-----------|:-----|
| `n24q02m/Qwen3-Embedding-0.6B-GGUF` | Embedding | 32-1024 (MRL) | 32768 | 378 MB |
| `n24q02m/Qwen3-Reranker-0.6B-GGUF` | Reranker | - | 40960 | 378 MB |
### HuggingFace Repos
| Format | Embedding | Reranker |
|:-------|:----------|:--------|
| ONNX | [n24q02m/Qwen3-Embedding-0.6B-ONNX](https://huggingface.co/n24q02m/Qwen3-Embedding-0.6B-ONNX) | [n24q02m/Qwen3-Reranker-0.6B-ONNX](https://huggingface.co/n24q02m/Qwen3-Reranker-0.6B-ONNX) |
| GGUF | [n24q02m/Qwen3-Embedding-0.6B-GGUF](https://huggingface.co/n24q02m/Qwen3-Embedding-0.6B-GGUF) | [n24q02m/Qwen3-Reranker-0.6B-GGUF](https://huggingface.co/n24q02m/Qwen3-Reranker-0.6B-GGUF) |
## Installation
```bash
pip install qwen3-embed
# For GGUF support
pip install qwen3-embed[gguf]
```
## Usage
### Text Embedding
```python
from qwen3_embed import TextEmbedding
# INT8 (default)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX")
# Q4F16 (smaller, slightly less accurate)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX-Q4F16")
# GGUF (requires: pip install qwen3-embed[gguf])
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-GGUF")
documents = [
"Qwen3 is a multilingual embedding model.",
"ONNX Runtime enables fast CPU inference.",
]
embeddings = list(model.embed(documents))
# Each embedding: numpy array of shape (1024,), L2-normalized
# Matryoshka Representation Learning (MRL) -- truncate to smaller dims
embeddings_256 = list(model.embed(documents, dim=256))
# Each embedding: numpy array of shape (256,), L2-normalized
# Query with instruction (for retrieval tasks)
queries = list(model.query_embed(
["What is Qwen3?"],
task="Given a question, retrieve relevant passages",
))
```
### Reranking
```python
from qwen3_embed import TextCrossEncoder
reranker = TextCrossEncoder(model_name="n24q02m/Qwen3-Reranker-0.6B-ONNX")
# YesNo variant: ~10x less RAM (~598MB vs ~12GB at inference)
# reranker = TextCrossEncoder(model_name="n24q02m/Qwen3-Reranker-0.6B-ONNX-YesNo")
query = "What is Qwen3?"
documents = [
"Qwen3 is a series of large language models by Alibaba.",
"The weather today is sunny.",
"Qwen3-Embedding supports multilingual text embedding.",
]
scores = list(reranker.rerank(query, documents))
# scores: list of float in [0, 1], higher = more relevant
# Or rerank pairs directly
pairs = [
("What is AI?", "Artificial intelligence is a branch of computer science."),
("What is ML?", "Machine learning is a subset of AI."),
]
pair_scores = list(reranker.rerank_pairs(pairs))
```
## Configuration
### GPU Acceleration
Both ONNX and GGUF backends auto-detect GPU when available (`Device.AUTO` is the default).
#### ONNX
Requires `onnxruntime-gpu` (CUDA) or `onnxruntime-directml` (Windows) instead of `onnxruntime`:
```bash
pip install onnxruntime-gpu # NVIDIA CUDA
# or
pip install onnxruntime-directml # Windows AMD/Intel/NVIDIA
```
```python
from qwen3_embed import TextEmbedding, Device
# Auto-detect GPU (default)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX")
# Force CPU
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX", cuda=Device.CPU)
# Force CUDA
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX", cuda=Device.CUDA)
```
#### GGUF
GPU is handled by `llama-cpp-python`. The default `pip install qwen3-embed[gguf]` is CPU-only.
For CUDA GPU support, build with:
```bash
CMAKE_ARGS="-DGGML_CUDA=on" pip install qwen3-embed[gguf]
```
```python
from qwen3_embed import TextEmbedding, Device
# Auto-detect GPU (default, offloads all layers)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-GGUF")
# Force CPU only
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-GGUF", cuda=Device.CPU)
```
## Development
```bash
mise run setup # Install deps + pre-commit hooks
mise run lint # ruff check + format --check
mise run test # pytest
mise run fix # ruff auto-fix + format
```
## Related Projects
- [wet-mcp](https://github.com/n24q02m/wet-mcp) -- MCP web search server with vector-based docs search, uses qwen3-embed for local embedding
- [mnemo-mcp](https://github.com/n24q02m/mnemo-mcp) -- MCP memory server with semantic search powered by qwen3-embed
- [better-code-review-graph](https://github.com/n24q02m/better-code-review-graph) -- Knowledge graph for code reviews, uses qwen3-embed for local ONNX embedding
- [modalcom-ai-workers](https://github.com/n24q02m/modalcom-ai-workers) -- GPU-serverless workers that convert Qwen3 models to ONNX/GGUF format
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md).
## License
Apache-2.0 -- See [LICENSE](LICENSE). Original fastembed by [Qdrant](https://github.com/qdrant/fastembed).