{"id":51236730,"url":"https://github.com/smart-models/smart-embedder","last_synced_at":"2026-06-28T21:01:02.888Z","repository":{"id":363330924,"uuid":"1253530731","full_name":"smart-models/Smart-Embedder","owner":"smart-models","description":"A lightweight, self-hosted embedding server built for hybrid search pipelines. It runs entirely on your own hardware on an NVIDIA GPU for high throughput, or on CPU when no GPU is available with no cloud dependency and no data leaving your machine.","archived":false,"fork":false,"pushed_at":"2026-06-08T11:50:48.000Z","size":113,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-08T13:26:37.444Z","etag":null,"topics":["bge-m3","docker","embedding","fastapi","hybrid-search","qwen-embeddings","qwen-reranker","rag","reranker","self-hosted"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/smart-models.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-29T15:00:33.000Z","updated_at":"2026-06-08T11:50:50.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/smart-models/Smart-Embedder","commit_stats":null,"previous_names":["smart-models/smart-embedder"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/smart-models/Smart-Embedder","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smart-models%2FSmart-Embedder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smart-models%2FSmart-Embedder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smart-models%2FSmart-Embedder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smart-models%2FSmart-Embedder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/smart-models","download_url":"https://codeload.github.com/smart-models/Smart-Embedder/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smart-models%2FSmart-Embedder/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34903523,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-28T02:00:05.809Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bge-m3","docker","embedding","fastapi","hybrid-search","qwen-embeddings","qwen-reranker","rag","reranker","self-hosted"],"created_at":"2026-06-28T21:01:01.972Z","updated_at":"2026-06-28T21:01:02.862Z","avatar_url":"https://github.com/smart-models.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SMART EMBEDDER\n\n![Version 1.2.0](https://img.shields.io/badge/Version-1.2.0-blue)\n![GPU Accelerated](https://img.shields.io/badge/GPU-Accelerated-green)\n![CPU Support](https://img.shields.io/badge/CPU-Supported-lightgrey)\n![CUDA 12.6](https://img.shields.io/badge/CUDA-12.6-76B900?logo=nvidia\u0026logoColor=white)\n![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python\u0026logoColor=white)\n![FastAPI](https://img.shields.io/badge/FastAPI-Latest-009688?logo=fastapi\u0026logoColor=white)\n![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?logo=docker\u0026logoColor=white)\n\n**Smart Embedder** is a lightweight, self-hosted embedding server built for **hybrid search** pipelines. It runs entirely on your own hardware — on an **NVIDIA GPU** for high throughput, or on **CPU** when no GPU is available — with no cloud dependency and no data leaving your machine.\n\nHybrid search combines dense vector similarity, sparse lexical matching (BM25-style), and optional ColBERT late-interaction scoring into a single retrieval pipeline. Smart Embedder exposes all three vector types from a single endpoint, plus a reranking endpoint to re-score candidate passages after retrieval — everything a hybrid search stack needs in one lightweight service.\n\nThe server is built on FastAPI and wraps [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3), a single model that produces dense, sparse, and ColBERT vectors simultaneously. The default stack uses BGE-M3 for all vector types and bge-reranker-v2-m3 for reranking — a solid baseline that runs on 8 GB VRAM or CPU with no extra configuration. **For higher retrieval quality**, Smart Embedder offers two optional upgrades selectable independently at startup:\n\n- **Dense embedding → [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B):** replaces only the dense vector path; sparse and ColBERT vectors still come from BGE-M3, keeping the full hybrid signal intact.\n- **Reranking → [Qwen3-Reranker-0.6B](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B):** replaces the cross-encoder reranker with a stronger model at the cost of higher inference time.\n\nBoth Qwen models are still compact (0.6B parameters) but benefit from a dedicated GPU — a machine with 8 GB+ VRAM will see the best results. CPU execution remains supported for both, with conservative batch sizes applied automatically.\n\n**Key properties at a glance:**\n\n| Property | Detail |\n|---|---|\n| **Deployment** | Local — GPU (NVIDIA CUDA) or CPU, Docker or Python venv |\n| **Hybrid search vectors** | Dense + sparse lexical + ColBERT from one model, one endpoint |\n| **Reranking** | Cross-encoder passage reranking, same service |\n| **Footprint** | BGE-M3 + reranker fit in 8 GB VRAM; CPU mode needs no GPU |\n| **QDRANT-ready** | Sparse vectors in native `{indices, values}` format via `sparse_as_indices` |\n\n---\n\nHigh-performance FastAPI server for BGE-M3 embeddings and selectable reranking:\n\n| Feature | Detail |\n|---|---|\n| **Embeddings** | BGE-M3 dense/sparse/ColBERT by default; optional Qwen3 dense with BGE-M3 sparse and ColBERT |\n| **Reranking** | Interactive startup choice: `BAAI/bge-reranker-v2-m3` or `Qwen/Qwen3-Reranker-0.6B` |\n| **Authentication** | Optional Bearer token on non-public endpoints |\n| **Rate Limiting** | Token bucket, 3600 req/min per IP, burst 120 |\n| **Backpressure** | Embedding queue max 200, rerank slots max 32, HTTP 503 on overflow |\n| **Graceful Shutdown** | 30s drain for in-flight requests |\n| **Prometheus Metrics** | Counters, histograms, gauges for both models |\n| **Dynamic Batching** | Embedding batch size adapts to GPU VRAM and max input length at startup |\n\n## Quick Start\n\n### 1. Setup\n\n```bat\npython -m venv .venv\n.venv\\Scripts\\activate\npip install -r requirements-gpu.txt\nREM CPU-only machine instead: pip install -r requirements-cpu.txt\n```\n\n### 2. Start the Server\n\n`start_server.bat` and `start_server.sh` are parameterized and choose execution target and device:\n\n```bat\nstart_server.bat [local|docker] [cpu|gpu|auto]\n```\n\n```bash\n./start_server.sh [local|docker] [cpu|gpu|auto]\n```\n\n| Command | What it does |\n|---|---|\n| `start_server.bat` / `./start_server.sh` | Default: `docker auto` |\n| `start_server.bat docker auto` | Docker startup with device auto-detection |\n| `start_server.bat local gpu` | Local venv, CUDA auto-detect |\n| `start_server.bat local cpu` | Local venv, forces CPU (`CUDA_VISIBLE_DEVICES=-1`) |\n| `start_server.bat docker gpu` | `docker compose -f docker-compose.gpu.yml build \u0026\u0026 up -d` with NVIDIA runtime |\n| `start_server.bat docker cpu` | Compose with override `docker-compose.cpu.yml` (no GPU) |\n\nArguments are case-insensitive. Built-in validation: unrecognized parameters print usage and exit with code 1.\n\nStartup asks two independent model questions:\n\n1. Dense embedding backend: choose BGE for current all-BGE embeddings, or QWEN to return Qwen dense vectors while keeping BGE sparse and ColBERT vectors.\n2. Reranker backend: choose BGE or QWEN for `/rerank`.\n\nThe interactive model selection is provided by the launcher scripts. Direct\n`uvicorn` or `docker compose` startup does not prompt; it uses environment\nvariables or `.env`, falling back to BGE defaults.\n\n**`local` mode**: requires `.venv` already created (see step 1). The script activates venv, checks for `uvicorn`, installs dependencies if missing, then starts the server.\n\n**`docker` mode**: requires Docker Desktop / Engine in PATH. The script builds the image and starts the container in background. For logs:\n\n```bat\ndocker compose -f docker-compose.gpu.yml logs -f embedder\n```\n\nIn Docker Desktop the project appears as **smart-embedder** (containers `smart-embedder-gpu` / `smart-embedder-cpu`).\n\nOr directly without wrapper:\n\n```bat\nuvicorn bge-m3_server:app --host 0.0.0.0 --port 8000\n```\n\nWait for these log lines:\n```\nINFO - Reranker ready.\nINFO - Server ready to accept requests\n```\n\n### 3. Automatic Test\n\nIn a second terminal (with server running):\n\n```bat\npython test_server.py\n```\n\nExpected output: **17/17 tests passed**. With `--token` and `API_TOKEN`\nconfigured, the authentication check is included and the expected output is\n**18/18 tests passed**.\n\n`test_server.py` accepts `--url` to point to a different host and `--token`\nwhen `API_TOKEN` is configured:\n```bat\npython test_server.py --url http://localhost:8000\npython test_server.py --token \u003ctoken\u003e\n```\n\n### 4. Benchmark\n\nMeasures latency (avg/p50/p95/p99) and throughput on `embed_dense`, `embed_full`, `rerank` scenarios:\n\n```bat\npython benchmark.py --concurrency 8 --requests 100 --batch-size 4\n```\n\n| Flag | Default | Description |\n|---|---|---|\n| `--url` | `http://localhost:8000` | Server target |\n| `--token` | `API_TOKEN` env or empty | Bearer token if server requires auth |\n| `--concurrency` | `8` | Concurrent requests in-flight |\n| `--requests` | `100` | Requests per scenario |\n| `--batch-size` | `4` | Sentences/passages per request |\n| `--warmup` | `5` | Warmup requests (excluded from metrics) |\n| `--timeout` | `60` | Timeout for single request |\n| `--max-batch-size` | `128` | Local guardrail on payload limits; `0` disables |\n| `--scenarios` | all | CSV: `embed_dense,embed_full,rerank` |\n| `--sleep-between` | `0` | Pause between scenarios (use `65` if rate-limit active) |\n\n\u003e Note: Default rate limits (3600 req/min, burst 120) are tuned for benchmarks\n\u003e on a single client at `conc\u003c=16`. For extreme stress testing:\n\u003e `RATE_LIMIT_REQUESTS_PER_MINUTE=1000000 docker compose -f docker-compose.gpu.yml up -d`.\n\nOutput: ASCII table with `Reqs / OK / Fail / Conc / Wall / Req/s / Units/s / Avg / P50 / P95 / P99 / Min / Max`.\n\n**Latest measured run (RTX 4060 Laptop 8GB, batch=4, conc=8, `transformers==4.57.3`):**\n\n| Scenario | Req/s | Units/s | P50 | P95 | P99 |\n|---|---|---|---|---|---|\n| `embed_dense` | 44.5 | 178 | 176.9ms | 185.4ms | 187.3ms |\n| `embed_full` (dense+sparse+colbert) | 28.7 | 115 | 294.1ms | 350.6ms | 498.9ms |\n| `rerank` | 37.8 | 151 | 205.7ms | 250.3ms | 263.8ms |\n\n---\n\n## Docker\n\n### Prerequisites\n\n- Docker Desktop / Docker Engine with Compose v2+\n- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)\n\n```bash\nnvidia-ctk runtime configure --runtime=docker\nsudo systemctl restart docker\n```\n\n### First Startup\n\n```bash\n# Verify CUDA tag exists before building\ndocker pull nvidia/cuda:12.6.3-runtime-ubuntu22.04\n\n# Build and GPU startup (first time: downloads selected embedding and reranker models)\ndocker compose -f docker-compose.gpu.yml up --build\n\n# Or via bat wrapper (Windows)\nstart_server.bat docker gpu\n\n# CPU execution (compose override)\ndocker compose -f docker-compose.gpu.yml -f docker-compose.cpu.yml up --build\n# Equivalent:\nstart_server.bat docker cpu\n```\n\nWait for these log lines:\n```\nINFO - Reranker ready.\nINFO - Server ready to accept requests\n```\n\nServer available at `http://localhost:8000`.  \nModels are saved in the default named volume\n`smart-embedder-hf-cache` and mounted at `/app/model_cache`;\nsubsequent restarts do not re-download them.\n\nThe first Docker startup with QWEN selected downloads the selected Qwen model\ninto the Hugging Face cache volume. Startup can take longer than BGE on an empty\ncache; later runs reuse the cached model.\n\n### Useful Commands\n\n```bash\n# Startup in background\ndocker compose -f docker-compose.gpu.yml up -d\n\n# Real-time logs\ndocker compose -f docker-compose.gpu.yml logs -f embedder\n\n# Stop\ndocker compose -f docker-compose.gpu.yml down\n\n# Rebuild after code changes (deps cached if requirements-gpu.txt unchanged)\ndocker compose -f docker-compose.gpu.yml up --build\n\n# Complete rebuild from scratch\ndocker compose -f docker-compose.gpu.yml build --no-cache\n```\n\n### Verify GPU in Container\n\n```bash\ndocker compose -f docker-compose.gpu.yml run --rm embedder python3 -c \"\nimport torch\nprint('PyTorch:', torch.__version__)\nprint('CUDA available:', torch.cuda.is_available())\nif torch.cuda.is_available():\n    print('GPU:', torch.cuda.get_device_name(0))\n\"\n```\n\n### Exposure on Local Network\n\nBy default the server is bound to `127.0.0.1:8000` (localhost only).\n\nTo change the exposed port (e.g. 8000 already in use), set `PORT` in `.env` or\nshell before startup. Docker remaps the published host port; the container stays\non 8000 internally:\n\n```bash\nPORT=8001 docker compose -f docker-compose.gpu.yml up -d\n# or persistent: add PORT=8001 to .env\n```\n\nIn `local` mode the launcher passes `PORT` to `uvicorn --port`.\n\nFor LAN access modify `docker-compose.gpu.yml` (remove the `127.0.0.1:` bind prefix):\n\n```yaml\nports:\n  - \"${PORT:-8000}:8000\"\n```\n\n\u003e Warning: If exposed on network, add a reverse proxy with authentication\n\u003e (nginx, Traefik).\n\n---\n\n## Project Files\n\n| File | Description |\n|---|---|\n| `bge-m3_server.py` | Main server |\n| `requirements-gpu.txt` | Python dependencies (GPU / CUDA PyTorch wheel) |\n| `requirements-cpu.txt` | Python dependencies (CPU-only PyTorch wheel) |\n| `Dockerfile.gpu` | GPU image build (CUDA 12.6, non-root, hardened) |\n| `Dockerfile.cpu` | CPU-only image build (slim Python base, no CUDA) |\n| `docker-compose.gpu.yml` | Container orchestration with GPU and model volume |\n| `docker-compose.cpu.yml` | Compose override: slim CPU image, removes GPU reservation |\n| `.env.example` | Environment variables template (copy to `.env` for local override) |\n| `.dockerignore` | Excludes `.venv`, cache, docs from build context |\n| `start_server.bat` | Windows startup script parameterized (`local\\|docker` x `cpu\\|gpu\\|auto`) |\n| `start_server.sh` | Unix shell startup script parameterized (`local\\|docker` x `cpu\\|gpu\\|auto`) |\n| `test_server.py` | Runtime test suite (17 checks, 18 with `--token`) |\n| `benchmark.py` | Benchmark latency/throughput with summary table |\n\n---\n\n## API Endpoints\n\n### `POST /embeddings/`\n\nGenerates embeddings for a list of texts.\n\n**Request:**\n```json\n{\n  \"sentences\": [\"Hello world!\", \"Ciao mondo!\"],\n  \"return_dense\": true,\n  \"return_sparse\": true,\n  \"return_colbert\": true,\n  \"normalize_dense\": false,\n  \"sparse_as_indices\": false\n}\n```\n\n**`sparse_as_indices` (default: `false`):** When `true`, sparse vectors are returned\nin QDRANT-compatible format instead of the default token-id dict:\n\n```json\n\"sparse\": {\"indices\": [10, 1389, 2349], \"values\": [0.277, 0.292, 0.313]}\n```\n\nUse with `SparseVector(indices=..., values=...)` when upserting to QDRANT.\n\nThe active embedding backends are selected at server startup. With the default\nBGE dense backend, `dense`, `sparse`, and `colbert` all come from `BAAI/bge-m3`.\nWith Qwen dense selected, only `dense` changes to `Qwen/Qwen3-Embedding-0.6B`;\n`sparse` and `colbert` still come from `BAAI/bge-m3`.\n\n**Response:**\n```json\n{\n  \"data\": [\n    {\n      \"id\": 0,\n      \"text\": \"Hello world!\",\n      \"embeddings\": {\n        \"dense\": [0.021, -0.013, ...],\n        \"sparse\": {\"12\": 0.08, \"435\": 0.12, ...},\n        \"colbert\": [[0.01, ...], ...]\n      }\n    }\n  ],\n  \"model_name\": \"Qwen/Qwen3-Embedding-0.6B\",\n  \"dense_model_name\": \"Qwen/Qwen3-Embedding-0.6B\",\n  \"sparse_model_name\": \"BAAI/bge-m3\",\n  \"colbert_model_name\": \"BAAI/bge-m3\",\n  \"processing_time_ms\": 104.5,\n  \"warnings\": [\n    {\n      \"code\": \"input_truncated\",\n      \"severity\": \"warning\",\n      \"message\": \"Input text was truncated to the model token limit.\",\n      \"target\": {\n        \"field\": \"sentences\",\n        \"index\": 0,\n        \"pointer\": \"/sentences/0\"\n      },\n      \"details\": {\n        \"model\": \"BAAI/bge-m3\",\n        \"max_tokens\": 8192,\n        \"original_tokens\": 9000,\n        \"truncated_tokens\": 808,\n        \"truncation_side\": \"end\"\n      }\n    }\n  ]\n}\n```\n\n**cURL:**\n```bash\ncurl -X POST \"http://localhost:8000/embeddings/\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"sentences\": [\"Hello world!\"], \"return_dense\": true}'\n```\n\nIf `API_TOKEN` is set:\n```bash\ncurl -X POST \"http://localhost:8000/embeddings/\" \\\n  -H \"Authorization: Bearer \u003ctoken\u003e\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"sentences\": [\"Hello world!\"], \"return_dense\": true}'\n```\n\n---\n\n### `POST /rerank`\n\nRanks a list of passages by relevance to a query.\n\n**Request:**\n```json\n{\n  \"query\": \"What is machine learning?\",\n  \"passages\": [\n    \"Machine learning is a subset of AI.\",\n    \"The weather is nice today.\",\n    \"Deep learning uses neural networks.\"\n  ],\n  \"normalize\": true\n}\n```\n\n**Response:**\n```json\n{\n  \"results\": [\n    {\"index\": 0, \"passage\": \"Machine learning is a subset of AI.\", \"score\": 0.987},\n    {\"index\": 2, \"passage\": \"Deep learning uses neural networks.\", \"score\": 0.821},\n    {\"index\": 1, \"passage\": \"The weather is nice today.\", \"score\": 0.003}\n  ],\n  \"model_name\": \"BAAI/bge-reranker-v2-m3\",\n  \"processing_time_ms\": 52.2,\n  \"warnings\": []\n}\n```\n\n- `normalize: true` returns a score in `[0, 1]` (sigmoid)\n- `normalize: false` returns a raw score (negative values possible)\n- With QWEN selected, scores are yes-probabilities and `normalize` is kept as an API-compatible no-op\n- Do not compare BGE `normalize: false` raw logits directly with QWEN scores\n- Passages are returned sorted by **descending** score\n- The `index` field returns the original position in the input list\n- `model_name` reports the reranker selected at startup\n\nFor over-token query-passage pairs, the server preserves the query where\npossible, truncates passages from the end, returns `200 OK`, and includes\n`query_truncated` or `passage_truncated` entries in `warnings`.\n\nWarning token counts are computed during server-side preparation. Rerank inputs\nare then decoded back to text and tokenized again by the model backend, so\n`original_tokens`, `max_tokens`, and `truncated_tokens` should be treated as\ndiagnostic metadata rather than exact proof of final backend tokenization.\n\n**cURL:**\n```bash\ncurl -X POST \"http://localhost:8000/rerank\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"query\": \"machine learning\", \"passages\": [\"ML is AI\", \"Nice weather\"], \"normalize\": true}'\n```\n\nIf `API_TOKEN` is set:\n```bash\ncurl -X POST \"http://localhost:8000/rerank\" \\\n  -H \"Authorization: Bearer \u003ctoken\u003e\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"query\": \"machine learning\", \"passages\": [\"ML is AI\", \"Nice weather\"], \"normalize\": true}'\n```\n\n---\n\n### `GET /health`\n\n```bash\ncurl \"http://localhost:8000/health\"\n```\n\nReturns server status, GPU info, active embedding/reranker models, batch size.\n\nRelevant model fields:\n\n```json\n{\n  \"version\": \"1.2.0\",\n  \"model\": \"BAAI/bge-m3\",\n  \"dense_embedding_model\": \"Qwen/Qwen3-Embedding-0.6B\",\n  \"reranker_model\": \"BAAI/bge-reranker-v2-m3\"\n}\n```\n\n---\n\n### `GET /stats`\n\n```bash\ncurl \"http://localhost:8000/stats\"\n```\n\nReturns uptime, total requests, total sentences, total batches, rejected requests, hardware.\n\n---\n\n### `GET /metrics`\n\n```bash\ncurl \"http://localhost:8000/metrics\"\n```\n\nPrometheus scraping endpoint in text/plain format.\n\n---\n\n### `GET /docs`\n\nInteractive Swagger documentation: `http://localhost:8000/docs`\n\nIf `API_TOKEN` is configured, Swagger shows the lock on `POST` endpoints.\nUse the **Authorize** button and enter only the token, without `Bearer` prefix.\n\n---\n\n## Configuration\n\nLimits are tunable via **environment variable** (override in `docker-compose.gpu.yml` or shell before startup):\n\n| Env var | Default | Description |\n|---|---|---|\n| `PORT` | `8000` | Host port to expose. Docker: published host port (container stays on 8000). Local: `uvicorn --port`. Set in `.env` or shell if 8000 is taken |\n| `BGE_EMBED_MAX_LENGTH` | `MAX_INPUT_LENGTH` fallback / `8192` | Max tokens for BGE-M3 embedding input; applies to dense, sparse, and ColBERT outputs |\n| `QWEN_EMBED_MAX_LENGTH` | `32768` | Max tokens for Qwen dense embedding input |\n| `BGE_RERANK_MAX_LENGTH` | `8192` | Max query+passage tokens for BGE rerank; BAAI notes this reranker was fine-tuned at 1024 and recommends 1024 for practical use |\n| `QWEN_RERANK_MAX_LENGTH` | `32768` | Max query+passage tokens for Qwen reranker when QWEN is selected |\n| `MAX_INPUT_LENGTH` | `8192` | Legacy fallback for `BGE_EMBED_MAX_LENGTH`; prefer the backend-specific variables above |\n| `REQUEST_TIMEOUT` | `90` | Global HTTP timeout (sec); keep above `RERANK_GPU_TIMEOUT` |\n| `DENSE_EMBEDDING_MODEL` | `BAAI/bge-m3` | Dense embedding backend selected by launcher (`BAAI/bge-m3` or `Qwen/Qwen3-Embedding-0.6B`) |\n| `RERANKER_MODEL` | `BAAI/bge-reranker-v2-m3` | Reranker selected by launcher (`BAAI/bge-reranker-v2-m3` or `Qwen/Qwen3-Reranker-0.6B`) |\n| `QWEN_RERANK_BATCH_SIZE` | launcher-tuned / `16` fallback | Max query-passage pairs per Qwen reranker micro-batch |\n| `API_TOKEN` | empty | Optional bearer token for non-public endpoints; empty disables authentication |\n| `MAX_QUEUE_SIZE` | `200` | Max requests in queue `/embeddings/` (backpressure) |\n| `RERANK_MAX_QUEUE` | `32` | Max concurrent slots for `/rerank` (backpressure) |\n| `RERANK_GPU_TIMEOUT` | `60` | Hard timeout for a single rerank inference (sec); keep below `REQUEST_TIMEOUT` |\n| `RATE_LIMIT_REQUESTS_PER_MINUTE` | `3600` | Rate limit per IP (60 req/s) |\n| `RATE_LIMIT_BURST_SIZE` | `120` | Token bucket burst (~2s of traffic) |\n| `PYTORCH_CUDA_ALLOC_CONF` | `expandable_segments:True` | CUDA caching-allocator config; reduces fragmentation OOM on variable-length batches (single-GPU, no NCCL). Set in `Dockerfile.gpu` and `docker-compose.gpu.yml` |\n\nTexts longer than the active backend-specific token limit are truncated and\nreported in the response `warnings` array. The server no longer rejects requests\nbased on character-count payload limits.\n\nWith `API_TOKEN` set, all non-public endpoints require:\n\n```http\nAuthorization: Bearer \u003ctoken\u003e\n```\n\nService endpoints (`/health`, `/stats`, `/metrics`, `/docs`, `/redoc`, `/openapi.json`) remain accessible without token.\n\nWhen `DENSE_EMBEDDING_MODEL=Qwen/Qwen3-Embedding-0.6B`, only dense vectors change. Sparse lexical weights and ColBERT vectors still come from `BAAI/bge-m3`, so mixed requests are supported through the same `/embeddings/` endpoint.\n\nThe Qwen dense path intentionally does not add query/document instruction prefixes. This keeps the existing `/embeddings/` API transparent, but deployments optimizing retrieval quality should benchmark task-specific Qwen formatting separately before changing request semantics.\n\nWhen QWEN reranking is selected and GPU mode is used, the launch scripts can\nauto-tune Qwen rerank limits from detected GPU VRAM if `QWEN_RERANK_BATCH_SIZE`\nor `QWEN_RERANK_MAX_LENGTH` are not set in the environment or `.env`:\n\n| GPU VRAM | QWEN_RERANK_BATCH_SIZE | QWEN_RERANK_MAX_LENGTH |\n|---|---:|---:|\n| \u003c= 6 GB | 4 | 4096 |\n| \u003c= 8 GB | 8 | 8192 |\n| \u003e 8 GB | 16 | 8192 |\n\nWhen QWEN reranking is selected in CPU mode, the launch scripts use conservative\ndefaults unless overridden:\n\n| Device | QWEN_RERANK_BATCH_SIZE | QWEN_RERANK_MAX_LENGTH |\n|---|---:|---:|\n| CPU | 1 | 2048 |\n\n`.env.example` pins `QWEN_RERANK_MAX_LENGTH=32768`, the documented model\nmaximum. To use the launcher VRAM-tuned values above, leave\n`QWEN_RERANK_MAX_LENGTH` unset in your shell and remove or comment it from\nyour local `.env`.\n\nBenchmark defaults are tuned for **NVIDIA RTX 4060 Laptop 8GB**: observed\nthroughput ~29-45 req/s at `conc=8` depending on scenario (see benchmark\ntable above).\nOverride:\n\n```bash\n# Ad-hoc (shell env)\nRATE_LIMIT_REQUESTS_PER_MINUTE=10000 docker compose -f docker-compose.gpu.yml up -d\n\n# Persistent - copy .env.example to .env and modify\ncp .env.example .env\ndocker compose -f docker-compose.gpu.yml up -d\n```\n\nCompose automatically loads `.env` in the same directory. `.env` is in `.gitignore`; `.env.example` is the versioned template.\n\n`MULTI_GPU_DEVICES = None` (in `bge-m3_server.py`) can be changed to\n`['cuda:0', 'cuda:1']` for multi-GPU.\n\n**Embedding batch size** is automatically calculated from available VRAM and the\nactive embedding max length. With default BGE embeddings, this is\n`BGE_EMBED_MAX_LENGTH=8192`. If Qwen dense embeddings are selected, the tuning\nuses the larger of `BGE_EMBED_MAX_LENGTH` and `QWEN_EMBED_MAX_LENGTH`, because\nmixed dense+sparse/ColBERT requests can exercise both tokenizers:\n\n| Condition | batch_size | MAX_REQUESTS_IN_BATCH |\n|---|---:|---:|\n| GPU \u003e 8 GB | 12 | 16 |\n| GPU \u003c= 8 GB and \u003e 4 GB | 6 | 16 |\n| GPU \u003c= 4 GB | 3 | 16 |\n| CPU | 1 | 8 |\n\nIf the active embedding tuning length is `\u003c=512`, the server switches to the\nshort-sequence profile:\n\n| VRAM | batch_size | MAX_REQUESTS_IN_BATCH |\n|---|---:|---:|\n| \u003e 8 GB | 128 | 64 |\n| \u003e 6 GB | 64 | 32 |\n| \u003e 4 GB | 32 | 16 |\n| \u003c= 4 GB | 16 | 16 |\n\n---\n\n## Prometheus Metrics\n\n### Embedding\n\n| Metric | Type | Label |\n|---|---|---|\n| `embedding_requests_total` | Counter | `status`, `endpoint` |\n| `embedding_requests_rejected_total` | Counter | `reason` |\n| `embedding_sentences_processed_total` | Counter | - |\n| `embedding_request_duration_seconds` | Histogram | `endpoint` |\n| `embedding_batch_size` | Histogram | - |\n| `embedding_gpu_inference_duration_seconds` | Histogram | - |\n| `embedding_queue_size` | Gauge | - |\n| `embedding_active_requests` | Gauge | - |\n| `embedding_gpu_memory_allocated_bytes` | Gauge | Legacy process GPU allocated memory, kept for existing dashboards |\n| `embedding_gpu_memory_reserved_bytes` | Gauge | Legacy process GPU reserved memory, kept for existing dashboards |\n| `embedding_server_info` | Info | `model`, `dense_embedding_model`, `bge_embed_max_length`, `qwen_embed_max_length`, `bge_rerank_max_length`, `qwen_rerank_max_length`, `version`, `gpu_available`, `device` |\n\n### GPU Process\n\nThese gauges are process-level CUDA readings updated after embedding and rerank\ninference. They include all loaded models and both endpoint paths.\n\n| Metric | Type | Label |\n|---|---|---|\n| `gpu_memory_allocated_bytes` | Gauge | Process GPU tensor memory allocated by PyTorch |\n| `gpu_memory_reserved_bytes` | Gauge | Process GPU memory reserved by the PyTorch caching allocator |\n| `gpu_memory_free_bytes` | Gauge | CUDA device free memory from `torch.cuda.mem_get_info()` |\n| `gpu_memory_total_bytes` | Gauge | CUDA device total memory from `torch.cuda.mem_get_info()` |\n\n### Reranker\n\n| Metric | Type | Label |\n|---|---|---|\n| `rerank_requests_total` | Counter | `status` |\n| `rerank_requests_rejected_total` | Counter | `reason` |\n| `rerank_pairs_processed_total` | Counter | - |\n| `rerank_request_duration_seconds` | Histogram | - |\n| `rerank_inference_duration_seconds` | Histogram | - |\n| `rerank_active_requests` | Gauge | - |\n\n### Useful PromQL Queries\n\n```promql\n# Throughput embedding (req/sec)\nrate(embedding_requests_total[1m])\n\n# Latency P95\nhistogram_quantile(0.95, rate(embedding_request_duration_seconds_bucket[5m]))\n\n# Error rate (%)\nrate(embedding_requests_total{status=\"error\"}[5m]) / rate(embedding_requests_total[5m]) * 100\n\n# GPU tensor memory allocated by PyTorch (GB)\ngpu_memory_allocated_bytes / 1024 / 1024 / 1024\n\n# GPU memory reserved by PyTorch caching allocator (GB)\ngpu_memory_reserved_bytes / 1024 / 1024 / 1024\n\n# CUDA device memory visible to the process (GB)\ngpu_memory_free_bytes / 1024 / 1024 / 1024\n\n# Reranker throughput (pairs/sec)\nrate(rerank_pairs_processed_total[1m])\n```\n\n### Setup Prometheus\n\n```yaml\n# prometheus.yml\nglobal:\n  scrape_interval: 15s\n\nscrape_configs:\n  - job_name: 'smart-embedder'\n    static_configs:\n      - targets: ['localhost:8000']\n    metrics_path: '/metrics'\n```\n\n### Grafana Dashboard - Recommended Panels\n\n1. **Embedding Request Rate** - `rate(embedding_requests_total[1m])`\n2. **Latency P50/P95/P99** - `histogram_quantile(0.X, ...)`\n3. **Queue Size** - `embedding_queue_size`\n4. **GPU Memory** - `gpu_memory_allocated_bytes`, `gpu_memory_reserved_bytes`, `gpu_memory_free_bytes`\n5. **Rerank Request Rate** - `rate(rerank_requests_total[1m])`\n6. **Batch Size Distribution** - `embedding_batch_size`\n\n---\n\n## Security and Limits\n\n### Rate Limiting\n- **Algorithm**: Token Bucket per IP\n- **Limit**: `RATE_LIMIT_REQUESTS_PER_MINUTE=3600` req/min, `RATE_LIMIT_BURST_SIZE=120`\n- **Response**: HTTP `429` with header `Retry-After: 60`\n\n### GPU Execution\n- Embedding and rerank inference share a **single-worker GPU executor**, so the\n  two paths never run forward passes concurrently. This bounds peak VRAM to the\n  larger resident model instead of the sum, preventing concurrency-driven CUDA\n  OOM on small GPUs. The CUDA default stream already serializes kernels, so this\n  costs effectively no throughput.\n\n### Backpressure\n- **/embeddings/ queue max**: `MAX_QUEUE_SIZE=200`\n- **/rerank slots max**: `RERANK_MAX_QUEUE=32` (admission bound on the shared single-worker GPU executor)\n- **Acquire timeout**: 0.5s\n- Rejections are reflected in both `/stats` (`rejected_requests`) and Prometheus (`embedding_requests_rejected_total` or `rerank_requests_rejected_total`, depending on endpoint).\n- Rate limit uses direct connection IP (`request.client.host`). If the server is behind a trusted reverse proxy, update the middleware to extract IP from `X-Forwarded-For`.\n\n### Timeout\n- `REQUEST_TIMEOUT=90s` is the global HTTP timeout (504 to the caller).\n- `GPU_PROCESS_TIMEOUT=15s` (CUDA) / `30s` (CPU) limits embedding batch inference on the thread pool.\n- `RERANK_GPU_TIMEOUT=60s` limits rerank inference and should stay below `REQUEST_TIMEOUT`.\n- Timeouts are tracked in Prometheus as `embedding_requests_total{status=\"timeout\"}` or `rerank_requests_total{status=\"timeout\"}`.\n\n### Graceful Shutdown\n- Blocks new requests (middleware)\n- Waits for queue drain\n- Completes in-flight requests (max 30s)\n- Cancels processing loop and closes the shared GPU executor\n\n---\n\n## Troubleshooting\n\n### Server Won't Start\n\n```bat\npython -c \"import torch; print(torch.cuda.is_available())\"\npip install -r requirements-gpu.txt --upgrade\n```\n\n### `429 Too Many Requests` Errors\nClient exceeds rate limit. Increase `RATE_LIMIT_REQUESTS_PER_MINUTE` or reduce call frequency.\n\n### `503 Service Unavailable` Errors\nQueue is full. Increase `MAX_QUEUE_SIZE` or scale horizontally with a load balancer.\n\n### `504 Gateway Timeout` Errors\nEmbedding inference exceeded `GPU_PROCESS_TIMEOUT` (15s on CUDA, 30s on CPU)\nor rerank inference exceeded `RERANK_GPU_TIMEOUT`. Reduce batch size or check\nGPU availability.\n\n### Prometheus Metrics Not Visible\n```bash\ncurl http://localhost:8000/metrics\n```\nVerify that target in `prometheus.yml` is reachable and that port 8000 is not blocked by firewall.\n\n### Docker: GPU Not Detected in Container\n```bash\n# Verify NVIDIA Container Toolkit\ndocker run --rm --gpus all nvidia/cuda:12.6.3-runtime-ubuntu22.04 nvidia-smi\n```\nIf it fails: reinstall NVIDIA Container Toolkit and restart Docker.\n\n### Docker: CUDA Tag Not Found\n```\nError: manifest for nvidia/cuda:12.6.3-runtime-ubuntu22.04 not found\n```\nSearch correct tag on [hub.docker.com/r/nvidia/cuda/tags](https://hub.docker.com/r/nvidia/cuda/tags) and update first line of `Dockerfile.gpu`.\n\n### Docker: Container Unhealthy on First Startup\nDefault Compose and Dockerfile healthchecks allow a 300s startup period for\nfirst-run model downloads. On slow networks or empty caches, increase the\nhealthcheck start period above 300s in your custom Compose override:\n```yaml\nstart_period: 300s\n```\n\n---\n\n## References\n\n- [BAAI/bge-m3 - Hugging Face](https://huggingface.co/BAAI/bge-m3)\n- [Qwen/Qwen3-Embedding-0.6B - Hugging Face](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)\n- [BAAI/bge-reranker-v2-m3 - Hugging Face](https://huggingface.co/BAAI/bge-reranker-v2-m3)\n- [Qwen/Qwen3-Reranker-0.6B - Hugging Face](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B)\n- [FlagEmbedding - GitHub](https://github.com/FlagOpen/FlagEmbedding)\n- [FastAPI Documentation](https://fastapi.tiangolo.com/)\n- [Prometheus Python Client](https://github.com/prometheus/client_python)\n\n---\n\n## License\n\nFollows the selected model licenses (`BAAI/bge-m3`, `BAAI/bge-reranker-v2-m3`,\nand optionally `Qwen/Qwen3-Embedding-0.6B` and `Qwen/Qwen3-Reranker-0.6B`).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmart-models%2Fsmart-embedder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsmart-models%2Fsmart-embedder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmart-models%2Fsmart-embedder/lists"}