{"id":50547107,"url":"https://github.com/artain-ai/ignite-ms","last_synced_at":"2026-06-04T00:00:40.502Z","repository":{"id":358757740,"uuid":"1242800193","full_name":"Artain-AI/ignite-ms","owner":"Artain-AI","description":"Fast self-hosted embedding engine for search, RAG, and reindexing workloads on NVIDIA GPUs. Built in Rust + TensorRT for teams that care about scale, cost, and control.","archived":false,"fork":false,"pushed_at":"2026-06-01T03:57:40.000Z","size":142,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-01T05:24:12.866Z","etag":null,"topics":["batch-inference","batch-processing","cuda","embeddings","gpu","high-performance","huggingface","machine-learning","multi-gpu","nlp","rag","rust","self-hosted","semantic-search","tensorrt","text-embeddings","vector-search"],"latest_commit_sha":null,"homepage":"https://dev.to/artain/embedding-685-million-texts-in-32-minutes-46o7","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Artain-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":"CLA.md"}},"created_at":"2026-05-18T19:11:13.000Z","updated_at":"2026-06-01T03:57:30.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Artain-AI/ignite-ms","commit_stats":null,"previous_names":["artain-ai/ignite-ms"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/Artain-AI/ignite-ms","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Artain-AI%2Fignite-ms","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Artain-AI%2Fignite-ms/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Artain-AI%2Fignite-ms/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Artain-AI%2Fignite-ms/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Artain-AI","download_url":"https://codeload.github.com/Artain-AI/ignite-ms/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Artain-AI%2Fignite-ms/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33884734,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-03T02:00:06.370Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["batch-inference","batch-processing","cuda","embeddings","gpu","high-performance","huggingface","machine-learning","multi-gpu","nlp","rag","rust","self-hosted","semantic-search","tensorrt","text-embeddings","vector-search"],"created_at":"2026-06-04T00:00:28.008Z","updated_at":"2026-06-04T00:00:40.443Z","avatar_url":"https://github.com/Artain-AI.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# IgniteMS\n\n[![License](https://img.shields.io/github/license/Artain-AI/ignite-ms)](LICENSE)\n[![Release](https://img.shields.io/github/v/release/Artain-AI/ignite-ms)](https://github.com/Artain-AI/ignite-ms/releases)\n\n256,000 msg/s on 8x A100. Up to 3.6x faster than Hugging Face TEI on same hardware.\n\n*357,893 msg/s sustained in production with workload-specific tuning.*\n\nIgniteMS is a batch text embedding engine. Rust, native TensorRT, no Python at runtime. You give it text, it gives you embeddings.\n\nUse it for workloads where millions of texts need embeddings quickly: vector DB reindexing, search rebuilds after model swaps, corpus-scale processing.\n\n## Numbers\n\np4d.24xlarge (8x A100 80GB), 1M MSMARCO passages, TensorRT 11 mixed precision:\n\n| Model | GPUs | msg/s | tok/s | TEI msg/s | Speedup |\n|-------|-----:|------:|------:|----------:|--------:|\n| e5-small-v2 | 1 | 56,002 | 2,860,377 | 16,412 | 3.4x |\n| e5-small-v2 | 8 | 254,979 | 12,988,479 | 88,912 | 2.9x |\n| e5-small | 1 | 55,958 | 3,178,595 | 15,378 | 3.6x |\n| e5-small | 8 | 255,958 | 14,539,275 | 76,480 | 3.3x |\n| e5-base | 1 | 18,626 | 1,058,018 | 8,843 | 2.1x |\n| e5-base | 8 | 126,614 | 7,192,032 | 57,423 | 2.2x |\n| e5-large | 1 | 5,861 | 332,982 | 4,029 | 1.5x |\n| e5-large | 8 | 40,445 | 2,297,994 | 28,664 | 1.4x |\n\n### Baselines (1 GPU, e5-small-v2)\n\n| Tool | msg/s | Relative |\n|------|------:|---------:|\n| IgniteMS | 56,002 | 1.0x |\n| TEI | 16,412 | 0.29x |\n| Fastembed (ORT+CUDA) | 8,907 | 0.16x |\n| SentenceTransformers | 2,468 | 0.04x |\n\n60 models supported out of the box: E5, BGE, GTE, MiniLM, MPNet, Nomic, Jina, mxbai, Snowflake Arctic, LaBSE, stella, plus language-specific models for Chinese, French, Russian, Korean, Indonesian, and domain models for scientific/biomedical text. Supports both encoder (BERT-style) and decoder (LLM-based) architectures with mean-pool or last-token pooling. Works with any Hugging Face model that exports to ONNX and compiles to TensorRT. Models are downloaded and compiled on first run. See [MODELS.md](MODELS.md) for the full list with verified throughput and correctness results.\n\n### Production run\n\nReal production pipeline, not a controlled benchmark:\n\n| Metric | Value | Note |\n|--------|------:|------|\n| Messages embedded | 685,520,494 | |\n| Sustained throughput | 357,893 msg/s | average across full run |\n| Peak throughput | 506,589 msg/s | short text, GPUs saturated |\n| Low throughput | 196,676 msg/s | dense/long text files, reader-bound |\n| Wall clock | 1,915s (31.9 min) | |\n| Hardware | 1x p4d.24xlarge | 8x A100 40GB, spot |\n\nFull pipeline: read zstd-compressed social media events (Reddit, Hacker News), extract and normalize text, tokenize, infer on 8 GPUs, write aggregated parquet output. Not a GPU microbenchmark.\n\nFor cost context: at ~$12.68/hr p4d spot pricing, this production run cost about $0.01 per 1M messages embedded. On the same 68-token/message dataset, OpenAI `text-embedding-3-small` would be about $1.36 per 1M messages at current API pricing.\n\n## Why it's fast\n\nNo single trick. Just removing waste everywhere:\n\n- **TensorRT** compiles kernels specific to the GPU architecture and batch shape. Not generic ONNX or PyTorch.\n- **Bucketed batching** groups texts by token length so you're not padding a 6-token string to 512.\n- **CPU-side pipeline** keeps tokenization, batching, and GPU dispatch moving together without waiting on each other.\n- **Rust end-to-end.** No GIL, no Python request path, no HTTP serialization at runtime.\n- **Multi-GPU in one process.** Lock-free work stealing across GPUs. Most serving stacks run one container per GPU and glue them together with HTTP. We don't.\n- **Engine caching.** TRT engines compile once and get reused until something actually changes (model, runtime version, or batch profile).\n\n## Quickstart\n\nDocker (just needs Docker + NVIDIA runtime):\n\n```bash\npython3 quickstart.py\n```\n\nNative (needs Rust, CUDA 12+, TensorRT 11+):\n\n```bash\npython3 quickstart.py --native\n```\n\nDownloads a public dataset, embeds it, writes output. First run takes ~5 minutes for TensorRT engine compilation. After that, engines are cached and startup is instant.\n\n## Docker\n\n```bash\ndocker run --rm --gpus all \\\n  -v \"$PWD/data:/data\" \\\n  -v ignite-ms-cache:/cache \\\n  ghcr.io/artain-ai/ignite-ms:v1.1.0 \\\n  embed \\\n  --model intfloat/e5-small-v2 \\\n  --input /data/input.jsonl \\\n  --output /data/embeddings.npy \\\n  --cache-dir /cache \\\n  --gpus all\n```\n\nUse the versioned image for reproducible deployments. The current `v1.1.0` release targets TensorRT 11 mixed-precision engines. `latest` may move and is intended for quick experiments, not production pinning.\n\nDocker hosts need an NVIDIA driver, Docker, and the NVIDIA container runtime. They do not need the CUDA toolkit or TensorRT installed on the host. The image has the production CLI (`ignite-ms`), benchmark CLI (`ignite-ms-bench`), and all dependencies for model prep.\n\n## Benchmark\n\nReproduce the numbers:\n\n```bash\npython3 benchmark.py                                          # Docker, defaults\npython3 benchmark.py --mode native --model e5-small-v2        # native\npython3 benchmark.py --gpu-counts 1,8 --skip-tei              # IgniteMS only\n```\n\nDownloads data, prepares models, runs both IgniteMS and TEI, reports results. See [BENCHMARKING.md](BENCHMARKING.md) for full results, methodology, and caveats.\n\nThe benchmark reports messages/sec plus token-oriented metrics such as tokens/sec, padded tokens/sec, average sequence length, batch fill, and estimated TFLOP/s. Messages/sec is useful for corpus throughput; token metrics are better for comparing runs with different text lengths.\n\n## Input / Output\n\nInput: JSONL (`{\"text\": \"...\"}`) or plain text, one per line. Handles `.zst` and `.gz` compression.\n\nOutput: `.npy` (NumPy array) or `.parquet` (with IDs). Row order preserved.\n\n```bash\nignite-ms embed \\\n  --model intfloat/e5-small-v2 \\\n  --input corpus.jsonl.zst \\\n  --output embeddings.npy \\\n  --gpus all\n```\n\n## Layout\n\n```\ncrates/ignite-ms/          core engine\ncrates/ignite-ms-embed/    production CLI (ignite-ms)\ncrates/ignite-ms-bench/    benchmark CLI (ignite-ms-bench)\nnative/                    TensorRT C++ bridge\nexamples/                  library usage\nbenchmark.py               IgniteMS vs TEI benchmark\nquickstart.py              one-command demo\n```\n\n## Building from source\n\n```bash\ncargo build --release -p ignite-ms-embed\ncargo build --release -p ignite-ms-bench\n```\n\nNeeds CUDA 12+ and TensorRT 11+ headers on the host.\n\n## Requirements\n\nDocker mode: NVIDIA GPU, NVIDIA driver, Docker, NVIDIA container runtime. CUDA and TensorRT are included in the image.\n\nNative mode: NVIDIA GPU, CUDA 12+, TensorRT 11+, Rust 1.85+, Python 3.10+.\n\n## Security\n\nReport vulnerabilities privately. See [SECURITY.md](SECURITY.md).\n\n## Contributing\n\nContributions require CLA. See [CONTRIBUTING.md](CONTRIBUTING.md).\n\n## License\n\nApache 2.0.\n\nArtain may offer future versions under different terms. Versions released under Apache 2.0 stay Apache 2.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fartain-ai%2Fignite-ms","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fartain-ai%2Fignite-ms","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fartain-ai%2Fignite-ms/lists"}