{"id":33918797,"url":"https://github.com/ml-rust/blazr","last_synced_at":"2026-01-11T16:02:40.078Z","repository":{"id":328289242,"uuid":"1110588200","full_name":"ml-rust/blazr","owner":"ml-rust","description":"A blazing-fast inference server for hybrid neural architectures, supporting Mamba SSM, Multi-Head Latent Attention (MLA), Mixture of Experts (MoE), and standard transformers.","archived":false,"fork":false,"pushed_at":"2025-12-05T12:23:08.000Z","size":82,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-13T14:44:10.891Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ml-rust.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-05T12:22:58.000Z","updated_at":"2025-12-10T04:17:48.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ml-rust/blazr","commit_stats":null,"previous_names":["ml-rust/blazr"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/ml-rust/blazr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Fblazr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Fblazr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Fblazr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Fblazr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ml-rust","download_url":"https://codeload.github.com/ml-rust/blazr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Fblazr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28312170,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-11T14:58:17.114Z","status":"ssl_error","status_checked_at":"2026-01-11T14:55:53.580Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-12-12T08:33:31.986Z","updated_at":"2026-01-11T16:02:40.066Z","avatar_url":"https://github.com/ml-rust.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# blazr\n\n[![Rust](https://img.shields.io/badge/rust-1.70%2B-orange.svg)](https://www.rust-lang.org) [![License: Apache-2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Build Status](https://img.shields.io/badge/build-passing-brightgreen.svg)]()\n\nA blazing-fast inference server for hybrid neural architectures, supporting Mamba2 SSM, Multi-Head Latent Attention (MLA), Mixture of Experts (MoE), and standard transformers.\n\n## Features\n\n- **Auto-detection** - Automatically detects model architecture, format (HuggingFace vs oxidizr), and tokenizer vocabulary from checkpoint tensors\n- **Hybrid Architecture Support** - Seamlessly handles mixed Mamba2 and attention layers in a single model\n- **HuggingFace Compatible** - Loads standard HuggingFace Llama models (tested with llama3.2-1b) alongside custom oxidizr checkpoints\n- **OpenAI-Compatible API** - Drop-in replacement with `/v1/completions` and `/v1/chat/completions` endpoints\n- **High Performance** - Written in Rust using the Candle ML framework with optional CUDA acceleration\n- **Multiple Tokenizers** - Supports cl100k_base, o200k_base, llama3, and deepseek_v3 vocabularies via splintr\n\n## Quick Start\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/farhan-syah/blazr.git\ncd blazr\n\n# Build (CPU-only)\ncargo build --release\n\n# Build with CUDA support (requires CUDA 12.x)\ncargo build --release --features cuda\n```\n\n### Basic Usage\n\n#### Generate Text\n\n```bash\nblazr generate \\\n  --model ./checkpoints/nano \\\n  --prompt \"Once upon a time\" \\\n  --max-tokens 100 \\\n  --vocab llama3\n```\n\n#### Start Server\n\n```bash\nblazr serve --model ./checkpoints/nano --port 8080\n```\n\nThen make API requests:\n\n```bash\ncurl http://localhost:8080/v1/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"prompt\": \"Hello, world!\",\n    \"max_tokens\": 50,\n    \"temperature\": 0.7\n  }'\n```\n\n#### Model Info\n\n```bash\nblazr info --model ./checkpoints/nano\n```\n\n## Supported Architectures\n\nblazr auto-detects and supports:\n\n- **Mamba2** - State Space Models with selective attention\n- **MLA** - Multi-Head Latent Attention with compressed KV cache\n- **MoE** - Mixture of Experts with top-k routing and optional shared expert\n- **Standard Transformers** - GQA (Grouped Query Attention) with MLP layers\n\nModels can mix and match these layer types freely.\n\n### Auto-Detection\n\nblazr automatically detects:\n- **Architecture** - Identifies layer types (Mamba2, MLA, MoE, Transformer) from tensor name patterns\n- **Model Format** - Distinguishes between oxidizr format (`layers.X.`) and HuggingFace format (`model.layers.X.`)\n- **Tokenizer Vocabulary** - Infers vocabulary from `vocab_size` if `--vocab` is not specified:\n  - ~100k tokens → `cl100k_base`\n  - ~128k tokens → `llama3`\n  - ~129k tokens → `deepseek_v3`\n  - ~200k tokens → `o200k_base`\n\n## Tokenizer\n\nblazr uses [splintr](https://github.com/farhan-syah/splintr) for high-performance BPE tokenization with pretrained vocabularies.\n\n### Supported Vocabularies\n\n| Vocabulary     | Description                    | Vocab Size | Use Case                          |\n|----------------|--------------------------------|------------|-----------------------------------|\n| `cl100k_base`  | GPT-4, GPT-3.5-turbo          | ~100k      | OpenAI-compatible models          |\n| `o200k_base`   | GPT-4o                        | ~200k      | Extended multilingual support     |\n| `llama3`       | Meta Llama 3 family           | ~128k      | Llama 3.x models (default)        |\n| `deepseek_v3`  | DeepSeek V3/R1                | ~129k      | DeepSeek models                   |\n\nAll vocabularies include 54 agent tokens for chat, reasoning, and tool-use applications.\n\n### Custom Vocabularies\n\nCustom vocabularies are not yet supported. If you need a custom vocabulary:\n1. Train your model with one of the supported vocabularies above\n2. Modify blazr's tokenizer module to load your `.tiktoken` file (base64-encoded tokens with ranks)\n\n## Documentation\n\n- [API Reference](docs/api.md) - Complete API endpoint documentation\n- [Architecture](docs/architecture.md) - Technical details on hybrid model support\n- [Configuration](docs/configuration.md) - Model configuration and tuning options\n\n## CLI Commands\n\n```bash\n# Generate text from a prompt\nblazr generate --model \u003cpath\u003e --prompt \"text\" [OPTIONS]\n\n# Start inference server\nblazr serve --model \u003cpath\u003e [--port 8080] [--host 0.0.0.0]\n\n# Display model configuration\nblazr info --model \u003cpath\u003e\n\n# Decode token IDs (debugging)\nblazr decode --ids \"123,456,789\" --vocab llama3\n```\n\n### Options\n\n**Generation:**\n- `--model` - Model path (local directory or HuggingFace ID like `meta-llama/Llama-3.2-1B`)\n- `--prompt` - Input text prompt\n- `--max-tokens` - Maximum tokens to generate (default: 100)\n- `--temperature` - Sampling temperature (default: 0.7)\n- `--top-p` - Nucleus sampling threshold (default: 0.9)\n- `--top-k` - Top-k sampling (default: 40)\n- `--vocab` - Tokenizer vocabulary (`llama3`, `cl100k_base`, `o200k_base`, `deepseek_v3`). Auto-detected if not specified.\n- `--cpu` - Force CPU inference even if CUDA is available\n\n**Server:**\n- `--model` - Model path (local directory or HuggingFace ID)\n- `--port` - Port to listen on (default: 8080)\n- `--host` - Host to bind to (default: 0.0.0.0)\n- `--cpu` - Force CPU inference even if CUDA is available\n\n## Model Format\n\nblazr loads models from SafeTensors checkpoints in two formats:\n\n### oxidizr Format\n\n```\ncheckpoint_dir/\n├── model.safetensors    # Model weights\n└── config.json          # Model configuration (optional)\n```\n\nTensor naming: `embed_tokens`, `layers.X.mamba2`, `layers.X.self_attn`, `lm_head`\n\n### HuggingFace Format\n\n```\ncheckpoint_dir/\n├── model.safetensors    # Model weights\n└── config.json          # Standard HuggingFace config\n```\n\nTensor naming: `model.embed_tokens`, `model.layers.X.self_attn`, `lm_head`\n\nblazr automatically detects the format and architecture from tensor names. If `config.json` is missing or incomplete, all parameters are inferred from tensor shapes.\n\n## Requirements\n\n- Rust 1.70 or later\n- (Optional) CUDA 12.x for GPU acceleration\n\n## License\n\nApache-2.0 License - see [LICENSE](LICENSE) for details.\n\n## Related Projects\n\n- [oxidizr](https://github.com/farhan-syah/oxidizr) - Training framework for hybrid Mamba2 + MLA + MoE architectures\n- [splintr](https://github.com/farhan-syah/splintr) - High-performance BPE tokenizer with Python bindings\n\n## Contributing\n\nContributions are welcome! Please open an issue or submit a pull request.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fml-rust%2Fblazr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fml-rust%2Fblazr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fml-rust%2Fblazr/lists"}