{"id":37231705,"url":"https://github.com/tzervas/unsloth-rs","last_synced_at":"2026-01-25T04:03:00.696Z","repository":{"id":332418098,"uuid":"1129034271","full_name":"tzervas/unsloth-rs","owner":"tzervas","description":"Memory-optimized GPU kernels for LLM fine-tuning in Rust (2-5x speedup, 70-80% less VRAM)","archived":false,"fork":false,"pushed_at":"2026-01-24T07:13:00.000Z","size":454,"stargazers_count":0,"open_issues_count":13,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-24T16:45:13.774Z","etag":null,"topics":["cuda","gpu","machine-learning","optimization","rust"],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tzervas.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-06T14:11:28.000Z","updated_at":"2026-01-24T07:13:03.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/tzervas/unsloth-rs","commit_stats":null,"previous_names":["tzervas/unsloth-rs"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/tzervas/unsloth-rs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tzervas%2Funsloth-rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tzervas%2Funsloth-rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tzervas%2Funsloth-rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tzervas%2Funsloth-rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tzervas","download_url":"https://codeload.github.com/tzervas/unsloth-rs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tzervas%2Funsloth-rs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28742983,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-25T02:46:29.005Z","status":"ssl_error","status_checked_at":"2026-01-25T02:44:29.968Z","response_time":113,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","gpu","machine-learning","optimization","rust"],"created_at":"2026-01-15T03:45:28.759Z","updated_at":"2026-01-25T04:03:00.677Z","avatar_url":"https://github.com/tzervas.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# unsloth-rs\n\nRust implementations of transformer building blocks for LLM inference and fine-tuning.\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n\n## Overview\n\n`unsloth-rs` provides Rust implementations of common transformer operations built on the [Candle](https://github.com/huggingface/candle) ML framework:\n\n- Multi-head attention with grouped-query attention (GQA) support\n- Rotary position embeddings (RoPE)\n- RMS normalization\n- SwiGLU activation\n\n## Status\n\n**⚠️ Early Development** - This project is in early development. Current implementations are CPU reference implementations with GPU dispatch that uses Candle's CUDA backend.\n\n### Implemented\n- ✅ Multi-head attention (CPU reference, Candle CUDA backend)\n- ✅ Rotary position embeddings (RoPE)\n- ✅ RMS normalization\n- ✅ SwiGLU activation\n- ✅ Memory estimation utilities\n- ✅ Ternary quantization (5-15x compression achieved)\n- ✅ Mixed precision training utilities (FP32/FP16/BF16)\n- ✅ Benchmarking suite (CPU)\n- ✅ 148 passing tests (100% pass rate)\n\n### In Progress\n- 🚧 Flash Attention CubeCL GPU kernel (Phase 1 complete, Phase 2 ready for RTX 5080 validation)\n- 🚧 Ternary GPU kernels (Phase 2-4 implemented, awaiting GPU profiling)\n- 🚧 CI/CD pipeline setup\n\n### Planned\n- ⏳ Gradient checkpointing (configuration exists, implementation planned)\n- ⏳ GPU performance validation on RTX 5080/3090 Ti\n- ⏳ RoPE, RMSNorm, SwiGLU GPU kernels\n- ⏳ Advanced sparsity optimizations\n- ⏳ Multi-GPU support\n\n## Installation\n\n```toml\n[dependencies]\nunsloth-rs = \"0.1\"\n```\n\nFor CUDA support (uses Candle's CUDA backend):\n\n```toml\n[dependencies]\nunsloth-rs = { version = \"0.1\", features = [\"cuda\"] }\n```\n\n## Usage\n\n### Attention\n\n```rust\nuse unsloth_rs::kernels::{FusedAttention, FusedAttentionConfig};\nuse candle_core::{Device, Tensor};\n\nfn main() -\u003e anyhow::Result\u003c()\u003e {\n    let device = Device::Cpu;\n    \n    let config = FusedAttentionConfig {\n        hidden_size: 768,\n        num_heads: 12,\n        head_dim: 64,\n        num_kv_heads: Some(4),  // GQA support\n        ..Default::default()\n    };\n    \n    let attention = FusedAttention::new(config, \u0026device)?;\n    \n    // Create random input tensor: randn(mean, std_dev, shape, device)\n    // 0.0f32 is Rust syntax for a 32-bit float literal with value 0.0\n    let hidden_states = Tensor::randn(0.0f32, 1.0, (1, 128, 768), \u0026device)?;\n    let output = attention.forward(\u0026hidden_states, None, None)?;\n    \n    Ok(())\n}\n```\n\n### Memory Estimation\n\n```rust\nuse unsloth_rs::memory::{estimate_forward_memory, CheckpointConfig};\n\nfn main() {\n    let checkpoint = CheckpointConfig {\n        enabled: true,\n        checkpoint_every: 2,\n    };\n    \n    let mem_bytes = estimate_forward_memory(\n        4,     // batch_size\n        2048,  // seq_len\n        4096,  // hidden_size\n        32,    // num_layers\n        \u0026checkpoint,\n    );\n    \n    println!(\"Estimated memory: {} GB\", mem_bytes as f64 / 1e9);\n}\n```\n\n## Benchmarks\n\nRun benchmarks with:\n\n```bash\ncargo bench\n```\n\nBenchmarks test CPU performance across various configurations. GPU benchmarks require the `cuda` feature.\n\n## Development Roadmap\n\nFor detailed development plans and task breakdowns, see:\n\n- **[ROADMAP.md](ROADMAP.md)** - Strategic development plan with phases and timelines\n- **[TASKS.md](TASKS.md)** - Actionable task list with priorities and estimates\n- **[SUMMARY.md](SUMMARY.md)** - Project review summary and execution guide\n\n## Contributing\n\nContributions are welcome, particularly:\n- GPU kernel implementations using CubeCL\n- Performance optimizations\n- Additional transformer operations\n\nSee [TASKS.md](TASKS.md) for specific tasks that need implementation.\n\n## License\n\nLicensed under the MIT License. See [LICENSE](LICENSE) for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftzervas%2Funsloth-rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftzervas%2Funsloth-rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftzervas%2Funsloth-rs/lists"}