{"id":47528986,"url":"https://github.com/fab2s/flodl","last_synced_at":"2026-04-18T22:11:15.948Z","repository":{"id":343798298,"uuid":"1177322631","full_name":"fab2s/floDl","owner":"fab2s","description":"rust recursive deep learning framework","archived":false,"fork":false,"pushed_at":"2026-04-13T16:53:15.000Z","size":18834,"stargazers_count":36,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-13T18:30:53.435Z","etag":null,"topics":["deep-learning","graph","machine-learning","rust"],"latest_commit_sha":null,"homepage":"https://flodl.dev","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fab2s.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-09T23:15:59.000Z","updated_at":"2026-04-08T22:14:53.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/fab2s/floDl","commit_stats":null,"previous_names":["fab2s/rdl","fab2s/flodl"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/fab2s/floDl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fab2s%2FfloDl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fab2s%2FfloDl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fab2s%2FfloDl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fab2s%2FfloDl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fab2s","download_url":"https://codeload.github.com/fab2s/floDl/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fab2s%2FfloDl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31817128,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T18:05:02.291Z","status":"ssl_error","status_checked_at":"2026-04-14T18:05:01.765Z","response_time":153,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","graph","machine-learning","rust"],"created_at":"2026-03-27T20:52:04.380Z","updated_at":"2026-04-18T22:11:15.927Z","avatar_url":"https://github.com/fab2s.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/fab2s/floDl/main/docs/floDl.png\" alt=\"floDl\" width=\"640\"\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003efloDl\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\nA Rust-native deep learning framework built on libtorch.\u003cbr\u003e\nSame GPU kernels as PyTorch. No Python. No GIL. No GC. Just Rust.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://flodl.dev\"\u003e\u003cimg src=\"https://img.shields.io/badge/web-flodl.dev-6c8cff\" alt=\"Website\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/fab2s/floDl/actions\"\u003e\u003cimg src=\"https://github.com/fab2s/floDl/actions/workflows/ci.yml/badge.svg\" alt=\"CI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://crates.io/crates/flodl\"\u003e\u003cimg src=\"https://img.shields.io/crates/v/flodl.svg\" alt=\"crates.io\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://docs.rs/flodl\"\u003e\u003cimg src=\"https://docs.rs/flodl/badge.svg\" alt=\"docs.rs\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/fab2s/floDl/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-MIT-blue.svg\" alt=\"MIT License\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#if-you-know-pytorch-you-know-flodl\"\u003ePyTorch Users\u003c/a\u003e \u0026bull;\n  \u003ca href=\"https://flodl.dev/thesis\"\u003e\u003cb\u003eThesis\u003c/b\u003e\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#getting-started\"\u003eGetting Started\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#the-graph-builder\"\u003eGraph Builder\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#graph-tree-hierarchical-composition\"\u003eGraph Tree\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#the-training-experience\"\u003eTraining\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#multi-gpu-training\"\u003eMulti-GPU\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#pytorch-parity\"\u003eParity\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#performance\"\u003eBenchmarks\u003c/a\u003e \u0026bull;\n  \u003ca href=\"https://github.com/fab2s/floDl/blob/main/ROADMAP.md\"\u003eRoadmap\u003c/a\u003e \u0026bull;\n  \u003ca href=\"https://github.com/fab2s/floDl/blob/main/docs/pytorch_migration.md\"\u003eMigration Guide\u003c/a\u003e \u0026bull;\n  \u003ca href=\"https://github.com/fab2s/floDl/blob/main/docs/tutorials/13-data-loading.md\"\u003eData Loading\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n\u003e **What's new in 0.5.0** -- the `fdl` CLI maturity pass. New proc-macro\n\u003e crate [`flodl-cli-macros`](https://crates.io/crates/flodl-cli-macros)\n\u003e adds `#[derive(FdlArgs)]` -- any Rust binary gets typed argv parsing,\n\u003e JSON schema, shell completions, and env-var fallback for free.\n\u003e `fdl.yml` consolidates to a single `commands:` map with three clean\n\u003e kinds (`run:` / `path:` / preset). New\n\u003e [`--env` overlays](docs/cli.md#environment-overlays) and\n\u003e [`fdl config show`](docs/cli.md#fdl-config) surface per-environment\n\u003e config with per-field origin annotations, so you can see the\n\u003e resolved YAML before running a two-hour job. Migration from 0.4.0:\n\u003e see [UPGRADE.md](UPGRADE.md).\n\n---\n\n## If You Know PyTorch, You Know floDl\n\n\u003ctable\u003e\n\u003ctr\u003e\u003cth\u003ePyTorch\u003c/th\u003e\u003cth\u003efloDl\u003c/th\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\n\n```python\nmodel = nn.Sequential(\n    nn.Linear(2, 16),\n    nn.GELU(),\n    nn.LayerNorm(16),\n    nn.Linear(16, 2),\n)\n\npred = model(x)\nloss = F.mse_loss(pred, target)\nloss.backward()\noptimizer.step()\n```\n\n\u003c/td\u003e\u003ctd\u003e\n\n```rust\nlet model = FlowBuilder::from(Linear::new(2, 16)?)\n    .through(GELU)\n    .through(LayerNorm::new(16)?)\n    .through(Linear::new(16, 2)?)\n    .build()?;\n\nlet pred = model.forward(\u0026x)?;\nlet loss = mse_loss(\u0026pred, \u0026target)?;\nloss.backward()?;\noptimizer.step()?;\n```\n\n\u003c/td\u003e\u003c/tr\u003e\n\u003c/table\u003e\n\nSame concepts, same names, same GPU kernels underneath. The `?` operator\nreplaces silent failures with compile-time error handling. `Drop` replaces the\ngarbage collector. The [full migration guide](https://github.com/fab2s/floDl/blob/main/docs/pytorch_migration.md) covers\nevery op, module, and pattern.\n\n\u003e **New to Rust?** Read [Rust for PyTorch Users](https://github.com/fab2s/floDl/blob/main/docs/tutorials/00-rust-primer.md) — 10 patterns in 15 minutes.\n\n## Getting Started\n\n**With the CLI** (recommended, no Rust needed):\n\n```bash\ncurl -sL https://flodl.dev/fdl -o fdl \u0026\u0026 chmod +x fdl\n./fdl setup          # detect hardware, download libtorch, configure build environment\n./fdl init my-proj   # scaffold a new project with training template\n```\n\nThe `fdl` script auto-downloads a pre-compiled CLI binary (~750KB, pure Rust,\nno libtorch dependency). It detects your GPUs, downloads the right libtorch\nvariant, and configures Docker or native builds. See the [full CLI\nreference](docs/cli.md) for all commands.\n\n**One-liner with Docker** (no Rust, no setup):\n\n```bash\ncurl -sL https://flodl.dev/init.sh | sh -s my-project\ncd my-project\n./fdl build   # first build (~5 min, downloads libtorch)\n./fdl run     # train the model\n```\n\n**Native** -- [Rust](https://rustup.rs/) 1.85+ and libtorch:\n\n```bash\n./fdl libtorch download    # auto-detects CPU or CUDA\ncargo add flodl \u0026\u0026 cargo build\n```\n\nFor CUDA: `cargo add flodl --features cuda` + [CUDA toolkit](https://developer.nvidia.com/cuda-downloads).\n\n\u003e **Using tch-rs or PyTorch C++?** `fdl` also works as a standalone\n\u003e libtorch manager outside of flodl: download any CPU/CUDA variant,\n\u003e switch between installs, compile from source for mixed GPU\n\u003e architectures (e.g. sm_61 + sm_120 in one build), and emit a\n\u003e machine-readable diagnostics report. No flodl buy-in required.\n\u003e See [docs/cli.md § Standalone](docs/cli.md#1-standalone-no-project-required)\n\u003e and the [`flodl-cli` crate](https://crates.io/crates/flodl-cli).\n\nBoth paths generate an annotated training template. Edit `src/main.rs` to\nbuild your model:\n\n```rust\nuse flodl::*;\n\nlet model = FlowBuilder::from(Linear::new(2, 16)?)\n    .through(GELU)\n    .through(LayerNorm::new(16)?)\n    .also(Linear::new(16, 16)?)     // residual connection\n    .through(Linear::new(16, 2)?)\n    .build()?;\n\nlet params = model.parameters();\nlet mut optimizer = Adam::new(\u0026params, 0.01);\nmodel.train();\n\nfor (input_t, target_t) in \u0026batches {\n    let input = Variable::new(input_t.clone(), true);\n    let target = Variable::new(target_t.clone(), false);\n\n    let pred = model.forward(\u0026input)?;\n    let loss = mse_loss(\u0026pred, \u0026target)?;\n\n    optimizer.zero_grad();\n    loss.backward()?;\n    clip_grad_norm(\u0026params, 1.0)?;\n    optimizer.step()?;\n}\n```\n\n## The Graph Builder\n\nfloDl's fluent graph builder lets you describe complex architectures as\nreadable data flow — no boilerplate, no `nn.Module` subclassing.\n\n```rust\nlet model = FlowBuilder::from(Linear::new(2, 16)?)\n    .through(GELU)                        // activation\n    .through(LayerNorm::new(16)?)         // normalization\n    .also(Linear::new(16, 16)?)           // residual connection\n    .through(Linear::new(16, 2)?)         // output projection\n    .build()?;\n```\n\n`build()` returns a `Graph` that implements `Module` — you can nest it\ninside other graphs. Things get interesting when architectures get complex:\n\n```rust\nlet g = FlowBuilder::from(encoder).tag(\"encoded\")\n    .split(modules![head_a, head_b, head_c]).merge(MergeOp::Mean)\n    .loop_body(refinement_block).for_n(3).tag(\"refined\")\n    .gate(router, modules![expert_a, expert_b]).using(\u0026[\"encoded\"])\n    .switch(selector, modules![light_path, heavy_path]).using(\u0026[\"refined\"])\n    .through(StateAdd).using(\u0026[\"memory\"]).tag(\"memory\")\n    .loop_body(decoder).while_cond(halt_condition, 10)\n    .through(output_head)\n    .build()?;\n```\n\nEvery construct — `split/merge`, `also`, `loop_body`, `gate`, `switch`, `map`,\n`tag/using` — composes cleanly. Forward references (`using` before `tag`) carry\nstate across calls, enabling recurrent architectures without special-casing.\n\n| Method | What it does |\n|--------|-------------|\n| `from(m).through(m)` | Linear chain |\n| `also(m)` | Residual: `input + m(input)` |\n| `fork(m)` | Side branch: capture output as tag, stream continues |\n| `split(modules![...]).merge(op)` | Parallel branches, merged by `Add` or `Mean` |\n| `tag(name)` / `using(refs)` | Named references — backward or forward (across calls) |\n| `loop_body(body).for_n(n)` | Fixed iteration with BPTT |\n| `loop_body(body).while_cond` / `until_cond` | Conditional loops |\n| `gate(router, modules![...])` | Soft routing — weighted combination |\n| `switch(selector, modules![...])` | Hard routing — only selected branch |\n| `map(body).each()` / `.over(tag)` / `.slices(n)` | Element-wise, tagged, or sliced iteration |\n| `input(names)` | Auxiliary graph inputs for multi-input architectures |\n\nSee the **[Graph Builder Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/05-graph-builder.md)** and\nthe [full showcase](https://github.com/fab2s/floDl/tree/main/flodl/examples/showcase/).\n\n## Graph Tree: Hierarchical Composition\n\nThis is where floDl goes beyond PyTorch. Graphs nest inside graphs with\n**label-path addressing** — dot-separated paths that let you reach into any\nsubgraph from the root. Train components independently, compose them into\nlarger architectures, and control training phases declaratively.\n\n```rust\n// Build components independently\nlet scan = FlowBuilder::from(scan_net).tag(\"hidden\")\n    .label(\"scan\").build()?;\n\nlet read = FlowBuilder::from(read_net).tag(\"confidence\")\n    .label(\"read\").build()?;\n\nlet encoder = FlowBuilder::from(scan)\n    .through(read)\n    .label(\"encoder\").build()?;\n\n// Compose into full model\nlet model = FlowBuilder::from(encoder)\n    .through(classifier)\n    .build()?;\n```\n\n### Dotted paths reach anywhere\n\nEvery tag and subgraph is addressable through dotted paths from the root:\n\n```rust\nmodel.validate_path(\"encoder\")?;                 // -\u003e Subgraph\nmodel.validate_path(\"encoder.scan.hidden\")?;      // -\u003e Tag (three levels deep)\nmodel.validate_path(\"encoder.read.confidence\")?;  // -\u003e Tag\n```\n\n### Declarative training phases\n\nFreeze and thaw entire subtrees by path — no manual parameter iteration:\n\n```rust\n// Phase 1: train only the classifier, encoder is frozen\nmodel.freeze(\"encoder\")?;\nlet fresh_params = model.parameters();  // only unfrozen params\nlet mut opt = Adam::new(\u0026fresh_params, 1e-3);\n// ... train ...\n\n// Phase 2: thaw scan, keep read frozen (it's proven)\nmodel.thaw(\"encoder.scan\")?;\nlet mut opt = Adam::with_groups()\n    .group(\u0026model.parameters_at(\"encoder.scan\")?, 1e-4)  // low LR\n    .group(\u0026model.parameters_at(\"classifier\")?, 1e-3)\n    .build();\n```\n\n### Subgraph checkpoints\n\nTrain a component standalone, save it, load it into a larger model:\n\n```rust\n// Pre-trained encoder saved earlier\nencoder.save_checkpoint(\"encoder_v1.fdl.gz\")?;\n\n// Load into the composed model — namespace + hash validated\nmodel.load_subgraph_checkpoint(\"encoder\", \"encoder_v1.fdl.gz\")?;\nmodel.freeze(\"encoder.read\")?;  // lock what's proven\n```\n\n### Cross-boundary observation\n\nMetrics flow up through the tree automatically:\n\n```rust\nmodel.record_at(\"encoder.scan.loss\", scan_loss)?;\nmodel.record_at(\"encoder.read.accuracy\", read_acc)?;\nmodel.record_scalar(\"total_loss\", total)?;\n\nmodel.flush(\u0026[]);  // single call flushes the entire tree\n\n// Trends across boundaries — drive training decisions\nif model.trend_at(\"encoder.scan.loss\")?.stalled(10, 1e-4) {\n    model.thaw(\"encoder.read\")?;  // scan stalled, unfreeze read\n}\n\n// Monitor sees all metrics with dotted names automatically\nmonitor.log(epoch, elapsed, \u0026model);\n// -\u003e total_loss, encoder.scan.loss, encoder.read.accuracy\n```\n\nThis is progressive model composition: each component is trained and\nvalidated independently before becoming a building block in a larger\narchitecture. Checkpoints, metrics, and training phases compose just like\nthe graphs themselves.\n\nSee the full **[Graph Tree Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/10-graph-tree.md)**.\n\n## The Training Experience\n\n### Training Monitor\n\nDrop-in monitor with adaptive ETA, resource tracking, and a live web\ndashboard — no external dependencies, no separate process.\n\n```rust\nuse flodl::monitor::Monitor;\n\nlet mut monitor = Monitor::new(num_epochs);\nmonitor.serve(3000)?;  // optional: live dashboard at http://localhost:3000\n\nfor epoch in 0..num_epochs {\n    let t = std::time::Instant::now();\n    // ... training ...\n    monitor.log(epoch, t.elapsed(), \u0026model);  // sees entire graph tree\n}\nmonitor.finish();\n```\n\n```\n  epoch   1/100  loss=1.5264  [49ms  ETA 4.8s]\n  epoch  10/100  loss=0.3817  [25ms  ETA 2.2s]  VRAM: 2.1/6.0 GB (82%)\n  epoch  50/100  loss=0.0023  [24ms  ETA 1.2s]  VRAM: 2.1/6.0 GB (82%)\n  epoch 100/100  loss=0.0012  [23ms]             VRAM: 2.1/6.0 GB (82%)\n  training complete in 2.8s  | loss: 0.0012\n```\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://flodl.dev/benchmark\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/fab2s/floDl/main/docs/dashboard.gif\" alt=\"floDl live training dashboard — click for interactive version\" width=\"800\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003e\u003ca href=\"https://flodl.dev/benchmark\"\u003eInteractive benchmark dashboard\u003c/a\u003e — real data from a 100-epoch training run\u003c/em\u003e\u003c/p\u003e\n\nThe live dashboard updates via Server-Sent Events (no WebSocket, no npm),\ntracks CPU/GPU/RAM/VRAM, and supports late join — open it mid-training and\nall past epochs backfill instantly.\n\n```rust\nmonitor.save_html(\"training_report.html\");  // self-contained archive\nmonitor.export_csv(\"training.csv\")?;         // for external analysis\n```\n\n### Observation and Trend Queries\n\nTags double as observation points. Collect metrics during training and use\ntrend queries to make programmatic training decisions:\n\n```rust\nfor epoch in 0..num_epochs {\n    for (input, target) in \u0026batches {\n        let pred = graph.forward(\u0026input)?;\n        graph.collect(\u0026[\"hidden\"])?;                 // from graph tag\n        graph.record_scalar(\"loss\", loss.item()?);   // external metric\n    }\n    graph.flush(\u0026[\"hidden\", \"loss\"]);\n\n    // Programmatic training control\n    if graph.trend(\"loss\").stalled(5, 1e-4) {\n        optimizer.set_lr(optimizer.lr() * 0.5);      // decay LR\n    }\n    if graph.trend(\"loss\").converged(5, 1e-5) {\n        break;                                        // early stopping\n    }\n}\n```\n\n| Method | What it does |\n|--------|-------------|\n| `g.collect(tags)` / `g.flush(tags)` | Batch -\u003e epoch metric aggregation |\n| `g.record_scalar(tag, value)` | Inject external metrics (loss, accuracy) |\n| `g.trend(tag).slope(n)` | OLS slope over last n epochs |\n| `g.trend(tag).stalled(n, tol)` | Is \\|slope\\| below tolerance? |\n| `g.trend(tag).improving(n)` | Is loss decreasing? |\n| `g.trend(tag).converged(n, tol)` | Is variance below tolerance? |\n| `g.trends(tags).all_improving(n)` | Group queries across branches |\n\n### Visualization\n\n```rust\nlet svg = g.svg(Some(\"model.svg\"))?;              // architecture diagram\ng.svg_with_profile(Some(\"profile.svg\"))?;          // timing heatmap\ng.plot_html(\"training.html\", \u0026[\"loss\", \"head\"])?;  // interactive curves\n```\n\nSee the **[Training Monitor Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/09-monitor.md)** and\nthe **[Observation example](https://github.com/fab2s/floDl/tree/main/flodl/examples/observation/)**.\n\n## Multi-GPU Training\n\n`Ddp::setup()` gives you transparent heterogeneous multi-GPU training with\nzero changes to your training loop. floDl detects your GPUs, picks the best\nstrategy, and balances work automatically: the slowest GPU anchors the pace\nwhile faster ones run ahead intelligently.\n\n**Graph DDP** -- one line to go from single-GPU to multi-GPU:\n\n```rust\n// Detect GPUs, replicate model, set optimizer, enable training\nDdp::setup(\u0026model, \u0026builder, |p| Adam::new(p, 0.001))?;\n\n// Training loop is IDENTICAL for 1 or N GPUs\nfor batch in model.epoch(0) {\n    let loss = model.forward_batch(\u0026batch?)?;\n    model.step()?;  // AllReduce + sync + optimizer + zero_grad\n}\n```\n\n**DDP Builder** -- thread-per-GPU, works with any `Module`:\n\n```rust\nlet state = Ddp::builder(model_factory, optim_factory, train_fn)\n    .dataset(dataset)\n    .batch_size(32)\n    .num_epochs(10)\n    .policy(ApplyPolicy::Cadence)       // ElChe for mixed GPUs\n    .backend(AverageBackend::Nccl)      // or Cpu for A/B testing\n    .run()?\n    .join()?;\n```\n\n| | Graph DDP | DDP Builder |\n|---|---|---|\n| **Works with** | `Graph` builder | Any `Module` |\n| **GPU model** | Scatter per batch | Thread per GPU (Local SGD) |\n| **Mixed GPUs** | El Che auto-enabled | `ApplyPolicy` x `AverageBackend` |\n| **Setup** | One line (`Ddp::setup`) | Builder pattern |\n| **Dashboard** | Integrated | Stderr logging |\n\n**A/B testing**: swap `AverageBackend::Nccl` for `AverageBackend::Cpu`\nwith one line. If loss curves match, you have validated the cheaper\nbackend for your workload.\n\nSee the **[Multi-GPU Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/11-multi-gpu.md)**,\n**[DDP Builder Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/12-async-ddp.md)**,\n**[Data Loading Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/13-data-loading.md)**, and\n**[DDP Reference](https://github.com/fab2s/floDl/blob/main/docs/ddp.md)**.\n\n### Validation suite — `ddp-bench`\n\nThe repo ships with [`ddp-bench/`](https://github.com/fab2s/floDl/tree/main/ddp-bench),\na workspace member that reproduces published training setups (Logistic /\nMLP / LeNet-5 / ResNet-20 / Char-RNN / GPT-nano / Conv-AE on MNIST,\nCIFAR-10, Shakespeare) to build scientifically valid solo baselines, then\nmeasures DDP/ElChe convergence quality against them across all 8\nbackend × policy combinations:\n\n```bash\nfdl ddp-bench --list                       # list models and modes\nfdl ddp-bench quick                        # 1-epoch smoke test\nfdl ddp-bench validate                     # full sweep vs structured baselines\nfdl ddp-bench --model gpt-nano --mode nccl-cadence --epochs 50 --lr-scale 2\nfdl ddp-bench --report runs/report.md      # convergence report from saved runs\n```\n\nEvery run produces a high-frequency `Timeline` (CPU/GPU utilization, sync\nevents, anchor changes, idle gaps) saved as JSON / CSV / interactive HTML\nunder `runs/\u003cmodel\u003e/\u003cmode\u003e/`.\n\n### Built-in datasets\n\nThe framework ships ready-to-use parsers for common benchmarks (all\nimplement `BatchDataSet`, plug straight into `DataLoader::builder`):\n\n```rust\nuse flodl::data::datasets::{Cifar10, Mnist, Shakespeare};\n\nlet mnist = Mnist::parse(\u0026images_gz, \u0026labels_gz)?;\nlet cifar = Cifar10::parse(\u0026[\u0026batch1, \u0026batch2, /* ... */])?;\nlet text  = Shakespeare::parse(\u0026corpus, /*seq_len=*/ 128)?;\n```\n\n`ddp-bench` downloads and caches the underlying files on first run.\n\n## PyTorch Parity\n\nfloDl covers the modules, losses, and optimizers you actually use:\n\n| Category | Count | Highlights |\n|----------|------:|-----------|\n| **NN Modules** | 30+ | `Linear`, `Conv1d`/`2d`/`3d` + transpose, `GRU`/`LSTM`, `MultiheadAttention`, `Bilinear`, all norms (`Layer`/`RMS`/`Group`/`Batch`/`Instance`), all pooling, `Embedding`/`EmbeddingBag`, `PixelShuffle`, `Upsample`, `Unfold`/`Fold` |\n| **Activations** | 17 | `ReLU`, `LeakyReLU`, `ELU`, `GELU`, `SiLU`, `Mish`, `SELU`, `Softplus`, `Hardswish`, `PReLU`, `Softmax`, ... |\n| **Losses** | 15 | MSE, CrossEntropy, BCE, NLL, CTC, Focal, Triplet, KLDiv, SmoothL1, Cosine, Hinge, Margin, Poisson, ... |\n| **Optimizers** | 7 | `SGD`, `Adam`, `AdamW`, `RMSprop`, `Adagrad`, `RAdam`, `NAdam` — all with parameter groups |\n| **Schedulers** | 8 | Step, Cosine, Exponential, MultiStep, OneCycle, Cyclic, Warmup (composable), Plateau |\n| **Init** | 9 | Xavier, Kaiming, orthogonal, truncated normal, uniform, normal |\n| **Tensor Ops** | 100+ | Full arithmetic, trig, reductions, shape, indexing, comparisons, fused ops |\n| **Autograd** | 90+ | Differentiable backward for every op above |\n\nFused Adam/AdamW on CUDA (single kernel for all parameters). Fused gradient\nclipping via foreach ops. Mixed precision with `AutocastGuard` + `GradScaler`.\nCUDA Graphs for replay-based training.\n\nThe [full migration guide](https://github.com/fab2s/floDl/blob/main/docs/pytorch_migration.md) has side-by-side\ncode for every op, module, and pattern.\n\n## Performance\n\nSame CUDA kernels as PyTorch — the difference comes from what happens\n*between* kernel launches. Ten models, ten interleaved rounds, locked GPU\nclocks (RTX 5060 Ti, v0.3.0 vs PyTorch 2.10.0):\n\n| Model | PyTorch | flodl | Delta |\n|---|---:|---:|---:|\n| transformer | 3183.0 ms | 2199.8 ms | **-31%** |\n| mlp | 291.1 ms | 207.0 ms | **-29%** |\n| residual_tower | 406.9 ms | 309.7 ms | **-24%** |\n| feedback_fixed | 275.3 ms | 231.3 ms | **-16%** |\n| gated_routing | 248.0 ms | 217.3 ms | **-12%** |\n| iterative_refine | 230.7 ms | 206.0 ms | **-11%** |\n| gru_seq | 1105.1 ms | 1057.5 ms | **-4%** |\n| conv_autoenc | 398.2 ms | 395.3 ms | -1% |\n| lstm_seq | 692.3 ms | 692.3 ms | 0% |\n| convnet | 1298.0 ms | 1298.2 ms | 0% |\n\nWins 8 of 10, ties 2, zero regressions. The ties (convnet, lstm_seq) are\ncompute-bound -- both frameworks saturate the GPU, confirming identical\nCUDA kernels. The gap appears where framework overhead matters:\ndispatch-bound architectures (transformer -31%, mlp -29%), graph routing\n(residual_tower -24%), and recurrent loops (feedback_fixed -16%).\n\n**[Benchmark Report](https://github.com/fab2s/floDl/blob/main/docs/benchmark.md)** |\n[Interactive dashboard](https://flodl.dev/benchmark)\n\n### Multi-GPU (DDP)\n\nResNet-20 on CIFAR-10, 200 epochs -- heterogeneous GPUs (RTX 5060 Ti +\nGTX 1060, 2.5x speed ratio). Published reference: 91.25%\n([He et al. 2015](https://arxiv.org/abs/1512.03385), Table 6):\n\n| Mode | Eval | vs Published | Time | vs Solo-0 |\n|---|---:|---:|---:|---:|\n| solo-0 (fast GPU only) | 91.66% | +0.41% | 3127s | -- |\n| nccl-async | **92.44%** | **+1.19%** | 2697s | 1.2x |\n| nccl-cadence | **92.42%** | **+1.17%** | 2650s | 1.2x |\n| cpu-async | **92.43%** | **+1.18%** | 2614s | 1.2x |\n| cpu-cadence | **92.04%** | **+0.79%** | 2670s | 1.2x |\n\nEvery ElChe mode surpasses published accuracy while finishing faster\nthan the fast GPU alone. 200 epochs is where ElChe's proportional\nscheduling has room to calibrate and shine -- shorter models (logistic\nthrough gpt-nano) confirm DDP convergence across architectures.\n\n**[DDP Benchmark Report](https://github.com/fab2s/floDl/blob/main/docs/ddp-benchmark.md)** --\nfull results for 8 models across 9 DDP modes\n\n## Why Rust for Deep Learning?\n\n**Deterministic memory.** Python adds ~3-5 us of framework overhead per GPU\nop. Go's GC can't manage VRAM — an [earlier Go implementation](https://github.com/fab2s/goDl)\nrequired 5 phases of lifecycle management (refcounting, GC callbacks, VRAM\nbudgets, pending-free queues). Rust replaces all of that with\n`impl Drop for Tensor`. Memory is freed the instant a tensor leaves scope.\n\n**Zero-cost safety.** Every op returns `Result\u003cT\u003e` — no silent failures.\nOwnership ensures tensors are freed exactly once. The borrow checker\nprevents data races at compile time.\n\n**Same GPU kernels.** floDl binds libtorch — the C++ library under\nPyTorch. CUDA, cuBLAS, cuDNN are identical. floDl replaces the dispatch\npath, autograd tracking, and graph execution.\n\n## Features Reference\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eTraining Tools\u003c/strong\u003e\u003c/summary\u003e\n\n| Tool | What it does |\n|------|-------------|\n| `clip_grad_norm` / `clip_grad_value` | Fused gradient clipping (2 kernels total via foreach ops) |\n| `save_checkpoint` / `load_checkpoint` | Named `.fdl` checkpoints, structural hash, partial loading, `LoadReport` |\n| `migrate_checkpoint` | Remap parameter names across versions |\n| `Parameter::freeze` / `unfreeze` | Per-parameter gradient control |\n| `GradScaler` | Dynamic loss scaling for fp16 training |\n| `cast_parameters` | Cast model parameters to any dtype |\n| `CpuWorker` / `ModelSnapshot` | Background checkpoint saving |\n| `CudaGraph` | Capture/replay training steps for fixed-shape models |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eModule Traits\u003c/strong\u003e\u003c/summary\u003e\n\nBeyond `forward`/`parameters`, `Module` provides optional methods the graph\nrecognizes automatically:\n\n| Method | What happens |\n|--------|-------------|\n| `as_named_input()` | `using()` refs arrive as a named map |\n| `reset()` | Loops auto-call before iterating — clears per-forward state |\n| `detach_state()` | Break gradient chains on retained state |\n| `sub_modules()` | Recursive device placement, training mode, parameter collection |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eBuild Profiles\u003c/strong\u003e\u003c/summary\u003e\n\n```toml\n# Optimize floDl in dev builds — your code stays fast to compile.\n[profile.dev.package.flodl]\nopt-level = 3\n\n[profile.dev.package.flodl-sys]\nopt-level = 3\n\n# Release: cross-crate optimization for maximum throughput.\n[profile.release]\nlto = \"thin\"\ncodegen-units = 1\n```\n\n| Profile | flodl | Your code | Typical rebuild |\n|---------|-------|-----------|-----------------|\n| `cargo build` | `-O3` (cached) | `-O0` (fast) | \u003c 2s |\n| `cargo build --release` | `-O3` + LTO | `-O3` + LTO | full link |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eMulti-GPU (DDP)\u003c/strong\u003e\u003c/summary\u003e\n\n| Component | What it does |\n|-----------|-------------|\n| `Ddp::setup` | One-liner: detect GPUs, distribute, set optimizer, train |\n| `Ddp::builder` | Thread-per-GPU with Local SGD, any Module |\n| `ApplyPolicy` | Sync / Cadence / Async (when to average) |\n| `AverageBackend` | Nccl / Cpu (how to average, A/B testable) |\n| `ElChe` | Heterogeneous GPU cadence strategy |\n| `NcclComms` / `NcclRankComm` | NCCL AllReduce, Broadcast, abort handles |\n| `CudaEvent` / `CudaStream` | Async GPU-CPU pipeline, timing |\n| `DataLoader` | Resident/streaming/distributed, VRAM-aware prefetch, auto OOM fallback |\n\n\u003c/details\u003e\n\n### Numerical Verification\n\nEvery differentiable path is verified against finite-difference gradients:\n- 117 autograd op-level checks (every op + compositions)\n- Module-level checks (every NN module, input + parameter gradients)\n- Exact optimizer step verifications (SGD, Adam, AdamW, RMSprop, Adagrad, RAdam, NAdam)\n- 1027 library tests, zero clippy warnings — all tests run on both CPU and CUDA\n\n### Hardware Compatibility\n\nDeveloped and tested from NVIDIA Pascal (GTX 1060 6GB) to Blackwell\n(RTX 5060 Ti 16GB). PyTorch dropped Pascal support after 2.5.1 — floDl\nlinks libtorch's stable C API, which supports every architecture the driver\nsupports. If `nvidia-smi` works, floDl trains on it.\n\n## Documentation\n\n### Choose your path\n\n| Background | Start here |\n|-----------|-----------|\n| **New to Rust** | [Rust for PyTorch Users](https://github.com/fab2s/floDl/blob/main/docs/tutorials/00-rust-primer.md) — 10 patterns in 15 minutes |\n| **Know Rust, new to DL** | [Tensors](https://github.com/fab2s/floDl/blob/main/docs/tutorials/01-tensors.md) then [Training](https://github.com/fab2s/floDl/blob/main/docs/tutorials/04-training.md) |\n| **Know PyTorch** | [Porting Guide](https://github.com/fab2s/floDl/blob/main/docs/porting.md) (or `/port` with AI) then [Graph Builder](https://github.com/fab2s/floDl/blob/main/docs/tutorials/05-graph-builder.md) |\n| **Scaling to multi-GPU** | [Multi-GPU Training](https://github.com/fab2s/floDl/blob/main/docs/tutorials/11-multi-gpu.md) then [DDP Builder](https://github.com/fab2s/floDl/blob/main/docs/tutorials/12-async-ddp.md) |\n| **Just show me code** | [`quickstart`](https://github.com/fab2s/floDl/tree/main/flodl/examples/quickstart/) or [`showcase`](https://github.com/fab2s/floDl/tree/main/flodl/examples/showcase/) |\n\n### Tutorials\n\n0. **[Rust for PyTorch Users](https://github.com/fab2s/floDl/blob/main/docs/tutorials/00-rust-primer.md)** — 10 Rust patterns in 15 minutes\n1. **[Tensors](https://github.com/fab2s/floDl/blob/main/docs/tutorials/01-tensors.md)** — creation, ops, memory, CUDA\n2. **[Autograd](https://github.com/fab2s/floDl/blob/main/docs/tutorials/02-autograd.md)** — variables, gradients, backward\n3. **[Modules](https://github.com/fab2s/floDl/blob/main/docs/tutorials/03-modules.md)** — all layers, convolutions, RNNs, attention, normalization\n4. **[Training](https://github.com/fab2s/floDl/blob/main/docs/tutorials/04-training.md)** — losses, optimizers, mixed precision, full loop\n5. **[Graph Builder](https://github.com/fab2s/floDl/blob/main/docs/tutorials/05-graph-builder.md)** — fluent API from simple to complex\n6. **[Advanced Graphs](https://github.com/fab2s/floDl/blob/main/docs/tutorials/06-advanced-graphs.md)** — forward refs, loops, gates, switches\n7. **[Visualization](https://github.com/fab2s/floDl/blob/main/docs/tutorials/07-visualization.md)** — DOT/SVG, profiling heatmaps\n8. **[Utilities](https://github.com/fab2s/floDl/blob/main/docs/tutorials/08-utilities.md)** — checkpoints, clipping, freezing, initialization, scheduling, verbosity-gated logging\n9. **[Training Monitor](https://github.com/fab2s/floDl/blob/main/docs/tutorials/09-monitor.md)** — ETA, resource tracking, live dashboard\n10. **[Graph Tree](https://github.com/fab2s/floDl/blob/main/docs/tutorials/10-graph-tree.md)** — hierarchical composition, freeze/thaw, subgraph checkpoints\n11. **[Multi-GPU Training](https://github.com/fab2s/floDl/blob/main/docs/tutorials/11-multi-gpu.md)** — Ddp::setup, El Che, auto-balancing, DataLoader integration\n12. **[DDP Builder](https://github.com/fab2s/floDl/blob/main/docs/tutorials/12-async-ddp.md)** — thread-per-GPU, Local SGD, A/B testable backends\n13. **[Data Loading](https://github.com/fab2s/floDl/blob/main/docs/tutorials/13-data-loading.md)** — DataLoader, resident/streaming modes, VRAM-aware prefetch, DDP integration\n\n### Examples\n\n- [`quickstart`](https://github.com/fab2s/floDl/tree/main/flodl/examples/quickstart/) — build, train, and monitor a model with residual connections\n- [`sine_wave`](https://github.com/fab2s/floDl/tree/main/flodl/examples/sine_wave/) — sine regression with monitor, checkpoint round-trip\n- [`mixed_precision`](https://github.com/fab2s/floDl/tree/main/flodl/examples/mixed_precision/) — float16 training with `GradScaler`\n- [`transfer_learning`](https://github.com/fab2s/floDl/tree/main/flodl/examples/transfer_learning/) — checkpoint, partial load, freeze, fine-tune\n- [`schedulers`](https://github.com/fab2s/floDl/tree/main/flodl/examples/schedulers/) — warmup + cosine + plateau composition\n- [`observation`](https://github.com/fab2s/floDl/tree/main/flodl/examples/observation/) — collect, flush, trend queries, early stopping\n- [`showcase`](https://github.com/fab2s/floDl/tree/main/flodl/examples/showcase/) — every graph builder method in one graph\n\n### Porting from PyTorch\n\n- **[Porting Guide](https://github.com/fab2s/floDl/blob/main/docs/porting.md)** — module mapping, FlowBuilder patterns, training loop translation\n- **[AI-assisted porting](https://github.com/fab2s/floDl/tree/main/ai/skills/port/)** — point any AI coding assistant at the skill guide for automated translation. With Claude Code: `/port my_model.py`\n- **`fdl api-ref`** — generate a structured API reference for your flodl version. Used by AI tools and useful on its own.\n\n### Architecture\n\n```\n+-----------------------------------------------------------+\n|  User Code / Model Definitions                            |\n+-----------------------------------------------------------+\n|  monitor/  ETA, resource tracking, live web dashboard     |\n+-----------------------------------------------------------+\n|  graph/    Fluent builder, graph tree, execution, DOT/SVG |\n+-----------------------------------------------------------+\n|  data/     DataLoader, resident/streaming, prefetch       |\n+-----------------------------------------------------------+\n|  nn/       Modules, losses, optimizers, DDP, NCCL         |\n+-----------------------------------------------------------+\n|  autograd/ Reverse-mode AD, gradient tracking             |\n+-----------------------------------------------------------+\n|  tensor/   Owned tensors with Drop, CPU + CUDA            |\n+-----------------------------------------------------------+\n|  flodl-sys   FFI bindings to libtorch C++ shim            |\n+-----------------------------------------------------------+\n|  libtorch / CUDA / NCCL                                   |\n+-----------------------------------------------------------+\n```\n\n## Story\n\nfloDl started as a question: what would a deep learning framework look like\nif you designed it around Rust's ownership model instead of fighting a garbage\ncollector?\n\nAn [earlier attempt in Go](https://github.com/fab2s/goDl) proved the\narchitecture — the graph builder, the module system, the observation engine —\nbut hit a wall: Go's GC cannot manage GPU memory deterministically. That\nrequired building five layers of memory management infrastructure on top of\nthe language, not with it.\n\nRust solved this at the language level. `impl Drop for Tensor` replaced\nhundreds of lines of lifecycle management. The graph builder, module\ncomposition, and design philosophy carried forward; the memory fights didn't.\n\n## License\n\nfloDl is open-sourced software licensed under the [MIT license](https://github.com/fab2s/floDl/blob/main/LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffab2s%2Fflodl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffab2s%2Fflodl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffab2s%2Fflodl/lists"}