https://github.com/fab2s/flodl

rust recursive deep learning framework
https://github.com/fab2s/flodl
deep-learning graph machine-learning rust
Last synced: 2 months ago
JSON representation
rust recursive deep learning framework
Host: GitHub
URL: https://github.com/fab2s/flodl
Owner: fab2s
License: mit
Created: 2026-03-09T23:15:59.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-04-13T16:53:15.000Z (3 months ago)
Last Synced: 2026-04-13T18:30:53.435Z (3 months ago)
Topics: deep-learning, graph, machine-learning, rust
Language: Rust
Homepage: https://flodl.dev
Size: 18 MB
Stars: 36
Watchers: 1
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project

README

          


  



floDl




A Rust-native deep learning framework built on libtorch.


Same GPU kernels as PyTorch. No Python. No GIL. No GC. Just Rust.





  

  

  

  

  





  PyTorch Users •

  Thesis •

  Getting Started •

  Graph Builder •

  Graph Tree •

  Training •

  Multi-GPU •

  Parity •

  Benchmarks •

  Roadmap •

  Migration Guide •

  Data Loading



---

> **What's new in 0.5.0** -- the `fdl` CLI maturity pass. New proc-macro

> crate [`flodl-cli-macros`](https://crates.io/crates/flodl-cli-macros)

> adds `#[derive(FdlArgs)]` -- any Rust binary gets typed argv parsing,

> JSON schema, shell completions, and env-var fallback for free.

> `fdl.yml` consolidates to a single `commands:` map with three clean

> kinds (`run:` / `path:` / preset). New

> [`--env` overlays](docs/cli.md#environment-overlays) and

> [`fdl config show`](docs/cli.md#fdl-config) surface per-environment

> config with per-field origin annotations, so you can see the

> resolved YAML before running a two-hour job. Migration from 0.4.0:

> see [UPGRADE.md](UPGRADE.md).

---

## If You Know PyTorch, You Know floDl

PyTorchfloDl

```python

model = nn.Sequential(

    nn.Linear(2, 16),

    nn.GELU(),

    nn.LayerNorm(16),

    nn.Linear(16, 2),

)

pred = model(x)

loss = F.mse_loss(pred, target)

loss.backward()

optimizer.step()

```

```rust

let model = FlowBuilder::from(Linear::new(2, 16)?)

    .through(GELU)

    .through(LayerNorm::new(16)?)

    .through(Linear::new(16, 2)?)

    .build()?;

let pred = model.forward(&x)?;

let loss = mse_loss(&pred, &target)?;

loss.backward()?;

optimizer.step()?;

```

Same concepts, same names, same GPU kernels underneath. The `?` operator

replaces silent failures with compile-time error handling. `Drop` replaces the

garbage collector. The [full migration guide](https://github.com/fab2s/floDl/blob/main/docs/pytorch_migration.md) covers

every op, module, and pattern.

> **New to Rust?** Read [Rust for PyTorch Users](https://github.com/fab2s/floDl/blob/main/docs/tutorials/00-rust-primer.md) — 10 patterns in 15 minutes.

## Getting Started

**With the CLI** (recommended, no Rust needed):

```bash

curl -sL https://flodl.dev/fdl -o fdl && chmod +x fdl

./fdl setup          # detect hardware, download libtorch, configure build environment

./fdl init my-proj   # scaffold a new project with training template

```

The `fdl` script auto-downloads a pre-compiled CLI binary (~750KB, pure Rust,

no libtorch dependency). It detects your GPUs, downloads the right libtorch

variant, and configures Docker or native builds. See the [full CLI

reference](docs/cli.md) for all commands.

**One-liner with Docker** (no Rust, no setup):

```bash

curl -sL https://flodl.dev/init.sh | sh -s my-project

cd my-project

./fdl build   # first build (~5 min, downloads libtorch)

./fdl run     # train the model

```

**Native** -- [Rust](https://rustup.rs/) 1.85+ and libtorch:

```bash

./fdl libtorch download    # auto-detects CPU or CUDA

cargo add flodl && cargo build

```

For CUDA: `cargo add flodl --features cuda` + [CUDA toolkit](https://developer.nvidia.com/cuda-downloads).

> **Using tch-rs or PyTorch C++?** `fdl` also works as a standalone

> libtorch manager outside of flodl: download any CPU/CUDA variant,

> switch between installs, compile from source for mixed GPU

> architectures (e.g. sm_61 + sm_120 in one build), and emit a

> machine-readable diagnostics report. No flodl buy-in required.

> See [docs/cli.md § Standalone](docs/cli.md#1-standalone-no-project-required)

> and the [`flodl-cli` crate](https://crates.io/crates/flodl-cli).

Both paths generate an annotated training template. Edit `src/main.rs` to

build your model:

```rust

use flodl::*;

let model = FlowBuilder::from(Linear::new(2, 16)?)

    .through(GELU)

    .through(LayerNorm::new(16)?)

    .also(Linear::new(16, 16)?)     // residual connection

    .through(Linear::new(16, 2)?)

    .build()?;

let params = model.parameters();

let mut optimizer = Adam::new(&params, 0.01);

model.train();

for (input_t, target_t) in &batches {

    let input = Variable::new(input_t.clone(), true);

    let target = Variable::new(target_t.clone(), false);

    let pred = model.forward(&input)?;

    let loss = mse_loss(&pred, &target)?;

    optimizer.zero_grad();

    loss.backward()?;

    clip_grad_norm(&params, 1.0)?;

    optimizer.step()?;

}

```

## The Graph Builder

floDl's fluent graph builder lets you describe complex architectures as

readable data flow — no boilerplate, no `nn.Module` subclassing.

```rust

let model = FlowBuilder::from(Linear::new(2, 16)?)

    .through(GELU)                        // activation

    .through(LayerNorm::new(16)?)         // normalization

    .also(Linear::new(16, 16)?)           // residual connection

    .through(Linear::new(16, 2)?)         // output projection

    .build()?;

```

`build()` returns a `Graph` that implements `Module` — you can nest it

inside other graphs. Things get interesting when architectures get complex:

```rust

let g = FlowBuilder::from(encoder).tag("encoded")

    .split(modules![head_a, head_b, head_c]).merge(MergeOp::Mean)

    .loop_body(refinement_block).for_n(3).tag("refined")

    .gate(router, modules![expert_a, expert_b]).using(&["encoded"])

    .switch(selector, modules![light_path, heavy_path]).using(&["refined"])

    .through(StateAdd).using(&["memory"]).tag("memory")

    .loop_body(decoder).while_cond(halt_condition, 10)

    .through(output_head)

    .build()?;

```

Every construct — `split/merge`, `also`, `loop_body`, `gate`, `switch`, `map`,

`tag/using` — composes cleanly. Forward references (`using` before `tag`) carry

state across calls, enabling recurrent architectures without special-casing.

| Method | What it does |

|--------|-------------|

| `from(m).through(m)` | Linear chain |

| `also(m)` | Residual: `input + m(input)` |

| `fork(m)` | Side branch: capture output as tag, stream continues |

| `split(modules![...]).merge(op)` | Parallel branches, merged by `Add` or `Mean` |

| `tag(name)` / `using(refs)` | Named references — backward or forward (across calls) |

| `loop_body(body).for_n(n)` | Fixed iteration with BPTT |

| `loop_body(body).while_cond` / `until_cond` | Conditional loops |

| `gate(router, modules![...])` | Soft routing — weighted combination |

| `switch(selector, modules![...])` | Hard routing — only selected branch |

| `map(body).each()` / `.over(tag)` / `.slices(n)` | Element-wise, tagged, or sliced iteration |

| `input(names)` | Auxiliary graph inputs for multi-input architectures |

See the **[Graph Builder Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/05-graph-builder.md)** and

the [full showcase](https://github.com/fab2s/floDl/tree/main/flodl/examples/showcase/).

## Graph Tree: Hierarchical Composition

This is where floDl goes beyond PyTorch. Graphs nest inside graphs with

**label-path addressing** — dot-separated paths that let you reach into any

subgraph from the root. Train components independently, compose them into

larger architectures, and control training phases declaratively.

```rust

// Build components independently

let scan = FlowBuilder::from(scan_net).tag("hidden")

    .label("scan").build()?;

let read = FlowBuilder::from(read_net).tag("confidence")

    .label("read").build()?;

let encoder = FlowBuilder::from(scan)

    .through(read)

    .label("encoder").build()?;

// Compose into full model

let model = FlowBuilder::from(encoder)

    .through(classifier)

    .build()?;

```

### Dotted paths reach anywhere

Every tag and subgraph is addressable through dotted paths from the root:

```rust

model.validate_path("encoder")?;                 // -> Subgraph

model.validate_path("encoder.scan.hidden")?;      // -> Tag (three levels deep)

model.validate_path("encoder.read.confidence")?;  // -> Tag

```

### Declarative training phases

Freeze and thaw entire subtrees by path — no manual parameter iteration:

```rust

// Phase 1: train only the classifier, encoder is frozen

model.freeze("encoder")?;

let fresh_params = model.parameters();  // only unfrozen params

let mut opt = Adam::new(&fresh_params, 1e-3);

// ... train ...

// Phase 2: thaw scan, keep read frozen (it's proven)

model.thaw("encoder.scan")?;

let mut opt = Adam::with_groups()

    .group(&model.parameters_at("encoder.scan")?, 1e-4)  // low LR

    .group(&model.parameters_at("classifier")?, 1e-3)

    .build();

```

### Subgraph checkpoints

Train a component standalone, save it, load it into a larger model:

```rust

// Pre-trained encoder saved earlier

encoder.save_checkpoint("encoder_v1.fdl.gz")?;

// Load into the composed model — namespace + hash validated

model.load_subgraph_checkpoint("encoder", "encoder_v1.fdl.gz")?;

model.freeze("encoder.read")?;  // lock what's proven

```

### Cross-boundary observation

Metrics flow up through the tree automatically:

```rust

model.record_at("encoder.scan.loss", scan_loss)?;

model.record_at("encoder.read.accuracy", read_acc)?;

model.record_scalar("total_loss", total)?;

model.flush(&[]);  // single call flushes the entire tree

// Trends across boundaries — drive training decisions

if model.trend_at("encoder.scan.loss")?.stalled(10, 1e-4) {

    model.thaw("encoder.read")?;  // scan stalled, unfreeze read

}

// Monitor sees all metrics with dotted names automatically

monitor.log(epoch, elapsed, &model);

// -> total_loss, encoder.scan.loss, encoder.read.accuracy

```

This is progressive model composition: each component is trained and

validated independently before becoming a building block in a larger

architecture. Checkpoints, metrics, and training phases compose just like

the graphs themselves.

See the full **[Graph Tree Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/10-graph-tree.md)**.

## The Training Experience

### Training Monitor

Drop-in monitor with adaptive ETA, resource tracking, and a live web

dashboard — no external dependencies, no separate process.

```rust

use flodl::monitor::Monitor;

let mut monitor = Monitor::new(num_epochs);

monitor.serve(3000)?;  // optional: live dashboard at http://localhost:3000

for epoch in 0..num_epochs {

    let t = std::time::Instant::now();

    // ... training ...

    monitor.log(epoch, t.elapsed(), &model);  // sees entire graph tree

}

monitor.finish();

```

```

  epoch   1/100  loss=1.5264  [49ms  ETA 4.8s]

  epoch  10/100  loss=0.3817  [25ms  ETA 2.2s]  VRAM: 2.1/6.0 GB (82%)

  epoch  50/100  loss=0.0023  [24ms  ETA 1.2s]  VRAM: 2.1/6.0 GB (82%)

  epoch 100/100  loss=0.0012  [23ms]             VRAM: 2.1/6.0 GB (82%)

  training complete in 2.8s  | loss: 0.0012

```



  

    

  



Interactive benchmark dashboard — real data from a 100-epoch training run


The live dashboard updates via Server-Sent Events (no WebSocket, no npm),

tracks CPU/GPU/RAM/VRAM, and supports late join — open it mid-training and

all past epochs backfill instantly.

```rust

monitor.save_html("training_report.html");  // self-contained archive

monitor.export_csv("training.csv")?;         // for external analysis

```

### Observation and Trend Queries

Tags double as observation points. Collect metrics during training and use

trend queries to make programmatic training decisions:

```rust

for epoch in 0..num_epochs {

    for (input, target) in &batches {

        let pred = graph.forward(&input)?;

        graph.collect(&["hidden"])?;                 // from graph tag

        graph.record_scalar("loss", loss.item()?);   // external metric

    }

    graph.flush(&["hidden", "loss"]);

    // Programmatic training control

    if graph.trend("loss").stalled(5, 1e-4) {

        optimizer.set_lr(optimizer.lr() * 0.5);      // decay LR

    }

    if graph.trend("loss").converged(5, 1e-5) {

        break;                                        // early stopping

    }

}

```

| Method | What it does |

|--------|-------------|

| `g.collect(tags)` / `g.flush(tags)` | Batch -> epoch metric aggregation |

| `g.record_scalar(tag, value)` | Inject external metrics (loss, accuracy) |

| `g.trend(tag).slope(n)` | OLS slope over last n epochs |

| `g.trend(tag).stalled(n, tol)` | Is \|slope\| below tolerance? |

| `g.trend(tag).improving(n)` | Is loss decreasing? |

| `g.trend(tag).converged(n, tol)` | Is variance below tolerance? |

| `g.trends(tags).all_improving(n)` | Group queries across branches |

### Visualization

```rust

let svg = g.svg(Some("model.svg"))?;              // architecture diagram

g.svg_with_profile(Some("profile.svg"))?;          // timing heatmap

g.plot_html("training.html", &["loss", "head"])?;  // interactive curves

```

See the **[Training Monitor Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/09-monitor.md)** and

the **[Observation example](https://github.com/fab2s/floDl/tree/main/flodl/examples/observation/)**.

## Multi-GPU Training

`Ddp::setup()` gives you transparent heterogeneous multi-GPU training with

zero changes to your training loop. floDl detects your GPUs, picks the best

strategy, and balances work automatically: the slowest GPU anchors the pace

while faster ones run ahead intelligently.

**Graph DDP** -- one line to go from single-GPU to multi-GPU:

```rust

// Detect GPUs, replicate model, set optimizer, enable training

Ddp::setup(&model, &builder, |p| Adam::new(p, 0.001))?;

// Training loop is IDENTICAL for 1 or N GPUs

for batch in model.epoch(0) {

    let loss = model.forward_batch(&batch?)?;

    model.step()?;  // AllReduce + sync + optimizer + zero_grad

}

```

**DDP Builder** -- thread-per-GPU, works with any `Module`:

```rust

let state = Ddp::builder(model_factory, optim_factory, train_fn)

    .dataset(dataset)

    .batch_size(32)

    .num_epochs(10)

    .policy(ApplyPolicy::Cadence)       // ElChe for mixed GPUs

    .backend(AverageBackend::Nccl)      // or Cpu for A/B testing

    .run()?

    .join()?;

```

| | Graph DDP | DDP Builder |

|---|---|---|

| **Works with** | `Graph` builder | Any `Module` |

| **GPU model** | Scatter per batch | Thread per GPU (Local SGD) |

| **Mixed GPUs** | El Che auto-enabled | `ApplyPolicy` x `AverageBackend` |

| **Setup** | One line (`Ddp::setup`) | Builder pattern |

| **Dashboard** | Integrated | Stderr logging |

**A/B testing**: swap `AverageBackend::Nccl` for `AverageBackend::Cpu`

with one line. If loss curves match, you have validated the cheaper

backend for your workload.

See the **[Multi-GPU Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/11-multi-gpu.md)**,

**[DDP Builder Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/12-async-ddp.md)**,

**[Data Loading Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/13-data-loading.md)**, and

**[DDP Reference](https://github.com/fab2s/floDl/blob/main/docs/ddp.md)**.

### Validation suite — `ddp-bench`

The repo ships with [`ddp-bench/`](https://github.com/fab2s/floDl/tree/main/ddp-bench),

a workspace member that reproduces published training setups (Logistic /

MLP / LeNet-5 / ResNet-20 / Char-RNN / GPT-nano / Conv-AE on MNIST,

CIFAR-10, Shakespeare) to build scientifically valid solo baselines, then

measures DDP/ElChe convergence quality against them across all 8

backend × policy combinations:

```bash

fdl ddp-bench --list                       # list models and modes

fdl ddp-bench quick                        # 1-epoch smoke test

fdl ddp-bench validate                     # full sweep vs structured baselines

fdl ddp-bench --model gpt-nano --mode nccl-cadence --epochs 50 --lr-scale 2

fdl ddp-bench --report runs/report.md      # convergence report from saved runs

```

Every run produces a high-frequency `Timeline` (CPU/GPU utilization, sync

events, anchor changes, idle gaps) saved as JSON / CSV / interactive HTML

under `runs///`.

### Built-in datasets

The framework ships ready-to-use parsers for common benchmarks (all

implement `BatchDataSet`, plug straight into `DataLoader::builder`):

```rust

use flodl::data::datasets::{Cifar10, Mnist, Shakespeare};

let mnist = Mnist::parse(&images_gz, &labels_gz)?;

let cifar = Cifar10::parse(&[&batch1, &batch2, /* ... */])?;

let text  = Shakespeare::parse(&corpus, /*seq_len=*/ 128)?;

```

`ddp-bench` downloads and caches the underlying files on first run.

## PyTorch Parity

floDl covers the modules, losses, and optimizers you actually use:

| Category | Count | Highlights |

|----------|------:|-----------|

| **NN Modules** | 30+ | `Linear`, `Conv1d`/`2d`/`3d` + transpose, `GRU`/`LSTM`, `MultiheadAttention`, `Bilinear`, all norms (`Layer`/`RMS`/`Group`/`Batch`/`Instance`), all pooling, `Embedding`/`EmbeddingBag`, `PixelShuffle`, `Upsample`, `Unfold`/`Fold` |

| **Activations** | 17 | `ReLU`, `LeakyReLU`, `ELU`, `GELU`, `SiLU`, `Mish`, `SELU`, `Softplus`, `Hardswish`, `PReLU`, `Softmax`, ... |

| **Losses** | 15 | MSE, CrossEntropy, BCE, NLL, CTC, Focal, Triplet, KLDiv, SmoothL1, Cosine, Hinge, Margin, Poisson, ... |

| **Optimizers** | 7 | `SGD`, `Adam`, `AdamW`, `RMSprop`, `Adagrad`, `RAdam`, `NAdam` — all with parameter groups |

| **Schedulers** | 8 | Step, Cosine, Exponential, MultiStep, OneCycle, Cyclic, Warmup (composable), Plateau |

| **Init** | 9 | Xavier, Kaiming, orthogonal, truncated normal, uniform, normal |

| **Tensor Ops** | 100+ | Full arithmetic, trig, reductions, shape, indexing, comparisons, fused ops |

| **Autograd** | 90+ | Differentiable backward for every op above |

Fused Adam/AdamW on CUDA (single kernel for all parameters). Fused gradient

clipping via foreach ops. Mixed precision with `AutocastGuard` + `GradScaler`.

CUDA Graphs for replay-based training.

The [full migration guide](https://github.com/fab2s/floDl/blob/main/docs/pytorch_migration.md) has side-by-side

code for every op, module, and pattern.

## Performance

Same CUDA kernels as PyTorch — the difference comes from what happens

*between* kernel launches. Ten models, ten interleaved rounds, locked GPU

clocks (RTX 5060 Ti, v0.3.0 vs PyTorch 2.10.0):

| Model | PyTorch | flodl | Delta |

|---|---:|---:|---:|

| transformer | 3183.0 ms | 2199.8 ms | **-31%** |

| mlp | 291.1 ms | 207.0 ms | **-29%** |

| residual_tower | 406.9 ms | 309.7 ms | **-24%** |

| feedback_fixed | 275.3 ms | 231.3 ms | **-16%** |

| gated_routing | 248.0 ms | 217.3 ms | **-12%** |

| iterative_refine | 230.7 ms | 206.0 ms | **-11%** |

| gru_seq | 1105.1 ms | 1057.5 ms | **-4%** |

| conv_autoenc | 398.2 ms | 395.3 ms | -1% |

| lstm_seq | 692.3 ms | 692.3 ms | 0% |

| convnet | 1298.0 ms | 1298.2 ms | 0% |

Wins 8 of 10, ties 2, zero regressions. The ties (convnet, lstm_seq) are

compute-bound -- both frameworks saturate the GPU, confirming identical

CUDA kernels. The gap appears where framework overhead matters:

dispatch-bound architectures (transformer -31%, mlp -29%), graph routing

(residual_tower -24%), and recurrent loops (feedback_fixed -16%).

**[Benchmark Report](https://github.com/fab2s/floDl/blob/main/docs/benchmark.md)** |

[Interactive dashboard](https://flodl.dev/benchmark)

### Multi-GPU (DDP)

ResNet-20 on CIFAR-10, 200 epochs -- heterogeneous GPUs (RTX 5060 Ti +

GTX 1060, 2.5x speed ratio). Published reference: 91.25%

([He et al. 2015](https://arxiv.org/abs/1512.03385), Table 6):

| Mode | Eval | vs Published | Time | vs Solo-0 |

|---|---:|---:|---:|---:|

| solo-0 (fast GPU only) | 91.66% | +0.41% | 3127s | -- |

| nccl-async | **92.44%** | **+1.19%** | 2697s | 1.2x |

| nccl-cadence | **92.42%** | **+1.17%** | 2650s | 1.2x |

| cpu-async | **92.43%** | **+1.18%** | 2614s | 1.2x |

| cpu-cadence | **92.04%** | **+0.79%** | 2670s | 1.2x |

Every ElChe mode surpasses published accuracy while finishing faster

than the fast GPU alone. 200 epochs is where ElChe's proportional

scheduling has room to calibrate and shine -- shorter models (logistic

through gpt-nano) confirm DDP convergence across architectures.

**[DDP Benchmark Report](https://github.com/fab2s/floDl/blob/main/docs/ddp-benchmark.md)** --

full results for 8 models across 9 DDP modes

## Why Rust for Deep Learning?

**Deterministic memory.** Python adds ~3-5 us of framework overhead per GPU

op. Go's GC can't manage VRAM — an [earlier Go implementation](https://github.com/fab2s/goDl)

required 5 phases of lifecycle management (refcounting, GC callbacks, VRAM

budgets, pending-free queues). Rust replaces all of that with

`impl Drop for Tensor`. Memory is freed the instant a tensor leaves scope.

**Zero-cost safety.** Every op returns `Result` — no silent failures.

Ownership ensures tensors are freed exactly once. The borrow checker

prevents data races at compile time.

**Same GPU kernels.** floDl binds libtorch — the C++ library under

PyTorch. CUDA, cuBLAS, cuDNN are identical. floDl replaces the dispatch

path, autograd tracking, and graph execution.

## Features Reference

Training Tools

| Tool | What it does |

|------|-------------|

| `clip_grad_norm` / `clip_grad_value` | Fused gradient clipping (2 kernels total via foreach ops) |

| `save_checkpoint` / `load_checkpoint` | Named `.fdl` checkpoints, structural hash, partial loading, `LoadReport` |

| `migrate_checkpoint` | Remap parameter names across versions |

| `Parameter::freeze` / `unfreeze` | Per-parameter gradient control |

| `GradScaler` | Dynamic loss scaling for fp16 training |

| `cast_parameters` | Cast model parameters to any dtype |

| `CpuWorker` / `ModelSnapshot` | Background checkpoint saving |

| `CudaGraph` | Capture/replay training steps for fixed-shape models |

Module Traits

Beyond `forward`/`parameters`, `Module` provides optional methods the graph

recognizes automatically:

| Method | What happens |

|--------|-------------|

| `as_named_input()` | `using()` refs arrive as a named map |

| `reset()` | Loops auto-call before iterating — clears per-forward state |

| `detach_state()` | Break gradient chains on retained state |

| `sub_modules()` | Recursive device placement, training mode, parameter collection |

Build Profiles

```toml

# Optimize floDl in dev builds — your code stays fast to compile.

[profile.dev.package.flodl]

opt-level = 3

[profile.dev.package.flodl-sys]

opt-level = 3

# Release: cross-crate optimization for maximum throughput.

[profile.release]

lto = "thin"

codegen-units = 1

```

| Profile | flodl | Your code | Typical rebuild |

|---------|-------|-----------|-----------------|

| `cargo build` | `-O3` (cached) | `-O0` (fast) | < 2s |

| `cargo build --release` | `-O3` + LTO | `-O3` + LTO | full link |

Multi-GPU (DDP)

| Component | What it does |

|-----------|-------------|

| `Ddp::setup` | One-liner: detect GPUs, distribute, set optimizer, train |

| `Ddp::builder` | Thread-per-GPU with Local SGD, any Module |

| `ApplyPolicy` | Sync / Cadence / Async (when to average) |

| `AverageBackend` | Nccl / Cpu (how to average, A/B testable) |

| `ElChe` | Heterogeneous GPU cadence strategy |

| `NcclComms` / `NcclRankComm` | NCCL AllReduce, Broadcast, abort handles |

| `CudaEvent` / `CudaStream` | Async GPU-CPU pipeline, timing |

| `DataLoader` | Resident/streaming/distributed, VRAM-aware prefetch, auto OOM fallback |

### Numerical Verification

Every differentiable path is verified against finite-difference gradients:

- 117 autograd op-level checks (every op + compositions)

- Module-level checks (every NN module, input + parameter gradients)

- Exact optimizer step verifications (SGD, Adam, AdamW, RMSprop, Adagrad, RAdam, NAdam)

- 1027 library tests, zero clippy warnings — all tests run on both CPU and CUDA

### Hardware Compatibility

Developed and tested from NVIDIA Pascal (GTX 1060 6GB) to Blackwell

(RTX 5060 Ti 16GB). PyTorch dropped Pascal support after 2.5.1 — floDl

links libtorch's stable C API, which supports every architecture the driver

supports. If `nvidia-smi` works, floDl trains on it.

## Documentation

### Choose your path

| Background | Start here |

|-----------|-----------|

| **New to Rust** | [Rust for PyTorch Users](https://github.com/fab2s/floDl/blob/main/docs/tutorials/00-rust-primer.md) — 10 patterns in 15 minutes |

| **Know Rust, new to DL** | [Tensors](https://github.com/fab2s/floDl/blob/main/docs/tutorials/01-tensors.md) then [Training](https://github.com/fab2s/floDl/blob/main/docs/tutorials/04-training.md) |

| **Know PyTorch** | [Porting Guide](https://github.com/fab2s/floDl/blob/main/docs/porting.md) (or `/port` with AI) then [Graph Builder](https://github.com/fab2s/floDl/blob/main/docs/tutorials/05-graph-builder.md) |

| **Scaling to multi-GPU** | [Multi-GPU Training](https://github.com/fab2s/floDl/blob/main/docs/tutorials/11-multi-gpu.md) then [DDP Builder](https://github.com/fab2s/floDl/blob/main/docs/tutorials/12-async-ddp.md) |

| **Just show me code** | [`quickstart`](https://github.com/fab2s/floDl/tree/main/flodl/examples/quickstart/) or [`showcase`](https://github.com/fab2s/floDl/tree/main/flodl/examples/showcase/) |

### Tutorials

0. **[Rust for PyTorch Users](https://github.com/fab2s/floDl/blob/main/docs/tutorials/00-rust-primer.md)** — 10 Rust patterns in 15 minutes

1. **[Tensors](https://github.com/fab2s/floDl/blob/main/docs/tutorials/01-tensors.md)** — creation, ops, memory, CUDA

2. **[Autograd](https://github.com/fab2s/floDl/blob/main/docs/tutorials/02-autograd.md)** — variables, gradients, backward

3. **[Modules](https://github.com/fab2s/floDl/blob/main/docs/tutorials/03-modules.md)** — all layers, convolutions, RNNs, attention, normalization

4. **[Training](https://github.com/fab2s/floDl/blob/main/docs/tutorials/04-training.md)** — losses, optimizers, mixed precision, full loop

5. **[Graph Builder](https://github.com/fab2s/floDl/blob/main/docs/tutorials/05-graph-builder.md)** — fluent API from simple to complex

6. **[Advanced Graphs](https://github.com/fab2s/floDl/blob/main/docs/tutorials/06-advanced-graphs.md)** — forward refs, loops, gates, switches

7. **[Visualization](https://github.com/fab2s/floDl/blob/main/docs/tutorials/07-visualization.md)** — DOT/SVG, profiling heatmaps

8. **[Utilities](https://github.com/fab2s/floDl/blob/main/docs/tutorials/08-utilities.md)** — checkpoints, clipping, freezing, initialization, scheduling, verbosity-gated logging

9. **[Training Monitor](https://github.com/fab2s/floDl/blob/main/docs/tutorials/09-monitor.md)** — ETA, resource tracking, live dashboard

10. **[Graph Tree](https://github.com/fab2s/floDl/blob/main/docs/tutorials/10-graph-tree.md)** — hierarchical composition, freeze/thaw, subgraph checkpoints

11. **[Multi-GPU Training](https://github.com/fab2s/floDl/blob/main/docs/tutorials/11-multi-gpu.md)** — Ddp::setup, El Che, auto-balancing, DataLoader integration

12. **[DDP Builder](https://github.com/fab2s/floDl/blob/main/docs/tutorials/12-async-ddp.md)** — thread-per-GPU, Local SGD, A/B testable backends

13. **[Data Loading](https://github.com/fab2s/floDl/blob/main/docs/tutorials/13-data-loading.md)** — DataLoader, resident/streaming modes, VRAM-aware prefetch, DDP integration

### Examples

- [`quickstart`](https://github.com/fab2s/floDl/tree/main/flodl/examples/quickstart/) — build, train, and monitor a model with residual connections

- [`sine_wave`](https://github.com/fab2s/floDl/tree/main/flodl/examples/sine_wave/) — sine regression with monitor, checkpoint round-trip

- [`mixed_precision`](https://github.com/fab2s/floDl/tree/main/flodl/examples/mixed_precision/) — float16 training with `GradScaler`

- [`transfer_learning`](https://github.com/fab2s/floDl/tree/main/flodl/examples/transfer_learning/) — checkpoint, partial load, freeze, fine-tune

- [`schedulers`](https://github.com/fab2s/floDl/tree/main/flodl/examples/schedulers/) — warmup + cosine + plateau composition

- [`observation`](https://github.com/fab2s/floDl/tree/main/flodl/examples/observation/) — collect, flush, trend queries, early stopping

- [`showcase`](https://github.com/fab2s/floDl/tree/main/flodl/examples/showcase/) — every graph builder method in one graph

### Porting from PyTorch

- **[Porting Guide](https://github.com/fab2s/floDl/blob/main/docs/porting.md)** — module mapping, FlowBuilder patterns, training loop translation

- **[AI-assisted porting](https://github.com/fab2s/floDl/tree/main/ai/skills/port/)** — point any AI coding assistant at the skill guide for automated translation. With Claude Code: `/port my_model.py`

- **`fdl api-ref`** — generate a structured API reference for your flodl version. Used by AI tools and useful on its own.

### Architecture

```

+-----------------------------------------------------------+

|  User Code / Model Definitions                            |

+-----------------------------------------------------------+

|  monitor/  ETA, resource tracking, live web dashboard     |

+-----------------------------------------------------------+

|  graph/    Fluent builder, graph tree, execution, DOT/SVG |

+-----------------------------------------------------------+

|  data/     DataLoader, resident/streaming, prefetch       |

+-----------------------------------------------------------+

|  nn/       Modules, losses, optimizers, DDP, NCCL         |

+-----------------------------------------------------------+

|  autograd/ Reverse-mode AD, gradient tracking             |

+-----------------------------------------------------------+

|  tensor/   Owned tensors with Drop, CPU + CUDA            |

+-----------------------------------------------------------+

|  flodl-sys   FFI bindings to libtorch C++ shim            |

+-----------------------------------------------------------+

|  libtorch / CUDA / NCCL                                   |

+-----------------------------------------------------------+

```

## Story

floDl started as a question: what would a deep learning framework look like

if you designed it around Rust's ownership model instead of fighting a garbage

collector?

An [earlier attempt in Go](https://github.com/fab2s/goDl) proved the

architecture — the graph builder, the module system, the observation engine —

but hit a wall: Go's GC cannot manage GPU memory deterministically. That

required building five layers of memory management infrastructure on top of

the language, not with it.

Rust solved this at the language level. `impl Drop for Tensor` replaced

hundreds of lines of lifecycle management. The graph builder, module

composition, and design philosophy carried forward; the memory fights didn't.

## License

floDl is open-sourced software licensed under the [MIT license](https://github.com/fab2s/floDl/blob/main/LICENSE).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fab2s/flodl

Awesome Lists containing this project

README

floDl