https://github.com/jman4162/sizing-ai-training-by-cost-per-memory-bandwidth

A practical model (with math + Python) to tell if you’re compute-, memory-, or network-bound—and what to buy next
https://github.com/jman4162/sizing-ai-training-by-cost-per-memory-bandwidth

ai ai-infrastructure aws aws-ec2 cost-optimization distributed-systems distributed-training hbm llm llm-training machine-learning memory-bandwidth ml nccl pytorch roofline-model systems-performance transformer

Last synced: 5 days ago
JSON representation

A practical model (with math + Python) to tell if you’re compute-, memory-, or network-bound—and what to buy next

Host: GitHub
URL: https://github.com/jman4162/sizing-ai-training-by-cost-per-memory-bandwidth
Owner: jman4162
License: mit
Created: 2025-09-04T04:30:30.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2025-09-04T06:00:40.000Z (about 1 month ago)
Last Synced: 2025-09-12T11:15:47.457Z (27 days ago)
Topics: ai, ai-infrastructure, aws, aws-ec2, cost-optimization, distributed-systems, distributed-training, hbm, llm, llm-training, machine-learning, memory-bandwidth, ml, nccl, pytorch, roofline-model, systems-performance, transformer
Language: Jupyter Notebook
Homepage:
Size: 23.4 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Sizing AI Training by **Cost per Memory Bandwidth**

*A practical, first-order model (math + Python) to tell if you’re compute-, memory-, or network-bound—and how to pick the cheapest TB/s that hits your tokens/sec target.*

> Notebook: **Sizing\_AI\_Training\_by\_Cost\_per\_Memory\_Bandwidth.ipynb** (this repo). ([GitHub][1])

## Why this exists

Frontier-scale transformer training often hits the **memory wall**: step time is limited by how fast bytes move through **HBM/GDDR**, not by peak TFLOPs. This project provides a compact model—both in math and code—to:

* Diagnose whether a run is **compute**, **memory**, or **network** bound

* Estimate **tokens/sec per GPU**, GPUs needed for a target throughput, and cluster **TB/s**

* Compare hardware using **\$/TB/s/hour** (cost per memory bandwidth), which often tracks throughput/\$ better than TFLOPs/\$ for large LLM training

## What’s inside

* 📓 **Notebook** with the derivation + reference implementation

* 🧮 **Equations** for FLOPs/token, bytes/token (optimizer + activations), arithmetic intensity, and network-bound checks

* 🧰 **Tunable knobs** for FlashAttention, activation checkpointing, optimizer precision, global tokens/step, etc.

* 🧪 **Example catalog** entries for common GPUs (editable to your pricing/specs)

---

## Quickstart

```bash

# 1) Clone

git clone https://github.com/jman4162/Sizing-AI-Training-by-Cost-per-Memory-Bandwidth

cd Sizing-AI-Training-by-Cost-per-Memory-Bandwidth

# 2) (Recommended) Create an environment

python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate

# 3) Install minimal deps for running the notebook

python -m pip install --upgrade pip jupyterlab

# 4) Launch and open the notebook

jupyter lab

```

> The notebook uses only the standard library (`dataclasses`, `math`). If you add plots, install `matplotlib` too.

---

## Usage pattern

1. **Fill in your run**

* Model size $N$, layers $L$, hidden size $d_{\text{model}}$

* Global tokens per step $B_g$ (global batch × sequence length)

* Optimizer traffic $\alpha_{\text{opt}}$ (e.g., Adam bf16 ≈ 16–20 B/param/step)

* Activation traffic coefficient $c_{\text{act}}$ (lower with FlashAttention/fused kernels)

* Recompute multiplier $\gamma$ (1.1–1.4 with activation checkpointing)

2. **Set hardware entries**

   Usable TFLOPs (bf16/fp16), HBM TB/s, NIC Gb/s, and your **\$/GPU-hr**.

3. **Ask the two key questions**

* What’s the **bottleneck**? (`compute`, `memory`, or `network`)

* Among configs that aren’t network-bound, which gives the lowest **\$/TB/s·hr** while meeting your tokens/sec target?

---

## Minimal code snippet (from the notebook)

```python

from dataclasses import dataclass

from math import ceil

@dataclass

class Hardware:

    name: str

    peak_flops_tflops: float

    hbm_tbps: float

    nic_gbps: float

    price_per_gpu_hr: float

    utilization: float = 0.75

@dataclass

class Model:

    n_params: float; layers: int; d_model: int; bytes_per_elem: int = 2

@dataclass

class TrainingCfg:

    k_flops_per_token: float = 6.0

    recompute_mult: float = 1.0

    alpha_opt_bytes_per_param: float = 16.0

    c_act: float = 6.0

    global_tokens_per_step: int = 512_000

    bytes_per_grad_elem: int = 2

# ...functions for per_token_flops, per_token_hbm_bytes, per_token_net_bytes...

def tokens_per_sec_per_gpu(hw, model, train, dp_world_size=1):

    # returns r_gpu, r_comp, r_mem, r_net, bound, intensity, machine_balance

    ...

def plan_cluster(hw, model, train, tokens_per_sec_target, dp_world_size=1):

    # returns per-GPU rate, GPUs needed, $/hr, cluster HBM TB/s, $/TB/s·hr

    ...

```

---

## Interpreting results

* **`bound == "memory"`** → You’re memory-bandwidth bound.

  * Reduce bytes/token: FlashAttention, fused kernels, 8-bit optimizers, bigger $B_g$ (if stable).

  * Prefer hardware with **better \$/TB/s·hr** (e.g., higher HBM BW per \$).

* **`bound == "network"`** → All-reduce is the choke point.

  * Increase $B_g$, reduce pure DP (add TP/PP/ZeRO), overlap comms, or raise effective NIC BW (EFA/IB).

* **`bound == "compute"`** → Great! Improve utilization and ensure you’re not secretly I/O-constrained.

---

## Examples to try

* Compare **H100 vs H200 vs L4** for a 70B model at target 200k tokens/sec.

* Flip to **inference** by setting $\kappa\approx2$, $\alpha_{\text{opt}}=0$, and modeling **KV-cache** bytes/token instead of activations.

* Test the effect of **global tokens/step** on the network bound (watch `r_net`).

---

## Roadmap

* [ ] Helper CLI: `python plan.py --model 70b --target-tps 2e5 --hw h100,h200`

* [ ] Plotting helpers (roofline view; \$/TB/s vs design points)

* [ ] Inference variant (KV cache), MoE variant (active params), long-context attention presets

* [ ] Optional YAML config for reproducible comparisons

---

## Contributing

PRs and issues welcome! Ideas:

* Add measured bandwidth/utilization from your cluster

* Additional hardware profiles and real **\$/TB/s·hr** snapshots

* Verified presets for FlashAttention, 8-bit optimizers, ZeRO, etc.

---

## Project Files

* [Sizing_AI_Training_by_Cost_per_Memory_Bandwidth.ipynb](./Sizing_AI_Training_by_Cost_per_Memory_Bandwidth.ipynb) — Main notebook with model and code.

* [The KV Cache: What It Is, Why It Matters, and How to Size It for Modern LLMs](./The_KV_Cache_What_It_Is,_Why_It_Matters,_and_How_to_Size_It_for_Modern_LLMs.ipynb) — Deep dive notebook on KV cache sizing and implications for LLM inference.

---

## References & further reading

* Roofline model (compute vs memory bound) — Williams et al., *CACM* (2009)

* FlashAttention (I/O-aware attention) — Dao et al., *arXiv:2205.14135*

* Megatron-LM scaling & comms patterns — Shoeybi et al., *arXiv:1909.08053*

* ZeRO optimizer sharding — Rajbhandari et al., *SC’20* / arXiv:1910.02054

* 8-bit optimizers — Dettmers et al., *arXiv:2110.02861*

* NCCL collectives, EFA/libfabric plugin — NVIDIA & AWS docs

*(See the blog post for a longer, linked bibliography.)*

---

## License

Specify a license for reuse (e.g., MIT or Apache-2.0). If you add a `LICENSE` file, link it here.

---

## Citation

If this helped your team ship or save money, feel free to cite the repo/blog post or drop a star ⭐.

```bibtex

@misc{cost_per_memory_bandwidth,

  title  = {Sizing AI Training by Cost per Memory Bandwidth},

  author = {Hodge, John},

  year   = {2025},

  url    = {https://github.com/jman4162/Sizing-AI-Training-by-Cost-per-Memory-Bandwidth}

}

```

---

[1]: https://github.com/jman4162/Sizing-AI-Training-by-Cost-per-Memory-Bandwidth/blob/main/Sizing_AI_Training_by_Cost_per_Memory_Bandwidth.ipynb "Sizing-AI-Training-by-Cost-per-Memory-Bandwidth/Sizing_AI_Training_by_Cost_per_Memory_Bandwidth.ipynb"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jman4162/sizing-ai-training-by-cost-per-memory-bandwidth

Awesome Lists containing this project

README