https://github.com/jman4162/sizing-ai-training-by-cost-per-memory-bandwidth
A practical model (with math + Python) to tell if you’re compute-, memory-, or network-bound—and what to buy next
https://github.com/jman4162/sizing-ai-training-by-cost-per-memory-bandwidth
ai ai-infrastructure aws aws-ec2 cost-optimization distributed-systems distributed-training hbm llm llm-training machine-learning memory-bandwidth ml nccl pytorch roofline-model systems-performance transformer
Last synced: 5 days ago
JSON representation
A practical model (with math + Python) to tell if you’re compute-, memory-, or network-bound—and what to buy next
- Host: GitHub
- URL: https://github.com/jman4162/sizing-ai-training-by-cost-per-memory-bandwidth
- Owner: jman4162
- License: mit
- Created: 2025-09-04T04:30:30.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2025-09-04T06:00:40.000Z (about 1 month ago)
- Last Synced: 2025-09-12T11:15:47.457Z (27 days ago)
- Topics: ai, ai-infrastructure, aws, aws-ec2, cost-optimization, distributed-systems, distributed-training, hbm, llm, llm-training, machine-learning, memory-bandwidth, ml, nccl, pytorch, roofline-model, systems-performance, transformer
- Language: Jupyter Notebook
- Homepage:
- Size: 23.4 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Sizing AI Training by **Cost per Memory Bandwidth**
*A practical, first-order model (math + Python) to tell if you’re compute-, memory-, or network-bound—and how to pick the cheapest TB/s that hits your tokens/sec target.*
> Notebook: **Sizing\_AI\_Training\_by\_Cost\_per\_Memory\_Bandwidth.ipynb** (this repo). ([GitHub][1])
## Why this exists
Frontier-scale transformer training often hits the **memory wall**: step time is limited by how fast bytes move through **HBM/GDDR**, not by peak TFLOPs. This project provides a compact model—both in math and code—to:
* Diagnose whether a run is **compute**, **memory**, or **network** bound
* Estimate **tokens/sec per GPU**, GPUs needed for a target throughput, and cluster **TB/s**
* Compare hardware using **\$/TB/s/hour** (cost per memory bandwidth), which often tracks throughput/\$ better than TFLOPs/\$ for large LLM training## What’s inside
* 📓 **Notebook** with the derivation + reference implementation
* 🧮 **Equations** for FLOPs/token, bytes/token (optimizer + activations), arithmetic intensity, and network-bound checks
* 🧰 **Tunable knobs** for FlashAttention, activation checkpointing, optimizer precision, global tokens/step, etc.
* 🧪 **Example catalog** entries for common GPUs (editable to your pricing/specs)---
## Quickstart
```bash
# 1) Clone
git clone https://github.com/jman4162/Sizing-AI-Training-by-Cost-per-Memory-Bandwidth
cd Sizing-AI-Training-by-Cost-per-Memory-Bandwidth# 2) (Recommended) Create an environment
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate# 3) Install minimal deps for running the notebook
python -m pip install --upgrade pip jupyterlab# 4) Launch and open the notebook
jupyter lab
```> The notebook uses only the standard library (`dataclasses`, `math`). If you add plots, install `matplotlib` too.
---
## Usage pattern
1. **Fill in your run**
* Model size $N$, layers $L$, hidden size $d_{\text{model}}$
* Global tokens per step $B_g$ (global batch × sequence length)
* Optimizer traffic $\alpha_{\text{opt}}$ (e.g., Adam bf16 ≈ 16–20 B/param/step)
* Activation traffic coefficient $c_{\text{act}}$ (lower with FlashAttention/fused kernels)
* Recompute multiplier $\gamma$ (1.1–1.4 with activation checkpointing)2. **Set hardware entries**
Usable TFLOPs (bf16/fp16), HBM TB/s, NIC Gb/s, and your **\$/GPU-hr**.3. **Ask the two key questions**
* What’s the **bottleneck**? (`compute`, `memory`, or `network`)
* Among configs that aren’t network-bound, which gives the lowest **\$/TB/s·hr** while meeting your tokens/sec target?---
## Minimal code snippet (from the notebook)
```python
from dataclasses import dataclass
from math import ceil@dataclass
class Hardware:
name: str
peak_flops_tflops: float
hbm_tbps: float
nic_gbps: float
price_per_gpu_hr: float
utilization: float = 0.75@dataclass
class Model:
n_params: float; layers: int; d_model: int; bytes_per_elem: int = 2@dataclass
class TrainingCfg:
k_flops_per_token: float = 6.0
recompute_mult: float = 1.0
alpha_opt_bytes_per_param: float = 16.0
c_act: float = 6.0
global_tokens_per_step: int = 512_000
bytes_per_grad_elem: int = 2# ...functions for per_token_flops, per_token_hbm_bytes, per_token_net_bytes...
def tokens_per_sec_per_gpu(hw, model, train, dp_world_size=1):
# returns r_gpu, r_comp, r_mem, r_net, bound, intensity, machine_balance
...def plan_cluster(hw, model, train, tokens_per_sec_target, dp_world_size=1):
# returns per-GPU rate, GPUs needed, $/hr, cluster HBM TB/s, $/TB/s·hr
...
```---
## Interpreting results
* **`bound == "memory"`** → You’re memory-bandwidth bound.
* Reduce bytes/token: FlashAttention, fused kernels, 8-bit optimizers, bigger $B_g$ (if stable).
* Prefer hardware with **better \$/TB/s·hr** (e.g., higher HBM BW per \$).* **`bound == "network"`** → All-reduce is the choke point.
* Increase $B_g$, reduce pure DP (add TP/PP/ZeRO), overlap comms, or raise effective NIC BW (EFA/IB).
* **`bound == "compute"`** → Great! Improve utilization and ensure you’re not secretly I/O-constrained.
---
## Examples to try
* Compare **H100 vs H200 vs L4** for a 70B model at target 200k tokens/sec.
* Flip to **inference** by setting $\kappa\approx2$, $\alpha_{\text{opt}}=0$, and modeling **KV-cache** bytes/token instead of activations.
* Test the effect of **global tokens/step** on the network bound (watch `r_net`).---
## Roadmap
* [ ] Helper CLI: `python plan.py --model 70b --target-tps 2e5 --hw h100,h200`
* [ ] Plotting helpers (roofline view; \$/TB/s vs design points)
* [ ] Inference variant (KV cache), MoE variant (active params), long-context attention presets
* [ ] Optional YAML config for reproducible comparisons---
## Contributing
PRs and issues welcome! Ideas:
* Add measured bandwidth/utilization from your cluster
* Additional hardware profiles and real **\$/TB/s·hr** snapshots
* Verified presets for FlashAttention, 8-bit optimizers, ZeRO, etc.---
## Project Files
* [Sizing_AI_Training_by_Cost_per_Memory_Bandwidth.ipynb](./Sizing_AI_Training_by_Cost_per_Memory_Bandwidth.ipynb) — Main notebook with model and code.
* [The KV Cache: What It Is, Why It Matters, and How to Size It for Modern LLMs](./The_KV_Cache_What_It_Is,_Why_It_Matters,_and_How_to_Size_It_for_Modern_LLMs.ipynb) — Deep dive notebook on KV cache sizing and implications for LLM inference.---
## References & further reading
* Roofline model (compute vs memory bound) — Williams et al., *CACM* (2009)
* FlashAttention (I/O-aware attention) — Dao et al., *arXiv:2205.14135*
* Megatron-LM scaling & comms patterns — Shoeybi et al., *arXiv:1909.08053*
* ZeRO optimizer sharding — Rajbhandari et al., *SC’20* / arXiv:1910.02054
* 8-bit optimizers — Dettmers et al., *arXiv:2110.02861*
* NCCL collectives, EFA/libfabric plugin — NVIDIA & AWS docs*(See the blog post for a longer, linked bibliography.)*
---
## License
Specify a license for reuse (e.g., MIT or Apache-2.0). If you add a `LICENSE` file, link it here.
---
## Citation
If this helped your team ship or save money, feel free to cite the repo/blog post or drop a star ⭐.
```bibtex
@misc{cost_per_memory_bandwidth,
title = {Sizing AI Training by Cost per Memory Bandwidth},
author = {Hodge, John},
year = {2025},
url = {https://github.com/jman4162/Sizing-AI-Training-by-Cost-per-Memory-Bandwidth}
}
```---
[1]: https://github.com/jman4162/Sizing-AI-Training-by-Cost-per-Memory-Bandwidth/blob/main/Sizing_AI_Training_by_Cost_per_Memory_Bandwidth.ipynb "Sizing-AI-Training-by-Cost-per-Memory-Bandwidth/Sizing_AI_Training_by_Cost_per_Memory_Bandwidth.ipynb"