https://github.com/traceopt-ai/traceml
Lightweight training runtime health monitor
https://github.com/traceopt-ai/traceml
ai data-science deep-learning huggingface machine-learning mlops observability pytorch runtime runtime-system straggler-problem training
Last synced: 3 months ago
JSON representation
Lightweight training runtime health monitor
- Host: GitHub
- URL: https://github.com/traceopt-ai/traceml
- Owner: traceopt-ai
- License: apache-2.0
- Created: 2025-08-17T12:46:51.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2026-03-02T22:20:51.000Z (3 months ago)
- Last Synced: 2026-03-03T00:44:01.191Z (3 months ago)
- Topics: ai, data-science, deep-learning, huggingface, machine-learning, mlops, observability, pytorch, runtime, runtime-system, straggler-problem, training
- Language: Python
- Homepage: https://traceopt.ai/
- Size: 4.84 MB
- Stars: 97
- Watchers: 1
- Forks: 8
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# TraceML
**Know what’s slowing your (PyTorch) training, while it runs**
[](https://pypi.org/project/traceml-ai/)
[](https://pepy.tech/project/traceml-ai)
[](https://github.com/traceopt-ai/traceml)
[](https://www.python.org/)
[](./LICENSE)
TraceML provides step-level training visibility for PyTorch workloads. It shows where time and memory go inside each training step so you can
quickly understand performance behavior across single-GPU and single-node DDP runs.
**Current support**
- ✅ Single GPU
- ✅ Single-node multi-GPU (**DDP**)
- ❌ Multi-node DDP (not yet)
- ❌ FSDP / TP / PP (not yet)
---
## What You See in Minutes
- System signals (CPU, RAM, GPU)
- Breakdown of each training step:
- `dataloader → forward → backward → optimizer → overhead`
- Median vs worst rank (in case of DDP)
- Skew (%) to surface imbalance
- GPU memory (allocated + peak)
- End-of-run summary card with straggler rank and step breakdown
Healthy runs are clearly stable. Unstable runs reveal drift, imbalance, or memory creep early.
---
## Quick Start
> **More detailed Quickstart:** [docs/quickstart.md](docs/quickstart.md) covers install, run modes, DDP, and troubleshooting.
Install:
``` bash
pip install traceml-ai
```
Wrap your training step:
``` python
from traceml.decorators import trace_step
for batch in dataloader:
with trace_step(model):
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
```
Run with cli:
``` bash
traceml run train.py
```
The terminal dashboard opens alongside your logs. At run end, TraceML also prints a compact runtime summary card for quick review and sharing.

Optional web UI:
``` bash
traceml run train.py --mode=dashboard
```

---
## What TraceML Surfaces
### Step-Level Signals
- Dataloader fetch time
- Step time (low-overhead, GPU-aware)
- Step GPU memory (allocated + peak)
Across ranks:
- Median (typical behavior)
- Worst rank (slowest / highest memory)
- Skew (% difference)
This makes rank imbalance and straggler behavior immediately visible.
---
## Deep-Dive Mode (Optional)
Enable model-level hooks for diagnostic context:
``` python
from traceml.decorators import trace_model_instance
trace_model_instance(model)
```
Use together with `trace_step(model)` to enable:
- Per-layer memory signals
- Per-layer forward/backward timing
- Lightweight failure attribution (experimental)
If not enabled, ESSENTIAL signals remain unchanged.
---
## What It Is Not
- Not a replacement for PyTorch Profiler or Nsight
- Not an auto-tuner
- Not a kernel-level tracer
TraceML focuses on step-level visibility that is practical during real
training runs.
---
## Supported Environments
- Python 3.10+
- PyTorch 2.5+
- macOS (Intel/ARM), Linux
- Single GPU
- Single-node DDP
---
## Hugging Face Integration
TraceML provides a seamless integration with Hugging Face `transformers` via `TraceMLTrainer`.
### Usage
Replace `transformers.Trainer` with `traceml.hf_decorators.TraceMLTrainer`.
```python
from traceml.hf_decorators import TraceMLTrainer
trainer = TraceMLTrainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=eval_ds,
traceml_enabled=True,
)
```
See the full [Hugging Face integration guide](docs/huggingface.md) for NLP, vision, DDP examples, and a complete `TraceMLTrainer` parameter reference.
---
## PyTorch Lightning Integration
TraceML offers official support for PyTorch Lightning models through `TraceMLCallback`.
### Usage
Simply pass the callback to your `Trainer`.
```python
import lightning as L
from traceml.utils.lightning import TraceMLCallback
from traceml.decorators import trace_model_instance
class MyLightningModule(L.LightningModule):
def __init__(self):
super().__init__()
self.model = ...
# Optional: enable deep-dive per-layer instrumentation
trace_model_instance(self)
def training_step(self, batch, batch_idx):
...
trainer = L.Trainer(callbacks=[TraceMLCallback()])
```
---
## Roadmap
Near-term: - Single-node DDP hardening - Disk run logging -
Compatibility validation (gradient accumulation, torch.compile) -
Accelerate / Lightning wrappers
Next: - Multi-node DDP - Initial FSDP support
Later: - Tensor / Pipeline parallel awareness
---
## Contributing
Contributions are welcome.
When opening issues, include: - Minimal repro script - Hardware + CUDA +
PyTorch versions - ESSENTIAL vs DEEP-DIVE - Single GPU vs DDP
---
## Community & Support
Founding Engineer / Co-Founder track (Berlin/Germany): We are looking
for a senior systems+ML builder to help grow TraceML into a sustainable AI
infra product. See the GitHub Discussion https://github.com/traceopt-ai/traceml/discussions/36
- 📧 Email: abhinav@traceopt.ai
- 🐙 LinkedIn: [Abhinav Srivastav](https://www.linkedin.com/in/abhinavsriva/)
- 📋 User Survey (2 min): https://forms.gle/KwPSLaPmJnJjoVXSA
Stars help more teams find the project. 🌟
---
## License
TraceML is released under the **Apache 2.0**.
See [LICENSE](./LICENSE) for details.
---
## Citation
If TraceML helps your research, please cite:
```bibtex
@software{traceml2024,
author = {TraceOpt},
title = {TraceML: Real-time Training Observability for PyTorch},
year = {2024},
url = {https://github.com/traceopt-ai/traceml}
}
```
---
Made with ❤️ by TraceOpt