https://github.com/shadensmith/modelstepper
Musings on debugging DeepSpeed codes.
https://github.com/shadensmith/modelstepper
Last synced: 8 months ago
JSON representation
Musings on debugging DeepSpeed codes.
- Host: GitHub
- URL: https://github.com/shadensmith/modelstepper
- Owner: ShadenSmith
- Created: 2020-02-19T10:56:26.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2023-10-03T21:42:55.000Z (over 2 years ago)
- Last Synced: 2025-04-01T19:51:31.904Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 12.7 KB
- Stars: 0
- Watchers: 4
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ModelStepper
A debugger for DeepSpeed engines. ModelStepper tracks model parameters, gradients, and
loss with configurable tolerances.
ModelStepper accepts two training engines as returned from `deepspeed.initialize()` (one
baseline and one test) and a `DataLoader` for training. ModelStepper's `go()` method will
train some number of batches and track the specified values (i.e., parameters, loss,
and/or gradients). If a tracked component diverges from the baseline within a specified
tolerance, `go()` returns `False` and reports information on the divergence.
**Note:** the divergence of parameters and gradients is currently decided by the
*relative* difference between the tensors, i.e., `((B - A).norm() / A.norm())`. The
absolute difference is still communicated when diverged.
Assumptions:
* The user must ensure that the baseline and tracked model are initialized with the same
state.
* If parameters or gradients are tracked, the models are aligned such
`base_eng.module.parameters()` are comparable `test_eng.module.parameters()`.
In the near future, we should support doing an `all_gather()` to coordinate
with varying model parallelism.
## Usage
ModelStepper has a small API:
```python
stepper = ModelStepper(base_engine,
test_engine,
trainloader,
num_batches=50,
test_every=1)
success = stepper.go()
```
Check out [demo.py](demo.py) and [ModelStepper.py](ModelStepper.py) for more details.
## Example
Try the demo:
```bash
$ deepspeed demo.py --deepspeed --deepspeed_config=ds_config.json
--- Model Stepper Configuration ---
batches=50
test_every=1
status_every=5
track_params=True
param_tol=1.000000e-05
track_loss=True
loss_tol=1.000000e-05
track_grads=True
grad_tol=1.000000e-05
STATUS batch=0 / 50 base_loss=2.30138 test_loss=2.30138 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=5 / 50 base_loss=2.29744 test_loss=2.29744 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=10 / 50 base_loss=2.25951 test_loss=2.25951 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=15 / 50 base_loss=2.19609 test_loss=2.19609 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=20 / 50 base_loss=2.12497 test_loss=2.12497 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=25 / 50 base_loss=2.05403 test_loss=2.05403 abs_diff=2.38419e-07 rel_diff=1.16074e-07
STATUS batch=30 / 50 base_loss=1.99819 test_loss=1.99819 abs_diff=1.19209e-07 rel_diff=5.96587e-08
STATUS batch=35 / 50 base_loss=1.97918 test_loss=1.97918 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=40 / 50 base_loss=1.98365 test_loss=1.98365 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=45 / 50 base_loss=1.85610 test_loss=1.85610 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=49 / 50 base_loss=1.89003 test_loss=1.89003 abs_diff=0.00000e+00 rel_diff=0.00000e+00
TEST PASSED
```
In contrast, here is the result of the `--fail` flag to demo a test failure. This
mode sets `lr=0` in the tested model:
```bash
$ deepspeed demo.py --deepspeed --deepspeed_config=ds_config.json --fail
DIVERGED PARAMETER rank=2 batch=0 param_idx=0 abs_diff=2.11209e-02 rel_diff=1.47273e-02 tol=1.00000e-04
DIVERGED PARAMETER rank=0 batch=0 param_idx=0 abs_diff=2.11209e-02 rel_diff=1.47273e-02 tol=1.00000e-04
DIVERGED PARAMETER rank=3 batch=0 param_idx=0 abs_diff=2.11209e-02 rel_diff=1.47273e-02 tol=1.00000e-04
DIVERGED PARAMETER rank=1 batch=0 param_idx=0 abs_diff=2.11209e-02 rel_diff=1.47273e-02 tol=1.00000e-04
STATUS batch=0 / 50 base_loss=2.30138 test_loss=2.30138 abs_diff=0.00000e+00 rel_diff=0.00000e+00
TEST FAILED
```
ModelStepper immediately detects that the model parameters have diverged from the
baseline.