Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/iterative/dvc-checkpoints-mnist

Example of checkpoints in a DVC project training a simple convolutional neural net to classify MNIST data
https://github.com/iterative/dvc-checkpoints-mnist

example

Last synced: 2 months ago
JSON representation

Example of checkpoints in a DVC project training a simple convolutional neural net to classify MNIST data

Awesome Lists containing this project

README

        

# dvc-checkpoints-mnist

This example DVC project demonstrates the different ways to employ
[Checkpoint Experiments](https://dvc.org/doc/user-guide/experiment-management#checkpoints-in-source-code)
with DVC.

This scenario uses [DVCLive](https://dvc.org/doc/dvclive) to generate
[checkpoints](https://dvc.org/doc/api-reference/make_checkpoint) for iterative
model training. The model is a simple convolutional neural network (CNN)
classifier trained on the [MNIST](http://yann.lecun.com/exdb/mnist/) data of
handwritten digits to predict the digit (0-9) in each image.

๐Ÿ”„ Switch between scenarios

This repo has several
[branches](https://github.com/iterative/dvc-checkpoints-mnist/branches) that
show different methods for using checkpoints (using a similar pipeline):

- The [live](https://github.com/iterative/dvc-checkpoints-mnist/tree/live)
scenario introduces full-featured checkpoint usage โ€” integrating with
[DVCLive](https://github.com/iterative/dvclive).
- The [basic](https://github.com/iterative/dvc-checkpoints-mnist/tree/basic)
scenario uses single-checkpoint experiments to illustrate how checkpoints work
in a simple way.
- The
[Python-only](https://github.com/iterative/dvc-checkpoints-mnist/tree/make_checkpoint)
variation features the
[make_checkpoint](https://dvc.org/doc/api-reference/make_checkpoint) function
from DVC's Python API.
- Contrastingly, the
[signal file](https://github.com/iterative/dvc-checkpoints-mnist/tree/signal_file)
scenario shows how to make your own signal files (applicable to any
programming language).
- Finally, our
[full pipeline](https://github.com/iterative/dvc-checkpoints-mnist/tree/full_pipeline)
scenario elaborates on the full-featured usage with a more advanced process.

## Setup

To try it out for yourself:

1. Fork the repository and clone to your local workstation.
2. Install the prerequisites in `requirements.txt` (if you are using pip, run
`pip install -r requirements.txt`).

## Experimenting

Start training the model with `dvc exp run`. It will train for an unlimited
number of epochs, each of which will generate a checkpoint. Use `Ctrl-C` to stop
at the last checkpoint, and simply `dvc exp run` again to resume.

DVCLive will track performance at each checkpoint. Open `dvclive.html` in your
web browser during training to track performance over time (you will need to
refresh after each epoch completes to see updates). Metrics will also be logged
to `.tsv` files in the `dvclive` directory.

Once you stop the training script, you can view the results of the experiment
with:

```bash
$ dvc exp show
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Experiment โ”ƒ Created โ”ƒ step โ”ƒ acc โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ workspace โ”‚ - โ”‚ 9 โ”‚ 0.4997 โ”‚
โ”‚ live โ”‚ 03:43 PM โ”‚ - โ”‚ โ”‚
โ”‚ โ”‚ โ•“ exp-34e55 โ”‚ 03:45 PM โ”‚ 9 โ”‚ 0.4997 โ”‚
โ”‚ โ”‚ โ•Ÿ 2fe819e โ”‚ 03:45 PM โ”‚ 8 โ”‚ 0.4394 โ”‚
โ”‚ โ”‚ โ•Ÿ 3da85f8 โ”‚ 03:45 PM โ”‚ 7 โ”‚ 0.4329 โ”‚
โ”‚ โ”‚ โ•Ÿ 4f64a8e โ”‚ 03:44 PM โ”‚ 6 โ”‚ 0.4686 โ”‚
โ”‚ โ”‚ โ•Ÿ b9bee58 โ”‚ 03:44 PM โ”‚ 5 โ”‚ 0.2973 โ”‚
โ”‚ โ”‚ โ•Ÿ e2c5e8f โ”‚ 03:44 PM โ”‚ 4 โ”‚ 0.4004 โ”‚
โ”‚ โ”‚ โ•Ÿ c202f62 โ”‚ 03:44 PM โ”‚ 3 โ”‚ 0.1468 โ”‚
โ”‚ โ”‚ โ•Ÿ eb0ecc4 โ”‚ 03:43 PM โ”‚ 2 โ”‚ 0.188 โ”‚
โ”‚ โ”‚ โ•Ÿ 28b170f โ”‚ 03:43 PM โ”‚ 1 โ”‚ 0.0904 โ”‚
โ”‚ โ”œโ”€โ•จ 9c705fc โ”‚ 03:43 PM โ”‚ 0 โ”‚ 0.0894 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```

You can manage it like any other DVC
[experiments](https://dvc.org/doc/start/experiments), including:
* Run `dvc exp run` again to continue training from the last checkpoint.
* Run `dvc exp apply [checkpoint_id]` to revert to any of the prior checkpoints
(which will update the `model.pt` output file and metrics to that point).
* Run `dvc exp run --reset` to drop all the existing checkpoints and start from
scratch.

## Adding `dvclive` checkpoints to a DVC project

Using `dvclive` to add checkpoints to a DVC project requires a few additional
lines of code.

In your training script, use `dvclive.log()` to log metrics and
`dvclive.next_step()` to make a checkpoint with those metrics.
If you need the current epoch number, use `dvclive.get_step()` (e.g.
to use a [learning rate
schedule](https://en.wikipedia.org/wiki/Learning_rate#Learning_rate_schedule)
or stop training after a fixed number of epochs). See the
[train.py](https://github.com/iterative/dvc-checkpoints-mnist/blob/live/train.py)
script for an example:

```python
# Iterate over training epochs.
for epoch in itertools.count(dvclive.get_step()):
train(model, x_train, y_train)
torch.save(model.state_dict(), "model.pt")
# Evaluate and checkpoint.
metrics = evaluate(model, x_test, y_test)
for metric, value in metrics.items():
dvclive.log(metric, value)
dvclive.next_step()
```

Then, in `dvc.yaml`, add the `checkpoint: true` option to your model output and
a `live` section to your stage output. See
[dvc.yaml](https://github.com/iterative/dvc-checkpoints-mnist/blob/live/dvc.yaml)
for an example:

```yaml
stages:
train:
cmd: python train.py
deps:
- train.py
outs:
- model.pt:
checkpoint: true
live:
dvclive:
summary: true
html: true
```

If you do not already have a `dvc.yaml` stage, you can use [dvc stage
add](https://dvc.org/doc/command-reference/stage/add) to create it:

```bash
$ dvc stage add -n train -d train.py -c model.pt --live dvclive python train.py
```

That's it! For users already familiar with logging metrics in DVC, note that you
no longer need a `metrics` section in `dvc.yaml` since `dvclive` is already
logging metrics.