https://github.com/thomas-bouvier/distributed-continual-learning
Towards Rehearsal-based Continual Learning at Scale: distributed CL using Horovod + PyTorch on up to 128 GPUs
https://github.com/thomas-bouvier/distributed-continual-learning
continual-learning data-parallelism deep-learning experience-replay hpc ptychography rehearsal
Last synced: 5 days ago
JSON representation
Towards Rehearsal-based Continual Learning at Scale: distributed CL using Horovod + PyTorch on up to 128 GPUs
- Host: GitHub
- URL: https://github.com/thomas-bouvier/distributed-continual-learning
- Owner: thomas-bouvier
- License: mit
- Created: 2023-05-05T20:53:14.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-10-21T14:32:03.000Z (12 months ago)
- Last Synced: 2025-09-23T17:51:38.662Z (18 days ago)
- Topics: continual-learning, data-parallelism, deep-learning, experience-replay, hpc, ptychography, rehearsal
- Language: Python
- Homepage:
- Size: 930 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# distributed-continual-learning
This is a PyTorch + Horovod implementation of the continual learning experiments with deep neural networks described in the following article:
* [Three types of incremental learning](https://www.nature.com/articles/s42256-022-00568-3) (2022, *Nature Machine Intelligence*)
Continual learning approaches implemented here are based on rehearsal, which is delegated in a separate high-performance C++ backend [Neomem](https://github.com/thomas-bouvier/neomem).
This repository primarily supports experiments in the academic continual learning setting, whereby a classification-based problem is split up into multiple, non-overlapping tasks that must be learned sequentially (class-incremental scenario). Instance-incremental scenarios are supported too.
Some Python code has been inspired by the [mammoth](https://github.com/aimagelab/mammoth) and [convNet.pytorch](https://github.com/eladhoffer/convNet.pytorch/tree/master) repositories.
## Installation
The current version of the code has been tested with Python 3.10 with the following package versions:
* `pytorch 2.2`
* `timm 0.9.2`
* `horovod 0.28.1`
* `continuum 1.2.7`
* `nvidia-dali-cuda110 1.27.0` (optional)Make sure to install [Neomem](https://github.com/thomas-bouvier/neomem) to benefit from global sampling of representatives. If not available, this code will fallback to a Python, local, low performance rehearsal buffer implementation.
If Neomem is installed outside of this directory, simlink it using `ln -s ../neomem cpp_loader`.
Further Python packages used are listed in requirements.txt. Assuming Python and pip are set up, these packages can be installed using:
```bash
pip install -r requirements.txt
```In an HPC environment, we strongly advise to use [Spack](https://github.com/spack/spack) to manage dependencies.
## Usage
Parameters defined in the `config.yaml` override CLI parameters. However, values for `backbone_config`, `buffer_config`, `tasksets_config` will be concatenated with those defined by CLI, instead of override them ;
Values for `optimizer_regime` will override regimes defined by `backbone/` in Python.
| Parameter name | Required | Description | Possible values |
|---|---|---|---|
| `--backbone` | Yes | DL backbone model to instanciate | `mnistnet`, `resnet18`, `resnet50`, `mobilenetv3`, `efficientnetv2`, `convnext`, `ghostnet`, `ptychonn` |
| `--backbone-config` | | Backbone-specific parameters | `"{'lr': 0.01, 'lr_min': 1e-6, }"` |
| `--model` | Default: `Vanilla` | Continual Learning strategy | `Vanilla`, `Er`, `Agem`, `Der`, `Derpp` |
| `--model-config` | | Reset strategies and CL model-specific parameters | `"{'reset_state_dict': True}"` allows to reset the model internal state between tasks
`"{'alpha': 0.2}"` is needed for Der model
`"{'alpha': 0.2, 'beta': 0.8}"` are needed for Derpp model |
| `--buffer-config` | | Rehearsal buffer parameters | `"{'rehearsal_ratio': 20}"` sets the proportion of the input dataset to be stored in the rehearsal buffer |
| `--tasksets-config` | | Scenario configuration, as defined in the [`continuum` package](https://continuum.readthedocs.io/en/latest/tutorials/scenarios/scenarios.html) | Class-incremental scenario with 2 tasks: `"{'scenario': 'class', 'initial_increment': 5, 'increment': 5}"`
Instance-incremental scenario with 2 tasks: `"{'scenario': 'instance', 'num_tasks': 5}"`
`"{'concatenate_tasksets': True}"` allows to concatenate previous tasksets before next task |
| `--dataset` | | Dataset | `mnist`, `cifar10`, `cifar100`, `tinyimagenet`, `imagenet`, `imagenet_blurred`, `ptycho` |### WandB sweeps
To run a hyperparameter search, first adapt the `sweep.py` (located in this directory) file if needed. Then, configure your optimization objective in `sweep.yaml`.
Make sure you exported your WandB API key before running anything `export WANDB_API_KEY=key` and set `WANDB_MODE=run`. Once you are ready, execute the `sweep_launcher.sh []` script on the master machine, providing the following parameters:
- `hostname`: the address of the current machine e.g., `chifflot-7.lille.grid5000.fr:1`
- `wandb_project`: the name of an existing W&B project where the run will be saved
- `sweep_conf`: the name of a sweep config defined in `sweep.py`To stop a sweep run, go to the online WandB dashboard and click "Stop run". To stop the whole sweep process, `ps aux | grep agent` on the machine and kill the process, then `ps aux | grep wandb` and kill that process too.
## Continual Learning Strategies
Specific implementations have to be selected using `--buffer-config "{'implementation': }"`. ER with implementation `standard` was used in the paper.
| Approach | Name | Available Implementations |
|---|---|---|
| Experience Replay (ER) | `Er` | `standard`, `flyweight`, `python` |
| Averaged (A-GEM) | `Agem` | `python` |
| Dark Experience Replay (DER) | `Der` | `standard`, `flyweight`, `python` |
| Dark Experience Replay ++ (DER++) | `Derpp` | `standard`, `flyweight`, `python` |### Baselines
#### From Scratch
```
python main.py --backbone --dataset --model Vanilla --model-config "{'reset_state_dict': True}" --tasksets-config "{<..tasksets-config, 'concatenate_tasksets': True>}"
```#### Incremental
```
python main.py --backbone --dataset --model Vanilla --tasksets-config "{}"
```## Examples
### Deep learning
Usual deep learning can be done using this project. Model `Vanilla` will be instanciated by default:
```
python main.py --backbone mnistnet --dataset mnist
python main.py --backbone resnet18 --dataset cifar100
python main.py --backbone resnet50 --dataset tinyimagenet
python main.py --backbone efficientnetv2 --dataset imagenet_blurred
```### Continual learning
```
python main.py --backbone mnistnet --dataset mnist --tasksets-config "{'scenario': 'class', 'initial_increment': 5, 'increment': 5}"
python main.py --backbone resnet18 --dataset cifar10 --tasksets-config "{'scenario': 'class', 'initial_increment': 4, 'increment': 3}"
python main.py --backbone resnet18 --model Er --dataset cifar100 --tasksets-config "{'scenario': 'instance', 'num_tasks': 5}"
python main.py --backbone resnet18 --model Der --buffer-config "{'rehearsal_ratio': 20}" --dataset cifar10 --tasksets-config "{'scenario': 'class', 'initial_increment': 4, 'increment': 3}"
python main.py --backbone resnet18 --model Derpp --buffer-config "{'rehearsal_ratio': 20}" --dataset imagenet100small --tasksets-config "{'scenario': 'class', 'initial_increment': 40, 'increment': 30}"
python main.py --backbone resnet50 --model Agem --dataset tinyimagenet --tasksets-config "{'scenario': 'instance', 'num_tasks': 5}"
```# Citation
```
@inproceedings{bouvier:hal-04600107,
TITLE = {{Efficient Data-Parallel Continual Learning with Asynchronous Distributed Rehearsal Buffers}},
AUTHOR = {Bouvier, Thomas and Nicolae, Bogdan and Chaugier, Hugo and Costan, Alexandru and Foster, Ian and Antoniu, Gabriel},
URL = {https://inria.hal.science/hal-04600107},
BOOKTITLE = {{CCGrid 2024 - IEEE 24th International Symposium on Cluster, Cloud and Internet Computing}},
ADDRESS = {Philadelphia (PA), United States},
PAGES = {1-10},
YEAR = {2024},
MONTH = May,
DOI = {10.1109/CCGrid59990.2024.00036},
KEYWORDS = {continual learning ; data-parallel training ; experience replay ; distributed rehearsal buffers ; asynchronous data management ; scalability},
PDF = {https://inria.hal.science/hal-04600107/file/paper.pdf},
HAL_ID = {hal-04600107},
HAL_VERSION = {v1},
}
```