https://github.com/thomas-bouvier/neomem

Torch rehearsal backend to mitigate catastrophic forgetting with a focus on performance, written in C++
https://github.com/thomas-bouvier/neomem

catastrophic-forgetting continual-learning experience-replay rehearsal

Last synced: 7 months ago
JSON representation

Torch rehearsal backend to mitigate catastrophic forgetting with a focus on performance, written in C++

Host: GitHub
URL: https://github.com/thomas-bouvier/neomem
Owner: thomas-bouvier
Created: 2023-05-09T13:21:21.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2025-01-14T17:30:37.000Z (9 months ago)
Last Synced: 2025-03-26T21:47:20.809Z (7 months ago)
Topics: catastrophic-forgetting, continual-learning, experience-replay, rehearsal
Language: C++
Homepage:
Size: 385 KB
Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Neomem

C++ data loader with rehearsal for Torch. Based on PyBind11 and [Mochi](https://www.mcs.anl.gov/research/projects/mochi/). Have a look at [distributed-continual-learning](https://github.com/thomas-bouvier/distributed-continual-learning), a Python training codebase which interfaces with Neomem.

Neomem aims to mitigate catastrophic forgetting, a common issue in traditional DNNs when handling continuous data streams. The system implements a distributed rehearsal buffer that (1) accumulates representative data samples and (2) efficiently serves augmented minibatches for further training iterations.

![Some accuracy results](results.png)
*Figure: extensive experiments on up to 128 GPUs have validated Neomem's increased accuracy, performance, and scalability.*

## Usage

### Requirements

- Python
- pybind11
- PyTorch
- MPI
- Thallium
- libfabric (built with CUDA support, optionally)
- CUDA (optional)

If these dependencies are installed inside a Spack environment, don't forget to `activate` it before building Neomem.

### Compiling Neomem using CMake

```console
cmake . -DPython_ROOT=/path/to/spack-env/view/bin -DWITHOUT_CUDA=0
make
```

### Using Neomem in your Python project

```python
import neomem
```

## Providers

### Verbs

If using provider `verbs`, make sure IPoIB is enabled and that an interface appears as `UP` when running `ip link show`.

### RDMA+CUDA

Device registration should be enabled. To use RDMA+CUDA, your only options are providers `ofi+shm` (shared-memory provider from libfabric, which supports gdr copy) and `verbs`.

If using `verbs`: you need MOFED to support CUDA. More specifically, it requires the kernel "peer memory" API which is only available in MOFED's version of IB drivers. If running into issues with MOFED, check that the command `grep ib_register_peer_memory_client /proc/kallsyms` outputs something similar:

```console
ffffffffc09c3595 r __kstrtab_ib_register_peer_memory_client [ib_core]
ffffffffc09c35b4 r __kstrtabns_ib_register_peer_memory_client [ib_core]
ffffffffc09bd54c r __ksymtab_ib_register_peer_memory_client [ib_core]
ffffffffc09b9620 T ib_register_peer_memory_client [ib_core]
```

`ucx` is another option that also supports CUDA, though we don't think anybody tested it just yet :) the code is there though. However, `na+sm` (shared-memory plugin from mercury) is not GPU-enabled.

## Tests

You can build a Docker image to run tests leveraging `pytest`.

```console
docker compose -f docker-compose.test.yml build test-cpu-openmpi-py3_10-torch2_1_0
docker run --rm -it neomem-test-cpu-openmpi-py3_10-torch2_1_0 bash -c "cd /neomem/tests && (ls -1 test_torch.py | xargs -n 1 mpirun --allow-run-as-root -np 1 -H localhost:1 bash /pytest.sh)"
```

# Citation

```
@inproceedings{bouvier:hal-04600107,
TITLE = {{Efficient Data-Parallel Continual Learning with Asynchronous Distributed Rehearsal Buffers}},
AUTHOR = {Bouvier, Thomas and Nicolae, Bogdan and Chaugier, Hugo and Costan, Alexandru and Foster, Ian and Antoniu, Gabriel},
URL = {https://inria.hal.science/hal-04600107},
BOOKTITLE = {{CCGrid 2024 - IEEE 24th International Symposium on Cluster, Cloud and Internet Computing}},
ADDRESS = {Philadelphia (PA), United States},
PAGES = {1-10},
YEAR = {2024},
MONTH = May,
DOI = {10.1109/CCGrid59990.2024.00036},
KEYWORDS = {continual learning ; data-parallel training ; experience replay ; distributed rehearsal buffers ; asynchronous data management ; scalability},
PDF = {https://inria.hal.science/hal-04600107/file/paper.pdf},
HAL_ID = {hal-04600107},
HAL_VERSION = {v1},
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thomas-bouvier/neomem

Awesome Lists containing this project

README