https://github.com/hpcaitech/tensornvme

A Python library transfers PyTorch tensors between CPU and NVMe
https://github.com/hpcaitech/tensornvme

colossal-ai deep-learning nvme pytorch

Last synced: 2 months ago
JSON representation

A Python library transfers PyTorch tensors between CPU and NVMe

Host: GitHub
URL: https://github.com/hpcaitech/tensornvme
Owner: hpcaitech
Created: 2022-07-01T08:28:52.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2024-11-27T10:19:54.000Z (10 months ago)
Last Synced: 2025-05-24T04:07:29.174Z (4 months ago)
Topics: colossal-ai, deep-learning, nvme, pytorch
Language: C++
Homepage:
Size: 298 KB
Stars: 115
Watchers: 5
Forks: 25
Open Issues: 6
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # TensorNVME

A Python Library provides APIs to move PyTorch Tensors between CPU and NVMe.

## Dependencies

- [liburing](https://github.com/axboe/liburing)

- [libaio](https://pagure.io/libaio)

## Install

This package is only supported on Linux. `liburing` and `libaio` can be automatically installed. `liburing` is supported on Linux >= `5.10`, and it won't be installed if the version of your Linux < `5.10`.

It will search `libaio` and `liburing` in `/usr/lib`, `/usr/lib64` and `$LD_LIBRARY_PATH`. If not found, backends will be installed in `~/.tensornvme`, and `~/.bashrc` will be modified to set `$LD_LIBRARY_PATH` correctly. **Please `source ~/.bashrc` after installation.** If you use other shells, please make sure `$LD_LIBRARY_PATH` is set correctly.

> You must install pytorch and cmake before installing tensornvme. Once you upgrade pytorch, remember to reinstall tensornvme.

### From source

```shell

git clone https://github.com/hpcaitech/TensorNVMe.git && cd TensorNVMe

```

First, install requirements:

```shell

pip install -r requirements.txt

```

To install `tensornvme` with `liburing` and `libaio`:

```shell

pip install -v --no-cache-dir .

```

To install `tensornvme` with only `liburing`:

```shell

DISABLE_AIO=1 pip install -v --no-cache-dir .

```

To install `tensornvme` with only `libaio`:

```shell

DISABLE_URING=1 pip install -v --no-cache-dir .

```

If you want to install `libaio` or `liburing` for system:

```shell

WITH_ROOT=1 sudo pip install -v --no-cache-dir .

```

Then they will be installed in `/usr` and `~/.bashrc` will not be modified. Make sure you have root access.

### From PIP

```shell

pip install packaging

pip install tensornvme

```

All acceptable environment variables are the same as those when installing from source.

## Use docker

```shell

git clone https://github.com/hpcaitech/TensorNVMe.git && cd TensorNVMe/docker && docker build -t tensornvme .

```

## CLI

We provide a CLI to test whether backends work well.

```shell

tensornvme check

```

## Usage

It provide both synchronize and asynchronize I/O API.

> Only CPU and contiguous tensors can be offloaded.

Synchronize API:

```python

import torch

from tensornvme import DiskOffloader

x = torch.rand(2, 2)

y = torch.rand(4, 4, 4)

offloader = DiskOffloader('./offload')

offloader.sync_write(x)

# x is saved to a file on disk (in ./offload folder) and the memory of x is freed

offloader.sync_read(x)

# x is restored

offloader.sync_writev([x, y])

# x and y are offloaded

offloader.sync_readv([x, y])

# x and y are restored.

# sync_writev() and sync_readv() are order sensitive

# E.g. sync_writev([x, y]) and sync_writev([y, x]) are different

```

Asynchronize API:

```python

import torch

from tensornvme import DiskOffloader

x = torch.rand(2, 2)

y = torch.rand(4, 4, 4)

offloader = DiskOffloader('./offload')

offloader.async_write(x)

# x is being offloaded in the background

offloader.sync_write_events()

# x is offloaded and the memory of x is freed

offloader.async_read(x)

# x is being restored in the background

offloader.sync_read_events()

# x is restored

offloader.async_writev([x, y])

# x and y are being offloaded in the background

offloader.synchronize()

# synchronize() will synchronize both write and read events.

offloader.async_readv([x, y])

offloader.synchronize()

# x and y are restored.

# async_writev() and async_readv() are also order sensitive

```

You can use asynchronize API to overlap computation and data moving.

```python

tensors = []

for _ in range(10):

    tensor = torch.rand(2, 2)

    tensors.append(tensor)

    offloader.sync_write(tensor)

offloader.sync_read(tensors[0])

# prefetch=1, writing tensor[i] and reading tensor[i+1]

for i, tensor in enumerate(tensors):

    offloader.sync_read_events()

    if i + 1 < len(tensors):

        offloader.async_read(tensors[i+1])

    tensor.mul_(2.0) # compute

    offloader.sync_write_events()

    offloader.async_write(tensor)

offloader.synchronize()

```

## How to test

We have C++ test scrpits for `AsyncIO` and `SpaceManager` class. Make sure you have installed `liburing` and `libaio`, and set environment variables correctly before testing. To run the tests:

```shell

mkdir build

cd build

cmake ..

make

./test_asyncio

./test_space_mgr

```

We also have python unit tests. Make sure you have installed `pytest`. To run:

```shell

pytest ./tests

```

## How to benchmark

We have benchmarks for `Adam` and `CpuAdam`  with different backend and prefetch depth to validate TensorNVME's speed. To run the benchmark:

```shell

cd benchmark

python benchmark_adam.py

python benchmark_cpuadam.py

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hpcaitech/tensornvme

Awesome Lists containing this project

README