Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/clearhanhui/parquetloader

A Distributed Streaming PyTorch Dataloader for Parquet.
https://github.com/clearhanhui/parquetloader

distributed parquet pytorch

Last synced: about 1 month ago
JSON representation

A Distributed Streaming PyTorch Dataloader for Parquet.

Host: GitHub
URL: https://github.com/clearhanhui/parquetloader
Owner: clearhanhui
License: mit
Created: 2024-07-22T18:42:40.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-08-01T15:55:42.000Z (5 months ago)
Last Synced: 2024-08-02T14:43:53.576Z (5 months ago)
Topics: distributed, parquet, pytorch
Language: Python
Homepage:
Size: 78.8 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md
- License: LICENSE

Awesome Lists containing this project

README

        # ParquetLoader

This project is inspired by [litdata](https://github.com/Lightning-AI/litdata).

It implements a PyTorch dataset and dataloader that support streaming and distributed loading of [Parquet](https://parquet.apache.org/docs/) datasets.

Key features:

* Streaming loading of large Parquet datasets, e.g., Hive tables stored in Parquet format.

* Near-zero redundancy loading across ranks & workers during distributed training.

* Asynchronous preloading to overlap training and loading for better efficiency.

Limitations:

* Less efficient than full memory loading for small datasets.

* Degrades to full loading (or worse) for datasets with only one or a few Parquet files/row groups.

* Row group size affects efficiency; it's recommended to set it to 1-1000 times the batch size.

## Installation

Install from source

``` shell 

git clone https://github.com/clearhanhui/ParquetLoader.git

cd ParquetLoader

pip install .

```

## Usage

``` python 

from parquet_loader import ParquetDataset, ParquetDataLoader

dataset = ParquetDataset('/path/to/parquet/dataset')

dataloader = ParquetDataLoader(dataset)

```

See examples in [tests](./tests).

## Benchmark

* fullly loading vs streaming loading

  |                   | Time(s) | Memory(MB) |

  | ----------------- | ------- | ---------- |

  | fullly loading    | 3.041   | 153        |

  | streaming loading | 7.290   | 610        |

* synchronous loading vs asynchronous loading

  |                      | Time(s) |

  | -------------------- | ------- |

  | synchronous loading  | 39.204  |

  | asynchronous loading | 25.854  |

See full results in [benckmarks](./benchmarks).