Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nextml-code/pytorch-datastream

Simple dataset to dataloader library for pytorch
https://github.com/nextml-code/pytorch-datastream

Last synced: 1 day ago
JSON representation

Simple dataset to dataloader library for pytorch

Host: GitHub
URL: https://github.com/nextml-code/pytorch-datastream
Owner: nextml-code
License: apache-2.0
Created: 2020-06-08T09:24:09.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2025-01-03T14:51:30.000Z (12 days ago)
Last Synced: 2025-01-03T15:19:52.831Z (12 days ago)
Language: Python
Homepage: https://nextml-code.github.io/pytorch-datastream/
Size: 885 KB
Stars: 33
Watchers: 3
Forks: 2
Open Issues: 12
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Pytorch Datastream

[![PyPI version](https://badge.fury.io/py/pytorch-datastream.svg)](https://badge.fury.io/py/pytorch-datastream)

[![Python versions](https://img.shields.io/pypi/pyversions/pytorch-datastream.svg)](https://pypi.python.org/pypi/pytorch-datastream)

[![License](https://img.shields.io/pypi/l/pytorch-datastream.svg)](https://pypi.python.org/pypi/pytorch-datastream)

This is a simple library for creating readable dataset pipelines and reusing best practices for issues such as imbalanced datasets. There are just two components to keep track of: `Dataset` and `Datastream`.

`Dataset` is a simple mapping between an index and an example. It provides pipelining of functions in a readable syntax originally adapted from tensorflow 2's `tf.data.Dataset`.

`Datastream` combines a `Dataset` and a sampler into a stream of examples. It provides a simple solution to oversampling / stratification, weighted sampling, and finally converting to a `torch.utils.data.DataLoader`.

See the [documentation](https://nextml-code.github.io/pytorch-datastream) for more information.

## Install

```bash

poetry add pytorch-datastream

```

Or, for the old-timers:

```bash

pip install pytorch-datastream

```

## Usage

The list below is meant to showcase functions that are useful in most standard and non-standard cases. It is not meant to be an exhaustive list. See the [documentation](https://nextml-code.github.io/pytorch-datastream) for a more extensive list on API and usage.

```python

Dataset.from_subscriptable

Dataset.from_dataframe

Dataset

    .map

    .subset

    .split

    .cache

    .with_columns

Datastream.merge

Datastream.zip

Datastream

    .map

    .data*loader

    .zip_index

    .update_weights*

    .update*example_weight*

    .weight

    .state_dict

    .load_state_dict

```

### Simple image dataset example

Here's a basic example of loading images from a directory:

```python

from datastream import Dataset

from pathlib import Path

from PIL import Image

# Assuming images are in a directory structure like:

# images/

#   class1/

#     image1.jpg

#     image2.jpg

#   class2/

#     image3.jpg

#     image4.jpg

image_dir = Path("images")

image_paths = list(image_dir.glob("\*_/_.jpg"))

dataset = (

    Dataset.from_paths(

    image_paths,

    pattern=r".\*/(?P\w+)/(?P\w+).jpg"

    )

    .map(lambda row: dict(

        image=Image.open(row["path"]),

        class_name=row["class_name"],

        image_name=row["image_name"],

    ))

)

# Access an item from the dataset

first_item = dataset[0]

print(f"Class: {first_item['class_name']}, Image name: {first_item['image_name']}")

```

### Merge / stratify / oversample datastreams

The fruit datastreams given below repeatedly yields the string of its fruit type.

```python

>>> datastream = Datastream.merge([

>>> ... (apple_datastream, 2),

>>> ... (pear_datastream, 1),

>>> ... (banana_datastream, 1),

>>> ... ])

>>> next(iter(datastream.data_loader(batch_size=8)))

>>> ['apple', 'apple', 'pear', 'banana', 'apple', 'apple', 'pear', 'banana']

>>>

```

### Zip independently sampled datastreams

The fruit datastreams given below repeatedly yields the string of its fruit type.

```python

>>> datastream = Datastream.zip([

>>> ... apple_datastream,

>>> ... Datastream.merge([pear_datastream, banana_datastream]),

>>> ... ])

>>> next(iter(datastream.data_loader(batch_size=4)))

>>> [('apple', 'pear'), ('apple', 'banana'), ('apple', 'pear'), ('apple', 'banana')]

>>>

```

### More usage examples

See the [documentation](https://nextml-code.github.io/pytorch-datastream) for more usage examples.