https://github.com/pmeier/datapipes

Proof-of-concept for datapipes in torchvision.datasets
https://github.com/pmeier/datapipes

Last synced: 2 months ago
JSON representation

Proof-of-concept for datapipes in torchvision.datasets

Host: GitHub
URL: https://github.com/pmeier/datapipes
Owner: pmeier
Created: 2021-03-18T14:07:37.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2021-04-26T20:54:47.000Z (about 4 years ago)
Last Synced: 2025-03-31T14:06:10.722Z (3 months ago)
Language: Python
Size: 133 KB
Stars: 8
Watchers: 3
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # datapipes

This a proof-of-concept repository on how `torch.utils.data.datapipes` can be used as basis for `torchvision.datasets`.

## General observations

- `pathlib.Path` should be a first-class citizen for paths.

- `dp.iter.LoadFilesFromDisk` should have a `mode` parameter. Forcing `rb` makes it cumbersome to read from plain text 

  files. Maybe even an `opener` parameter would be better that defaults to `open` and respects `mode`.

- Files loaded with `get_file_binaries_from_pathnames` used in `dp.iter.LoadFilesFromDisk` are never closed.

- `dp.Iter.RoutedDecoder` only accepts `(path, buffer)` inputs, which is not usable for us. Our datasets return a 

  buffer as well as some additional information.

- It feels weird to call `dp.iter.LoadFilesFromDisk` for a single file, which is usually the case for our datasets.

- I'm aware that this is not possible if we are streaming archives, but if that is not the case, we should be able to 

  read specific files from an archive. Some datasets contain metadata in a separate file that should be available as 

  soon as we create the dataset rather than based on luck when it is stream with the other files.

- `dp.iter.Map` expects an `IterDataPipe` rather than a more general `Iterable` as the other datapipes.

- Instead of `ReadFilesFrom(Tar|Zip)` there should be `ReadFilesFromArchive` that automatically detect the underlying archive type.

- `dp.iter.ReadFilesFrom(Tar|Zip)` should be split in `ListFilesIn(Tar|Zip)` and `LoadFilesFrom(Tar|Zip)`. Most datasets define some splits of the data so that only a part of the data has to be loaded at all. It would be a good idea to drop unused files before we load them.

- For some reason `dp.iter.ReadFilesFrom(Tar|Zip)` returns the files in reversed alphabetical order. This makes it weird to align this with corresponding text files, which are usually read from top to bottom.

## Datasets

Legend:

- :heavy_check_mark: : Fully working

- :o: : Working, but with a significant performance hit

- :x: Not working.

For :o: and :x:, please check out the `README.md` in the corresponding folder for details.

| `torchvision.datasets.`                    | Status             |

|:-------------------------------------------|--------------------|

| [`Caltech101`](caltech101/)                | :heavy_check_mark: |

| [`Caltech256`](caltech256/)                | :heavy_check_mark: |

| [`CelebA`](celeba/)                        | :heavy_check_mark: |

| [`CIFAR10` / `CIFAR100`](cifar/)           | :heavy_check_mark: |

| [`CocoDetection` / `CocoCaptions`](coco/)  | :heavy_check_mark: |

| [`VOCDetection` / `VOCSegmentation`](voc/) | :heavy_check_mark: |

| [`LSUN`](lsun/)                            | :x:                |

| [`ImageNet`](imagenet/)                    | :heavy_check_mark: |

| [`HMDB51`](hmdb51/)                        | :heavy_check_mark: |

## Notes

- So far, I think the best approach for datasets with related files is to have each individual datapipe to yield a key for the datapoint as well as the data.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pmeier/datapipes

Awesome Lists containing this project

README