{"id":19352012,"url":"https://github.com/lukashedegaard/datasetops","last_synced_at":"2025-04-23T07:31:15.959Z","repository":{"id":62566735,"uuid":"242953843","full_name":"LukasHedegaard/datasetops","owner":"LukasHedegaard","description":"Fluent dataset operations, compatible with your favorite libraries","archived":false,"fork":false,"pushed_at":"2022-10-19T07:09:35.000Z","size":25939,"stargazers_count":11,"open_issues_count":19,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-11T23:59:45.117Z","etag":null,"topics":["data-cleaning","data-munging","data-processing","data-science","data-wrangling","dataset","dataset-combinations","deep-learning","multiple-datasets","pytorch","tensorflow"],"latest_commit_sha":null,"homepage":"https://datasetops.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LukasHedegaard.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-02-25T08:56:59.000Z","updated_at":"2024-12-12T21:27:22.000Z","dependencies_parsed_at":"2022-11-03T17:47:48.580Z","dependency_job_id":null,"html_url":"https://github.com/LukasHedegaard/datasetops","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LukasHedegaard%2Fdatasetops","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LukasHedegaard%2Fdatasetops/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LukasHedegaard%2Fdatasetops/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LukasHedegaard%2Fdatasetops/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LukasHedegaard","download_url":"https://codeload.github.com/LukasHedegaard/datasetops/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250391153,"owners_count":21422849,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-cleaning","data-munging","data-processing","data-science","data-wrangling","dataset","dataset-combinations","deep-learning","multiple-datasets","pytorch","tensorflow"],"created_at":"2024-11-10T04:37:51.626Z","updated_at":"2025-04-23T07:31:10.948Z","avatar_url":"https://github.com/LukasHedegaard.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"docs/pics/logo.svg\"\u003e\u003cbr\u003e\n\u003c/div\u003e\n\n# Dataset Ops: Fluent dataset operations, compatible with your favorite libraries\n\n![Python package](https://github.com/LukasHedegaard/datasetops/workflows/Python%20package/badge.svg) [![Documentation Status](https://readthedocs.org/projects/datasetops/badge/?version=latest)](https://datasetops.readthedocs.io/en/latest/?badge=latest) [![codecov](https://codecov.io/gh/LukasHedegaard/datasetops/branch/master/graph/badge.svg)](https://codecov.io/gh/LukasHedegaard/datasetops) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\nDataset Ops provides a [fluent interface](https://martinfowler.com/bliki/FluentInterface.html) for _loading, filtering, transforming, splitting,_ and _combining_ datasets. \nDesigned specifically with data science and machine learning applications in mind, it integrates seamlessly with [Tensorflow](https://www.tensorflow.org) and [PyTorch](https://pytorch.org).\n\n## Appetizer\n```python\nimport datasetops as do\n\n# prepare your data\ntrain, val, test = (\n    do.from_folder_class_data('path/to/data/folder')\n    .named(\"data\", \"label\")\n    .image_resize((240, 240))\n    .one_hot(\"label\")\n    .shuffle(seed=42)\n    .split([0.6, 0.2, 0.2])\n)\n\n# use with your favorite framework\ntrain_tf = train.to_tensorflow() \ntrain_pt = train.to_pytorch() \n\n# or do your own thing\nfor img, label in train:\n    ...\n```\n\n## Installation \nBinary installers available at the [Python package index](https://pypi.org/project/datasetops/)\n```bash\npip install datasetops\n```\n\n\n## Why? \nCollecting and preprocessing datasets is tiresome and often takes upwards of 50% of the effort spent in the data science and machine learning lifecycle.\nWhile [Tensorflow](https://www.tensorflow.org/datasets) and [PyTorch](https://www.tensorflow.org/datasets) have some useful datasets utilities available, they are designed specifically with the respective frameworks in mind.\nUnsuprisingly, this makes it hard to switch between them, and training-ready dataset definitions are bound to one or the other.\nMoreover, they do not aid you in standard scenarios where you want to:\n- Sample your dataset non-random ways (e.g with a fixed number of samples per class)\n- Center, standardize, normalise you data\n- Combine multiple datasets, e.g. for parallel input to a multi-stream network\n- Create non-standard data splits\n\n_Dataset Ops_ aims to make these processing steps easier, faster, and more intuitive to perform, while retaining full compatibility to and from the leading libraries. This also means you can grab a dataset from [torchvision datasets](https://pytorch.org/docs/stable/torchvision/datasets.html#mnist) and use it directly with tensorflow:\n\n```python\nimport do\nimport torchvision\n\ntorch_usps = torchvision.datasets.USPS('../dataset/path', download=True)\ntensorflow_usps = do.from_pytorch(torch_usps).to_tensorflow()\n```\n\n\n## Development Status\nThe library is still under heavy development and the API may be subject to change.\n\nWhat follows here is a list of implemented and planned features.\n\n### Loaders\n- [x] `Loader` (utility class used to define a dataset)\n- [x] `from_pytorch` (load from a `torch.utils.data.Dataset`)\n- [x] `from_tensorflow` (load from a `tf.data.Dataset`)\n- [x] `from_folder_data` (load flat folder with data)\n- [x] `from_folder_class_data` (load nested folder with a folder for each class)\n- [x] `from_folder_dataset_class_data` (load nested folder with multiple datasets, each with a nested class folder structure )\n- [ ] `from_mat` (load contents of a .mat file as a single dataaset)\n- [x] `from_mat_single_mult_data` (load contents of a .mat file as multiple dataasets)\n- [ ] `load` (load data from a path, automatically inferring type and structure)\n\n### Converters\n- [x] `to_tensorflow` (convert Dataset into tensorflow.data.Dataset)\n- [x] `to_pytorch` (convert Dataset into torchvision.Dataset)\n\n### Dataset information\n- [x] `shape` (get shape of a dataset item)\n- [x] `counts` (compute the counts of each unique item in the dataset by key)\n- [x] `unique` (get a list of unique items in the dataset by key)\n- [x] `named` (supply names for the item elements)\n- [x] `names` (get a list of names for the elements in an item)\n- [ ] `stats` (provide an overview of the dataset statistics)\n- [ ] `origin` (provide an description of how the dataset was made)\n\n### Sampling and splitting\n- [x] `shuffle` (shuffle the items   in a dataset randomly)\n- [x] `sample` (sample data at random a dataset)\n- [x] `filter` (filter the dataset using a predicate)\n- [x] `split` (split a dataset randomly based on fractions)\n- [x] `split_filter` (split a dataset into two based on a predicate)\n- [x] `allow_unique` (handy predicate used for balanced classwise filtering/sampling)\n- [x] `take` (take the first items in dataset)\n- [x] `repeat` (repeat the items in a dataset, either itemwise or as a whole)\n\n### Item manipulation\n- [x] `reorder` (reorder the elements of the dataset items (e.g. flip label and data order))\n- [x] `transform` (transform function which takes other functions and applies them to the dataset items.)\n- [x] `categorical` (transforms an element into a categorical integer encoded label)\n- [x] `one_hot` (transforms an element into a one-hot encoded label)\n- [x] `numpy` (transforms an element into a numpy.ndarray)\n- [x] `reshape` (reshapes numpy.ndarray elements)\n- [x] `image` (transforms a numpy array or path string into a PIL.Image.Image)\n- [x] `image_resize` (resizes PIL.Image.Image elements)\n- [ ] `image_crop` (crops PIL.Image.Image elements)\n- [ ] `image_rotate` (rotates PIL.Image.Image elements)\n- [ ] `image_transform` (transforms PIL.Image.Image elements)\n- [ ] `image_brightness` (modify brightness of PIL.Image.Image elements)\n- [ ] `image_contrast` (modify contrast of PIL.Image.Image elements)\n- [ ] `image_filter` (apply an image filter to PIL.Image.Image elements)\n- [ ] `noise` (adds noise to the data)\n- [ ] `center` (modify each item according to dataset statistics)\n- [ ] `normalize` (modify each item according to dataset statistics)\n- [ ] `standardize` (modify each item according to dataset statistics)\n- [ ] `whiten` (modify each item according to dataset statistics)\n- [ ] `randomly` (apply data transformations with some probability)\n\n### Dataset combinations \n- [x] `concat` (concatenate two datasets, placing the items of one after the other)\n- [x] `zip` (zip datasets itemwise, extending the size of each item)\n- [x] `cartesian_product` (create a dataset whose items are all combinations of items (zipped) of the originating datasets)\n\n\n## Citation\nIf you use this software, please cite it as below:\n```bibtex\n@software{Hedegaard_DatasetOps_2022,\n  author = {Hedegaard, Lukas and Oleksiienko, Illia and Legaard, Christian Møldrup},\n  doi = {10.5281/zenodo.7223644},\n  month = {10},\n  title = {{DatasetOps}},\n  version = {0.0.7},\n  year = {2022}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flukashedegaard%2Fdatasetops","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flukashedegaard%2Fdatasetops","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flukashedegaard%2Fdatasetops/lists"}