{"id":15066290,"url":"https://github.com/aiporre/multidataloader","last_synced_at":"2025-04-10T16:42:44.428Z","repository":{"id":37632161,"uuid":"238936055","full_name":"aiporre/multidataloader","owner":"aiporre","description":"Dataloader for Tensor Flow using the multithreading features","archived":false,"fork":false,"pushed_at":"2023-03-25T00:55:08.000Z","size":3067,"stargazers_count":8,"open_issues_count":10,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-24T14:39:07.835Z","etag":null,"topics":["dataset","dpipe","tensorflow"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aiporre.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-07T13:58:20.000Z","updated_at":"2023-02-13T10:50:40.000Z","dependencies_parsed_at":"2023-02-18T00:46:08.027Z","dependency_job_id":null,"html_url":"https://github.com/aiporre/multidataloader","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aiporre%2Fmultidataloader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aiporre%2Fmultidataloader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aiporre%2Fmultidataloader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aiporre%2Fmultidataloader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aiporre","download_url":"https://codeload.github.com/aiporre/multidataloader/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247785945,"owners_count":20995645,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","dpipe","tensorflow"],"created_at":"2024-09-25T01:05:08.097Z","updated_at":"2025-04-10T16:42:44.408Z","avatar_url":"https://github.com/aiporre.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e DPIPE\u003c/h1\u003e\r\n\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"images/pipeguy.png\" data-canonical-src=\"https://gyazo.com/eb5c5741b6a9a16c692170a41a49c858.png\" width=\"200\" /\u003e\r\n\u003c/p\u003e\r\n\r\n[![Documentation](http://inch-ci.org/github/dwyl/hapi-auth-jwt2.svg?branch=master)](https://multidataloader.readthedocs.io/en/latest/) \r\n\r\nWith dpipe you can create ready to use datasets from paths or list of files. You should \r\nspecify the type and the location of the input and target. The labels are assumed to be the name of the folder containing the file,\r\nif you need a dataset for classification. \r\n\r\nThe inputs and targets can be a list of paths, a path to be explored containing images or videos. For example:\r\n````shell script\r\n./dataset\r\n  |\r\n  |--cat/img1.png|\r\n  |--cat/img2.png\r\n  |--dog/img1.png\r\n  |--dog/img2.png\r\n````\r\nThe function `make_dataset` outputs a `dpipe.dataset_builder` object that has the method to predefined multiprocessing setups based on the recomendation of tensorflow.\r\n\r\n| method        | action          \r\n| ------------- |:-------------:| \r\n| `dataset_builder.prefetch()`      | Preloads samples on memory |\r\n| `dataset_builder.batch()`      | Creates a batch dataset |\r\n| `dataset_builder.enumerate()`      | Creates a appends an index to the output |\r\n| `dataset_builder.filter()`      | Applies a filter concurrently |\r\n| `dataset_builder.map()`      | Applies a function to each element concurrently |\r\n| `dataset_builder.repeat()`      | Creates a repeated dataset |\r\n| `dataset_builder.shuffle()`      | Shuffles the dataset after a complete run |\r\n \r\n\r\nThe dataset can be specified as:\r\n````python\r\nfrom dpipe import make_dataset\r\ndataset = make_dataset('image','label',x_path='./dataset',x_size=(128,128)).build()\r\n````\r\n## Creating dataset (more options)\r\nAdditionally, we defined the dataset from functions or objects. Two use cases are presented here. A dataset can be created from a function and a list of element to parse, for example a list of files and a reading function. \r\nFor example, if we need are training a denoising autoencoder, we need image noisy and clean image pairs; this can be handled with the function `dpipe.from_function`:\r\n```python\r\nimport glob # to find the files\r\nimport matplotlib.image as mpimg # to read the images (you need to install it.)\r\nimport numpy as np\r\nfrom dpipe import from_function\r\n\r\nfilelist = glob.glob('./dataset','*.png')\r\ndef read_file(filename):\r\n    target = mpimg.imread(filename) # read the image\r\n    noisy_image = np.random.randn(target.shape)\r\n    return noisy_image, target\r\n# undetermined shape is used to define dimentions that vary across shamples, in this case the height and the width of the images\r\ndataset = from_function(read_file, filelist, undetermined_shape=((1,2),(1,2))).build()\r\n```\r\nIf you are accessing your data in an object oriented way, you can use `dpipe.from_object`. In the next example lets consider you want use consume a list of files with records on it via generator function, this can also be handled with `dpipe.from_function` though. The code should look like this\r\n```python\r\nimport os\r\nimport pandas as pd\r\nfrom dpipe import from_object\r\n\r\nclass Reader():\r\n    def __init__(self,datapath='./dataset'):\r\n        self.filelist = os.listdir(datapath)\r\n    def __len__(self):\r\n        return len(self.filelist)\r\n    def my_reading_function(self,filename):\r\n        df = pd.read_csv(filename)\r\n        for v, t in zip(df.values, df.targets):\r\n            yield v, t\r\nreader = Reader()\r\ndataset = from_object(reader, 'my_reading_function','filelist').build()\r\n```\r\nThe `build()` function that creates a dataset with arguments ready to use with the `fit()` method of and `tf.model` object. This is used like this:\r\n```python\r\ntraining_ds = from_object(reader_training, 'my_reading_function').shuffle(len(reader_training), reshuffle_each_iteration=True).batch(32).repeat().build()\r\nvalidation_ds = from_object(reader_validation, 'my_reading_function',training=False).batch(32).build()\r\nmodel.fit(x=training_ds,validation_data=validation_ds, epochs=10,**training_ds.built_args,**validation_ds_ds.built_args)\r\n```\r\n# Installation\r\n````shell script\r\npip install dapipe\r\n````\r\nIt requires to install FFMPEG ([here](https://www.ffmpeg.org)) to work with video formats.\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faiporre%2Fmultidataloader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faiporre%2Fmultidataloader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faiporre%2Fmultidataloader/lists"}