{"id":19875670,"url":"https://github.com/zeptofine/dataset-creator","last_synced_at":"2026-06-08T01:01:36.626Z","repository":{"id":174990611,"uuid":"653150806","full_name":"zeptofine/dataset-creator","owner":"zeptofine","description":"Simply a creator for image datasets.","archived":false,"fork":false,"pushed_at":"2023-12-05T00:51:15.000Z","size":1822,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-01T01:28:52.282Z","etag":null,"topics":["image-processing-python","pyside6","python","python-multiprocessing","typer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zeptofine.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-13T13:52:07.000Z","updated_at":"2023-11-29T00:06:12.000Z","dependencies_parsed_at":"2024-11-12T16:35:56.735Z","dependency_job_id":"06bebdc9-8c76-43b7-be87-d1705322251d","html_url":"https://github.com/zeptofine/dataset-creator","commit_stats":null,"previous_names":["zeptofine/dataset-creator"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zeptofine/dataset-creator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zeptofine%2Fdataset-creator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zeptofine%2Fdataset-creator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zeptofine%2Fdataset-creator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zeptofine%2Fdataset-creator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zeptofine","download_url":"https://codeload.github.com/zeptofine/dataset-creator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zeptofine%2Fdataset-creator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34043822,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-07T02:00:07.652Z","response_time":124,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["image-processing-python","pyside6","python","python-multiprocessing","typer"],"created_at":"2024-11-12T16:29:06.213Z","updated_at":"2026-06-08T01:01:36.608Z","avatar_url":"https://github.com/zeptofine.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dataset-creator\n\nThis is a tool I made to assist making datasets for image models.\n\n## Installation\n\nThe simplest way to install is to run\n\n```bash\ngit clone https://github.com/zeptofine/dataset-creator\ncd dataset-creator\n```\n\n, create a virtual environment, then\n\n```bash\npip install -e .\n```\n\ninside. Atm This basically just uses pyproject.toml (and Poetry) to install the dependencies. you can use the requirements.txt most of the time.\n\n## GUI Configuration\n\nThis gui is (currently) used to configure the actions of the\n[imdataset_creator](imdataset_creator) main.\n![dataset-creator-gui](github/images/dc_empty.png)\nTo run it, you can execute `python -m imdataset_creator.gui` (or if you installed it, `imdataset-creator-gui`) in the terminal. When you save, the config will normally appear in `\u003cPWD\u003e/config.json`. Make sure it doesn't overwrite anything important!\n\n### Inputs\n\n![dataset creator inputs window](github/images/dc_inputs.png)\n\nThe folder is what is searched through to find images, and the search patterns are used in [`wcmatch.glob`](https://facelessuser.github.io/wcmatch/glob/) to find files.\n\n### Producers\n\n![dataset creator producers window](github/images/dc_producers.png)\n\nRules use producers to get information about files. Of course, Rules themselves could gather this information themselves but that is a little inefficient when it comes to multiple consecutive runs. The data saved by the producers will be saved to a file, by default `filedb.arrow`.\n\n### Rules\n\nRules are used to filter out unwanted files. For example, one of them restricts the resolution of allowed files to a certain range, and another restricts the modification time within a certain range.\n\n![dataset creator rules window](github/images/dc_rules.png)\n\nWhen a Rule needs a producer, the rule should tell you what it needs in its description. Pick the appropriate Producer in the Producers.\n\nAs of writing this, there are 6 rules:\n\n- Time Range: only allows files created within a time frame.\n- Blacklist and whitelist: Only allow paths that include `str`s in the whitelist and not in the blacklist\n- Total count: Only allow a certain number of files\n- Resolution: Only allow files with a resolution within a certain range\n- Channels: Only allow files with a certain number of channels\n- Hash: Uses ImageHash hashes to eliminate similar looking images.\n\nThe order of these rules in the list is important, as they will be executed in order from top to bottom.\n\n**!Neither Producers or Rules need to be defined for inputs/outputs to work!**\n\n### Outputs \u0026 Filters\n\n![dataset creator outputs window](github/images/dc_outputs.png)\n\nOutputs have a folder, which is used to send created images, and the format_text is used to define files new paths. The `overwrite existing files` checkbox defines whether you overwrite existing files in the output folder if they already exist.\n\nThe `Filters` list show functions that will be applied to images going through this step. They can apply noise, compression, etc. to images. Many of these are nearly ripped from [Kim2091/helpful-scripts/Dataset-Destroyer](https://github.com/Kim2091/helpful-scripts/).\n\n### Every list item can be dragged and resized to help viewing them\n\n## Running\n\nTo run the actual program, run `python -m imdataset_creator` (or if you installed it, `imdataset-creator`) in the terminal.\n\n### Arguments\n\n```rich\n--config-path                              PATH     Where the dataset config is placed [default: config.json]\n--database-path                            PATH     Where the database is placed [default: filedb.arrow]\n--threads                                  INTEGER  multiprocessing threads [default: 9]\n--chunksize                                INTEGER  imap chunksize [default: 5]\n--population-chunksize                     INTEGER  chunksize when populating the df [default: 100]\n--population-interval                      INTEGER  save interval in secs when populating the df [default: 60]\n--simulate                --no-simulate             stops before conversion [default: no-simulate]\n--verbose                 --no-verbose              prints converted files [default: no-verbose]\n--sort-by                                  TEXT     Which database column to sort by [default: path]\n--help                                              Show this message and exit.\n```\n\n## TODO\n\n- [x] make UI\n- [x] redo config setup\n- [ ] More filters\n- [ ] bind UI to the CLI methods\n\nBefore the last point can be started, create_dataset.py must be broken down far enough such that almost all it is controlling is progress tracking.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzeptofine%2Fdataset-creator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzeptofine%2Fdataset-creator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzeptofine%2Fdataset-creator/lists"}