{"id":13738187,"url":"https://github.com/robvanvolt/DALLE-datasets","last_synced_at":"2025-05-08T16:32:35.340Z","repository":{"id":38371410,"uuid":"363522896","full_name":"robvanvolt/DALLE-datasets","owner":"robvanvolt","description":"This is a summary of easily available datasets for generalized DALLE-pytorch training.","archived":false,"fork":false,"pushed_at":"2022-04-19T20:27:34.000Z","size":509,"stargazers_count":127,"open_issues_count":2,"forks_count":16,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-08-04T03:12:02.070Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/robvanvolt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-05-01T22:53:22.000Z","updated_at":"2024-07-30T12:07:41.000Z","dependencies_parsed_at":"2022-08-25T02:11:42.373Z","dependency_job_id":null,"html_url":"https://github.com/robvanvolt/DALLE-datasets","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robvanvolt%2FDALLE-datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robvanvolt%2FDALLE-datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robvanvolt%2FDALLE-datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robvanvolt%2FDALLE-datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/robvanvolt","download_url":"https://codeload.github.com/robvanvolt/DALLE-datasets/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224746394,"owners_count":17363038,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T03:02:13.672Z","updated_at":"2024-11-15T07:30:30.014Z","avatar_url":"https://github.com/robvanvolt.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"## DALLE-datasets\nThis is a summary of easily available, high-quality datasets consisiting of captioned image files for generalized DALLE-pytorch training (https://github.com/lucidrains/DALLE-pytorch).\n\nThe scripts help you download and resize the files from the given sources.\n\n* general datasets\n  * Conceptual Images 12m\n  * Wikipedia\n  * Filtered yfcc100m\n  * Open Images\n* specific datasets\n  * None yet\n\n\n## Helper scripts\n\nAll helper scripts can be found in the utilities folder now:\n* TFrecords to WebDataset converter\n* Image-Text-Folder to WebDataset converter\n* Dataset sanitycheck for image-text-files\n* Example reader for WebDataset files\n\n\n### Sanitycheck for downloaded datasets\n\nThe following command will look for image-text-pairs (.jpg / .png / .bmp) and return a csv table with incomplete data.\nWhen you add the optional argument -DEL, the incomplete files get deleted. The python scripts checks one folder and the first subdirectories.\n\n```python sanity_check.py --dataset_folder my-dataset-folder```\n\n\n## Pretrained models\n\nIf you want to continue training on pretrained models or even upload your own Dall-E model, head over to https://github.com/robvanvolt/DALLE-models\n\n## Credits\n\nSpecial thanks go to \u003ca href=\"https://github.com/rom1504\"\u003eRomaine\u003c/a\u003e, who improved the download scripts and made the great WebDataset format more accessible with his continuous coding efforts! 🙏 \n\nA lot of inspiration was taken from https://github.com/yashbonde/dall-e-baby - unfortunately that repo does not get updated anymore...\nAlso, the shard creator was inspired by https://github.com/tmbdev-archive/webdataset-examples/blob/master/makeshards.py.\nThe custom tokenizer was inspired by afiaka87, who showed a simple way to generate custom tokenizers with youtokentome.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobvanvolt%2FDALLE-datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frobvanvolt%2FDALLE-datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobvanvolt%2FDALLE-datasets/lists"}