{"id":27957619,"url":"https://github.com/laion-ai/dataset-spec","last_synced_at":"2025-05-07T18:13:49.268Z","repository":{"id":41870342,"uuid":"441663171","full_name":"LAION-AI/dataset-spec","owner":"LAION-AI","description":"Describe the format of image/text datasets","archived":false,"fork":false,"pushed_at":"2022-04-26T09:02:03.000Z","size":21,"stargazers_count":11,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-05-07T18:13:44.520Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LAION-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-12-25T11:16:14.000Z","updated_at":"2024-01-04T17:04:27.000Z","dependencies_parsed_at":"2022-09-23T23:02:06.886Z","dependency_job_id":null,"html_url":"https://github.com/LAION-AI/dataset-spec","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LAION-AI%2Fdataset-spec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LAION-AI%2Fdataset-spec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LAION-AI%2Fdataset-spec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LAION-AI%2Fdataset-spec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LAION-AI","download_url":"https://codeload.github.com/LAION-AI/dataset-spec/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252931550,"owners_count":21827112,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-07T18:13:48.570Z","updated_at":"2025-05-07T18:13:49.260Z","avatar_url":"https://github.com/LAION-AI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dataset-spec\nCheck out this video: https://www.youtube.com/watch?v=cmVBaShtygA\n\n## Metadata for image/text datasets\n\nIf the dataset can be downloaded from image urls, \nPrefer distributing data publicly (eg on huggingface) as parquet files with these columns:\n* URL\n* TEXT\nIf other information is available, feel free to provide it was well in other columns\n\n## Image-Text-Datasets\n\nThe format is a collection of tar files (that dataset format is called webdataset) containing images, captions and metadata\n\n* 00000.tar containing 10k samples\n  * 0.jpg\n  * 0.txt containing the caption\n  * 0.json containing metadata such as the url, the original width, the exif data, whether the image is NSFW\n\n\nIf the biggest dimension of an image is bigger than 512, then resize so that the biggest dimension would be 512 and keep the aspect ratio.\n\nIf the smallest dimension of the image is below 64, drop the sample.\n\nDo not increase the resolution of the sample if it is below 512, but above 64, just keep it.\n\nSave the image data in the webdataset format as JPEG highest quality.\n\nUse “jpg” as field for the image and “txt” as field for the caption.\nA \"json\" field should contain at least height and width of the image, eventually if more data is available.\n\nIf you have a VQA - dataset, put the prefix \"Q: \" before the question \u0026 the prefix \"A: \" before the answer and then concatenate both texts. Put those into the \"txt\" field.\nAlso out an entry into the \"json\" field with the key \"question\" \u0026 the question as value. Also add an entry with the key \"answer\" to the json with the answer as the value.\n\nA help to create wds tar files from jpg \u0026 txt files can be this script: [wds_create_shards.py](wds_create_shards.py) (**json support added, you can use the --json tag to automatically read json files and add them to the tar files 04-19-2022**)\n\n|  Dataset info  |  Who works on it  |\n|---|---|\n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n|   |   | \n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flaion-ai%2Fdataset-spec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flaion-ai%2Fdataset-spec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flaion-ai%2Fdataset-spec/lists"}