{"id":13719748,"url":"https://poloclub.github.io/diffusiondb/","last_synced_at":"2025-05-07T11:32:38.239Z","repository":{"id":62134178,"uuid":"549133772","full_name":"poloclub/diffusiondb","owner":"poloclub","description":"A large-scale text-to-image prompt gallery dataset based on Stable Diffusion","archived":false,"fork":false,"pushed_at":"2024-07-11T16:10:02.000Z","size":6954,"stargazers_count":1266,"open_issues_count":3,"forks_count":73,"subscribers_count":17,"default_branch":"main","last_synced_at":"2025-04-08T11:09:41.782Z","etag":null,"topics":["ai-art","computer-vision","image-generation","prompt-engineering","stable-diffusion"],"latest_commit_sha":null,"homepage":"https://poloclub.github.io/diffusiondb","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/poloclub.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-10T18:18:57.000Z","updated_at":"2025-04-07T14:50:18.000Z","dependencies_parsed_at":"2024-01-26T14:43:28.936Z","dependency_job_id":"4d037c06-c65e-47a3-961e-c69ccd7b80a5","html_url":"https://github.com/poloclub/diffusiondb","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poloclub%2Fdiffusiondb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poloclub%2Fdiffusiondb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poloclub%2Fdiffusiondb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poloclub%2Fdiffusiondb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/poloclub","download_url":"https://codeload.github.com/poloclub/diffusiondb/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252869169,"owners_count":21816986,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-art","computer-vision","image-generation","prompt-engineering","stable-diffusion"],"created_at":"2024-08-03T01:00:54.955Z","updated_at":"2025-05-07T11:32:33.228Z","avatar_url":"https://github.com/poloclub.png","language":"Python","funding_links":[],"categories":["🗃 Content"],"sub_categories":["🏛 AI Art Galleries"],"readme":"# DiffusionDB  \u003ca href=\"https://huggingface.co/datasets/poloclub/diffusiondb\"\u003e\u003cpicture\u003e\u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://i.imgur.com/yGxUUlX.png\"\u003e\u003cimg src=\"favicon.ico\" align=\"right\" src=\"favicon.ico\" height=\"40px\"\u003e\u003c/picture\u003e\n\n[![hugging](https://img.shields.io/badge/🤗%20Hugging%20Face-Datasets-yellow)](https://huggingface.co/datasets/poloclub/diffusiondb)\n[![license](https://img.shields.io/badge/License-CC0/MIT-blue)](#licensing)\n[![arxiv badge](https://img.shields.io/badge/arXiv-2210.14896-red)](https://arxiv.org/abs/2210.14896)\n[![datasheet](https://img.shields.io/badge/Data%20Sheet-Available-success)](https://poloclub.github.io/diffusiondb/datasheet.html)\n\u003c!-- [![DOI:10.1145/3491101.3519653](https://img.shields.io/badge/DOI-10.1145/3491101.3519653-blue)](https://doi.org/10.1145/3491101.3519653) --\u003e\n\n\u003cimg width=\"100%\" src=\"https://user-images.githubusercontent.com/15007159/201762588-f24db2b8-dbb2-4a94-947b-7de393fc3d33.gif\"\u003e\n\nDiffusionDB is the first large-scale text-to-image prompt dataset. It contains **14 million images** generated by Stable Diffusion using prompts and hyperparameters specified by real users. The unprecedented scale and diversity of this human-actuated dataset provide exciting research opportunities in understanding the interplay between prompts and generative models, detecting deepfakes, and designing human-AI interaction tools to help users more easily use these models.\n\n## Get Started\n\nDiffusionDB is available at [🤗 Hugging Face Datasets](https://huggingface.co/datasets/poloclub/diffusiondb).\n\n## Two Subsets\n\nDiffusionDB provides two subsets (DiffusionDB 2M and DiffusionDB Large) to support different needs.\n\n|Subset|Num of Images|Num of Unique Prompts|Size|Image Directory|Metadata Table|\n|:--|--:|--:|--:|--:|--:|\n|DiffusionDB 2M|2M|1.5M|1.6TB|`images/`|`metadata.parquet`|\n|DiffusionDB Large|14M|1.8M|6.5TB|`diffusiondb-large-part-1/` `diffusiondb-large-part-2/`|`metadata-large.parquet`|\n\n##### Key Differences\n\n1. Two subsets have a similar number of unique prompts, but DiffusionDB Large has much more images. DiffusionDB Large is a superset of DiffusionDB 2M.\n2. Images in DiffusionDB 2M are stored in `png` format; images in DiffusionDB Large use a lossless `webp` format.\n\n## Dataset Structure\n\nWe use a modularized file structure to distribute DiffusionDB. The 2 million images in DiffusionDB 2M are split into 2,000 folders, where each folder contains 1,000 images and a JSON file that links these 1,000 images to their prompts and hyperparameters. Similarly, the 14 million images in DiffusionDB Large are split into 14,000 folders.\n\n```bash\n# DiffusionDB 2M\n./\n├── images\n│   ├── part-000001\n│   │   ├── 3bfcd9cf-26ea-4303-bbe1-b095853f5360.png\n│   │   ├── 5f47c66c-51d4-4f2c-a872-a68518f44adb.png\n│   │   ├── 66b428b9-55dc-4907-b116-55aaa887de30.png\n│   │   ├── [...]\n│   │   └── part-000001.json\n│   ├── part-000002\n│   ├── part-000003\n│   ├── [...]\n│   └── part-002000\n└── metadata.parquet\n```\n\n```bash\n# DiffusionDB Large\n./\n├── diffusiondb-large-part-1\n│   ├── part-000001\n│   │   ├── 0a8dc864-1616-4961-ac18-3fcdf76d3b08.webp\n│   │   ├── 0a25cacb-5d91-4f27-b18a-bd423762f811.webp\n│   │   ├── 0a52d584-4211-43a0-99ef-f5640ee2fc8c.webp\n│   │   ├── [...]\n│   │   └── part-000001.json\n│   ├── part-000002\n│   ├── part-000003\n│   ├── [...]\n│   └── part-010000\n├── diffusiondb-large-part-2\n│   ├── part-010001\n│   │   ├── 0a68f671-3776-424c-91b6-c09a0dd6fc2d.webp\n│   │   ├── 0a0756e9-1249-4fe2-a21a-12c43656c7a3.webp\n│   │   ├── 0aa48f3d-f2d9-40a8-a800-c2c651ebba06.webp\n│   │   ├── [...]\n│   │   └── part-010001.json\n│   ├── part-010002\n│   ├── part-010003\n│   ├── [...]\n│   └── part-014000\n└── metadata-large.parquet\n```\n\nThese sub-folders have names `part-0xxxxx`, and each image has a unique name generated by [UUID Version 4](https://en.wikipedia.org/wiki/Universally_unique_identifier). The JSON file in a sub-folder has the same name as the sub-folder. Each image is a `PNG` file (DiffusionDB 2M) or a lossless `WebP` file (DiffusionDB Large). The JSON file contains key-value pairs mapping image filenames to their prompts and hyperparameters. For example, below is the image of `f3501e05-aef7-4225-a9e9-f516527408ac.png` and its key-value pair in `part-000001.json`.\n\n\u003cimg width=\"300\" src=\"https://i.imgur.com/gqWcRs2.png\"\u003e\n\n```json\n{\n  \"f3501e05-aef7-4225-a9e9-f516527408ac.png\": {\n    \"p\": \"geodesic landscape, john chamberlain, christopher balaskas, tadao ando, 4 k, \",\n    \"se\": 38753269,\n    \"c\": 12.0,\n    \"st\": 50,\n    \"sa\": \"k_lms\"\n  },\n}\n```\n\nThe data fields are:\n\n- key: Unique image name\n- `p`: Prompt\n- `se`: Random seed\n- `c`: CFG Scale (guidance scale)\n- `st`: Steps\n- `sa`: Sampler\n\n## Dataset Metadata\n\nTo help you easily access prompts and other attributes of images without downloading all the Zip files, we include two metadata tables  `metadata.parquet` and `metadata-large.parquet` for DiffusionDB 2M and DiffusionDB Large, respectively.\n\nThe shape of `metadata.parquet` is (2000000, 13) and the shape of `metatable-large.parquet` is (14000000, 13). Two tables share the same schema, and each row represents an image. We store these tables in the Parquet format because Parquet is column-based: you can efficiently query individual columns (e.g., prompts) without reading the entire table.\n\nBelow are three random rows from `metadata.parquet`.\n\n| image_name                               | prompt                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |   part_id |       seed |   step |   cfg |   sampler |   width |   height | user_name                                                        | timestamp                 |   image_nsfw |   prompt_nsfw |\n|:-----------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------:|-----------:|-------:|------:|----------:|--------:|---------:|:-----------------------------------------------------------------|:--------------------------|-------------:|--------------:|\n| 0c46f719-1679-4c64-9ba9-f181e0eae811.png | a small liquid sculpture, corvette, viscous, reflective, digital art                                                                                                                                                                                                                                                                                                                                                                                                           |      1050 | 2026845913 |     50 |   7   |         8 |     512 |      512 | c2f288a2ba9df65c38386ffaaf7749106fed29311835b63d578405db9dbcafdb | 2022-08-11 09:05:00+00:00 |    0.0845108 |   0.00383462  |\n| a00bdeaa-14eb-4f6c-a303-97732177eae9.png | human sculpture of lanky tall alien on a romantic date at italian restaurant with smiling woman, nice restaurant, photography, bokeh                                                                                                                                                                                                                                                                                                                                           |       905 | 1183522603 |     50 |  10   |         8 |     512 |      768 | df778e253e6d32168eb22279a9776b3cde107cc82da05517dd6d114724918651 | 2022-08-19 17:55:00+00:00 |    0.692934  |   0.109437    |\n| 6e5024ce-65ed-47f3-b296-edb2813e3c5b.png | portrait of barbaric spanish conquistador, symmetrical, by yoichi hatakenaka, studio ghibli and dan mumford                                                                                                                                                                                                                                                                                                                                                                    |       286 | 1713292358 |     50 |   7   |         8 |     512 |      640 | 1c2e93cfb1430adbd956be9c690705fe295cbee7d9ac12de1953ce5e76d89906 | 2022-08-12 03:26:00+00:00 |    0.0773138 |   0.0249675   |\n\n\n\n### Metadata Schema\n\n`metadata.parquet` and `metatable-large.parquet` share the same schema.\n\n|Column|Type|Description|\n|:---|:---|:---|\n|`image_name`|`string`|Image UUID filename.|\n|`prompt`|`string`|The text prompt used to generate this image.|\n|`part_id`|`uint16`|Folder ID of this image.|\n|`seed`|`uint32`| Random seed used to generate this image.|\n|`step`|`uint16`| Step count (hyperparameter).|\n|`cfg`|`float32`| Guidance scale (hyperparameter).|\n|`sampler`|`uint8`| Sampler method (hyperparameter). Mapping: `{1: \"ddim\", 2: \"plms\", 3: \"k_euler\", 4: \"k_euler_ancestral\", 5: \"k_heun\", 6: \"k_dpm_2\", 7: \"k_dpm_2_ancestral\", 8: \"k_lms\", 9: \"others\"}`.\n|`width`|`uint16`|Image width.|\n|`height`|`uint16`|Image height.|\n|`user_name`|`string`|The unique discord ID's SHA256 hash of the user who generated this image. For example, the hash for `xiaohk#3146` is `e285b7ef63be99e9107cecd79b280bde602f17e0ca8363cb7a0889b67f0b5ed0`. \"deleted_account\" refer to users who have deleted their accounts. None means the image has been deleted before we scrape it for the second time.|\n|`timestamp`|`timestamp`|UTC Timestamp when this image was generated. None means the image has been deleted before we scrape it for the second time. Note that timestamp is not accurate for duplicate images that have the same prompt, hypareparameters, width, height.|\n|`image_nsfw`|`float32`|Likelihood of an image being NSFW. Scores are predicted by [LAION's state-of-art NSFW detector](https://github.com/LAION-AI/LAION-SAFETY) (range from 0 to 1). A score of 2.0 means the image has already been flagged as NSFW and blurred by Stable Diffusion.|\n|`prompt_nsfw`|`float32`|Likelihood of a prompt being NSFW. Scores are predicted by the library [Detoxicy](https://github.com/unitaryai/detoxify). Each score represents the maximum of `toxicity` and `sexual_explicit` (range from 0 to 1).|\n\n\u003e **Warning**\n\u003e Although the Stable Diffusion model has an NSFW filter that automatically blurs user-generated NSFW images, this NSFW filter is not perfect—DiffusionDB still contains some NSFW images. Therefore, we compute and provide the NSFW scores for images and prompts using the state-of-the-art models. The distribution of these scores is shown below. Please decide an appropriate NSFW score threshold to filter out NSFW images before using DiffusionDB in your projects.\n\n\n\u003cpicture\u003e\n  \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://i.imgur.com/KLbJkUr.png\"\u003e\n  \u003cimg alt=\"NSFW Score distributions.\" src=\"https://i.imgur.com/1RiGAXL.png\" width=\"100%\"\u003e\n\u003c/picture\u003e\n\n\n## Loading DiffusionDB\n\nDiffusionDB is large (1.6TB or 6.5 TB)! However, with our modularized file structure, you can easily load a desirable number of images and their prompts and hyperparameters. In the [`example-loading.ipynb`](https://github.com/poloclub/diffusiondb/blob/main/notebooks/example-loading.ipynb) notebook, we demonstrate three methods to load a subset of DiffusionDB. Below is a short summary.\n\n### Method 1: Use Hugging Face Datasets Loader\n\nYou can use the Hugging Face [`Datasets`](https://huggingface.co/docs/datasets/quickstart) library to easily load prompts and images from DiffusionDB. We pre-defined 16 DiffusionDB subsets (configurations) based on the number of instances. You can see all subsets in the [Dataset Preview](https://huggingface.co/datasets/poloclub/diffusiondb/viewer/all/train).\n\n\u003e **Note**\n\u003e To use Datasets Loader, you need to install `Pillow` as well (`pip install Pillow`)\n\n```python\nimport numpy as np\nfrom datasets import load_dataset\n\n# Load the dataset with the `large_random_1k` subset\ndataset = load_dataset('poloclub/diffusiondb', 'large_random_1k')\n```\n\n### Method 2. Use a downloader script\n\nThis repo includes a Python downloader [`download.py`](https://github.com/poloclub/diffusiondb/blob/main/scripts/download.py) that allows you to download and load DiffusionDB. You can use it from your command line. Below is an example of loading a subset of DiffusionDB.\n\n#### Usage/Examples\n\nThe script is run using command-line arguments as follows:\n\n- `-i` `--index` - File to download or lower bound of a range of files if `-r` is also set.\n- `-r` `--range` - Upper bound of range of files to download if `-i` is set.\n- `-o` `--output` - Name of custom output directory. Defaults to the current directory if not set.\n- `-z` `--unzip` - Unzip the file/files after downloading\n- `-l` `--large` - Download from Diffusion DB Large. Defaults to Diffusion DB 2M.\n\n##### Downloading a single file\n\nThe specific file to download is supplied as the number at the end of the file on HuggingFace. The script will automatically pad the number out and generate the URL.\n\n```bash\npython download.py -i 23\n```\n\n##### Downloading a range of files\n\nThe upper and lower bounds of the set of files to download are set by the `-i` and `-r` flags respectively.\n\n```bash\npython download.py -i 1 -r 2000\n```\n\nNote that this range will download the entire dataset. The script will ask you to confirm that you have 1.7Tb free at the download destination.\n\n##### Downloading to a specific directory\n\nThe script will default to the location of the dataset's `part` .zip files at `images/`. If you wish to move the download location, you should move these files as well or use a symbolic link.\n\n```bash\npython download.py -i 1 -r 2000 -o /home/$USER/datahoarding/etc\n```\n\nAgain, the script will automatically add the `/` between the directory and the file when it downloads.\n\n##### Setting the files to unzip once they've been downloaded\n\nThe script is set to unzip the files _after_ all files have downloaded as both can be lengthy processes in certain circumstances.\n\n```bash\npython download.py -i 1 -r 2000 -z\n```\n\n### Method 3. Use `metadata.parquet` (Text Only)\n\nIf your task does not require images, then you can easily access all 2 million prompts and hyperparameters in the `metadata.parquet` table.\n\n```python\nfrom urllib.request import urlretrieve\nimport pandas as pd\n\n# Download the parquet table\ntable_url = f'https://huggingface.co/datasets/poloclub/diffusiondb/resolve/main/metadata.parquet'\nurlretrieve(table_url, 'metadata.parquet')\n\n# Read the table using Pandas\nmetadata_df = pd.read_parquet('metadata.parquet')\n```\n\n## Dataset Creation\n\nWe collected all images from the official Stable Diffusion Discord server. Please read our [research paper](https://arxiv.org/abs/2210.14896) for details. The code is included in [`./scripts/`](./scripts/).\n\n## Data Removal\n\nIf you find any harmful images or prompts in DiffusionDB, you can use [this Google Form](https://forms.gle/GbYaSpRNYqxCafMZ9) to report them. Similarly, if you are a creator of an image included in this dataset, you can use the [same form](https://forms.gle/GbYaSpRNYqxCafMZ9) to let us know if you would like to remove your image from DiffusionDB. We will closely monitor this form and update DiffusionDB periodically.\n\n## Credits\n\nDiffusionDB is created by [Jay Wang](https://zijie.wang), [Evan Montoya](https://www.linkedin.com/in/evan-montoya-b252391b4/), [David Munechika](https://www.linkedin.com/in/dmunechika/), [Alex Yang](https://alexanderyang.me), [Ben Hoover](https://www.bhoov.com), [Polo Chau](https://faculty.cc.gatech.edu/~dchau/).\n\n## Citation\n\n```bibtex\n@article{wangDiffusionDBLargescalePrompt2022,\n  title = {{{DiffusionDB}}: {{A}} Large-Scale Prompt Gallery Dataset for Text-to-Image Generative Models},\n  author = {Wang, Zijie J. and Montoya, Evan and Munechika, David and Yang, Haoyang and Hoover, Benjamin and Chau, Duen Horng},\n  year = {2022},\n  journal = {arXiv:2210.14896 [cs]},\n  url = {https://arxiv.org/abs/2210.14896}\n}\n```\n\n## Licensing\n\nThe DiffusionDB dataset is available under the [CC0 1.0 License](https://creativecommons.org/publicdomain/zero/1.0/).\nThe Python code in this repository is available under the [MIT License](./LICENSE).\n\n## Contact\n\nIf you have any questions, feel free to [open an issue](https://github.com/poloclub/diffusiondb/issues/new) or contact [Jay Wang](https://zijie.wang).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/poloclub.github.io%2Fdiffusiondb%2F","html_url":"https://awesome.ecosyste.ms/projects/poloclub.github.io%2Fdiffusiondb%2F","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/poloclub.github.io%2Fdiffusiondb%2F/lists"}