{"id":15637383,"url":"https://github.com/sayakpaul/caption-upsampling","last_synced_at":"2025-09-20T20:32:25.918Z","repository":{"id":202732190,"uuid":"707985759","full_name":"sayakpaul/caption-upsampling","owner":"sayakpaul","description":"This repository implements the idea of \"caption upsampling\" from DALL-E 3 with Zephyr-7B and gathers results with SDXL.","archived":false,"fork":false,"pushed_at":"2023-10-25T13:41:29.000Z","size":47,"stargazers_count":154,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-12-28T03:39:32.347Z","etag":null,"topics":["diffusers","image-generation","pytorch","sdxl"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sayakpaul.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-21T07:04:22.000Z","updated_at":"2024-12-10T16:00:18.000Z","dependencies_parsed_at":"2024-10-22T15:22:33.248Z","dependency_job_id":null,"html_url":"https://github.com/sayakpaul/caption-upsampling","commit_stats":null,"previous_names":["sayakpaul/caption-upsampling"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sayakpaul%2Fcaption-upsampling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sayakpaul%2Fcaption-upsampling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sayakpaul%2Fcaption-upsampling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sayakpaul%2Fcaption-upsampling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sayakpaul","download_url":"https://codeload.github.com/sayakpaul/caption-upsampling/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":233687137,"owners_count":18714251,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diffusers","image-generation","pytorch","sdxl"],"created_at":"2024-10-03T11:11:33.913Z","updated_at":"2025-09-20T20:32:20.637Z","avatar_url":"https://github.com/sayakpaul.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# caption-upsampling\n\nThis repository implements the idea of \"caption upsampling\" from [DALL-E 3](https://cdn.openai.com/papers/dall-e-3.pdf) with [Zephyr-7B](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) and gathers results with [SDXL](https://huggingface.co/papers/2307.01952).\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/sandwich.jpg\" alt=\"Sample Image 1\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/car_sheep.jpg\" alt=\"Sample Image 2\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/owl.jpg\" alt=\"Sample Image 3\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cb\u003eA white colored sandwich.\u003c/b\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cb\u003eA white car and a red sheep.\u003c/b\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cb\u003eA side view of an owl sitting in a field.\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/upsampled_sandwich.jpg\" alt=\"Sample Image 4\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/upsampled_car_sheep.jpg\" alt=\"Sample Image 5\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/upsampled_owl.jpg\" alt=\"Sample Image 6\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cb\u003eA white-bread sandwich with delicate layers of fluffy turkey, crisp lettuce, and juicy tomatoes is placed on a wooden cutting board. The sandwich is surrounded by various condiments, including mayonnaise, mustard, and a small jar of pickles. The scene is set in a cozy kitchen, with natural light pouring in through a window.\u003c/b\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cb\u003eA white car is parked on the side of a road in a green meadow. In the distance, a flock of red sheep can be seen grazing. The car seems to be abandoned, and the windows are shattered. The scene is eerie, and there is an unsettling feeling in the air.\u003c/b\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cb\u003eA regal-looking snowy owl perches on a rocky outcropping, its feathers fluffed against the chilly wind. The bird's large, yellow eyes are fixed on a rabbit nibbling on some grass in the distance. The sun sets behind the owl, casting a warm orange glow over the landscape.\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\u003csub\u003eExplore more samples \u003ca href=\"https://huggingface.co/datasets/sayakpaul/drawbench-sdxl\"\u003ehere\u003c/a\u003e. Find additional examples \u003ca href=\"https://github.com/sayakpaul/caption-upsampling#additional-examples\"\u003ebelow\u003c/a\u003e with SDXL Refiner and Kandinsky V2.2.\u003c/sub\u003e\n\u003cbr\u003e\u003cbr\u003e\n\n\"Caption upsampling\" is the $10 term for deriving a highly descriptive caption from a short caption. Here is an example:\n\n**Short**: _A bird scaring a scarecrow_\n\n**Upsampled**: _A large, vibrant bird with an impressive wingspan swoops down from the sky, letting out a piercing call as it approaches a weathered scarecrow in a sunlit field. The scarecrow, dressed in tattered clothing and a straw hat, appears to tremble, almost as if it’s coming to life in fear of the approaching bird._\n\nThis is particularly useful in the context of text-to-image generation.\n\n🌟 **Update 23/10/2023**: Got featured in this [TLDR newsletter](https://tldr.tech/ai/2023-10-23).\n\n## Why does this repo exist?\n\nDALL-E 3 uses GPT-4 for upsampling the captions. This repository aims at providing an implementation with an open-source model that is capable of performing something similar but doesn't require you to pay for the usage. As such it makes use of the \"zephyr-7b-alpha\" model, fine-tuned from the mighty [Mistral-7B model](https://huggingface.co/mistralai/Mistral-7B-v0.1).\n\nYou can find the upsampled captions from the DrawBench (introduced in [Imagen](https://imagen.research.google/)) benchmark dataset here: [sayakpaul/drawbench](https://huggingface.co/datasets/sayakpaul/drawbench). \n\nRefer to the `upsample_drawbench_captions.py` script for implementation details.\n\n## Images with and without caption upsampling\n\nAfter the DrawBench prompts were \"upsampled\", the `generate_images.py` script was used to generate images with the regular DrawBench prompts and the upsampled ones. You can find all the images here: [sayakpaul/drawbench-sdxl](https://huggingface.co/datasets/sayakpaul/drawbench-sdxl).\n\n## Additional examples\n\nThis section presents results generated using the SDXL Refiner and Kandinsky V2.2. These were generated using the scripts from the `additional_examples` directory.\n\n### SDXL Refiner \n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/refiner/sandwich.jpg\" alt=\"Sample Image 1\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/refiner/car_sheep.jpg\" alt=\"Sample Image 2\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/refiner/owl.jpg\" alt=\"Sample Image 3\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cb\u003eA white colored sandwich.\u003c/b\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cb\u003eA white car and a red sheep.\u003c/b\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cb\u003eA side view of an owl sitting in a field.\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/refiner/upsampled_sandwich.jpg\" alt=\"Sample Image 4\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/refiner/upsampled_car_sheep.jpg\" alt=\"Sample Image 5\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/refiner/upsampled_owl.jpg\" alt=\"Sample Image 6\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cb\u003eA white-bread sandwich with delicate layers of fluffy turkey, crisp lettuce, and juicy tomatoes is placed on a wooden cutting board. The sandwich is surrounded by various condiments, including mayonnaise, mustard, and a small jar of pickles. The scene is set in a cozy kitchen, with natural light pouring in through a window.\u003c/b\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cb\u003eA white car is parked on the side of a road in a green meadow. In the distance, a flock of red sheep can be seen grazing. The car seems to be abandoned, and the windows are shattered. The scene is eerie, and there is an unsettling feeling in the air.\u003c/b\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cb\u003eA regal-looking snowy owl perches on a rocky outcropping, its feathers fluffed against the chilly wind. The bird's large, yellow eyes are fixed on a rabbit nibbling on some grass in the distance. The sun sets behind the owl, casting a warm orange glow over the landscape.\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\u003csub\u003eExplore more samples \u003ca href=\"https://huggingface.co/datasets/sayakpaul/drawbench-sdxl-refiner\"\u003ehere\u003c/a\u003e.\u003c/sub\u003e\n\u003cbr\u003e\u003cbr\u003e\n\n### Kandinsky V2.2\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/kandinsky_v22/sandwich.jpg\" alt=\"Sample Image 1\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/kandinsky_v22/car_sheep.jpg\" alt=\"Sample Image 2\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/kandinsky_v22/owl.jpg\" alt=\"Sample Image 3\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cb\u003eA white colored sandwich.\u003c/b\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cb\u003eA white car and a red sheep.\u003c/b\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cb\u003eA side view of an owl sitting in a field.\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/kandinsky_v22/upsampled_sandwich.jpg\" alt=\"Sample Image 4\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/kandinsky_v22/upsampled_car_sheep.jpg\" alt=\"Sample Image 5\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/caption-upsampling/kandinsky_v22/upsampled_owl.jpg\" alt=\"Sample Image 6\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cb\u003eA white-bread sandwich with delicate layers of fluffy turkey, crisp lettuce, and juicy tomatoes is placed on a wooden cutting board. The sandwich is surrounded by various condiments, including mayonnaise, mustard, and a small jar of pickles. The scene is set in a cozy kitchen, with natural light pouring in through a window.\u003c/b\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cb\u003eA white car is parked on the side of a road in a green meadow. In the distance, a flock of red sheep can be seen grazing. The car seems to be abandoned, and the windows are shattered. The scene is eerie, and there is an unsettling feeling in the air.\u003c/b\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cb\u003eA regal-looking snowy owl perches on a rocky outcropping, its feathers fluffed against the chilly wind. The bird's large, yellow eyes are fixed on a rabbit nibbling on some grass in the distance. The sun sets behind the owl, casting a warm orange glow over the landscape.\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\u003csub\u003eExplore more samples \u003ca href=\"https://huggingface.co/datasets/sayakpaul/drawbench-kandinsky-v22\"\u003ehere\u003c/a\u003e.\u003c/sub\u003e\n\u003cbr\u003e\u003cbr\u003e\n\n## Limitations ⛔️\n\n1. Since SDXL uses CLIP, upsampled captions leading to more than 77 tokens will not be fully utilized. One way to remedy this would be to change the system prompt [here](https://github.com/sayakpaul/caption-upsampling/blob/c71388f39a9717c57faffcb14c0d9152c9d78657/upsample_drawbench_captions.py#L38) so that the underlying generation model is more length-aware.\n\n   This repository uses the prompt template from the DALL-E 3 technical report (Appendix C).\n\n2. DALL-E 3 conducts training on a recaptioned dataset where the captions were regenerated to be much more detailed using GPT-4. It then demonstrates the effectiveness of using detailed prompts during inference. However, existing works (as noted in [here](#notes)) show that it's possible to improve the generation quality of existing systems like SDXL with detailed prompts even when they weren't particularly trained on similar datasets with very detailed captions.\n\n3. It's important to investigate the output of the language model that's producing the descriptive captions. This directly impacts the quality of the images. As mentioned above, the prompt template is the original one used in the DALL-E 3 report. However, different language models might respond differently to that template. So, figuring out which template gives the best output most of the time is crucial.\n\n## Notes\n\nThe core idea of using detailed prompts to improve the quality of the generated samples has been explored before. Readers are welcome to check out the following resources in this regard:\n\n* \"Better prompt engineering\" section from [this doc](https://huggingface.co/docs/diffusers/main/en/stable_diffusion#better-prompt-engineering)\n* [lllyasviel/Fooocus](https://github.com/lllyasviel/Fooocus)\n\nAdditionally, [PixArt-Alpha](https://github.com/PixArt-alpha/PixArt-alpha) shows that fine-tuning on a dataset with highly detailed captions can lead to substantial quality improvements.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsayakpaul%2Fcaption-upsampling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsayakpaul%2Fcaption-upsampling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsayakpaul%2Fcaption-upsampling/lists"}