{"id":20435483,"url":"https://github.com/jacksonchen1998/image-to-prompts","last_synced_at":"2026-03-09T20:06:54.231Z","repository":{"id":153192927,"uuid":"626381388","full_name":"jacksonchen1998/Image-to-Prompts","owner":"jacksonchen1998","description":"A generative text-to-image model is a model that can generate an image from a text prompt.","archived":false,"fork":false,"pushed_at":"2023-07-06T05:34:39.000Z","size":5478,"stargazers_count":9,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-12T21:37:05.196Z","etag":null,"topics":["kaggle-competition","machine-learning","prompt","text-to-image","transformer"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jacksonchen1998.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-11T11:02:31.000Z","updated_at":"2025-03-27T02:02:04.000Z","dependencies_parsed_at":null,"dependency_job_id":"acbf6733-a055-47fd-a8d7-2d68999ee1ce","html_url":"https://github.com/jacksonchen1998/Image-to-Prompts","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jacksonchen1998/Image-to-Prompts","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacksonchen1998%2FImage-to-Prompts","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacksonchen1998%2FImage-to-Prompts/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacksonchen1998%2FImage-to-Prompts/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacksonchen1998%2FImage-to-Prompts/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jacksonchen1998","download_url":"https://codeload.github.com/jacksonchen1998/Image-to-Prompts/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacksonchen1998%2FImage-to-Prompts/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30310066,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-09T20:05:46.299Z","status":"ssl_error","status_checked_at":"2026-03-09T19:57:04.425Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["kaggle-competition","machine-learning","prompt","text-to-image","transformer"],"created_at":"2024-11-15T08:34:44.594Z","updated_at":"2026-03-09T20:06:54.214Z","avatar_url":"https://github.com/jacksonchen1998.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Image to Prompt\n\n| [Code](./clipinterrogator-ofa-vit.ipynb) | [Slide](https://www.slideshare.net/jacksonChen22/imagetopromptspdf) | [Report](./Image_To_Prompts.pdf) |\n\nA generative text-to-image model is a model that can generate an image from a text prompt.\n\nThis repository is a final project for the course [EECM30064 Deep Learning](https://timetable.nycu.edu.tw/?r=main/crsoutline\u0026Acy=111\u0026Sem=2\u0026CrsNo=535361\u0026lang=zh-tw)\n\n## Contributors\n\n\u003ca href=\"https://github.com/jacksonchen1998/Image-to-Prompts/graphs/contributors\"\u003e\n  \u003cimg src=\"http://contributors.nn.ci/api?repo=jacksonchen1998/Image-to-Prompts\" /\u003e\n\u003c/a\u003e\n\n## Motivation and Background\n\n[Stable Diffusion - Image to Prompts](https://www.kaggle.com/competitions/stable-diffusion-image-to-prompts/overview) is a competition on Kaggle.\n\nThe goal of this competition is to reverse the typical direction of a generative text-to-image model: instead of generating an image from a text prompt.\n\nWe want to  create a model which can predict the text prompt given a generated image. And making predictions on a dataset containing a wide variety of $\\verb|(prompt, image)|$ pairs generated by Stable Diffusion 2.0, in order to understand how reversible the latent relationship is.\n\nSample images from the competition dataset and their corresponding prompts are shown below.\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003cth\u003e\n            \u003ccenter\u003eImage\u003c/center\u003e\n        \u003c/th\u003e\n        \u003cth\u003e\n            \u003ccenter\u003ePrompt\u003c/center\u003e\n        \u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\n            \u003ccenter\u003e\u003cimg src=\"./images/92e911621.png\" width=\"200\" height=\"200\"\u003e\u003c/center\u003e\n        \u003c/td\u003e\n        \u003ctd\u003e\n            \u003ccenter\u003e\u003ccode\u003eultrasaurus holding a black bean taco in the woods, near an identical cheneosaurus\u003c/code\u003e\u003c/center\u003e\n        \u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n## Methodology\n\nOur method is to ensemble the CLIP Interrogator, OFA model, and ViT model.\n\nHere's the ratio for three different model\n- Vision Transformer (ViT) model:  74.88%\n- CLIP Interrogator: 21.12%\n- OFA model fine-tuned for image captioning: 4%\n\n## Application and Datasets\n\n### Application\n\nBased on the Kaggle competition, we want to build a model to predict the prompts that were used to generate target images.\n\n### Datasets\n\nPrompts for this challenge were generated using a variety of (non disclosed) methods, and range from fairly simple to fairly complex with multiple objects and modifiers.\n\nImages were generated from the prompts using Stable Diffusion $2.0$ ($768$-v-ema.ckpt) and were generated with 50 steps at $768 \\times 768$ px and then downsized to $512 \\times 512$ for the competition dataset. The hidden re-run test folder contains approximately $16,000$ images.\n\n## References\n\n[1] [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2103.00020.pdf)\n\n[2] [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/pdf/2010.11929.pdf)\n\n[3] [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/pdf/1409.1556.pdf)\n\n[4] [SentenceTransformers](https://www.sbert.net/)\n\n[5] [CLIPInterrogator + OFA + ViT](https://www.kaggle.com/code/motono0223/clipinterrogator-ofa-vit)\n\n[6] [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/pdf/2103.14030.pdf)\n\n[7] [CoCa: Contrastive Captioners are Image-Text Foundation Models](https://arxiv.org/pdf/2205.01917.pdf)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjacksonchen1998%2Fimage-to-prompts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjacksonchen1998%2Fimage-to-prompts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjacksonchen1998%2Fimage-to-prompts/lists"}