{"id":20837766,"url":"https://github.com/astrazeneca/mcpl","last_synced_at":"2025-05-08T20:30:19.106Z","repository":{"id":231750821,"uuid":"777264974","full_name":"AstraZeneca/MCPL","owner":"AstraZeneca","description":"Official implementation for \"An image is worth multiple words: discovering object level concepts using multi-concepts prompts learning\" [ICML 2024]]","archived":false,"fork":false,"pushed_at":"2024-07-16T14:09:34.000Z","size":61442,"stargazers_count":14,"open_issues_count":3,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-31T17:59:07.113Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AstraZeneca.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-25T14:22:34.000Z","updated_at":"2025-03-10T02:32:35.000Z","dependencies_parsed_at":"2024-07-12T18:59:04.806Z","dependency_job_id":null,"html_url":"https://github.com/AstraZeneca/MCPL","commit_stats":null,"previous_names":["astrazeneca/mcpl"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FMCPL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FMCPL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FMCPL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FMCPL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AstraZeneca","download_url":"https://codeload.github.com/AstraZeneca/MCPL/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253144486,"owners_count":21861068,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-18T01:08:32.043Z","updated_at":"2025-05-08T20:30:19.068Z","avatar_url":"https://github.com/AstraZeneca.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concept Prompt Learning (ICML 2024)\r\n\r\n\u003ca href=\"https://astrazeneca.github.io/mcpl.github.io/\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=Project\u0026message=Website\u0026color=blue\"\u003e\u003c/a\u003e\r\n\u003ca href=\"https://arxiv.org/abs/2310.12274\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2305.16311-b31b1b.svg\"\u003e\u003c/a\u003e\r\n\u003ca href=\"https://www.youtube.com/watch?v=EXnyT-JVG5U\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=YouTube\u0026message=Video\u0026color=orange\"\u003e\u003c/a\u003e\r\n\u003ca href=\"https://www.apache.org/licenses/LICENSE-2.0.txt\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-Apache-yellow\"\u003e\u003c/a\u003e\r\n\u003ca href=\"https://pytorch.org/\"\u003e\u003cimg src=\"https://img.shields.io/badge/PyTorch-\u003e=1.10.0-Red?logo=pytorch\"\u003e\u003c/a\u003e\r\n[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/papers/2310.12274)\r\n![Maturity level-0](https://img.shields.io/badge/Maturity%20Level-ML--0-red)\r\n\r\n![teaser](docs/teaser_tldr.jpg)\r\n\u003ca href=\"https://astrazeneca.github.io/mcpl.github.io/\"\u003e\u003cimg src=\"docs/MCPL-ICML-2024-ext-sim.gif\" /\u003e\u003c/a\u003e\r\n\r\n\u003e \u003ca href=\"https://astrazeneca.github.io/mcpl.github.io\"\u003e**An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concept Prompt Learning (ICML 2024)**\u003c/a\u003e\r\n\u003e\r\n\u003e \u003ca href=\"https://chenjin.netlify.app/\"\u003eChen Jin\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e \r\n    \u003ca href=\"https://rt416.github.io/\"\u003eRyutaro Tanno\u003csup\u003e2\u003c/sup\u003e\u003c/a\u003e \r\n    \u003ca href=\"https://uk.linkedin.com/in/amrutha-saseendran\"\u003eAmrutha Saseendran\u003csup\u003e1\u003c/sup\u003e \u003c/a\u003e \r\n    \u003ca href=\"https://tomdiethe.com/\"\u003eTom Diethe\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e \r\n    \u003ca href=\"https://uk.linkedin.com/in/philteare\"\u003ePhilip Teare\u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e\u003cbr\u003e\r\n\u003e\r\n\u003e Multi-Concept Prompt Learning (MCPL) pioneers **mask-free** **text-guided** learning for multiple prompts from **one scene**. Our approach not only enhances current methodologies but also paves the way for novel applications, such as facilitating knowledge discovery through natural language-driven interactions between humans and machines.\r\n\r\n## Motivation\r\n\u003cimg src=\"docs/041_motivation_combined.jpg\"/\u003e\r\nWe use Textural Inversion (T.I.) to learn concepts from both masked (left-first) or cropped (left-second) images; MCPL-one, learning both concepts jointly from the full image with a single string; and MCPL-diverse accounting for per-image specific relationships\r\n\r\n\u003cimg src=\"docs/042_MCv12_vs_CL_AttnMask_v2_sim.jpg\"/\u003e\r\nNaive learning multiple text embeddings from single image-sentence pair without imagery guidence lead to miss-alignment in per-word cross attention (top). We propose three regularisation terms to enhance the accuracy of prompt-object level correlation (bottom).\r\n\r\n## Method\r\n\u003cimg src=\"docs/method.jpg\"/\u003e\r\n\r\nInput images from our [natural_2_concepts](dataset/natural_2_concepts) dataset.\r\n\r\n## Applications\r\n\r\n### Multiple concepts from single image\r\n\u003cimg src=\"docs/editing_examples_single_image.jpg\"/\u003e\r\n\r\nInput images from our [natural_2_concepts](dataset/natural_2_concepts) dataset.\r\n\r\n\r\n### Per-image different multiple concepts \r\n\u003cimg src=\"docs/editing_examples_per_image.jpg\"/\u003e\r\n\r\nInput images from [P2P demo images](https://github.com/google/prompt-to-prompt?tab=readme-ov-file).\r\n\r\n\r\n### Out-of-Distribution concept discovery and hypothesis generation\r\n\u003cimg src=\"docs/editing_examples_medical.jpg\"/\u003e\r\n\r\nInput images from [LGE CMR](https://www.sciencedirect.com/science/article/pii/S1361841516000050) and [MIMIC-CXR](https://www.nature.com/articles/s41597-019-0322-0) dataset.\r\n\r\n## Dataset\r\nWe generate and collected a [Multi-Concept-Dataset](dataset) including a total of ~ 1400 images and masked objects/concepts as follows\r\n\r\n  /  (370 images)\r\n  /natural_2_concepts  \r\n  /natural_345_concepts  \r\n  /real_natural_concepts\r\n\r\n| Data file name | Size | # of images |\r\n| --- | --- | ---: |\r\n| [medical_2_concepts](dataset/medical_2_concepts/) | 2.5M | 370 |\r\n| [natural_2_concepts](dataset/natural_2_concepts/) | 36M | 415 |\r\n| [natural_345_concepts](dataset/natural_345_concepts/) | 13M | 525 |\r\n| [real_natural_concepts](dataset/real_natural_concepts/) | 5.6M | 137 |\r\n\r\n\r\n## Setup\r\n\r\nOur code builds on, and shares requirements with [Latent Diffusion Models (LDM)](https://github.com/CompVis/latent-diffusion). To set up their environment, please run:\r\n\r\n```\r\nconda env create -f environment.yml\r\nconda activate ldm\r\n```\r\n\r\n```\r\npip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers\r\ncd ./src/taming-transformers\r\npip install -e .\r\n```\r\n\r\nYou will also need the official LDM text-to-image checkpoint, available through the [LDM project page](https://github.com/CompVis/latent-diffusion). \r\n\r\nCurrently, the model can be downloaded by running:\r\n\r\n```\r\nmkdir -p models/ldm/text2img-large/\r\nwget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt\r\n```\r\n\r\n\r\n## Learning \r\n\r\n### MCPL-all: a naive approach that learns em-beddings for all prompts in the string (including adjectives, prepositions and nouns. etc.)\r\n- specify the placeholder_string to describe your multi-concept images;\r\n- in presudo_words we specify to learn every word in the placeholder_string;\r\n\r\n```\r\npython main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml \\\r\n                -t \\\r\n                --actual_resume \u003c/path/to/pretrained/model.ckpt\u003e \\\r\n                -n \u003crun_name\u003e \\\r\n                --gpus 0, \\\r\n                --data_root \u003c/path/to/directory/with/images\u003e \\\r\n                --init_word \u003cinitialization_word\u003e \\\r\n                --placeholder_string 'green * and orange @' \\\r\n                --presudo_words 'green,*,and,orange,@'\r\n```\r\n\r\n### MCPL-one: which simplifies the objective by learning single prompt (nouns) per concept\r\n- in this case, in presudo_words we specify to learn only a subset of words in the placeholder_string;\r\n\r\n```\r\npython main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml \\\r\n                -t \\\r\n                --actual_resume \u003c/path/to/pretrained/model.ckpt\u003e \\\r\n                -n \u003crun_name\u003e \\\r\n                --gpus 0, \\\r\n                --data_root \u003c/path/to/directory/with/images\u003e \\\r\n                --init_word \u003cinitialization_word\u003e \\\r\n                --placeholder_string 'green * and orange @' \\\r\n                --presudo_words '*,@'\r\n```\r\n\r\n### MCPL-diverse: where different strings are learned per image to observe variances among examples\r\n- before start, name each training image using single word representing relation; \r\n- e.g. in the ball and box exp, we train with: \u003c'front.jpg,  next.jpg,  on.jpg,  under.jpg'\u003e;\r\n- in placeholder_string we describe the multi-concept, and use 'RELATE' as placeholder of relationship between multi-concepts;\r\n- in presudo_words, we specify all presudo_words include relations to be learnt, the per-image relation will be injected via replace 'RELATE' with the relation specified by each image's name;\r\n\r\n```\r\npython main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml \\\r\n                -t \\\r\n                --actual_resume \u003c/path/to/pretrained/model.ckpt\u003e \\\r\n                -n \u003crun_name\u003e \\\r\n                --gpus 0, \\\r\n                --data_root \u003c/path/to/directory/with/images\u003e \\\r\n                --init_word \u003cinitialization_word\u003e \\\r\n                --placeholder_string 'green * RELATE orange @' \\\r\n                --presudo_words '*,@,on,under,next,front'\r\n```\r\n\r\n### Regularisation-1: adding PromptCL and Bind adjective (teddybear skateboard example)\r\n\r\n```\r\npython main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml \\\r\n                -t \\\r\n                --actual_resume \u003c/path/to/pretrained/model.ckpt\u003e \\\r\n                -n \u003crun_name\u003e \\\r\n                --gpus 0, \\\r\n                --data_root \u003c/path/to/directory/with/images\u003e \\\r\n                --init_word \u003cinitialization_word\u003e \\\r\n                --placeholder_string 'a brown @ on a rolling * at times square' \\\r\n                --presudo_words 'a,brown,on,rolling,at,times,square,@,*' \\\r\n                --attn_words 'brown,rolling,@,*' \\\r\n                --presudo_words_softmax '@,*' \\\r\n                --presudo_words_infonce '@,*' \\\r\n                --infonce_temperature 0.2 \\\r\n                --infonce_scale 0.0005 \\\r\n                --adj_aug_infonce 'brown,rolling' \\\r\n                --attn_mask_type 'skip'\r\n```\r\n\r\n### Regularisation-2: adding PromptCL, Bind adjective and Attention Mask (teddybear skateboard example)\r\n\r\n```\r\npython main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml \\\r\n                -t \\\r\n                --actual_resume \u003c/path/to/pretrained/model.ckpt\u003e \\\r\n                -n \u003crun_name\u003e \\\r\n                --gpus 0, \\\r\n                --data_root \u003c/path/to/directory/with/images\u003e \\\r\n                --init_word \u003cinitialization_word\u003e \\\r\n                --placeholder_string 'a brown @ on a rolling * at times square' \\\r\n                --presudo_words 'a,brown,on,rolling,at,times,square,@,*' \\\r\n                --attn_words 'brown,rolling,@,*' \\\r\n                --presudo_words_softmax '@,*' \\\r\n                --presudo_words_infonce '@,*' \\\r\n                --infonce_temperature 0.3 \\\r\n                --infonce_scale 0.00075 \\\r\n                --adj_aug_infonce 'brown,rolling'\r\n```\r\n\r\n\r\n## Generation\r\n\r\nTo generate new images of the learned concept, run:\r\n```\r\npython scripts/txt2img.py --ddim_eta 0.0 \r\n            --n_samples 8 \r\n            --n_iter 2 \r\n            --scale 10.0 \r\n            --ddim_steps 50 \r\n            --embedding_path /path/to/logs/trained_model/checkpoints/embeddings_gs-6099.pt \r\n            --ckpt_path /path/to/pretrained/model.ckpt \r\n            --prompt \"a photo of green * and orange @\"\r\n```\r\n\r\nwhere * and @ is the placeholder string used during inversion.\r\n\r\n## Code scructure\r\nOur code is builds on the code from the [Textural Inversion](https://github.com/rinongal/textual_inversion) ([MIT licence](https://github.com/rinongal/textual_inversion?tab=MIT-1-ov-file#readme)) library as well as the [Prompt-to-Prompt](https://github.com/google/prompt-to-prompt/) ([Apache-2.0 licence](https://github.com/google/prompt-to-prompt/?tab=Apache-2.0-1-ov-file#readme)) codebase.\r\n\r\nThe mainjority modifications are performed in the following files, where we provide docstrings for all functions: \r\n```\r\n./main.py\r\n./src/p2p/p2p_ldm_utils.py\r\n./src/p2p/ptp_utils.py\r\n./ldm/modules/embedding_manager.py\r\n./ldm/models/diffusion/ddpm.py\r\n```\r\nThe rest lib files are mostly unchanged and inherent from prior work.\r\n\r\n## FAQ\r\n\r\n**bert tokenizer error**\r\nSometimes one may get the following error due to the intrinsic error of tokenizer, simply try a different word with similar meaning.\r\nFor example in the error below, replace 'peachy' in your prompt with 'splendid' would resolve the issue.\r\n```\r\nFile \"/YOUR-HOME-PATH/MCPL/ldm/modules/embedding_manager.py\", line 22, in get_bert_token_for_string\r\n    assert torch.count_nonzero(token) == 3, f\"String '{string}' maps to more than a single token. Please use another string\"\r\nAssertionError: String 'peachy' maps to more than a single token. Please use another string\r\n```\r\n\r\n## Citation\r\n\r\nIf you make use of our work, please cite our paper:\r\n\r\n```\r\n@inproceedings{\r\nanonymous2024an,\r\ntitle={An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concept Prompt Learning},\r\nauthor={Anonymous},\r\nbooktitle={Forty-first International Conference on Machine Learning},\r\nyear={2024},\r\nurl={https://openreview.net/forum?id=F3x6uYILgL}\r\n}\r\n```\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrazeneca%2Fmcpl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fastrazeneca%2Fmcpl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrazeneca%2Fmcpl/lists"}