{"id":20515425,"url":"https://github.com/snap-research/myvlm","last_synced_at":"2025-04-14T21:09:39.965Z","repository":{"id":229073514,"uuid":"775148130","full_name":"snap-research/MyVLM","owner":"snap-research","description":"Official Implementation for \"MyVLM: Personalizing VLMs for User-Specific Queries\" (ECCV 2024)","archived":false,"fork":false,"pushed_at":"2024-07-05T08:06:09.000Z","size":49642,"stargazers_count":167,"open_issues_count":7,"forks_count":11,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-04-14T21:09:29.644Z","etag":null,"topics":["personalization","vision-language-models"],"latest_commit_sha":null,"homepage":"https://snap-research.github.io/MyVLM/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/snap-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-20T21:06:57.000Z","updated_at":"2025-03-27T04:31:36.000Z","dependencies_parsed_at":"2024-03-27T07:25:43.317Z","dependency_job_id":"82240877-432e-4f8f-921e-f92e2559516a","html_url":"https://github.com/snap-research/MyVLM","commit_stats":null,"previous_names":["snap-research/myvlm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/snap-research%2FMyVLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/snap-research%2FMyVLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/snap-research%2FMyVLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/snap-research%2FMyVLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/snap-research","download_url":"https://codeload.github.com/snap-research/MyVLM/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248961237,"owners_count":21189993,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["personalization","vision-language-models"],"created_at":"2024-11-15T21:21:39.776Z","updated_at":"2025-04-14T21:09:39.930Z","avatar_url":"https://github.com/snap-research.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MyVLM: Personalizing VLMs for User-Specific Queries (ECCV 2024)\n\n\u003e Recent large-scale vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and generating textual descriptions for visual content. However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. For example, we explore whether these models can learn to recognize you in an image and communicate what you are doing, tailoring the model to reflect your personal experiences and relationships. To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model, enabling the VLM the identify the presence of specific target concepts in a given image. Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM. This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response. We apply our technique to BLIP-2 and LLaVA for personalized image captioning and further show its applicability for personalized visual question-answering. Our experiments demonstrate our ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs.\n\n\u003ca href=\"https://arxiv.org/abs/2403.14599\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2403.14599-b31b1b.svg\"\u003e\u003c/a\u003e\n\u003ca href=\"https://snap-research.github.io/MyVLM/\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=Project\u0026message=Website\u0026color=red\" height=20.5\u003e\u003c/a\u003e \n\u003ca href=\"https://www.youtube.com/watch?v=8Fy17kK_aZY\u0026ab_channel=YuvalAlaluf\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=1 Minute\u0026message=Video\u0026color=darkgreen\" height=20.5\u003e\u003c/a\u003e \n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"docs/teaser.jpg\" width=\"800px\"/\u003e  \n\u003cbr\u003e\nGiven a set of images depicting user-specific concepts such as \u0026lt;you\u0026gt;, \u0026lt;your-dog\u0026gt;, and \u0026lt;your-friend\u0026gt; (left), we teach a pretrained vision-language model (VLM) to understand and reason over these concepts. First, we enable the model to generate personalized captions incorporating the concept into its output text (middle). We further allow the user to ask subject-specific questions about these concepts, querying the model with questions such as \"What are \u0026lt;you\u0026gt; doing?\" or \"What is my \u0026lt;your-friend\u0026gt; wearing?\" (right).\n\u003c/p\u003e\n\n\n# Description  \nOfficial implementation of our MyVLM personalization paper.\n\n\n# Table of Contents\n- [Description](#description)\n- [Table of Contents](#table-of-contents)\n- [Setup](#setup)\n  * [Environment](#environment)\n- [Dataset \u0026 Pretrained Concept Heads](#dataset---pretrained-concept-heads)\n- [Concept Heads](#concept-heads)\n- [Concept Embeddings](#concept-embeddings)\n  * [Data Setup](#data-setup)\n    + [Optional: Generating Additional VQA Data for LLaVA](#--optional--generating-additional-vqa-data-for-llava--)\n  * [Concept Embedding Training](#concept-embedding-training)\n    + [BLIP-2: Captioning](#blip-2--captioning)\n    + [LLaVA: Captioning \u0026 VQA](#llava--captioning---vqa)\n    + [MiniGPT-v2: Captioning \u0026 Referring Expression Comprehension](#minigpt-v2--captioning---referring-expression-comprehension)\n- [Inference](#inference)\n  * [Original VLM Captioning](#original-vlm-captioning)\n  * [MyVLM Inference](#myvlm-inference)\n- [Acknowledgements](#acknowledgements)\n- [Citation](#citation)\n\n\n# Setup\n\n## Environment\nTo set up the environment with all necessary dependencies, please run:\n```\nconda env create -f environment/environment.yaml\nconda activate myvlm\n```\n\n# Dataset \u0026 Pretrained Concept Heads\nAs part of our code release, we have also released our object dataset introduced in the paper. This contains 29 user-specific objects, each containing ~10 images and 5 corresponding personalized captions for each image. \n\nThe full dataset can be downloaded from [Google Drive](https://drive.google.com/drive/folders/1dxjwYVAmBRWLeqUjWsR8cWdqMvfsqW79?usp=sharing) or [HuggingFace](https://huggingface.co/datasets/yuvalalaluf/MyVLM).\nPlease note that our dataset is available under a non-commercial license (see `LICENSE` for details.)\n\nA pretrained concept head and concept embedding for each object can also be downloaded from [Google Drive](https://drive.google.com/drive/folders/1qnBWDv1l9JFEG_7HGPtfnZyhBerRknP9?usp=sharing) or [HuggingFace](https://huggingface.co/yuvalalaluf/MyVLM), also under the same license.  \nThese can be loaded using the `CLIPConceptHead` class and our inference scripts for reproducing the paper results.\n\n\n# Concept Heads\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"docs/concept_head.jpg\" width=\"200px\"/\u003e  \n\u003cbr\u003e\nFor each user-specific concept, we introduce an external concept head designed to identify the presence of the concept within an image.\n\u003c/p\u003e\n\n\nAs mentioned in the paper, we have two types of concept heads: \n1. A facial recognition model for recognizing individuals\n2. A CLIP-based concept head for recognizing user-specific objects\n\nFor faces, we use the `buffalo_l` face detection and face recognition model from [insightface](https://github.com/deepinsight/insightface/tree/master).\nSee `concept_heads/face_recognition/head.py` for usage.\n\nFor objects, we train a single linear layer over features extracted from a CLIP ViT-H/14 model (`DFN5B-CLIP-ViT-H-14-384`).\nSee `concept_heads/clip/head.py` for usage. \n\nTraining a new concept head for objects can be done using `concept_heads/clip/train.py`, using the following command: \n```\npython concept_heads/clip/train.py \\\n--config_path example_configs/concept_head_training.yaml\n```\nPlease see `concept_heads/clip/config.py` for all available parameters. The important parameters are: \n1. `concept_name`: the name of the concept we wish to train a concept head for.\n2. `positive_samples_path`: the directory containing all the positive samples. This directory should contain a sub-directory with\n   the concept name (e.g., `\u003cpositive_samples_path\u003e/\u003cconcept_name\u003e`).\n3. `negative_samples_path`: similar to above, the directory containing all the negative samples. This can be a set of negative samples that are downloaded from the internet, as was done in the paper.\n4. `n_positive_samples`: how many positive samples we want to sample from the positive samples directory. By default, we use `4` samples.  \nAll remaining parameters can likely remain using the default values.\n\nLogs will be saved to `\u003coutput_dir\u003e/\u003cconcept_name\u003e`. Please note that we only save the parameters of the \ntrained linear layer, and not the entire model.\n\n\n# Concept Embeddings\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"docs/method.jpg\" width=\"800px\"/\u003e  \n\u003cbr\u003e\nHaving identified the presence of a user-specific concept within an image, a learned concept embedding representing an object or individual is used to guide the LLM in incorporating the concept into its personalized textual response.\n\u003c/p\u003e\n\n## Data Setup\nFirst, we describe how to prepare your data for training the concept embedding.\nYour data should be organized using the following structure:\n```\ndata_root\n├── \u003cconcept_name\u003e\n│   ├── \u003cimage1\u003e.jpg\n│   ├── \u003cimage2\u003e.jpg\n│   ├── ...\n│   ├── captions.json (or captions_augmented.json)\n│   └── additional_llava_vqa_data.json  (optional, used for personalized VQA using LLaVA, see next section).\n└── \u003cconcept_name_2\u003e\n```\nThat is, the root directory should contain a sub-directory for each concept. Then, in each concept directory, you should have:\n1. the set of images we want to use either for training or inference.\n2. a `json` file containing the captions for each image, named `captions.json` or `captions_augmented.json`. \nThis file should be in the following format:\n```\n{\n    \"\u003cimage1\u003e.jpg\": [\"\u003ccaption1\u003e\", \"\u003ccaption2\u003e\", ...],\n    \"\u003cimage2\u003e.jpg\": [\"\u003ccaption1\u003e\", \"\u003ccaption2\u003e\", ...],\n    ...\n}\n```\nThat is, we have a dictionary mapping each image path to a list of target captions. \nAs described in the paper, at each optimization step we will randomly sample a caption from this list to use as the target caption for the image.\n\n\n### Optional: Generating Additional VQA Data for LLaVA\nIf we wish to perform personalized VQA using LLaVA as presented in the paper, you should also have an additional `json` file \ncontaining additional questions and answers for each image. \n\nThis can be created using the script `inference/generate_augmented_vqa_data.py`, which will save everything in the correct format.\n\nFor example, you can run: \n```\npython inference/generate_augmented_vqa_data.py \\\n--concept_name \"Alex\" \\\n--concept_identifier \"Alex\" \\\n--concept_class \"the man\" \\  \n--concept_type PERSON \\         # either OBJECT or PERSON\n--images_root /path/to/images\n```\nPlease see `generate_augmented_vqa_data.py` for more information on the arguments that should be passed to the script.\n\n\n## Concept Embedding Training\nBelow, we detail how to train the concept embedding for the various VLM models and personalization tasks \npresented in the paper.\n\nTraining embeddings for all VLMs and all personalization tasks can be done using the same script:\n```bash\npython concept_embedding_training/train.py \\\n--config_path example_configs/\u003cVLM_TYPE\u003e/\u003cCONFIG_NAME\u003e\n```\nwhere you should change `\u003cVLM_TYPE\u003e` to the VLM model you wish to train the concept embedding for and `\u003cCONFIG_NAME\u003e` to the name of the configuration file you wish to use. You can take a look at the various examples we have prepared in `example_configs`.\n\nPlease see `configs/train_config.py` for all available parameters. The important parameters that should be modified are: \n1. `concept_name`: the name of the concept we wish to train the embedding for.\n2. `concept_identifier`: the identifier we wish to use the concept (e.g., `sks` for objects or a short name for people, e.g., `Bob` or `Anna`).\n      - We will replace all instances of `concept_name` in the caption with `concept_identifier` during training. \n      - If working with people, we will replace all instances of `concept_name.title()` with `concept_identifier`. \n3. `concept_type`: which type of concept we are training for, either an `OBJECT` or `PERSON`.\n4. `vlm_type`: which VLM we want to use, which should be one of `BLIP2`, `LLAVA`, or `MINIGPT_V2`.\n5. `personalization_task`: which personalization task we want to perform. Please note that we currently tested only the following \n    combination of VLMs and personalization tasks:\n    - `BLIP2`: `CAPTIONING`\n    - `LLAVA`: `CAPTIONING`, `VQA`\n    - `MINIGPT_V2`: `CAPTIONING`  \n    You may be able to try different combinations, but these have not been tested.\n6. `output_root`: where to save the outputs to. The exact output directory will be: `\u003coutput_root\u003e/\u003cconcept_name\u003e/seed_\u003cseed\u003e`.\n7. `data_root`: the path to the data. Please see above for how the data should be organized.\n8. `concept_head_path`: if working with objects, this should be the directory holding all the concept heads. This should have \n    a sub-directory for the concept we want to personalize.\n9. `optimization_steps`: how many optimization steps to perform. Please see the Appendix of the paper for the number of steps we used for each VLM and task.\n10. `batch_size`: the batch size to use for training. We typically use between `1` to `4` training samples.\n11. `reg_lambda`: the lambda value to use for the attention-based regularization. These are already defined to their correct values in the example configs.\n12. `seed`: random seed used for training. If working with objects, this should be the same seed used for training the concept head.\n\nThe remaining parameters can likely remain using the default values, but please see `configs/train_config.py` for more details.\n\nThis training script will save two outputs: \n1. A `pt` file containing the checkpoints of the trained concept embedding. This will be saved to `concept_embeddings_\u003cVLM_TYPE\u003e_\u003cTASK\u003e.pt`.  \n    This is saved in the following format:\n    ```\n    {\n      10: {\n        \"keys\": torch.Tensor(),    # the keys used for optimizing the concept embedding\n        \"values\": torch.Tensor(),  # the concept embedding itself\n      },\n      ...\n      20: {\n        \"keys\": torch.Tensor(),    \n        \"values\": torch.Tensor(),  \n      },\n      ...\n    }\n    ```\n    where each entry in the dictionary represents a different checkpoint during the optimization process.\n\n2. A `json` file containing the inference results on all validation images, named `inference_outputs_\u003cVLM_TYPE\u003e_\u003cTASK\u003e.json`.\n   - By default, we will run on the language instructions defined in `myvlm/common.py` (in `VLM_TO_PROMPTS`). Feel free to expand this list to include more prompts. \n   - Please see [MyVLM Inference](#myvlm-inference) for more details on the output format.\n\n\n### BLIP-2: Captioning\nFor BLIP-2, we provide support for personalized captioning using:\n```bash\npython concept_embedding_training/train.py \\\n--config_path example_configs/blip2/concept_embedding_training_captioning.yaml\n```\n\n\n### LLaVA: Captioning \u0026 VQA\nFor training the embedding for personalized captioning with LLaVA, please follow: \n```bash\npython concept_embedding_training/train.py \\\n--config_path example_configs/llava/concept_embedding_training_captioning.yaml\n```\n\nFor personalized VQA with LLaVA, please follow: \n```bash\npython concept_embedding_training/train.py \\\n--config_path example_configs/llava/concept_embedding_training_vqa.yaml\n```\nTo match the scheme used in the paper, please make sure that you create the additional VQA data as described above \n([Optional: Generating Additional VQA Data for LLaVA](#--optional--generating-additional-vqa-data-for-llava--)). The training process will load this data and integrate it into the optimization process.\n\n\n### MiniGPT-v2: Captioning \u0026 Referring Expression Comprehension\n\nBefore running MyVLM with MiniGPT-v2, you need to perform the following steps: \n1. Update the `HF_TOKEN_FOR_LLAMA` field in `myvlm/common.py` to your Hugging Face API token. This is required for downloading the LLama-2 LLM model from huggingface. If you need access to LLama-2, you should request access [here](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).\n2. Download the pretrained `minigptv2_checkpoint.pth` model from [here](https://drive.google.com/file/d/1aVbfW7nkCSYx99_vCRyP1sOlQiWVSnAl/view).\n    - This is taken from the official MiniGPT-v2 repository [here](https://github.com/Vision-CAIR/MiniGPT-4).\n    - After downloading the model, update the path defined in `myvlm/common.py` in the variable `MINIGPT_V2_CKPT_PATH`.\n\n\nFor personalized captioning using MiniGPT-v2, please follow: \n```bash\npython concept_embedding_training/train.py \\\n--config_path example_configs/minigpt_v2/concept_embedding_training_captioning.yaml\n```\nYou may need to increase the number of iterations for personalized captioning with MiniGPT-v2. \nThis will perform inference on both the captioning and referring expression comprehension personalization tasks.\n\n\n# Inference\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"docs/background.jpg\" width=\"800px\"/\u003e  \n\u003cbr\u003e\nVLMs possess \u003ci\u003egeneric\u003c/i\u003e knowledge, lacking a personal touch. With MyVLM we equip these models with the ability to comprehend user-specific concepts, tailoring the model specifically to \u003ci\u003eyou\u003c/i\u003e. \nMyVLM allows users to obtain personalized responses where outputs are no longer generic, but focus on communicating information about the target subject to the user.\n\u003c/p\u003e\n\n\n## Original VLM Captioning\nIf you wish to run captioning using the original VLMs, you can do so using the following command: \n```bash\npython inference/generate_original_captions.py \\\n--images_root /path/to/images \\\n--vlm_type \u003cVLM_TYPE\u003e\n```\nwhere `\u003cVLM_TYPE\u003e` is one of `BLIP2`, `LLAVA`, and `MINIGPT_V2`.\n\nYou can also run inference using: \n```bash\npython inference/generate_original_captions.py \\\n--config_path example_configs/inference/original_vlm_inference.yaml\n```\n\nPlease note that this script can likely be extended to run inference on other tasks/prompts by changing the input language \ninstruction that is defined in Line 56: \n```python\ninputs = vlm_wrapper.preprocess(image_path, prompt=VLM_TYPE_TO_PROMPT[cfg.vlm_type])\n```\n\n\n## MyVLM Inference\nAfter training the concept heads and concept embeddings, you can run inference on a new set of images using: \n```bash\npython inference/run_myvlm_inference.py \\\n--config_path example_configs/inference/myvlm_inference.yaml\n```\n\nAll parameters are defined in `configs/inference_config.py` and closely follow the parameters defined in the training config detailed above.\nThe main parameters that should be modified are:\n1. `concept_name`: same as above.\n2. `concept_identifier`: same as above.\n3. `concept_type`: same as above.\n4. `vlm_type`: same as above.\n5. `personalization_task`: same as above.\n6. `image_paths`: either (1) a list of paths we want to run inference on; or (2) the directory containing the images we want to run inference on.\n7. `checkpoint_path`: the path to all the trained concept embedding. This should contain a sub-directory for each concept and seed (e.g., `\u003coutput_root\u003e/\u003cconcept_name\u003e/seed_\u003cseed\u003e`).\n8. `concept_head_path`: if working with objects, this should be the directory holding all the concept heads and seeds (e.g., `\u003cconcept_head_path\u003e/\u003cconcept_name\u003e/seed_\u003cseed\u003e`).\n9. `seed`: random seed. This should be the same as used for the concept head and embedding training.\n10. `iterations`: which optimization steps to run inference on. If `None`, we will run on all the checkpoints that were saved during the optimization process.\n11. `prompts`: a list of strings defining the prompts to use for inference. If `None`, we will use a default list that is defined in `myvlm/common.py` (in `VLM_TO_PROMPTS`).\n\nThe output results will be saved to `\u003ccheckpoint_path\u003e/\u003cconcept_name\u003e/seed_\u003cseed\u003e/inference_outputs/inference_outputs_\u003cVLM_TYPE\u003e_\u003cTASK\u003e.json`, in the following format:\n```\n{\n    \"iteration_10\": {\n        \"image_path_1\": {\n            \"prompt1\": \"caption1\",\n            \"prompt2\": \"caption2\",\n            ...\n        },\n        \"image_path_2\": {\n            \"prompt1\": \"caption1\",\n            \"prompt2\": \"caption2\",\n            ...\n        },\n        ...\n    },\n    \"iteration_20\": {\n        ...\n    },\n  ...\n}\n```\n\n\n# Acknowledgements \nThis code builds on code from the following repositories: \n- [Transformers](https://github.com/huggingface/transformers): we use the `transformers` library for various model architectures, including CLIP and BLIP-2.\n- [LLaVA](https://github.com/haotian-liu/LLaVA): the official implementation of LLaVA-1.6.\n- [MiniGPT-v2](https://github.com/Vision-CAIR/MiniGPT-4): the official implementation of MiniGPT-v2.\n- [GRACE](https://github.com/Thartvigsen/GRACE): official implementation of the GRACE model editing technique on which our original MyVLMLayer implementation was based.\n\n\n# Citation\nIf you use this code for your research, please cite the following work:\n```\n@misc{alaluf2024myvlm,\n      title={MyVLM: Personalizing VLMs for User-Specific Queries}, \n      author={Yuval Alaluf and Elad Richardson and Sergey Tulyakov and Kfir Aberman and Daniel Cohen-Or},\n      year={2024},\n      eprint={2403.14599},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsnap-research%2Fmyvlm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsnap-research%2Fmyvlm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsnap-research%2Fmyvlm/lists"}