{"id":28653881,"url":"https://github.com/tiger-ai-lab/abc","last_synced_at":"2025-06-13T07:08:00.691Z","repository":{"id":280061215,"uuid":"939516239","full_name":"TIGER-AI-Lab/ABC","owner":"TIGER-AI-Lab","description":"ABC: Achieving Better Control of Multimodal Embeddings using VLMs","archived":false,"fork":false,"pushed_at":"2025-04-05T03:51:09.000Z","size":5976,"stargazers_count":8,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-05T04:24:56.880Z","etag":null,"topics":["information-retrieval","multimodal"],"latest_commit_sha":null,"homepage":"https://tiger-ai-lab.github.io/ABC/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TIGER-AI-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-26T17:02:53.000Z","updated_at":"2025-04-05T03:51:55.000Z","dependencies_parsed_at":"2025-03-01T01:27:24.477Z","dependency_job_id":"9b252f79-ec8b-485d-85d0-6439dfabfb67","html_url":"https://github.com/TIGER-AI-Lab/ABC","commit_stats":null,"previous_names":["tiger-ai-lab/abc"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TIGER-AI-Lab/ABC","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FABC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FABC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FABC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FABC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TIGER-AI-Lab","download_url":"https://codeload.github.com/TIGER-AI-Lab/ABC/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FABC/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259599331,"owners_count":22882357,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["information-retrieval","multimodal"],"created_at":"2025-06-13T07:07:59.202Z","updated_at":"2025-06-13T07:08:00.669Z","avatar_url":"https://github.com/TIGER-AI-Lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ABC: Achieving Better Control of Multimodal Embeddings using VLMs\n\u003ca target=\"_blank\" href=\"https://arxiv.org/abs/2503.00329\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-Paper-red?style=flat\u0026logo=arxiv\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" href=\"https://github.com/TIGER-AI-Lab/ABC\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-Code-green?style=flat\u0026logo=github\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" href=\"https://tiger-ai-lab.github.io/ABC/\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-🌐%20Website-blue?style=flat\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" href=\"https://huggingface.co/TIGER-Lab/ABC-Qwen2VL-Instruct\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-🤗%20Models-red?style=flat\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" href=\"https://huggingface.co/datasets/TIGER-Lab/ABC-VG-Instruct\"\u003e\n\u003cimg style=\"height:22pt\" src=\"https://img.shields.io/badge/-🤗%20Dataset-red?style=flat\"\u003e\u003c/a\u003e\n\u003cbr\u003e\n\n\u003cbr\u003e\n\n\u003cspan style=\"font-size: 14pt; font-family: Roboto, Helvetica, Arial, Heveltica Neue, sans-serif\"\u003e\n     \u003cb\u003eAuthors:\u003c/b\u003e\n     \u003ca class=\"name\" target=\"_blank\" href=\"https://benjaminschneider.ca/\"\u003eBenjamin Schneider\u003c/a\u003e, \n     \u003ca class=\"name\" target=\"_blank\" href=\"https://cs.uwaterloo.ca/~fkerschb/\"\u003eFlorian Kerschbaum\u003c/a\u003e,\n     \u003ca class=\"name\" target=\"_blank\" href=\"https://wenhuchen.github.io/\"\u003eWenhu Chen\u003c/a\u003e\u0026nbsp; @ \n     \u003ca class=\"btna\" target=\"_blank\" href=\"https://huggingface.co/TIGER-Lab\"\u003eTIGER-Lab\u003c/a\u003e \u0026nbsp; \n     \u003c/span\u003e\n\n## 🔥News\n\n- [2025/3/24] Added scripts to easily fetch our datasets from HF hub, includiong our large (200 GB) pretraining dataset. Our training script now directly pulls these datasets from the hub making it very easy to train yuor our models / adapters. I also added a batched inference embedding function (example in batched_demo.py).\n\n- [2025/3/4] Release of the [ABC Paper](https://arxiv.org/abs/2503.00329), along with the first release of our [🤗 Model and Datasets](https://huggingface.co/collections/TIGER-Lab/abc-67bf2036a7c51b2a99aa9f54) on Hugging Face (more to come, stay tuned!).\n\n\n## Overview\n![./assets/training_overview.png](./assets/training_overview.png)\n\n\u003cdetails\u003e\u003csummary\u003eABC's Design\u003c/summary\u003e  \n\n\n- We introduce ABC, an open-source multimodal embedding model that uses a\nvision-language model backbone to deeply integrate image features with natural language\ninstructions.\n\n- ABC is designed to give the user **maximum control** over how images are represented in embeddings. If you need to use naturral langauge to specify which aspects of an image you want emphasized and represented, ABC is the perfect model for you!\n\n- The key behind ABC's training is that we pretrain the model using a large dataset of difficult embedding samples, where each batch contains many candidates that are relevant but not quite correct. The pretrained model is therefore able to generate embeddings that capture subtle differences. After a short finetuning stage, the model ideal for tasks like VQA, where differences in user instructions result in different correct answers (right).\n\n- ABC outputs great quality embeddings, ABC achieves best-for-size performance on MSCOCO image-to-text retrieval and is the\ntop performing model on zero-shot classification and VQA tasks in the Massive Multimodal Embedding\nBenchmark.\n\n\u003c/details\u003e\n\n## 🤗 Models\n\n| Model | Supports Instructions | Base Model | Training Dataset |\n|:---------------------:|:-----------:|:----------------:|:--------------:|\n| [ABC-Qwen2VL-Instruct](https://huggingface.co/TIGER-Lab/ABC-Qwen2VL-Instruct)  | ✅        | [ABC-Qwen2VL-Pretrain](https://huggingface.co/TIGER-Lab/ABC-Qwen2VL-Pretrain) | [TIGER-Lab/ABC-VG-Instruct]() |\n| [ABC-Qwen2VL-Pretrain](https://huggingface.co/TIGER-Lab/ABC-Qwen2VL-Pretrain)  | ❌        | [Qwen2VL-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)     | [TIGER-Lab/ABC-Pretrain](https://huggingface.co/datasets/TIGER-Lab/ABC-Pretraining-Data)    |\n\n## 📚 Datasets\n- [ABC-VG-Instruct](https://huggingface.co/datasets/TIGER-Lab/ABC-VG-Instruct): A custom dataset for multimodal finetuning. Contains multiple instructions per image, each corresponding to different aspects of each image.\n- [ABC-Pretrain](https://huggingface.co/datasets/TIGER-Lab/ABC-Pretraining-Data): Multimodal pretraining dataset with mined negatives.\n\n\n## 🚀 Quick Start\n\nInstall Dependancies:\n```bash\ngit clone $\ncd ABC\npip install -r requirements.txt\n```\nStart making multimodal embeddings!\n```bash\npython -i ./quick_start.py\n```\n\n## 📈 Zero-shot Performance\n![./assets/results.png](./assets/results.png)\nCheck out our [paper](https://arxiv.org/abs/2503.00329) for additional evaluations!\n\n\n## Fetching Datasets from 🤗 Hub\n\nOur datasets are hosted on HuggingFace Hub. The text data and dataset metadata can be fetched using HF's `load_dataset` utility.\nTo fetch the images from our datasets we provide scripts in the `fetch_datasets` directory.\nThese scripts will pull the pretraining/finetuning image data off the hub and unpack them in your huggingface datasets cache (under a directory called tigerlab).\nRun `python ./fetch_datasets/pretrain.py` to get the pretraining dataset and `python ./fetch_datasets/instruct.py` to get the finetuning dataset, respectively.\n\n## 🤖 Training\n\n**1. Install all requirements.**\n```\npip install -r training_requirements.txt\n```\n**2. Download the appropriate dataset.**  \nEither thhe pretraining dataset:\n```\npython ./fetch_datasets/pretrain.py\n```\nor the instruction finetuning dataset:\n```\npython ./fetch_datasets/instruct.py\n```\n**3. Update Config**  \nFind the config you want to run in the `config` folder\n(Currently the example configs are nested under the `qwen` folder, one for pretraining and one for finetuning).\nAt minimum, change the `output_dir` field to where you want to the checkpoints to be saved.\nFeel free to change any other settings in your chosen config. 😊\n\n**4. Run the training script**  \nThe `scripts` directory contains a file for training the model with different GPU / system config settings:\n```\n./scripts/qwen_finetune.sh {GPU} {PORT} {CONFIG_PATH}\n```\nfor example:\n```\n./scripts/qwen_finetune.sh 0,1 44000 ./config/qwen/QwenVL-8B-Instruct.json\n```\nRuns our pretraining on GPUs 0,1 with communication over port 44000. \nhis script still works if you only want to specify a single GPU for your training.\n\nIf encounter any problems please open an issue on this repo. 😊\n\n## Citation\nIf you find this work helpful, please consider citing:\n```bibtex\n@misc{schneider2025abcachievingbettercontrol,\n      title={ABC: Achieving Better Control of Multimodal Embeddings using VLMs}, \n      author={Benjamin Schneider and Florian Kerschbaum and Wenhu Chen},\n      year={2025},\n      eprint={2503.00329},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https://arxiv.org/abs/2503.00329}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Fabc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftiger-ai-lab%2Fabc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Fabc/lists"}