{"id":22068803,"url":"https://github.com/wjpoom/SPEC","last_synced_at":"2025-07-24T07:31:05.310Z","repository":{"id":210722075,"uuid":"724008289","full_name":"wjpoom/SPEC","owner":"wjpoom","description":"[CVPR' 24] The official implementation of paper \"synthesize, diagnose, and optimize: towards fine-grained vision-language understanding\"","archived":false,"fork":false,"pushed_at":"2024-04-13T17:07:29.000Z","size":11908,"stargazers_count":15,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-04-14T09:41:28.893Z","etag":null,"topics":["clip","compositionality","computer-vision","fine-grained","multimodal","vision-language","vision-language-model"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2312.00081","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wjpoom.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-27T07:55:15.000Z","updated_at":"2024-06-03T07:26:54.207Z","dependencies_parsed_at":"2024-06-03T07:41:37.774Z","dependency_job_id":null,"html_url":"https://github.com/wjpoom/SPEC","commit_stats":null,"previous_names":["wjpoom/spec"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wjpoom%2FSPEC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wjpoom%2FSPEC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wjpoom%2FSPEC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wjpoom%2FSPEC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wjpoom","download_url":"https://codeload.github.com/wjpoom/SPEC/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227421331,"owners_count":17775010,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clip","compositionality","computer-vision","fine-grained","multimodal","vision-language","vision-language-model"],"created_at":"2024-11-30T20:04:22.083Z","updated_at":"2025-07-24T07:31:05.284Z","avatar_url":"https://github.com/wjpoom.png","language":"Jupyter Notebook","funding_links":[],"categories":["Paper List"],"sub_categories":["Follow-up Papers"],"readme":"\u003cdiv align=\"center\" style=\"font-family: charter;\"\u003e\n\u003ch1\u003e\u003ci\u003eSPEC\u003c/i\u003e: Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding\u003c/h1\u003e\n\u003ca href=\"https://arxiv.org/abs/2312.00081\" target=\"_blank\"\u003e\n    \u003cimg alt=\"arXiv\" src=\"https://img.shields.io/badge/arXiv-SPEC-red?logo=arxiv\" height=\"20\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://huggingface.co/wjpoom/SPEC-CLIP-ViT-B-32\" target=\"_blank\"\u003e\n    \u003cimg alt=\"HF Checkpoint: SPEC\" src=\"https://img.shields.io/badge/📒_Checkpoint-SPEC-ffc107?color=5e84b6\u0026logoColor=white\" height=\"20\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://huggingface.co/datasets/wjpoom/SPEC\" target=\"_blank\"\u003e\n    \u003cimg alt=\"HF Dataset: SPEC\" src=\"https://img.shields.io/badge/📒_Benchmark-SPEC-ffc107?color=A9B5DF\u0026logoColor=white\" height=\"20\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://github.com/wjpoom/SPEC/tree/main/notebooks\" target=\"_blank\"\u003e\n    \u003cimg alt=\"HF Dataset: Inst-It-Dataset\" src=\"https://img.shields.io/badge/%F0%9F%A4%97%20_Notebook-SPEC-ffc107?color=B3D8A8\u0026logoColor=white\" height=\"20\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://openaccess.thecvf.com/content/CVPR2024/supplemental/Peng_Synthesize_Diagnose_and_CVPR_2024_supplemental.pdf\" target=\"_blank\"\u003e\n    \u003cimg alt=\"Supplementary\" src=\"https://img.shields.io/badge/📑_Supplementary-SPEC-ffc107?color=FFCF50\u0026logoColor=white\" height=\"20\" /\u003e\n\u003c/a\u003e\n\n\u003cdiv\u003e\n    \u003ca href=\"https://scholar.google.com/citations?user=GTuWk9YAAAAJ\u0026hl=zh-CN\" target=\"_blank\"\u003eWujian Peng\u003c/a\u003e\u003csup\u003e\u003c/sup\u003e,\u003c/span\u003e\n    Sicheng Xie\u003csup\u003e\u003c/sup\u003e,\u003c/span\u003e\n    Zuyao You\u003csup\u003e\u003c/sup\u003e,\u003c/span\u003e\n    \u003ca href=\"https://voidrank.github.io/\" target=\"_blank\"\u003eShiyi Lan\u003c/a\u003e\u003csup\u003e\u003c/sup\u003e,\u003c/span\u003e\n    \u003ca href=\"https://zxwu.azurewebsites.net/\" target=\"_blank\"\u003eZuxuan Wu\u003c/a\u003e\u003csup\u003e\u0026dagger;\u003c/sup\u003e,\u003c/span\u003e\n\u003c/div\u003e\n\n\u003cdiv\u003e\n    \u003csup\u003e\u0026dagger;\u003c/sup\u003e Corresponding author\u0026emsp;\n\u003c/div\u003e\n\n\u003c/div\u003e\n\n## :fire: News\n* `Jun. 17, 2025` 🔥  We have released the [checkpoints](https://huggingface.co/wjpoom/SPEC-CLIP-ViT-B-32) of our fine-tuned model.\n\u003c!-- * `Apr. 14, 2024` We have released a [preview](https://wjpoom.github.io/preview/) of a more advanced dataset version, the full version will come soon. --\u003e\n* `Apr. 13, 2024` We released the SPEC dataset and the code for evaluation, sorry for the delay :relaxed:.\n* `Feb. 28, 2024` Our work has been accepted by [CVPR 2024](https://cvpr.thecvf.com/) :tada:.\n\n\u003c!-- ## :rocket: A more advanced version is coming!\nWe are building a new version with a larger data scale, more object categories, and higher-quality images and text, and more. \nYou can preview it at [this website](https://wjpoom.github.io/preview/), and the full version will come soon. --\u003e\n\n## :mag: SPEC Benchmark\nTo evaluate the understanding capability of visual-language models on fine-grained concepts, we propose a new benchmark, SPEC, \nwhich consists of six distinct subsets, distributed across the dimensions of **S**ize, **P**osition, **E**xistence, and **C**ount.\nEach test case consists of an image candidate set, which differs only in certain visual concepts, and a text candidate set, \nwhich differs only in the corresponding language concept.\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/spec_overview.png\" width=\"720px\"/\u003e  \n\u003cbe\u003e\n\u003c/p\u003e\n\n## :wrench: Usage\n### install\n``` shell\ngit clone https://github.com/wjpoom/SPEC.git\ncd SPEC/\npip install -e .\n```\n### prepare data\n* run the following code in Python shell, replace `/path/to/save/data` with a specified dir to store the data.\n```python\nimport zipfile\nimport os\nfrom huggingface_hub import hf_hub_download\n\ndata_root = '/path/to/save/data'\nhf_hub_download(repo_id='wjpoom/SPEC', repo_type='dataset', filename='data.zip', local_dir=data_root)\n\nwith zipfile.ZipFile(os.path.join(data_root, 'data.zip'), 'r') as zip_ref:\n    zip_ref.extractall(os.path.join(data_root))\n    \nos.remove(os.path.join(data_root, 'data.zip'))\n```\n### explore the dataset\n* We provide a 📓notebook that enables you to visually explore the test samples in the SPEC dataset.\n* Run this notebook either [locally](https://github.com/wjpoom/SPEC/blob/main/notebooks/explore_spec_local.ipynb) or online using [Colab](https://colab.research.google.com/github/wjpoom/SPEC/blob/main/notebooks/explore_spec_colab.ipynb).\n\n### reproduce the results\n* In our paper, we evaluated four popular VLMs using our SPEC dataset, namely: CLIP, BLIP, FLAVA and CoCa.\n* To reproduce the results with these VLMs, you can run [this script](https://github.com/wjpoom/SPEC/blob/main/spec/run_eval.sh).\n* You can also reproduce with this [local notebook](https://github.com/wjpoom/SPEC/blob/main/notebooks/evaluate_example_local.ipynb) or the online [Colab notebook](https://colab.research.google.com/github/wjpoom/SPEC/blob/main/notebooks/evaluate_example_colab.ipynb).\n\n### evaluate custom VLMs\n* If you want to evaluate your custom model on SPEC, you can follow the instructions in [this document](https://github.com/wjpoom/SPEC/blob/main/docs/evaluate_custom_model.md).\n\n## :space_invader: Model weights\n```shell\npip install open_clip_torch\nmkdir checkpoints\nhuggingface-cli download wjpoom/SPEC-CLIP-ViT-B-32 --local-dir checkpoints/SPEC-CLIP-ViT-B-32\n```\n```python\nimport torch\nfrom PIL import Image\nimport open_clip\n\nmodel, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='checkpoints/SPEC-CLIP-ViT-B-32', load_weights_only=False)\nmodel.eval()\ntokenizer = open_clip.get_tokenizer('ViT-B-32')\n\nimage = preprocess(Image.open(\"assets/image.png\")).unsqueeze(0)\ntext = tokenizer([\n    \"the broccoli is situated above the backpack.\", \n    \"the broccoli is situated to the right of the backpack\",\n    \"the broccoli is positioned on the left of the backpack.\",\n    \"the broccoli is placed beneath the backpack.\"\n    ])\n\nwith torch.no_grad(), torch.autocast(\"cuda\"):\n    image_features = model.encode_image(image)\n    text_features = model.encode_text(text)\n    image_features /= image_features.norm(dim=-1, keepdim=True)\n    text_features /= text_features.norm(dim=-1, keepdim=True)\n\n    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)\n\nprint(\"Label probs:\", text_probs)  \n```\n\n## :memo: TODO\n\u003c!-- - [ ] Release the newly built version of the dataset --\u003e\n\u003c!-- - [ ] Release the code of our data synthesize pipeline --\u003e\n- [x] Release the checkpoint of fine-tuned model\n- [x] Release the testing set of SPEC benchmark\n- [x] Release the evaluation code of SPEC\n\n## :clap: Acknowledgement\nPart of this repository is built upon [ARO](https://github.com/mertyg/vision-language-models-are-bows), thanks for the well-organized codebase.\n\n## Contact Us\nFeel free to contact us if you have any questions or suggestions \n\nEmail (Wujian Peng): wjpeng24@m.fudan.edu.cn\n\n## :black_nib: Citation\nIf you use our code or data in this repo or find our work helpful, please consider giving a citation:\n\n``` bibtex\n@inproceedings{peng2024synthesize,\n  title={Synthesize diagnose and optimize: Towards fine-grained vision-language understanding},\n  author={Peng, Wujian and Xie, Sicheng and You, Zuyao and Lan, Shiyi and Wu, Zuxuan},\n  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},\n  pages={13279--13288},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwjpoom%2FSPEC","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwjpoom%2FSPEC","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwjpoom%2FSPEC/lists"}