{"id":29571019,"url":"https://github.com/compvis/disclip","last_synced_at":"2026-03-03T22:31:09.519Z","repository":{"id":271301179,"uuid":"894000019","full_name":"CompVis/DisCLIP","owner":"CompVis","description":"[AAAI 2025] Does VLM Classification Benefit from LLM Description Semantics?","archived":false,"fork":false,"pushed_at":"2025-08-05T09:48:10.000Z","size":7537,"stargazers_count":22,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"release","last_synced_at":"2025-09-10T05:24:25.592Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CompVis.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-11-25T15:19:17.000Z","updated_at":"2025-09-09T16:02:40.000Z","dependencies_parsed_at":"2025-09-10T04:00:08.784Z","dependency_job_id":"43d3e0d2-7232-48c2-8b4d-521a64c301b0","html_url":"https://github.com/CompVis/DisCLIP","commit_stats":null,"previous_names":["compvis/disclip"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CompVis/DisCLIP","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CompVis%2FDisCLIP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CompVis%2FDisCLIP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CompVis%2FDisCLIP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CompVis%2FDisCLIP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CompVis","download_url":"https://codeload.github.com/CompVis/DisCLIP/tar.gz/refs/heads/release","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CompVis%2FDisCLIP/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30064278,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-03T18:21:05.932Z","status":"ssl_error","status_checked_at":"2026-03-03T18:20:59.341Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-19T03:07:50.870Z","updated_at":"2026-03-03T22:31:09.511Z","avatar_url":"https://github.com/CompVis.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n \u003ch2 align=\"center\"\u003e 🦙 Does VLM Classification Benefit from LLM Description Semantics?\u003c/h2\u003e\n \u003cp align=\"center\"\u003e \n    Pingchuan Ma\u003csup\u003e*\u003c/sup\u003e · Lennart Rietdorf\u003csup\u003e*\u003c/sup\u003e · Dmytro Kotovenko · Vincent Tao Hu · Björn Ommer \n \u003c/p\u003e\u003cp align=\"center\"\u003e \n \u003c/p\u003e\n \u003cp align=\"center\"\u003e \n    \u003cb\u003eCompVis Group @ LMU Munich, MCML\u003c/b\u003e\n \u003c/p\u003e\n  \u003cp align=\"center\"\u003e \u003csup\u003e*\u003c/sup\u003e \u003ci\u003eequal contribution\u003c/i\u003e \u003c/p\u003e\n\u003c/p\u003e\n\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://arxiv.org/abs/2412.11917\"\u003e\n        \u003cimg src=\"https://img.shields.io/badge/arXiv-PDF-b31b1b\" alt=\"Paper\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n## 📖 Overview\nDescribing images accurately through text is key to explainability. Vision-Language Models (VLMs) as CLIP align images and texts in a shared space. Descriptions generated by Large Language Models (LLMs) can  further improve their classification performance. However, it remains unclear if performance gains stem from true semantics or semantic-agnostic ensembling effects, as questioned by several prior works. To address this, we propose an alternative evaluation scenario to isolate the discriminative power of descriptions and introduce a training-free method for selecting discriminative descriptions. This method improves classification accuracy across datasets by leveraging CLIP’s local label neighborhood, offering insights into description-based classification and explainability in VLMs. [Figure 1](#fig1) depicts this procedure.\n\nThis repository is our official implementation for the paper **\"Does VLM Classification Benefit from LLM Description Semantics?\"**. It enables the evaluation of Visual-Language Model (VLM) classification accuracy across different datasets, leveraging the semantics of descriptions generated by Large Language Models (LLMs). \n\n\u003ca name=\"fig1\"\u003e\u003c/a\u003e\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/pipeline_horizontal.png\" alt=\"Diagram of the process\" style=\"width: 1200px;\"\u003e\n  \u003cp style=\"font-size: 1.2em; font-weight: bold;\"\u003eFigure 1: Depiction of the suggested approach\u003c/p\u003e\n\u003c/div\u003e\n\n## 🛠️ Setup\n### Environment\nResults were obtained using `Ubuntu 22.04.5 LTS`, `Cuda 11.8`, and `Python 3.10.14` \n\nInstall the necessary dependencies manually via\n```bash\nconda create -n \u003cchoose_name\u003e python=3.10.14\nconda activate \u003cchoose_name\u003e\npip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu118\npip install tqdm\npip install torchmetrics\npip install imagenetv2_pytorch\npip install git+https://github.com/modestyachts/ImageNetV2_pytorch\npip install pyyaml\npip install git+https://github.com/openai/CLIP.git\npip install requests\n```\nThe resulting python env will correspond to requirements.txt.\n### Datasets\n\nThe datasets supported by this implementation are:\n- **Flowers102**\n- **DTD (Describable Textures Dataset)**\n- **Places365**\n- **EuroSAT**\n- **Oxford Pets**\n- **Food101**\n- **CUB-200**\n- **ImageNet**\n- **ImageNet V2**\n\nMost of these datasets will be automatically downloaded as `torchvision` datasets and stored in ```./datasets``` during the first run of ```main.py```. Instructions for datasets that have to be installed manually can be found below.\n\n### CUB-200 Dataset\nThe CUB-200 dataset requires downloading the dataset files first, e.g. from https://data.caltech.edu/records/65de6-vp158 via\n```bash\nwget https://data.caltech.edu/records/65de6-vp158/files/CUB_200_2011.tgz?download=1D\n```\nAfter that, create a directory `./datasets/cub_200` where you unpack `CUB_200_2011.tgz`. The dataset is then ready for embedding.\n\n### ImageNet Dataset\nFollow the instructions to download ImageNet's dataset files under the following link:\n```bash\nhttps://pytorch.org/vision/main/generated/torchvision.datasets.ImageNet.html\n```\nSave these files to `./datasets/ilsvrc`. The dataset is then ready to use and embed for the ```main.py``` script.\n\n### ImageNetV2 Dataset\nImageNetV2 is an additional test dataset for the ImageNet training dataset.\nThis dataset requires the installation of `imagenet_v2_pytorch` package stated above in the _Environment_.\nThe dataset files will be downloaded automatically.\n\n## 🦙 Description Pools\nAvailable description pools can be found under ```./descriptions```. DClip descriptions are taken from ```https://github.com/sachit-menon/classify_by_description_release```.\nThe description pools supported by this implementation are:\n- **DClip**\n- **Contrastive Llama**\n  \nAssignments of selected descriptions will be saved as JSON files to ```./saved_descriptions```.\n\n## 🔢 Embeddings\nIn the first run of ```main.py```, the datasets will be embedded first by CLIP's VLM backbones before the description selection pipeline depicted in [Figure 1](#fig1) is executed. The image embeddings will be stored in ```./image_embeddings``` for further usage. This speeds up further executions of the script. \n\n## 🚀 Usage\n### Run\nTo run the whole pipeline as depicted in [Figure 1](#fig1) call the script ```main.py```. As stated above, the new dataset will be downloaded and embedded in the first run of a new dataset. Use the following command with the following options:\n\n```bash\npython main.py --dataset \u003cDATASET_NAME\u003e --pool \u003cDESCRIPTION_POOL\u003e --encoding_device \u003cCUDA_ID_0\u003e --calculation_device \u003cCUDA_ID_1\u003e\n```\n\n### Arguments\n\n`--dataset`\nChoose the dataset to evaluate. Available options are:\n  - **flowers**\n  - **dtd**\n  - **eurosat**\n  - **places**\n  - **food**\n  - **pets**\n  - **cub**\n  - **ilsvrc**\n  - **imagenet_v2**\n\n_Be aware that_ Downloading and embedding the places dataset may take a long time. \n\n**Default:** `flowers`\n\n`--pool`\nSelect the description pool to use for the evaluation. Available options are:\n  - **dclip**\n  - **con_llama**\n\n**Default:** `dclip`\n\n`--encoding_device` and `--calculation_device`\nSelect the cuda ID as an integer for encoding of images and texts; ID for evaluation device.\n\n**Default:** 0 and 1\n\n`--backbone`\nSelect the openai ViT CLIP backbone. Available options are:\n- **b32**\n- **b16**\n- **l14**\n- **l14@336px**\n\n**Default:** `b32`\n\n\n## 📈 Results\nOur evaluation demonstrates that the proposed method significantly outperforms baselines in the classname-free setup, minimizing artificial gains from the ensembling effect. Additionally, we show that these improvements transfer to the conventional evaluation setup, achieving competitive results with substantially fewer descriptions required, while offering better interpretability.\n![Results](/assets/tab_wo_cls.jpg)\n![Results](/assets/tab_cls.jpg)\n\n## 🎓 Citation\n\nIf you use this codebase or otherwise found our work valuable, please cite our paper:\n\n```bibtex\n@inproceedings{ma2025does,\n  title={Does VLM Classification Benefit from LLM Description Semantics?},\n  author={Ma, Pingchuan and Rietdorf, Lennart and Kotovenko, Dmytro and Hu, Vincent Tao and Ommer, Bj{\\\"o}rn},\n  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},\n  volume={39},\n  number={6},\n  pages={5973--5981},\n  year={2025}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcompvis%2Fdisclip","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcompvis%2Fdisclip","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcompvis%2Fdisclip/lists"}