{"id":18631007,"url":"https://github.com/aimagelab/dico","last_synced_at":"2025-04-11T06:31:22.146Z","repository":{"id":255331149,"uuid":"835675882","full_name":"aimagelab/DiCO","owner":"aimagelab","description":"[BMVC'24 Oral ✨] Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization","archived":false,"fork":false,"pushed_at":"2024-09-11T12:08:12.000Z","size":7084,"stargazers_count":17,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-08T16:35:59.340Z","etag":null,"topics":["bmvc2024","caption-generation","captioning-images","image-captioning","vision-and-language"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aimagelab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-30T10:06:51.000Z","updated_at":"2025-04-07T08:16:19.000Z","dependencies_parsed_at":"2024-08-29T10:53:25.675Z","dependency_job_id":null,"html_url":"https://github.com/aimagelab/DiCO","commit_stats":null,"previous_names":["aimagelab/dico"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimagelab%2FDiCO","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimagelab%2FDiCO/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimagelab%2FDiCO/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimagelab%2FDiCO/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aimagelab","download_url":"https://codeload.github.com/aimagelab/DiCO/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248355790,"owners_count":21090084,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bmvc2024","caption-generation","captioning-images","image-captioning","vision-and-language"],"created_at":"2024-11-07T05:05:33.771Z","updated_at":"2025-04-11T06:31:20.234Z","avatar_url":"https://github.com/aimagelab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/dico_logo.png\" /\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003e Revisiting Image Captioning Training Paradigm\u003c/br\u003evia Direct CLIP-based Optimization\u003c/br\u003e(BMVC 2024 Oral ✨)\u003c/h1\u003e\n\n\n\u003cdiv align='center'\u003e\n\n#### [Nicholas Moratelli](https://nicholasmoratelli.github.io)\\*, [Davide Caffagni](https://github.com/dcaffo98)\\*, [Marcella Cornia](https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=90), [Lorenzo Baraldi](https://www.lorenzobaraldi.com/), and [Rita Cucchiara](https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=1)\n\n[![Paper](https://img.shields.io/badge/Paper-arxiv.2303.12112-B31B1B.svg)](https://arxiv.org/pdf/2408.14547)\n\n\u003c/div\u003e\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/dico_model.jpg\" width=80%/\u003e \n\u003c/p\u003e\n\nThis repository contains the reference code for the paper [Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization](https://arxiv.org/pdf/2408.14547), **BMVC 2024**.\n\nPlease cite with the following BibTeX:\n```\n@inproceedings{moratelli2024revisiting,\n  title={{Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization}},\n  author={Moratelli, Nicholas and Caffagni, Davide and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},\n  booktitle={Proceedings of the British Machine Vision Conference},\n  year={2024}\n}\n```\n\n## 📣 Latest News 📣\n- **`10 September 2024`** Our paper has been selected for oral presentation at **BMVC2024**! ✨\n- **`19 July 2024`** Our paper has been accepted for publication at **BMVC2024**!\n\n## Abstract\n\nThe conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed *Direct CLIP-Based Optimization* (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics.\n\n## Create the environment\n```\nconda create -y -n \"dico\" python=3.8.16\nconda activate dico\npip install -r requirements.txt\n```\n\n## Training\nEdit the following scripts with the correct checkpoint paths.\n1. **Cross-Entropy pre-training**\n```\n./scripts/train_xe_coco.sh\n```\n\n2. **DiCO fine-tuning**\n```\n./scripts/train_dico_coco.sh\n```\n\nWe train and evaluate our models on the [COCO Karpathy splits](https://github.com/karpathy/neuraltalk2). We employ the [webdatasets](https://github.com/webdataset/webdataset) format to prepare our datasets. Every `tar` file complies with the following structure (see also [dataset.json](datasets.json)):\n- Cross-entropy\n```\n├── webdatasets/coco-384-training-000.tar\n│   └── 177828__COCO_train2014_000000379613.jpg\n│   └── 177828__COCO_train2014_000000379613.txt\n│   └── 549457__COCO_val2014_000000195045.jpg\n│   └── 549457__COCO_val2014_000000195045.txt\n│   └── ...\n├── ...\n└── webdatasets/coco-384-training-113.tar\n    └── ...\n```\nEvery `.txt` file contains a single caption.\n- DiCO: fine-tuning | validation | test\n```\n├── webdatasets/coco-384-training-dict-000.tar\n│   └── 177828__COCO_train2014_000000379613.jpg\n│   └── 177828__COCO_train2014_000000379613.json\n│   └── 549457__COCO_val2014_000000195045.jpg\n│   └── 549457__COCO_val2014_000000195045.json\n│   └── ...\n├── ...\n└── webdatasets/coco-384-training-dict-022.tar\n    └── ...\n```\nEvery `.json` file contains all the captions for the given image.\n\n## Inference\n```\n./scripts/inference_coco.sh\n```\n\n## DiCO Weights\n- [ViT/L-14] The checkpoint is available [here](https://drive.google.com/file/d/19vV-SJYjFKg5-8XfbCQvq8OWB2vVTvpX).\n\nSoon available also on HuggingFace Hub\n\n## Demo\n```\npython demo.py --checkpoint dico-ViTL14\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faimagelab%2Fdico","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faimagelab%2Fdico","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faimagelab%2Fdico/lists"}