{"id":18864608,"url":"https://github.com/yuhui-zh15/c3","last_synced_at":"2025-04-14T13:21:25.817Z","repository":{"id":217576595,"uuid":"738643756","full_name":"yuhui-zh15/C3","owner":"yuhui-zh15","description":"Official implementation of \"Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data\" (ICLR 2024)","archived":false,"fork":false,"pushed_at":"2024-10-16T21:22:30.000Z","size":32423,"stargazers_count":28,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-28T02:37:44.101Z","etag":null,"topics":["computer-vision","contrastive-learning","iclr2024","machine-learning","natural-language-processing"],"latest_commit_sha":null,"homepage":"https://yuhui-zh15.github.io/C3-Website/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yuhui-zh15.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-03T17:49:52.000Z","updated_at":"2025-02-14T07:32:12.000Z","dependencies_parsed_at":"2024-10-17T13:09:01.348Z","dependency_job_id":null,"html_url":"https://github.com/yuhui-zh15/C3","commit_stats":null,"previous_names":["yuhui-zh15/c3"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yuhui-zh15%2FC3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yuhui-zh15%2FC3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yuhui-zh15%2FC3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yuhui-zh15%2FC3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yuhui-zh15","download_url":"https://codeload.github.com/yuhui-zh15/C3/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248886334,"owners_count":21177645,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","contrastive-learning","iclr2024","machine-learning","natural-language-processing"],"created_at":"2024-11-08T04:43:32.336Z","updated_at":"2025-04-14T13:21:25.755Z","avatar_url":"https://github.com/yuhui-zh15.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data\n\n[![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://lbesson.mit-license.org/)\n[![Python](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-311/)\n[![Pytorch](https://img.shields.io/badge/Pytorch-2.1-red.svg)](https://pytorch.org/get-started/previous-versions/#v21)\n[![Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)\n\nThis repo provides the PyTorch source code of our paper: \n[Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data](https://arxiv.org/abs/2401.08567) (ICLR 2024). Check out project page [here](https://yuhui-zh15.github.io/C3-Website/)!\n\n## 🔮 Abstract\n\nBuilding cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our $C^3$ method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation.\n\n\u003cimg src=\"./figures/poster.jpg\" width=\"100%\" /\u003e\n\n\n## 💡 Approach\n\n\u003cp float=\"left\"\u003e\n  \u003cimg src=\"./figures/figure1.png\" width=\"57%\" /\u003e\n  \u003cimg src=\"./figures/figure2.png\" width=\"39%\" /\u003e \n\u003c/p\u003e\n\n**Figure: Overview of the motivation behind our approach, $C^3$.** Our work provides a theoretical explanation of the unique geometry that arises from multi-modal contrastive learning, where a modality gap and alignment noise exist in the learned representation space. Building upon this observation, we present a straightforward technique, $C^3$, which enhances the interchangeability of embeddings between modalities, enabling the creation of cross-modal applications using only uni-modal data.\n\n## 🚀 Getting Started\n\n- Reproduce embedding geometry analysis results [here](geometry_analysis/README.md).\n\n- Reproduce image captioning results [here](image_captioning/README.md).\n\n- Reproduce image generation results [here](image_generation/README.md).\n\n- Reproduce ImageBind results [imagebind branch](https://github.com/yuhui-zh15/C3/tree/imagebind).\n\n## 🎯 Citation\n\nIf you use this repo in your research, please cite it as follows:\n\n```\n@inproceedings{C3,\n  title={Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data},\n  author={Zhang, Yuhui and Sui, Elaine and Yeung-Levy, Serena},\n  booktitle={International Conference on Learning Representations (ICLR)},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyuhui-zh15%2Fc3","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyuhui-zh15%2Fc3","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyuhui-zh15%2Fc3/lists"}