https://github.com/jpthu17/emcl

[NeurIPS 2022 Spotlight] Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
https://github.com/jpthu17/emcl

cross-modal-retrieval neurips video-captioning video-question-answering video-retrieval

Last synced: 5 months ago
JSON representation

[NeurIPS 2022 Spotlight] Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

Host: GitHub
URL: https://github.com/jpthu17/emcl
Owner: jpthu17
License: mit
Created: 2022-09-23T06:54:51.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2024-04-09T08:38:19.000Z (over 1 year ago)
Last Synced: 2025-03-30T14:11:43.476Z (6 months ago)
Topics: cross-modal-retrieval, neurips, video-captioning, video-question-answering, video-retrieval
Language: Python
Homepage:
Size: 23.9 MB
Stars: 132
Watchers: 2
Forks: 9
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


  

# 【NeurIPS'2022 🔥】Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

  

[![Conference](http://img.shields.io/badge/NeurIPS-2022-FFD93D.svg)](https://neurips.cc/Conferences/2022)

[![Paper](http://img.shields.io/badge/Paper-arxiv.2211.11427-FF6B6B.svg)](https://arxiv.org/abs/2211.11427)



The implementation of NeurIPS 2022 paper [Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations](https://arxiv.org/pdf/2211.11427.pdf).

💡 I also have other video-language projects that may interest you ✨. 


> [**Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning**](https://arxiv.org/abs/2303.14369)


> Accepted by CVPR 2023 (Highlight) | [[HBI Code]](https://github.com/jpthu17/HBI)


> Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

> [**DiffusionRet: Generative Text-Video Retrieval with Diffusion Model**](https://arxiv.org/abs/2303.09867)


> Accepted by ICCV 2023 | [[DiffusionRet Code]](https://github.com/jpthu17/DiffusionRet)


> Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Xiangyang Ji, Chang Liu, Li Yuan, Jie Chen

> [**Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment**](https://arxiv.org/abs/2305.12218)


> Accepted by IJCAI 2023 | [[DiCoSA Code]](https://github.com/jpthu17/DiCoSA)


> Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan, Chang Liu, Jie Chen



### 📣 Updates

* **[2023/04/12]**: We provide download links for the processed datasets, including MSRVTT, MSVD, ActivityNet Captions, and DiDeMo. (See [EMCL-Net](video_retrieval/EMCL-Net))

* **[2023/04/10]**: Add MSVD, LSMDC, ActivityNet Captions, and DiDeMo datasets (See [EMCL-Net](video_retrieval/EMCL-Net)).

* **[2023/01/12]**: Our approach achieves better performance (46.8 -> 48.2 on MSR-VTT dataset) when training with more GPUs (2 -> 8). So we recommend using more GPUs for better performance.

![results](pic/results.png)

* **[2022/12/14]**: Add the code of [EMCL-Net](video_retrieval/EMCL-Net).

* **[2022/11/21]**: Release code for reimplementing the experiments in the paper.

## 🚀 Quick Start

### Datasets



|Datasets|Google Cloud|Baidu Yun|Peking University Yun|

|:--------:|:--------------:|:-----------:|:-----------:|

| MSR-VTT | [Download](https://drive.google.com/drive/folders/1LYVUCPRxpKMRjCSfB_Gz-ugQa88FqDu_?usp=sharing) | [Download](https://pan.baidu.com/s/1Gdf6ivybZkpua5z1HsCWRA?pwd=enav) | [Download](https://disk.pku.edu.cn/link/AA6A028EE7EF5C48A788118B82D6ABE0C5) |

| MSVD | [Download](https://drive.google.com/drive/folders/18EXLWvCCQMRBd7-n6uznBUHdP4uC6Q15?usp=sharing) | [Download](https://pan.baidu.com/s/1hApFdxgV3TV2TCcnM_yBiA?pwd=kbfi) | [Download](https://disk.pku.edu.cn/link/AA6BD6FC1A490F4D0E9C384EF347F0D07F) |

| ActivityNet | TODO | [Download](https://pan.baidu.com/s/1tI441VGvN3In7pcvss0grg?pwd=2ddy) | [Download](https://disk.pku.edu.cn/link/AAE744E6488E2049BD9412738E14AAA8EA) |

| DiDeMo | TODO | [Download](https://pan.baidu.com/s/1Tsy9nb1hWzeXaZ4xr7qoTg?pwd=c842) | [Download](https://disk.pku.edu.cn/link/AA14E48D1333114022B736291D60350FA5) |



### Model Zoo



|Checkpoint|Google Cloud|Baidu Yun|Peking University Yun|

|:--------:|:--------------:|:-----------:|:-----------:|

| MSR-VTT | [Download](https://drive.google.com/file/d/1gxTKW5KfXvJK8-3WsOftRtCszMMCssv7/view?usp=sharing) | TODO | [Download](https://disk.pku.edu.cn:443/link/6165FBD0B60C4E1ED83E78ADF9635471) |

| ActivityNet | [Download](https://drive.google.com/file/d/1o4kVq8gHUIxR5wzWNw6NWVX13FGP8W2E/view?usp=drive_link) | [Download](https://pan.baidu.com/s/101iJ4Ml41k3TnWKgbV7sig?pwd=er2w) | [Download](https://disk.pku.edu.cn:443/link/50EBDF3124AD82272F061FE8E7880403) |



### Text-video Retrieval

* The implementation of EMCL-Net ([video_retrieval/EMCL-Net](https://github.com/jpthu17/EMCL/tree/main/video_retrieval/EMCL-Net)).

* An example of using EMCL as a joint training module ([video_retrieval/as_a_joint_training_module](https://github.com/jpthu17/EMCL/tree/main/video_retrieval/As_a_joint_training_module)).

* An example of using EMCL as an inference module with no extra training ([video_retrieval/as_an_inference_module](https://github.com/jpthu17/EMCL/tree/main/video_retrieval/As_an_inference_module)).

### Video-question Answering

* The implementation of EMCL-QA ([video_question_answering](https://github.com/jpthu17/EMCL/tree/main/video_question_answering)).

## 📕 Overview

Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations.

![motivation](pic/Modality_gap.png)

## 📚 Method

![EMCL](pic/EMCL.png)

## 📌 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

```

@inproceedings{

jin2022expectationmaximization,

title={Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations},

author={Peng Jin and JinFa Huang and Fenglin Liu and Xian Wu and Shen Ge and Guoli Song and David A. Clifton and Jie Chen},

booktitle={Advances in Neural Information Processing Systems},

volume={35},

pages={30291--30306},

editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},

year={2022}

}

```

## 🎗️ Acknowledgments

Our code is based on [MMT](https://github.com/gabeur/mmt), [CLIP](https://github.com/openai/CLIP), [CLIP4Clip](https://github.com/ArrowLuo/CLIP4Clip/), [DRL](https://github.com/foolwood/DRL) and [CLIP2Video](https://github.com/CryhanFang/CLIP2Video). We sincerely appreciate for their contributions.

[def]: motivation.pdf

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jpthu17/emcl

Awesome Lists containing this project

README