Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/yuxie11/R2D2


https://github.com/yuxie11/R2D2

Last synced: about 2 months ago
JSON representation

Awesome Lists containing this project

README

        

# CCMB and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework

๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ **CCMB: A Large-scale Chinese Cross-modal Benchmark (ACM MM 2023)**

This repo is the official implementation of CCMB and R2D2.

CCMB is available. It include pre-train dataset (Zero) and 5 downstream datasets. The detailed introduction and download URL are in **http://zero.so.com**. The 250M data is in **https://pan.baidu.com/s/1gnNbjOdCQdqZ4bRNN1S-Vw?pwd=iau8**.

R2D2 is a vision-language framework. We release the following code and models:

โœ…Pre-trained checkpoints.

โœ…Inference demo.

โœ…Fine-tuning code and checkpoints for Image-Text Retrieval and Image-Text Matching tasks.

## Performance
We show the performance of R2D2ViT-L fine-tuned on Flickr30k-CNA dataset. The output of R2D2 is a similarity score between 0 and 1.
ไธญๆ–‡ (English) | ไน”ไธนๆŠ•็ฏฎ (Jordan shot) | ไน”ไธน่ฟ็ƒ (Jordan dribble)|่ฉนๅง†ๆ–ฏๆŠ•็ฏฎ (James shot)
--- | :---: | :---:|--
Similarity score|0.99033021|0.91078649|0.61231128

## Requirements

pip install -r requirements.txt 

## Pre-trained checkpoints
Pre-trained image-text pairs | R2D2ViT-L | PRD2ViT-L
--- | :---: | :---:
250M | Download | Download
23M |
Download | -

## Fine-tuned checkpoints
Dataset | R2D2ViT-B(23M) |
--- | :---:
Flickr-CNA | Download
IQR | Download
ICR | Download
IQM | Download
ICM | Download

## Inference demo
- To evaluate the pretrained R2D2 model on image-text pairs, run:

python r2d2_inference_demo.py

- To evaluate the pretrained PRD2 model on image-text pairs, run:
python prd2_inference_demo.py

## Downstream Tasks
1. Download datasets and pretrained models.
for ICR, IQR, ICM, IQM tasks, after downloading you should see the following folder structure:
```
โ”œโ”€โ”€ IQR_IQM_ICR_ICM_images
โ”‚
โ”œโ”€โ”€ IQR
โ”‚ โ”œโ”€โ”€ train
โ”‚ โ””โ”€โ”€ val
โ”œโ”€โ”€ ICR
โ”‚ โ”œโ”€โ”€ train
โ”‚ โ””โ”€โ”€ val
โ”œโ”€โ”€ IQM
โ”‚ โ”œโ”€โ”€ train
โ”‚ โ””โ”€โ”€ val
โ”‚โ”€โ”€ ICM
โ”‚ โ”œโ”€โ”€ train
โ”‚ โ””โ”€โ”€ val
for Flickr30k-CNA, after downloading you should see the following folder structure:
```
โ”œโ”€โ”€ Flickr30k-images
โ”‚
โ”œโ”€โ”€ train
โ”‚
โ”œโ”€โ”€ val
โ”‚
โ””โ”€โ”€ test
```
2. In config/retrieval_*.yaml, set the paths for the dataset and pretrain model paths.
3. Run fine-tuning for the Image-Text Retrieval task.
```
sh train_r2d2_retrieval.sh
```
4. Run fine-tuning for the Image-Text Matching task.
```
sh train_r2d2_matching.sh
```

### Citation
If you find this dataset and code useful for your research, please consider citing.


@inproceedings{xie2023ccmb,
title={CCMB: A Large-scale Chinese Cross-modal Benchmark},
author={Xie, Chunyu and Cai, Heng and Li, Jincheng and Kong, Fanjing and Wu, Xiaoyu and Song, Jianfei and Morimitsu, Henrique and Yao, Lin and Wang, Dexin and Zhang, Xiangzheng and others},
booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
pages={4219--4227},
year={2023}
}