https://github.com/yalesong/pvse

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (CVPR 2019)
https://github.com/yalesong/pvse
cross-modal-retrieval metric-learning mrw-dataset mscoco-dataset tgif-dataset
Last synced: 5 months ago
JSON representation
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (CVPR 2019)
Host: GitHub
URL: https://github.com/yalesong/pvse
Owner: yalesong
License: mit
Created: 2019-06-11T02:26:56.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2024-02-08T20:55:44.000Z (about 1 year ago)
Last Synced: 2024-02-08T22:39:36.250Z (about 1 year ago)
Topics: cross-modal-retrieval, metric-learning, mrw-dataset, mscoco-dataset, tgif-dataset
Language: Python
Size: 15.9 MB
Stars: 128
Watchers: 4
Forks: 24
Open Issues: 9
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project

README

        # Polysemous Visual-Semantic Embedding (PVSE)

This repository contains a PyTorch implementation of the PVSE network and the MRW dataset proposed in our paper [*Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval* (CVPR 2019)](https://arxiv.org/abs/1906.04402). The code and data are free to use for *academic purposes only*. 

Please also visit our [project page](https://yalesong.github.io/pvse/)

![](assets/pvse.jpg)

### Table of contents

1. [MRW Dataset](#mrw-dataset)

2. [Setting up an environment](#setting-up-an-environment)

3. [Download and prepare data](#download-and-prepare-data)

4. [Evaluate pretrained models](#evaluate-pretrained-models)

5. [Train your own model](#train-your-own-model)

## MRW Dataset

Our *My Reaction When (MRW)* dataset contains `50,107` video-sentence pairs crawled from social media, where videos display physical or emotional reactions to the situations described in sentences. This subreddit [/r/reactiongifs](https://www.reddit.com/r/reactiongifs/) contains several examples; below shows some representative examples pairs; 

| (a) Physical Reaction | (b) Emotional Reaction | (c) Animal Reaction | (d) Lexical Reaction |

| :-------------------: | :--------------------: | :-----------------: | :------------------: |

| MRW a witty comment I wanted to make was already said | MFW I see a cute girl on Facebook change her status to single | MFW I cant remember if I've locked my front door | MRW a family member askes me why his computer isn't working |

|  |  |  |  |

Below shows the descriptive statistics of the datset. The word vocabulary size is `34,835`. The dataset can be used for evaluting cross-modal retrieval systems under *ambiguous/weak-association* between vision and language.

|        | Train  | Validation |  Test |  Total |

| :---:  | :----: | :--------: | :---: |  :---: |

| #pairs | 44,107 |      1,000 | 5,000 | 50,107 |

| Avg. #frames | 104.91 | 209.04 | 209.55 | 117.43 |

| Avg. #words  | 11.36 |  15.02 |  14.79 | 11.78 |

| Avg. word frequency | 15.48 | 4.80 | 8.57 | 16.94 |

We provide detailed analysis of the dataset in the [supplementary material](https://arxiv.org/abs/1906.04402) of the main paper.

Follow the instruction [below](#mrw) to download the dataset.

## Setting up an environment

We recommend creating a virtual environment and install packages there. Note, you must install the Cython package first.

```

python3 -m venv 

source /bin/activate

pip3 install Cython

pip3 install -r requirements.txt

```

## Download and prepare data

### MRW

```

cd data

bash prepare_mrw_dataset.sh

```

This will download the dataset (without videos) in a JSON format, a vocabulary file, and train/val/test splits. It will then prompt an option:

```

Do you wish to download video data and gulp them? [y/n]

```

We provide two ways to obtain the data. A recommended option is to download pre-compiled data in a [GulpIO](https://github.com/TwentyBN/GulpIO) binary storage format, which contains video frames sampled at 8 FPS. For this, simpliy hit `n` (this will terminate the script) and download our pre-compiled GulpIO data in [this link (54 GB)](https://drive.google.com/file/d/14hni_L2iN3_zsZ1KDkh21juWYCE0s13j). After finish downloading, extract the tarball under `data/mrw/gulp` to train and/or test our models.

If you wish to download raw video clips and gulp them on your own, hit `y` once prompted with the message above. This will start downloading videos and, once finished, start gulping the video files at 8 FPS (you can change this in [download_gulp_mrw.py](./data/mrw/download_gulp_mrw.py)). If you encounter any problem downloading the video files, you may also download them directly from [this link (19 GB)](https://drive.google.com/file/d/1Gcv4j-b2TkWLgSNF1Ye5zOa-LTG3mBcL), and then continue gulping them using the script [download_gulp_mrw.py](./data/mrw/download_gulp_mrw.py). 

### TGIF

```

cd data

bash prepare_tgif_dataset.sh

```

This will download the dataset (without videos) in a TSV format, a vocabulary file, and train/val/test splits. Please note, we use a slightly modified version of the [TGIF dataset](https://github.com/raingo/TGIF-Release) because of invalid video files; the script will automatically download the modified version.

 

 

It will then prompt an option:

```

Do you wish to gulp the data? [y/n]

```

Similar to the MRW data, we provide two options to obtain the data: (1) download pre-compiled GulpIO data, or (2) download raw video clips and gulp them on your own, and we recommend the first option for an easy start. For this, simply hit `n` and download our pre-compiled GulpIO data in [this link (89 GB)](https://drive.google.com/file/d/1W2iXV9tWxKkYkTzOCvUMTQBkVPg_8mOf). After finish downloading `tgif-gulp.tar.gz`, extract the tarball under `data/tgif/gulp`.  

If you wish to gulp your own dataset, hit `y` and follow the prompt. Note that you must first download a tarball containing the videos before gulping. You can download the file `tgif.tar.gz` (124 GB) from [this link](https://drive.google.com/open?id=0B15H16jpV4w2NHI2QmUxV21JdkE) and place it under `./data/tgif`. Once you have the video data, the script will start gulping the video files.

### MS-COCO

```

cd data

bash prepare_coco_dataset.sh

```

## Evaluate pretrained models

Download all six pretrained models in a tarball at [this link](https://drive.google.com/file/d/1bj6-LLx-jGMcS6qKiA23wPoSiXC9toIw). You can also download each individual files using the links below.

| Dataset | Model | Command | 

| ------- | ----- | ------- | 

| COCO | PVSE (k=1) [[download]](https://drive.google.com/file/d/1NvHhATydJ94r2O48_m72tx028Jy17Mxl) | `python3 eval.py --data_name coco --num_embeds 1 --img_attention --txt_attention --legacy --ckpt ./ckpt/coco_pvse_k1.pth` | 

| COCO | PVSE [[download]](https://drive.google.com/file/d/1cFePGpvhs4ypu7q2j1dDbgspBbBcD2z7) | `python3 eval.py --data_name coco --num_embeds 2 --img_attention --txt_attention --legacy --ckpt ./ckpt/coco_pvse.pth` | 

| MRW | PVSE (k=1) [[download]](https://drive.google.com/file/d/1gkXfdhupyv10y1pMIsHm9XaanCo2ewUR) | `python3 eval.py --data_name mrw --num_embeds 1 --img_attention --txt_attention --max_video_length 4 --legacy --ckpt ./ckpt/mrw_pvse_k1.pth` | 

| MRW | PVSE [[download]](https://drive.google.com/file/d/1k5VO_7OGfy7k9H4xg5GqtfBTMtNflgH-) | `python3 eval.py --data_name mrw --num_embeds 5 --img_attention --txt_attention --max_video_length 4 --legacy --ckpt ./ckpt/mrw_pvse.pth` | 

| TGIF | PVSE (k=1) [[download]](https://drive.google.com/file/d/1FGDDrszmdzHn7gnLGbZqYlIbA1lTVymk) | `python3 eval.py --data_name tgif --num_embeds 1 --img_attention --txt_attention --max_video_length 8 --legacy --ckpt ./ckpt/tgif_pvse_k1.pth` |

| TGIF | PVSE [[download]](https://drive.google.com/file/d/1BOhePBouPjdrd8o67avn7drWB5h1_YGI) | `python3 eval.py --data_name tgif --num_embeds 3 --img_attention --txt_attention --max_video_length 8 --legacy --ckpt ./ckpt/tgif_pvse.pth` |

Using the pretrained models you should be able to reproduce the results in the table below

| Dataset | Model | Image/Video-to-Text 
 R@1 / R@5 / R@10 / Med r (nMR) | Text-to-Image/Video 
 R@1 / R@5 / R@10 / Med r (nMR) |

| - | - | - | - |

| COCO 1K | PVSE (K=1) | 66.72 / 91.00 / 96.22 / 1 (0.00) | 53.49 / 85.14 / 92.70 / 1 (0.00) |

| COCO 1K | PVSE | 69.24 / 91.62 / 96.64 / 1 (0.00) | 55.21 / 86.50 / 93.73 / 1 (0.00) |

| COCO 5K | PVSE (K=1) | 41.72 / 72.96 / 82.90 / 2 (0.00) | 30.64 / 61.37 / 73.62 / 3 (0.00) |

| COCO 5K | PVSE | 45.18 / 74.28 / 84.46 / 2 (0.00) | 32.42 / 62.97 / 74.96 / 3 (0.00)|

| MRW | PVSE (K=1) | 0.16 / 0.68 / 0.90 / 1700 (0.34) | 0.16 / 0.56 / 0.88 / 1650 (0.33) |

| MRW | PVSE | 0.18 / 0.62 / 1.18 / 1624 (0.32) | 0.20 / 0.70 / 1.16 / 1552 (0.31) |

| TGIF | PVSE (K=1) | 2.82 / 9.07 / 14.02 / 128 (0.01) | 2.63 / 9.37 / 14.58 / 115 (0.01) |

| TGIF | PVSE | 3.28 / 9.87 / 15.56 / 115 (0.01) | 3.01 / 9.70 / 14.85 / 109 (0.01) |

## Train your own model

You can train your own model using [train.py](train.py); check [option.py](option.py) for all available options.

For example, you can train our PVSE model (k=2) on COCO using the command below. It uses ResNet152 as a backbone CNN, GloVe word embedding, MMD loss weight 0.01 and DIV loss weight 0.1, and bacth size of 256: 

```

python3 train.py --data_name coco --cnn_type resnet152 --wemb_type glove --margin 0.1 --max_violation --num_embeds 2 --img_attention --txt_attention --mmd_weight 0.01 --div_weight 0.1 --batch_size 256

```

For video models, you should set the parameter `--max_video_length`; otherwise it defaults to 1 (single frame). Here's an example command:

```

python3 train.py --data_name mrw --max_video_length 4 --cnn_type resnet18 --wemb_type glove --margin 0.1 --num_embeds 4 --img_attention --txt_attention --mmd_weight 0.01 --div_weight 0.1 --batch_size 128

```

---

If you use any of the material in this repository we ask you to cite:

```

@inproceedings{song-pvse-cvpr19,

  author    = {Yale Song and Mohammad Soleymani},

  title     = {Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval},

  booktitle = {CVPR},

  year      = 2019

```

Our code is based on [the implementation by Faghri et al.](https://github.com/fartashf/vsepp)

## Notes

*Last edit: Tuesday July 16, 2019*
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yalesong/pvse

Awesome Lists containing this project

README