An open API service indexing awesome lists of open source software.

https://github.com/aimagelab/pma-net

With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023
https://github.com/aimagelab/pma-net

captioning captioning-images iccv2023 image-captioning memory-augmented-neural-networks transformer vision-and-language vision-language

Last synced: 11 months ago
JSON representation

With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023

Awesome Lists containing this project

README

          


PMA-Net: Prototypical Memory Attention Network
(ICCV 2023)



This repository contains the reference code for the paper [With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning](https://arxiv.org/abs/2308.12383).

Please cite with the following BibTeX:
```
@inproceedings{sarto2023positive,
title={{With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning}},
author={Barraco, Manuele and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2023}
}
```


PMA-Net

## Environment Setup
Clone the repository and create the `pma-net` conda environment using the `environment.yml` file:
```
conda env create -f environment.yml
conda activate pma-net
```

Note: Python 3.9 is required to run our code.

## Data Preparation
### Checkpoints

XE and SCST checkpoints are available at the following links:

| **Model** | **Checkpoint** |
| -------------- | ------------- |
| **PMA-Net XE** | [pma-net_xe.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/pma-net_xe.tar) |
| **PMA-Net SCST** | [pma-net_scst.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/pma-net_scst.tar) |

Download, extract, and place them in a folder anywhere. The path `{CHECKPOINT_FOLDER}` will be set as argument later.

### Dataset
To run the code, annotations for the COCO dataset are needed.
Please download the zip files containing the annotations ([annotations.zip](https://aimagelab.ing.unimore.it/go/coco_annotations)), extract them, and place them under the ```datasets/annotations``` folder.

To train and test our model, download the tar files containing the already extracted COCO image features using CLIP ViT-L/14 at the following links:
| **Split** | **Checkpoint** |
| -------------- | ------------- |
| **COCO Training (chunck 0)** | [coco_training_CLIP-ViT-L14_cached_0.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_CLIP-ViT-L14_cached_0.tar) |
| **COCO Training (chunck 1)** | [coco_training_CLIP-ViT-L14_cached_1.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_CLIP-ViT-L14_cached_1.tar) |
| **COCO Training (chunck 2)** | [coco_training_CLIP-ViT-L14_cached_2.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_CLIP-ViT-L14_cached_2.tar) |
| **COCO Training (chunck 3)** | [coco_training_CLIP-ViT-L14_cached_3.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_CLIP-ViT-L14_cached_3.tar) |
| **COCO Training (chunck 4)** | [coco_training_CLIP-ViT-L14_cached_4.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_CLIP-ViT-L14_cached_4.tar) |
| **COCO Training (chunck 5)** | [coco_training_CLIP-ViT-L14_cached_5.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_CLIP-ViT-L14_cached_5.tar) |
| **COCO Training for SCST** | [coco_training_dict_CLIP-ViT-L14_cached.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_dict_CLIP-ViT-L14_cached.tar) |
| **COCO Validation** | [coco_validation_dict_CLIP-ViT-L14_cached.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_validation_dict_CLIP-ViT-L14_cached.tar) |
| **COCO Test** | [coco_test_dict_CLIP-ViT-L14_cached.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_test_dict_CLIP-ViT-L14_cached.tar) |

Once the files are downloaded and extracted in a single folder, set the correct path in the ```configs/datasets/datasets.json```.

These paths will be set as arguments later.

## Evaluation
To evaluate our best model, use
```
torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps --generation_max_length 30 --generation_num_beams 5 --per_device_eval_batch_size {EVAL_BATCH_SIZE} --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 --resume_from_checkpoint {CHECKPOINT_FOLDER}
```

## Training Procedure
To train our best model with the parameters used in our experiments, use
```
torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_train --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --train_datasets coco_CLIP-ViT-L14_cached --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps
--eval_steps 1000 --save_steps 1000 --max_steps -1 --generation_max_length 30 --generation_num_beams 5 --per_device_train_batch_size {TRAIN_BATCH_SIZE} --per_device_eval_batch_size {EVAL_BATCH_SIZE} --custom_lr_scheduler CustomScheduler --steps_min 15000 --start_decreasing_steps 10000 --learning_rate 2.5e-4 --warmup_steps 1000 --lr_min 1e-5 --gradient_accumulation_steps 8 --deepspeed configs/deepspeed/config_lamb_zero2.json --encoder --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25
```

After XE pre-training, for the SCST step use:
```
torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_train --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --train_datasets coco_training_dict_CLIP-ViT-L14_cached --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps
--eval_steps 1000 --save_steps 1000 --max_steps -1 --generation_max_length 30 --generation_num_beams 5 --per_device_train_batch_size {TRAIN_BATCH_SIZE} --per_device_eval_batch_size {EVAL_BATCH_SIZE} --steps_min 15000 --learning_rate 5e-6 --gradient_accumulation_steps 8 --deepspeed configs/deepspeed/config_adam_zero2.json --encoder --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 --scst --resume_from_checkpoint {CHECKPOINT_FOLDER}
```

## Custom Arguments
The complete arguments list for using our code:

| Argument | Description |
|------|------|
|`--encoder` | Add a BERT encoder. |
|`--n_layer` | Number of layer. |
|`--n_embd` | Embedding dimension. |
|`--n_head` | Number of head. |
|`--custom_checkpoint_keeper` | How many checkpoints keep on drive, default is `5`. |
|`--scst` | Use SCST phase. |
|`--train_datasets` | Training datasets, default is `coco_training`. |
|`--validation_datasets` | Validation datasets, default is `coco_validation_dict`. |
|`--test_datasets` | Test datasets, default is `coco_test_dict`. |
|`--scst_datasets` | SCST datasets, default is `coco_training_dict`. |
|`--custom_lr_scheduler` | Which custom scheduler uses (`CustomScheduler`, `TransformerScheduler`), default is `None`. |
|`--lr_multiplier` | Learning rate multiplier, default is `1.0`. |
|`--steps_min` | Only with `CustomScheduler`. |
|`--lr_min` | Only with `CustomScheduler`. |
|`--start_decreasing_steps` | Only with `CustomScheduler`. |
|`--add_memory_slots_selfattn` | Add memory slots in the self-attention blocks. |
|`--n_memory_slots` | How many memory slots, default is `64`. |
|`--freeze_memory` | Freeze the memories. |
|`--kmeans_memory` | Compute the memories using k-means. |
|`--deque_iters` | Max number of iterations data in the deque, default is `10`. |
|`--window` | Overlap window of new data, default is `None`. |