{"id":18631063,"url":"https://github.com/aimagelab/pma-net","last_synced_at":"2025-07-22T19:33:53.732Z","repository":{"id":186340624,"uuid":"674132706","full_name":"aimagelab/PMA-Net","owner":"aimagelab","description":"With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023","archived":false,"fork":false,"pushed_at":"2024-06-07T08:52:03.000Z","size":5597,"stargazers_count":17,"open_issues_count":3,"forks_count":2,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-04-11T14:43:16.969Z","etag":null,"topics":["captioning","captioning-images","iccv2023","image-captioning","memory-augmented-neural-networks","transformer","vision-and-language","vision-language"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aimagelab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-08-03T08:06:06.000Z","updated_at":"2025-03-26T06:19:19.000Z","dependencies_parsed_at":null,"dependency_job_id":"8b8bc63a-1949-442f-8105-b69382d970d2","html_url":"https://github.com/aimagelab/PMA-Net","commit_stats":null,"previous_names":["aimagelab/pma-net"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aimagelab/PMA-Net","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimagelab%2FPMA-Net","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimagelab%2FPMA-Net/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimagelab%2FPMA-Net/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimagelab%2FPMA-Net/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aimagelab","download_url":"https://codeload.github.com/aimagelab/PMA-Net/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimagelab%2FPMA-Net/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266561336,"owners_count":23948627,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-22T02:00:09.085Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["captioning","captioning-images","iccv2023","image-captioning","memory-augmented-neural-networks","transformer","vision-and-language","vision-language"],"created_at":"2024-11-07T05:05:36.514Z","updated_at":"2025-07-22T19:33:53.710Z","avatar_url":"https://github.com/aimagelab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\u003cdiv align=\"center\"\u003e\n  \u003ch1\u003ePMA-Net: Prototypical Memory Attention Network\u003cbr\u003e(ICCV 2023)\u003c/h1\u003e\n  \n\u003c/div\u003e\n\nThis repository contains the reference code for the paper [With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning](https://arxiv.org/abs/2308.12383).\n\nPlease cite with the following BibTeX:\n```\n@inproceedings{sarto2023positive,\n  title={{With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning}},\n  author={Barraco, Manuele and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},\n  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},\n  year={2023}\n}\n```\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"model_pma_net.png\" alt=\"PMA-Net\" width=\"820\" /\u003e\n\u003c/p\u003e \n\n## Environment Setup\nClone the repository and create the `pma-net` conda environment using the `environment.yml` file:\n```\nconda env create -f environment.yml\nconda activate pma-net\n```\n\nNote: Python 3.9 is required to run our code. \n\n## Data Preparation\n### Checkpoints\n\nXE and SCST checkpoints are available at the following links:\n\n| **Model**       | **Checkpoint**         |\n| -------------- | -------------      |\n| **PMA-Net XE**  | [pma-net_xe.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/pma-net_xe.tar)  |\n| **PMA-Net SCST**  |  [pma-net_scst.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/pma-net_scst.tar) |\n\nDownload, extract, and place them in a folder anywhere. The path `{CHECKPOINT_FOLDER}` will be set as argument later.\n\n### Dataset\nTo run the code, annotations for the COCO dataset are needed.\nPlease download the zip files containing the annotations ([annotations.zip](https://aimagelab.ing.unimore.it/go/coco_annotations)), extract them, and place them under the ```datasets/annotations``` folder.\n\nTo train and test our model, download the tar files containing the already extracted COCO image features using CLIP ViT-L/14 at the following links:\n| **Split**       | **Checkpoint**         | \n| -------------- | -------------      |\n| **COCO Training (chunck 0)**  | [coco_training_CLIP-ViT-L14_cached_0.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_CLIP-ViT-L14_cached_0.tar)  |\n| **COCO Training (chunck 1)**  | [coco_training_CLIP-ViT-L14_cached_1.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_CLIP-ViT-L14_cached_1.tar)  |\n| **COCO Training (chunck 2)**  | [coco_training_CLIP-ViT-L14_cached_2.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_CLIP-ViT-L14_cached_2.tar)  |\n| **COCO Training (chunck 3)**  | [coco_training_CLIP-ViT-L14_cached_3.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_CLIP-ViT-L14_cached_3.tar)  |\n| **COCO Training (chunck 4)**  | [coco_training_CLIP-ViT-L14_cached_4.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_CLIP-ViT-L14_cached_4.tar)  |\n| **COCO Training (chunck 5)**  | [coco_training_CLIP-ViT-L14_cached_5.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_CLIP-ViT-L14_cached_5.tar)  |\n| **COCO Training for SCST**  |  [coco_training_dict_CLIP-ViT-L14_cached.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_training_dict_CLIP-ViT-L14_cached.tar) |\n| **COCO Validation**  |  [coco_validation_dict_CLIP-ViT-L14_cached.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_validation_dict_CLIP-ViT-L14_cached.tar) |\n| **COCO Test**  |  [coco_test_dict_CLIP-ViT-L14_cached.tar](https://ailb-web.ing.unimore.it/publicfiles/pma-net_iccv2023/coco_test_dict_CLIP-ViT-L14_cached.tar) |\n\nOnce the files are downloaded and extracted in a single folder, set the correct path in the ```configs/datasets/datasets.json```. \n\nThese paths will be set as arguments later.\n\n## Evaluation\nTo evaluate our best model, use\n```\ntorchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps --generation_max_length 30 --generation_num_beams 5 --per_device_eval_batch_size {EVAL_BATCH_SIZE} --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 --resume_from_checkpoint {CHECKPOINT_FOLDER}\n```\n\n## Training Procedure\nTo train our best model with the parameters used in our experiments, use\n```\ntorchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_train --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --train_datasets coco_CLIP-ViT-L14_cached --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps \n--eval_steps 1000 --save_steps 1000 --max_steps -1 --generation_max_length 30 --generation_num_beams 5 --per_device_train_batch_size {TRAIN_BATCH_SIZE} --per_device_eval_batch_size {EVAL_BATCH_SIZE} --custom_lr_scheduler CustomScheduler --steps_min 15000 --start_decreasing_steps 10000 --learning_rate 2.5e-4 --warmup_steps 1000 --lr_min 1e-5 --gradient_accumulation_steps 8 --deepspeed configs/deepspeed/config_lamb_zero2.json --encoder --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 \n```\n\nAfter XE pre-training, for the SCST step use:\n```\ntorchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_train --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --train_datasets coco_training_dict_CLIP-ViT-L14_cached --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps \n--eval_steps 1000 --save_steps 1000 --max_steps -1 --generation_max_length 30 --generation_num_beams 5 --per_device_train_batch_size {TRAIN_BATCH_SIZE} --per_device_eval_batch_size {EVAL_BATCH_SIZE} --steps_min 15000  --learning_rate 5e-6 --gradient_accumulation_steps 8 --deepspeed configs/deepspeed/config_adam_zero2.json --encoder --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 --scst --resume_from_checkpoint {CHECKPOINT_FOLDER}\n```\n\n## Custom Arguments\nThe complete arguments list for using our code:\n\n| Argument | Description |\n|------|------|\n|`--encoder` | Add a BERT encoder. |\n|`--n_layer` | Number of layer. |\n|`--n_embd` | Embedding dimension. |\n|`--n_head` | Number of head. |\n|`--custom_checkpoint_keeper` | How many checkpoints keep on drive, default is `5`. |\n|`--scst` | Use SCST phase. |\n|`--train_datasets` | Training datasets, default is `coco_training`. | \n|`--validation_datasets` | Validation datasets, default is `coco_validation_dict`. |\n|`--test_datasets` | Test datasets, default is `coco_test_dict`. |\n|`--scst_datasets` | SCST datasets, default is `coco_training_dict`. |\n|`--custom_lr_scheduler` | Which custom scheduler uses (`CustomScheduler`, `TransformerScheduler`), default is `None`. |\n|`--lr_multiplier` | Learning rate multiplier, default is `1.0`. |\n|`--steps_min` | Only with `CustomScheduler`. |\n|`--lr_min` | Only with `CustomScheduler`. |\n|`--start_decreasing_steps` | Only with `CustomScheduler`. |\n|`--add_memory_slots_selfattn` | Add memory slots in the self-attention blocks. |\n|`--n_memory_slots` | How many memory slots, default is `64`. |\n|`--freeze_memory` | Freeze the memories. |\n|`--kmeans_memory` | Compute the memories using k-means. |\n|`--deque_iters` | Max number of iterations data in the deque, default is `10`. |\n|`--window` | Overlap window of new data, default is `None`. |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faimagelab%2Fpma-net","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faimagelab%2Fpma-net","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faimagelab%2Fpma-net/lists"}