https://github.com/dito97/dense-image-captioning

An unofficial Torch implementation of J. Lu, C. Xiong, et al., Knowing when to Look: Adaptive Attention via a Visual Sentinel for Image Captioning, 2017 with deformable adaptive attention
https://github.com/dito97/dense-image-captioning

attention image-captioning torch-2

Last synced: 2 months ago
JSON representation

An unofficial Torch implementation of J. Lu, C. Xiong, et al., Knowing when to Look: Adaptive Attention via a Visual Sentinel for Image Captioning, 2017 with deformable adaptive attention

Host: GitHub
URL: https://github.com/dito97/dense-image-captioning
Owner: DiTo97
License: mit
Created: 2021-06-14T20:57:23.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2023-07-24T12:13:11.000Z (about 2 years ago)
Last Synced: 2025-04-03T12:56:50.262Z (6 months ago)
Topics: attention, image-captioning, torch-2
Language: Jupyter Notebook
Homepage:
Size: 3.72 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# dense image captioning

An unofficial Torch implementation of [J. Lu, C. Xiong, et al., *Knowing when to Look: Adaptive Attention via a Visual Sentinel for Image Captioning*, 2017](https://arxiv.org/abs/1612.01887) trained on the COCO image captioning and Flickr30k datasets.

The implementation presents the following variations from the paper:
- deformable adaptive attention;
- larger visual sentinel size (128-dim);
- model eval against the [SPICE](https://panderson.me/spice/) metric;
- [MCTS-based decoding](https://arxiv.org/pdf/2104.05336.pdf).

## Introduction

The role of image dense captioning is immense for enabling visual-language understanding of the outer world.

In this project we propose a deformable variant of the visual sentinel via adaptive attention introduced in the reference paper for estimating grounding probas which allows larger networks to be constructed while running at a faster inference speed and training for almost half the epochs with equal performance.

This project is part of a larger venture for the development of visual-language aid tools for visually-impaired people,
by combining speech recognition, speech synthesis, image captioning and familiar person identification.

For more information, see the attached in-depth [report](report/F.%20Minutoli,%20G.%20Losapio,%20et%20al.%20-%20Improving%20Daily%20Interactions%20of%20Visually-impaired%20People.pdf).

## Training

The model was trained for 50 epochs on a multi-GPU HPC cluster courtesy of [CERN](https://abpcomputing.web.cern.ch/computing_resources/hpc_cern/).

## Usage

The following files must be downloaded from Google Drive:

- [preprocessing.zip](https://drive.google.com/file/d/1njpdzE1BHHrtC7CHt-WLe7V2w7e919wj/view?usp=sharing)
- [adaptive.pkl](https://drive.google.com/file/d/1g0HfjOmJA4Eh2m88O2sElPaDUm2OJi-q/view?usp=sharing)

The former contains the dataset with COCO-like annotations and the corresponding vocabulary.

The following files should be downloaded from Google Driver for display purposes:

- [eval-loss.pkl](https://drive.google.com/file/d/17Z9jpqp_B_TLzLa0MOQ8u4MqgcOROsMm/view?usp=sharing)
- [eval-metrics.pkl](https://drive.google.com/file/d/1CzkKbW-ZQM3cxkFCWLd3rE4U9rQD30J9/view?usp=sharing)
- [visual-grounding-probas.pkl](https://drive.google.com/file/d/1PU7eSV_M7Z56PzFhX4aIitKNtu6TNS0b/view?usp=sharing)

**N.B.:** If the provided links are not longer available, contact the authors.

## Authors

- [@DiTo97](https://github.com/DiTo97)
- [@arcadeghira](https://github.com/arcadeghira)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dito97/dense-image-captioning

Awesome Lists containing this project

README