Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/xmu-xiaoma666/dtnet

The official repository for “Image Captioning via Dynamic Path Customization”.
https://github.com/xmu-xiaoma666/dtnet

Last synced: about 1 month ago
JSON representation

The official repository for “Image Captioning via Dynamic Path Customization”.

Awesome Lists containing this project

README

        

# Image Captioning via Dynamic Path Customization

## Introduction
The official repository for “Image Captioning via Dynamic Path Customization”.

Dynamic Transformer Network (DTNet) is a model to genrate discriminative yet accurate captions, which dynamically assigns customized paths to different samples.




The framework of the proposed Dynamic Transformer Network (DTNet)




The detailed architectures of different cells in the spatial and channel routing space.

## News

- 2023.09.28: Released code

## Environment setup

Please refer to [meshed-memory-transformer](https://github.com/aimagelab/meshed-memory-transformer)

## Data preparation
* **Annotation**. Download the annotation file [annotation.zip](https://drive.google.com/file/d/1i8mqKFKhqvBr8kEp3DbIh9-9UNAfKGmE/view?usp=sharing). Extarct and put it in the project root directory.
* **Feature**. You can download our ResNeXt-101 feature (hdf5 file) [here](https://pan.baidu.com/s/1xVZO7t8k4H_l3aEyuA-KXQ). Acess code: jcj6.
* **evaluation**. Download the evaluation tools [here](https://pan.baidu.com/s/1xVZO7t8k4H_l3aEyuA-KXQ). Acess code: jcj6. Extarct and put it in the project root directory.

There are five kinds of keys in our .hdf5 file. They are
* `['%d_features' % image_id]`: region features (N_regions, feature_dim)
* `['%d_boxes' % image_id]`: bounding box of region features (N_regions, 4)
* `['%d_size' % image_id]`: size of original image (for normalizing bounding box), (2,)
* `['%d_grids' % image_id]`: grid features (N_grids, feature_dim)
* `['%d_mask' % image_id]`: geometric alignment graph, (N_regions, N_grids)

The feature extraction can be followed as [here](https://github.com/luo3300612/image-captioning-DLCT/tree/main)

## Training
```python
python train.py --exp_name DTNet --batch_size 50 --rl_batch_size 100 --workers 4 --head 8 --warmup 10000 --features_path /home/data/coco_grid_feats2.hdf5 --annotation /home/data/m2_annotations --logs_folder tensorboard_logs
```
## Evaluation
```python
python eval.py --batch_size 50 --exp_name DTNet --features_path /home/data/coco_grid_feats2.hdf5 --annotation /home/data/m2_annotations --ckpt_path your_model_path
```

## Performance




Comparisons with SOTAs on the Karpathy test split.

## Qualitative Results




Examples of captions generated by Transformer and DTNet.




Images and the corresponding number of passed cells.




Path Visualization.

## Acknowledgements
- Thanks the [meshed-memory-transformer](https://github.com/aimagelab/meshed-memory-transformer).
- Thanks the amazing work of [grid-feats-vqa](https://github.com/facebookresearch/grid-feats-vqa).

## Citations
```
@ARTICLE{ma2024image,
author={Ma, Yiwei and Ji, Jiayi and Sun, Xiaoshuai and Zhou, Yiyi and Hong, Xiaopeng and Wu, Yongjian and Ji, Rongrong},
journal={IEEE Transactions on Neural Networks and Learning Systems},
title={Image Captioning via Dynamic Path Customization},
year={2024},
volume={},
number={},
pages={1-15},
keywords={Routing;Visualization;Transformers;Adaptation models;Task analysis;Feature extraction;Semantics;Dynamic network;image captioning;input-sensitive;transformer},
doi={10.1109/TNNLS.2024.3409354}}
```