https://github.com/tuanbeba/img_captions
Image caption generation based on encoder - decoder architecture with Flickr8k dataset. Beam search used for inference
https://github.com/tuanbeba/img_captions
beamer encoder-decoder-architecture image-caption-generator mobilenetv3 pytorch tranformers
Last synced: about 2 months ago
JSON representation
Image caption generation based on encoder - decoder architecture with Flickr8k dataset. Beam search used for inference
- Host: GitHub
- URL: https://github.com/tuanbeba/img_captions
- Owner: tuanbeba
- Created: 2024-11-24T15:05:39.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-11-24T15:46:15.000Z (6 months ago)
- Last Synced: 2025-02-03T21:47:22.283Z (4 months ago)
- Topics: beamer, encoder-decoder-architecture, image-caption-generator, mobilenetv3, pytorch, tranformers
- Language: Jupyter Notebook
- Homepage:
- Size: 2.73 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Image Caption Generator With Beam Search
The key components of project:
- Model use the encoder-decoder based architecture. Mobilenet_v3_small used for Encoder and TransformerDecoder used for Decoder
- Preprocessing caption and build tokeizer.
- Model training and evaluation.
- Use greedy search and beam search algorithms for inference task
- Saving model weights and visualizing the results .## About the dataset
This repository using [Flickr8k](https://www.kaggle.com/datasets/deekshithabandam/dataset-flick8k) dataset and pytorch framework. The dataset organize the files as follows:
- flickr8k
- images
- image files
- captions.txt## Inference
You can download pre-trained [best_model.pt](https://drive.google.com/file/d/1Z6v04NykpclrC_RVOmVHRMzhhHZCg7Bb/view?usp=sharing) weights and encoded images [feature_extractor.pkl](https://drive.google.com/file/d/1-5dSmo62OEkeJTFUVCR0bydPSOdUv_H7/view?usp=sharing)
You have to change config.root path to your workspace path.
## Beam search algorithm
Beam search helps in generating the most optimal caption by considering multiple possibilities at each decoding step, rather than greedily selecting the word with the highest score. The example below demonstrates how using a beam width (k) of 3 results in better captions.
