https://github.com/tuanbeba/img_captions

Image caption generation based on encoder - decoder architecture with Flickr8k dataset. Beam search used for inference
https://github.com/tuanbeba/img_captions

beamer encoder-decoder-architecture image-caption-generator mobilenetv3 pytorch tranformers

Last synced: about 2 months ago
JSON representation

Image caption generation based on encoder - decoder architecture with Flickr8k dataset. Beam search used for inference

Host: GitHub
URL: https://github.com/tuanbeba/img_captions
Owner: tuanbeba
Created: 2024-11-24T15:05:39.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-11-24T15:46:15.000Z (6 months ago)
Last Synced: 2025-02-03T21:47:22.283Z (4 months ago)
Topics: beamer, encoder-decoder-architecture, image-caption-generator, mobilenetv3, pytorch, tranformers
Language: Jupyter Notebook
Homepage:
Size: 2.73 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## Image Caption Generator With Beam Search

The key components of project:

- Model use the encoder-decoder based architecture. Mobilenet_v3_small used for Encoder and TransformerDecoder used for Decoder
- Preprocessing caption and build tokeizer.
- Model training and evaluation.
- Use greedy search and beam search algorithms for inference task
- Saving model weights and visualizing the results .

## About the dataset
This repository using [Flickr8k](https://www.kaggle.com/datasets/deekshithabandam/dataset-flick8k) dataset and pytorch framework. The dataset organize the files as follows:
- flickr8k
- images
- image files
- captions.txt

## Inference

You can download pre-trained [best_model.pt](https://drive.google.com/file/d/1Z6v04NykpclrC_RVOmVHRMzhhHZCg7Bb/view?usp=sharing) weights and encoded images [feature_extractor.pkl](https://drive.google.com/file/d/1-5dSmo62OEkeJTFUVCR0bydPSOdUv_H7/view?usp=sharing)

You have to change config.root path to your workspace path.

## Beam search algorithm

Beam search helps in generating the most optimal caption by considering multiple possibilities at each decoding step, rather than greedily selecting the word with the highest score. The example below demonstrates how using a beam width (k) of 3 results in better captions.

![Accuracy](img/beam_search.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tuanbeba/img_captions

Awesome Lists containing this project

README