https://github.com/ramyacp14/image-caption-generator
Developed an image captioning system using the BLIP model to generate detailed, context-aware captions. Achieved an average BLEU score of 0.72, providing rich descriptions that enhance accessibility and inclusivity.
https://github.com/ramyacp14/image-caption-generator
blip cnn-rnn coco-dataset imagenet machine-learning tensorflow vision-transformer vit-gpt2
Last synced: 7 months ago
JSON representation
Developed an image captioning system using the BLIP model to generate detailed, context-aware captions. Achieved an average BLEU score of 0.72, providing rich descriptions that enhance accessibility and inclusivity.
- Host: GitHub
- URL: https://github.com/ramyacp14/image-caption-generator
- Owner: ramyacp14
- Created: 2024-06-10T20:55:29.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-09-06T20:30:57.000Z (about 1 year ago)
- Last Synced: 2025-01-13T08:46:26.740Z (9 months ago)
- Topics: blip, cnn-rnn, coco-dataset, imagenet, machine-learning, tensorflow, vision-transformer, vit-gpt2
- Language: Jupyter Notebook
- Homepage:
- Size: 59.6 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# EqualEyes: Image Caption Generator
## Introduction
EqualEyes is an advanced image caption generator that aims to push the boundaries of image captioning technology. By combining recent advances in image recognition and language modeling, our system generates novel, descriptive captions that go beyond simply naming objects and actions. The goal is to create rich, detailed, and natural descriptions of photographs, making them more accessible and meaningful for all users.### Key Features
- Generates detailed and contextual image captions
- Utilizes state-of-the-art image recognition and language modeling techniques
- Trained on diverse image datasets to ensure broad generalization
- Focuses on inclusivity and accessibility in image description### Target Audience
- Individuals with visual impairments (especially color-blind people)
- Social media users and content creators
- Researchers analyzing image datasets
- Developers working on image recognition and understanding applications
- Educators and students for early literacy and language learning## Data & Methods
### Datasets
1. COCO Dataset 2017
- 118k images in train set, 40k in test set
- 56k train and 14k validation images used for our models
- 1.5 million object instances across 12 super categories and 80 sub-categories2. ImageNet
- 1.2 million training images
- 1,000 object classes### Models Developed
1. Basic CNN-RNN model
2. CNN-RNN with Hyper-parameter tuning
3. Vision Transformer: ViT-GPT2
4. BLIP: Bootstrapping Language-Image Pre-training## Results
- Model 1 (Basic CNN-RNN): Loss of 2.61
- Model 2 (Optimized CNN-RNN): Loss of 2.31
- Model 3 (Vision Transformer GPT-2): Loss of 0.376
- Model 4 (BLIP): Loss of 0.0062The final application is built on the BLIP model, which demonstrated the best performance.
### BLEU Score
Average BLEU score: 0.72 (72% match between generated and reference captions)## Limitations
- Computational resource constraints
- GPU compatibility issues with TensorFlow versions
- Limited training on object categories (currently 80 from COCO dataset)
- Time constraints for model training and optimization## Future Work
- Resolve GPU detection and utilization issues
- Expand object category training (aim for 300+ categories)
- Incorporate advanced vocabulary training
- Add support for multiple languages (Italian, Spanish, Dutch)
- Develop specialized versions for different stakeholders (e.g., botanists)
- Create an interactive learning experience for students