{"id":49451706,"url":"https://github.com/dabasajay/Image-Caption-Generator","last_synced_at":"2026-05-16T15:00:42.851Z","repository":{"id":37666476,"uuid":"144290149","full_name":"dabasajay/Image-Caption-Generator","owner":"dabasajay","description":"A neural network to generate captions for an image using CNN and RNN with BEAM Search.","archived":false,"fork":false,"pushed_at":"2020-10-01T07:13:57.000Z","size":2518,"stargazers_count":247,"open_issues_count":15,"forks_count":76,"subscribers_count":6,"default_branch":"master","last_synced_at":"2023-11-07T17:36:58.910Z","etag":null,"topics":["attention","attention-mechanism","attention-model","beam-search","bleu","bleu-score","caption-generation","captioning-images","cnn-keras","convolutional-neural-networks","deep-learning","flickr-8k","flickr-dataset","image-caption","image-captioning","inception-v3","inceptionv3","lstm","recurrent-neural-networks","vgg16"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dabasajay.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-08-10T13:33:49.000Z","updated_at":"2023-11-01T12:46:21.000Z","dependencies_parsed_at":"2022-09-02T06:35:22.623Z","dependency_job_id":null,"html_url":"https://github.com/dabasajay/Image-Caption-Generator","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"purl":"pkg:github/dabasajay/Image-Caption-Generator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dabasajay%2FImage-Caption-Generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dabasajay%2FImage-Caption-Generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dabasajay%2FImage-Caption-Generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dabasajay%2FImage-Caption-Generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dabasajay","download_url":"https://codeload.github.com/dabasajay/Image-Caption-Generator/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dabasajay%2FImage-Caption-Generator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33107564,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-16T04:41:52.686Z","status":"ssl_error","status_checked_at":"2026-05-16T04:41:52.009Z","response_time":115,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention","attention-mechanism","attention-model","beam-search","bleu","bleu-score","caption-generation","captioning-images","cnn-keras","convolutional-neural-networks","deep-learning","flickr-8k","flickr-dataset","image-caption","image-captioning","inception-v3","inceptionv3","lstm","recurrent-neural-networks","vgg16"],"created_at":"2026-04-30T03:00:32.663Z","updated_at":"2026-05-16T15:00:42.839Z","avatar_url":"https://github.com/dabasajay.png","language":"Python","funding_links":[],"categories":["Image Generation \u0026 Editing"],"sub_categories":[],"readme":"## Image Caption Generator\n\n[![Issues](https://img.shields.io/github/issues/dabasajay/Image-Caption-Generator.svg?color=%231155cc)](https://github.com/dabasajay/Image-Caption-Generator/issues)\n[![Forks](https://img.shields.io/github/forks/dabasajay/Image-Caption-Generator.svg?color=%231155cc)](https://github.com/dabasajay/Image-Caption-Generator/network)\n[![Stars](https://img.shields.io/github/stars/dabasajay/Image-Caption-Generator.svg?color=%231155cc)](https://github.com/dabasajay/Image-Caption-Generator/stargazers)\n[![Ajay Dabas](https://img.shields.io/badge/Ajay-Dabas-825ee4.svg)](https://dabasajay.github.io/)\n\nA neural network to generate captions for an image using CNN and RNN with BEAM Search.\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eExamples\u003c/strong\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://cdn-images-1.medium.com/max/1600/1*6BFOIdSHlk24Z3DFEakvnQ.png\" width=\"85%\" title=\"Example of Image Captioning\" alt=\"Example of Image Captioning\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\tImage Credits : \u003ca href=\"https://towardsdatascience.com/image-captioning-in-deep-learning-9cd23fb4d8d2\"\u003eTowardsdatascience\u003c/a\u003e\n\u003c/p\u003e\n\n## Table of Contents\n\n1. [Requirements](#1-requirements)\n2. [Training parameters and results](#2-training-parameters-and-results)\n3. [Generated Captions on Test Images](#3-generated-captions-on-test-images)\n4. [Procedure to Train Model](#4-procedure-to-train-model)\n5. [Procedure to Test on new images](#5-procedure-to-test-on-new-images)\n6. [Configurations (config.py)](#6-configurations-configpy)\n7. [Frequently encountered problems](#7-frequently-encountered-problems)\n8. [TODO](#8-todo)\n9. [References](#9-references)\n\n## 1. Requirements\n\nRecommended System Requirements to train model.\n\n\u003cul type=\"square\"\u003e\n\t\u003cli\u003eA good CPU and a GPU with atleast 8GB memory\u003c/li\u003e\n\t\u003cli\u003eAtleast 8GB of RAM\u003c/li\u003e\n\t\u003cli\u003eActive internet connection so that keras can download inceptionv3/vgg16 model weights\u003c/li\u003e\n\u003c/ul\u003e\n\nRequired libraries for Python along with their version numbers used while making \u0026 testing of this project\n\n\u003cul type=\"square\"\u003e\n\t\u003cli\u003ePython - 3.6.7\u003c/li\u003e\n\t\u003cli\u003eNumpy - 1.16.4\u003c/li\u003e\n\t\u003cli\u003eTensorflow - 1.13.1\u003c/li\u003e\n\t\u003cli\u003eKeras - 2.2.4\u003c/li\u003e\n\t\u003cli\u003enltk - 3.2.5\u003c/li\u003e\n\t\u003cli\u003ePIL - 4.3.0\u003c/li\u003e\n\t\u003cli\u003eMatplotlib - 3.0.3\u003c/li\u003e\n\t\u003cli\u003etqdm - 4.28.1\u003c/li\u003e\n\u003c/ul\u003e\n\n\u003cstrong\u003eFlickr8k Dataset:\u003c/strong\u003e \u003ca href=\"https://forms.illinois.edu/sec/1713398\"\u003eDataset Request Form\u003c/a\u003e\n\n\u003cstrong\u003eUPDATE (April/2019):\u003c/strong\u003e The official site seems to have been taken down (although the form still works). Here are some direct download links:\n\n\u003cul type=\"square\"\u003e\n\t\u003cli\u003e\u003ca href=\"https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip\"\u003eFlickr8k_Dataset\u003c/a\u003e\u003c/li\u003e\n\t\u003cli\u003e\u003ca href=\"https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip\"\u003eFlickr8k_text\u003c/a\u003e\u003c/li\u003e\n\tDownload Link Credits:\u003ca href=\"https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/\"\u003e Jason Brownlee\u003c/a\u003e\n\u003c/ul\u003e\n\n\u003cstrong\u003eImportant:\u003c/strong\u003e After downloading the dataset, put the reqired files in train_val_data folder\n\n## 2. Training parameters and results\n\n#### NOTE\n\n- `batch_size=64` took ~14GB GPU memory in case of *InceptionV3 + AlternativeRNN* and *VGG16 + AlternativeRNN*\n- `batch_size=64` took ~8GB GPU memory in case of *InceptionV3 + RNN* and *VGG16 + RNN*\n- **If you're low on memory**, use google colab or reduce batch size\n- In case of BEAM Search, `loss` and `val_loss` are same as in case of argmax since the model is same\n\n| Model \u0026 Config | Argmax | BEAM Search |\n| :--- | :--- | :--- |\n| **InceptionV3 + AlternativeRNN** \u003cul\u003e\u003cli\u003eEpochs = 20\u003c/li\u003e\u003cli\u003eBatch Size = 64\u003c/li\u003e\u003cli\u003eOptimizer = Adam\u003c/li\u003e\u003c/ul\u003e |\u003cul\u003e**Crossentropy loss**\u003cbr\u003e*(Lower the better)*\u003cli\u003eloss(train_loss): 2.4050\u003c/li\u003e\u003cli\u003eval_loss: 3.0527\u003c/li\u003e**BLEU Scores on Validation data**\u003cbr\u003e*(Higher the better)*\u003cli\u003eBLEU-1: 0.596818\u003c/li\u003e\u003cli\u003eBLEU-2: 0.356009\u003c/li\u003e\u003cli\u003eBLEU-3: 0.252489\u003c/li\u003e\u003cli\u003eBLEU-4: 0.129536\u003c/li\u003e\u003c/ul\u003e |\u003cul\u003e**k = 3**\u003cbr\u003e\u003cbr\u003e**BLEU Scores on Validation data**\u003cbr\u003e*(Higher the better)*\u003cli\u003eBLEU-1: 0.606086\u003c/li\u003e\u003cli\u003eBLEU-2: 0.359171\u003c/li\u003e\u003cli\u003eBLEU-3: 0.249124\u003c/li\u003e\u003cli\u003eBLEU-4: 0.126599\u003c/li\u003e\u003c/ul\u003e |\n| **InceptionV3 + RNN** \u003cul\u003e\u003cli\u003eEpochs = 11\u003c/li\u003e\u003cli\u003eBatch Size = 64\u003c/li\u003e\u003cli\u003eOptimizer = Adam\u003c/li\u003e\u003c/ul\u003e |\u003cul\u003e**Crossentropy loss**\u003cbr\u003e*(Lower the better)*\u003cli\u003eloss(train_loss): 2.5254\u003c/li\u003e\u003cli\u003eval_loss: 3.1769\u003c/li\u003e**BLEU Scores on Validation data**\u003cbr\u003e*(Higher the better)*\u003cli\u003eBLEU-1: 0.601791\u003c/li\u003e\u003cli\u003eBLEU-2: 0.344289\u003c/li\u003e\u003cli\u003eBLEU-3: 0.230025\u003c/li\u003e\u003cli\u003eBLEU-4: 0.108898\u003c/li\u003e\u003c/ul\u003e |\u003cul\u003e**k = 3**\u003cbr\u003e\u003cbr\u003e**BLEU Scores on Validation data**\u003cbr\u003e*(Higher the better)*\u003cli\u003eBLEU-1: 0.605097\u003c/li\u003e\u003cli\u003eBLEU-2: 0.356094\u003c/li\u003e\u003cli\u003eBLEU-3: 0.251132\u003c/li\u003e\u003cli\u003eBLEU-4: 0.129900\u003c/li\u003e\u003c/ul\u003e |\n| **VGG16 + AlternativeRNN** \u003cul\u003e\u003cli\u003eEpochs = 18\u003c/li\u003e\u003cli\u003eBatch Size = 64\u003c/li\u003e\u003cli\u003eOptimizer = Adam\u003c/li\u003e\u003c/ul\u003e |\u003cul\u003e**Crossentropy loss**\u003cbr\u003e*(Lower the better)*\u003cli\u003eloss(train_loss): 2.2880\u003c/li\u003e\u003cli\u003eval_loss: 3.1889\u003c/li\u003e**BLEU Scores on Validation data**\u003cbr\u003e*(Higher the better)*\u003cli\u003eBLEU-1: 0.596655\u003c/li\u003e\u003cli\u003eBLEU-2: 0.342127\u003c/li\u003e\u003cli\u003eBLEU-3: 0.229676\u003c/li\u003e\u003cli\u003eBLEU-4: 0.108707\u003c/li\u003e\u003c/ul\u003e | \u003cul\u003e**k = 3**\u003cbr\u003e\u003cbr\u003e**BLEU Scores on Validation data**\u003cbr\u003e*(Higher the better)*\u003cli\u003eBLEU-1: 0.593876\u003c/li\u003e\u003cli\u003eBLEU-2: 0.348569\u003c/li\u003e\u003cli\u003eBLEU-3: 0.242063\u003c/li\u003e\u003cli\u003eBLEU-4: 0.123221\u003c/li\u003e\u003c/ul\u003e |\n| **VGG16 + RNN** \u003cul\u003e\u003cli\u003eEpochs = 7\u003c/li\u003e\u003cli\u003eBatch Size = 64\u003c/li\u003e\u003cli\u003eOptimizer = Adam\u003c/li\u003e\u003c/ul\u003e |\u003cul\u003e**Crossentropy loss**\u003cbr\u003e*(Lower the better)*\u003cli\u003eloss(train_loss): 2.6297\u003c/li\u003e\u003cli\u003eval_loss: 3.3486\u003c/li\u003e**BLEU Scores on Validation data**\u003cbr\u003e*(Higher the better)*\u003cli\u003eBLEU-1: 0.557626\u003c/li\u003e\u003cli\u003eBLEU-2: 0.317652\u003c/li\u003e\u003cli\u003eBLEU-3: 0.216636\u003c/li\u003e\u003cli\u003eBLEU-4: 0.105288\u003c/li\u003e\u003c/ul\u003e |\u003cul\u003e**k = 3**\u003cbr\u003e\u003cbr\u003e**BLEU Scores on Validation data**\u003cbr\u003e*(Higher the better)*\u003cli\u003eBLEU-1: 0.568993\u003c/li\u003e\u003cli\u003eBLEU-2: 0.326569\u003c/li\u003e\u003cli\u003eBLEU-3: 0.226629\u003c/li\u003e\u003cli\u003eBLEU-4: 0.113102\u003c/li\u003e\u003c/ul\u003e |\n\n\n## 3. Generated Captions on Test Images\n\n**Model used** - *InceptionV3 + AlternativeRNN*\n\n| Image | Caption |\n| :---: | :--- |\n| \u003cimg width=\"50%\" src=\"https://github.com/dabasajay/Image-Caption-Generator/raw/master/test_data/bikestunt.jpg\" alt=\"Image 1\"\u003e | \u003cul\u003e\u003cli\u003e\u003cstrong\u003eArgmax:\u003c/strong\u003e A man in a blue shirt is riding a bike on a dirt path.\u003c/li\u003e\u003cli\u003e\u003cstrong\u003eBEAM Search, k=3:\u003c/strong\u003e A man is riding a bicycle on a dirt path.\u003c/li\u003e\u003c/ul\u003e|\n| \u003cimg src=\"https://github.com/dabasajay/Image-Caption-Generator/raw/master/test_data/surfing.jpeg\" alt=\"Image 2\"\u003e | \u003cul\u003e\u003cli\u003e\u003cstrong\u003eArgmax:\u003c/strong\u003e A man in a red kayak is riding down a waterfall.\u003c/li\u003e\u003cli\u003e\u003cstrong\u003eBEAM Search, k=3:\u003c/strong\u003e A man on a surfboard is riding a wave.\u003c/li\u003e\u003c/ul\u003e|\n\n## 4. Procedure to Train Model\n\n1. Clone the repository to preserve directory structure.\u003cbr\u003e\n`git clone https://github.com/dabasajay/Image-Caption-Generator.git`\n2. Put the required dataset files in `train_val_data` folder (files mentioned in readme there).\n3. Review `config.py` for paths and other configurations (explained below).\n4. Run `train_val.py`.\n\n## 5. Procedure to Test on new images\n\n1. Clone the repository to preserve directory structure.\u003cbr\u003e\n`git clone https://github.com/dabasajay/Image-Caption-Generator.git`\n2. Train the model to generate required files in `model_data` folder (steps given above).\n3. Put the test images in `test_data` folder.\n4. Review `config.py` for paths and other configurations (explained below).\n5. Run `test.py`.\n\n## 6. Configurations (config.py)\n\n**config**\n\n1. **`images_path`** :- Folder path containing flickr dataset images\n2. `train_data_path` :- .txt file path containing images ids for training\n3. `val_data_path` :- .txt file path containing imgage ids for validation\n4. `captions_path` :- .txt file path containing captions\n5. `tokenizer_path` :- path for saving tokenizer\n6. `model_data_path` :- path for saving files related to model\n7. **`model_load_path`** :- path for loading trained model\n8. **`num_of_epochs`** :- Number of epochs\n9. **`max_length`** :- Maximum length of captions. This is set manually after training of model and required for test.py\n10. **`batch_size`** :- Batch size for training (larger will consume more GPU \u0026 CPU memory)\n11. **`beam_search_k`** :- BEAM search parameter which tells the algorithm how many words to consider at a time.\n11. `test_data_path` :- Folder path containing images for testing/inference\n12. **`model_type`** :- CNN Model type to use -\u003e inceptionv3 or vgg16\n13. **`random_seed`** :- Random seed for reproducibility of results\n\n**rnnConfig**\n\n1. **`embedding_size`** :- Embedding size used in Decoder(RNN) Model\n2. **`LSTM_units`** :- Number of LSTM units in Decoder(RNN) Model\n3. **`dense_units`** :- Number of Dense units in Decoder(RNN) Model\n4. **`dropout`** :- Dropout probability used in Dropout layer in Decoder(RNN) Model\n\n## 7. Frequently encountered problems\n\n- **Out of memory issue**:\n  - Try reducing `batch_size`\n- **Results differ everytime I run script**:\n  - Due to stochastic nature of these algoritms, results *may* differ slightly everytime. Even though I did set random seed to make results reproducible, results *may* differ slightly.\n- **Results aren't very great using beam search compared to argmax**:\n  - Try higher `k` in BEAM search using `beam_search_k` parameter in config. Note that higher `k` will improve results but it'll also increase inference time significantly.\n\n## 8. TODO\n\n- [X] Support for VGG16 Model. Uses InceptionV3 Model by default.\n\n- [X] Implement 2 architectures of RNN Model.\n\n- [X] Support for batch processing in data generator with shuffling.\n\n- [X] Implement BEAM Search.\n\n- [X] Calculate BLEU Scores using BEAM Search.\n\n- [ ] Implement Attention and change model architecture.\n\n- [ ] Support for pre-trained word vectors like word2vec, GloVe etc.\n\n## 9. References\n\n\u003cul type=\"square\"\u003e\n\t\u003cli\u003e\u003ca href=\"https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vinyals_Show_and_Tell_2015_CVPR_paper.pdf\"\u003eShow and Tell: A Neural Image Caption Generator\u003c/a\u003e - Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan\u003c/li\u003e\n\t\u003cli\u003e\u003ca href=\"https://arxiv.org/abs/1703.09137\"\u003eWhere to put the Image in an Image Caption Generator\u003c/a\u003e - Marc Tanti, Albert Gatt, Kenneth P. Camilleri\u003c/li\u003e\n\t\u003cli\u003e\u003ca href=\"https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/\"\u003eHow to Develop a Deep Learning Photo Caption Generator from Scratch\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdabasajay%2FImage-Caption-Generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdabasajay%2FImage-Caption-Generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdabasajay%2FImage-Caption-Generator/lists"}