{"id":16780218,"url":"https://github.com/dev-khant/image-captioning","last_synced_at":"2025-03-16T20:13:00.441Z","repository":{"id":112231818,"uuid":"445118446","full_name":"Dev-Khant/Image-Captioning","owner":"Dev-Khant","description":"Image Captioning using EfficientNet and GRU","archived":false,"fork":false,"pushed_at":"2022-01-10T07:06:20.000Z","size":24003,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-23T06:32:39.810Z","etag":null,"topics":["attention-mechanism","efficientnetb0","gru"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Dev-Khant.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-06T09:48:44.000Z","updated_at":"2024-02-01T14:22:13.000Z","dependencies_parsed_at":"2023-05-11T18:32:04.875Z","dependency_job_id":null,"html_url":"https://github.com/Dev-Khant/Image-Captioning","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dev-Khant%2FImage-Captioning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dev-Khant%2FImage-Captioning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dev-Khant%2FImage-Captioning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dev-Khant%2FImage-Captioning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Dev-Khant","download_url":"https://codeload.github.com/Dev-Khant/Image-Captioning/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243926071,"owners_count":20369910,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention-mechanism","efficientnetb0","gru"],"created_at":"2024-10-13T07:34:24.866Z","updated_at":"2025-03-16T20:13:00.417Z","avatar_url":"https://github.com/Dev-Khant.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Image Captioning Using EfficientNet and GRU\n\n### Models \u0026 Dataset\n- EfficientNetB0 which is state of the art model for image processing is used. The model and its variantes are [here](https://github.com/qubvel/efficientnet).\n- Attention Mechanism is used to focus on certain parts of images. Here is a nice explanation for [Attention](https://towardsdatascience.com/attention-in-neural-networks-e66920838742).\n- Gated Recurrent Unit(GRU) is model used for processing text.\n- The Dataset used here is [Flickr(30K)](https://www.kaggle.com/hsankesara/flickr-image-dataset).\n- The Dataset contains 30k images along with 5 annotations for each image. Here I have used only first annotation of each image.\n\n### Parameters \u0026 Libraries\n- Tensorflow, NLTK, Numpy, Pandas are used.\n- For converting text to embedding vectors **TextVectorization** is used with vocabulary size of 5000, sequence length of 25 and with embedding dimension of 256.\n- The image size for EfficientNet is (224, 224, 3). EfficientNet was loaded with weights from ImageNet.\n- Units for GRU are 512.\n\n### Training \u0026 Evaluation\n- Here training is done on batch size of 64 for 25 epochs.\n- Encoder consists of EfficientNet and a FC layer for fine tunning. Decoder consists of GRU along with Attention Mechanism.\n- First the image is passed to EfficientNet and image context vector is obtained then along with image context vector hidden_state(intial state of decoder) is passed to Attention layer now its output is passed to GRU along with embedding vector of \"[start]\" token.\n- Here Teacher Forcing is used which is while training we pass the word vector of target sentence to GRU.\n- While testing the model input to GRU is previous output along with Attention output.\n- Loss obtained by model is 0.511. And BLEU score on test data is 0.129\n- Here are 2 examples after training.\n\n\u003cbr\u003e ![ex1](https://user-images.githubusercontent.com/57898986/148728626-d85c5ea0-e966-42a0-9689-7fdace11a480.png)\n\u0026nbsp; ![Screenshot 2022-01-06 151649](https://user-images.githubusercontent.com/57898986/148728501-4b15fab4-6722-4dee-8cf5-f54f8113acb6.png)\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdev-khant%2Fimage-captioning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdev-khant%2Fimage-captioning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdev-khant%2Fimage-captioning/lists"}