{"id":16282886,"url":"https://github.com/dito97/dense-image-captioning","last_synced_at":"2025-10-14T04:21:18.927Z","repository":{"id":124235138,"uuid":"376954301","full_name":"DiTo97/dense-image-captioning","owner":"DiTo97","description":"An unofficial Torch implementation of J. Lu, C. Xiong, et al., Knowing when to Look: Adaptive Attention via a Visual Sentinel for Image Captioning, 2017 with deformable adaptive attention","archived":false,"fork":false,"pushed_at":"2023-07-24T12:13:11.000Z","size":3901,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-04T06:34:46.387Z","etag":null,"topics":["attention","image-captioning","torch-2"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DiTo97.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-06-14T20:57:23.000Z","updated_at":"2024-07-22T18:55:00.000Z","dependencies_parsed_at":null,"dependency_job_id":"4650cfd2-d643-4dbd-9ed3-43cdd24a6227","html_url":"https://github.com/DiTo97/dense-image-captioning","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DiTo97/dense-image-captioning","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DiTo97%2Fdense-image-captioning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DiTo97%2Fdense-image-captioning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DiTo97%2Fdense-image-captioning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DiTo97%2Fdense-image-captioning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DiTo97","download_url":"https://codeload.github.com/DiTo97/dense-image-captioning/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DiTo97%2Fdense-image-captioning/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279017950,"owners_count":26086213,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention","image-captioning","torch-2"],"created_at":"2024-10-10T19:11:57.038Z","updated_at":"2025-10-14T04:21:18.871Z","avatar_url":"https://github.com/DiTo97.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dense image captioning\n\nAn unofficial Torch implementation of [J. Lu, C. Xiong, et al., *Knowing when to Look: Adaptive Attention via a Visual Sentinel for Image Captioning*, 2017](https://arxiv.org/abs/1612.01887) trained on the COCO image captioning and Flickr30k datasets.\n\nThe implementation presents the following variations from the paper:\n- deformable adaptive attention;\n- larger visual sentinel size (128-dim);\n- model eval against the [SPICE](https://panderson.me/spice/) metric;\n- [MCTS-based decoding](https://arxiv.org/pdf/2104.05336.pdf).\n\n## Introduction\n\nThe role of image dense captioning is immense for enabling visual-language understanding of the outer world.\n\nIn this project we propose a deformable variant of the visual sentinel via adaptive attention introduced in the reference paper for estimating grounding probas which allows larger networks to be constructed while running at a faster inference speed and training for almost half the epochs with equal performance.\n\nThis project is part of a larger venture for the development of visual-language aid tools for visually-impaired people,\nby combining speech recognition, speech synthesis, image captioning and familiar person identification.\n\nFor more information, see the attached in-depth [report](report/F.%20Minutoli,%20G.%20Losapio,%20et%20al.%20-%20Improving%20Daily%20Interactions%20of%20Visually-impaired%20People.pdf).\n\n## Training\n\nThe model was trained for 50 epochs on a multi-GPU HPC cluster courtesy of [CERN](https://abpcomputing.web.cern.ch/computing_resources/hpc_cern/).\n\n## Usage\n\nThe following files must be downloaded from Google Drive:\n\n- [preprocessing.zip](https://drive.google.com/file/d/1njpdzE1BHHrtC7CHt-WLe7V2w7e919wj/view?usp=sharing)\n- [adaptive.pkl](https://drive.google.com/file/d/1g0HfjOmJA4Eh2m88O2sElPaDUm2OJi-q/view?usp=sharing)\n\nThe former contains the dataset with COCO-like annotations and the corresponding vocabulary.\n\nThe following files should be downloaded from Google Driver for display purposes:\n\n- [eval-loss.pkl](https://drive.google.com/file/d/17Z9jpqp_B_TLzLa0MOQ8u4MqgcOROsMm/view?usp=sharing)\n- [eval-metrics.pkl](https://drive.google.com/file/d/1CzkKbW-ZQM3cxkFCWLd3rE4U9rQD30J9/view?usp=sharing)\n- [visual-grounding-probas.pkl](https://drive.google.com/file/d/1PU7eSV_M7Z56PzFhX4aIitKNtu6TNS0b/view?usp=sharing)\n\n**N.B.:** If the provided links are not longer available, contact the authors.\n\n## Authors\n\n- [@DiTo97](https://github.com/DiTo97)\n- [@arcadeghira](https://github.com/arcadeghira)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdito97%2Fdense-image-captioning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdito97%2Fdense-image-captioning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdito97%2Fdense-image-captioning/lists"}