{"id":21693548,"url":"https://github.com/kleinyuan/image2text","last_synced_at":"2025-06-10T11:35:49.084Z","repository":{"id":87912022,"uuid":"95378095","full_name":"KleinYuan/image2text","owner":"KleinYuan","description":"A deep learning project to tell a story with an image or a video.","archived":false,"fork":false,"pushed_at":"2017-08-09T05:50:40.000Z","size":38,"stargazers_count":42,"open_issues_count":4,"forks_count":10,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-12T10:44:52.318Z","etag":null,"topics":["artificial-intelligence","cnn","convolutional-neural-networks","deep-learning","iapr","image-understanding","lasagne","machine-learning","multimodal-layers","natural-language-processing","neural-network","real-time","storyteller","tensorflow","theano","word2vec","word2vec-model"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KleinYuan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-06-25T18:25:43.000Z","updated_at":"2025-03-23T23:20:01.000Z","dependencies_parsed_at":null,"dependency_job_id":"f0fb6fa8-8f2a-4489-91bb-a1c2e797759f","html_url":"https://github.com/KleinYuan/image2text","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KleinYuan%2Fimage2text","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KleinYuan%2Fimage2text/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KleinYuan%2Fimage2text/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KleinYuan%2Fimage2text/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KleinYuan","download_url":"https://codeload.github.com/KleinYuan/image2text/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KleinYuan%2Fimage2text/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259067903,"owners_count":22800417,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","cnn","convolutional-neural-networks","deep-learning","iapr","image-understanding","lasagne","machine-learning","multimodal-layers","natural-language-processing","neural-network","real-time","storyteller","tensorflow","theano","word2vec","word2vec-model"],"created_at":"2024-11-25T18:20:45.925Z","updated_at":"2025-06-10T11:35:49.052Z","avatar_url":"https://github.com/KleinYuan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Intro\n\nThis repo is to implement a multi-modal natural language model with tensorflow.\n\n|**Dependencies**             |  **DataSets**|\n| --- | --- |\n|[python 2.7](https://www.python.org/download/releases/2.7/)\u003cbr/\u003e[tensorflow](https://www.tensorflow.org) \u003cbr/\u003e[lasagne](https://https://github.com/Lasagne/Lasagne) \u003cbr/\u003e[Theano](https://github.com/Theano/Theano) |[IAPR TC-12](http://www.imageclef.org/photodata)|\n\n\n# Project Overview\n\n1. Firstly, a word embedding with word2vec net is trained against iaprtc12 datasets.\n\n2. Secondly, the filtered (meaning, if the description is too long, we only keep the first sentence) word vectors for each description of image are used as target output of a CNN network\n\n# Setup\n\nFor various systems, you need to use different tools to install tensorflow, lasagne, theano, nolearn, ... dependencies, first.\n\nThen, simply run below scripts to download the datasets\n\nRun:\n\n```bash setup.sh```\n\nor:\n\n```make setup```\n\n\n# Network Design\n\n**Word2Vec**             |  **StoryNet**\n:-------------------------:|:-------------------------:\n![word2vec](https://www.tensorflow.org/images/softmax-nplm.png)|![storynet](https://user-images.githubusercontent.com/8921629/28401184-23dfdb4e-6ccd-11e7-8883-cf7749444d32.png)\n\n# Training\n\nRun:\n\n```python train.py```\n\nor:\n\n```make train```\n\n\n**Optimizer**             |  **Loss**\n:-------------------------:|:-------------------------:\nMomentumOptimizer  | MSE Loss\n\n\n![learning_curve](https://user-images.githubusercontent.com/8921629/28445982-bd35c1e6-6d7c-11e7-8100-cfdeee644167.png)\n\n# Pre-trained Model\n\nDownload [here](https://www.dropbox.com/s/hxt8xwpy4wz429k/storyNet.pb?dl=0)\n\n# Testing and Results\n\nRun:\n\n```\nmake demo\n```\n\n\n# Train on your own\n\n1. Run setup bash script to download datasets\n\n2. Run train.py or with makefile\n\n3. Freeze tensorflow model with the command provided in makefile\n\n4. Run app.py or with makefile\n\n\n# Data Sets\n\nThe image collection of the IAPR TC-12 Benchmark consists of 20,000 still natural images taken from locations around the world and comprising an assorted cross-section of still natural images. This includes pictures of different sports and actions, photographs of people, animals, cities, landscapes and many other aspects of contemporary life.\n\nEach image is associated with a text caption in up to three different languages (English, German and Spanish) . These annotations are stored in a database which is managed by a benchmark administration system that allows the specification of parameters according to which different subsets of the image collection can be generated.\n\nThe IAPR TC-12 Benchmark is now available free of charge and without copyright restrictions.\n\nMore [details](http://www.imageclef.org/photodata).\n\nSample annotations:\n\n```\n\n    \u003cDOC\u003e\n    \u003cDOCNO\u003eannotations/01/1000.eng\u003c/DOCNO\u003e\n    \u003cTITLE\u003eGodchild Cristian Patricio Umaginga Tuaquiza\u003c/TITLE\u003e\n    \u003cDESCRIPTION\u003ea dark-skinned boy wearing a knitted woolly hat and a light and dark grey striped jumper with a grey zip, leaning on a grey wall;\u003c/DESCRIPTION\u003e\n    \u003cNOTES\u003e\u003c/NOTES\u003e\n    \u003cLOCATION\u003eQuilotoa, Ecuador\u003c/LOCATION\u003e\n    \u003cDATE\u003eApril 2002\u003c/DATE\u003e\n    \u003cIMAGE\u003eimages/01/1000.jpg\u003c/IMAGE\u003e\n    \u003cTHUMBNAIL\u003ethumbnails/01/1000.jpg\u003c/THUMBNAIL\u003e\n    \u003c/DOC\u003e\n\n```\n\n\n# References:\n\n1. [Dong, Jianfeng, Xirong Li, and Cees GM Snoek. \"Word2VisualVec: Image and video to sentence matching by visual feature prediction.\" CoRR, abs/1604.06838 (2016).](https://arxiv.org/pdf/1604.06838.pdf)\n\n2. [Karpathy, Andrej, and Li Fei-Fei. \"Deep visual-semantic alignments for generating image descriptions.\" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.](https://cs.stanford.edu/people/karpathy/cvpr2015.pdf)\n\n3. [Kiros, Ryan, Ruslan Salakhutdinov, and Rich Zemel. \"Multimodal neural language models.\" Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014.](http://proceedings.mlr.press/v32/kiros14.pdf)\n\n4. [word2vec tutorial](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkleinyuan%2Fimage2text","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkleinyuan%2Fimage2text","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkleinyuan%2Fimage2text/lists"}