{"id":18927770,"url":"https://github.com/peteanderson80/up-down-captioner","last_synced_at":"2025-06-20T10:33:30.468Z","repository":{"id":90817721,"uuid":"113699565","full_name":"peteanderson80/Up-Down-Captioner","owner":"peteanderson80","description":"Automatic image captioning model based on Caffe, using features from bottom-up attention.","archived":false,"fork":false,"pushed_at":"2023-02-03T03:44:28.000Z","size":2730,"stargazers_count":245,"open_issues_count":21,"forks_count":68,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-05-20T13:08:44.442Z","etag":null,"topics":["caffe","captioning-images","image-captioning","lstm"],"latest_commit_sha":null,"homepage":"http://www.panderson.me/up-down-attention/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/peteanderson80.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-12-09T20:36:45.000Z","updated_at":"2025-03-21T16:06:57.000Z","dependencies_parsed_at":null,"dependency_job_id":"4bcbe9f4-295a-4e62-90d2-c28ae73d9291","html_url":"https://github.com/peteanderson80/Up-Down-Captioner","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/peteanderson80/Up-Down-Captioner","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peteanderson80%2FUp-Down-Captioner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peteanderson80%2FUp-Down-Captioner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peteanderson80%2FUp-Down-Captioner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peteanderson80%2FUp-Down-Captioner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/peteanderson80","download_url":"https://codeload.github.com/peteanderson80/Up-Down-Captioner/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peteanderson80%2FUp-Down-Captioner/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260927164,"owners_count":23083971,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["caffe","captioning-images","image-captioning","lstm"],"created_at":"2024-11-08T11:20:27.917Z","updated_at":"2025-06-20T10:33:25.456Z","avatar_url":"https://github.com/peteanderson80.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Up-Down-Captioner\n\nSimple yet high-performing image captioning model using Caffe and python. Using image features from [bottom-up attention](https://github.com/peteanderson80/bottom-up-attention), in July 2017 this model achieved state-of-the-art performance on all metrics of the [COCO captions test leaderboard](http://cocodataset.org/#captions-leaderboard) (**SPICE 21.5**, **CIDEr 117.9**, **BLEU_4 36.9**). The architecture (2-layer LSTM with attention) is described in Section 3.2 of:\n- [Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering](https://arxiv.org/abs/1707.07998). \n\n### Reference\nIf you use this code in your research, please cite our paper:\n```\n@inproceedings{Anderson2017up-down,\n  author = {Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang},\n  title = {Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering},\n  booktitle={CVPR},\n  year = {2018}\n}\n```\n\n### License\n\nThis code is released under the MIT License (refer to the LICENSE file for details).\n\n### Requirements: software\n\n0. **`Important`** Please use the version of caffe provided as a submodule within this repository. It contains additional layers and features required for captioning.\n\n1.  Requirements for `Caffe` and `pycaffe` (see: [Caffe installation instructions](http://caffe.berkeleyvision.org/installation.html))\n\n    **Note:** Caffe *must* be built with support for Python layers and NCCL!\n\n    ```make\n    # In your Makefile.config, make sure to have these lines uncommented\n    WITH_PYTHON_LAYER := 1\n    USE_NCCL := 1\n    # Unrelatedly, it's also recommended that you use CUDNN\n    USE_CUDNN := 1\n    ```\n3.  Nvidia's NCCL library which is used for multi-GPU training https://github.com/NVIDIA/nccl\n\n### Requirements: hardware\n\nBy default, the provided training scripts assume that two gpus are available, with indices 0,1. Training on two gpus takes around 9 hours. Any NVIDIA GPU with 8GB or larger memory should be OK. Training scripts and prototxt files will require minor modifications to train on a single gpu (e.g. set `iter_size` to 2).\n\n\n### Demo - Using the model to predict on new images\n\nRun install instructions 1-4 below, then use the notebook at `scripts/demo.ipynb`\n\n### Installation\n\nAll instructions are from the top level directory. To run the demo, should be only steps 1-4 required (remaining steps are for training a model).\n\n1.  Clone the Up-Down-Captioner repository:\n    ```Shell\n    # Make sure to clone with --recursive\n    git clone --recursive https://github.com/peteanderson80/Up-Down-Captioner.git\n    ```\n\n    If you forget to clone with the `--recursive` flag, then you'll need to manually clone the submodules:\n    ```Shell\n    git submodule update --init --recursive\n    ```\n\n2.  Build Caffe and pycaffe:\n    ```Shell\n    cd ./external/caffe\n\n    # If you're experienced with Caffe and have all of the requirements installed\n    # and your Makefile.config in place, then simply do:\n    make -j8 \u0026\u0026 make pycaffe\n    ```\n\n3.  Build the COCO tools:\n    ```Shell\n    cd ./external/coco/PythonAPI\n    make\n    ```\n\n4.  Add python layers and caffe build to PYTHONPATH:\n    ```Shell\n    cd $REPO_ROOT\n    export PYTHONPATH=${PYTHONPATH}:$(pwd)/layers:$(pwd)/lib:$(pwd)/external/caffe/python\n    ```\n    \n5.  Build Ross Girshick's Cython modules (to run the demo on new images)\n    ```Shell\n    cd $REPO_ROOT/lib\n    make\n    ```\n    \n6.  Download Stanford CoreNLP (required by the evaluation code):\n    ```Shell\n    cd ./external/coco-caption\n    ./get_stanford_models.sh\n    ```\n\n7.  Download the MS COCO train/val image caption annotations. Extract all the json files into one folder `$COCOdata`, then create a symlink to this location:\n    ```Shell\n    cd $REPO_ROOT/data\n    ln -s $COCOdata coco\n    ``` \n\n8.  Pre-process the caption annotations for training (building vocabs etc).\n    ```Shell\n    cd $REPO_ROOT\n    python scripts/preprocess_coco.py\n    ``` \n    \n8.  Download or generate pretrained image features following the instructions below.\n\n\n### Pretrained image features\n\n**LINKS HAVE BEEN UPDATED**\n\nThe captioner takes pretrained image features as input (and does not finetune). For best performance, bottom-up attention features should be used. Code for generating these features can be found [here](https://github.com/peteanderson80/bottom-up-attention). For ease-of-use, we provide pretrained features for the [MSCOCO dataset](http://mscoco.org/dataset/#download). Manually download the following tsv file and unzip to `data/tsv/`:\n- [2014 Train/Val Image Features (120K / 23GB)](https://imagecaption.blob.core.windows.net/imagecaption/trainval.zip)\n\nTo make a test server submission, you would also need these features:\n- [2014 Testing Image Features (40K / 7.3GB)](https://imagecaption.blob.core.windows.net/imagecaption/test2014.zip)\n\nAlternatively, to generate conventional pretrained features from the ResNet-101 CNN:\n- Download the [pretrained ResNet-101 model](https://github.com/KaimingHe/deep-residual-networks#models) and save it in `baseline/ResNet-101-model.caffemodel`\n- Download the MS COCO train/val images, and extract them into `data/images`.\n- Run:\n```Shell\ncd $REPO_ROOT\n./scripts/generate_baseline.py\n``` \n\n### Training\n\nTo train the model on the karpathy training set, and then generate and evaluate captions on the karpathy testing set (using bottom-up attention features): \n```Shell\ncd $REPO_ROOT\n./experiments/caption_lstm/train.sh\n```\n\nTrained snapshots are saved under: `snapshots/caption_lstm/`\n\nLogging outputs are saved under: `logs/caption_lstm/`\n\nGenerated caption outputs are saved under: `outputs/caption_lstm/`\n\nScores for the generated captions (on the karpathy test set) are saved under: `scores/caption_lstm/`\n\nTo train and evaluate the baseline using conventional pretrained features, follow the instructions above but replace `caption_lstm` with `caption_lstm_baseline_resnet`.\n\n### Results\n\nResults (using bottom-up attention features) should be similar to the numbers below (as reported in Table 1 of the paper).\n\n|                   | BLEU-1  | BLEU-4  | METEOR  | ROUGE-L |  CIDEr  |  SPICE  |\n|-------------------|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|\n|Cross-Entropy Loss |  77.2   |  36.2   |  27.0   |  56.4   |  113.5  |  20.3   |\n|CIDEr Optimization |  79.8   |  36.3   |  27.7   |  56.9   |  120.1  |  21.4   |\n\n### Other useful scripts\n\n1. `scripts/create_caption_lstm.py`\n    The version of caffe provided as a submodule with this repo includes (amongst other things) a custom `LSTMNode` layer that enables sampling and beam search through LSTM layers. However, the resulting network architecture prototxt files are quite complicated. The file `scripts/create_caption_lstm.py` scaffolds out network structures, such as those in `experiments`.\n\n2. `layers/efficient_rcnn_layers.py`\n    The provided `net.prototxt` file uses a python data layer (`layers/rcnn_layers.py`) that loads all training data (including image features) into memory. If you have insufficient system memory use this python data layer instead, by replacing `module: \"rcnn_layers\"` with `module: \"efficient_rcnn_layers\"` in `experiments/caption_lstm/net.prototxt`.\n\n3. `scripts/plot.py`\n    Basic script for plotting validation set scores during training.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeteanderson80%2Fup-down-captioner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpeteanderson80%2Fup-down-captioner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeteanderson80%2Fup-down-captioner/lists"}