{"id":20038514,"url":"https://github.com/yahoo/object_relation_transformer","last_synced_at":"2025-10-29T14:02:05.501Z","repository":{"id":45985261,"uuid":"217151325","full_name":"yahoo/object_relation_transformer","owner":"yahoo","description":"Implementation of the Object Relation Transformer for Image Captioning","archived":false,"fork":false,"pushed_at":"2024-09-17T21:38:40.000Z","size":1180,"stargazers_count":177,"open_issues_count":14,"forks_count":45,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-31T09:08:01.210Z","etag":null,"topics":["machine-learning"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/1906.05963","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yahoo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"Contributing.md","funding":null,"license":"LICENSE","code_of_conduct":"Code-of-Conduct.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-23T20:51:56.000Z","updated_at":"2024-11-26T13:02:06.000Z","dependencies_parsed_at":"2024-09-18T01:33:02.727Z","dependency_job_id":"4f1a96f8-3b00-4397-bc72-89a6539ecfe1","html_url":"https://github.com/yahoo/object_relation_transformer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yahoo%2Fobject_relation_transformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yahoo%2Fobject_relation_transformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yahoo%2Fobject_relation_transformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yahoo%2Fobject_relation_transformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yahoo","download_url":"https://codeload.github.com/yahoo/object_relation_transformer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247640465,"owners_count":20971557,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning"],"created_at":"2024-11-13T10:29:40.753Z","updated_at":"2025-10-29T14:02:05.482Z","avatar_url":"https://github.com/yahoo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Object Relation Transformer\n\nThis is a PyTorch implementation of the Object Relation Transformer published in NeurIPS 2019. You can find the paper [here](https://papers.nips.cc/paper/9293-image-captioning-transforming-objects-into-words.pdf). This repository is largely based on code from Ruotian Luo's Self-critical Sequence Training for Image Captioning GitHub repo, which can be found [here](https://github.com/ruotianluo/self-critical.pytorch).\n\nThe primary additions are as follows:\n* Relation transformer model\n* Script to create reports for runs on MSCOCO\n\n\n## Requirements\n* Python 2.7 (because there is no [coco-caption](https://github.com/tylin/coco-caption) version for Python 3)\n* PyTorch 0.4+ (along with torchvision)\n* h5py\n* scikit-image\n* typing\n* pyemd\n* gensim\n* [cider](https://github.com/ruotianluo/cider.git) (already added as a submodule). See `.gitmodules` and clone the referenced repo into\n  the `object_relation_transformer` folder.  \n* The [coco-caption](https://github.com/tylin/coco-caption) library,\n  which is used for generating different evaluation metrics. To set it\n  up, clone the repo into the `object_relation_transformer`\n  folder. Make sure to keep the cloned repo folder name as\n  `coco-caption` and also to run the `get_stanford_models.sh`\n  script from within that repo.\n\n\n\n## Data Preparation\n\n### Download ResNet101 weights for feature extraction\n\nDownload the file `resnet101.pth` from [here](https://drive.google.com/drive/folders/0B7fNdx_jAqhtbVYzOURMdDNHSGM). Copy the weights to a folder `imagenet_weights` within the data folder:\n\n```\nmkdir data/imagenet_weights\ncp /path/to/downloaded/weights/resnet101.pth data/imagenet_weights\n```\n\n### Download and preprocess the COCO captions\n\nDownload the [preprocessed COCO captions](http://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip) from Karpathy's homepage. Extract `dataset_coco.json` from the zip file and copy it in to `data/`. This file provides preprocessed captions and also standard train-val-test splits.\n\nThen run:\n\n```\n$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk\n```\n`prepro_labels.py` will map all words that occur \u003c= 5 times to a special `UNK` token, and create a vocabulary for all the remaining words. The image information and vocabulary are dumped into `data/cocotalk.json` and discretized caption data are dumped into `data/cocotalk_label.h5`.\n\nNext run:\n```\n$ python scripts/prepro_ngrams.py --input_json data/dataset_coco.json --dict_json data/cocotalk.json --output_pkl data/coco-train --split train\n```\n\nThis will preprocess the dataset and get the cache for calculating cider score.\n\n\n### Download the COCO dataset and pre-extract the image features\n\nDownload the [COCO images](http://mscoco.org/dataset/#download) from the MSCOCO website.\nWe need 2014 training images and 2014 validation images. You should put the `train2014/` and `val2014/` folders in the same directory, denoted as `$IMAGE_ROOT`:\n\n```\nmkdir $IMAGE_ROOT\npushd $IMAGE_ROOT\nwget http://images.cocodataset.org/zips/train2014.zip\nunzip train2014.zip\nwget http://images.cocodataset.org/zips/val2014.zip\nunzip val2014.zip\npopd\nwget https://msvocds.blob.core.windows.net/images/262993_z.jpg\nmv 262993_z.jpg $IMAGE_ROOT/train2014/COCO_train2014_000000167126.jpg\n```\n\nThe last two commands are needed to address an issue with a corrupted image in the MSCOCO dataset (see [here](https://github.com/karpathy/neuraltalk2/issues/4)). The prepro script will fail otherwise.\n\n\nThen run:\n\n```\n$ python scripts/prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root $IMAGE_ROOT\n```\n\n`prepro_feats.py` extracts the ResNet101 features (both fc feature and last conv feature) of each image. The features are saved in `data/cocotalk_fc` and `data/cocotalk_att`, and resulting files are about 200GB. Running this script may take a day or more, depending on hardware.\n\n(Check the prepro scripts for more options, like other ResNet models or other attention sizes.)\n\n### Download the Bottom-up features\n\nDownload the pre-extracted features from [here](https://github.com/peteanderson80/bottom-up-attention). For the paper, the adaptive features were used.\n\nDo the following:\n```\nmkdir data/bu_data; cd data/bu_data\nwget https://imagecaption.blob.core.windows.net/imagecaption/trainval.zip\nunzip trainval.zip\n\n```\nThe .zip file is around 22 GB.\nThen return to the base directory and run:\n```\npython scripts/make_bu_data.py --output_dir data/cocobu\n```\n\nThis will create `data/cocobu_fc`, `data/cocobu_att` and `data/cocobu_box`.\n\n\n### Generate the relative bounding box coordinates for the Relation Transformer\n\nRun the following:\n```\npython scripts/prepro_bbox_relative_coords.py --input_json data/dataset_coco.json --input_box_dir data/cocobu_box --output_dir data/cocobu_box_relative --image_root $IMAGE_ROOT\n```\nThis should take a couple hours or so, depending on hardware.\n\n\n## Model Training and Evaluation\n\n### Standard cross-entropy loss training\n\n```\npython train.py --id relation_transformer_bu --caption_model relation_transformer --input_json data/cocotalk.json --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative --input_label_h5 data/cocotalk_label.h5 --checkpoint_path log_relation_transformer_bu --noamopt --noamopt_warmup 10000 --label_smoothing 0.0 --batch_size 15 --learning_rate 5e-4 --num_layers 6 --input_encoding_size 512 --rnn_size 2048 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --max_epochs 30 --use_box 1\n```\n\nThe train script will dump checkpoints into the folder specified by `--checkpoint_path` (default = `save/`). We only save the best-performing checkpoint on validation and the latest checkpoint to save disk space.\n\nTo resume training, you can specify `--start_from` option to be the path saving `infos.pkl` and `model.pth` (usually you could just set `--start_from` and `--checkpoint_path` to be the same).\n\nIf you have tensorflow, the loss histories are automatically dumped into `--checkpoint_path`, and can be visualized using tensorboard.\n\nThe current command uses scheduled sampling. You can also set scheduled_sampling_start to -1 to disable it.\n\nIf you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross entropy loss, use `--language_eval 1` option, but don't forget to download the [coco-caption code](https://github.com/tylin/coco-caption) into `coco-caption` directory.\n\nFor more options, see `opts.py`.\n\n\nThe above training script should achieve a CIDEr-D score of about 115.\n\n\n### Self-critical RL training\n\nAfter training using cross-entropy loss, additional self-critical training produces signficant gains in CIDEr-D score.\n\n\nFirst, copy the model from the pretrained model using cross entropy. (It's not mandatory to copy the model, just for back-up)\n```\n$ bash scripts/copy_model.sh relation_transformer_bu relation_transformer_bu_rl\n```\n\nThen:\n\n```\npython train.py --id relation_transformer_bu_rl --caption_model relation_transformer --input_json data/cocotalk.json --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_label_h5 data/cocotalk_label.h5  --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative --input_label_h5 data/cocotalk_label.h5 --checkpoint_path log_relation_transformer_bu_rl --label_smoothing 0.0 --batch_size 10 --learning_rate 5e-4 --num_layers 6 --input_encoding_size 512 --rnn_size 2048 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --start_from log_transformer_bu_rl --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --self_critical_after 30 --max_epochs 60 --use_box 1\n```\n\nThe above training script should achieve a CIDEr-D score of about 128.\n\n### Evaluate on Karpathy's test split\nTo evaluate the cross-entropy model, run:\n\n```\npython eval.py --dump_images 0 --num_images 5000 --model log_relation_transformer_bu/model.pth --infos_path log_relation_transformer_bu/infos_relation_transformer_bu-best.pkl --image_root $IMAGE_ROOT --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5  --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative --use_box 1 --language_eval 1\n```\n\nand for cross-entropy+RL run:\n\n```\npython eval.py --dump_images 0 --num_images 5000 --model log_relation_transformer_bu_rl/model.pth --infos_path log_relation_transformer_bu_rl/infos_relation_transformer_bu-best.pkl --image_root $IMAGE_ROOT --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5  --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative --language_eval 1\n```\n\n## Visualization\n\n### Visualize caption predictions\nPlace all your images of interest into a folder, e.g. `images`, and run\nthe eval script:\n\n```\n$ python eval.py --dump_images 1 --num_images 10 --model log_relation_transformer_bu/model.pth --infos_path log_relation_transformer_bu/infos_relation_transformer_bu-best.pkl --image_root $IMAGE_ROOT --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5  --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative\n```\n\nThis tells the `eval` script to run up to 10 images from the given folder. If you have a big GPU you can speed up the evaluation by increasing `batch_size`. Use `--num_images -1` to process all images. The eval script will create an `vis.json` file inside the `vis` folder, which can then be visualized with the provided HTML interface:\n\n```\n$ cd vis\n$ python -m SimpleHTTPServer\n```\n\nNow visit `localhost:8000` in your browser and you should see your predicted captions.\n\n### Generate reports from runs on MSCOCO\n\nThe [create_report.py](create_report.py) script can be used in order to generate HTML reports containing results from different runs. Please see the script for specific usage examples.\n\nThe script takes as input one or more pickle files containing results from runs on the MSCOCO dataset. It reads in the pickle files and creates a set of HTML files with tables and graphs generated from the different captioning evaluation metrics, as well as the generated image captions and corresponding metrics for individual images.\n\nIf more than one pickle file with results is provided as input, the script will also generate a report containing a comparison between the metrics generated by each pair of methods.\n\n\n## Model Zoo and Results\n\nPlease find all of our pre-trained models on huggingface:\n[yahoo-inc/object-relation-transformer](https://huggingface.co/yahoo-inc/object-relation-transformer).\nThe table below presents results from our paper on the Karpathy test\nsplit, along with the respective model folders which can be found in the huggingface link above.\nSimilar results should be obtained by running the respective commands in\n[neurips_training_runs.sh](neurips_training_runs.sh). As learning rate scheduling was not fully optimized, these\nvalues should only serve as a reference/expectation rather than what can be achieved with additional tuning.\n\nThe models are Copyright Verizon Media, licensed under the terms of the CC-BY-4.0 license. See associated\n[license file](LICENSE-CC-BY-4.md).\n\n| Algorithm\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp; | Model Folder                    | CIDEr-D | SPICE | BLEU-1 | BLEU-4 | METEOR | ROUGE-L |\n|:-------------------------------------------------------------------|:--------------------------------|:-------:|:----:| :--: |:------:|:------:|:-------:|\n| Up-Down + LSTM *                                                   | log_topdown_bu/                 |  106.6  | 19.9 | 75.6 |  32.9  |  26.5  |  55.4   |\n| Up-Down + Transformer                                              | log_transformer_bu/             |  111.0  | 20.9 | 75.0 |  32.8  |  27.5  |  55.6   |\n| Up-Down + Object Relation Transformer                              | log_relation_transformer_bu/    |  112.6  | 20.8 | 75.6 |  33.5  |  27.6  |  56.0   |\n| Up-Down + Object Relation Transformer + Beamsize 2                 | log_relation_transformer_bu/    |  115.4  | 21.2 | 76.6 |  35.5  |  28.0  |  56.6   |\n| Up-Down + Object Relation Transformer + Self-Critical + Beamsize 5 | log_relation_transformer_bu_rl/ |  128.3  | 22.6 | 80.5 |  38.6  |  28.7  |  58.4   |\n\n\\* Note that the pre-trained Up-Down + LSTM model above produces slightly better results than\nreported, as it came from a different training run. We kept the older LSTM results in the table above for consistency\nwith our paper.\n\n### Comparative Analysis\n\nIn addition, in the paper we also present a head-to-head comparison of the Object Relation Transformer against the \"Up-Down + Transformer\" model. (Results from the latter model are also included in the table above).\nIn the paper, we refer to this latter model as \"Baseline Transformer\", as it does not make use of geometry in its attention definition. The idea of the head-to-head comparison is to better understand the improvement\nobtained by adding geometric attention to the Transformer, both quantitatively and qualitatively. The comparison consists of a set of evaluation metrics computed for each model on a per-image basis, as well as aggregated over all images.\nIt includes the results of paired t-tests, which test for statistically significant differences between the evaluation metrics resulting from each of the models. This comparison can be generated by running the commands in\n[neurips_report_comands.sh](neurips_report_commands.sh). The commands first run the two aforementioned models on the MSCOCO test set and then generate the corresponding report containing the complete comparative analysis.\n\n\n## Citation\n\nIf you find this repo useful, please consider citing (no obligation at all):\n\n```\n@article{herdade2019image,\n  title={Image Captioning: Transforming Objects into Words},\n  author={Herdade, Simao and Kappeler, Armin and Boakye, Kofi and Soares, Joao},\n  journal={arXiv preprint arXiv:1906.05963},\n  year={2019}\n}\n```\n\nOf course, please cite the original paper of models you are using (you can find references in the model files).\n\n## Contribute\n\nPlease refer to [the contributing.md file](Contributing.md) for information about how to get involved. We welcome\nissues, questions, and pull requests.\n\nPlease be aware that we (the maintainers) are currently busy with other projects, so it make take some days before we\nare able to get back to you. We do not foresee big changes to this repository going forward.\n\n## Maintainers\n\nKofi Boakye: kaboakye@verizonmedia.com\n\nSimao Herdade: sherdade@verizonmedia.com\n\nJoao Soares: jvbsoares@verizonmedia.com\n\n## License\n\nThis project is licensed under the terms of the MIT open source license. Please refer to [LICENSE](LICENSE) for the full terms.\n\n\n## Acknowledgments\n\nThanks to [Ruotian Luo](https://github.com/ruotianluo) for the original code.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyahoo%2Fobject_relation_transformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyahoo%2Fobject_relation_transformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyahoo%2Fobject_relation_transformer/lists"}