{"id":28967199,"url":"https://github.com/kuanghuei/scan","last_synced_at":"2025-06-24T08:06:01.795Z","repository":{"id":41063227,"uuid":"133075819","full_name":"kuanghuei/SCAN","owner":"kuanghuei","description":"PyTorch source code for \"Stacked Cross Attention for Image-Text Matching\" (ECCV 2018)","archived":false,"fork":false,"pushed_at":"2023-05-18T07:45:40.000Z","size":35,"stargazers_count":490,"open_issues_count":19,"forks_count":106,"subscribers_count":10,"default_branch":"master","last_synced_at":"2023-10-25T15:16:30.946Z","etag":null,"topics":["computer-vision","cross-modal","deep-learning","image-captioning","neural-network","pytorch","visual-semantic"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kuanghuei.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-05-11T18:37:52.000Z","updated_at":"2023-10-24T16:32:09.000Z","dependencies_parsed_at":"2022-07-14T23:16:59.601Z","dependency_job_id":"ee083689-2a72-41bd-8531-37620805e379","html_url":"https://github.com/kuanghuei/SCAN","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"purl":"pkg:github/kuanghuei/SCAN","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuanghuei%2FSCAN","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuanghuei%2FSCAN/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuanghuei%2FSCAN/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuanghuei%2FSCAN/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kuanghuei","download_url":"https://codeload.github.com/kuanghuei/SCAN/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuanghuei%2FSCAN/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261632136,"owners_count":23187271,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","cross-modal","deep-learning","image-captioning","neural-network","pytorch","visual-semantic"],"created_at":"2025-06-24T08:06:01.147Z","updated_at":"2025-06-24T08:06:01.763Z","avatar_url":"https://github.com/kuanghuei.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Introduction\n\nThis is Stacked Cross Attention Network, source code of [Stacked Cross Attention for Image-Text Matching](https://arxiv.org/abs/1803.08024) ([project page](https://kuanghuei.github.io/SCANProject/)) from Microsoft AI and Research. The paper will appear in ECCV 2018. It is built on top of the [VSE++](https://github.com/fartashf/vsepp) in PyTorch.\n\n\n## Requirements and Installation\nWe recommended the following dependencies.\n\n* Python 2.7\n* [PyTorch](http://pytorch.org/) 0.3\n* [NumPy](http://www.numpy.org/) (\u003e1.12.1)\n* [TensorBoard](https://github.com/TeamHG-Memex/tensorboard_logger)\n\n* Punkt Sentence Tokenizer:\n```python\nimport nltk\nnltk.download()\n\u003e d punkt\n```\n\n## Download data\n\nDownload the dataset files and pre-trained models. We use splits produced by [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/deepimagesent/). The raw images can be downloaded from from their original sources [here](http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/KCCA.html), [here](http://shannon.cs.illinois.edu/DenotationGraph/) and [here](http://mscoco.org/).\n\nThe precomputed image features of MS-COCO are from [here](https://github.com/peteanderson80/bottom-up-attention). The precomputed image features of Flickr30K are extracted from the raw Flickr30K images using the bottom-up attention model from [here](https://github.com/peteanderson80/bottom-up-attention). All the data needed for reproducing the experiments in the paper, including image features and vocabularies, can be downloaded from:\n\nhttps://www.kaggle.com/datasets/kuanghueilee/scan-features\n\nWe refer to the path of extracted files for `data.zip` as `$DATA_PATH` and files for `vocab.zip` to `./vocab` directory. Alternatively, you can also run vocab.py to produce vocabulary files. For example, \n\n```bash\npython vocab.py --data_path data --data_name f30k_precomp\npython vocab.py --data_path data --data_name coco_precomp\n```\n\n## Data pre-processing (Optional)\n\nThe image features of Flickr30K and MS-COCO are available in numpy array format, which can be used for training directly. However, if you wish to test on another dataset, you will need to start from scratch:\n\n1. Use the `bottom-up-attention/tools/generate_tsv.py` and the bottom-up attention model to extract features of image regions. The output file format will be a tsv, where the columns are ['image_id', 'image_w', 'image_h', 'num_boxes', 'boxes', 'features'].\n2. Use `util/convert_data.py` to convert the above output to a numpy array.\n\n\nIf downloading the whole data package containing bottom-up image features for Flickr30K and MS-COCO is too slow for you, you can download everything but image features from https://www.kaggle.com/datasets/kuanghueilee/scan-features and compute image features locally from raw images.\n\n\n## Training new models\nRun `train.py`:\n\n```bash\npython train.py --data_path \"$DATA_PATH\" --data_name coco_precomp --vocab_path \"$VOCAB_PATH\" --logger_name runs/coco_scan/log --model_name runs/coco_scan/log --max_violation --bi_gru\n```\n\nArguments used to train Flickr30K models:\n\n| Method    | Arguments |\n| :-------: | :-------: |\n| SCAN t-i LSE     | `--max_violation --bi_gru --agg_func=LogSumExp --cross_attn=t2i --lambda_lse=6 --lambda_softmax=9` |\n| SCAN t-i AVG     | `--max_violation --bi_gru --agg_func=Mean --cross_attn=t2i --lambda_softmax=9` |\n| SCAN i-t LSE     | `--max_violation --bi_gru --agg_func=LogSumExp --cross_attn=i2t --lambda_lse=5 --lambda_softmax=4` |\n| SCAN i-t AVG     | `--max_violation --bi_gru --agg_func=Mean --cross_attn=i2t --lambda_softmax=4` |\n\n\nArguments used to train MS-COCO models:\n\n| Method    | Arguments |\n| :-------: | :-------: |\n| SCAN t-i LSE     | `--max_violation --bi_gru --agg_func=LogSumExp --cross_attn=t2i --lambda_lse=6 --lambda_softmax=9 --num_epochs=20 --lr_update=10 --learning_rate=.0005` |\n| SCAN t-i AVG     | `--max_violation --bi_gru --agg_func=Mean --cross_attn=t2i --lambda_softmax=9 --num_epochs=20 --lr_update=10 --learning_rate=.0005` |\n| SCAN i-t LSE     | `--max_violation --bi_gru --agg_func=LogSumExp --cross_attn=i2t --lambda_lse=20 --lambda_softmax=4 --num_epochs=20 --lr_update=10 --learning_rate=.0005` |\n| SCAN i-t AVG     | `--max_violation --bi_gru --agg_func=Mean --cross_attn=i2t --lambda_softmax=4 --num_epochs=20 --lr_update=10 --learning_rate=.0005` |\n\n## Evaluate trained models\n\n```python\nfrom vocab import Vocabulary\nimport evaluation\nevaluation.evalrank(\"$RUN_PATH/coco_scan/model_best.pth.tar\", data_path=\"$DATA_PATH\", split=\"test\")\n```\n\nTo do cross-validation on MSCOCO, pass `fold5=True` with a model trained using \n`--data_name coco_precomp`.\n\n## Reference\n\nIf you found this code useful, please cite the following paper:\n\n```\n@inproceedings{lee2018stacked,\n  title={Stacked cross attention for image-text matching},\n  author={Lee, Kuang-Huei and Chen, Xi and Hua, Gang and Hu, Houdong and He, Xiaodong},\n  booktitle={Proceedings of the European conference on computer vision (ECCV)},\n  pages={201--216},\n  year={2018}\n}\n```\n\n## License\n\n[Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0)\n\n\n## Acknowledgments\n\nThe authors would like to thank [Po-Sen Huang](https://posenhuang.github.io/) and Yokesh Kumar for helping the manuscript. We also thank Li Huang, Arun Sacheti, and Bing Multimedia team for supporting this work.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkuanghuei%2Fscan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkuanghuei%2Fscan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkuanghuei%2Fscan/lists"}