{"id":19360134,"url":"https://github.com/yashkant/sam-textvqa","last_synced_at":"2025-04-23T11:33:02.055Z","repository":{"id":47053144,"uuid":"301547198","full_name":"yashkant/sam-textvqa","owner":"yashkant","description":"Official code for paper \"Spatially Aware Multimodal Transformers for TextVQA\" published at ECCV, 2020.","archived":false,"fork":false,"pushed_at":"2021-09-15T14:48:42.000Z","size":1010,"stargazers_count":64,"open_issues_count":5,"forks_count":13,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-02T15:04:10.960Z","etag":null,"topics":["eccv","language","textvqa","vision"],"latest_commit_sha":null,"homepage":"https://yashkant.github.io/projects/sam-textvqa","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yashkant.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-10-05T21:43:53.000Z","updated_at":"2025-03-17T07:19:31.000Z","dependencies_parsed_at":"2022-08-23T17:40:19.779Z","dependency_job_id":null,"html_url":"https://github.com/yashkant/sam-textvqa","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yashkant%2Fsam-textvqa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yashkant%2Fsam-textvqa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yashkant%2Fsam-textvqa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yashkant%2Fsam-textvqa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yashkant","download_url":"https://codeload.github.com/yashkant/sam-textvqa/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250425621,"owners_count":21428590,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["eccv","language","textvqa","vision"],"created_at":"2024-11-10T07:17:12.237Z","updated_at":"2025-04-23T11:33:00.942Z","avatar_url":"https://github.com/yashkant.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Spatially Aware Multimodal Transformers for TextVQA\n===================================================\n\u003ch4\u003e\nYash Kant, Dhruv Batra, Peter Anderson, Alex Schwing, Devi Parikh, Jiasen Lu, Harsh Agrawal\n\u003c/br\u003e\n\u003cspan style=\"font-size: 14pt; color: #555555\"\u003e\nPublished at ECCV, 2020\n\u003c/span\u003e\n\u003c/h4\u003e\n\u003chr\u003e\n\n**Paper:** [arxiv.org/abs/2007.12146](https://arxiv.org/abs/2007.12146)\n\n**Project Page:** [yashkant.github.io/projects/sam-textvqa](https://yashkant.github.io/projects/sam-textvqa.html)\n\nWe propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph and use it to solve TextVQA.\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"tools/sam-textvqa-large.png\"\u003e\n\u003c/p\u003e\n\n\n## Repository Setup\n\nCreate a fresh conda environment, and install all dependencies.\n\n```text\nconda create -n sam python=3.6\nconda activate sam\ncd sam-textvqa\npip install -r requirements.txt\n```\n\nInstall pytorch\n```\nconda install pytorch torchvision cudatoolkit=10.0 -c pytorch\n```\n\nFinally, install apex from: https://github.com/NVIDIA/apex\n\n## Data Setup\nDownload files from the [dropbox link](https://www.dropbox.com/sh/dk6oubjlt2x7w0h/AAAKExm33IKnVe8mkC4tOzUKa) and place it in the ``data/`` folder.\nEnsure that data paths match the directory structure provided in ``data/README.md``\n\n## Run Experiments\nFrom the below table pick the suitable configuration file:\n\n | Method  |  context (c)   |  Train splits   |  Evaluation Splits  | Config File|\n | ------- | ------ | ------ | ------ | ------ |\n | SA-M4C  | 3 | TextVQA | TextVQA | train-tvqa-eval-tvqa-c3.yml |\n | SA-M4C  | 3 | TextVQA + STVQA | TextVQA | train-tvqa_stvqa-eval-tvqa-c3.yml |\n | SA-M4C  | 3 | STVQA | STVQA | train-stvqa-eval-stvqa-c3.yml |\n | SA-M4C  | 5 | TextVQA | TextVQA | train-tvqa-eval-tvqa-c5.yml |\n\nTo run the experiments use:\n```\npython train.py \\\n--config config.yml \\\n--tag experiment-name\n```\n\n\nTo evaluate the pretrained checkpoint provided use:\n```\npython train.py \\\n--config configs/train-tvqa_stvqa-eval-tvqa-c3.yml \\\n--pretrained_eval data/pretrained-models/best_model.tar\n```\nNote: The beam-search evaluation is \nundergoing changes and will be updated.\n\n**Resources Used**: We ran all the experiments on 2 Titan Xp gpus. \n\n## Citation\n```\n@inproceedings{kant2020spatially,\n  title={Spatially Aware Multimodal Transformers for TextVQA},\n  author={Kant, Yash and Batra, Dhruv and Anderson, Peter \n          and Schwing, Alexander and Parikh, Devi and Lu, Jiasen\n          and Agrawal, Harsh},\n  booktitle={ECCV}\n  year={2020}}\n```\n\n## Acknowledgements\nParts of this codebase were borrowed from the following repositories:\n- [12-in-1: Multi-Task Vision and Language Representation Learning](https://github.com/facebookresearch/vilbert-multi-task): Training Setup\n- [MMF: A multimodal framework for vision and language research](https://github.com/facebookresearch/mmf/): Dataset processors and M4C model\n\nWe thank \u003ca href=\"https://abhishekdas.com/\"\u003eAbhishek Das\u003c/a\u003e, \u003ca href=\"https://amoudgl.github.io/\"\u003eAbhinav Moudgil\u003c/a\u003e for their feedback and \u003ca href=\"https://ronghanghu.com/\"\u003eRonghang Hu\u003c/a\u003e for sharing an early version of his work. \nThe Georgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon. \nThe views and conclusions contained herein are those of the authors and should not be interpreted\n as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.\n\n\n## License\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyashkant%2Fsam-textvqa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyashkant%2Fsam-textvqa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyashkant%2Fsam-textvqa/lists"}