{"id":24100107,"url":"https://github.com/uakarsh/latr","last_synced_at":"2025-07-29T11:06:32.238Z","repository":{"id":41519740,"uuid":"481092984","full_name":"uakarsh/latr","owner":"uakarsh","description":"Implementation of LaTr: Layout-aware transformer for scene-text VQA,a novel multimodal architecture for Scene Text Visual Question Answering (STVQA)","archived":false,"fork":false,"pushed_at":"2024-10-30T07:37:35.000Z","size":4866,"stargazers_count":53,"open_issues_count":10,"forks_count":6,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-07T22:31:58.379Z","etag":null,"topics":["deep-learning","pytorch"],"latest_commit_sha":null,"homepage":"https://uakarsh.github.io/latr/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/uakarsh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-04-13T06:07:15.000Z","updated_at":"2025-02-19T05:43:02.000Z","dependencies_parsed_at":"2024-10-30T08:32:53.238Z","dependency_job_id":null,"html_url":"https://github.com/uakarsh/latr","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/uakarsh/latr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uakarsh%2Flatr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uakarsh%2Flatr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uakarsh%2Flatr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uakarsh%2Flatr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/uakarsh","download_url":"https://codeload.github.com/uakarsh/latr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uakarsh%2Flatr/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267677054,"owners_count":24126306,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","pytorch"],"created_at":"2025-01-10T15:57:48.121Z","updated_at":"2025-07-29T11:06:32.215Z","avatar_url":"https://github.com/uakarsh.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LaTr - PyTorch\n\n![latr architecture](images/latr-architecture.jpg)\n\nImplementation of [LaTr: Layout-aware transformer for scene-text VQA](https://arxiv.org/abs/2112.12494),a novel multimodal architecture for Scene Text Visual Question Answering (STVQA).\n\nLaTr improves robustness towards OCR errors, a common reason for failure cases in STVQA. In addition, by leveraging a vision transformer, LaTr eliminate the need for an external object detector. LaTr outperforms state-of-the-art STVQA methods on multiple datasets. In particular, +7.6% on TextVQA, +10.8% on ST-VQA and +4.0% on OCR-VQA (all absolute accuracy numbers).\n\nThe official implementation was not released by the authors.\n\n\nNOTE: I have tried my best to implement this paper, and have taken minimum assumptions while implementing, but, one of the essential part of any implementation is to provide pre-trained weights and show the results of your implementation on the dataset mentioned in the paper, however due to resource limitation from my side, I won't be able to provide pre-trained weights. However, I would try to include scripts in the example, so that if someone has the resources, they can use the scripts to obtain pre-trained weights and share it. Open to all feedbacks, and hope this implementation turns out to be useful to the community.\n\n## Demo\n![latr architecture](images/demo.gif)\n\nAn interactive demo for the same can be found out [here](https://huggingface.co/spaces/iakarshu/latr-vqa)\n\n## Install\n\n```python\npip install transformers\npip install sentencepiece==0.1.91\npip install pytesseract\nsudo apt install tesseract-ocr\npip install 'Pillow==7.1.2'\n```\n\n## Usage\n\n* For pre-training task: Refer [here](https://github.com/uakarsh/latr/blob/main/examples/LaTr_PreTraining.ipynb)\n* The training of LaTr from scratch with PyTorch Lightening can be referred [here](https://github.com/uakarsh/latr/tree/main/examples/textvqa)\n\n\n## Results:\n\nCurrently, I used the following configurations:\n\n```python\nclasses : 32128\nhidden_state: 768\nmax_2d_position_embeddings: 1001\nseq_len: 512\nt5_model: \"t5-base\"\nvocab_size: 32128\nhidden_state: 768\nlearning_rate: 0.0001\nmax_2d_position_embeddings: 1001\nmax_steps: 50000\nseq_len: 512\nt5_model: \"t5-base\"\nvocab_size: 32128\nbatch size: 1  (I think, this is a major difference between the training of mine and authors)\n```\n\nAnd have been able to obtain a validation accuracy of: 27.42 percentage (authors were able to achieve 44.03 percentage)\n\n* The results of all the experiments can be found out [here](https://wandb.ai/iakarshu/VQA%20with%20LaTr?workspace=), note that I was not able to save the checkpoint of that (some kaggle error), but I initialized the weights from [here](https://www.kaggleusercontent.com/kf/99663112/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..JDENcUm0rUk0qGihFn1QuQ.wKuoRF1z1AmNCwFoZJN3SSFRMNKRvZLlGhzAykt7njLW3OUwV-TQCk9fbUx27ITQ6TpBWeYZl7G3mVorvDQquZfcYHoFam8yZpZ1zl9hmX_YQdZ1KtNrlMv0mKCpr2r6QH7WtUCbi0nWOG3R_31GJHV42pyUXJ1EII9KgnSmjKcTVNjRl7SdrwVnUW8caVtGDTZeMZuS8HH1T_-6pInZMwaZvekEvRqgIM2TArZH-0OVwIszKdfbQftcPz2f9NzpSHeu9bq6ZxhjUcUTCdNJxeNeIcxv4jnfTW146_r_zzmt4SWo8QSsG-zQAPAsxv5JL9nZiP65OUe4uNeWSO-t4ChzpRkUQLnv01ptWkzK0p9j00-xIlC36F5mXXtpbvLHlLXvkBKlrJ4NKEN76RdYAv77sbwoMQZ8RVHRj7-QYcBzaPZgTUNlRi65FnA30v0_UZIMreHyN0H1K7Kdj34TS8_pY058rYVhQY9avwuc32krDOoSG-sQ2FZA7Nvs5CoH0H6ejyvrsMMhCBbROkZDiD0jzeKwlPi-267OqjEMsKar77LsDgzkhccxp6Zgr8ZHTkEnVE553A8Yz7J76Q5vFx-M1ZXhoJIVfZcdSSpoI_jih7woeLdJVWIvctvE1aof88M1PmHPmB9qS2V9S10tK1MBIGeay06xW83d9dd5qD93ugxKZISxEg-IJddlSuII.o1fCKlUduAUrwtk1ANYLug/models/epoch=0-step=34602.ckpt)\n\nThe same weights can be downloaded by the command as follows:\n```\npip install gdown\ngdown 192-AETChd2FoNfut0hkLRwcLMfd5-uIj\n```\n\n* The script of the same can be found out [here](https://www.kaggle.com/code/akarshu121/latr-textvqa-training-with-wandb)\n\n##  License\n\nMIT\n\n## Maintainers\n\n- [uakarsh](https://github.com/uakarsh)\n\n## Contribute\n\n\n## Citations\n\n```bibtex\n@misc{https://doi.org/10.48550/arxiv.2112.12494,\n  doi = {10.48550/ARXIV.2112.12494},\n  url = {https://arxiv.org/abs/2112.12494},\n  author = {Biten, Ali Furkan and Litman, Ron and Xie, Yusheng and Appalaraju, Srikar and Manmatha, R.},\n  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},\n  title = {LaTr: Layout-Aware Transformer for Scene-Text VQA},\n  publisher = {arXiv},\n  year = {2021},\n  copyright = {Creative Commons Attribution 4.0 International}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuakarsh%2Flatr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fuakarsh%2Flatr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuakarsh%2Flatr/lists"}