{"id":19932208,"url":"https://github.com/amazon-science/semimtr-text-recognition","last_synced_at":"2025-06-14T15:03:12.833Z","repository":{"id":48212439,"uuid":"515459560","full_name":"amazon-science/semimtr-text-recognition","owner":"amazon-science","description":"Multimodal Semi-Supervised Learning for Text Recognition (SemiMTR)","archived":false,"fork":false,"pushed_at":"2023-09-12T11:11:27.000Z","size":1290,"stargazers_count":82,"open_issues_count":1,"forks_count":12,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-05-03T11:35:42.109Z","etag":null,"topics":["computer-vision","consistency-regularization","contrastive-learning","deep-learning","ocr","pytorch","scene-text-recognition","self-supervised-learning","semi-supervised-learning","text-recognition"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amazon-science.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-07-19T06:18:13.000Z","updated_at":"2025-03-02T12:03:07.000Z","dependencies_parsed_at":"2025-05-03T11:42:23.376Z","dependency_job_id":null,"html_url":"https://github.com/amazon-science/semimtr-text-recognition","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/amazon-science/semimtr-text-recognition","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fsemimtr-text-recognition","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fsemimtr-text-recognition/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fsemimtr-text-recognition/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fsemimtr-text-recognition/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amazon-science","download_url":"https://codeload.github.com/amazon-science/semimtr-text-recognition/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fsemimtr-text-recognition/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259835337,"owners_count":22918972,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","consistency-regularization","contrastive-learning","deep-learning","ocr","pytorch","scene-text-recognition","self-supervised-learning","semi-supervised-learning","text-recognition"],"created_at":"2024-11-12T23:09:23.846Z","updated_at":"2025-06-14T15:03:12.806Z","avatar_url":"https://github.com/amazon-science.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multimodal Semi-Supervised Learning for Text Recognition\n\nThe official code implementation of SemiMTR [Paper](https://arxiv.org/pdf/2205.03873) \n| [Pretrained Models](#Pretrained-Models) | [SeqCLR Paper](https://arxiv.org/pdf/2012.10873)\n|  [Citation](#citation) | [Demo](#demo).\n\n**[Aviad Aberdam](https://sites.google.com/view/aviad-aberdam/home),\n[Roy Ganz](https://il.linkedin.com/in/roy-ganz-270592),\n[Shai Mazor](https://il.linkedin.com/in/shai-mazor-529771b),\n[Ron Litman](https://scholar.google.com/citations?hl=iw\u0026user=69GY5dEAAAAJ)**\n\nWe introduce a multimodal semi-supervised learning algorithm for text recognition, which is customized for modern\nvision-language multimodal architectures. To this end, we present a unified one-stage pretraining method for the vision\nmodel, which suits scene text recognition. In addition, we offer a sequential, character-level, consistency\nregularization in which each modality teaches itself. Extensive experiments demonstrate state-of-the-art performance on\nmultiple scene text recognition benchmarks.\n\n### Figures\n\n\u003cfigure\u003e\n  \u003cp align=\"center\"\u003e\u003cimg src=\"figures/semimtr_vision_pretraining.svg\" alt=\"semimtr vision model pretraining\" width=\"512\" /\u003e\u003c/p\u003e\n  \u003cfigcaption\u003e\u003cp align=\"center\"\u003e\u003cb\u003eFigure 1:\u003c/b\u003e SemiMTR vision model pretraining: Contrastive learning \u003c/p\u003e\u003c/figcaption\u003e\n\u003c/figure\u003e\n\u003cbr/\u003e\u003cbr/\u003e\n\n\u003cfigure\u003e\n  \u003cp align=\"center\"\u003e\u003cimg src=\"figures/semimtr_consistency_regularization.svg\" alt=\"semimtr fine-tuning\" width=\"512\" /\u003e\u003c/p\u003e\n  \u003cfigcaption\u003e\u003cp align=\"center\"\u003e\u003cb\u003eFigure 2:\u003c/b\u003e SemiMTR model fine-tuning: Consistency regularization \u003c/p\u003e\u003c/figcaption\u003e\n\u003c/figure\u003e\n\n\u003c!-- \u003cbr/\u003e\u003cbr/\u003e\n\u003cfigure\u003e\n  \u003cp align=\"center\"\u003e\u003cimg src=\"figures/abinet_model_architecture.svg\" alt=\"semimtr model architecture\" width=\"512\" /\u003e\u003c/p\u003e\n  \u003cfigcaption\u003e\u003cp align=\"center\"\u003e\u003cb\u003e SemiMTR model architecture: ABINet Model \u003c/b\u003e\u003c/p\u003e\u003c/figcaption\u003e\n\u003c/figure\u003e --\u003e\n  \n\n# Getting Started\n\n\u003ch3 id=\"demo\"\u003e \n    Run Demo with Pretrained Model \n    \u003ca \n    href=\"https://colab.research.google.com/github/amazon-research/semimtr-text-recognition/blob/master/notebook_demo.ipynb\" target=\"_parent\"\u003e\n    \u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/\u003e\n    \u003c/a\u003e \n\u003c/h3\u003e\n\n## Dependencies\n\n- Inference and demo requires PyTorch \u003e= 1.7.1\n- For training and evaluation, install the dependencies\n\n```\npip install -r requirements.txt\n```\n\n## Pretrained Models\n\nDownload pretrained models:\n\n- [SemiMTR Real-L + Real-U](https://awscv-public-data.s3.us-west-2.amazonaws.com/semimtr/semimtr_real_l_and_u.pth)\n- [SemiMTR Real-L + Real-U + Synth](https://awscv-public-data.s3.us-west-2.amazonaws.com/semimtr/semimtr_real_l_and_u_and_synth.pth)\n- [SemiMTR Real-L + Real-U + TextOCR](https://awscv-public-data.s3.us-west-2.amazonaws.com/semimtr/semimtr_real_l_and_u_and_textocr.pth)\n\nPretrained vision models:\n\n- [SemiMTR Vision Model Real-L + Real-U](https://awscv-public-data.s3.us-west-2.amazonaws.com/semimtr/semimtr_vision_model_real_l_and_u.pth)\n\nPretrained language model:\n\n- [ABINet Language Model](https://awscv-public-data.s3.us-west-2.amazonaws.com/semimtr/abinet_language_model.pth)\n\n\nFor fine-tuning SemiMTR without vision and language pretraining, locate the above models in a `workdir` directory, as follows:\n\n    workdir\n    ├── semimtr_vision_model_real_l_and_u.pth\n    ├── abinet_language_model.pth\n    └── semimtr_real_l_and_u.pth\n\n### SemiMTR Models Accuracy\n\n|Training Data|IIIT|SVT|IC13|IC15|SVTP|CUTE|Avg.|COCO|RCTW|Uber|ArT|LSVT|MLT19|ReCTS|Avg.|\n|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|\n|Synth (ABINet)|96.4|93.2|95.1|82.1|89.0|89.2|91.2|63.1|59.7|39.6|68.3|59.5|85.0|86.7|52.0|\n|Real-L+U|97.0|95.8|96.1|84.7|90.7|94.1|92.8|72.2|76.1|58.5|71.6|77.1|90.4|92.4|65.4|\n|Real-L+U+Synth|97.4|96.8|96.5|84.7|92.9|95.1|93.3|73.0|75.7|58.6|72.4|77.5|90.4|93.1|65.8|\n|Real-L+U+TextOCR|97.3|97.7|96.9|86.0|92.2|94.4|93.7|73.8|77.7|58.6|73.5|78.3|91.3|93.3|66.1|\n\n\n## Datasets\n\n- Download preprocessed lmdb dataset for training and\n  evaluation.  [Link](https://github.com/ku21fan/STR-Fewer-Labels/blob/main/data.md#download-preprocessed-lmdb-dataset-for-traininig-and-evaluation)\n- For training the language model, download WikiText103. [Link](https://github.com/FangShancheng/ABINet#datasets)\n- The final structure of `data` directory can be found in [`DATA.md`](data/DATA.md).\n\n## Training\n\n1. Pretrain vision model\n    ```\n    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/semimtr_pretrain_vision_model.yaml\n    ```\n2. Pretrain language model\n    ```\n    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/pretrain_language_model.yaml\n    ```\n3. Train SemiMTR\n    ```\n    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/semimtr_finetune.yaml\n    ```\n\nNote:\n\n- You can set the `checkpoint` path for vision and language models separately for specific pretrained model, or set\n  to `None` to train from scratch\n\n### Training ABINet\n\n1. Pre-train vision model\n    ```\n    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/abinet_pretrain_vision_model.yaml\n    ```\n2. Pre-train language model\n    ```\n    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/pretrain_language_model.yaml\n    ```\n3. Train ABINet\n    ```\n    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/abinet_finetune.yaml\n    ```\n\n## Evaluation\n\n```\nCUDA_VISIBLE_DEVICES=0 python main.py --config configs/semimtr_finetune.yaml --run_only_test\n```\n\n## Arguments:\n\n- `--checkpoint /path/to/checkpoint` set the path of evaluation model\n- `--test_root /path/to/dataset` set the path of evaluation dataset\n- `--model_eval [alignment|vision]` which sub-model to evaluate\n\n## Citation\n\nIf you find our method useful for your research, please cite\n\n```\n@article{aberdam2022multimodal,\n  title={Multimodal Semi-Supervised Learning for Text Recognition},\n  author={Aberdam, Aviad and Ganz, Roy and Mazor, Shai and Litman, Ron},\n  journal={arXiv preprint arXiv:2205.03873},\n  year={2022}\n}\n\n@inproceedings{aberdam2021sequence,\n  title={Sequence-to-sequence contrastive learning for text recognition},\n  author={Aberdam, Aviad and Litman, Ron and Tsiper, Shahar and Anschel, Oron and Slossberg, Ron and Mazor, Shai and Manmatha, R and Perona, Pietro},\n  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},\n  pages={15302--15312},\n  year={2021}\n}\n ```\n\n## Acknowledgements\n\nThis implementation is based on the repository [ABINet](https://github.com/FangShancheng/ABINet).\n\n## Security\n\nSee [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.\n\n## License\n\nThis project is licensed under the Apache-2.0 License.\n\n## Contact\n\nFeel free to contact us if there is any question: [Aviad Aberdam](mailto:aaberdam@amazon.com?subject=[GitHub-SemiMTR])\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fsemimtr-text-recognition","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famazon-science%2Fsemimtr-text-recognition","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fsemimtr-text-recognition/lists"}