{"id":18653309,"url":"https://github.com/roatienza/deep-text-recognition-benchmark","last_synced_at":"2025-04-05T16:10:24.133Z","repository":{"id":40341357,"uuid":"320124143","full_name":"roatienza/deep-text-recognition-benchmark","owner":"roatienza","description":"PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)","archived":false,"fork":false,"pushed_at":"2024-04-09T10:25:11.000Z","size":26879,"stargazers_count":301,"open_issues_count":26,"forks_count":58,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-29T15:09:12.569Z","etag":null,"topics":["ocr","str","vision-transformer","vitstr"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/roatienza.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-10T01:22:34.000Z","updated_at":"2025-03-14T12:31:01.000Z","dependencies_parsed_at":"2024-11-07T07:11:13.626Z","dependency_job_id":"f61a0eb4-6de8-4831-a4b8-5611947f7214","html_url":"https://github.com/roatienza/deep-text-recognition-benchmark","commit_stats":{"total_commits":438,"total_committers":17,"mean_commits":"25.764705882352942","dds":0.545662100456621,"last_synced_commit":"fb06d18bde4e62e728208ba3274390b8a615418a"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/roatienza%2Fdeep-text-recognition-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/roatienza%2Fdeep-text-recognition-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/roatienza%2Fdeep-text-recognition-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/roatienza%2Fdeep-text-recognition-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/roatienza","download_url":"https://codeload.github.com/roatienza/deep-text-recognition-benchmark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247361695,"owners_count":20926643,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ocr","str","vision-transformer","vitstr"],"created_at":"2024-11-07T07:11:07.067Z","updated_at":"2025-04-05T16:10:24.094Z","avatar_url":"https://github.com/roatienza.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Vision Transformer for Fast and Efficient Scene Text Recognition\n\nViTSTR is a simple single-stage model that uses a pre-trained Vision Transformer (ViT) to perform Scene Text Recognition (ViTSTR). It has a comparable accuracy with state-of-the-art STR models although it uses significantly less number of parameters and FLOPS. ViTSTR is also fast due to the parallel computation inherent to ViT architecture. \n\n### Paper\n* [ICDAR 2021](https://link.springer.com/chapter/10.1007/978-3-030-86549-8_21)\n* [Arxiv](https://arxiv.org/abs/2105.08582)\n\n![ViTSTR Model](figures/vitstr_model.png)\n\nViTSTR is built using a fork of [CLOVA AI Deep Text Recognition Benchmark](https://github.com/clovaai/deep-text-recognition-benchmark). Below we document how to train and evaluate ViTSTR-Tiny and ViTSTR-small.\n\n### Install requirements\n\n```\npip3 install -r requirements.txt\n```\n\n### Inference\n\n```\npython3 infer.py --image demo_image/demo_1.png --model https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_small_patch16_jit.pt\n```\n\nReplace `--image` by the path to your target image file.\n\nAfter the model has been downloaded, you can perform inference using the local checkpoint:\n\n```\npython3 infer.py --image demo_image/demo_2.jpg --model vitstr_small_patch16_jit.pt\n```\n\n**Quantized Model on x86**\n\n```\npython3 infer.py --image demo_image/demo_1.png --model  https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_small_patch16_quant.pt --quantized\n```\n\n**Quantized Model on Raspberry Pi 4**\n\n```\npython3 infer.py --image demo_image/demo_1.png --model  https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_small_patch16_quant.pt --quantized --rpi\n```\n\n**Inference Time on GPU using JIT**\n```\npython3 infer.py --model https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_small_patch16_jit.pt --time --gpu\n```\n```\nAverage inference time per image: 2.57e-03 sec (Quadro RTX 6000)\nAverage inference time per image: 4.53e-03 sec (V100)\n```\n\n**Inference Time on CPU using JIT**\n```\npython3 infer.py --model https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_small_patch16_jit.pt --time\n```\n```\nAverage inference time per image: 2.80e-02 sec (AMD Ryzen Threadripper 3970X 32-Core)\nAverage inference time per image: 2.70e-02 sec (Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz)\n```\n\n**Inference Time on RPi 4**\n```\npython3 infer.py --model https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_small_patch16_quant.pt  --time --rpi --quantized\n```\n```\nAverage inference time per image: 3.69e-01 sec (Quantized)\n```\n```\npython3 infer.py --model https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_small_patch16_jit.pt  --time --rpi\n```\n```\nAverage inference time per image: 4.64e-01 sec (JIT)\n```\n\n#### Sample Results:\n| Input Image | Output Prediction  |\n| :---: | :---: |\n| ![demo_1](demo_image/demo_1.png) | `Available` |\n| ![demo_2](demo_image/demo_2.jpg) | `SHAKESHACK` |\n| ![demo_3](demo_image/demo_3.png) | `Londen` |\n| ![demo_4](demo_image/demo_4.png) | `Greenstead` |\n\n### Dataset\n\nDownload lmdb dataset from [CLOVA AI Deep Text Recognition Benchmark](https://github.com/clovaai/deep-text-recognition-benchmark).\n\n### Quick validation using a pre-trained model \n\nViTSTR-Small\n\n```\nCUDA_VISIBLE_DEVICES=0 python3 test.py --eval_data data_lmdb_release/evaluation \\\n--benchmark_all_eval --Transformation None --FeatureExtraction None \\\n--SequenceModeling None --Prediction None --Transformer \\\n--sensitive --data_filtering_off  --imgH 224 --imgW 224 \\\n--TransformerModel=vitstr_small_patch16_224 \\ \n--saved_model https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_small_patch16_224_aug.pth\n```\n\nAvailable model weights:\n\n| Tiny | Small  | Base |\n| :---: | :---: | :---: |\n| `vitstr_tiny_patch16_224` | `vitstr_small_patch16_224` | `vitstr_base_patch16_224`|\n|[ViTSTR-Tiny](https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_tiny_patch16_224.pth)|[ViTSTR-Small](https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_small_patch16_224.pth)|[ViTSTR-Base](https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_base_patch16_224.pth)|\n|[ViTSTR-Tiny+Aug](https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_tiny_patch16_224_aug.pth)|[ViTSTR-Small+Aug](https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_small_patch16_224_aug.pth)|[ViTSTR-Base+Aug](https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_base_patch16_224_aug.pth)|\n\n\n### Benchmarks (Top 1% accuracy)\n\n| Model | IIIT | SVT | IC03 | IC03 | IC13 | IC13 | IC15 | IC15 | SVTP | CT | Acc | Std\n| :--- | :---: | :---: | :---: | :---: | :--: | :--: | :---: | :---: | :---: | :---: | :---: | :--: |\n|  | 3000 | 647 | 860 | 867 | 857 |1015 |1811 |2077 |645 |288 |% |  %|\n| TRBA (Baseline) | 87.7\t|87.4\t|94.5\t|94.2\t|93.4\t|92.1\t|77.3\t|71.6\t|78.1\t|75.5\t|84.3\t|0.1\n| ViTSTR-Tiny | 83.7 | 83.2 | 92.8 | 92.5 | 90.8 | 89.3 | 72.0 | 66.4 | 74.5 | 65.0 | 80.3| 0.2\n| ViTSTR-Tiny+Aug | 85.1\t|85.0\t|93.4\t|93.2\t|90.9\t|89.7\t|74.7\t|68.9\t|78.3\t|74.2\t|82.1\t|0.1\n| ViTSTR-Small | 85.6\t|85.3\t|93.9\t|93.6\t|91.7\t|90.6\t|75.3\t|69.5\t|78.1\t|71.3\t|82.6\t|0.3\n| ViTSTR-Small+Aug  | 86.6\t|87.3\t|94.2\t|94.2\t|92.1\t|91.2\t|77.9\t|71.7\t|81.4\t|77.9\t|84.2\t|0.1\n| ViTSTR-Base  | 86.9\t|87.2\t|93.8\t|93.4\t|92.1\t|91.3\t|76.8\t|71.1\t|80.0\t|74.7\t|83.7\t|0.1\n| ViTSTR-Base+Aug  | 88.4\t|87.7\t|94.7\t|94.3\t|93.2\t|92.4\t|78.5\t|72.6\t|81.8\t|81.3\t|85.2\t|0.1\n\n\n### Comparison with other STR models\n\n#### Accuracy vs Number of Parameters\n\n![Acc vs Parameters](https://github.com/roatienza/deep-text-recognition-benchmark/blob/master/scripts/paper/Accuracy_vs_Number_of_Parameters.png)\n\n#### Accuracy vs Speed (2080Ti GPU)\n![Acc vs Speed](https://github.com/roatienza/deep-text-recognition-benchmark/blob/master/scripts/paper/Accuracy_vs_Msec_per_Image.png)\n\n#### Accuracy vs FLOPS\n![Acc vs FLOPS](https://github.com/roatienza/deep-text-recognition-benchmark/blob/master/scripts/paper/Accuracy_vs_GFLOPS.png)\n\n### Train\n\nViTSTR-Tiny without data augmentation \n\n```\nRANDOM=$$\n\nCUDA_VISIBLE_DEVICES=0 python3 train.py --train_data data_lmdb_release/training \\\n--valid_data data_lmdb_release/evaluation --select_data MJ-ST \\\n--batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction None \\ \n--SequenceModeling None --Prediction None --Transformer \\\n--TransformerModel=vitstr_tiny_patch16_224 --imgH 224 --imgW 224 \\\n--manualSeed=$RANDOM  --sensitive\n```\n\n### Multi-GPU training\n\nViTSTR-Small on a 4-GPU machine\n\nIt is recommended to train larger networks like ViTSTR-Small and ViTSTR-Base on a multi-GPU machine. To keep a fixed batch size at `192`, use the `--batch_size` option. Divide `192` by the number of GPUs. For example, to train ViTSTR-Small on a 4-GPU machine, this would be `--batch_size=48`.\n\n```\npython3 train.py --train_data data_lmdb_release/training \\\n--valid_data data_lmdb_release/evaluation --select_data MJ-ST \\\n--batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction None \\\n--SequenceModeling None --Prediction None --Transformer \\\n--TransformerModel=vitstr_small_patch16_224 --imgH 224 --imgW 224 \\\n--manualSeed=$RANDOM --sensitive --batch_size=48\n```\n\n### Data augmentation \n\nViTSTR-Tiny using rand augment\n\nIt is recommended to use more workers (eg from default of `4`, use `32` instead) since the data augmentation process is CPU intensive. In determining the number of workers, a simple rule of thumb to follow is it can be set to a value between 25% to 50% of the total number of CPU cores. For example, for a system with `64` CPU cores, the number of workers can be set to `32` to use 50% of all cores.  For multi-GPU systems, the number of workers must be divided by the number of GPUs. For example, for `32` workers in a 4-GPU system, `--workers=8`. For convenience, simply use `--workers=-1`, 50% of all cores will be used. Lastly, instead of using a constant learning rate, a cosine scheduler improves the performance of the model during training.\n\nBelow is a sample configuration for a 4-GPU system using batch size of `192`.\n\n```\npython3 train.py --train_data data_lmdb_release/training \\\n--valid_data data_lmdb_release/evaluation --select_data MJ-ST \\\n--batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction None \\\n--SequenceModeling None --Prediction None --Transformer \\\n--TransformerModel=vitstr_tiny_patch16_224 --imgH 224 --imgW 224 \\\n--manualSeed=$RANDOM  --sensitive \\\n--batch_size=48 --isrand_aug --workers=-1 --scheduler\n```\n\n\n### Test\n\nViTSTR-Tiny. Find the path to `best_accuracy.pth` checkpoint file (usually in `saved_model` folder).\n\n```\nCUDA_VISIBLE_DEVICES=0 python3 test.py --eval_data data_lmdb_release/evaluation \\\n--benchmark_all_eval --Transformation None --FeatureExtraction None \\\n--SequenceModeling None --Prediction None --Transformer \\\n--TransformerModel=vitstr_tiny_patch16_224 \\\n--sensitive --data_filtering_off  --imgH 224 --imgW 224 \\\n--saved_model \u003cpath_to/best_accuracy.pth\u003e\n```\n\n\n## Citation\nIf you find this work useful, please cite:\n\n```\n@inproceedings{atienza2021vision,\n  title={Vision transformer for fast and efficient scene text recognition},\n  author={Atienza, Rowel},\n  booktitle={International Conference on Document Analysis and Recognition},\n  pages={319--334},\n  year={2021},\n  organization={Springer}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froatienza%2Fdeep-text-recognition-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Froatienza%2Fdeep-text-recognition-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froatienza%2Fdeep-text-recognition-benchmark/lists"}