{"id":18659673,"url":"https://github.com/baudm/parseq","last_synced_at":"2025-04-04T11:16:43.876Z","repository":{"id":45700638,"uuid":"431325804","full_name":"baudm/parseq","owner":"baudm","description":"Scene Text Recognition with Permuted Autoregressive Sequence Models (ECCV 2022)","archived":false,"fork":false,"pushed_at":"2024-05-29T02:02:27.000Z","size":1333,"stargazers_count":630,"open_issues_count":46,"forks_count":132,"subscribers_count":13,"default_branch":"main","last_synced_at":"2025-03-28T10:10:05.561Z","etag":null,"topics":["computer-vision","eccv","eccv2022","ocr","optical-character-recognition","scene-text-recognition","text-recognition","vision-transformer"],"latest_commit_sha":null,"homepage":"https://huggingface.co/spaces/baudm/PARSeq-OCR","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/baudm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-24T02:48:36.000Z","updated_at":"2025-03-26T13:37:46.000Z","dependencies_parsed_at":"2023-12-09T18:25:05.428Z","dependency_job_id":"37e7abbf-c6d2-49bf-a48d-2fddfbcc3984","html_url":"https://github.com/baudm/parseq","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baudm%2Fparseq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baudm%2Fparseq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baudm%2Fparseq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baudm%2Fparseq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/baudm","download_url":"https://codeload.github.com/baudm/parseq/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247166169,"owners_count":20894654,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","eccv","eccv2022","ocr","optical-character-recognition","scene-text-recognition","text-recognition","vision-transformer"],"created_at":"2024-11-07T07:37:33.442Z","updated_at":"2025-04-04T11:16:43.847Z","avatar_url":"https://github.com/baudm.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## News\n- **2024-02-22**: Updated for PyTorch 2.0 and Lightning 2.0\n- **2024-01-16**: Featured in the [NVIDIA Developer Blog](https://developer.nvidia.com/blog/robust-scene-text-detection-and-recognition-introduction/)\n- **2023-11-18**: [Interview with Deci AI at ECCV 2022](https://deeplearningdaily.substack.com/p/exclusive-interview-with-a-researcher) published\n- **2023-09-07**: [Added](https://github.com/PaddlePaddle/PaddleOCR/blob/main/doc/doc_en/algorithm_rec_parseq_en.md) to [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR), one of the most popular multilingual OCR toolkits\n- **2023-06-15**: [Added](https://mindee.github.io/doctr/modules/models.html#doctr.models.recognition.parseq) to [docTR](https://github.com/mindee/doctr), a deep learning-based library for OCR\n- **2022-07-14**: Initial public release (ranked #1 overall for STR on [Papers With Code](https://paperswithcode.com/paper/scene-text-recognition-with-permuted) at the time of release)\n- **2022-07-04**: Accepted at ECCV 2022\n\n\u003cdiv align=\"center\"\u003e\n\n# Scene Text Recognition with\u003cbr/\u003ePermuted Autoregressive Sequence Models\n[![Apache License 2.0](https://img.shields.io/github/license/baudm/parseq)](https://github.com/baudm/parseq/blob/main/LICENSE)\n[![arXiv preprint](http://img.shields.io/badge/arXiv-2207.06966-b31b1b)](https://arxiv.org/abs/2207.06966)\n[![In Proc. ECCV 2022](http://img.shields.io/badge/ECCV-2022-6790ac)](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/556_ECCV_2022_paper.php)\n[![Gradio demo](https://img.shields.io/badge/%F0%9F%A4%97%20demo-Gradio-ff7c00)](https://huggingface.co/spaces/baudm/PARSeq-OCR)\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scene-text-recognition-with-permuted/scene-text-recognition-on-coco-text)](https://paperswithcode.com/sota/scene-text-recognition-on-coco-text?p=scene-text-recognition-with-permuted)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scene-text-recognition-with-permuted/scene-text-recognition-on-ic19-art)](https://paperswithcode.com/sota/scene-text-recognition-on-ic19-art?p=scene-text-recognition-with-permuted)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scene-text-recognition-with-permuted/scene-text-recognition-on-icdar2013)](https://paperswithcode.com/sota/scene-text-recognition-on-icdar2013?p=scene-text-recognition-with-permuted)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scene-text-recognition-with-permuted/scene-text-recognition-on-iiit5k)](https://paperswithcode.com/sota/scene-text-recognition-on-iiit5k?p=scene-text-recognition-with-permuted)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scene-text-recognition-with-permuted/scene-text-recognition-on-cute80)](https://paperswithcode.com/sota/scene-text-recognition-on-cute80?p=scene-text-recognition-with-permuted)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scene-text-recognition-with-permuted/scene-text-recognition-on-icdar2015)](https://paperswithcode.com/sota/scene-text-recognition-on-icdar2015?p=scene-text-recognition-with-permuted)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scene-text-recognition-with-permuted/scene-text-recognition-on-svt)](https://paperswithcode.com/sota/scene-text-recognition-on-svt?p=scene-text-recognition-with-permuted)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scene-text-recognition-with-permuted/scene-text-recognition-on-svtp)](https://paperswithcode.com/sota/scene-text-recognition-on-svtp?p=scene-text-recognition-with-permuted)\n\n[**Darwin Bautista**](https://github.com/baudm) and [**Rowel Atienza**](https://github.com/roatienza)\n\nElectrical and Electronics Engineering Institute\u003cbr/\u003e\nUniversity of the Philippines, Diliman\n\n[Method](#method-tldr) | [Sample Results](#sample-results) | [Getting Started](#getting-started) | [FAQ](#frequently-asked-questions) | [Training](#training) | [Evaluation](#evaluation) | [Citation](#citation)\n\n\u003c/div\u003e\n\nScene Text Recognition (STR) models use language context to be more robust against noisy or corrupted images. Recent approaches like ABINet use a standalone or external Language Model (LM) for prediction refinement. In this work, we show that the external LM\u0026mdash;which requires upfront allocation of dedicated compute capacity\u0026mdash;is inefficient for STR due to its poor performance vs cost characteristics. We propose a more efficient approach using **p**ermuted **a**uto**r**egressive **seq**uence (PARSeq) models. View our ECCV [poster](https://drive.google.com/file/d/19luOT_RMqmafLMhKQQHBnHNXV7fOCRfw/view) and [presentation](https://drive.google.com/file/d/11VoZW4QC5tbMwVIjKB44447uTiuCJAAD/view) for a brief overview.\n\n![PARSeq](.github/gh-teaser.png)\n\n**NOTE:** _P-S and P-Ti are shorthands for PARSeq-S and PARSeq-Ti, respectively._\n\n### Method tl;dr\n\nOur main insight is that with an ensemble of autoregressive (AR) models, we could unify the current STR decoding methods (context-aware AR and context-free non-AR) and the bidirectional (cloze) refinement model:\n\u003cdiv align=\"center\"\u003e\u003cimg src=\".github/contexts-example.png\" alt=\"Unified STR model\" width=\"75%\"/\u003e\u003c/div\u003e\n\nA single Transformer can realize different models by merely varying its attention mask. With the correct decoder parameterization, it can be trained with Permutation Language Modeling to enable inference for arbitrary output positions given arbitrary subsets of the input context. This *arbitrary decoding* characteristic results in a _unified_ STR model\u0026mdash;PARSeq\u0026mdash;capable of context-free and context-aware inference, as well as iterative prediction refinement using bidirectional context **without** requiring a standalone language model. PARSeq can be considered an ensemble of AR models with shared architecture and weights:\n\n![System](.github/system.png)\n**NOTE:** _LayerNorm and Dropout layers are omitted. `[B]`, `[E]`, and `[P]` stand for beginning-of-sequence (BOS), end-of-sequence (EOS), and padding tokens, respectively. `T` = 25 results in 26 distinct position tokens. The position tokens both serve as query vectors and position embeddings for the input context. For `[B]`, no position embedding is added. Attention\nmasks are generated from the given permutations and are used only for the context-position attention. L\u003csub\u003ece\u003c/sub\u003e pertains to the cross-entropy loss._\n\n### Sample Results\n\u003cdiv align=\"center\"\u003e\n\n| Input Image                                                                | PARSeq-S\u003csub\u003eA\u003c/sub\u003e | ABINet            | TRBA              | ViTSTR-S          | CRNN              |\n|:--------------------------------------------------------------------------:|:--------------------:|:-----------------:|:-----------------:|:-----------------:|:-----------------:|\n| \u003cimg src=\"demo_images/art-01107.jpg\" alt=\"CHEWBACCA\" width=\"128\"/\u003e         | CHEWBACCA            | CHEWBA**GG**A     | CHEWBACCA         | CHEWBACCA         | CHEW**U**ACCA     |\n| \u003cimg src=\"demo_images/coco-1166773.jpg\" alt=\"Chevron\" width=\"128\"/\u003e        | Chevro**l**          | Chevro\\_          | Chevro\\_          | Chevr\\_\\_         | Chevr\\_\\_         |\n| \u003cimg src=\"demo_images/cute-184.jpg\" alt=\"SALMON\" height=\"128\"/\u003e            | SALMON               | SALMON            | SALMON            | SALMON            | SA\\_MON           |\n| \u003cimg src=\"demo_images/ic13_word_256.png\" alt=\"Verbandstoffe\" width=\"128\"/\u003e | Verbandst**e**ffe    | Verbandst**e**ffe | Verbandst**ell**e | Verbandst**e**ffe | Verbands**le**ffe |\n| \u003cimg src=\"demo_images/ic15_word_26.png\" alt=\"Kappa\" width=\"128\"/\u003e          | Kappa                | Kappa             | Ka**s**pa         | Kappa             | Ka**ad**a         |\n| \u003cimg src=\"demo_images/uber-27491.jpg\" alt=\"3rdAve\" height=\"128\"/\u003e          | 3rdAve               | 3=-Ave            | 3rdAve            | 3rdAve            | **Coke**          |\n\n**NOTE:** _Bold letters and underscores indicate wrong and missing character predictions, respectively._\n\u003c/div\u003e\n\n## Getting Started\nThis repository contains the reference implementation for PARSeq and reproduced models (collectively referred to as _Scene Text Recognition Model Hub_). See `NOTICE` for copyright information.\nMajority of the code is licensed under the Apache License v2.0 (see `LICENSE`) while ABINet and CRNN sources are\nreleased under the BSD and MIT licenses, respectively (see corresponding `LICENSE` files for details).\n\n### Demo\nAn [interactive Gradio demo](https://huggingface.co/spaces/baudm/PARSeq-OCR) hosted at Hugging Face is available. The pretrained weights released here are used for the demo.\n\n### Installation\nRequires Python \u003e= 3.9 and PyTorch \u003e= 2.0. The default requirements files will install the latest versions of the dependencies (as of February 22, 2024).\n```bash\n# Use specific platform build. Other PyTorch 2.0 options: cu118, cu121, rocm5.7\nplatform=cpu\n# Generate requirements files for specified PyTorch platform\nmake torch-${platform}\n# Install the project and core + train + test dependencies. Subsets: [dev,train,test,bench,tune]\npip install -r requirements/core.${platform}.txt -e .[train,test]\n ```\n#### Updating dependency version pins\n```bash\npip install pip-tools\nmake clean-reqs reqs  # Regenerate all the requirements files\n ```\n### Datasets\nDownload the [datasets](Datasets.md) from the following links:\n1. [LMDB archives](https://drive.google.com/drive/folders/1NYuoi7dfJVgo-zUJogh8UQZgIMpLviOE) for MJSynth, SynthText, IIIT5k, SVT, SVTP, IC13, IC15, CUTE80, ArT, RCTW17, ReCTS, LSVT, MLT19, COCO-Text, and Uber-Text.\n2. [LMDB archives](https://drive.google.com/drive/folders/1D9z_YJVa6f-O0juni-yG5jcwnhvYw-qC) for TextOCR and OpenVINO.\n\n### Pretrained Models via Torch Hub\nAvailable models are: `abinet`, `crnn`, `trba`, `vitstr`, `parseq_tiny`, `parseq_patch16_224`, and `parseq`.\n```python\nimport torch\nfrom PIL import Image\nfrom strhub.data.module import SceneTextDataModule\n\n# Load model and image transforms\nparseq = torch.hub.load('baudm/parseq', 'parseq', pretrained=True).eval()\nimg_transform = SceneTextDataModule.get_transform(parseq.hparams.img_size)\n\nimg = Image.open('/path/to/image.png').convert('RGB')\n# Preprocess. Model expects a batch of images with shape: (B, C, H, W)\nimg = img_transform(img).unsqueeze(0)\n\nlogits = parseq(img)\nlogits.shape  # torch.Size([1, 26, 95]), 94 characters + [EOS] symbol\n\n# Greedy decoding\npred = logits.softmax(-1)\nlabel, confidence = parseq.tokenizer.decode(pred)\nprint('Decoded label = {}'.format(label[0]))\n```\n\n## Frequently Asked Questions\n- How do I train on a new language? See Issues [#5](https://github.com/baudm/parseq/issues/5) and [#9](https://github.com/baudm/parseq/issues/9).\n- Can you export to TorchScript or ONNX? Yes, see Issue [#12](https://github.com/baudm/parseq/issues/12#issuecomment-1267842315).\n- How do I test on my own dataset? See Issue [#27](https://github.com/baudm/parseq/issues/27).\n- How do I finetune and/or create a custom dataset? See Issue [#7](https://github.com/baudm/parseq/issues/7).\n- What is `val_NED`? See Issue [#10](https://github.com/baudm/parseq/issues/10).\n\n## Training\nThe training script can train any supported model. You can override any configuration using the command line. Please refer to [Hydra](https://hydra.cc) docs for more info about the syntax. Use `./train.py --help` to see the default configuration.\n\n\u003cdetails\u003e\u003csummary\u003eSample commands for different training configurations\u003c/summary\u003e\u003cp\u003e\n\n### Finetune using pretrained weights\n```bash\n./train.py +experiment=parseq-tiny pretrained=parseq-tiny  # Not all experiments have pretrained weights\n```\n\n### Train a model variant/preconfigured experiment\nThe base model configurations are in `configs/model/`, while variations are stored in `configs/experiment/`.\n```bash\n./train.py +experiment=parseq-tiny  # Some examples: abinet-sv, trbc\n```\n\n### Specify the character set for training\n```bash\n./train.py charset=94_full  # Other options: 36_lowercase or 62_mixed-case. See configs/charset/\n```\n\n### Specify the training dataset\n```bash\n./train.py dataset=real  # Other option: synth. See configs/dataset/\n```\n\n### Change general model training parameters\n```bash\n./train.py model.img_size=[32, 128] model.max_label_length=25 model.batch_size=384\n```\n\n### Change data-related training parameters\n```bash\n./train.py data.root_dir=data data.num_workers=2 data.augment=true\n```\n\n### Change `pytorch_lightning.Trainer` parameters\n```bash\n./train.py trainer.max_epochs=20 trainer.accelerator=gpu trainer.devices=2\n```\nNote that you can pass any [Trainer parameter](https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html),\nyou just need to prefix it with `+` if it is not originally specified in `configs/main.yaml`.\n\n### Resume training from checkpoint (experimental)\n```bash\n./train.py +experiment=\u003cmodel_exp\u003e ckpt_path=outputs/\u003cmodel\u003e/\u003ctimestamp\u003e/checkpoints/\u003ccheckpoint\u003e.ckpt\n```\n\n\u003c/p\u003e\u003c/details\u003e\n\n## Evaluation\nThe test script, ```test.py```, can be used to evaluate any model trained with this project. For more info, see ```./test.py --help```.\n\nPARSeq runtime parameters can be passed using the format `param:type=value`. For example, PARSeq NAR decoding can be invoked via `./test.py parseq.ckpt refine_iters:int=2 decode_ar:bool=false`.\n\n\u003cdetails\u003e\u003csummary\u003eSample commands for reproducing results\u003c/summary\u003e\u003cp\u003e\n\n### Lowercase alphanumeric comparison on benchmark datasets (Table 6)\n```bash\n./test.py outputs/\u003cmodel\u003e/\u003ctimestamp\u003e/checkpoints/last.ckpt  # or use the released weights: ./test.py pretrained=parseq\n```\n**Sample output:**\n| Dataset   | # samples | Accuracy | 1 - NED | Confidence | Label Length |\n|:---------:|----------:|---------:|--------:|-----------:|-------------:|\n| IIIT5k    |      3000 |    99.00 |   99.79 |      97.09 |         5.09 |\n| SVT       |       647 |    97.84 |   99.54 |      95.87 |         5.86 |\n| IC13_1015 |      1015 |    98.13 |   99.43 |      97.19 |         5.31 |\n| IC15_2077 |      2077 |    89.22 |   96.43 |      91.91 |         5.33 |\n| SVTP      |       645 |    96.90 |   99.36 |      94.37 |         5.86 |\n| CUTE80    |       288 |    98.61 |   99.80 |      96.43 |         5.53 |\n| **Combined** | **7672** | **95.95** | **98.78** | **95.34** | **5.33** |\n--------------------------------------------------------------------------\n\n### Benchmark using different evaluation character sets (Table 4)\n```bash\n./test.py outputs/\u003cmodel\u003e/\u003ctimestamp\u003e/checkpoints/last.ckpt  # lowercase alphanumeric (36-character set)\n./test.py outputs/\u003cmodel\u003e/\u003ctimestamp\u003e/checkpoints/last.ckpt --cased  # mixed-case alphanumeric (62-character set)\n./test.py outputs/\u003cmodel\u003e/\u003ctimestamp\u003e/checkpoints/last.ckpt --cased --punctuation  # mixed-case alphanumeric + punctuation (94-character set)\n```\n\n### Lowercase alphanumeric comparison on more challenging datasets (Table 5)\n```bash\n./test.py outputs/\u003cmodel\u003e/\u003ctimestamp\u003e/checkpoints/last.ckpt --new\n```\n\n### Benchmark Model Compute Requirements (Figure 5)\n```bash\n./bench.py model=parseq model.decode_ar=false model.refine_iters=3\n\u003ctorch.utils.benchmark.utils.common.Measurement object at 0x7f8fcae67ee0\u003e\nmodel(x)\n  Median: 14.87 ms\n  IQR:    0.33 ms (14.78 to 15.12)\n  7 measurements, 10 runs per measurement, 1 thread\n| module                | #parameters   | #flops   | #activations   |\n|:----------------------|:--------------|:---------|:---------------|\n| model                 | 23.833M       | 3.255G   | 8.214M         |\n|  encoder              |  21.381M      |  2.88G   |  7.127M        |\n|  decoder              |  2.368M       |  0.371G  |  1.078M        |\n|  head                 |  36.575K      |  3.794M  |  9.88K         |\n|  text_embed.embedding |  37.248K      |  0       |  0             |\n```\n\n### Latency Measurements vs Output Label Length (Appendix I)\n```bash\n./bench.py model=parseq model.decode_ar=false model.refine_iters=3 +range=true\n```\n\n### Orientation robustness benchmark (Appendix J)\n```bash\n./test.py outputs/\u003cmodel\u003e/\u003ctimestamp\u003e/checkpoints/last.ckpt --cased --punctuation  # no rotation\n./test.py outputs/\u003cmodel\u003e/\u003ctimestamp\u003e/checkpoints/last.ckpt --cased --punctuation --rotation 90\n./test.py outputs/\u003cmodel\u003e/\u003ctimestamp\u003e/checkpoints/last.ckpt --cased --punctuation --rotation 180\n./test.py outputs/\u003cmodel\u003e/\u003ctimestamp\u003e/checkpoints/last.ckpt --cased --punctuation --rotation 270\n```\n\n### Using trained models to read text from images (Appendix L)\n```bash\n./read.py outputs/\u003cmodel\u003e/\u003ctimestamp\u003e/checkpoints/last.ckpt --images demo_images/*  # Or use ./read.py pretrained=parseq\nAdditional keyword arguments: {}\ndemo_images/art-01107.jpg: CHEWBACCA\ndemo_images/coco-1166773.jpg: Chevrol\ndemo_images/cute-184.jpg: SALMON\ndemo_images/ic13_word_256.png: Verbandsteffe\ndemo_images/ic15_word_26.png: Kaopa\ndemo_images/uber-27491.jpg: 3rdAve\n\n# use NAR decoding + 2 refinement iterations for PARSeq\n./read.py pretrained=parseq refine_iters:int=2 decode_ar:bool=false --images demo_images/*\n```\n\u003c/p\u003e\u003c/details\u003e\n\n## Tuning\n\nWe use [Ray Tune](https://www.ray.io/ray-tune) for automated parameter tuning of the learning rate. See `./tune.py --help`. Extend `tune.py` to support tuning of other hyperparameters.\n```bash\n./tune.py tune.num_samples=20  # find optimum LR for PARSeq's default config using 20 trials\n./tune.py +experiment=tune_abinet-lm  # find the optimum learning rate for ABINet's language model\n```\n\n## Citation\n```bibtex\n@InProceedings{bautista2022parseq,\n  title={Scene Text Recognition with Permuted Autoregressive Sequence Models},\n  author={Bautista, Darwin and Atienza, Rowel},\n  booktitle={European Conference on Computer Vision},\n  pages={178--196},\n  month={10},\n  year={2022},\n  publisher={Springer Nature Switzerland},\n  address={Cham},\n  doi={10.1007/978-3-031-19815-1_11},\n  url={https://doi.org/10.1007/978-3-031-19815-1_11}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaudm%2Fparseq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbaudm%2Fparseq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaudm%2Fparseq/lists"}