https://github.com/HCIILAB/Scene-Text-Recognition
https://github.com/HCIILAB/Scene-Text-Recognition
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/HCIILAB/Scene-Text-Recognition
- Owner: HCIILAB
- Created: 2019-05-15T08:38:44.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2021-12-17T09:05:40.000Z (over 4 years ago)
- Last Synced: 2024-08-01T04:02:07.580Z (almost 2 years ago)
- Size: 1.05 MB
- Stars: 603
- Watchers: 32
- Forks: 117
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesomeai - Scene Text Recognition Resources
- awesome-ai-awesomeness - Scene Text Recognition Resources
- awesome-ai-awesomeness - Scene Text Recognition Resources
- awesome-deep-text-detection-recognition - HCIILAB-Recognition
README
# Scene Text Recognition Resources
[
Author: 陈晓雪
](https://github.com/CCchenxiaoxue)
The paper "Text Recognition in the Wild: A Survey" (accepted to appear in ACM Computing Surveys) in [arXiv](https://arxiv.org/pdf/2005.03492v3.pdf) version is available now.
# ❗❗ Newest Version Can be Found Here ❗❗
## This repository is no longer maintaining now, and you can refer to our newest one
## [Scene Text Recognition Recommendations](https://github.com/HCIILAB/Scene-Text-Recognition-Recommendations)
## Updates
Dec 24, 2019: add 20 papers and update corresponding tables.
Feb 29, 2020: add AAAI-2020 papers and update corresponding tables.
May 8, 2020: add CVPR-2020 papers and update corresponding tables.
Dec 8, 2020: add 11 papers and update corresponding tables. You can download the new [Excel](https://pan.baidu.com/s/1xitxu7R5hw27pVV7eJ1c7w) prepared by us. (Password: sj2t)
***
- [1. Datasets](#1-datasets)
- [1.1 Regular Latin Datasets](#11-regular-latin-datasets)
- [1.2 Irregular Latin Datasets](#12-irregular-latin-datasets)
- [1.3 Multilingual Datasets](#13-multilingual-datasets)
- [1.4 Synthetic Datasets](#14-synthetic-datasets)
- [1.5 Comparison of the Benchmark Datasets](#15-comparison-of-the-benchmark-datasets)
- [2. Performance Comparison of Recognition Algorithms](#2-performance-comparison-of-recognition-algorithms)
- [2.1 Characteristics Comparison of Recognition Approaches](#21-characteristics-comparison-of-recognition-approaches)
- [2.2 Performance Comparison on Benchmark Datasets](#22-performance-comparison-on-benchmark-datasets)
- [2.2.1 Performance Comparison of Recognition Algorithms on Regular Latin Datasets](#221-performance-comparison-of-recognition-algorithms-on-regular-latin-datasets)
- [2.2.2 Performance Comparison of Recognition Algorithms on Irregular Latin Datasets](#222-performance-comparison-of-recognition-algorithms-on-irregular-latin-datasets)
- [3. Survey](#3-survey)
- [4. OCR Service](#4-ocr-service)
- [5. References](#5-references)
- [6.Help](#6help)
- [7.Copyright](#7copyright)
### 1.1 Regular Latin Datasets
- IIIT5K[31]:
* **Introduction:** The IIIT5K dataset [31] contains 5,000 text instance images: 2,000 for training and 3,000 for testing. It contains words from street scenes and from originally-digital images. Every image is associated with a 50 -word lexicon and a 1,000 -word lexicon. Specifically, the lexicon consists of a ground-truth word and some randomly picked words.
* **Link:** [IIIT5K-download](http://cvit.iiit.ac.in/research/projects/cvit-projects/the-iiit-5k-word-dataset)
- SVT[1]:
* **Introduction:** The SVT dataset [1] contains 350 images: 100 for training and 250 for testing. Some images are severely corrupted by noise, blur, and low resolution. Each image is associated with a 50 -word lexicon.
* **Link:** [SVT-download](http://vision.ucsd.edu/~kai/svt/)
- ICDAR 2003(IC03)[33]:
* **Introduction:** The IC03 dataset [33] contains 509 images: 258 for training and 251 for testing. Specifically, it contains 867 cropped text instances after discarding images that contain non-alphanumeric characters or less than three characters. Every image is associated with a 50 -word lexicon and a full-word lexicon. Moreover, the full lexicon combines all lexicon words.
* **Link:** [IC03-download](http://www.iapr-tc11.org/mediawiki/index.php?title=ICDAR_2003_Robust_Reading_Competitions)
- ICDAR 2013(IC13)[34]:
* **Introduction:** The IC13 dataset [34] contains 561 images: 420 for training and 141 for testing. It inherits data from the IC03 dataset and extends it with new images. Similar to IC03 dataset, the IC13 dataset contains 1,015 cropped text instance images after removing the words with non-alphanumeric characters. No lexicon is associated with IC13 . Notably, 215 duplicate text instance images [65] exist between the IC03 training dataset and the IC13 testing dataset. Therefore, care should be taken regarding the overlapping data when evaluating a model on the IC13 testing data.
* **Link:** [IC13-download](http://dagdata.cvc.uab.es/icdar2013competition/?ch=2&com=downloads)
- SVHN[45]:
* **Introduction:** The SVHN [45] dataset contains more than 600,000 digits of house numbers in natural scenes. It is obtained from a large number of street view images using a combination of automated algorithms and the Amazon Mechanical Turk (AMT) framework. The SVHN dataset was typically used for scene digit recognition.
* **Link:** [SVHN-download](http://ufldl.stanford.edu/housenumbers/)
### 1.2 Irregular Latin Datasets
- SVT-P[35]:
- **Introduction:** The SVT-P [35] dataset contains 238 images with 639 cropped text instances. It is specifically designed to evaluate perspective distorted text recognition. It is built based on the original SVT dataset by selecting the images at the same address on Google Street View but with different view angles. Therefore, most text instances are heavily distorted by the non-frontal view angle. Moreover, each image is associated with a 50-word lexicon and a full-word lexicon.
- **Link:** [SVT-P-download](https://pan.baidu.com/s/1rhYUn1mIo8OZQEGUZ9Nmrg ) \(Password : vnis)
- CUTE80[36]:
- **Introduction:** The CUTE80 dataset [36] contains 80 high-resolution images with 288 cropped text
instances. It focuses on curved text recognition. Most images in CUTE80 have a complex background, perspective distortion, and poor resolution. No lexicon is associated with CUTE80.
- **Link:** [CUTE80-download](http://cs-chan.com/downloads_CUTE80_dataset.html)
- ICDAR 2015(IC15)[37]:
- **Introduction:** The IC15 dataset [37] contains 1,500 images: 1,000 for training and 500 for testing.
Specifically, it contains 2,077 cropped text instances, including more than 200 irregular text samples. As text images were taken by Google Glasses without ensuring the image quality, most of the text is very small, blurred, and multi-oriented. No lexicon is provided.
- **Link:** [IC15-download](http://rrc.cvc.uab.es/?ch=4&com=downloads)
- COCO-Text[38]:
- **Introduction:** The COCO-Text dataset [38] contains 63,686 images with 145,859 cropped text instances. It is the first large-scale dataset for text in natural images and also the first dataset to annotate scene text with attributes such as legibility and type of text. However, no lexicon is associated with COCO-Text.
- **Link:** [COCO-Text-download](https://vision.cornell.edu/se3/coco-text-2/)
- Total-Text[39]:
- **Introduction:** The Total-Text dataset [39] contains 1,555 images with 11,459 cropped text instance images. It focuses on curved scene text recognition. Images in Total-Text have more than three different orientations, including horizontal, multi-oriented, and curved. No lexicon is associated with Total-Text.
- **Link:** [Total-Text-download](https://github.com/cs-chan/Total-Text-Dataset)
- RCTW-17(RCTW competition,ICDAR17)[40]:
- **Introduction:** The RCTW-17 dataset contains 12,514 images: 11,514 for training and 1,000 for testing. Most are natural images collected by cameras or mobile phones, whereas others are digital-born. Text instances are annotated with labels, fonts, languages, etc.
- **Link:** [RCTW-17-download](http://rctw.vlrlab.net/dataset/)
- MTWI(competition)[41]:
- **Introduction:** The MTWI dataset contains 20,000 images. This is the first dataset constructed by Chinese and Latin web text. Most images in MTWI have a relatively high resolution and cover diverse types of web text, including multi-oriented text, tightly-stacked text, and complex-shaped text.
- **Link:** [MTWI-download ](https://pan.baidu.com/s/1SUODaOzV7YOPkrun0xSz6A#list/path=%2F) \(Password:gox9)
- CTW[42]:
- **Introduction:** The CTW dataset includes 32,285 high-resolution street view images with 1,018,402 character instances. All images have character-level annotations: the underlying character, the bounding box, and six other attributes.
- **Link:** [CTW-download](https://ctwdataset.github.io/)
- SCUT-CTW1500[43]:
- **Introduction:** The SCUT-CTW1500 dataset contains 1,500 images: 1,000 for training and 500
for testing. In particular, it provides 10,751 cropped text instance images, including 3,530 with curved text. The images are manually harvested from the Internet, image libraries such as Google Open-Image, or phone cameras. The dataset contains a lot of horizontal and multi-oriented text
- **Link:** [SCUT-CTW1500-download](https://github.com/Yuliang-Liu/Curve-Text-Detector)
* LSVT(LSVT competition, ICDAR2019)[57]:
* **Introduction:** The LSVT dataset contains 20,000 testing samples, 30,000 fully annotated training samples, and 400,000 training samples with weak annotations (i.e., with partial labels). All images are captured from streets and reflect a large variety of complicated real-world scenarios, e.g., store fronts and landmarks.
* **Link:** [LSVT-download](https://rrc.cvc.uab.es/?ch=16&com=downloads)
* ArT(ArT competition, ICDAR2019)[58]:
* **Introduction:** The ArT dataset [58] contains 10,166 images: 5,603 for training and 4,563 for testing. ArT is a combination of Total-Text, SCUT-CTW 1500 , and Baidu Curved Scene Text 4 , which was collected to introduce the arbitrary-shaped text problem. Moreover, all existing text shapes (i.e., horizontal, multi-oriented, and curved) have multiple occurrences in the ArT dataset.
* **Link:** [ArT-download](https://rrc.cvc.uab.es/?ch=16&com=downloads)
* ReCTS-25k(ReCTS competition, ICDAR2019)[59]:
* **Introduction:** The ReCTS-25k dataset [59] contains 25,000 images: 20,000 for training and 5,000 for testing. All the images are from the Meituan-Dianping Group, collected by Meituan business mer-
chants, using phone cameras under uncontrolled conditions. Specifically, ReCTS-25 k dataset mainly contains images of Chinese text on signboards.
* **Link:** [ReCTS-download](https://rrc.cvc.uab.es/?ch=16&com=downloads)
* MLT(MLTcompetition, ICDAR2019) [81]:
* **Introduction:** The MLT-2019 dataset [81] contains 20,000 images: 10,000 for training (1,000 per language) and 10,000 for testing. The dataset includes ten languages, representing seven different scripts: Arabic, Bangla, Chinese, Devanagari, English, French, German, Italian, Japanese, and Korean. The number of images per script is equal.
* **Link:** [MLT-download](https://rrc.cvc.uab.es/?ch=15&com=downloads)
* Synth90k [53] :
* **Introduction:** The Synth90k dataset contains 9 million synthetic text instance images from a set of 90k common English words. Words are rendered onto natural images with random transformations and effects, such as random fonts, colors, blur, and noises. Synth90k dataset can emulate the distribution of scene text images and can be used instead of real-world data to train data-hungry deep learning algorithms. Besides, every image is annotated with a ground-truth word.
* **Link:** [Synth90k-download](http://www.robots.ox.ac.uk/~vgg/data/text/)
* SynthText [54] :
* **Introduction:** The SynthText dataset contains 800,000 images with 6 million synthetic text instances. As in the generation of Synth90k dataset, the text sample is rendered using a randomly selected font and transformed according to the local surface orientation. Moreover, each image is annotated with a ground-truth word.
* **Link:** [SynthText-download](https://github.com/ankush-me/SynthText)
* Verisimilar Synthesis [73] :
* **Introduction:** The Verisimilar Synthesis dataset [73] contains 5 million synthetic text instance images. Given background images and source texts, a semantic map and a saliency map are first
determined which are then combined to identify semantically sensible and apt locations for text embedding. The color, brightness, and orientation of the source texts are further determined adaptively according to the color, brightness, and contextual structures around the embedding locations within the background image.
* **Link:** [Verisimilar Synthesis](https://github.com/fnzhan/Verisimilar-Image-Synthesis-for-Accurate-Detection-and-Recognition-of-Texts-in-Scenes)
* UnrealText [88]:
* **Introduction:** The UnrealText dataset [88] contains 600K synthetic images with 12 million cropped text instances. It is developed upon Unreal Engine 4 and the UnrealCV plugin [89]. Text instances are regarded as planar polygon meshes with text foregrounds loaded as texture. These meshes are placed in suitable positions in 3D world, and rendered together with the scene as a whole. The same font set from [Google Fonts](https://fonts.google.com/) and the same text corpus, i.e., Newsgroup20, are used as SynthText does.
* **Link:** [Verisimilar Synthesis](https://github.com/fnzhan/Verisimilar-Image-Synthesis-for-Accurate-Detection-and-Recognition-of-Texts-in-Scenes)
### 1.5 Comparison of the Benchmark Datasets
Comparison of the Benchmark Datasets
Datasets
Language
Images
Lexicon
Label
Type
Pictures
Training Pictures
Testing Pictures
Instances
Training Instances
Testing Instances
50
1k
Full
None
Char
Word
IIIT5K[31]
English
1120
380
740
5000
2000
3000
√
√
×
√
√
√
Regular
SVT[32]
English
350
100
250
725
211
514
√
×
×
√
×
√
Regular
IC03[33]
English
509
258
251
2268
1157
1111
√
√
√
√
√
√
Regular
IC13[34]
English
561
420
141
5003
3564
1439
×
×
×
√
√
√
Regular
SVHN[45]
Digits
600000
573968
26032
600000
573968
26032
×
×
×
√
√
√
Regular
SVT-P[35]
English
238
0
238
639
0
639
√
×
√
√
×
√
Irregular
CUTE80[36]
English
80
0
80
288
0
288
×
×
×
√
×
√
Irregular
IC15[37]
English
1500
1000
500
6545
4468
2077
×
×
×
√
×
√
Irregular
COCO-Text[38]
English
63686
43686
10000
145859
118309
27550
×
×
×
√
×
√
Irregular
Total-Text[39]
English
1555
1255
300
11459
11166
293
×
×
×
√
×
√
Irregular
RCTW-17[40]
Chinese/English
12514
11514
1000
-
-
-
×
×
×
√
×
√
Regular
MTWI[41]
Chinese/English
20000
10000
10000
290206
141476
148730
×
×
×
√
×
√
Regular
CTW[42]
Chinese/English
32285
25887
3269
1018402
812872
103519
×
×
×
√
√
√
Regular
SCUT-CTW1500[43]
Chinese/English
1500
1000
500
10751
7683
3068
×
×
×
√
×
√
Irregular
LSVT[57], [63]
Chinese/English
450000
30000
20000
-
-
-
×
×
×
√
×
√
Irregular
ArT[58]
Chinese/English
10166
5603
4563
98455
50029
48426
×
×
×
√
×
√
Irregular
ReCTS-25k[59]
Chinese/English
25000
20000
5000
119713
108924
10789
×
×
×
√
√
√
Irregular
MLT[81]
Multilingual
20000
10000
10000
191639
89177
102462
×
×
×
√
×
√
Irregular
Synth90k[53]
English
~9000000
-
-
~9000000
-
-
×
×
×
√
×
√
Regular
SynthText[54]
English
~6000000
-
-
~6000000
-
-
×
×
×
√
√
√
Regular
Verisimilar Synthesis[73]
English
-
-
-
~5000000
-
-
×
×
×
√
×
√
Regular
UnrealText[88]
English
~600000
-
-
~12000000
-
-
×
×
×
√
√
√
Regular
***
## 2. Performance Comparison of Recognition Algorithms
### 2.1 Characteristics Comparison of Recognition Approaches
It is notable that 1) "Reg" stands for regular Latin datasets. 2) "Irreg" stands for irregular Latin datasets. 3) "Seg" denotes the segmentation-based methods. 4) "Extra" indicates the methods that use the extra datasets other than Synth90k and SynthText. 5) "CTC" represents the methods that apply the CTC-based algorithm to decode. 6) "Attn" represents the method that apply the attention mechanism to decode.
You can also download the new [Excel](https://pan.baidu.com/s/1xitxu7R5hw27pVV7eJ1c7w) prepared by us. (Password: sj2t)
Characteristics Comparison of Recognition Approaches
Method
Code
Regular
Irregular
Segmentation
Extra data
CTC
Attention
Source
Time
Highlight
Wang et al. [1] : ABBYY
√
√
×
√
×
×
×
ICCV
2011
a state-of-the-art text detector + a leading commercial OCR engine
Wang et al. [1] : SYNTH+PLEX
√
√
×
×
×
×
×
ICCV
2011
the baseline of scene text recognition
Mishra et al. [2]
×
√
×
√
×
×
×
BMVC
2012
1) incorporating higher order statistical language models to recognize words in an unconstrained manner 2) introducing IIIT5K-word dataset
Wang et al. [3]
√
√
×
√
×
×
×
ICPR
2012
CNNs + Non-maximal suppression + beam search
Goel et al. [4] : wDTW
×
√
×
√
×
×
×
ICDAR
2013
recognizing text by matching the scene and synthetic image features with wDTW
Bissacco et al. [5] : PhotoOCR
×
√
×
√
×
×
×
ICCV
2013
applying a network with five hidden layers for character classification
Phan et al. [6]
×
×
√
√
×
×
×
ICCV
2013
1) MSER + SIFT descriptors + SVM 2) introducing the SVT-P datasets
Alsharif et al. [7] : HMM/Maxout
×
√
×
√
×
×
×
ICLR
2014
convolutional Maxout networks + Hybrid HMM
Almazan et al [8] : KCSR
√
√
×
×
×
×
×
TPAMI
2014
embedding word images and text strings in a common vectorial subspace and interpreting the task of recognition and retrieval as a nearest neighbor problem
Yao et al. [9] : Strokelets
×
√
×
√
×
×
×
CVPR
2014
proposing a novel multi-scale representation for scene text recognition: strokelets
R.-Serrano et al.[10] : Label embedding
×
√
×
×
×
×
×
IJCV
2015
embedding word labels and word images into a common Euclidean space and finding the cloest word label in this space
Jaderberg et al. [11]
√
√
×
√
×
×
×
ECCV
2014
1) enabling efficient feature sharing for text detection and classification 2) making technical changes over the traditional CNN architectures 3) proposing a method of automated data mining of Flickr.
Su and Lu [12]
×
√
×
×
×
√
×
ACCV
2014
HOG + BLSTM + CTC
Gordo[13] : Mid-features
×
√
×
√
×
×
×
CVPR
2015
proposing local mid-level features for building word image representations
Jaderberg et al. [14]
√
√
×
×
×
×
×
IJCV
2015
1) treating each word as a category and training very large convolutional neural networks to perform word recognition on the whole proposal region 2) generating 9 million images with equal numbers of word samples from a 90k word dictionary
Jaderberg et al. [15]
×
√
×
×
×
×
×
ICLR
2015
CNN + CRF
Shi, Bai, and Yao [16] : CRNN
√
√
×
×
×
√
×
TPAMI
2017
CNN + BLSTM + CTC
Shi et al. [17] : RARE
×
×
√
×
×
×
√
CVPR
2016
STN + CNN + attentional BLSTM
Lee and Osindero [18] : R2AM
×
√
×
×
×
×
√
CVPR
2016
presenting recursive recurrent neural networks with attention modeling
Liu et al. [19] : STAR-Net
×
×
√
×
×
√
×
BMVC
2016
STN + ResNet + BLSTM + CTC
Liu et al. [78]
×
√
×
√
√
×
×
ICPR
2016
integrating the CNN and WFST classification model
Mishra et al. [77]
×
√
×
√
√
×
×
CVIU
2016
character detection (HOG/CNN + SVM +Sliding window) + CRF, combining bottom-up cues from character detection and top-down cues from lexicon
Su and Lu [76]
×
√
×
×
√
√
×
PR
2017
HOG(different scale) + BLSTM + CTC (ensemble)
*Yang et al. [20]
×
×
√
×
√
×
√
IJCAI
2017
1) CNN + 2D attention-based RNN, applying an auxiliary dense character detection task that helps to learn text specific visual patterns 2) developing a large-scale synthetic dataset
Yin et al. [21]
×
√
×
×
×
√
×
ICCV
2017
CNN + CTC
Wang et al.[66] : GRCNN
√
√
×
×
×
√
×
NIPS
2017
Gated Recurrent Convulution Layer + BLSTM + CTC
*Cheng et al. [22] : FAN
×
√
×
×
√
×
√
ICCV
2017
1) proposing the concept of attention drift 2)introducing focusing network to focus deviated attention back on the target areas
Cheng et al. [23] : AON
×
×
√
×
×
×
√
CVPR
2018
1) extracting scene text features in four directions 2) CNN + Attentional BLSTM
Gao et al. [24]
×
√
×
×
×
√
√
NC
2019
attentional ResNet + CNN + CTC
Liu et al. [25] : Char-Net
×
×
√
√
×
×
√
AAAI
2018
CNN + STN (facilitating the rectification of individual characters) + LSTM
*Liu et al. [26] : SqueezedText
×
√
×
×
√
×
×
AAAI
2018
binary convolutional encoder-decoder network + Bi-RNN
Zhan et al.[73]
√
√
×
×
√
√
×
CVPR
2018
CRNN, achieving verisimilar scene text image synthesis by combining three novel designs, including semantic coherence, visual attention, and adaptive text appearance
*Bai et al. [27] : EP
×
√
×
×
√
×
√
CVPR
2018
proposing edit probability to effectively handle the misalignment between the training text and the output probability distribution sequence
Fang et al.[74]
√
√
×
×
×
×
√
MultiMedia
2018
ResNet + [2D Attentional CNN, CNN-based language module]
Liu et al.[75] : EnEsCTC
√
√
×
×
×
√
×
NIPS
2018
proposing a novel maximum entropy based regularization for CTC (EnCTC) and an entropy-based pruning method (EsCTC) to effectively reduce the space of the feasible set
Liu et al. [28]
×
√
×
×
×
√
×
ECCV
2018
designing a multi-task network with an encoder-discriminator-generator architecture to guide the feature of the original image toward that of the clean image
Wang et al.[61] : MAAN
×
√
×
×
×
×
√
ICFHR
2018
ResNet + BLSTM + Memory-Augmented attentional decoder
Gao et al. [29]
×
√
×
×
×
√
√
ICIP
2018
attentional DenseNet + BLSTM + CTC
Shi et al. [30] : ASTER
√
×
√
×
×
×
√
TPAMI
2018
TPS + ResNet + bidirectional attention-based BLSTM
Chen et al. [60] : ASTER + AEG
×
×
√
×
×
×
√
NC
2019
TPS + ResNet + bidirectional attention-based BLSTM + AEG
Luo et al. [46] : MORAN
√
×
√
×
×
×
√
PR
2019
Multi-object rectification network + CNN + attentional BLSTM
Luo et al. [61] : MORAN-v2
√
×
√
×
×
×
√
PR
2019
Multi-object rectification network + ResNet + attentional BLSTM
Chen et al. [60] : MORAN-v2 + AEG
×
×
√
×
×
×
√
NC
2019
Multi-object rectification network + ResNet + attentional BLSTM + AEG
Xie et al. [47] : CAN
×
√
×
×
×
×
√
ACM
2019
ResNet + CNN + GLU
*Liao et al.[48] : CA-FCN
×
×
√
√
√
×
√
AAAI
2019
performing character classification at each pixel location and needing character-level annotations
*Li et al. [49] : SAR
√
×
√
×
√
×
√
AAAI
2019
ResNet + 2D attentional LSTM
Zhan el at. [55]: ESIR
×
×
√
×
×
×
√
CVPR
2019
Iterative rectification Network + ResNet + attentional BLSTM
Zhang et al. [56]: SSDAN
×
√
×
√
×
×
√
CVPR
2019
attentional CNN + GAS + GRU
Yang et al. [62]: ScRN
×
×
√
×
√
×
√
ICCV
2019
Symmetry-constrained Rectification Network + ResNet + BLSTM + attentional GRU
Wang et al. [64]: GCAM
×
√
×
×
×
×
√
ICME
2019
Convolutional Block Attention Module (CBAM) + ResNet + BLSTM + the proposed Gated Cascade Attention Module (GCAM)
Jeonghun et al. [65]
√
×
√
×
×
×
√
ICCV
2019
TPS + ResNet + BLSTM + Attention Mechanism
Huang et al. [67] : EPAN
×
×
√
×
×
×
√
NC
2019
learning to sample features from the text region of 2D feature maps and innovatively introducing a two-stage attention mechanism
Gao et al. [68]
×
√
×
×
×
√
×
NC
2019
attentional DenseNET + 4-layer CNN + CTC
Qi et al. [69] : CCL
×
√
×
×
√
√
×
ICDAR
2019
ResNet + [CTC, CCL]
Wang et al. [70] : ReELFA
×
×
√
×
√
×
√
ICDAR
2019
VGG + attentional LSTM, utilizing one-hot encoded coordinates to indicate the spatial relationship of pixels and character center masks to help focus attention on the right feature areas
Zhu et al. [71] : HATN
×
×
√
×
√
×
√
ICIP
2019
ResNet50 + Hierarchical Attention Mechanism (Transformer structure)
Zhan et al. [72] : SF-GAN
×
√
×
×
√
×
√
CVPR
2019
ResNet50 + attentional Decoder, synthesising realistic scene text images for training better recognition models
Liao et al. [79] : SAM
√
×
√
×
×
×
√
TPAMI
2019
Spatial attentional module (SAM)
Liao et al. [79] : seg-SAM
√
×
√
×
√
×
√
TPAMI
2019
Character segmentation module + Spatial attention module (SAM)
Wang et al. [80] : DAN
√
×
√
×
×
×
√
AAAI
2020
decoupling the decoder of the traditional attention mechanism into a convolutional alignment module and a decoupled text decoder
Wang et al. [82] : TextSR
√
×
√
×
×
×
√
arXiv
2019
attempting to solve small texts with super-resolution methods
Wan et al. [83] : TextScanner
×
×
√
√
√
×
×
AAAI
2020
an effective segmentation-based dual-branch framework for scene text recognition
Hu et al. [84] : GTC
×
×
√
×
√
√
√
AAAI
2020
attempting to use GCN to learn the local correlations of feature sequence
Luo et al. [85]
×
×
√
×
×
×
√
IJCV
2020
separating text content from noisy background styles
*Litman et al. [86]
×
×
√
×
√
×
√
CVPR
2020
training a deep BiLSTM encoder, thus improving the encoding of contextual dependencies
Yu et al. [87]
×
×
√
×
×
×
√
CVPR
2020
introducing a global semantic reasoning module (GSRM) to capture global semantic context through multi-way parallel transmission
Qiao et al. [101] : SEED
√
×
√
×
×
×
√
CVPR
2020
proposing a semantics enhanced encoder-decoder framework to robustly recognize low-quality scene texts
Bleeker et al. [93] : Bi-STET
√
√
×
×
×
×
√
ECAI
2020
a novel bidirectional STR method with a single decoder for bidirectional text decoding
*Bartz et al. [94] : KISS
√
×
√
×
√
×
√
arXiv
2020
a new model for STR that consists of two ResNet based feature extractors, a spatial transformer, and a transformer
Zhang et al. [95] : SPIN
×
×
√
×
×
×
√
arXiv
2020
a new learnable geometric-unrelated module which allows the color manipulation of source data within the network
Lin et al. [96] : FASDA
×
√
×
×
×
×
√
arXiv
2020
implementing sequence-level domain adaption for STR
Zhang et al. [98] : AutoSTR
√
×
√
×
×
×
√
ECCV
2020
searching data-dependent backbones
Mou et al. [99] : PlugNet
×
×
√
×
×
×
√
ECCV
2020
combining the pluggable super-resolution unit to solve the low-quality text recognition from the feature-level
*Yue et al. [100] : RobustScanner
×
×
√
×
√
×
√
ECCV
2020
mitigating the misrecognition problem of the encoderdecoder with attention framework on contextless text images
### 2.2 Performance Comparison on Benchmark Datasets
In this section, we compare the performance of the current advanced algorithms on benchmark datasets, including IIIT5K,SVT,IC03,IC13,SVT-P,CUTE80,IC15,COCO-Text, RCTW-17, MWTI, CTW,SCUT-CTW1500, LSVT, ArT and ReCTS-25k.
It is notable that 1) The '*' indicates the methods that use the extra datasets other than Synth90k and SynthText. 2) The **bold** represents the best recognition results. 3) '^' denotes the best recognition results of using extra datasets. 4) '@' represents the methods under different evaluation that only uses 1811 test images. 5) 'SK', 'ST', 'ExPu', 'ExPr' and 'Un' indicates the methods that use Synth90K, SynthText, Extra Public Data, Extra Private Data and unknown data, respectively. 6) 'D_A' means data augmentation. 7) IC5-S contains only 1811 cropped text instances.
#### 2.2.1 Performance Comparison of Recognition Algorithms on Regular Latin Datasets
Performance Comparison of Recognition Algorithms on Regular Latin Datasets
Method
IIIT5K
SVT
IC03
IC13
Data
Source
Time
50
1K
None
50
None
50
Full
50k
None
None
Wang et al. [1] : ABBYY
24.3
-
-
35
-
56
55
-
-
-
Un
ICCV
2011
Wang et al. [1] : SYNTH+PLEX
-
-
-
57
-
76
62
-
-
-
ExPr
ICCV
2011
Mishra et al. [2]
64.1
57.5
-
73.2
-
81.8
67.8
-
-
-
ExPu
BMVC
2012
Wang et al. [3]
-
-
-
70
-
90
84
-
-
-
ExPr
ICPR
2012
Goel et al. [4] : wDTW
-
-
-
77.3
-
89.7
-
-
-
-
Un
ICDAR
2013
Bissacco et al. [5] : PhotoOCR
-
-
-
90.4
78
-
-
-
-
87.6
ExPr
ICCV
2013
Phan et al. [6]
-
-
-
73.7
-
82.2
-
-
-
-
ExPu
ICCV
2013
Alsharif et al. [7] : HMM/Maxout
-
-
-
74.3
-
93.1
88.6
85.1
-
-
ExPu
ICLR
2014
Almazan et al [8] : KCSR
88.6
75.6
-
87
-
-
-
-
-
-
ExPu
TPAMI
2014
Yao et al. [9] : Strokelets
80.2
69.3
-
75.9
-
88.5
80.3
-
-
-
ExPu
CVPR
2014
R.-Serrano et al.[10] : Label embedding
76.1
57.4
-
70
-
-
-
-
-
-
ExPu
IJCV
2015
Jaderberg et al. [11]
-
-
-
86.1
-
96.2
91.5
-
-
-
ExPu
ECCV
2014
Su and Lu [12]
-
-
-
83
-
92
82
-
-
-
ExPu
ACCV
2014
Gordo[13] : Mid-features
93.3
86.6
-
91.8
-
-
-
-
-
-
ExPu
CVPR
2015
Jaderberg et al. [14]
97.1
92.7
-
95.4
80.7
98.7
98.6
93.3
93.1
90.8
ExPr
IJCV
2015
Jaderberg et al. [15]
95.5
89.6
-
93.2
71.7
97.8
97
93.4
89.6
81.8
SK + ExPr
ICLR
2015
Shi, Bai, and Yao [16] : CRNN
97.8
95
81.2
97.5
82.7
98.7
98
95.7
91.9
89.6
SK
TPAMI
2017
Shi et al. [17] : RARE
96.2
93.8
81.9
95.5
81.9
98.3
96.2
94.8
90.1
88.6
SK
CVPR
2016
Lee and Osindero [18] : R2AM
96.8
94.4
78.4
96.3
80.7
97.9
97
-
88.7
90
SK
CVPR
2016
Liu et al. [19] : STAR-Net
97.7
94.5
83.3
95.5
83.6
96.9
95.3
-
89.9
89.1
SK + ExPr
BMVC
2016
*Liu et al. [78]
94.1
84.7
-
92.5
-
96.8
92.2
-
-
-
ExPu (D_A)
ICPR
2016
*Mishra et al. [77]
78.07
-
46.73
78.2
-
88
-
-
67.7
60.18
ExPu (D_A)
CVIU
2016
*Su and Lu [76]
-
-
-
91
-
95
89
-
-
76
SK + ExPu
PR
2017
*Yang et al. [20]
97.8
96.1
-
95.2
-
97.7
-
-
-
-
ExPu
IJCAI
2017
Yin et al. [21]
98.7
96.1
78.2
95.1
72.5
97.6
96.5
-
81.1
81.4
SK
ICCV
2017
Wang et al.[66] : GRCNN
98
95.6
80.8
96.3
81.5
98.8
97.8
-
91.2
-
SK
NIPS
2017
*Cheng et al. [22] : FAN
99.3
97.5
87.4
97.1
85.9
99.2
97.3
-
94.2
93.3
SK + ST (Pixel_wise)
ICCV
2017
Cheng et al. [23] : AON
99.6
98.1
87
96
82.8
98.5
97.1
-
91.5
-
SK + ST (D_A)
CVPR
2018
Gao et al. [24]
99.1
97.9
81.8
97.4
82.7
98.7
96.7
-
89.2
88
SK
NC
2019
Liu et al. [25] : Char-Net
-
-
83.6
-
84.4
-
93.3
-
91.5
90.8
SK (D_A)
AAAI
2018
*Liu et al. [26] : SqueezedText
97
94.1
87
95.2
-
98.8
97.9
93.8
93.1
92.9
ExPr
AAAI
2018
*Zhan et al.[73]
98.1
95.3
79.3
96.7
81.5
-
-
-
-
87.1
Pr(5 million)
CVPR
2018
*Bai et al. [27] : EP
99.5
97.9
88.3
96.6
87.5
98.7
97.9
-
94.6
94.4
SK + ST (Pixel_wise)
CVPR
2018
Fang et al.[74]
98.5
96.8
86.7
97.8
86.7
99.3
98.4
-
94.8
93.5
SK + ST
MultiMedia
2018
Liu et al.[75] : EnEsCTC
-
-
82
-
80.6
-
-
-
92
90.6
SK
NIPS
2018
Liu et al. [28]
97.3
96.1
89.4
96.8
87.1
98.1
97.5
-
94.7
94
SK
ECCV
2018
Wang et al.[61] : MAAN
98.3
96.4
84.1
96.4
83.5
97.4
96.4
-
92.2
91.1
SK
ICFHR
2018
Gao et al. [29]
99.1
97.2
83.6
97.7
83.9
98.6
96.6
-
91.4
89.5
SK
ICIP
2018
Shi et al. [30] : ASTER
99.6
98.8
93.4
97.4
89.5
98.8
98
-
94.5
91.8
SK + ST
TPAMI
2018
Chen et al. [60] : ASTER + AEG
99.5
98.5
94.4
97.4
90.3
99
98.3
-
95.2
95
SK + ST
NC
2019
Luo et al. [46] : MORAN
97.9
96.2