Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/HCIILAB/Scene-Text-End2end


https://github.com/HCIILAB/Scene-Text-End2end

Last synced: 13 days ago
JSON representation

Awesome Lists containing this project

README

        

# End-to-End Scene Text Detection and Recognition System Resources

Author: Canjie Luo, Chongyu Liu

- [1.Datasets](#1-datasets)
- [1.1 Introduction](#1-1intro)
- [1.2 Comparison of Datasets](#1-2-dataset)
- [2. Summary of End-to-end Scene Text Detection and Recognition Methods](#2-summary-of-end2end-results)
- [2.1 Comparison of methods](#21-comparison-of-methods)
- [2.2 End-to-end scene text detection and recognition results](#22-end2end-result)
- [3. Survey](#3-field-survey)
- [4. OCR Service](#4-ocr-service)
- [5. References and codes](#5-references)

------


## 1. Datasets


### 1.1 Introduction
- SVT [15]:
* **Introduction:** There are 100 training images and 250 testing images downloaded from Google Street View of road-side scenes. The labelled text can be very challenging with a wide variety of fonts, orientations, and lighting conditions. A lexicon containing 50 words (SVT-50) is also provided for each image.
* **Link:** [SVT-download](http://vision.ucsd.edu/~kai/grocr/)

- ICDAR 2003(IC03) [16]:
* **Introduction:** The dataset contains a varied array of photos of the world that contain scene text. There are 251 testing images with 50 word lexicons (IC03-50) and a lexicon of all test groundtruth words (IC03-Full).
* **Link:** [IC03-download](http://www.iapr-tc11.org/mediawiki/index.php?title=ICDAR_2003_Robust_Reading_Competitions)

- ICDAR 2011(IC11) [17] :
* **Introduction:** The dataset is an extension to the dataset used for the text locating competitions of ICDAR 2003.It includes 485 natural images in total.
* **Link:** [IC11-download](http://www.cvc.uab.es/icdar2011competition/?com=downloads)

- ICDAR 2013(IC13) [18]:
* **Introduction:** The dataset consists of 229 training images and 233 testing images. Most text are horizontal. Three specific lexicons are provided, named as “Strong(S)”, “Weak(W)” and “Generic(G)”. “Strong(S)” lexicon provides 100 words per-image including all words that appear in the image. “Weak(W)” lexicon includes all words that appear in the entire test set. And “Generic(G)” lexicon is a 90k word vocabulary.
* **Link:** [IC13-download](http://dagdata.cvc.uab.es/icdar2013competition/?ch=2&com=downloads)

- ICDAR 2015(IC15) [19]:
- **Introduction:** The dataset includes 1000 training images and 500 testing images captured by Google glasses. The text in the scene is in arbitrary orientations. Similar to ICDAR 2013, it also provides “Strong(S)”, “Weak(W)” and “Generic(G)” lexicons.
- **Link:** [IC15-download](http://rrc.cvc.uab.es/?ch=4&com=downloads)

- Total-Text [20]:
- **Introduction:** Except for the horizontal text and oriented text, Total-Text also consists of a lot of curved text. Total-Text contains 1255 training images and 300 test images. All images are annotated with polygons and transcriptions in word-level. A “Full” lexicon contains all words in test set is provided.
- **Link:** [Total-Text-download](https://github.com/cs-chan/Total-Text-Dataset)


### 1.2 Comparison of Datasets



Comparison of Datasets


Datasets
Language
Image
Text instance
Text Shape
Annotation level
Lexicon


Total
Train
Test
Total
Train
Test
Horizontal
Arbitrary-Quadrilateral
Multi-oriented
Char
Word
Text-Line
50
1k
Full
None


IC03
English
509
258
251
2266
1110
1156












IC11
English
484
229
255
1564














IC13
English
462
229
233
1944
849
1095












SVT
English
350
100
250
725
211
514












SVT-P
English
238


639














IC15
English
1500
1000
500
17548
122318
5230












Total-Text
English
1525
1225
300
9330













## 2. Summary of End-to-end Scene Text Detection and Recognition Methods


### 2.1 Comparison of methods









Method                    
Model    
Code
Detection                        
Recognition            
Source
Time
Highlight                                                            


Wang et al. [1]



Sliding windows and Random Ferns
Pictorial Structures
ICCV
2011
Word Re-scoring for NMS


Wang et al. [2]



CNN-based
Sliding windows for classification
ICPR
2012
CNN architecture


Jaderberg et al. [3]



CNN-based and saliency maps
CNN classifier
ECCV
2014
Data mining and annotation


Alsharif et al. [4]



CNN and hybrid HMM maxout models
Segmentation-based
ICLR
2014
Hybrid HMM maxout models


Yao et al. [5]



Random Forest
Component Linking and Word Partition
TIP
2014
(1) Detection and recognition features sharing. (2) Oriented-text. (3) A new dictionary search method


Neumann et al. [6]



Extremal Regions
Clustering algorithm to group characters
TPAMI
2015
Real-time performance(1.6s/image)


Jaderberg et al. [7]



Region proposal mechanism
Word-level classification
IJCV
2016
Trained only on data produced by a synthetic text generation engine, requiring no human labelled data


Liao et al. [8]
TextBoxes

SSD-based framework
CRNN
AAAI
2017
An end-to-end trainable fast scene text detector


Bŭsta et al. [9]
Deep TextSpotter

Yolo v2
CTC
ICCV
2017
Yolov2 + RPN, RNN + CTC. It is the first end-to-end trainable detection and recognition system with high speed.


Li et al. [10]



Text Proposal Network
Attention
ICCV
2017
TPN + RNN encoder + attention-based RNN


Sun et al. [22]
TextNet

Scale-aware attention backbone and Perspective RoI Transform
Attention
ACCV
2018
Perspective RoI Transform for Irregular text recognition


Lyu et al. [11]
Mask TextSpotter

Fast R-CNN with mask branch
Character segmentation
ECCV
2018
Precise text detection and recognition are acquired via semantic segmentation


He et al. [12]



Text-Alignment Layer
Attention
CVPR
2018
Character attention mechanism: use character spatial information as explicit supervision


Liu et al. [13]
FOTS

EAST with RoIRotate
CTC
CVPR
2018
Little computation overhead compared to baseline text detection network (22.6fps)


Liao et al. [14]
TextBoxes++

SSD-based framework
CRNN
TIP
2018
Journal version of TextBoxes (multi-oriented scene text support)


Liao et al. [15]
Mask TextSpotter

Mask R-CNN
Character segmentation + Spatial Attention Module
TPAMI
2019
Journal version of Mask TextSpotter(proposes Spatial Attention Module)


Xing et al. [23]
CharNet

A character branch and a detection branch
Character level
ICCV
2019
Utilizing a character as basic element to overcome the main difficulty of joint optimization of text detection and RNN-based recognition


Feng et al. [24]
TextDragon

Local box regression, center line segmentation and RoI Sliding
CTC
ICCV
2019
A new differentiable operator named RoISlide connect arbitrary shaped text detection and recognition


Qin et al. [25]



Mask R-CNN with RoI masking
Attention
ICCV
2019
A simple yet effective RoI masking step to extract useful irregularly shaped text instance features


Qiao et al. [26]
Text Perceptron

Mask R-CNN with Order-aware Semantic Segmentation and Boundary Regressions
Attention
AAAI
2020
A novel Shape Transform Module to transform the feature regions into regular morphologies


Wang et al. [27]



Oriented Rectangular Box Detector and Boundary Point Detector
Attention
AAAI
2020
A set of points on the boundary of each text instance represents arbitrary shapes


Liu et al. [28]
ABCNet

Bezier Curve Detection and BezierAlign
CTC
CVPR
2020
10 times faster than re-cent state-of-the-art methods with a competitive scene text spotting accuracy


### 2.2 End-to-end scene text detection and recognition results





      Method       
Model
Source
Time
SVT
SVT-50
IC03
IC11
IC13
IC15
Total-text


End-to-end
Spotting
End-to-end
Spotting
None
Full
None
Full


50
Full
None
S
W
G
S
W
G
S
W
G
S
W
G


Wang et al. [1]


ICCV
2011
~
~
51


~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~


Wang et al. [2]


ICPR
2012
46
~
72
67
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~


Jaderberg et al. [3]


ECCV
2014
~
56
80
75
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~


Alsharif et al. [4]


ICLR
2014
~
48
77
70
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~


Yao et al. [5]


TIP
2014
~
~
~
~


48.6


~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~


Neumann et al. [6]


TPAMI
2015


68.1
~
~
~
~
45.2
~
~
~
~
~
35
19.9
15.6
35
19.9
15.6
~
~
~
~
~


Jaderberg et al. [7]


IJCV
2016
53
76
90
86
78
76
76
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~


Liao et al. [8]
TextBoxes
AAAI
2017
64
84
~
~
~
87
91
89
84
94
92
87
~
~
~
~
~
~
36.3
48.9
~
~
~


Bŭsta et al. [9]
Deep TextSpotter
ICCV
2017
~
~
~
~
~
~
89
86
77
92
89
81
54
51
47
58
53
51
~
~
~
~
21.85


Li et al. [10]


ICCV
2017
66.18
84.91
~
~
~
87.7
~
~
~
~
~
~
91.08
89.8
84.6
94.2
92.4
88.2
~
~
~
~
~


Sun et al. [22]
TextNet
ACCV
2018
~
~
~
~
~
~
89.77
88.80
82.96
94.59
93.48
86.99
78.66
74.9
60.45
82.38
78.43
62.36
54.02
~
~
~
~


Lyu et al. [11]
Mask TextSpotter
ECCV
2018
~
~
~
~
~
~
92.2
91.1
86.5
92.5
92
88.2
79.3
73
62.4
79.3
74.5
64.2
52.9
71.8
~
~
~


He et al. [12]


CVPR
2018
~
~
~
~
~
~
91
89
86
93
92
87
82
77
63
85
80
65
~
~
~
~
~


Liu et al. [13]
FOTS
CVPR
2018
~
~
~
~
~
~
91.99
90.11
84.77
95.94
93.9
87.76
83.55
79.11
65.33
87.01
82.39
67.97
~
~
~
~
~


Liao et al. [14]
TextBoxes++
TIP
2018
64
84
~
~
~
~
93
92
85
96
95
87
73.3
65.9
51.9
76.5
69
54.4
~
~
~
~
~


Liao et al. [15]
Mask TextSpotter
TPAMI
2019
~
~
~
~
~
~
93.3
91.3
88.2
92.7
91.7
87.7
83
77.7
73.5
82.4
78.1
73.6
65.3
77.4
~
~
~


Xing et al. [23]
CharNet
ICCV
2019
~
~
~
~
~
~
~
~
~
~
~
~
85.05
81.25
71.08
~
~
~
69.2
~
~
~
~


Feng et al. [24]
TextDragon
ICCV
2019
~
~
~
~
~
~
~
~
~
~
~
~
82.54
78.34
65.15
86.22
81.62
68.03
48.8
74.8
39.7
72.4
~


Qin et al. [25]


ICCV
2019
~
~
~
~
~
~
~
~
~
~
~
~
85.51
81.91
69.94
~
~
~
70.7
~
~
~
~


Qiao et al. [26]
Text Perceptron
AAAI
2020
~
~
~
~
~
~
91.4
90.7
85.8
94.9
94
88.5
80.5
76.6
65.1
84.1
79.4
67.9
69.7
78.3
57
~
~


Wang et al. [27]


AAAI
2020
~
~
~
~
~
~
88.2
87.7
84.1
~
~
~
79.7
75.2
64.1
~
~
~
65
76.1
~
~
41.3


Liu et al. [28]
ABCNet
CVPR
2020
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
69.5
78.4
45.2
74.1
~


## 3. Survey

**[A] \[TPAMI-2015]** Ye Q, Doermann D. **Text detection and recognition in imagery: A survey**[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. [paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6945320)

**[B] \[Frontiers-Comput. Sci-2016]** Zhu Y, Yao C, Bai X. **Scene text detection and recognition: Recent advances and future trends**[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. [paper](https://link.springer.com/article/10.1007/s11704-015-4488-0)

**[C] \[arXiv-2018]** Long S, He X, Ya C. **Scene Text Detection and Recognition: The Deep Learning Era**[J]. arXiv preprint arXiv:1811.04256, 2018. [paper](https://arxiv.org/pdf/1811.04256.pdf)


## 4. OCR Service

| OCR | API | Free |
| :----------------------------------------------------------: | :--: | :--: |
| [Tesseract OCR Engine](https://github.com/tesseract-ocr/tesseract) | × | √ |
| [Azure](https://azure.microsoft.com/zh-cn/services/cognitive-services/computer-vision/#Analysis) | √ | √ |
| [ABBYY](https://www.abbyy.cn/real-time-recognition-sdk/technical-specifications/) | √ | √ |
| [OCR Space](https://ocr.space/) | √ | √ |
| [SODA PDF OCR](https://www.sodapdf.com/ocr-pdf/) | √ | √ |
| [Free Online OCR](https://www.newocr.com/) | √ | √ |
| [Online OCR](https://www.onlineocr.net/) | √ | √ |
| [Super Tools](https://www.wdku.net/) | √ | √ |
| [Online Chinese Recognition](http://chongdata.com/ocr/) | √ | √ |
| [Calamari OCR](https://github.com/Calamari-OCR/calamari) | × | √ |
| [Tencent OCR](https://cloud.tencent.com/product/ocr?lang=cn) | √ | × |


## 5. References and codes

- [1] Wang K, Babenko B, Belongie S. **End-to-end scene text recognition**[C].2011 International Conference on Computer Vision. IEEE, 2011: 1457-1464. [paper](http://www.iapr-tc11.org/dataset/SVT/wang_iccv2011.pdf)

- [2] Wang T, Wu D J, Coates A, et al. **End-to-end text recognition with convolutional neural networks**[C]. Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012). IEEE, 2012: 3304-3308. [paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.664.6212&rep=rep1&type=pdf)

- [3] Jaderberg M, Vedaldi A, Zisserman A. **Deep features for text spotting**[C]. European conference on computer vision. Springer, Cham, 2014: 512-528. [paper](http://www.robots.ox.ac.uk/~vedaldi/assets/pubs/jaderberg14deep.pdf)

- [4] Alsharif O, Pineau J. **End-to-End Text Recognition with Hybrid HMM Maxout Models**[C]. In ICLR 2014. [paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.740.1108&rep=rep1&type=pdf)

- [5] Yao C, Bai X, Liu W. **A unified framework for multioriented text detection and recognition**[J]. IEEE Transactions on Image Processing, 2014, 23(11): 4737-4749. [paper](http://www.vlrlab.net/admin/uploads/avatars/A_Unified_Framework_for_Multi-Oriented_Text_Detection_and_Recognition.pdf)

- [6] Neumann L, Matas J. **Real-time lexicon-free scene text localization and recognition**[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 38(9): 1872-1885. [paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.717.4947&rep=rep1&type=pdf)

- [7] Jaderberg M, Simonyan K, Vedaldi A, et al. **Reading text in the wild with convolutional neural networks**[J]. International Journal of Computer Vision, 2016, 116(1): 1-20. [paper](http://www.academia.edu/download/43938680/jaderberg16.pdf)

- [8] Liao M, Shi B, Bai X, et al. **Textboxes: A fast text detector with a single deep neural network**[C]. In AAAI 2017. [paper](https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/viewPDFInterstitial/14202/14295) [code](https://github.com/MhLiao/TextBoxes)

- [9] Busta M, Neumann L, Matas J. **Deep textspotter: An end-to-end trainable scene text localization and recognition framework**[C]. Proceedings of the IEEE International Conference on Computer Vision. 2017: 2204-2212. [paper](http://openaccess.thecvf.com/content_ICCV_2017/papers/Busta_Deep_TextSpotter_An_ICCV_2017_paper.pdf)

- [10] Li H, Wang P, Shen C. **Towards end-to-end text spotting with convolutional recurrent neural networks**[C]. Proceedings of the IEEE International Conference on Computer Vision. 2017: 5238-5246. [paper](http://openaccess.thecvf.com/content_ICCV_2017/papers/Li_Towards_End-To-End_Text_ICCV_2017_paper.pdf)

- [11] Lyu P, Liao M, Yao C, et al. **Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes**[C]. Proceedings of the European Conference on Computer Vision (ECCV). 2018: 67-83. [paper](http://openaccess.thecvf.com/content_ECCV_2018/papers/Pengyuan_Lyu_Mask_TextSpotter_An_ECCV_2018_paper.pdf) [code](https://github.com/lvpengyuan/masktextspotter.caffe2)

- [12] He T, Tian Z, Huang W, et al. **An end-to-end textspotter with explicit alignment and attention**[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5020-5029. [paper](http://openaccess.thecvf.com/content_cvpr_2018/papers/He_An_End-to-End_TextSpotter_CVPR_2018_paper.pdf) [code](https://github.com/tonghe90/textspotter)

- [13] Liu X, Liang D, Yan S, et al. **FOTS: Fast oriented text spotting with a unified network**[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 5676-5685. [paper](http://openaccess.thecvf.com/content_cvpr_2018/papers/Liu_FOTS_Fast_Oriented_CVPR_2018_paper.pdf) [code](https://github.com/jiangxiluning/FOTS.PyTorch)

- [14] Liao M, Shi B, Bai X. **Textboxes++: A single-shot oriented scene text detector**[J]. IEEE transactions on image processing, 2018, 27(8): 3676-3690. [paper](https://ieeexplore.ieee.org/abstract/document/8334248) [code](https://github.com/MhLiao/TextBoxes_plusplus)

- [15] Minghui Liao, Pengyuan Lyu, Minghang He. **Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes**[J]. IEEE transactions on pattern analysis and machine intelligence, 2019. [paper](https://arxiv.org/abs/1908.08207) [code](https://github.com/MhLiao/TextBoxes_plusplus)

- [16] Wang,Kai, and S. Belongie. **Word Spotting in the Wild**. European Conference on Computer Vision(ECCV), 2010: 591-604. [Paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.168.4897&rep=rep1&type=pdf)

- [17] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young,K. Ashida, H. Nagai, M. Okamoto, H. Yamamoto, H. Miyao,J. Zhu, W. Ou, C. Wolf, J. Jolion, L. Todoran, M. Worring, and X. Lin. **ICDAR 2003 robust reading competitions:entries, results,and future directions**. IJDAR, 7(2-3):105–122, 2005. [paper](https://link.springer.com/content/pdf/10.1007%2Fs10032-004-0134-3.pdf)

- [18] Shahab, A, Shafait, F, Dengel, A: **ICDAR 2011 robust reading competition challenge 2: Reading text in scene images**. In: ICDAR, 2011. [Paper](https://ieeexplore.ieee.org/document/6065556)

- [19] D. Karatzas, F. Shafait, S. Uchida, et al. **ICDAR 2013 robust reading competition**. In ICDAR, 2013. [Paper](https://ieeexplore.ieee.org/document/6628859)

- [20] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D.Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. **ICDAR 2015 competition on robust reading**. In ICDAR, pages 1156–1160, 2015. [Paper](https://ieeexplore.ieee.org/document/7333942)

- [21] Chee C K, Chan C S. **Total-text: A comprehensive dataset for scene text detection and recognition**.Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, 1: 935-942.[Paper](https://arxiv.org/abs/1710.10400)

- [22] Y. Sun, C. Zhang, Z. Huang, J. Liu, J. Han, and E. Ding, **TextNet: Irregular Text Reading from Images with an End-to-End Trainable Network**, Asian Conference on Computer Vision (ACCV), Cham, 2018, vol. 11363, no. 1, pp. 83–99.[Paper](https://link.springer.com/chapter/10.1007/978-3-030-20893-6_6)

- [23] Xing L, Tian Z, Huang W, **Convolutional character networks**.In ICCV, 2019.[Paper](http://openaccess.thecvf.com/content_ICCV_2019/papers/Xing_Convolutional_Character_Networks_ICCV_2019_paper.pdf) [code](https://github.com/MalongTech/research-charnet)

- [24] Feng W, He W, Yin F, et al. **TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting**.In ICCV, 2019.[Paper](http://openaccess.thecvf.com/content_ICCV_2019/papers/Feng_TextDragon_An_End-to-End_Framework_for_Arbitrary_Shaped_Text_Spotting_ICCV_2019_paper.pdf)

- [25] Qin S, Bissacco A, Raptis M, et al. **Towards unconstrained end-to-end text spotting**.In ICCV, 2019.[Paper](http://openaccess.thecvf.com/content_ICCV_2019/papers/Qin_Towards_Unconstrained_End-to-End_Text_Spotting_ICCV_2019_paper.pdf)

- [26] Qiao L, Tang S, Cheng Z, et al. **Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting**.In AAAI 2020.[Paper](https://www.aaai.org/Papers/AAAI/2020GB/AAAI-QiaoL.893.pdf)

- [27] Wang H, Lu P, Zhang H, et al. **All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting.** In AAAI 2020.[Paper](https://arxiv.org/abs/1911.09550)

- [28] Liu Y, Chen H, Shen C, et al. **ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network** In CVPR, 2020.[Paper](https://arxiv.org/abs/2002.10200)
[code](https://github.com/Yuliang-Liu/bezier_curve_text_spotting)
###

If you find any problems in our resources, or any good papers/codes we have missed, please inform us at **[email protected]**. Thank you for your contribution.

### Copyright

Copyright © 2019 SCUT-DLVC. All Rights Reserved.


Sample