https://github.com/xinke-wang/ocrdatasets
A collection of OCR-related datasets
https://github.com/xinke-wang/ocrdatasets
datasets documents ocr scene-text-detection scene-text-recognition
Last synced: 3 months ago
JSON representation
A collection of OCR-related datasets
- Host: GitHub
- URL: https://github.com/xinke-wang/ocrdatasets
- Owner: xinke-wang
- Created: 2022-09-07T07:21:05.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2022-09-07T09:57:06.000Z (almost 3 years ago)
- Last Synced: 2025-01-10T03:28:44.984Z (5 months ago)
- Topics: datasets, documents, ocr, scene-text-detection, scene-text-recognition
- Homepage:
- Size: 19.5 KB
- Stars: 141
- Watchers: 1
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# OCR Datasets
This repo collects OCR-related datasets. In general, the datasets are classified by 6 types, *i.e.*, **Natural Scene Text**, **Document Text**, **Handwritten Text**, **Historical Document Text**, **Video Text**, and **Synthetic Text**.

- **Natural Scene Text**: The images in this type of dataset are usually taken in natural scenes, so the difficulty of this task lies in the complex lighting transformations, shooting angles, blurring, varied fonts, etc.
- **Document Text**: only focues on document images, the difficulty is the variety of typesetting.
- **Historical Document Text**: is usally designed for assisting social science research. For example, digitized antiquarian documents help preserve historical materials and facilitate scholars to conduct related research.
- **Video Text**: aims at recognizing texts in videos, which introduces temporal information into the OCR task.
- **Synthetic Text**: synthetically generates images containing texts and the corresponding annotations by rendering texts of different fonts into natural photos. This type of dataset usually includes hundreds of thousands of samples since it does not require human beings to annotate the images. However, due to the limited technology, there is usually a large domain gap between the synthetic images and authentic samples; these datasets are often employed for pre-training only.
Natural Scene Text
Year/Venue
Name
Task
#Train(#wds)
#Val(#wds)
#Test(#wds)
Granu.
Anno. Form
Language
Scene
Paper
Size
2003-05/ICDAR
IC03/IC05
Det. & Rec.
258 (1110)
N/A
251 (1156)
Word
Rect [x, y, w, h, "transcript"]
English
Natural
112MB
2011-15/ICDAR
Born-DIgital-Image (IC2011-2015)
Det. & Rec. & Seg.
410 (3564)
N/A
141 (1439)
Word & Pixel
Rect [x, y, w, h, "transcript"]
English
Natural/Web/Email
40MB
2013-15/ICDAR
Focused Scene Text (IC13)
Det. & Rec. & Seg.
229 (848)
N/A
233 (1095)
Word & Pixel
Rect [x1, y1, x2, y2, "transcript"] & SegMap
English
Natural
250MB
2015/ICDAR
Incidental Scene Text (IC15)
Det. & Rec.
1,000 (4468)
N/A
500 (2077)
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Natural
130MB
2017/ICDAR
Multi-Lingual Scene Text (MLT2017)
Det. & Rec.
7,200
1,800
private
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, Lan, 'trans']
multi-lingual
Natural
-
12GB
2019/ICDAR
Multi-Lingual Scene Text (MLT2019)
Det. & Rec.
10,000
N/A
10,000
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, Lan, 'trans']
multi-lingual
Natural
~12GB
2017/ICDAR
COCO-Text v2.0
Det. & Rec.
43,686
10,000
10,000
Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
En & NonEn
Natural
13GB
2019/ICDAR
ReCTS
Det. & Rec.
20,000
N/A
5,000
Word/Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
Chinese
Signboard
-
~2.5GB
2017/ICDAR
Total-Text
Det. & Rec.
1255
N/A
300
Word & Pixel
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
English
Natural
441MB
2019/PR
SCUT-CTW1500
Det. & Rec.
1,000
N/A
500
Line
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
En & Ch
Natural
800MB
2019/ICDAR
Arbitrary-Shaped Text (ART)
Det. & Rec.
5,603 (50,029)
N/A
4,563 (52,631)
Word(En)/Line(CH)
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], Lan, 'trans']
En & Ch
Natural
-
4.4GB
2017/ICDAR
RCTW-17 (CTW-12k)
Det. & Rec.
11514
N/A
1000
Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
Chinese
Mixture
11GB
2019/ICDAR/ICCV
Large-scale Street View Text (LSVT)
Det. & Rec.
30,000
N/A
20,000
Line
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
En & Ch
Street View
14GB
2016/DAS
MLe2e
Det. & Script Identifica.
450
N/A
261
Word
Rect [x1, y1, x2, y2, language]
multi-lingual
Natural
82MB
2017/ICDAR
IIIT-ILST
Det. & Rec.
893
Word
Rect [x, y, w, h, "transcript"]
Indic
Google Images
609MB
2017/CVPRW
UberText
Det. & Rec.
117,969 (571,534)
Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
English
Street View
197GB
2009/VISAPP
Chars74k
Det. & Rec.
1922
Character
En & Kanada
Natural Scene
739MB
2010/ICPR
KAIST
Det. & Rec. & Seg.
3000
Char & Word & Pixel
Rect [x, y, w, h, "transcript"] & SegMap
En & Korean
Mixture
364MB
2010/ECCV
SVT
Det. & Rec.
100 (211)
N/A
250 (514)
Word
Rect [x, y, w, h, "transcript"]
English
Street View
118MB
2013/ICCV
SVTP (download code:vnis)
Rec.
238 (639)
-
English
Street View
~1MB
2011/NIPSw
SVHN
Det. & Rec.
73,257+531,131
N/A
26,032
Character
Rect [x, y, w, h, "transcript"]
Digit
House Number
~3GB
2011/ICDARw
NEOCR
Det.
659 (5,238)
Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
multi-lingual
Natural Scene
1.3GB
2012/CVPR
MSRA-TD500
Det.
300
N/A
200
Line
RotRect [ind, difficult, x, y, w, h, theta]
multi-lingual
Street View
96MB
2012/BMVC
IIIT 5k-word
Rec.
380 (2000)
N/A
740 (3000)
Word
English
Natural
106MB
2014/ESWA
CUTE80
Rec.
80
Line
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]]]
English
Street View
44MB
2015/TPAMI
USTB-SV1K
Det. & Rec.
500
N/A
500
Word
RotRect [ind, difficult, x, y, w, h, theta, "trans"]
English
Street View
36MB
2019/JCST
Chinese Text in the Wild (CTW)
Det. & Rec.
25,887(812,872chrs)
N/A
3,269(103,519chrs)
Char & Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
Chinese
Street View
~40GB
2019/TITS
ShopSign
Det. & Rec.
1258 sample images
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
Chinese
Signboard
3GB
2021/CVPR
TextOCR
Det. & Rec. & VQA
24902 (822,572)
N/A
3232 (80,497)
Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
English
Natural Scene
~8GB
2021/CVPR
VinText
Det. & Rec.
1,200
N/A
300+500
Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
Vietnamese
Natural Scene
1GB
2018/Competition
ICPR MTWI2018
Det. & Rec.
10,000
N/A
10,000
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
En & Ch
WEB Images
2GB
2019/Competition
百度中文场景文字识别比赛
Rec.
50,000
N/A
10,000
-
[h, w, 'trans']
En & Ch
Street View
-
Document Text
Year/Venue
Name
Task
#Train
#Val
#Test
Granu.
Anno. Form
Language
Scene
Paper
Size
2011/ICDAR
RETAS
No public download link
Char & Word
No public download link
-
2013/IJDAR
LRDE-DBD Document Binarization
Det. & Binarization
125
Line & Mask
Rect
French
Magzine
~700MB
2015/ICDAR
SmartDOC
3630
N/A
8470
~30GB
2016/ICFHR
KPTI
Rec.
11,910
2,552
2,553
-
['transcripts']
Pashto
Document
~100MB
2017/ICDAR
DeText
Det. & Rec.
100
100
300
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Scientific
10MB
2019/ICDAR
SROIE
Det. & Rec. & Info Ext.
600
400
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Receipt
-
<1GB
2019/ICDAR
FUNSD
Det. & Rec. & Info Ext.
149
N/A
50
Word
Rect [x1, y1, x2, y2, "transcript"]
English
Form
16MB
2019/ICDAR
NAF
Det. & Rec. & Info Ext.
682
59
63
Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Form
2020
BID
Det. & Rec.
28880
Line
Poly
Latin
ID Document
2020/ISCSIC
DDI-100
Det. & Rec.
~ 100,000 (70% train, 30% val)
Char & Word & Mask
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Distorted Document
~300GB
Handwritten Text
Year/Venue
Name
Task
#Train
#Val
#Test
Granu.
Anno. Form
Language
Scene
Paper
Size
2008-11/ICDAR
RIMES
No public download link
Word & Line
No public download link
2010/DAS
HIT-OR3C
Rec.
Char set 832,650 chars / Doc set 77,168 chars
-
special format
Chinese
Handwritten
1GB
2012/PR
KHATT
Rec.
8,368
1,793
1,822
-
['transcripts']
Arabic
Handwritten
98-2014
HANDS
No public download link
Japanese
Handwritten
-
Lao-SABAIDEE
500 SAMPLES
No public download link
Laos
Handwritten
2014/ICFHR
ORAND-CAR/CVL
Rec.
5,000
N/A
5,000
Word
['image_name', 'trans']
Digits
Handwritten Digits
194MB
2018/ICFHR
VNOnDB
Rec.
1,146 paragraphs 7,296 lines
380,000 chars
Word/Line/Parag.
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
Vietnamese
Handwritten
200MB
2013-16/IJDAR
PE92/SERI95/HanDB (HangulDB)
Rec.
1200 samples (90% Train/10% Test)
.HGU1 format
Korean
Handwritten
800MB
95-2016
NIST
Rec.
English
2011/ICDAR
CASIA-OLHWDB/HWDB
Rec.
Chinese
Handwritten
2021/ICDAR
IIT-INDIC-HW-WORDS
Rec.
872,000 instances
Word
['image_name', 'vocab_id'] & vocabularly
Indic
Handwritten
~20GB
1999/ICDAR
IAM Handwriting Database
Rec.
6,161
900+940
1,861
Registration is Required
2005/ICDAR
IAM ONLINE Handwritting Data
Rec.
86,272 word instances
Registration is Required
2018/ICDAR
IAM-MonDo
Rec.
Registration is Required
2011-14/ICDAR
CHROME
Rec.
> 10,000 expressions
symbol & expression
inkml format, latex
Symbol
Mathematical
58MB
2017/ICDAR
MUSICMA++
Rec.
140
Symbol
Music Notation
2018/Access
SCUT-EPT
Rec.
40,000
N/A
10,000
Chinese
Educational Doc.
1.08GB
2020/ICFHR
HHD
Rec.
3965
1134
Hebrew
2021/ArXiv
IMGUR5K
Det. & Rec.
(~108,000)
(~13,000)
(~14,000)
Word
Rect [x, y, w, h, "transcript"]
English
Handwritten
-
2021/ArXiv
VML-MOC
Seg. & Rec.
Hebrew
2021/ICDAR
Bengali
Rec.
Bengali
2021/ICDAR
GNHK
Det. & Rec.
687
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Historical Document Text
Year/Venue
Name
Task
#Train
#Val
#Test
Granu.
Anno. Form
Language
Scene
Paper
Size
2010-11/DAS
IAM-HistDB
Rec.
127
Word & Line
['image_id', 'transcript']
En & Ger & Latin
>200mb
2016/ICFHR
H-KWS (1. Botany 2. AK)
Det. & Rec.
1849
3734
N/A
Word & Line
Rect [x, y, w, h, "transcript"]
English
2016/ICFHR
READ
Registration is Required
German
~600mb
2017/ICFHR
Palm Leaf Manuscript
Det. & Rec.
~19,000 Balinese + ~20,000 Khmer
Char
No public download link
Khmer
Palm Leaf
2017/HIP
SleukRith-Set
Det. & Rec.
658
Char & Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'transcript']
Khmer
Palm Leaf
1GB
2019/NCA
ARDIS
Rec.
10,000
Char & Word
['transcript']
Digits
Church Records
2019/ICDAR
Pinkas
Det. & Rec.
Word & Line
Hebrew
historical manuscripts
~50MB
2020/ICFHR
Cuneiform
2020/ICFHR
MTHv2
Det. & Rec.
2,399
N/A
800
Char & Line
Chinese
Acient Book
4.6GB
2021/ICDAR
IHR-NomDB
Det. & Rec.
267
Line
Rect [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
ChuNom
Acient Book
2021/ICDAR
VML-HP
Hebrew
VML-AHTE
2019/ICDAR
IndiScapes
Seg
No public download link
Indic
Video Text
Year/Venue
Name
Task
#TrainVids (#frames)
#ValVids (#f)
#TestVids(#f)
Granu.
Anno. Form
Language
Scene
Paper
Size
2013/15/ICDAR
Text in Videos (IC13)
Det. & Rec.
25 (13450)
24 (14374)
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Natural
2015/ICDAR
CVSI2015
No public link for download
multi-lingual
2017/ICDAR
DOST
Word
QUAD
Japanese
2018/ICFHR
LectureVideoDB
Det. & Rec.
-52,225
-27,900
-36,460
Word
English
Slides/Paper
2.3GB
2020/ICRA
RoadText-1K
Det. & Rec.
500 (150,000)
200 (60,000)
300 (90,000)
Line
Rect [x1, y1, x2, y2, "transcript"] & SegMap
En & NonEn
Road/Traffic
2020/ICMV
MIDV-500 & MIDV-2019
Det. & Rec. & Others
500 video clips
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
multi-lingual
Document
32GB
2021/ICDAR
MIDV-LAIT
Det. & Rec. & Others
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
multi-lingual
Document
2020/ICPR
AcTiVComp
Det. & Rec.
2557 frames
Line
Rect [x1, y1, x2, y2, "transcript"]
Arabic
Synthetic Text
Year/Venue
Name
Task
#Train
#Val
#Test
Granu.
Anno. Form
Language
Scene
Paper
Size
2016/CVPR
Synth800k
Det. & Rec.
858,750 (7,266,866)
Char & Word & Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Synthetic
41GB
2020
UnrealText
728,000 En + 674,000 others
multi-lingual
-
Chinese_ocr
Det. & Rec.
~ 364 million
Chinese
Document
-
UPTI
Urdu
-
APTI
45313600 (> 250 million chars)
Word
arabic
2021/ICDAR
SynthTiger
Rec.
2021/ICDAR
DocSynth