An open API service indexing awesome lists of open source software.

https://github.com/xinke-wang/ocrdatasets

A collection of OCR-related datasets
https://github.com/xinke-wang/ocrdatasets

datasets documents ocr scene-text-detection scene-text-recognition

Last synced: 3 months ago
JSON representation

A collection of OCR-related datasets

Awesome Lists containing this project

README

        

# OCR Datasets

This repo collects OCR-related datasets. In general, the datasets are classified by 6 types, *i.e.*, **Natural Scene Text**, **Document Text**, **Handwritten Text**, **Historical Document Text**, **Video Text**, and **Synthetic Text**.

![OCR Dataset Type](https://user-images.githubusercontent.com/45810070/188843040-e8d95f94-ef00-406d-b470-2ca83fa0d3cd.png)

- **Natural Scene Text**: The images in this type of dataset are usually taken in natural scenes, so the difficulty of this task lies in the complex lighting transformations, shooting angles, blurring, varied fonts, etc.
- **Document Text**: only focues on document images, the difficulty is the variety of typesetting.
- **Historical Document Text**: is usally designed for assisting social science research. For example, digitized antiquarian documents help preserve historical materials and facilitate scholars to conduct related research.
- **Video Text**: aims at recognizing texts in videos, which introduces temporal information into the OCR task.
- **Synthetic Text**: synthetically generates images containing texts and the corresponding annotations by rendering texts of different fonts into natural photos. This type of dataset usually includes hundreds of thousands of samples since it does not require human beings to annotate the images. However, due to the limited technology, there is usually a large domain gap between the synthetic images and authentic samples; these datasets are often employed for pre-training only.


Natural Scene Text


Year/Venue
Name
Task
#Train(#wds)
#Val(#wds)
#Test(#wds)
Granu.
Anno. Form
Language
Scene
Paper
Size


2003-05/ICDAR
IC03/IC05
Det. & Rec.
258 (1110)
N/A
251 (1156)
Word
Rect [x, y, w, h, "transcript"]
English
Natural
PDF
112MB


2011-15/ICDAR
Born-DIgital-Image (IC2011-2015)
Det. & Rec. & Seg.
410 (3564)
N/A
141 (1439)
Word & Pixel
Rect [x, y, w, h, "transcript"]
English
Natural/Web/Email
PDF
40MB


2013-15/ICDAR
Focused Scene Text (IC13)
Det. & Rec. & Seg.
229 (848)
N/A
233 (1095)
Word & Pixel
Rect [x1, y1, x2, y2, "transcript"] & SegMap
English
Natural
PDF
250MB


2015/ICDAR
Incidental Scene Text (IC15)
Det. & Rec.
1,000 (4468)
N/A
500 (2077)
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Natural
PDF
130MB


2017/ICDAR
Multi-Lingual Scene Text (MLT2017)
Det. & Rec.
7,200
1,800
private
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, Lan, 'trans']
multi-lingual
Natural
-
12GB


2019/ICDAR
Multi-Lingual Scene Text (MLT2019)
Det. & Rec.
10,000
N/A
10,000
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, Lan, 'trans']
multi-lingual
Natural
PDF
~12GB


2017/ICDAR
COCO-Text v2.0
Det. & Rec.
43,686
10,000
10,000
Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
En & NonEn
Natural
PDF
13GB


2019/ICDAR
ReCTS
Det. & Rec.
20,000
N/A
5,000
Word/Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
Chinese
Signboard
-
~2.5GB


2017/ICDAR
Total-Text
Det. & Rec.
1255
N/A
300
Word & Pixel
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
English
Natural
PDF
441MB


2019/PR
SCUT-CTW1500
Det. & Rec.
1,000
N/A
500
Line
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
En & Ch
Natural
PDF
800MB


2019/ICDAR
Arbitrary-Shaped Text (ART)
Det. & Rec.
5,603 (50,029)
N/A
4,563 (52,631)
Word(En)/Line(CH)
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], Lan, 'trans']
En & Ch
Natural
-
4.4GB


2017/ICDAR
RCTW-17 (CTW-12k)
Det. & Rec.
11514
N/A
1000
Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
Chinese
Mixture
PDF
11GB


2019/ICDAR/ICCV
Large-scale Street View Text (LSVT)
Det. & Rec.
30,000
N/A
20,000
Line
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
En & Ch
Street View
PDF
14GB


2016/DAS
MLe2e
Det. & Script Identifica.
450
N/A
261
Word
Rect [x1, y1, x2, y2, language]
multi-lingual
Natural
PDF
82MB


2017/ICDAR
IIIT-ILST
Det. & Rec.
893


Word
Rect [x, y, w, h, "transcript"]
Indic
Google Images
PDF
609MB


2017/CVPRW
UberText
Det. & Rec.
117,969 (571,534)


Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
English
Street View
PDF
197GB


2009/VISAPP
Chars74k
Det. & Rec.
1922


Character

En & Kanada
Natural Scene
PDF
739MB


2010/ICPR
KAIST
Det. & Rec. & Seg.
3000


Char & Word & Pixel
Rect [x, y, w, h, "transcript"] & SegMap
En & Korean
Mixture
PDF
364MB


2010/ECCV
SVT
Det. & Rec.
100 (211)
N/A
250 (514)
Word
Rect [x, y, w, h, "transcript"]
English
Street View
PDF
118MB


2013/ICCV
SVTP (download code:vnis)
Rec.
238 (639)


-

English
Street View
PDF
~1MB


2011/NIPSw
SVHN
Det. & Rec.
73,257+531,131
N/A
26,032
Character
Rect [x, y, w, h, "transcript"]
Digit
House Number
PDF
~3GB


2011/ICDARw
NEOCR
Det.
659 (5,238)


Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
multi-lingual
Natural Scene
PDF
1.3GB


2012/CVPR
MSRA-TD500
Det.
300
N/A
200
Line
RotRect [ind, difficult, x, y, w, h, theta]
multi-lingual
Street View
PDF
96MB


2012/BMVC
IIIT 5k-word
Rec.
380 (2000)
N/A
740 (3000)
Word

English
Natural
PDF
106MB


2014/ESWA
CUTE80
Rec.
80


Line
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]]]
English
Street View
PDF
44MB


2015/TPAMI
USTB-SV1K
Det. & Rec.
500
N/A
500
Word
RotRect [ind, difficult, x, y, w, h, theta, "trans"]
English
Street View
PDF
36MB


2019/JCST
Chinese Text in the Wild (CTW)
Det. & Rec.
25,887(812,872chrs)
N/A
3,269(103,519chrs)
Char & Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
Chinese
Street View
PDF
~40GB


2019/TITS
ShopSign
Det. & Rec.
1258 sample images


Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
Chinese
Signboard
PDF
3GB


2021/CVPR
TextOCR
Det. & Rec. & VQA
24902 (822,572)
N/A
3232 (80,497)
Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
English
Natural Scene
PDF
~8GB


2021/CVPR
VinText
Det. & Rec.
1,200
N/A
300+500
Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
Vietnamese
Natural Scene
PDF
1GB


2018/Competition
ICPR MTWI2018
Det. & Rec.
10,000
N/A
10,000
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
En & Ch
WEB Images
PDF
2GB


2019/Competition
百度中文场景文字识别比赛
Rec.
50,000
N/A
10,000
-
[h, w, 'trans']
En & Ch
Street View
-



Document Text


Year/Venue
Name
Task
#Train
#Val
#Test
Granu.
Anno. Form
Language
Scene
Paper
Size


2011/ICDAR
RETAS
No public download link  
Char & Word
No public download link


-



2013/IJDAR
LRDE-DBD Document Binarization
Det. & Binarization
125


Line & Mask
Rect
French
Magzine
PDF
~700MB


2015/ICDAR
SmartDOC

3630
N/A
8470




PDF
~30GB


2016/ICFHR
KPTI
Rec.
11,910
2,552
2,553
-
['transcripts']
Pashto
Document
PDF
~100MB


2017/ICDAR
DeText
Det. & Rec.
100
100
300
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Scientific

PDF
10MB


2019/ICDAR
SROIE
Det. & Rec. & Info Ext.
600

400
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Receipt
-
<1GB


2019/ICDAR
FUNSD
Det. & Rec. & Info Ext.
149
N/A
50
Word
Rect [x1, y1, x2, y2, "transcript"]
English
Form
PDF
16MB


2019/ICDAR
NAF
Det. & Rec. & Info Ext.
682
59
63
Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Form
PDF



2020
BID
Det. & Rec.
28880


Line
Poly
Latin
ID Document




2020/ISCSIC
DDI-100
Det. & Rec.
~ 100,000 (70% train, 30% val)

Char & Word & Mask
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Distorted Document
PDF
~300GB


Handwritten Text


Year/Venue
Name
Task
#Train
#Val
#Test
Granu.
Anno. Form
Language
Scene
Paper
Size


2008-11/ICDAR
RIMES
No public download link
Word & Line
No public download link


2010/DAS
HIT-OR3C
Rec.
Char set 832,650 chars / Doc set 77,168 chars
-
special format
Chinese
Handwritten
PDF
1GB


2012/PR
KHATT
Rec.
8,368
1,793
1,822
-
['transcripts']
Arabic
Handwritten
PDF



98-2014
HANDS
No public download link
Japanese
Handwritten




-
Lao-SABAIDEE
500 SAMPLES
No public download link  
Laos
Handwritten




2014/ICFHR
ORAND-CAR/CVL
Rec.
5,000
N/A
5,000
Word
['image_name', 'trans']
Digits
Handwritten Digits
PDF
194MB


2018/ICFHR
VNOnDB
Rec.
1,146 paragraphs 7,296 lines
380,000 chars
Word/Line/Parag.
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
Vietnamese
Handwritten
PDF
200MB


2013-16/IJDAR
PE92/SERI95/HanDB (HangulDB)
Rec.
1200 samples (90% Train/10% Test)

.HGU1 format
Korean
Handwritten
PDF
800MB


95-2016
NIST
Rec.





English





2011/ICDAR
CASIA-OLHWDB/HWDB
Rec.





Chinese
Handwritten
PDF



2021/ICDAR
IIT-INDIC-HW-WORDS
Rec.
872,000 instances


Word
['image_name', 'vocab_id'] & vocabularly
Indic
Handwritten
PDF
~20GB


1999/ICDAR
IAM Handwriting Database
Rec.
6,161
900+940
1,861
Registration is Required


2005/ICDAR
IAM ONLINE Handwritting Data
Rec.
86,272 word instances
Registration is Required


2018/ICDAR
IAM-MonDo
Rec.
Registration is Required   
PDF



2011-14/ICDAR
CHROME
Rec.
> 10,000 expressions


symbol & expression
inkml format, latex
Symbol
Mathematical
PDF
58MB


2017/ICDAR
MUSICMA++
Rec.
140




Symbol
Music Notation
PDF



2018/Access
SCUT-EPT
Rec.
40,000
N/A
10,000


Chinese
Educational Doc.
PDF
1.08GB


2020/ICFHR
HHD
Rec.
3965

1134


Hebrew

PDF



2021/ArXiv
IMGUR5K
Det. & Rec.
(~108,000)
(~13,000)
(~14,000)
Word
Rect [x, y, w, h, "transcript"]
English
Handwritten
PDF
-


2021/ArXiv
VML-MOC
Seg. & Rec.





Hebrew

PDF



2021/ICDAR
Bengali
Rec.





Bengali

PDF



2021/ICDAR
GNHK
Det. & Rec.
687


Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English

PDF



Historical Document Text


Year/Venue
Name
Task
#Train
#Val
#Test
Granu.
Anno. Form
Language
Scene
Paper
Size


2010-11/DAS
IAM-HistDB
Rec.
127


Word & Line
['image_id', 'transcript']
En & Ger & Latin


>200mb


2016/ICFHR
H-KWS (1. Botany 2. AK)
Det. & Rec.
1849
3734
N/A
Word & Line
Rect [x, y, w, h, "transcript"]
English

PDF



2016/ICFHR
READ
Registration is Required





German

PDF
~600mb


2017/ICFHR
Palm Leaf Manuscript
Det. & Rec.
~19,000 Balinese + ~20,000 Khmer
Char
No public download link
Khmer
Palm Leaf




2017/HIP
SleukRith-Set
Det. & Rec.
658


Char & Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'transcript']
Khmer
Palm Leaf
PDF
1GB


2019/NCA
ARDIS
Rec.
10,000


Char & Word
['transcript']
Digits
Church Records
PDF



2019/ICDAR
Pinkas
Det. & Rec.



Word & Line

Hebrew
historical manuscripts
PDF
~50MB


2020/ICFHR







Cuneiform

PDF



2020/ICFHR
MTHv2
Det. & Rec.
2,399
N/A
800
Char & Line

Chinese
Acient Book
PDF
4.6GB


2021/ICDAR
IHR-NomDB
Det. & Rec.
267


Line
Rect [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
ChuNom
Acient Book
PDF



2021/ICDAR
VML-HP






Hebrew

PDF




VML-AHTE








PDF



2019/ICDAR
IndiScapes
Seg
No public download link




Indic

PDF



Video Text


Year/Venue
Name
Task
#TrainVids (#frames)
#ValVids (#f)
#TestVids(#f)
Granu.
Anno. Form
Language
Scene
Paper
Size


2013/15/ICDAR
Text in Videos (IC13)
Det. & Rec.
25 (13450)

24 (14374)
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Natural
PDF



2015/ICDAR
CVSI2015
No public link for download
multi-lingual

PDF



2017/ICDAR
DOST




Word
QUAD
Japanese





2018/ICFHR
LectureVideoDB
Det. & Rec.
-52,225
-27,900
-36,460
Word

English
Slides/Paper
PDF
2.3GB


2020/ICRA
RoadText-1K
Det. & Rec.
500 (150,000)
200 (60,000)
300 (90,000)
Line
Rect [x1, y1, x2, y2, "transcript"] & SegMap
En & NonEn
Road/Traffic
PDF



2020/ICMV
MIDV-500 & MIDV-2019
Det. & Rec. & Others
500 video clips



Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
multi-lingual
Document
PDF
32GB


2021/ICDAR
MIDV-LAIT
Det. & Rec. & Others




Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
multi-lingual
Document
PDF



2020/ICPR
AcTiVComp
Det. & Rec.
2557 frames


Line
Rect [x1, y1, x2, y2, "transcript"]
Arabic





Synthetic Text


Year/Venue
Name
Task
#Train
#Val
#Test
Granu.
Anno. Form
Language
Scene
Paper
Size


2016/CVPR
Synth800k
Det. & Rec.
858,750 (7,266,866)


Char & Word & Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Synthetic
PDF
41GB


2020
UnrealText

728,000 En + 674,000 others


multi-lingual





-
Chinese_ocr
Det. & Rec.
~ 364 million




Chinese
Document




-
UPTI






Urdu





-
APTI

45313600 (> 250 million chars)
Word

arabic





2021/ICDAR
SynthTiger
Rec.







PDF



2021/ICDAR
DocSynth








PDF