https://github.com/xinke-wang/ocrdatasets

A collection of OCR-related datasets
https://github.com/xinke-wang/ocrdatasets

datasets documents ocr scene-text-detection scene-text-recognition

Last synced: 4 months ago
JSON representation

A collection of OCR-related datasets

Host: GitHub
URL: https://github.com/xinke-wang/ocrdatasets
Owner: xinke-wang
Created: 2022-09-07T07:21:05.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2022-09-07T09:57:06.000Z (almost 4 years ago)
Last Synced: 2025-02-27T04:48:03.831Z (over 1 year ago)
Topics: datasets, documents, ocr, scene-text-detection, scene-text-recognition
Homepage:
Size: 19.5 KB
Stars: 150
Watchers: 1
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# OCR Datasets

This repo collects OCR-related datasets. In general, the datasets are classified by 6 types, *i.e.*, **Natural Scene Text**, **Document Text**, **Handwritten Text**, **Historical Document Text**, **Video Text**, and **Synthetic Text**.

![OCR Dataset Type](https://user-images.githubusercontent.com/45810070/188843040-e8d95f94-ef00-406d-b470-2ca83fa0d3cd.png)

- **Natural Scene Text**: The images in this type of dataset are usually taken in natural scenes, so the difficulty of this task lies in the complex lighting transformations, shooting angles, blurring, varied fonts, etc.
- **Document Text**: only focues on document images, the difficulty is the variety of typesetting.
- **Historical Document Text**: is usally designed for assisting social science research. For example, digitized antiquarian documents help preserve historical materials and facilitate scholars to conduct related research.
- **Video Text**: aims at recognizing texts in videos, which introduces temporal information into the OCR task.
- **Synthetic Text**: synthetically generates images containing texts and the corresponding annotations by rendering texts of different fonts into natural photos. This type of dataset usually includes hundreds of thousands of samples since it does not require human beings to annotate the images. However, due to the limited technology, there is usually a large domain gap between the synthetic images and authentic samples; these datasets are often employed for pre-training only.

Natural Scene Text

Year/Venue
Name
Task
#Train(#wds)
#Val(#wds)
#Test(#wds)
Granu.
Anno. Form
Language
Scene
Paper
Size

2003-05/ICDAR
IC03/IC05
Det. & Rec.
258 (1110)
N/A
251 (1156)
Word
Rect [x, y, w, h, "transcript"]
English
Natural
PDF
112MB

2011-15/ICDAR
Born-DIgital-Image (IC2011-2015)
Det. & Rec. & Seg.
410 (3564)
N/A
141 (1439)
Word & Pixel
Rect [x, y, w, h, "transcript"]
English
Natural/Web/Email
PDF
40MB

2013-15/ICDAR
Focused Scene Text (IC13)
Det. & Rec. & Seg.
229 (848)
N/A
233 (1095)
Word & Pixel
Rect [x1, y1, x2, y2, "transcript"] & SegMap
English
Natural
PDF
250MB

2015/ICDAR
Incidental Scene Text (IC15)
Det. & Rec.
1,000 (4468)
N/A
500 (2077)
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Natural
PDF
130MB

2017/ICDAR
Multi-Lingual Scene Text (MLT2017)
Det. & Rec.
7,200
1,800
private
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, Lan, 'trans']
multi-lingual
Natural
-
12GB

2019/ICDAR
Multi-Lingual Scene Text (MLT2019)
Det. & Rec.
10,000
N/A
10,000
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, Lan, 'trans']
multi-lingual
Natural
PDF
~12GB

2017/ICDAR
COCO-Text v2.0
Det. & Rec.
43,686
10,000
10,000
Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
En & NonEn
Natural
PDF
13GB

2019/ICDAR
ReCTS
Det. & Rec.
20,000
N/A
5,000
Word/Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
Chinese
Signboard
-
~2.5GB

2017/ICDAR
Total-Text
Det. & Rec.
1255
N/A
300
Word & Pixel
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
English
Natural
PDF
441MB

2019/PR
SCUT-CTW1500
Det. & Rec.
1,000
N/A
500
Line
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
En & Ch
Natural
PDF
800MB

2019/ICDAR
Arbitrary-Shaped Text (ART)
Det. & Rec.
5,603 (50,029)
N/A
4,563 (52,631)
Word(En)/Line(CH)
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], Lan, 'trans']
En & Ch
Natural
-
4.4GB

2017/ICDAR
RCTW-17 (CTW-12k)
Det. & Rec.
11514
N/A
1000
Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
Chinese
Mixture
PDF
11GB

2019/ICDAR/ICCV
Large-scale Street View Text (LSVT)
Det. & Rec.
30,000
N/A
20,000
Line
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
En & Ch
Street View
PDF
14GB

2016/DAS
MLe2e
Det. & Script Identifica.
450
N/A
261
Word
Rect [x1, y1, x2, y2, language]
multi-lingual
Natural
PDF
82MB

2017/ICDAR
IIIT-ILST
Det. & Rec.
893

Word
Rect [x, y, w, h, "transcript"]
Indic
Google Images
PDF
609MB

2017/CVPRW
UberText
Det. & Rec.
117,969 (571,534)

Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
English
Street View
PDF
197GB

2009/VISAPP
Chars74k
Det. & Rec.
1922

Character

En & Kanada
Natural Scene
PDF
739MB

2010/ICPR
KAIST
Det. & Rec. & Seg.
3000

Char & Word & Pixel
Rect [x, y, w, h, "transcript"] & SegMap
En & Korean
Mixture
PDF
364MB

2010/ECCV
SVT
Det. & Rec.
100 (211)
N/A
250 (514)
Word
Rect [x, y, w, h, "transcript"]
English
Street View
PDF
118MB

2013/ICCV
SVTP (download code:vnis)
Rec.
238 (639)

-

English
Street View
PDF
~1MB

2011/NIPSw
SVHN
Det. & Rec.
73,257+531,131
N/A
26,032
Character
Rect [x, y, w, h, "transcript"]
Digit
House Number
PDF
~3GB

2011/ICDARw
NEOCR
Det.
659 (5,238)

Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
multi-lingual
Natural Scene
PDF
1.3GB

2012/CVPR
MSRA-TD500
Det.
300
N/A
200
Line
RotRect [ind, difficult, x, y, w, h, theta]
multi-lingual
Street View
PDF
96MB

2012/BMVC
IIIT 5k-word
Rec.
380 (2000)
N/A
740 (3000)
Word

English
Natural
PDF
106MB

2014/ESWA
CUTE80
Rec.
80

Line
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]]]
English
Street View
PDF
44MB

2015/TPAMI
USTB-SV1K
Det. & Rec.
500
N/A
500
Word
RotRect [ind, difficult, x, y, w, h, theta, "trans"]
English
Street View
PDF
36MB

2019/JCST
Chinese Text in the Wild (CTW)
Det. & Rec.
25,887(812,872chrs)
N/A
3,269(103,519chrs)
Char & Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
Chinese
Street View
PDF
~40GB

2019/TITS
ShopSign
Det. & Rec.
1258 sample images

Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
Chinese
Signboard
PDF
3GB

2021/CVPR
TextOCR
Det. & Rec. & VQA
24902 (822,572)
N/A
3232 (80,497)
Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
English
Natural Scene
PDF
~8GB

2021/CVPR
VinText
Det. & Rec.
1,200
N/A
300+500
Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
Vietnamese
Natural Scene
PDF
1GB

2018/Competition
ICPR MTWI2018
Det. & Rec.
10,000
N/A
10,000
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
En & Ch
WEB Images
PDF
2GB

2019/Competition
百度中文场景文字识别比赛
Rec.
50,000
N/A
10,000
-
[h, w, 'trans']
En & Ch
Street View
-

Document Text

Year/Venue
Name
Task
#Train
#Val
#Test
Granu.
Anno. Form
Language
Scene
Paper
Size

2011/ICDAR
RETAS
No public download link
Char & Word
No public download link

-

2013/IJDAR
LRDE-DBD Document Binarization
Det. & Binarization
125

Line & Mask
Rect
French
Magzine
PDF
~700MB

2015/ICDAR
SmartDOC

3630
N/A
8470

PDF
~30GB

2016/ICFHR
KPTI
Rec.
11,910
2,552
2,553
-
['transcripts']
Pashto
Document
PDF
~100MB

2017/ICDAR
DeText
Det. & Rec.
100
100
300
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Scientific

PDF
10MB

2019/ICDAR
SROIE
Det. & Rec. & Info Ext.
600

400
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Receipt
-
<1GB

2019/ICDAR
FUNSD
Det. & Rec. & Info Ext.
149
N/A
50
Word
Rect [x1, y1, x2, y2, "transcript"]
English
Form
PDF
16MB

2019/ICDAR
NAF
Det. & Rec. & Info Ext.
682
59
63
Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Form
PDF

2020
BID
Det. & Rec.
28880

Line
Poly
Latin
ID Document

2020/ISCSIC
DDI-100
Det. & Rec.
~ 100,000 (70% train, 30% val)

Char & Word & Mask
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Distorted Document
PDF
~300GB

Handwritten Text

Year/Venue
Name
Task
#Train
#Val
#Test
Granu.
Anno. Form
Language
Scene
Paper
Size

2008-11/ICDAR
RIMES
No public download link
Word & Line
No public download link

2010/DAS
HIT-OR3C
Rec.
Char set 832,650 chars / Doc set 77,168 chars
-
special format
Chinese
Handwritten
PDF
1GB

2012/PR
KHATT
Rec.
8,368
1,793
1,822
-
['transcripts']
Arabic
Handwritten
PDF

98-2014
HANDS
No public download link
Japanese
Handwritten

-
Lao-SABAIDEE
500 SAMPLES
No public download link
Laos
Handwritten

2014/ICFHR
ORAND-CAR/CVL
Rec.
5,000
N/A
5,000
Word
['image_name', 'trans']
Digits
Handwritten Digits
PDF
194MB

2018/ICFHR
VNOnDB
Rec.
1,146 paragraphs 7,296 lines
380,000 chars
Word/Line/Parag.
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans']
Vietnamese
Handwritten
PDF
200MB

2013-16/IJDAR
PE92/SERI95/HanDB (HangulDB)
Rec.
1200 samples (90% Train/10% Test)

.HGU1 format
Korean
Handwritten
PDF
800MB

95-2016
NIST
Rec.

English

2011/ICDAR
CASIA-OLHWDB/HWDB
Rec.

Chinese
Handwritten
PDF

2021/ICDAR
IIT-INDIC-HW-WORDS
Rec.
872,000 instances

Word
['image_name', 'vocab_id'] & vocabularly
Indic
Handwritten
PDF
~20GB

1999/ICDAR
IAM Handwriting Database
Rec.
6,161
900+940
1,861
Registration is Required

2005/ICDAR
IAM ONLINE Handwritting Data
Rec.
86,272 word instances
Registration is Required

2018/ICDAR
IAM-MonDo
Rec.
Registration is Required
PDF

2011-14/ICDAR
CHROME
Rec.
> 10,000 expressions

symbol & expression
inkml format, latex
Symbol
Mathematical
PDF
58MB

2017/ICDAR
MUSICMA++
Rec.
140

Symbol
Music Notation
PDF

2018/Access
SCUT-EPT
Rec.
40,000
N/A
10,000

Chinese
Educational Doc.
PDF
1.08GB

2020/ICFHR
HHD
Rec.
3965

1134

Hebrew

PDF

2021/ArXiv
IMGUR5K
Det. & Rec.
(~108,000)
(~13,000)
(~14,000)
Word
Rect [x, y, w, h, "transcript"]
English
Handwritten
PDF
-

2021/ArXiv
VML-MOC
Seg. & Rec.

Hebrew

PDF

2021/ICDAR
Bengali
Rec.

Bengali

PDF

2021/ICDAR
GNHK
Det. & Rec.
687

Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English

PDF

Historical Document Text

Year/Venue
Name
Task
#Train
#Val
#Test
Granu.
Anno. Form
Language
Scene
Paper
Size

2010-11/DAS
IAM-HistDB
Rec.
127

Word & Line
['image_id', 'transcript']
En & Ger & Latin

>200mb

2016/ICFHR
H-KWS (1. Botany 2. AK)
Det. & Rec.
1849
3734
N/A
Word & Line
Rect [x, y, w, h, "transcript"]
English

PDF

2016/ICFHR
READ
Registration is Required

German

PDF
~600mb

2017/ICFHR
Palm Leaf Manuscript
Det. & Rec.
~19,000 Balinese + ~20,000 Khmer
Char
No public download link
Khmer
Palm Leaf

2017/HIP
SleukRith-Set
Det. & Rec.
658

Char & Word
Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'transcript']
Khmer
Palm Leaf
PDF
1GB

2019/NCA
ARDIS
Rec.
10,000

Char & Word
['transcript']
Digits
Church Records
PDF

2019/ICDAR
Pinkas
Det. & Rec.

Word & Line

Hebrew
historical manuscripts
PDF
~50MB

2020/ICFHR

Cuneiform

PDF

2020/ICFHR
MTHv2
Det. & Rec.
2,399
N/A
800
Char & Line

Chinese
Acient Book
PDF
4.6GB

2021/ICDAR
IHR-NomDB
Det. & Rec.
267

Line
Rect [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
ChuNom
Acient Book
PDF

2021/ICDAR
VML-HP

Hebrew

PDF

VML-AHTE

PDF

2019/ICDAR
IndiScapes
Seg
No public download link

Indic

PDF

Video Text

Year/Venue
Name
Task
#TrainVids (#frames)
#ValVids (#f)
#TestVids(#f)
Granu.
Anno. Form
Language
Scene
Paper
Size

2013/15/ICDAR
Text in Videos (IC13)
Det. & Rec.
25 (13450)

24 (14374)
Word
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Natural
PDF

2015/ICDAR
CVSI2015
No public link for download
multi-lingual

PDF

2017/ICDAR
DOST

Word
QUAD
Japanese

2018/ICFHR
LectureVideoDB
Det. & Rec.
-52,225
-27,900
-36,460
Word

English
Slides/Paper
PDF
2.3GB

2020/ICRA
RoadText-1K
Det. & Rec.
500 (150,000)
200 (60,000)
300 (90,000)
Line
Rect [x1, y1, x2, y2, "transcript"] & SegMap
En & NonEn
Road/Traffic
PDF

2020/ICMV
MIDV-500 & MIDV-2019
Det. & Rec. & Others
500 video clips

Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
multi-lingual
Document
PDF
32GB

2021/ICDAR
MIDV-LAIT
Det. & Rec. & Others

Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
multi-lingual
Document
PDF

2020/ICPR
AcTiVComp
Det. & Rec.
2557 frames

Line
Rect [x1, y1, x2, y2, "transcript"]
Arabic

Synthetic Text

Year/Venue
Name
Task
#Train
#Val
#Test
Granu.
Anno. Form
Language
Scene
Paper
Size

2016/CVPR
Synth800k
Det. & Rec.
858,750 (7,266,866)

Char & Word & Line
Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans']
English
Synthetic
PDF
41GB

2020
UnrealText

728,000 En + 674,000 others

multi-lingual

-
Chinese_ocr
Det. & Rec.
~ 364 million

Chinese
Document

-
UPTI

Urdu

-
APTI

45313600 (> 250 million chars)
Word

arabic

2021/ICDAR
SynthTiger
Rec.

PDF

2021/ICDAR
DocSynth

PDF

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xinke-wang/ocrdatasets

Awesome Lists containing this project

README