Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/VictorAtPL/awesome-receipt-data-extraction

A curated list (and summaries) of awesome research publications on topic of data extraction from photos of receipts.
https://github.com/VictorAtPL/awesome-receipt-data-extraction

List: awesome-receipt-data-extraction

awesome awesome-list data-extraction information-extraction list receipts

Last synced: 4 days ago
JSON representation

A curated list (and summaries) of awesome research publications on topic of data extraction from photos of receipts.

Awesome Lists containing this project

README

        

# Awesome receipt data extraction

This repository contains resources helpful if you are going to build a system for key information extraction from photos of receipts.

## Disclaimer

Quotes and images of publications listed below, which are available in this GitHub repository are shared here for educational purpose only. I don't own any copyrights for these publications. If you want me to delete your publication from this list and repository - please open an issue in this repository.

## List of publications

| Year | Type of document | Title, authors | Works on | Dataset, quantity, country of origin | Receipt detection | Receipt localization | Receipt normalization | Text line segmentation | Optical character recognition | Semantic analysis |
| ------- | ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- | ----------------- | -------------------- | --------------------- | ---------------------- | ----------------------------- | ----------------- |
| 2019.12 | Preprint | [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](reviews/xu2019layout.md)
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou | scanned documents images with text segments and their position from OCR | IIT-CDIP
6kk | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
| 2019.09 | Workshop Paper | [Post-OCR parsing: building simple and robust parser via BIO tagging](reviews/hwang2019post.md)
Wonseok Hwang, Seonghyeon Kim, Minjoon Seo, Jinyeong Yim, Seunghyun Park, Sungrae Park, Junyeop Lee, Bado Lee, Hwalsuk Lee | receipts' text segments with position from OCR | CORD
1000 | ❌ | ❌ | ❌ | ❌ | ❗ | ✔️ |
| 2019.09 | Workshop Paper | [Chargrid-OCR: End-to-end Trainable Optical Character Recognition for Printed Documents using Instance Segmentation](reviews/reisswig2019chargrid.md)
Christian Reisswig, Anoop R Katti, Marco Spinaci, Johannes Höhne | printed documents | Proprietary
unknown synth + 43k real with noisy labels | ❌ | ❌ | ❌ | ❌ | ✔️ | ❌ |
| 2019.09 | Conference Paper | [EATEN: Entity-aware Attention for Single Shot Visual Text Extraction](reviews/guo2019eaten.md)
He Guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding | train ticket photos and synthetic images of  train tickets, passports and business cards | EATEN
2000 real train ticket + synth: 300k train ticket + 100k passport + 200k business card | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
| 2019.09 | Conference Paper | [End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net](reviews/dang2019end.md)
Tuan Anh Nguyen Dang, Dat Nguyen Thanh | scanned invoices' and receipts' text with char-level bounding boxes from OCR | Toyota invoices dataset
261
+
Daiichi medical receipts dataset
200 | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
| 2019.09 | Conference Paper (ICDAR) | [Attend, Copy, Parse End-to-end Information Extraction from Documents](reviews/palm2019attend.md)
Rasmus Berg Palm, Florian Laws, Ole Winther | scanned and digitalized invoices text with char-level bounding boxes from OCR | Proprietary
1.2kk | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
| 2019.09 | Bachelor's thesis | [Separation and Extraction of Valuable Information From Digital Receipts Using Google Cloud Vision OCR](reviews/johansson2019separation.md)
Elias Johansson | photos of receipts | Proprietary
53 | ❌ | ❌ | ✔️ | ❌ | ❗ | ✔️ |
| 2019.08 | Conference Paper | [Towards Unconstrained End-to-End Text Spotting](reviews/qin2019towards.md)
Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa Fujii, Ying Xiao | photos of scenes with naturalistic text | Proprietary, SynthText, ICDAR15, COCO-Text, ICDAR-MLT and Total-Text
30k, 200k, 1k, 17k, 7k and 1255 | ❌ | ❌ | ❌ | ✔️ | ✔️ | ❌ |
| 2019.07 | Conference Paper (CBMI) | [Receipt automatic reader](reviews/maslowa2019receipt.md)
Olga Maslova, Louis Klein, Damien Dabernat, A Benoit, Patrick Lambert | photos of receipts | Proprietary
1200 (receipt detection and segmentation)
+
15 (text recognition quality) | ✔️ | ✔️ | ✔️ | ✔️ | ❗ | ❌ |
| 2019.06 | Preprint | [CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor](reviews/zhao2019cutie.md)
Xiaohui Zhao, Endi Niu, Zhuo Wu, and Xiaoguang Wang | receipts' text from OCR | Proprietary
4484, Spain
+
SROIE 2019
1000 | ❌ | ❌ | ❌ | ❌ | ❗ | ✔️ |
| 2019.06 | Conference Paper | [A Multitask Network for Localization and Recognition of Text in Images](reviews/sarshogh2019multi.md)
Mohammad Reza Sarshogh, Keegan E. Hines | synthetically-generated documents | Proprietary
10000 | ❌ | ❌ | ❌ | ✔️ | ✔️ | ❌ |
| 2019.06 | Journal Article | [Visual-Linguistic Methods for Receipt Field Recognition](reviews/gal2018visual.md)
Rinon Gal, Nimrod Morag, Roy Shilkrot | scanned invoices' and receipts' text with char-level bounding boxes from OCR | Proprietary
5094 | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
| 2019.05 | Conference Paper | [Deep Learning Approach for Receipt Recognition](reviews/le2019deep.md)
Le Duc, Anh & Pham, Dung & Nguyen, Tuan | scanned receipts | SROIE 2019
1000 | ❌ | ✔️ | ❌ | ✔️ | ✔️ | ❌ |
| 2019.04 | Conference Paper (ESANN) | [A document detection technique using convolutional neural networks for optical character recognition systems](reviews/dobai2019document.md)
Lorand Dobai, Mihai Teletin | photos of receipts | Proprietary
6700 | ❌ | ✔️ | ✔️ | ❌ | ❌ | ❌ |
| 2019.03 | Conference Paper | [Graph Convolution for Multimodal Information Extraction from Visually Rich Documents](reviews/liu2019graph.md)
Xiaojing Liu, Feiyu Gao, Qiong Zhang, Huasha Zhao | receipts' text segments from OCR | Value-Added Tax Invoices (VATI)
3000
+
International Pur- chase Receipts (IPR)
1500 | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
| 2018.11 | Conference Paper (ICPR) | [A Novel Integrated Framework for Learning both Text Detection and Recognition](reviews/sui2018novel.md)
Wanchen Sui, Qing Zhang, Jun Yang, Wei Chu | business card photographs and scanned handwritten text | Chinese Business Card Database
20k
+
IAM Handwriting Database
747 | ❌ | ❌ | ❌ | ✔️ | ✔️ | ❌ |
| 2018.08 | Conference Paper | [Chargrid: Towards Understanding 2D Documents](reviews/katti2018chargrid.md)
Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, Jean Baptiste Faddoul | scanned invoices' text with char-level bounding boxes from OCR | Proprietary
12000 | ❌ | ❌ | ❌ | ❌ | ❗ | ✔️ |
| 2018.03 | Conference Paper | [Optical Character Recognition Engine to extract Food-items and Prices from Grocery Receipt Images via Templating and Dictionary-Traversal Technique](reviews/sohani2018optical.md)
Ali Sohani, Rafi Ullah, Faraz Ali, Athaul Rai, Richard Messier | photos of receipts | N/A | ❌ | ✔️ | ✔️ | ❌ | ❗ | ✔️ |
| 2018.02 | Journal Article | [OCR Engine to Extract Food-Items, Prices, Quantity, Units from Receipt Images, Heuristics Rules Based Approach](reviews/ullah2018ocr.md)
Rafi Ullah, Ali Sohani, Athaul Rai, Faraz Ali, Richard Messier | photos of receipts | N/A | ❌ | ✔️ | ✔️ | ❌ | ❗ | ✔️ |
| 2018 | BSc thesis | [Utilize OCR text to extract receipt data and classify receipts with common Machine Learning algorithms](reviews/odd2018utilize.md)
Joel Odd, Emil Theologou | receipts' text from OCR | Proprietary
556, Sweden | ❌ | ❌ | ❌ | ❌ | ❗ | ✔️ |
| 2018 | Journal Article | [Preprocessing Photos of Receipts for Recognition](reviews/korobacz2018preprocessing.md)
Wojciech Korobacz, Marek Tabędzki | photos of receipts | Proprietary
240 | ❌ | ✔️ | ✔️ | ❌ | ❗ | ❌ |
| 2018 | Preprint | [Automated Receipt Image Identification, Cropping, and Parsing](reviews/yue2018automated.md)
Alex Yue | photos of receipts | Proprietary
50 | ❌ | ✔️ | ✔️ | ❌ | ❗ | ✔️ |
| 2017.12 | Conference Paper | [OCR Engine to extract Food-items and Prices from Receipt Images via Pattern matching and heuristics approach](reviews/ullah2017ocr.md)
Rafi Ullah, Ali Sohani, Faraz Ali, Athaul Rai | photos of receipts | N/A | ❌ | ✔️ | ✔️ | ❌ | ❗ | ✔️ |
| 2017.10 | Conference Paper | [Deep Learning for automatic sale receipt understanding](reviews/raoui2017deep.md)
Rizlene Raoui-Outach, Cecile Million-Rousseau , Alexandre Benoit and Patrick Lambert | photos of receipts | Proprietary
3000 | ✔️ | ✔️ | ✔️ | ✔️ | ❗ | ❗ |
| 2017.09 | Conference Paper (ICPR) | [Fused Text Segmentation Networks for Multi-oriented Scene Text Detection](reviews/dai2017fused.md)
Yuchen Dai, Zheng Huang, Yuting Gao, Youxuan Xu, Kai Chen, Jie Guo, Weidong Qiu | photos of scenes with naturalistic text | SynthText
160k | ❌ | ❌ | ❌ | ✔️ | ❌ | ❌ |
| 2016.07 | Bachelor's thesis | [Optical Character Recognition on supermarket receipts](reviews/ziegaus2016optical.md)
Marco Ziegaus | scanned receipts | Proprietary
39 | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ |
| 2015.08 | Journal Article | [OCR accuracy improvement on document images through a novel pre-processing approach](reviews/harraj2015ocr.md)
Abdeslam El Harraj, Naoufal Raissouni | scanned documents | MTDB
500 | ❌ | ❌ | ✔️ | ❌ | ❌ | ❌ |
| 2015 | Preprint | [Mobile Scanner and OCR (A first step towards receipt to spreadsheet)](reviews/nshuti2015mobile.md)
Clement Ntwari Nshuti | photos of documents | Proprietary
77 | ❌ | ✔️ | ✔️ | ❌ | ❗ | ❌ |
| 2014 | Preprint | [A Novel Machine Learning Based Approach for Retrieving Information from Receipt Images](reviews/szabo2014novel.md)
Roland Szabo | photos of receipts | Proprietary
20 | ❌ | ✔️ | ❌ | ✔️ | ✔️ | ❌ |
| 2012.09 | Conference Paper | [Receipts2Go: The Big World of Small Documents](reviews/janssen2012receipts.md)
Bill Janssen, Eric Saund, Eric A. Bier, Patricia Wall, Mary Ann Sprague | photos of receipts | N/A | ❌ | ✔️ | ✔️ | ❌ | ❗ | ✔️ |

## Citations

Citations in Bibtex format are available here: [references.bib](references.bib).

## To read

##### High priority

* TBA

##### Low priority

* [Expense Control: A Gamified, Semi-Automated, Crowd-Based Approach For Receipt Capturing](https://www.researchgate.net/publication/311492118_Expense_Control_A_Gamified_Semi-Automated_Crowd-Based_Approach_For_Receipt_Capturing)

* [BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding](https://arxiv.org/pdf/1909.04948.pdf)

* [CloudScan - A configuration-free invoice analysis system using recurrent neural networks](https://arxiv.org/pdf/1708.07403.pdf)

* [Segmentation, Labeling and Optical Character Recognition Applied on Receipt Images](https://upcommons.upc.edu/bitstream/handle/2117/97003/OCR_Project_claudi.pdf?sequence=1&isAllowed=y)

* [[D] Long-term Text-Recognition?](https://www.reddit.com/r/MachineLearning/comments/8krt45/d_longterm_textrecognition/)

* [Find receipts, warp perspective and OCR with Tesseract JS in browser](https://www.reddit.com/r/computervision/comments/92lnkv/find_receipts_warp_perspective_and_ocr_with/)

* [Survey Of Receipt Identification And Classification Using Machine Learning](https://archives.ourheritagejournal.com/index.php/oh/article/download/2424/2270/)

* TBA