Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-document-understanding
A curated list of resources for Document Understanding (DU) topic
https://github.com/tstanislawek/awesome-document-understanding
Last synced: 5 days ago
JSON representation
-
Resources
- The RVL-CDIP Dataset - dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class
- The Industry Documents Library - a portal to millions of documents created by industries that influence public health, hosted by the UCSF Library
- Color Document Dataset - from the Intelligent Sensory Information Systems, University of Amsterdam
- The IIT CDIP Collection - dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s, consists of around 7 million documents
- OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted
- parsing-prickly-pdfs - prickly-pdfs.svg?style=social) - Resources and worksheet for the NICAR 2016 workshop of the same name
- borb - is a pure python library to read, write and manipulate PDF documents. It represents a PDF document as a JSON-like datastructure of nested lists, dictionaries and primitives (numbers, string, booleans, etc).
- pawls - PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated with a PDF document
- pdfplumber - Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging
- Pdfminer.six - Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data
- Layout Parser - Parser/layout-parser.svg?style=social) - Layout Parser is a deep learning based tool for document image layout analysis tasks
- Tabulo - Table extraction from images
- OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted
- PDFBox - The Apache PDFBox library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents
- PdfPig - This project allows users to read and extract text and other content from PDF files. In addition the library can be used to create simple PDF documents containing text and geometrical shapes. This project aims to port PDFBox to C#
- pdf-text-extraction-benchmark - text-extraction-benchmark.svg?style=social) - PDF tools benchmark
- Born digital pdf scanner - born-pdf-scanner.svg?style=social) - checking if pdf is born-digital
- OpenContracts - licensed, PDF annotating platform for visually-rich documents that preserves the original layout and exports x,y positional data for tokens as well as span starts and stops. Based on PAWLs, but with a Python-based backend and readily deployable on your local machine, company intranet or the web via Docker Compose.
- deepdoctection - tuning, evaluating and running models.
- pydoxtools - composition library for dpocument analysis. It features an extensive toolset for building complex document analysis pipelines and recognizes most document formats out of the box. It supports typical NLP tasks such as keywords, summarization, question_answering out of the box. and features a high quality low-CPU/memory table extraction algorithm and makes NLP batch operations on a cluster easy.
-
Conferences, workshops
- [2021 - pat.org/ICDAR2017/index.php)]
- [2021 - 64/) ]
- [2021 - 2019), [2018](https://www.aclweb.org/anthology/W18-31.pdf) ]
- [2020
- ACM International Conference on AI in Finance (ICAIF)
- CVPR 2020 Workshop on Text and Documents in the Deep Learning Era
- KDD Workshop on Machine Learning in Finance (KDD MLF 2020)
- FinIR 2020: The First Workshop on Information Retrieval in Finance
- 2nd KDD Workshop on Anomaly Detection in Finance (KDD 2019)
- Document Understanding Conference (DUC 2007)
- The AAAI-21 Workshop on Scientific Document Understanding (SDU 2021)
- [2020 - international-workshop-scientific-document-analysis) ]
- [2021
- The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services
- First Workshop on Scholarly Document Processing (SDProc 2020)
-
Blogs
- A Survey of Document Understanding Models
- Document Form Extraction
- How to automate processes with unstructured data
- A Comprehensive Guide to OCR with RPA and Document Understanding
- Information Extraction from Receipts with Graph Convolutional Networks
- How to extract structured data from invoices
- To apply AI for good, think form extraction
- UiPath Document Understanding Solution Architecture and Approach
- How Can I Automate Data Extraction from Complex Documents?
- LegalTech: Information Extraction in legal documents
- Extracting Structured Data from Templatic Documents
-
Solutions
-
Document Question Answering
Categories
Sub Categories
Keywords
pdf
7
python
6
ocr
4
library
2
pdf-generation
2
deep-learning
2
document-layout-analysis
2
layout-analysis
2
table-detection
2
table-recognition
2
tensorflow
2
tesseract
2
pdfbox
2
document-analysis
2
extraction
2
nlp
2
pdf-table-extraction
1
luminoth
1
faster-r-cnn
1
detection
1
object-detection
1
layout-parser
1
layout-detection
1
publaynet
1
pubtabnet
1
document-image-processing
1
detectron2
1
pytorch
1
computer-vision
1
parser
1
table-extraction
1
pdf-parsing
1
typesetting
1
sdk
1
python3
1
chatgpt
1
pdf-library
1
document-extraction
1
pdf-converter
1
pdf-conversion
1
information-retrieval
1
llm
1
tex
1
evaluation
1
benchmark
1
arxiv
1
pdf-files
1
pdf-extractor
1
pdf-document-processor
1
pdf-document
1