https://github.com/ispras/dedoc
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
https://github.com/ispras/dedoc
doc document-analysis document-content-extraction documents docx docx-parser excel html html-parser logical-structure-extraction ocr odt pdf pdf-parser scanned-documents table-of-contents table-recognition txt
Last synced: 4 days ago
JSON representation
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
- Host: GitHub
- URL: https://github.com/ispras/dedoc
- Owner: ispras
- License: apache-2.0
- Created: 2020-12-07T13:53:27.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2025-02-14T14:21:39.000Z (3 months ago)
- Last Synced: 2025-04-01T12:03:51.144Z (about 2 months ago)
- Topics: doc, document-analysis, document-content-extraction, documents, docx, docx-parser, excel, html, html-parser, logical-structure-extraction, ocr, odt, pdf, pdf-parser, scanned-documents, table-of-contents, table-recognition, txt
- Language: Python
- Homepage:
- Size: 229 MB
- Stars: 226
- Watchers: 12
- Forks: 26
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt