Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/justmars/start-ocr
Applying pdfplumber + opencv + pytesseract to extract content and metadata from formal PDF files.
https://github.com/justmars/start-ocr
Last synced: 3 days ago
JSON representation
Applying pdfplumber + opencv + pytesseract to extract content and metadata from formal PDF files.
- Host: GitHub
- URL: https://github.com/justmars/start-ocr
- Owner: justmars
- Created: 2024-02-08T01:27:23.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-02-08T01:41:41.000Z (9 months ago)
- Last Synced: 2024-10-13T16:52:49.342Z (about 1 month ago)
- Language: Python
- Homepage:
- Size: 3.15 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# start-ocr
![Github CI](https://github.com/justmars/start-ocr/actions/workflows/ci.yml/badge.svg)
1. Applying pdfplumber + opencv + pytesseract to extract content and metadata from formal PDF files.
2. pdfplumber's `page.extract_text_lines()` is experimental and thus can work or not depending on the pdf file.
3. See [documentation](https://justmars.github.io/start-ocr).## Installation
```sh
just start
```