Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/justmars/start-ocr

Applying pdfplumber + opencv + pytesseract to extract content and metadata from formal PDF files.
https://github.com/justmars/start-ocr

Last synced: 3 days ago
JSON representation

Applying pdfplumber + opencv + pytesseract to extract content and metadata from formal PDF files.

Host: GitHub
URL: https://github.com/justmars/start-ocr
Owner: justmars
Created: 2024-02-08T01:27:23.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-02-08T01:41:41.000Z (9 months ago)
Last Synced: 2024-10-13T16:52:49.342Z (about 1 month ago)
Language: Python
Homepage:
Size: 3.15 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# start-ocr

![Github CI](https://github.com/justmars/start-ocr/actions/workflows/ci.yml/badge.svg)

1. Applying pdfplumber + opencv + pytesseract to extract content and metadata from formal PDF files.
2. pdfplumber's `page.extract_text_lines()` is experimental and thus can work or not depending on the pdf file.
3. See [documentation](https://justmars.github.io/start-ocr).

## Installation

```sh
just start
```