https://github.com/timhanewich/doc-ocr

My HOA provided the HOA documents as a scanned PDF. 160+ pages I am not able to Ctrl+F on. Passed the full document through Azure OCR to extract plain text I can Ctrl+F on.
https://github.com/timhanewich/doc-ocr

Last synced: 2 months ago
JSON representation

My HOA provided the HOA documents as a scanned PDF. 160+ pages I am not able to Ctrl+F on. Passed the full document through Azure OCR to extract plain text I can Ctrl+F on.

Host: GitHub
URL: https://github.com/timhanewich/doc-ocr
Owner: TimHanewich
Created: 2024-05-14T19:04:45.000Z (about 2 years ago)
Default Branch: master
Last Pushed: 2025-05-15T17:43:16.000Z (about 1 year ago)
Last Synced: 2026-01-01T09:08:41.868Z (6 months ago)
Language: C#
Size: 512 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

          # Document OCR

My HOA provided the HOA documents as a scanned PDF. 160+ pages I am not able to Ctrl+F on. Passed the full document through Azure OCR to extract plain text I can Ctrl+F on. I built this program to loop through each page (an image) of the document, call to the Azure service that performs OCR, and then save.

Original documents:

- [Full HOA Documents PDF](https://github.com/TimHanewich/doc-ocr/releases/download/1/Palmero-HOA.pdf)

- [The full PDF split into individual JPEG images](https://github.com/TimHanewich/doc-ocr/releases/download/1/images.zip)

Resulting documents:

- All `ImageReadTask` objects ([this class](./src/ImageReadTask.cs)), containing the page number and read OCR result pairs: [result.json](./results/result.json).

- [The full document](./results/result.txt).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/timhanewich/doc-ocr

Awesome Lists containing this project

README