https://github.com/timhanewich/doc-ocr
My HOA provided the HOA documents as a scanned PDF. 160+ pages I am not able to Ctrl+F on. Passed the full document through Azure OCR to extract plain text I can Ctrl+F on.
https://github.com/timhanewich/doc-ocr
Last synced: 2 months ago
JSON representation
My HOA provided the HOA documents as a scanned PDF. 160+ pages I am not able to Ctrl+F on. Passed the full document through Azure OCR to extract plain text I can Ctrl+F on.
- Host: GitHub
- URL: https://github.com/timhanewich/doc-ocr
- Owner: TimHanewich
- Created: 2024-05-14T19:04:45.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2025-05-15T17:43:16.000Z (about 1 year ago)
- Last Synced: 2026-01-01T09:08:41.868Z (6 months ago)
- Language: C#
- Size: 512 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# Document OCR
My HOA provided the HOA documents as a scanned PDF. 160+ pages I am not able to Ctrl+F on. Passed the full document through Azure OCR to extract plain text I can Ctrl+F on. I built this program to loop through each page (an image) of the document, call to the Azure service that performs OCR, and then save.
Original documents:
- [Full HOA Documents PDF](https://github.com/TimHanewich/doc-ocr/releases/download/1/Palmero-HOA.pdf)
- [The full PDF split into individual JPEG images](https://github.com/TimHanewich/doc-ocr/releases/download/1/images.zip)
Resulting documents:
- All `ImageReadTask` objects ([this class](./src/ImageReadTask.cs)), containing the page number and read OCR result pairs: [result.json](./results/result.json).
- [The full document](./results/result.txt).