Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/techytushar/ocr-date-extractor
API to extract dates from documents using OCR
https://github.com/techytushar/ocr-date-extractor
flask ocr python
Last synced: 2 months ago
JSON representation
API to extract dates from documents using OCR
- Host: GitHub
- URL: https://github.com/techytushar/ocr-date-extractor
- Owner: techytushar
- Created: 2019-12-04T15:43:13.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2024-06-26T21:03:52.000Z (7 months ago)
- Last Synced: 2024-06-27T01:11:58.991Z (7 months ago)
- Topics: flask, ocr, python
- Language: Python
- Size: 139 KB
- Stars: 9
- Watchers: 2
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# OCR Date Extractor
Flask API to extract dates from documents
## How to use
The API is provided with 2 routes:
* If you want to pass Base64 encoded image, send a POST request with payload `{"base_64_image_content": }` to
```
https://ocr-date-extractor.herokuapp.com/extract_date
```* If you want to pass image file, send a POST request with payload `{'image': }` to
```
https://ocr-date-extractor.herokuapp.com/extract_date_from_image
```Python sample code to test out the API:
1. Sending the image as Base64 encoded
```python
import requests, base64
img_url =
with open(img_url, 'rb') as f:
img = base64.b64encode(f.read())
response = requests.post('https://ocr-date-extractor.herokuapp.com/extract_date', data={'base_64_image_content':img})
print(response.content)
```2. Directly uploading the file
```python
import requests
url = "https://ocr-date-extractor.herokuapp.com/extract_date_from_image"
files=[
('image',('document.png',open('/Users/tushar/peak/document.png','rb'),'image/png'))
]
response = requests.post(url, data=payload, files=files)
print(response.text)
```## Working
The project performs the following steps for any given image:
* Re-scales the image if its too big in size
* Performs thresholding to separate foreground (the document) and the background
* Find contours and draws a bounding box on the document present in the image
* Crops the image to keep only the document
* Performs thresholding again to separate text from the background
* Apply OCR to extract text
* Use regex to extract out the date
* Date is then parsed and returned in `YYYY-MM-DD` format## Supported Date Formats
Following date format are supported with some flexibility:
* dd-mm-yyyy
* mm-dd-yyyy
* yyyy-mm-dd
* dd/mm/yyyy
* mm/dd/yyyy
* yyyy/mm/dd
* Aug23'19
* Feb 24, 2019
* 24 May'19## References
I took help from the following resources:
* Improving OCR Accuracy [Medium](https://medium.com/cashify-engineering/improve-accuracy-of-ocr-using-image-preprocessing-8df29ec3a033)
* [OpenCV Docs](https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_tutorials.html)
* Automatic Canny Edge [PyImageSearch](https://www.pyimagesearch.com/2015/04/06/zero-parameter-automatic-canny-edge-detection-with-python-and-opencv/)