{"id":19380554,"url":"https://github.com/techytushar/ocr-date-extractor","last_synced_at":"2025-04-23T19:33:37.222Z","repository":{"id":111152092,"uuid":"225901513","full_name":"techytushar/ocr-date-extractor","owner":"techytushar","description":"API to extract dates from documents using OCR","archived":false,"fork":false,"pushed_at":"2024-06-26T21:03:52.000Z","size":142,"stargazers_count":10,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-02T19:46:37.299Z","etag":null,"topics":["flask","ocr","python"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/techytushar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-12-04T15:43:13.000Z","updated_at":"2025-03-07T08:19:21.000Z","dependencies_parsed_at":null,"dependency_job_id":"7e66e8ab-4612-4bbe-b531-3beeb9a2548a","html_url":"https://github.com/techytushar/ocr-date-extractor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techytushar%2Focr-date-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techytushar%2Focr-date-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techytushar%2Focr-date-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techytushar%2Focr-date-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/techytushar","download_url":"https://codeload.github.com/techytushar/ocr-date-extractor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250499941,"owners_count":21440719,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["flask","ocr","python"],"created_at":"2024-11-10T09:14:16.348Z","updated_at":"2025-04-23T19:33:36.884Z","avatar_url":"https://github.com/techytushar.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OCR Date Extractor\n\nFlask API to extract dates from documents\n\n## How to use\n\nThe API is provided with 2 routes:\n\n* If you want to pass Base64 encoded image, send a POST request with payload `{\"base_64_image_content\": \u003cbase_64_image_bytes\u003e}` to\n\n```\nhttps://ocr-date-extractor.herokuapp.com/extract_date\n```\n\n* If you want to pass image file, send a POST request with payload `{'image': \u003cimage_file\u003e}` to\n\n```\nhttps://ocr-date-extractor.herokuapp.com/extract_date_from_image\n```\n\nPython sample code to test out the API:\n\n1. Sending the image as Base64 encoded\n\n```python\nimport requests, base64\nimg_url = \u003cpath_to_image\u003e\nwith open(img_url, 'rb') as f:\n    img = base64.b64encode(f.read())\nresponse = requests.post('https://ocr-date-extractor.herokuapp.com/extract_date', data={'base_64_image_content':img})\nprint(response.content)\n```\n\n2. Directly uploading the file\n\n```python\nimport requests\nurl = \"https://ocr-date-extractor.herokuapp.com/extract_date_from_image\"\nfiles=[\n  ('image',('document.png',open('/Users/tushar/peak/document.png','rb'),'image/png'))\n]\nresponse = requests.post(url, data=payload, files=files)\nprint(response.text)\n```\n\n## Working\n\nThe project performs the following steps for any given image:\n\n* Re-scales the image if its too big in size\n* Performs thresholding to separate foreground (the document) and the background\n* Find contours and draws a bounding box on the document present in the image\n* Crops the image to keep only the document\n* Performs thresholding again to separate text from the background\n* Apply OCR to extract text\n* Use regex to extract out the date\n* Date is then parsed and returned in `YYYY-MM-DD` format \n\n## Supported Date Formats\n\nFollowing date format are supported with some flexibility:\n\n* dd-mm-yyyy\n* mm-dd-yyyy\n* yyyy-mm-dd\n* dd/mm/yyyy\n* mm/dd/yyyy\n* yyyy/mm/dd\n* Aug23'19\n* Feb 24, 2019\n* 24 May'19\n\n## References\n\nI took help from the following resources:\n\n* Improving OCR Accuracy [Medium](https://medium.com/cashify-engineering/improve-accuracy-of-ocr-using-image-preprocessing-8df29ec3a033)\n* [OpenCV Docs](https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_tutorials.html)\n* Automatic Canny Edge [PyImageSearch](https://www.pyimagesearch.com/2015/04/06/zero-parameter-automatic-canny-edge-detection-with-python-and-opencv/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftechytushar%2Focr-date-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftechytushar%2Focr-date-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftechytushar%2Focr-date-extractor/lists"}