{"id":29136480,"url":"https://github.com/akshar-raaj/document-processing","last_synced_at":"2025-06-30T11:39:48.709Z","repository":{"id":295421094,"uuid":"976463511","full_name":"akshar-raaj/document-processing","owner":"akshar-raaj","description":"A fast, flexible API for extracting text from PDFs and images using smart file detection and OCR—perfect for automating your document workflows.","archived":false,"fork":false,"pushed_at":"2025-06-11T02:50:21.000Z","size":121,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-11T03:57:00.155Z","etag":null,"topics":["ai","artificial-intelligence","document-processing-pipeline","ocr","optical-character-recognition","tesseract","textract"],"latest_commit_sha":null,"homepage":"http://ocr.petprojects.in","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/akshar-raaj.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-02T06:35:52.000Z","updated_at":"2025-06-11T02:50:24.000Z","dependencies_parsed_at":"2025-05-25T13:40:15.116Z","dependency_job_id":"9963be49-8591-4442-90ac-b0a8a7824c92","html_url":"https://github.com/akshar-raaj/document-processing","commit_stats":null,"previous_names":["akshar-raaj/document-processing"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/akshar-raaj/document-processing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akshar-raaj%2Fdocument-processing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akshar-raaj%2Fdocument-processing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akshar-raaj%2Fdocument-processing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akshar-raaj%2Fdocument-processing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/akshar-raaj","download_url":"https://codeload.github.com/akshar-raaj/document-processing/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akshar-raaj%2Fdocument-processing/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262766527,"owners_count":23361123,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","artificial-intelligence","document-processing-pipeline","ocr","optical-character-recognition","tesseract","textract"],"created_at":"2025-06-30T11:39:38.180Z","updated_at":"2025-06-30T11:39:48.203Z","avatar_url":"https://github.com/akshar-raaj.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge\u0026logo=python\u0026logoColor=ffdd54) ![FastAPI](https://img.shields.io/badge/FastAPI-005571?style=for-the-badge\u0026logo=fastapi)\n![Redis](https://img.shields.io/badge/redis-%23DD0031.svg?style=for-the-badge\u0026logo=redis\u0026logoColor=white)\n![AWS](https://img.shields.io/badge/AWS-%23FF9900.svg?style=for-the-badge\u0026logo=amazon-aws\u0026logoColor=white)\n![Docker](https://img.shields.io/badge/docker-%230db7ed.svg?style=for-the-badge\u0026logo=docker\u0026logoColor=white)\n![GitHub Actions](https://img.shields.io/badge/github%20actions-%232671E5.svg?style=for-the-badge\u0026logo=githubactions\u0026logoColor=white)\n\u003c!--Taken from https://github.com/Ileriayo/markdown-badges--\u003e\n\n![GitHub last commit](https://img.shields.io/github/last-commit/akshar-raaj/document-processing) ![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/akshar-raaj/document-processing/lint.yml)\n\n## What\n\nThis repository powers the following:\n- [http://ocr.petprojects.in](http://ocr.petprojects.in)\n- [http://nlp.petprojects.in](http://nlp.petprojects.in)\n\nThis project performs the following broad functionalities:\n- Text Detection\n- Text Extraction\n- OCR (Optical Character Recognition)\n- Text Analysis\n\nIt exposes an API endpoint `/ocr` that takes a PDF or an image as an input. It then performs OCR if needed on the input, extracts text out of the input, and outputs the extracted text.\n\n`/ocr` performs OCR using Tesseract. Another API endpoint `/textract-ocr` performs OCR using **AWS Textract**. AWS Textract provides better accuracy on low quality images, skewed images and images of handwritten text.\n\nAn interactive API documentation is available at `/docs`, see http://ocr-api.petprojects.in/docs. This API documentation is generated from an OpenAPI schema.\n\n## How\n\n### Dependencies\n\nThe following Python dependencies makes OCR possible.\n\n#### python-magic\nPython interface to the libmagic, a file type identification library. Unix `file` command uses libmagic under the hood as well.\nThis uses file headers to identify the file mime type.\n\n#### pikepdf\nA PDF manipulation library, based on qpdf.\nAllows performing PDF operations like rotating, cropping, merging etc.\n\n#### pytesseract\nPython interface to Tesseract OCR.\nTesseract OCR can take an image as in input, extract text from the input image, and can output to different formats.\n\n#### pdf2image\nIt allows converting pdf pages to individual images.\nTesseract OCR can only be performed on image. Hence, we need ability to convert non searchable PDFs to images before performing OCR.\n\nThis has a dependency on poppler library.\n\n#### pdfminer.six\nIt allows extracting text from searchable PDFs. In such cases on OCR is needed.\n\n#### boto3\nProvides Python interfaces to AWS Services. We are using AWS Textract.\n\n### AWS Textract\nAWS Textract is a critical component for performing accurate text recognition and detection on low quality or skewed images.\n\nExample AWS CLI command:\n\n    aws textract detect-document-text --document '{\"S3Object\":{\"Bucket\":\"annals\",\"Name\":\"decathlon-whey.jpeg\"}}' --profile administrator --region ap-south-1 --debug\n\n### nltk\nIt is being used to perform Natural Language Processing. We have the ability to analyse the extracted text and infer:\n- Word Frequency\n- Repetitions and Lexical Diversity\n- Parts of Speech Tagging\n- Named Entity Recognition\n\nFor advanced purposes, we might explore using spaCy.\n\n### rq\nrq(Redis Queue) is being used to enqueue the OCR extraction tasks on a Redis List. Workers running in the background dequeue from this list and invoke the service functions to perform actual OCR.\n\n### opencv-contrib-python\nProvides Computer Vision and Image processing capability. We preprocess the image before performing recognition and detection.\nWe apply grayscaling, smoothing and denoising, and thresholding and binarisation.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakshar-raaj%2Fdocument-processing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fakshar-raaj%2Fdocument-processing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakshar-raaj%2Fdocument-processing/lists"}