{"id":25876500,"url":"https://github.com/prithivsakthiur/multimodal-ocr","last_synced_at":"2026-06-09T19:31:12.892Z","repository":{"id":274682991,"uuid":"923724106","full_name":"PRITHIVSAKTHIUR/Multimodal-OCR","owner":"PRITHIVSAKTHIUR","description":"Multimodal-OCR is an experimental, high-performance visual reasoning and optical character recognition suite designed to accurately extract text, analyze visual content, and parse complex document structures. Built upon a diverse ecosystem of cutting-edge vision-language models.","archived":false,"fork":false,"pushed_at":"2026-03-22T07:21:33.000Z","size":14189,"stargazers_count":15,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-03-22T22:21:04.471Z","etag":null,"topics":["gradio","huggingface-models","huggingface-spaces","huggingface-transformers","ocr-recognition","opencv-python","pillow","python","qwen2-5-vl","qwen2-vl-2b","torch","torchvision"],"latest_commit_sha":null,"homepage":"https://huggingface.co/spaces/prithivMLmods/Multimodal-OCR","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PRITHIVSAKTHIUR.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-01-28T18:29:13.000Z","updated_at":"2026-03-22T07:23:49.000Z","dependencies_parsed_at":null,"dependency_job_id":"f8aacf61-3fae-466a-9874-fa8f86ed82bb","html_url":"https://github.com/PRITHIVSAKTHIUR/Multimodal-OCR","commit_stats":null,"previous_names":["prithivsakthiur/multimodal-ocr"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/PRITHIVSAKTHIUR/Multimodal-OCR","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRITHIVSAKTHIUR%2FMultimodal-OCR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRITHIVSAKTHIUR%2FMultimodal-OCR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRITHIVSAKTHIUR%2FMultimodal-OCR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRITHIVSAKTHIUR%2FMultimodal-OCR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PRITHIVSAKTHIUR","download_url":"https://codeload.github.com/PRITHIVSAKTHIUR/Multimodal-OCR/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRITHIVSAKTHIUR%2FMultimodal-OCR/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34123171,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gradio","huggingface-models","huggingface-spaces","huggingface-transformers","ocr-recognition","opencv-python","pillow","python","qwen2-5-vl","qwen2-vl-2b","torch","torchvision"],"created_at":"2025-03-02T10:29:37.005Z","updated_at":"2026-06-09T19:31:12.886Z","avatar_url":"https://github.com/PRITHIVSAKTHIUR.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **Multimodal-OCR**\n\nMultimodal-OCR is an experimental, high-performance visual reasoning and optical character recognition suite designed to accurately extract text, analyze visual content, and parse complex document structures. Built upon a diverse ecosystem of cutting-edge vision-language models—including architectures based on Qwen2.5-VL, Qwen2-VL, and Cohere's Aya-Vision—this application excels at handling dense documents, multilingual texts, and real-world scene imagery. The suite features a highly customized, interactive web interface that enables users to effortlessly upload screenshots, receipts, and pages for rapid analysis. With built-in support for fully GPU-accelerated inference via Flash Attention 2 and granular manipulation of text generation parameters, Multimodal-OCR provides researchers and developers with a powerful, streamlined environment for testing and deploying robust multimodal AI workflows.\n\n\u003cimg width=\"1920\" height=\"1800\" alt=\"Screenshot 2026-03-22 at 12-44-23 Multimodal OCR - a Hugging Face Space by prithivMLmods\" src=\"https://github.com/user-attachments/assets/42b8c8f0-6903-4a83-96b8-04553fcc1df2\" /\u003e\n\n### **Key Features**\n\n* **Multi-Model Architecture:** Seamlessly switch between specialized vision-language models directly from the interface. Supported models include `Nanonets-OCR2-3B`, `olmOCR-7B-0725`, `RolmOCR-7B`, `Aya-Vision-8B`, and `Qwen2-VL-OCR-2B`.\n* **Custom User Interface:** Features a bespoke, responsive Gradio frontend built with custom HTML, CSS, and JavaScript. It includes a drag-and-drop media zone, real-time output streaming, and an integrated advanced settings panel.\n* **Granular Inference Controls:** Fine-tune the AI's output by adjusting text generation parameters such as Maximum New Tokens, Temperature, Top-p, Top-k, and Repetition Penalty.\n* **Output Management:** Built-in actions allow users to instantly copy the raw output text to their clipboard or save the generated response directly as a `.txt` file.\n* **Flash Attention 2 Integration:** Utilizes `kernels-community/flash-attn2` for optimized, memory-efficient inference on compatible GPUs.\n\n### **Repository Structure**\n\n```text\n├── examples/\n│   ├── 1.jpg\n│   ├── 2.jpg\n│   ├── 3.jpg\n│   ├── 4.jpg\n│   └── 5.jpg\n├── app.py\n├── LICENSE\n├── pre-requirements.txt\n├── README.md\n└── requirements.txt\n```\n\n### **Installation and Requirements**\n\nTo run Multimodal-OCR locally, you need to configure a Python environment with the following dependencies. Ensure you have a compatible CUDA-enabled GPU for optimal performance.\n\n**1. Install Pre-requirements**\nRun the following command to update pip to the required version:\n```bash\npip install pip\u003e=23.0.0\n```\n\n**2. Install Core Requirements**\nInstall the necessary machine learning and UI libraries. You can place these in a `requirements.txt` file and run `pip install -r requirements.txt`.\n\n```text\ngit+https://github.com/huggingface/transformers.git@v4.57.6\ngit+https://github.com/huggingface/accelerate.git\ngit+https://github.com/huggingface/peft.git\ntransformers-stream-generator\nhuggingface_hub\nqwen-vl-utils\nsentencepiece\nopencv-python\ntorch==2.8.0\ntorchvision\nmatplotlib\nrequests\nkernels\nhf_xet\nspaces\npillow\ngradio\nav\n```\n\n---\n\n### **Usage**\n\nOnce your environment is set up and the dependencies are installed, you can launch the application by running the main Python script:\n\n```bash\npython app.py\n```\n\nAfter the script initializes the interface, it will provide a local web address (usually `http://127.0.0.1:7860/`) which you can open in your browser to interact with the models. Note that the selected models will be downloaded and loaded into VRAM upon their first invocation.\n\n### **License and Source**\n\n* **License:** Apache License - Version 2.0\n* **GitHub Repository:** [https://github.com/PRITHIVSAKTHIUR/Multimodal-OCR.git](https://github.com/PRITHIVSAKTHIUR/Multimodal-OCR.git)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprithivsakthiur%2Fmultimodal-ocr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprithivsakthiur%2Fmultimodal-ocr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprithivsakthiur%2Fmultimodal-ocr/lists"}