{"id":28952928,"url":"https://github.com/nsourlos/ocr_and_rag","last_synced_at":"2026-05-06T02:31:23.130Z","repository":{"id":300731477,"uuid":"1006953426","full_name":"nsourlos/OCR_and_RAG","owner":"nsourlos","description":"Tests of OCR and RAG with LLMs","archived":false,"fork":false,"pushed_at":"2025-06-23T09:01:15.000Z","size":22,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-23T10:20:28.125Z","etag":null,"topics":["cohere","colpali","document-processing","gemini","information-retrieval","mistral","ocr","openai","pdf-text-extraction","qwen2-vl","rag"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nsourlos.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-23T08:41:13.000Z","updated_at":"2025-06-23T09:01:18.000Z","dependencies_parsed_at":"2025-06-23T10:20:30.870Z","dependency_job_id":"d9e3ba33-fb2e-4af1-8207-fd083c773bb0","html_url":"https://github.com/nsourlos/OCR_and_RAG","commit_stats":null,"previous_names":["nsourlos/ocr_and_rag"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/nsourlos/OCR_and_RAG","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nsourlos%2FOCR_and_RAG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nsourlos%2FOCR_and_RAG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nsourlos%2FOCR_and_RAG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nsourlos%2FOCR_and_RAG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nsourlos","download_url":"https://codeload.github.com/nsourlos/OCR_and_RAG/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nsourlos%2FOCR_and_RAG/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261528613,"owners_count":23172747,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cohere","colpali","document-processing","gemini","information-retrieval","mistral","ocr","openai","pdf-text-extraction","qwen2-vl","rag"],"created_at":"2025-06-23T18:00:14.901Z","updated_at":"2026-05-06T02:31:23.084Z","avatar_url":"https://github.com/nsourlos.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OCR and RAG Experiments\n\n## Description\n\nThis project benchmarks and demonstrates a wide range of Optical Character Recognition (OCR) and Retrieval-Augmented Generation (RAG) techniques for extracting, cleaning, and querying information from PDF documents. It covers both text-based and image-based PDFs, with a special focus on handling mathematical equations and complex layouts. The notebook provides code and commentary for using and comparing popular OCR tools, PDF parsers, and RAG pipelines, including integration with state-of-the-art LLMs and Vision-Language Models.\n\n---\n\n## Table of Contents\n\n- [Installation](#installation)\n- [Usage](#usage)\n- [Supported Tools \u0026 Models](#supported-tools--models)\n- [Example Workflow](#example-workflow)\n- [Results \u0026 Recommendations](#results--recommendations)\n- [References](#references)\n- [License](#license)\n\n---\n\n## Installation\n\n1. **Clone the repository** and navigate to the project directory.\n\n2. **Install dependencies**:\n\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n3. **Set up API keys**  \n   Create an `.env` file in your project directory with your API keys:\n\n   ```\n   OPENAI_API_KEY_DRACO=your_openai_api_key\n   MISTRAL_API_KEY=your_mistral_api_key\n   GEMINI_API_KEY=your_gemini_api_key\n   COHERE_API_KEY=your_cohere_api_key\n   ```\n\n---\n\n## Usage\n\n1. **Open the notebook**  \n   Launch Jupyter and open `ocr_RAG_tests.ipynb`.\n\n2. **Configure file paths**  \n   Place your PDF files in the `pdfs` directory or update the `files_path` and `pdf_file` variables as needed.\n\n3. **Run the cells**  \n   The notebook is organized into sections for each tool and workflow. You can run all cells or focus on the tools/models you want to benchmark.\n\n4. **Output**  \n   - Cleaned and formatted Markdown files are saved to your Desktop.\n   - JSON files with parsed document data are also generated.\n   - Results and recommendations are printed in the notebook.\n\n---\n\n## Supported Tools \u0026 Models\n\nThe notebook includes code and benchmarks for:\n\n- **PDF Text Extraction:**  \n  - `OpenAI OCR`\n  - `Marked-pdf`\n  - `docling`\n  - `Pytesseract`\n  - `Mistral OCR`\n  - `surya-ocr`\n  - `alibaba-damo/mgp-str-base`\n  - `LatexOCR`\n  - `zerox`\n  - `Ollama OCR`\n  - `X-PLUG/mPLUG-DocOwl`\n  - `OlmOCR`\n  - `GOT OCR`\n  - `Nougat OCR`\n  - `MegaParse`\n- **RAG \u0026 LLMs:**  \n  - OpenAI GPT-4o, GPT-4o-mini\n  - Gemini 2.5\n  - Cohere Embed v4\n  - ColPali (via Byaldi)\n  - Qwen2-VL-2B/7B\n  - Visual RAG pipelines\n\n---\n\n## Results \u0026 Recommendations\n\n- **Best overall, especially for equations and complex layouts:**  \n  - OpenAI GPT-4o Vision (via API) and marked-pdf.\n- **Good for contracts and simple text:**  \n  - `docling`, `mistral`, `pytesseract`.\n- **Visual RAG:**  \n  - ColPali + Qwen2-VL pipeline is promising for multimodal retrieval and QA.\n- **Other tools:**  \n  - Many open-source OCR tools struggle with equations and complex formatting.\n\nSee the notebook for detailed benchmarks and code for each tool.\n\n---\n\n## References\n\n- [OpenAI Cookbook: Parse PDF Docs for RAG](https://cookbook.openai.com/examples/parse_pdf_docs_for_rag)\n- [Marker](https://github.com/VikParuchuri/marker)\n- [Docling](https://github.com/docling-project/docling)\n- [MistralAI](https://docs.mistral.ai/capabilities/document/)\n- [Byaldi (ColPali)](https://github.com/AnswerDotAI/byaldi)\n- [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)\n- [Cohere](https://cohere.com/)\n- [Google Gemini](https://ai.google.dev/)\n\n---\n\n**Note:**  \nSome tools require API keys and/or GPU support. See the notebook comments for installation and usage tips for each tool.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnsourlos%2Focr_and_rag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnsourlos%2Focr_and_rag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnsourlos%2Focr_and_rag/lists"}