{"id":28760718,"url":"https://github.com/preprocess-co/rag-document-viewer","last_synced_at":"2025-10-05T23:21:09.410Z","repository":{"id":299397146,"uuid":"1002884310","full_name":"preprocess-co/rag-document-viewer","owner":"preprocess-co","description":"RAG Document Viewer is an open-source library that generates high-fidelity file previews for seamless integration into your applications. It provides desktop-level file viewing capabilities for a wide range of document formats","archived":false,"fork":false,"pushed_at":"2025-08-14T10:23:34.000Z","size":1070,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-20T05:29:52.162Z","etag":null,"topics":["document-viewer","rag","rag-document-viewer"],"latest_commit_sha":null,"homepage":"https://preprocess.co/rag-document-viewer","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/preprocess-co.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-16T09:41:09.000Z","updated_at":"2025-08-28T10:26:45.000Z","dependencies_parsed_at":"2025-08-11T18:08:55.023Z","dependency_job_id":"b544e5be-14b4-44fa-892e-338f94b2a002","html_url":"https://github.com/preprocess-co/rag-document-viewer","commit_stats":null,"previous_names":["preprocess-co/rag-document-viewer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/preprocess-co/rag-document-viewer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/preprocess-co%2Frag-document-viewer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/preprocess-co%2Frag-document-viewer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/preprocess-co%2Frag-document-viewer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/preprocess-co%2Frag-document-viewer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/preprocess-co","download_url":"https://codeload.github.com/preprocess-co/rag-document-viewer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/preprocess-co%2Frag-document-viewer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278532611,"owners_count":26002399,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-05T02:00:06.059Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-viewer","rag","rag-document-viewer"],"created_at":"2025-06-17T06:07:34.210Z","updated_at":"2025-10-05T23:21:09.403Z","avatar_url":"https://github.com/preprocess-co.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RAG Document Viewer ![V1.1.2](https://img.shields.io/badge/Version-1.1.2-333.svg?labelColor=eee) ![MIT License](https://img.shields.io/badge/License-MIT-333.svg?labelColor=eee)\n\n**RAG Document Viewer** is an open-source library that generates high-fidelity file previews for seamless integration into your applications. It provides desktop-level file viewing capabilities for a wide range of document formats, including:\n\n- PDF documents\n- Microsoft Office files (Word, PowerPoint, Excel)\n- OpenOffice documents (ODS, ODT, ODP)\n\nThe library converts these files into interactive HTML-based previews that can be easily embedded into web applications, desktop applications, or any system that supports HTML rendering.\n\n*Developed by [Preprocess Team](https://preprocess.co)*\n\n## How it works\n-   Pass in a file and specify the destination path.\n-   An HTML bundle is created.\n-   You can now embed the viewer in your application with just an `\u003ciframe\u003e`.\n\n**Viewer capabilities:**\n\n1. **High-Fidelity Rendering**: Preserve the exact look-and-feel of PDFs, DOCX, PPTX \u0026 XLSX documents.\n2. **Embed in Seconds**: Generate a self-contained HTML bundle and drop it into an `\u003ciframe\u003e`.\n3. **Precise Highlights**: Pass bounding-box coordinates from your RAG chunks; the viewer auto-scrolls and spotlights them.\n4. **Lightweight \u0026 Secure** - Runs 100 % in-browser. Files are served directly from *your* backend under *your* auth logic, no external servers.\n\n\n**Viewer features:**\n\n\n![RAG Document Viewer Demo](previewer.png)\n\n1.  **Chunk Navigator**: Navigate between highlighted chunks with next/previous controls.\n2.  **Zoom Controls**: Renders the document at the optimal zoom level, and users can zoom in/out as needed.\n3.  **Scrollbar Navigator**: Visual indicators on the scrollbar show highlighted chunk positions; click to jump to a specific chunk.\n4.  **Chunks Highlighting** - Visual emphasis of the important content part you select.\n\n**Demo:**\n\nWe've created a [demo on Hugging Face](https://preprocess-rag-dv-demo.hf.space/) that lets you see the results you can achieve with your documents.\n\u003e **The demo doesn't have chunk highlighting functionality.**\n\u003e For that feature, you'll need to use a supported provider like [preprocess.co](https://preprocess.co) for document chunking.\n\n---\n\n## 🚀 Quick Start\n\n**1. Install Dependencies**\n```bash\nwget \"https://raw.githubusercontent.com/preprocess-co/rag-document-viewer/refs/heads/main/install.sh\"\nchmod +x install.sh \u0026\u0026 ./install.sh\n```\n\n**2. Install the Library**\n```bash\npip install rag-document-viewer\n```\n\n**3. Create the bundle**\n```python\nfrom rag_document_viewer import RAG_DV\n\n# Generate an HTML viewer\nRAG_DV(\"document.pdf\", \"/static/viewers/document\")\n```\n\n**4. Serve in your application**\n```html\n\u003ciframe\n  src=\"/static/viewers/document/\"\n  width=\"100%\"\n  height=\"800\"\n  style=\"border:0\"\n\u003e\u003c/iframe\u003e\n```\n\n---\n\n## Prerequisites\n\u003e **TL;DR** – *You only need system tools when **building** viewers on your server. Pre-built viewers are pure HTML/JS and have no dependencies.*\n\nBefore you start, make sure the required system dependencies are installed. An `install.sh` convenience script is included for Ubuntu; support for additional operating systems is coming soon.\n\n### 1. System Dependencies\n\u003e For macOS, Windows, and other OSes, please refer to [this guide](./standard.md).\n\nInstall the required libraries:\n```bash\nwget \"https://raw.githubusercontent.com/preprocess-co/rag-document-viewer/refs/heads/main/install.sh\"\nchmod +x install.sh \u0026\u0026 ./install.sh\n```\n\n### 2. Python Library\nInstall the package from PyPI:\n```bash\npip install rag-document-viewer\n# or with Poetry:\n# poetry add rag-document-viewer\n```\n\n### 3. Verify Installations\n\nConfirm both system tools are properly installed:\n\n```bash\nlibreoffice --version\n# Expected output:\n# LibreOffice 24.2.7.2 420(Build:2)\n\npdf2htmlEX --version\n# Expected output:\n# pdf2htmlEX version 0.18.8.rc1\n# ...\n```\n\n---\n\n## Usage\n\n### Generate a standard viewer\n\n```python\nfrom rag_document_viewer import RAG_DV\n\n# Generate an HTML viewer\nRAG_DV(file_path=\"document.pdf\", store_path=\"/path/to/viewers/doc1\")\n```\n\n\u003e **Note**: We suggest setting `store_path` to a non-public, internal path and serving the content through a dedicated view. This way, you remain in full control of the authentication logic. See [Handling Authentication](#handling-authentication) for more details.\n\n### Generate a viewer with chunk highlighting\nYou can get chunk coordinates from chunking providers like [Preprocess.co](https://preprocess.co/rag-document-viewer) (which supports paragraphs, layout items, multi-column layouts, slides, and more) or Unstructured.io (which offers PDF-only item-level support).\n\n\u003e **Note**: Chunks' coordinates should be stored in a list. When storing and then accessing a chunk, you should use the list index to reference the correct chunk.\n\n**With the [Preprocess SDK](https://github.com/preprocess-co/pypreprocess)**\n```python\nfrom pypreprocess import Preprocess\nfrom rag_document_viewer import RAG_DV\n\n# Preprocess a file\npreprocess = Preprocess(api_key=YOUR_API_KEY, filepath=\"path/to/file\", boundary_boxes=True)\npreprocess.chunk()\npreprocess.wait()\n\nresult = preprocess.result() \n# result is a PreprocessResponse object\n\n# Generate an HTML viewer with highlighting capabilities\nRAG_DV(\n    file_path=\"path/to/file\",\n    store_path=\"/path/to/viewers/doc1\",\n    chunks=result.data['boundary_boxes'][\"boxes\"]\n)\n```\n\n**With other providers**\n```python\nfrom rag_document_viewer import RAG_DV\n\n# Define boxes for highlighting specific content areas.\n# Each chunk is a list of one or more boxes.\n# Each box has coordinates relative to the page dimensions (0.0 to 1.0).\n# page: is a 0 based index for identifying the document page.\n# top: position of the chunk between 0 and 1 relative to the page height\n# left: position of the chunk between 0 and 1 relative to the page width\n# height: vertical length of the chunk between 0 and 1 relative to the page height\n# width: horizontal length of the chunk between 0 and 1 relative to the page width\n\nboxes = [\n    [ # First chunk\n        {\"page\": 1, \"top\": 0.02, \"left\": 0.1, \"height\": 0.1, \"width\": 0.5},\n        # A chunk can be composed of multiple boxes (e.g., for multi-column text)\n    ],\n    [ # Second chunk\n        {\"page\": 2, \"top\": 0.5, \"left\": 0.2, \"height\": 0.2, \"width\": 0.6},\n    ],\n    # ... more chunks\n]\n\n# Generate an HTML viewer with highlighting capabilities\nRAG_DV(\n    file_path=\"path/to/file\",\n    store_path=\"/path/to/viewers/doc1\",\n    chunks=boxes\n)\n```\n\n\u003e **Important**: If no chunk information is provided when generating the viewer, the following features will be disabled:\n\u003e - Chunk highlighting and navigation\n\u003e - Scrollbar chunk indicators\n\u003e - The `goto_chunk` URL parameter\n\u003e\n\u003e Ensure you include chunk coordinates if you plan to use these interactive features.\n\n\n\u003e **Tip: Page Highlighting**\n\u003e If you prefer to highlight entire pages instead of precise portions, create a chunk that covers the full page:\n\u003e `[{\"page\": 3, \"top\": 0, \"left\": 0, \"height\": 1, \"width\": 1}]`\n\n\n### Viewer Options\nCustomize the viewer's appearance and behavior with these parameters during generation:\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `chunks` | `list` | `[]` | List of box coordinates for content chunks to highlight. |\n| `page_number` | `bool` | `True` | Display page numbers at the bottom. |\n| `chunks_navigator` | `bool` | `True` | Show chunk navigation controls (requires `chunks`). |\n| `scrollbar_navigator` | `bool` | `True` | Display chunk indicators on the scrollbar (requires `chunks`). |\n| `show_chunks_if_single` | `bool` | `False` | Show chunks navigator even with only one chunk (requires `chunks`). |\n| `chunk_navigator_text` | `str` | `\"Chunk %d of %d\"` | Text template for chunk counter (use `%d` placeholders, requires `chunks`). |\n\n\n**Example**\n```python\nfrom rag_document_viewer import RAG_DV\n\n# `boxes` defined earlier in the code\nRAG_DV(\n    file_path=\"path/to/file\",\n    store_path=\"/path/to/viewer\",\n    chunks=boxes,\n    chunk_navigator_text=\"Suggestion %d of %d\",\n    scrollbar_navigator=False\n)\n```\n\n\n### Color Customization\nCustomize the viewer's colors to match your branding.\n\n\u003e If `main_color` and `background_color` are set, all other colors are automatically derived. You can still override any specific color individually.\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `main_color` | `str` | `#ff8000` | Primary color for interactive elements |\n| `background_color` | `str` | `#dddddd` | Viewer background color |\n| `page_shadow` | `str` | `None` | CSS `box-shadow` for pages (auto-calculated if not set) |\n| `text_selection_color` | `str` | `None` | Browser text selection color for the viewer (auto-calculated if not set) |\n| `controls_text_color` | `str` | `None` | Text color of viewer controls, like zoom and page number (auto-calculated if not set) |\n| `controls_bg_color` | `str` | `None` | Background color of viewer controls, like zoom and page number (auto-calculated if not set) |\n| `scrollbar_color` | `str` | `None` | Scrollbar background color (auto-calculated if not set) |\n| `scroller_color` | `str` | `None` | Scrollbar thumb color (auto-calculated if not set) |\n| `bookmark_color` | `str` | `None` | Color for relevant chunk indicators in the scrollbar (defaults to main_color) |\n| `highlight_chunk_color` | `str` | `None` | CSS `background-image` for chunk highlight (auto-calculated if not set) |\n| `highlight_page_color` | `str` | `None` | CSS `background-image` for page highlight (auto-calculated if not set) |\n| `highlight_page_outline` | `str` | `None` | Page border color for highlighted pages (auto-calculated if not set) |\n\n**Example**\n```python\nfrom rag_document_viewer import RAG_DV\n\nRAG_DV(\n    file_path=\"path/to/file\",\n    store_path=\"/path/to/viewer\",\n    main_color=\"#0969da\",\n    background_color=\"#f6f8fa\"\n)\n```\n\n\n### Displaying the Viewer\nAdd an `\u003ciframe\u003e` to your application to show the document.\n\n\u003e ### **⚠️ Important**: The content must be served via HTTP/S. Opening the `index.html` directly from the local filesystem (`file://`) is not fully supported and may cause issues.\n\n```html\n\u003ciframe\n  src=\"/path/to/viewers/my_document\"\n  width=\"100%\"\n  height=\"800\"\n  style=\"border:0\"\n\u003e\u003c/iframe\u003e\n```\n\n\u003e **Note**: Please see the [Handling Authentication](#handling-authentication) section for best practices on securely integrating the viewer.\n\n\n### Viewer Display Parameters\n\nControl the viewer's initial state by passing parameters in the `\u003ciframe\u003e` URL:\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `chunks` | `string` | `[]` | An ordered JSON array of chunk indices to highlight and navigate. |\n| `goto_chunk`| `int` | `None` | Automatically scroll to this chunk index on load. |\n| `goto_page` | `int` | `None` | Automatically scroll to this page number on load. |\n\n\u003e **Note**: The `chunks` and `goto_chunk` parameters only work if chunk data was provided when the viewer was generated. The order of indices in the `chunks` URL parameter determines the \"Next/Previous\" navigation order.\n\u003e chunks and pages are 0-based inndexes\n\n**Behavior Priority:**\nThe viewer determines the initial scroll position based on the following priority:\n1.  If `goto_chunk` is set, it scrolls to that chunk.\n2.  Else, if `chunks` is set, it scrolls to the first chunk in the list.\n3.  Else, if `goto_page` is set, it scrolls to that page.\n4.  Otherwise, it defaults to the beginning of the document.\n\n**Examples:**\n\nHighlight chunks `0`, `2`, and `3`, and jump directly to chunk `2` on load. Navigation will follow the `[0, 2, 3]` order.\n```html\n\u003ciframe src=\"/viewer/doc1?chunks=[0,2,3]\u0026goto_chunk=2\"\u003e\u003c/iframe\u003e\n```\n\nHighlight chunks `2`, `0`, and `3`. The \"Next/Previous\" buttons will navigate in this specific order (`2` -\u003e `0` -\u003e `3`). The view will initially scroll to chunk `2`.\n```html\n\u003ciframe src=\"/viewer/doc1?chunks=[2,0,3]\"\u003e\u003c/iframe\u003e\n```\n\nGo to a specific page on load.\n```html\n\u003ciframe src=\"/viewer/doc1?goto_page=4\"\u003e\u003c/iframe\u003e\n```\n\n\n### Handling Authentication\n**We strongly recommend storing viewer bundles in a non-public path. Here is a guide on how to manage authentication to prevent unwanted access to your documents.**\n\nWhen generating a viewer, you should store the resulting bundle in a directory that is not publicly accessible via HTTP. You can use your web server (Apache, Nginx, etc.) to block direct access to this folder. When a user requests to see a document, your application backend should first verify their permissions and then serve the viewer bundle from the disk.\n\nDepending on your stack, this can be implemented in many ways. Using a route handler is a common approach.\n\n**Flask Example**\nThis example shows how to serve a viewer only after checking user permissions.\n\n```python\nfrom flask import Flask, send_from_directory, abort\nfrom pathlib import Path\n\n# Path where viewer bundles are stored securely, outside the public web root\nBASE_DIR = Path(\"/var/secure_viewers\").resolve()\n\n@app.route(\"/view/\u003cdoc_id\u003e/\")\n@app.route(\"/view/\u003cdoc_id\u003e/\u003cpath:asset\u003e\")\ndef serve_my_document(doc_id, asset=\"index.html\"):\n    # 1. Add your authentication and authorization logic here\n    # Example: check_user_can_view(current_user, doc_id)\n    if not user_is_allowed:\n        abort(403) # Forbidden\n    \n    # 2. Securely resolve the path to the viewer\n    viewer_dir = (BASE_DIR / doc_id).resolve()\n    \n    # Security check: ensure the resolved path is still within the base directory\n    # This prevents path traversal attacks (e.g., doc_id = \"../../../etc/passwd\")\n    if viewer_dir.parent != BASE_DIR:\n        abort(404) # Not Found\n    \n    # 3. Serve the requested asset (index.html, CSS, JS, etc.)\n    return send_from_directory(viewer_dir, asset)\n```\n\n\u003e **Note**: Remember to include a wildcard in your route (e.g. `\u003cpath:asset\u003e`) to handle requests for all assets inside the bundle (CSS, JS, fonts, images), otherwise the viewer will not render correctly.\n\n---\n\n## Support\nContact the Preprocess team at `support@preprocess.co` or join our [Discord channel](https://discord.gg/7G5xqsZmGu).\n\n## License\n\nThis project is licensed under the MIT License.\n\n## Credits\nRAG Document Viewer would not be possible without the following open-source projects:\n\n| Project | License |\n|---------|---------|\n| **LibreOffice** \u003chttps://www.libreoffice.org/\u003e | **MPL 2.0 / LGPL v3** |\n| **pdf2htmlEX** \u003chttps://github.com/pdf2htmlEX/pdf2htmlEX\u003e | **GPL v3** |\n\nThese tools are **not** bundled with the `rag-document-viewer` package; they must be installed on the host system where viewers are generated. Please consult the upstream repositories for full license texts and source code.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpreprocess-co%2Frag-document-viewer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpreprocess-co%2Frag-document-viewer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpreprocess-co%2Frag-document-viewer/lists"}