{"id":29195685,"url":"https://github.com/gptscript-ai/gptparse","last_synced_at":"2025-07-02T05:05:29.887Z","repository":{"id":258400330,"uuid":"874914867","full_name":"gptscript-ai/gptparse","owner":"gptscript-ai","description":"Document parser for RAG","archived":false,"fork":false,"pushed_at":"2024-11-07T02:44:39.000Z","size":218,"stargazers_count":12,"open_issues_count":0,"forks_count":0,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-11-07T03:30:00.610Z","etag":null,"topics":["ocr","retrieval-augmented-generation","vision-language-model"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gptscript-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-18T17:39:38.000Z","updated_at":"2024-11-07T02:44:43.000Z","dependencies_parsed_at":"2024-10-18T20:39:08.487Z","dependency_job_id":"0a8f8deb-e6d0-476a-8440-81ecfadf658d","html_url":"https://github.com/gptscript-ai/gptparse","commit_stats":null,"previous_names":["gptscript-ai/gptparse"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/gptscript-ai/gptparse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gptscript-ai%2Fgptparse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gptscript-ai%2Fgptparse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gptscript-ai%2Fgptparse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gptscript-ai%2Fgptparse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gptscript-ai","download_url":"https://codeload.github.com/gptscript-ai/gptparse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gptscript-ai%2Fgptparse/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263077633,"owners_count":23410167,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ocr","retrieval-augmented-generation","vision-language-model"],"created_at":"2025-07-02T05:05:29.238Z","updated_at":"2025-07-02T05:05:29.882Z","avatar_url":"https://github.com/gptscript-ai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GPTParse\n\nGPTParse is a powerful and versatile document parser designed specifically for Retrieval-Augmented Generation (RAG) systems. It enables seamless conversion of PDF documents and images into Markdown format using either advanced vision language models (VLMs) or fast local processing, facilitating easy integration into text-based workflows and applications.\n\nWith GPTParse, you can:\n\n- Convert complex PDFs and images, including those with tables, lists, and embedded images, into well-structured Markdown.\n- Choose between AI-powered processing (using OpenAI, Anthropic, or Google) or fast local processing.\n- Use GPTParse as a Python library or via a command-line interface (CLI), offering flexibility in how you integrate it into your projects.\n\nIt's as simple as:\n\n```bash\n# Convert a PDF using Vision Language Models\ngptparse vision example.pdf --output_file output.md\n\n# Convert a PDF using fast local processing (no VLM or internet connection required)\ngptparse fast example.pdf --output_file output.md\n\n# Convert using hybrid mode (combines fast and vision for better results)\ngptparse hybrid example.pdf --output_file output.md\n\n# Convert using OCR mode (uses local deep learning model for text extraction)\ngptparse ocr example.pdf --output_file output.md\n\n# Convert an image\ngptparse vision screenshot.png --output_file output.md\n```\n\n## Features\n\n- **Convert PDFs and Images to Markdown**: Transform PDF documents and image files (PNG, JPG, JPEG) into Markdown format, preserving the structure and content.\n- **Multiple Parsing Methods**: Choose between using Vision Language Models (VLMs) for high-fidelity conversion, fast local processing for quick results, hybrid mode for enhanced accuracy, or OCR mode for direct text extraction.\n  - OCR processing powered by EasyOCR for fast and accurate text recognition\n- **Support for Multiple AI Providers**: Seamlessly integrate with OpenAI, Anthropic, and Google AI models, selecting the one that best fits your needs.\n- **Python Library and CLI Application**: Use GPTParse within your Python applications or interact with it through the command line.\n- **Customizable Processing Options**: Configure concurrency levels, select specific pages to process, and customize system prompts to tailor the output.\n- **Page Selection**: Process entire documents or specify individual pages or ranges of pages.\n- **Detailed Statistics**: Optionally display detailed processing statistics, including token usage and processing times.\n\n## Table of Contents\n\n- [Installation](#installation)\n  - [Prerequisites](#prerequisites)\n- [Quick Start](#quick-start)\n- [Usage](#usage)\n  - [Setting Up Environment Variables](#setting-up-environment-variables)\n  - [Configuration](#configuration)\n  - [Using GPTParse as a Python Package](#using-gptparse-as-a-python-package)\n  - [Using GPTParse via the CLI](#using-gptparse-via-the-cli)\n    - [Vision Mode](#vision-mode)\n    - [Fast Mode](#fast-mode)\n    - [Hybrid Mode](#hybrid-mode)\n    - [OCR Mode](#ocr-mode)\n- [Available Models and Providers](#available-models-and-providers)\n  - [OpenAI Models](#openai-models)\n  - [Anthropic Models](#anthropic-models)\n  - [Google Models](#google-models)\n- [Examples](#examples)\n  - [Processing Specific Pages](#processing-specific-pages)\n  - [Custom System Prompt](#custom-system-prompt)\n  - [Displaying Statistics](#displaying-statistics)\n- [Contributing](#contributing)\n- [License](#license)\n- [Acknowledgments](#acknowledgments)\n\n## Installation\n\nInstall GPTParse using pip:\n\n```bash\npip install gptparse\n```\n\n### Prerequisites\n\nEnsure you have the following installed:\n\n- **Python 3.9** or higher\n- **Poppler**: For PDF to image conversion\n\n#### Installing Poppler\n\nPoppler is the underlying project that handles PDF processing. You can check if you already have it installed by running `pdftoppm -h` in your terminal/command prompt.\n\n- **Ubuntu/Debian**:\n\n  ```bash\n  sudo apt-get install poppler-utils\n  ```\n\n- **Arch Linux**:\n\n  ```bash\n  sudo pacman -S poppler\n  ```\n\n- **macOS (with Homebrew)**:\n\n  ```bash\n  brew install poppler\n  ```\n\n- **Windows**:\n\n  1. Download the latest poppler package from [oschwartz10612's version](https://github.com/oschwartz10612/poppler-windows/releases/), which is the most up-to-date.\n  2. Extract the downloaded package and move the extracted directory to your desired location.\n  3. Add the `bin/` directory from the extracted folder to your [system PATH](https://www.architectryan.com/2018/03/17/add-to-the-path-on-windows-10/).\n  4. Verify the installation by opening a new command prompt and running `pdftoppm -h`.\n\nAfter installing Poppler, you should be ready to use GPTParse.\n\n## Quick Start\n\nHere's how you can quickly get started with GPTParse:\n\n```bash\n# Set your API key\nexport OPENAI_API_KEY=\"your-openai-api-key\"\n\n# Convert a PDF to Markdown using Vision Language Models\ngptparse vision example.pdf --output_file output.md\n\n# Convert a PDF to Markdown using fast local processing (no VLM or internet connection required)\ngptparse fast example.pdf --output_file output.md\n\n# Convert using hybrid mode (combines fast and vision for better results)\ngptparse hybrid example.pdf --output_file output.md\n\n# Convert using OCR mode (direct text extraction)\ngptparse ocr example.pdf --output_file output.md\n```\n\n## Usage\n\n### Setting Up Environment Variables\n\nBefore using GPTParse, set up the API keys for the AI providers you plan to use by setting the appropriate environment variables:\n\n- **OpenAI**:\n\n  ```bash\n  export OPENAI_API_KEY=\"your-openai-api-key\"\n  ```\n\n- **Anthropic**:\n\n  ```bash\n  export ANTHROPIC_API_KEY=\"your-anthropic-api-key\"\n  ```\n\n- **Google**:\n\n  ```bash\n  export GOOGLE_API_KEY=\"your-google-api-key\"\n  ```\n\nYou can set these variables in your shell profile (`~/.bashrc`, `~/.zshrc`, etc.) or include them in your Python script before importing GPTParse.\n\n\u003e **Note**: Keep your API keys secure and do not expose them in code repositories.\n\n### Configuration\n\nGPTParse allows you to set default configurations for ease of use. Use the `configure` command to set default values for the AI provider, model, and concurrency:\n\n```bash\ngptparse configure\n```\n\nYou will be prompted to enter the desired provider, model, and concurrency level. The configuration is saved in `~/.gptparse_config.json`.\n\nExample:\n\n```bash\n$ gptparse configure\nGPTParse Configuration\nEnter new values or press Enter to keep the current values.\nCurrent values are shown in [brackets].\n\nAI Provider [openai]: anthropic\nDefault Model for anthropic [claude-3-5-sonnet-latest]: claude-3-opus-latest\nDefault Concurrency [10]: 5\nConfiguration updated successfully.\n\nCurrent configuration:\n  provider: anthropic\n  model: claude-3-opus-latest\n  concurrency: 5\n```\n\n### Using GPTParse as a Python Package\n\nBelow is an example of how to use GPTParse in your Python code:\n\n```python\nimport os\n\n# For AI-powered vision processing\nos.environ[\"OPENAI_API_KEY\"] = \"your-openai-api-key\"\nfrom gptparse.modes.vision import vision\nfrom gptparse.modes.fast import fast\nfrom gptparse.modes.hybrid import hybrid\n\n# Using vision mode\nvision_result = vision(\n    concurrency=10,\n    file_path=\"example.pdf\",\n    model=\"gpt-4o\",\n    output_file=\"output.md\",\n    custom_system_prompt=None,\n    select_pages=None,\n    provider=\"openai\",\n)\n\n# Using fast mode (no AI required)\nfast_result = fast(\n    file_path=\"example.pdf\",\n    output_file=\"output.md\",\n    select_pages=None,\n)\n\n# Using hybrid mode (combines fast and vision)\nhybrid_result = hybrid(\n    concurrency=10,\n    file_path=\"example.pdf\",\n    model=\"gpt-4o\",\n    output_file=\"output.md\",\n    custom_system_prompt=None,\n    select_pages=None,\n    provider=\"openai\",\n)\n```\n\n### Using GPTParse via the CLI\n\nWhen using the command-line interface, you have four modes available:\n\n1. **Vision Mode** - Uses AI models for high-quality conversion:\n\n```bash\nexport OPENAI_API_KEY=\"your-openai-api-key\"\ngptparse vision example.pdf --output_file output.md --provider openai\n```\n\n2. **Fast Mode** - Uses local processing for quick conversion (no AI required):\n\n```bash\ngptparse fast example.pdf --output_file output.md\n```\n\n3. **Hybrid Mode** - Combines fast and vision modes for enhanced results:\n\n```bash\nexport OPENAI_API_KEY=\"your-openai-api-key\"\ngptparse hybrid example.pdf --output_file output.md --provider openai\n```\n\n4. **OCR Mode** - Uses direct OCR processing for text extraction:\n\n```bash\ngptparse ocr example.pdf --output_file output.md\n```\n\n- `--output_file`: Output file name (must have a `.md` or `.txt` extension).\n- `--abort-on-error`: Stop processing if an error occurs (optional).\n\n#### Vision Mode Options\n\n- `--concurrency`: Number of concurrent processes (default: value set in configuration or 10).\n- `--model`: Vision language model to use (overrides configured default).\n- `--output_file`: Output file name (must have a `.md` or `.txt` extension).\n- `--custom_system_prompt`: Custom system prompt for the language model.\n- `--select_pages`: Pages to process (e.g., `\"1,3-5,10\"`). Only applicable for PDF files.\n- `--provider`: AI provider to use (`openai`, `anthropic`, `google`).\n- `--stats`: Display detailed statistics after processing.\n\n#### Fast Mode Options\n\n- `--output_file`: Output file name (must have a `.md` or `.txt` extension).\n- `--select_pages`: Pages to process (e.g., `\"1,3-5,10\"`). Only applicable for PDF files.\n- `--stats`: Display basic processing statistics.\n\n#### Hybrid Mode Options\n\n- `--concurrency`: Number of concurrent processes (default: value set in configuration or 10).\n- `--model`: Vision language model to use (overrides configured default).\n- `--output_file`: Output file name (must have a `.md` or `.txt` extension).\n- `--custom_system_prompt`: Custom system prompt for the language model.\n- `--select_pages`: Pages to process (e.g., `\"1,3-5,10\"`). Only applicable for PDF files.\n- `--provider`: AI provider to use (`openai`, `anthropic`, `google`).\n- `--stats`: Display detailed statistics after processing.\n\n#### OCR Mode Options\n\n```bash\ngptparse ocr example.pdf --output_file output.md\n```\n\n- `--output_file`: Output file name (must have a `.md` or `.txt` extension).\n- `--abort-on-error`: Stop processing if an error occurs (optional).\n\n## Available Models and Providers\n\nGPTParse supports multiple models from different AI providers.\n\n### OpenAI Models\n\n- `gpt-4o` (Default)\n- `gpt-4o-mini`\n\n### Anthropic Models\n\n- `claude-3-5-sonnet-latest` (Default)\n- `claude-3-opus-latest`\n- `claude-3-sonnet-20240229`\n- `claude-3-haiku-20240307`\n\n### Google Models\n\n- `gemini-1.5-pro-002` (Default)\n- `gemini-1.5-flash-002`\n- `gemini-1.5-flash-8b`\n\nTo list available models for a provider in your code, you can use:\n\n```python\nfrom gptparse.models.model_interface import list_available_models\n\n# List models for a specific provider\nmodels = list_available_models(provider='openai')\nprint(\"OpenAI models:\", models)\n\n# List all available models from all providers\nall_models = list_available_models()\nprint(\"All available models:\", all_models)\n```\n\n## Examples\n\n### Processing Specific Pages\n\nTo process only specific pages from a PDF document, use the `--select_pages` option:\n\n```bash\ngptparse vision example.pdf --select_pages \"2,4,6-8\"\n```\n\nThis command will process pages 2, 4, 6, 7, and 8 of `example.pdf`.\n\n### Custom System Prompt\n\nProvide a custom system prompt to influence the model's output:\n\n```bash\ngptparse vision example.pdf --custom_system_prompt \"Please extract all text in bullet points.\"\n```\n\n### Displaying Statistics\n\nTo display detailed processing statistics, use the `--stats` flag:\n\n```bash\ngptparse vision example.pdf --stats\n```\n\nSample output:\n\n```\nDetailed Statistics:\nFile Path: example.pdf\nCompletion Time: 12.34 seconds\nTotal Pages Processed: 5\nTotal Input Tokens: 2500\nTotal Output Tokens: 3000\nTotal Tokens: 5500\nAverage Tokens per Page: 1100.00\n\nPage-wise Statistics:\n  Page 1: 600 tokens\n  Page 2: 500 tokens\n  Page 3: 700 tokens\n  Page 4: 800 tokens\n  Page 5: 400 tokens\n```\n\n### Processing Images\n\nTo process an image file:\n\n```bash\n# Process a PNG file\ngptparse vision screenshot.png --output_file output.md\n\n# Process a JPG file\ngptparse vision photo.jpg --output_file output.md\n```\n\nSupported image formats:\n\n- PNG\n- JPG/JPEG\n\n### Processing with OCR\n\nTo process a file using direct OCR:\n\n```bash\n# Process a PDF file with OCR\ngptparse ocr document.pdf --output_file output.md\n\n# Process an image with OCR\ngptparse ocr scan.png --output_file output.md\n\n# Process with abort-on-error flag\ngptparse ocr document.pdf --output_file output.md --abort-on-error\n```\n\nThe OCR mode supports:\n\n- PDF documents\n- PNG images\n- JPG/JPEG images\n\n## Contributing\n\nContributions are welcome! If you'd like to contribute to GPTParse, please follow these steps:\n\n1. **Fork the repository** on GitHub.\n2. **Create a new branch** for your feature or bugfix.\n3. **Make your changes** and ensure tests pass.\n4. **Submit a pull request** with a clear description of your changes.\n\nPlease ensure that your code adheres to the existing style conventions and passes all tests.\n\n## License\n\nGPTParse is licensed under the Apache-2.0 License. See [LICENSE](LICENSE) for more information.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgptscript-ai%2Fgptparse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgptscript-ai%2Fgptparse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgptscript-ai%2Fgptparse/lists"}