{"id":13572091,"url":"https://github.com/Filimoa/open-parse","last_synced_at":"2025-04-04T09:31:45.907Z","repository":{"id":230121783,"uuid":"775822419","full_name":"Filimoa/open-parse","owner":"Filimoa","description":"Improved file parsing for LLM’s","archived":false,"fork":false,"pushed_at":"2024-11-13T01:28:53.000Z","size":7580,"stargazers_count":2888,"open_issues_count":25,"forks_count":117,"subscribers_count":23,"default_branch":"main","last_synced_at":"2025-04-02T01:02:36.827Z","etag":null,"topics":["document-parser","document-structure","layout-parsing","table-detection"],"latest_commit_sha":null,"homepage":"https://filimoa.github.io/open-parse/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Filimoa.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-22T05:35:12.000Z","updated_at":"2025-04-01T23:49:36.000Z","dependencies_parsed_at":"2024-05-01T22:57:57.718Z","dependency_job_id":"038680ae-0dba-4f56-a702-35035397171d","html_url":"https://github.com/Filimoa/open-parse","commit_stats":null,"previous_names":["filimoa/open-parse"],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Filimoa%2Fopen-parse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Filimoa%2Fopen-parse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Filimoa%2Fopen-parse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Filimoa%2Fopen-parse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Filimoa","download_url":"https://codeload.github.com/Filimoa/open-parse/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247153531,"owners_count":20892686,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-parser","document-structure","layout-parsing","table-detection"],"created_at":"2024-08-01T14:01:13.007Z","updated_at":"2025-04-04T09:31:45.898Z","avatar_url":"https://github.com/Filimoa.png","language":"Python","funding_links":[],"categories":["Python","开源工具","🔥LLM Extraction / Parsing"],"sub_categories":["预处理"],"readme":"\u003cp align=\"center\"\u003e\n \u003cimg src=\"https://sergey-filimonov.nyc3.digitaloceanspaces.com/open-parse/open-parse-with-text-tp-logo.webp\" width=\"350\" /\u003e\n\u003c/p\u003e\n\u003cbr/\u003e\n\n**Easily chunk complex documents the same way a human would.**  \n\nChunking documents is a challenging task that underpins any RAG system.  High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.  \n\nOpen Parse is designed to fill this gap by providing a flexible, easy-to-use library capable of visually discerning document layouts and chunking them effectively.\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eHow is this different from other layout parsers?\u003c/b\u003e\u003c/summary\u003e\n\n  #### ✂️ Text Splitting\n  Text splitting converts a file to raw text and [slices it up](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/token_text_splitter/).\n  \n  - You lose the ability to easily overlay the chunk on the original pdf\n  - You ignore the underlying semantic structure of the file - headings, sections, bullets represent valuable information.\n  - No support for tables, images or markdown.\n  \n  #### 🤖 ML Layout Parsers\n  There's some of fantastic libraries like [layout-parser](https://github.com/Layout-Parser/layout-parser). \n  - While they can identify various elements like text blocks, images, and tables, but they are not built to group related content effectively.\n  - They strictly focus on layout parsing - you will need to add another model to extract markdown from the images, parse tables, group nodes, etc.\n  - We've found performance to be sub-optimal on many documents while also being computationally heavy.\n\n  #### 💼 Commercial Solutions\n\n  - Typically priced at ≈ $10 / 1k pages. See [here](https://cloud.google.com/document-ai), [here](https://aws.amazon.com/textract/) and [here](https://www.reducto.ai/).\n  - Requires sharing your data with a vendor\n\n\u003c/details\u003e\n\n## Highlights\n\n- **🔍 Visually-Driven:** Open-Parse visually analyzes documents for superior LLM input, going beyond naive text splitting.\n- **✍️ Markdown Support:** Basic markdown support for parsing headings, bold and italics.\n- **📊 High-Precision Table Support:** Extract tables into clean Markdown formats with accuracy that surpasses traditional tools.\n    \u003cdetails\u003e\n  \u003csummary\u003e\u003ci\u003eExamples\u003c/i\u003e\u003c/summary\u003e\n  The following examples were parsed with unitable.\n    \u003cbr/\u003e\n    \u003cp align=\"center\"\u003e\n        \u003cbr/\u003e\n        \u003cimg src=\"https://sergey-filimonov.nyc3.digitaloceanspaces.com/open-parse/unitable-parsing-sample.webp\" width=\"650\"/\u003e\n    \u003c/p\u003e\n         \u003cbr/\u003e\n    \u003c/details\u003e\n\n- **🛠️ Extensible:** Easily implement your own post-processing steps.\n- **💡Intuitive:** Great editor support. Completion everywhere. Less time debugging.\n- **🎯 Easy:** Designed to be easy to use and learn. Less time reading docs.\n\n\u003cbr/\u003e\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://sergey-filimonov.nyc3.digitaloceanspaces.com/open-parse/marked-up-doc-2.webp\" width=\"250\" /\u003e\n\u003c/p\u003e\n\n## Example\n\n#### Basic Example\n\n```python\nimport openparse\n\nbasic_doc_path = \"./sample-docs/mobile-home-manual.pdf\"\nparser = openparse.DocumentParser()\nparsed_basic_doc = parser.parse(basic_doc_path)\n\nfor node in parsed_basic_doc.nodes:\n    print(node)\n```\n\n**📓 Try the sample notebook** \u003ca href=\"https://colab.research.google.com/drive/1Z5B5gsnmhFKEFL-5yYIcoox7-jQao8Ep?usp=sharing\" class=\"external-link\" target=\"_blank\"\u003ehere\u003c/a\u003e\n\n#### Semantic Processing Example\n\nChunking documents is fundamentally about grouping similar semantic nodes together. By embedding the text of each node, we can then cluster them together based on their similarity.\n\n```python\nfrom openparse import processing, DocumentParser\n\nsemantic_pipeline = processing.SemanticIngestionPipeline(\n    openai_api_key=OPEN_AI_KEY,\n    model=\"text-embedding-3-large\",\n    min_tokens=64,\n    max_tokens=1024,\n)\nparser = DocumentParser(\n    processing_pipeline=semantic_pipeline,\n)\nparsed_content = parser.parse(basic_doc_path)\n```\n\n**📓 Sample notebook** \u003ca href=\"https://github.com/Filimoa/open-parse/blob/main/src/cookbooks/semantic_processing.ipynb\" class=\"external-link\" target=\"_blank\"\u003ehere\u003c/a\u003e\n\n#### Serializing Results\nUses pydantic under the hood so you can serialize results with \n\n```python\nparsed_content.dict()\n\n# or to convert to a valid json dict\nparsed_content.json()\n```\n\n## Requirements\n\nPython 3.8+\n\n**Dealing with PDF's:**\n\n- \u003ca href=\"https://github.com/pdfminer/pdfminer.six\" class=\"external-link\" target=\"_blank\"\u003epdfminer.six\u003c/a\u003e Fully open source.\n\n**Extracting Tables:**\n\n- \u003ca href=\"https://github.com/pymupdf/PyMuPDF\" class=\"external-link\" target=\"_blank\"\u003ePyMuPDF\u003c/a\u003e has some table detection functionality. Please see their \u003ca href=\"https://mupdf.com/licensing/index.html#commercial\" class=\"external-link\" target=\"_blank\"\u003elicense\u003c/a\u003e.\n- \u003ca href=\"https://huggingface.co/microsoft/table-transformer-detection\" class=\"external-link\" target=\"_blank\"\u003eTable Transformer\u003c/a\u003e is a deep learning approach.\n- \u003ca href=\"https://github.com/poloclub/unitable\" class=\"external-link\" target=\"_blank\"\u003eunitable\u003c/a\u003e is another transformers based approach with **state-of-the-art** performance.\n\n## Installation\n\n#### 1. Core Library\n\n```console\npip install openparse\n```\n\n**Enabling OCR Support**:\n\nPyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, so installation of Tesseract-OCR is still required.\n\nThe language support folder location must be communicated either via storing it in the environment variable \"TESSDATA_PREFIX\", or as a parameter in the applicable functions.\n\nSo for a working OCR functionality, make sure to complete this checklist:\n\n1. Install Tesseract.\n\n2. Locate Tesseract’s language support folder. Typically you will find it here:\n\n   - Windows: `C:/Program Files/Tesseract-OCR/tessdata`\n\n   - Unix systems: `/usr/share/tesseract-ocr/5/tessdata`\n\n   - macOS (installed via Homebrew):\n     - Standard installation: `/opt/homebrew/share/tessdata`\n     - Version-specific installation: `/opt/homebrew/Cellar/tesseract/\u003cversion\u003e/share/tessdata/`\n\n3. Set the environment variable TESSDATA_PREFIX\n\n   - Windows: `setx TESSDATA_PREFIX \"C:/Program Files/Tesseract-OCR/tessdata\"`\n\n   - Unix systems: `declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata`\n\n    - macOS (installed via Homebrew): `export TESSDATA_PREFIX=$(brew --prefix tesseract)/share/tessdata`\n\n**Note:** _On Windows systems, this must happen outside Python – before starting your script. Just manipulating os.environ will not work!_\n\n#### 2. ML Table Detection (Optional)\n\nThis repository provides an optional feature to parse content from tables using a variety of deep learning models.\n\n```console\npip install \"openparse[ml]\"\n```\n\nThen download the model weights with\n\n```console\nopenparse-download\n```\n\nYou can run the parsing with the following. \n\n```python\nparser = openparse.DocumentParser(\n        table_args={\n            \"parsing_algorithm\": \"unitable\",\n            \"min_table_confidence\": 0.8,\n        },\n)\nparsed_nodes = parser.parse(pdf_path)\n```\n\nNote we currently use [table-transformers](https://github.com/microsoft/table-transformer) for all table detection and we find its performance to be subpar. This negatively affects the downstream results of unitable. If you're aware of a better model please open an Issue - the unitable team mentioned they might add this soon too.\n\n## Cookbooks\n\nhttps://github.com/Filimoa/open-parse/tree/main/src/cookbooks\n\n## Documentation\n\nhttps://filimoa.github.io/open-parse/\n\n## Sponsors\n\n\u003c!-- sponsors --\u003e\n\n\u003ca href=\"https://www.data.threesigma.ai/filings-ai\" target=\"_blank\" title=\"Three Sigma: AI for insurance filings.\"\u003e\u003cimg src=\"https://sergey-filimonov.nyc3.digitaloceanspaces.com/open-parse/marketing/three-sigma-wide.png\" width=\"250\"\u003e\u003c/a\u003e\n\n\u003c!-- /sponsors --\u003e\n\nDoes your use case need something special? Reach [out](https://www.linkedin.com/in/sergey-osu/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFilimoa%2Fopen-parse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFilimoa%2Fopen-parse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFilimoa%2Fopen-parse/lists"}