{"id":45896923,"url":"https://github.com/jimmy228676/arborparser","last_synced_at":"2026-02-27T21:01:11.268Z","repository":{"id":284118094,"uuid":"948469502","full_name":"Jimmy228676/arborparser","owner":"Jimmy228676","description":"ArborParser is a powerful Python library designed to parse structured text documents and convert them into a tree representation based on hierarchical headings. It intelligently handles various numbering schemes and document inconsistencies, making it ideal for processing outlines, reports, technical documentation, legal texts, and more.","archived":false,"fork":false,"pushed_at":"2025-11-14T06:04:10.000Z","size":77,"stargazers_count":10,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-11-14T06:20:22.783Z","etag":null,"topics":["arbor","chain","custom-pattern","document","error-correction","parser","parsing","tree"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Jimmy228676.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-14T11:50:25.000Z","updated_at":"2025-11-14T06:04:13.000Z","dependencies_parsed_at":"2025-03-24T09:29:20.342Z","dependency_job_id":"0cc9f05a-008d-46a6-a818-20af18fbd3a6","html_url":"https://github.com/Jimmy228676/arborparser","commit_stats":null,"previous_names":["jimmy228676/arborparser"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/Jimmy228676/arborparser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jimmy228676%2Farborparser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jimmy228676%2Farborparser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jimmy228676%2Farborparser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jimmy228676%2Farborparser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Jimmy228676","download_url":"https://codeload.github.com/Jimmy228676/arborparser/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jimmy228676%2Farborparser/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29913647,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-27T19:37:42.220Z","status":"ssl_error","status_checked_at":"2026-02-27T19:37:41.463Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arbor","chain","custom-pattern","document","error-correction","parser","parsing","tree"],"created_at":"2026-02-27T21:01:10.492Z","updated_at":"2026-02-27T21:01:11.248Z","avatar_url":"https://github.com/Jimmy228676.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ArborParser\n\nArborParser is a powerful Python library designed to parse structured text documents and convert them into a tree representation based on hierarchical headings. It intelligently handles various numbering schemes and document inconsistencies, making it ideal for processing outlines, reports, technical documentation, legal texts, and more.\n\n## Features\n\n*   **Chain Parsing:** Converts text into a linear sequence (`ChainNode` list) representing the document's hierarchical structure.\n*   **Multi-Candidate Parsing:** `parse_to_multi_chain` keeps every heading candidate per line and the rest of the toolkit (tree builder/exporter) works directly on the resulting `List[List[ChainNode]]`.\n*   **Flexible Pattern Definition:** Define custom parsing patterns using regular expressions and specific number converters (Arabic, Roman, Chinese, Letters, Circled).\n*   **Built-in Patterns:** Provides ready-to-use patterns for common heading styles (`1.2.3`, `Chapter 1`, `第一章`, etc.).\n*   **Robust Tree Building:** Transforms the linear chain into a true hierarchical `TreeNode` structure.\n*   **Automatic Error Correction:** Includes an `AutoPruneStrategy` to intelligently handle skipped heading levels or lines mistakenly identified as headings.\n*   **Node Manipulation:** Allows merging content between nodes (`concat_node` `merge_all_children`) for post-processing.\n*   **Reversible Transformation:** Preserves original text, enabling full document reconstruction from the tree (`tree.get_full_content()`).\n*   **Export Capabilities:** Outputs the parsed structure in various formats (e.g., human-readable tree view).\n\n**Example Transformation:**\n\n**Original Text**\n```text\nChapter 1 Animals\n1.1 Mammals\n1.1.1 Primates\n1.2 Reptiles\nChapter 2 Plants\n2.1 Angiosperms\n```\n\n**Chain Structure (Intermediate)**\n```\nLEVEL-[]: ROOT\nLEVEL-[1]: Animals\nLEVEL-[1, 1]: Mammals\nLEVEL-[1, 1, 1]: Primates\nLEVEL-[1, 2]: Reptiles\nLEVEL-[2]: Plants\nLEVEL-[2, 1]: Angiosperms\n```\n\n**Tree Structure (Final)**\n```\nROOT\n├─ Chapter 1 Animals\n│   ├─ 1.1 Mammals\n│   │   └─ 1.1.1 Primates\n│   └─ 1.2 Reptiles\n└─ Chapter 2 Plants\n    └─ 2.1 Angiosperms\n```\n\n## Installation\n\n```bash\npip install arborparser\n```\n\n## Basic Usage\n\n```python\nfrom arborparser.chain import ChainParser\nfrom arborparser.tree import TreeBuilder, TreeExporter, AutoPruneStrategy\nfrom arborparser.pattern import ENGLISH_CHAPTER_PATTERN_BUILDER, NUMERIC_DOT_PATTERN_BUILDER\n\ntest_text = \"\"\"\nChapter 1 Animals\n1.1 Mammals\n1.1.1 Primates\n1.2 Reptiles\nChapter 2 Plants\n2.1 Angiosperms\n\"\"\"\n\n# 1. Define parsing patterns\npatterns = [\n    ENGLISH_CHAPTER_PATTERN_BUILDER.build(),\n    NUMERIC_DOT_PATTERN_BUILDER.build(),\n]\n\n# 2. Parse text to chain\nparser = ChainParser(patterns)\nchain = parser.parse_to_chain(test_text)\n\n# 3. Build tree (using AutoPrune for robustness)\nbuilder = TreeBuilder(strategy=AutoPruneStrategy())\ntree = builder.build_tree(chain)\n\n# 4. Print the structured tree\nprint(TreeExporter.export_tree(tree))\n```\n\n## Multi-Chain Parsing\n\nSometimes a line can match multiple heading patterns (or a converter can emit more than one hierarchy).  Call `ChainParser.parse_to_multi_chain` to preserve every candidate per line and let downstream consumers decide which one to keep.\n\n```python\nambiguous_text = \"\"\"\nChapter 2 Building Blocks\n    Content for the second chapter.\n\n2.1 A Component\n    Details about the first component.\n\n2.1.1 A details\n    Details 1\n\n2.1 .2 A details 2 [the title is corrupted due to OCR or other reasons]\n    Details 2\n\n2.2 2-Sided Materials B Component\n    Details about the second component.\n\"\"\"\n\nnon_strict = NUMERIC_DOT_PATTERN_BUILDER.modify(\n    prefix_regex=r\"[\\#\\s]*\",\n    suffix_regex=r\"[\\.\\s]*\",\n    separator=r\"[\\.\\s]+\",\n    is_sep_regex=True,\n    min_level=2,\n).build()\n\npatterns = [\n    ENGLISH_CHAPTER_PATTERN_BUILDER.build(),\n    NUMERIC_DOT_PATTERN_BUILDER.build(),\n    non_strict,\n]\n\nparser = ChainParser(patterns)\nmulti_chain = parser.parse_to_multi_chain(ambiguous_text)\n\nprint(TreeExporter.export_chain(multi_chain))\n\nbuilder = TreeBuilder()\ntree_from_multi = builder.build_tree(multi_chain)\nprint(TreeExporter.export_tree(tree_from_multi))\n```\n\nSample output (abridged):\n\n```\n[LEVEL-[]: ROOT]\n[LEVEL-[2]: Building Blocks]\n[LEVEL-[2, 1]: A Component, LEVEL-[2, 1]: A Component]\n[LEVEL-[2, 1, 1]: A details, LEVEL-[2, 1, 1]: A details]\n[LEVEL-[2, 1]: 2 A details 2 [...], LEVEL-[2, 1, 2]: A details 2 [...]]\n[LEVEL-[2, 2]: 2-Sided Materials B Component, LEVEL-[2, 2, 2]: -Sided Materials B Component]\n\nROOT\n└─ Chapter 2 Building Blocks\n    ├─ 2.1 A Component\n    │   ├─ 2.1.1 A details\n    │   └─ 2.1 .2 A details 2 [...]\n    └─ 2.2 2-Sided Materials B Component\n```\n\nKey points:\n\n* Each outer list entry represents a text line (the first entry is still `ROOT`).\n* Each inner list is ordered by detection priority. `TreeBuilder` prefers candidates that immediately follow the previous node (`is_imm_next`), otherwise it falls back to the lowest `pattern_priority`.\n* `TreeExporter.export_chain` renders multi rows in square brackets so you can quickly spot OCR errors or ambiguous headings.\n\n## Key Features in Detail\n\n### Built-in \u0026 Custom Patterns\n\nQuickly parse common formats using builders like `NUMERIC_DOT_PATTERN_BUILDER`, `CHINESE_CHAPTER_PATTERN_BUILDER`, etc., or define your own using `PatternBuilder` for full control over prefixes, suffixes, number types, and separators.\n\n```python\n# Example: Match \"Section A.\", \"Section B.\"\nletter_section_pattern = PatternBuilder(\n    prefix_regex=r\"Section\\s\",\n    number_type=NumberType.LETTER,\n    suffix_regex=r\"\\.\"\n).build()\n```\n\n### Automatic Error Correction (AutoPruneStrategy)\n\nDocuments aren't always perfect. `AutoPruneStrategy` (the default for `TreeBuilder`) handles common issues like skipped heading numbers (e.g., `1.1` followed by `1.3`) and prunes lines incorrectly matched as headings, ensuring a more robust parsing process compared to the `StrictStrategy`.\n\nOkay, here is a dedicated section explaining `AutoPruneStrategy` using the provided example, formatted for a README without using Python code blocks for the illustration:\n\n---\n\n### Automatic Error Correction (AutoPruneStrategy)\n\nReal-world documents often contain structural inconsistencies that can challenge parsers. Common issues include:\n\n*   **Skipped Heading Levels:** Authors might jump from `1.1` directly to `1.3`, omitting `1.2`.\n*   **False Positives:** Regular text lines might accidentally match a heading pattern (e.g., a sentence mentioning \"section 1.1\").\n\nThe `AutoPruneStrategy` (used by default in `TreeBuilder`) is designed to handle these imperfections gracefully. It uses heuristics to identify likely errors and prune the intermediate structure, resulting in a more accurate final tree.\n\n**Example: Handling Imperfections**\n\nConsider the following text with a missing section (`1.2`) and a line of text containing `1.1` which could be mistaken for a heading:\n\n**Input Text:**\n\n```text\nChapter 1 The Foundation\n    Introductory content for the first chapter.\n\n1.1 Core Concepts\n    Explanation of the fundamental ideas.\n    This section lays the groundwork.\n\n# NOTE: Heading '1.2 Intermediate Concepts' is MISSING here.\n\n1.3 Advanced Topics\n    Discussing more complex subjects. We build upon the ideas from section\n    1.1. This section is more advanced and goes into more detail.\n    # NOTE: The '1.1.' here is text, not a heading.\n\nChapter 2 Building Blocks\n    Content for the second chapter.\n\n2.1 Component A\n    Details about the first component.\n\n2.2 Component B\n    Details about the second component. End of document.\n```\n\n**Intermediate Chain (Before Pruning):**\n\nA naive parsing step might initially produce a chain like this, including the misidentified heading:\n\n```\nLEVEL-[]: ROOT\nLEVEL-[1]: The Foundation\nLEVEL-[1, 1]: Core Concepts\nLEVEL-[1, 3]: Advanced Topics\nLEVEL-[1, 1]: This section is more advanced and goes into more detail.  \u003c-- POTENTIAL FALSE POSITIVE\nLEVEL-[2]: Building Blocks\nLEVEL-[2, 1]: Component A\nLEVEL-[2, 2]: Component B\n```\n\n**How AutoPrune Works:**\n\nWhen building the tree, `AutoPruneStrategy` analyzes the sequence:\n\n1.  It recognizes that `LEVEL-[1, 3]` can logically follow `LEVEL-[1, 1]` even if `[1, 2]` is missing (sibling jump).\n2.  It sees the subsequent `LEVEL-[1, 1]` node (\"This section...\") followed by a completely different hierarchy (`LEVEL-[2]`). This discontinuity strongly suggests the second `LEVEL-[1, 1]` node was a false positive.\n3.  The strategy \"prunes\" the misidentified node, effectively merging its content back into the preceding valid node (`LEVEL-[1, 3]` in this case, depending on implementation details of content association).\n\n**Final Tree Structure (After AutoPrune):**\n\nThe resulting tree correctly reflects the intended document structure:\n\n```\nROOT\n├─ Chapter 1 The Foundation\n│   ├─ 1.1 Core Concepts\n│   └─ 1.3 Advanced Topics  # Correctly handles the jump \u0026 ignored false positive\n└─ Chapter 2 Building Blocks\n    ├─ 2.1 Component A\n    └─ 2.2 Component B\n```\n\n### Node Operations \u0026 Reversibility\n\nArborParser works with `ChainNode` (linear sequence) and `TreeNode` (hierarchical tree) objects. Both inherit from `BaseNode`, which stores `level_seq`, `title`, and the original `content` string.\n\n*   **Concatenating Content:** You can merge the content of one node into another. This is useful internally for associating non-heading text with its preceding heading or for merging nodes during error correction.\n    ```python\n    # Append node B's content to node A\n    node_a.concat_node(node_b)\n    ```\n\n*   **Merging Children:** A parent node can absorb the content of all its descendants.\n    ```python\n    # Make node_a contain its own content plus all content from its children/grandchildren...\n    node_a.merge_all_children()\n    ```\n\n*   **Reconstructing Original Text:** Because each node retains its original text chunk (`content`), you can reconstruct the *entire* original document from the root `TreeNode`. This verifies parsing integrity and allows regeneration after modification.\n    ```python\n    # Get the full text back from the parsed tree structure\n    reconstructed_text = root_node.get_full_content()\n    assert reconstructed_text == original_text # Verification\n    ```\n\n## Potential Use Cases\n\n*   Documentation Parsing\n*   Legal Document Analysis (Laws, Contracts)\n*   Outline Processing \u0026 Conversion\n*   Report Structuring \u0026 Analysis\n*   Content Management System Import\n*   Data Extraction from Structured Text\n*   Format Conversion (e.g., Text to HTML/XML preserving structure)\n*   Better Chunking Strategies for RAG\n\n## Contributing\n\nContributions (pull requests, issues) are welcome!\n\n## License\n\nMIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjimmy228676%2Farborparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjimmy228676%2Farborparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjimmy228676%2Farborparser/lists"}