{"id":24306100,"url":"https://github.com/aminediro/ferrules","last_synced_at":"2026-02-16T21:20:13.813Z","repository":{"id":272651716,"uuid":"910945629","full_name":"AmineDiro/ferrules","owner":"AmineDiro","description":"Modern, fast, document parser written in  🦀","archived":false,"fork":false,"pushed_at":"2025-02-22T17:11:47.000Z","size":60137,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-22T18:23:02.200Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AmineDiro.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-01T21:38:08.000Z","updated_at":"2025-02-22T17:11:50.000Z","dependencies_parsed_at":"2025-02-22T18:31:59.491Z","dependency_job_id":null,"html_url":"https://github.com/AmineDiro/ferrules","commit_stats":null,"previous_names":["aminediro/ferrules"],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmineDiro%2Fferrules","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmineDiro%2Fferrules/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmineDiro%2Fferrules/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmineDiro%2Fferrules/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AmineDiro","download_url":"https://codeload.github.com/AmineDiro/ferrules/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242250847,"owners_count":20096895,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-17T02:20:06.096Z","updated_at":"2026-02-16T21:20:13.801Z","avatar_url":"https://github.com/AmineDiro.png","language":"C","funding_links":[],"categories":["\u003ca name=\"ai\"\u003e\u003c/a\u003eAI / ChatGPT"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1\u003e Ferrules:  Modern, fast, document parser written in 🦀 \u003c/h1\u003e\n\u003c/div\u003e\n\n---\n\n\u003e 🚧 **Work in Progress**: Check out our [roadmap](./ROADMAP.md) for upcoming features and development plans.\n\nFerrules is an **opinionated high-performance document parsing library** designed to generate LLM-ready documents efficiently.\nUnlike alternatives such as `unstructured` which are slow and Python-based, `ferrules` is written in Rust and aims to provide a seamless experience with robust deployment across various platforms.\n\n| **NOTE** A ferrule is a corruption of Latin viriola on a pencil known as a Shoe, is any of a number of types of objects, generally used for fastening, joining, sealing, or reinforcement.\n\n## Features\n\n- **📄 PDF Parsing and Layout Extraction:**\n    - Utilizes `pdfium2` to parse documents.\n    - Supports OCR using Apple's Vision on macOS (using `objc2` Rust bindings and [`VNRecognizeTextRequest`](https://developer.apple.com/documentation/vision/vnrecognizetextrequest) functionality).\n    - Extracts and analyzes **page layouts** with advanced preprocessing and postprocessing techniques.\n    - Accelerate model inference on Apple Neural Engine (ANE)/GPU (using [`ort`](https://ort.pyke.io/) library).\n    - Merges layout with PDF text lines for comprehensive document understanding.\n\n- **📊 Advanced Table Parsing:**\n    - Robust table structure recognition using three complementary algorithms.\n    - Intelligent fallback heuristics to ensure high-accuracy extraction across different table styles.\n    - Handles both bordered (Lattice) and borderless (Stream/Vision) tables.\n    - Extracts spanning cells and preserves cell alignment.\n\n- **🔄 Document Transformation:**\n    - Groups captions, footers, and other elements intelligently.\n    - Structures lists and merges blocks into cohesive sections.\n    - Detects headings and titles using machine learning for logical document structuring.\n\n- **🖨️ Rendering:** Provides HTML, Markdown, and JSON rendering options for versatile use cases.\n\n- **⚡ High Performance \u0026 Easy Deployment:**\n    - Built with **Rust** for maximum speed and efficiency\n    - Zero-dependency deployment (no Python runtime required !)\n    - Hardware-accelerated ML inference (Apple Neural Engine, GPU)\n    - Designed for production environments with minimal setup\n\n- **⚙️ Advanced Functionalities:** : Offers configurable inference parameters for optimized processing (COMING SOON)\n\n- **🛠️ API and CLI:**\n    - Provides both a CLI and API interface\n    - Supports tracing\n\n## Installation\n\nFerrules provides precompiled binaries for macOS, available for download from the [GitHub Releases](https://github.com/aminediro/ferrules/releases) page.\n\n### macOS Installation\n\n1. Download the latest `ferrules` binary from the [releases](https://github.com/aminediro/ferrules/releases).\n\n2. Verify the installation:\n\n    ```sh\n    ferrules --version\n    ```\n\n### Linux Installation\n\nLinux support with NVIDIA GPU acceleration will be available soon. Keep an eye out for updates on the [releases](https://github.com/aminediro/ferrules/releases) page.\n\n\u003e ⚠️ **Note:** Ensure that you have the necessary permissions to execute and move files to system directories.\n\nVisit the [GitHub Releases](https://github.com/aminediro/ferrules/releases) page to find the latest version suitable for your operating system.\n\n## Usage\n\nFerrules provides two ways to use the library:\n\n### 1. Command Line Interface (CLI)\n\n### Basic Usage\n\n```sh\nferrules path/to/your.pdf\n```\n\nThis will parse the PDF and save the results in the current directory:\n\n```sh\nferrules file.pdf\n[00:00:02] [########################################] Parsed document in 108ms\n✓ Results saved in: ./file-results.json\n```\n\n### Debug Mode\n\nTo get detailed processing information and debug outputs:\n\n```sh\nferrules path/to/your.pdf --debug\n```\n\nRunning with `--debug` will generate:\n1. Visual JSON results and cropped images (if enabled).\n2. A `.ferr` debug archive containing all intermediate states (layout, OCR, native lines, tables).\n\n### 🛠️ Visual Debugger (`ferrules-debug`)\n\n`ferrules-debug` is a lightweight, cross-platform visualizer built with [Iced](https://iced.rs/). It allows you to inspect exactly how the engine interpreted your document.\n\n\u003cdiv align=\"center\"\u003e\n\n| Simple Layout Analysis | Complex Table Extraction |\n|:---:|:---:|\n| \u003cimg src=\"./imgs/ferrules_debug_simple.png\" alt=\"Ferrules Debug Simple\" height=\"350\"\u003e | \u003cimg src=\"./imgs/ferrules_debug_table_cells.png\" alt=\"Ferrules Debug Table Cells\" height=\"350\"\u003e |\n\n\u003c/div\u003e\n**How to use:**\n1. Run the parser with the debug flag: `ferrules sample.pdf --debug`\n2. Open the resulting `.ferr` file: `ferrules-debug --file path/to/sample.ferr`\n3. Toggle layers (Layout, OCR, Tables, Blocks) to inspect the parsing logic.\n\n### 🧠 Table Parsing Algorithms\n\nFerrules uses a tiered approach to table extraction:\n\n1.  **Lattice**: Detects tables with explicit borders by analyzing PDF vector paths. It's the most accurate for traditional tables.\n2.  **Stream**: Used for tables without visible borders. It analyzes text alignment and whitespace gaps to reconstruct the grid.\n3.  **Vision (Table Transformer)**: A deep learning fallback using the Table Transformer model. It is triggered when the previous methods yield \"suspicious\" results (e.g., low cell density in a large area).\n\n**Heuristics**: The engine automatically sequences these algorithms. If a `Stream` result appears incomplete or messy, it triggers `Vision` to verify and improve the structure recognition.\n\n### Available Options\n\n```\nOptions:\n  -r, --page-range \u003cPAGE_RANGE\u003e\n          Specify pages to parse (e.g., '1-5' or '1' for single page)\n      --output-dir \u003cOUTPUT_DIR\u003e\n          Specify the directory to store parsing result [env: FERRULES_OUTPUT_DIR=]\n      --save-images\n          Specify the directory to store parsing result\n      --layout-model-path \u003cLAYOUT_MODEL_PATH\u003e\n          Specify the path to the layout model for document parsing [env: FERRULES_LAYOUT_MODEL_PATH=]\n      --coreml\n          Enable or disable the use of CoreML for layout inference\n      --use-ane\n          Enable or disable Apple Neural Engine acceleration (only applies when CoreML is enabled)\n      --trt\n          Enable or disable the use of TensorRT for layout inference\n      --cuda\n          Enable or disable the use of CUDA for layout inference\n      --device-id \u003cDEVICE_ID\u003e\n          CUDA device ID to use (0 for first GPU) [default: 0]\n  -j, --intra-threads \u003cINTRA_THREADS\u003e\n          Number of threads to use for parallel processing within operations [default: 2]\n      --inter-threads \u003cINTER_THREADS\u003e\n          Number of threads to use for executing operations in parallel [default: 1]\n  -O, --graph-opt-level \u003cGRAPH_OPT_LEVEL\u003e\n          Ort graph optimization level\n      --debug\n          Activate debug mode for detailed processing information [env: FERRULES_DEBUG=]\n      --debug-dir \u003cDEBUG_DIR\u003e\n          Specify the directory to store debug output files [env: FERRULES_DEBUG_PATH=]\n  -h, --help\n          Print help\n  -V, --version\n          Print version\n```\n\nYou can also configure some options through environment variables:\n\n- `FERRULES_OUTPUT_DIR`: Set the output directory\n- `FERRULES_LAYOUT_MODEL_PATH`: Set the layout model path\n- `FERRULES_DEBUG`: Enable debug mode\n- `FERRULES_DEBUG_PATH`: Set the debug output directory\n\n### 2. HTTP API Server\n\nFerrules also provides an HTTP API server for integration into existing systems.\n\n#### Running locally\n\nTo start the API server locally:\n\n```sh\nferrules-api\n```\n\n#### Running with Docker (NVIDIA GPU)\n\nFor systems with NVIDIA GPU support, you can run the API server using Docker:\n\n```sh\ndocker run -p 3002:3002 --gpus all aminediro/ferrules-api-gpu\n```\n\nBy default, the server listens on `0.0.0.0:3002`. For detailed API documentation and additional running options, see [API.md](./API.md).\n\n## Resources:\n\n- Apple vision text detection:\n    - https://github.com/straussmaximilian/ocrmac/blob/main/ocrmac/ocrmac.py\n    - https://docs.rs/objc2-vision/latest/objc2_vision/index.html\n    - https://developer.apple.com/documentation/vision/recognizing-text-in-images\n\n- `ort` : https://ort.pyke.io/\n\n## Credits\n\nThis project uses models from the [yolo-doclaynet repository](https://github.com/ppaanngggg/yolo-doclaynet). We are grateful to the contributors of that project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faminediro%2Fferrules","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faminediro%2Fferrules","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faminediro%2Fferrules/lists"}