{"id":31250550,"url":"https://github.com/scientist-labs/parsekit","last_synced_at":"2025-09-23T05:30:04.354Z","repository":{"id":311006660,"uuid":"1042078688","full_name":"scientist-labs/parsekit","owner":"scientist-labs","description":"Ruby document parsing toolkit with zero runtime dependencies. Parse PDFs, DOCX, XLSX, and images (with OCR) using a single, lightweight gem. Statically links MuPDF and Tesseract at   compile time for hassle-free installation - no system libraries or external tools required.","archived":false,"fork":false,"pushed_at":"2025-09-06T16:26:56.000Z","size":2028,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-06T16:42:09.815Z","etag":null,"topics":["content","extraction","metadata","ruby"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scientist-labs.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-21T13:00:37.000Z","updated_at":"2025-09-06T16:26:56.000Z","dependencies_parsed_at":"2025-09-06T16:42:10.515Z","dependency_job_id":null,"html_url":"https://github.com/scientist-labs/parsekit","commit_stats":null,"previous_names":["cpetersen/parsekit","scientist-labs/parsekit"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/scientist-labs/parsekit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scientist-labs%2Fparsekit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scientist-labs%2Fparsekit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scientist-labs%2Fparsekit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scientist-labs%2Fparsekit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scientist-labs","download_url":"https://codeload.github.com/scientist-labs/parsekit/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scientist-labs%2Fparsekit/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":276519116,"owners_count":25656553,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-23T02:00:09.130Z","response_time":73,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["content","extraction","metadata","ruby"],"created_at":"2025-09-23T05:30:01.938Z","updated_at":"2025-09-23T05:30:04.347Z","avatar_url":"https://github.com/scientist-labs.png","language":"Ruby","funding_links":[],"categories":["Ruby"],"sub_categories":[],"readme":"\u003cimg src=\"/docs/assets/parsekit-wide.png\" alt=\"parsekit\" height=\"80px\"\u003e\n\n[![Gem Version](https://badge.fury.io/rb/parsekit.svg)](https://badge.fury.io/rb/parsekit)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nNative Ruby bindings for the [parser-core](https://crates.io/crates/parser-core) Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX), images (with OCR), and more. Part of the ruby-nlp ecosystem.\n\n## Features\n\n- 📄 **Document Parsing**: Extract text from PDFs, Office documents (DOCX, XLSX)\n- 🖼️ **OCR Support**: Extract text from images using Tesseract OCR\n- 🚀 **High Performance**: Native Rust performance with Ruby convenience\n- 🔧 **Unified API**: Single interface for multiple document formats\n- 📦 **Cross-Platform**: Works on Linux, macOS, and Windows\n- 🧪 **Well Tested**: Comprehensive test suite with RSpec\n\n## Installation\n\nAdd this line to your application's Gemfile:\n\n```ruby\ngem 'parsekit'\n```\n\nAnd then execute:\n\n    $ bundle install\n\nOr install it yourself as:\n\n```bash\ngem install parsekit\n```\n\n### Requirements\n\n- Ruby \u003e= 3.0.0\n- Rust toolchain (stable)\n- C compiler (for linking)\n\nThat's it! ParseKit bundles all necessary libraries including Tesseract for OCR, so you don't need to install any system dependencies.\n\n## Usage\n\n### Basic Usage\n\n```ruby\nrequire 'parsekit'\n\n# Parse a PDF file\ntext = ParseKit.parse_file(\"document.pdf\")\nputs text  # Extracted text from the PDF\n\n# Parse an Excel file\ntext = ParseKit.parse_file(\"spreadsheet.xlsx\")\nputs text  # Extracted text from all sheets\n\n# Parse binary data directly\nfile_data = File.binread(\"document.pdf\")\ntext = ParseKit.parse_bytes(file_data)\nputs text\n\n# Parse with a Parser instance\nparser = ParseKit::Parser.new\ntext = parser.parse_file(\"report.docx\")\nputs text\n```\n\n### Module-Level Convenience Methods\n\n```ruby\n# Parse files directly\ncontent = ParseKit.parse_file('document.pdf')\n\n# Parse bytes\ndata = File.read('document.pdf', mode: 'rb')\ncontent = ParseKit.parse_bytes(data.bytes)\n\n# Check supported formats\nformats = ParseKit.supported_formats\n# =\u003e [\"txt\", \"json\", \"xml\", \"html\", \"docx\", \"xlsx\", \"xls\", \"csv\", \"pdf\", \"png\", \"jpg\", \"jpeg\", \"tiff\", \"bmp\"]\n\n# Check if a file is supported\nParseKit.supports_file?('document.pdf')  # =\u003e true\n```\n\n### Configuration Options\n\n```ruby\n# Create parser with options\nparser = ParseKit::Parser.new(\n  strict_mode: true,\n  max_size: 50 * 1024 * 1024,  # 50MB limit\n  encoding: 'UTF-8'\n)\n\n# Or use the strict convenience method\nparser = ParseKit::Parser.strict\n```\n\n### Format-Specific Parsing\n\n```ruby\nparser = ParseKit::Parser.new\n\n# Direct access to format-specific parsers\npdf_data = File.read('document.pdf', mode: 'rb').bytes\npdf_text = parser.parse_pdf(pdf_data)\n\nimage_data = File.read('image.png', mode: 'rb').bytes\nocr_text = parser.ocr_image(image_data)\n\nexcel_data = File.read('data.xlsx', mode: 'rb').bytes\nexcel_text = parser.parse_xlsx(excel_data)\n```\n\n## Supported Formats\n\n| Format | Extensions | Method | Notes |\n|--------|------------|--------|-------|\n| PDF | .pdf | `parse_pdf` | Text extraction via MuPDF |\n| Word | .docx | `parse_docx` | Office Open XML format |\n| Excel | .xlsx, .xls | `parse_xlsx` | Both modern and legacy formats |\n| PowerPoint | .pptx | `parse_pptx` | Text extraction from slides and notes |\n| Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via bundled Tesseract |\n| JSON | .json | `parse_json` | Pretty-printed output |\n| XML/HTML | .xml, .html | `parse_xml` | Extracts text content |\n| Text | .txt, .csv, .md | `parse_text` | With encoding detection |\n\n## Performance\n\nParseKit is built with performance in mind:\n\n- Native Rust implementation for speed\n- Statically linked C libraries (MuPDF, Tesseract) compiled with optimizations\n- Efficient memory usage with streaming where possible\n- Configurable size limits to prevent memory issues\n\n## Development\n\nAfter checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests.\n\nTo compile the Rust extension:\n\n```bash\nrake compile\n```\n\nTo run tests with coverage:\n\n```bash\nrake dev:coverage\n```\n\n### OCR Mode Configuration\n\nBy default, ParseKit bundles Tesseract for zero-dependency OCR support. Advanced users who already have Tesseract installed system-wide and want faster gem installation can use system mode:\n\n**Using system Tesseract during installation:**\n```bash\ngem install parsekit -- --no-default-features\n```\n\n**For development with system Tesseract:**\n```bash\nrake compile CARGO_FEATURES=\"\"  # Disables bundled-tesseract feature\n```\n\n**System Tesseract requirements:**\n- **macOS**: `brew install tesseract`\n- **Ubuntu/Debian**: `sudo apt-get install libtesseract-dev`\n- **Fedora/RHEL**: `sudo dnf install tesseract-devel`\n\nThe bundled mode adds ~1-3 minutes to initial gem installation but provides a completely self-contained experience with no external dependencies.\n\n## Architecture\n\nParseKit uses a hybrid Ruby/Rust architecture:\n\n- **Ruby Layer**: Provides convenient API and format detection\n- **Rust Layer**: Implements high-performance parsing using:\n  - MuPDF for PDF text extraction (statically linked)\n  - tesseract-rs for OCR (with bundled Tesseract by default)\n  - Pure Rust libraries for DOCX/XLSX parsing\n  - Magnus for Ruby-Rust FFI bindings\n\n## Contributing\n\nBug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/parsekit.\n\n## License\n\nThe gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).\n\nNote: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscientist-labs%2Fparsekit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscientist-labs%2Fparsekit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscientist-labs%2Fparsekit/lists"}