https://github.com/scientist-labs/parsekit

Ruby document parsing toolkit with zero runtime dependencies. Parse PDFs, DOCX, XLSX, and images (with OCR) using a single, lightweight gem. Statically links MuPDF and Tesseract at compile time for hassle-free installation - no system libraries or external tools required.
https://github.com/scientist-labs/parsekit

content extraction metadata ruby

Last synced: 10 months ago
JSON representation

Host: GitHub
URL: https://github.com/scientist-labs/parsekit
Owner: scientist-labs
License: mit
Created: 2025-08-21T13:00:37.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-09-06T16:26:56.000Z (11 months ago)
Last Synced: 2025-09-06T16:42:09.815Z (11 months ago)
Topics: content, extraction, metadata, ruby
Language: Ruby
Homepage:
Size: 1.93 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          

[![Gem Version](https://badge.fury.io/rb/parsekit.svg)](https://badge.fury.io/rb/parsekit)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Native Ruby bindings for the [parser-core](https://crates.io/crates/parser-core) Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX), images (with OCR), and more. Part of the ruby-nlp ecosystem.

## Features

- 📄 **Document Parsing**: Extract text from PDFs, Office documents (DOCX, XLSX)

- 🖼️ **OCR Support**: Extract text from images using Tesseract OCR

- 🚀 **High Performance**: Native Rust performance with Ruby convenience

- 🔧 **Unified API**: Single interface for multiple document formats

- 📦 **Cross-Platform**: Works on Linux, macOS, and Windows

- 🧪 **Well Tested**: Comprehensive test suite with RSpec

## Installation

Add this line to your application's Gemfile:

```ruby

gem 'parsekit'

```

And then execute:

    $ bundle install

Or install it yourself as:

```bash

gem install parsekit

```

### Requirements

- Ruby >= 3.0.0

- Rust toolchain (stable)

- C compiler (for linking)

That's it! ParseKit bundles all necessary libraries including Tesseract for OCR, so you don't need to install any system dependencies.

## Usage

### Basic Usage

```ruby

require 'parsekit'

# Parse a PDF file

text = ParseKit.parse_file("document.pdf")

puts text  # Extracted text from the PDF

# Parse an Excel file

text = ParseKit.parse_file("spreadsheet.xlsx")

puts text  # Extracted text from all sheets

# Parse binary data directly

file_data = File.binread("document.pdf")

text = ParseKit.parse_bytes(file_data)

puts text

# Parse with a Parser instance

parser = ParseKit::Parser.new

text = parser.parse_file("report.docx")

puts text

```

### Module-Level Convenience Methods

```ruby

# Parse files directly

content = ParseKit.parse_file('document.pdf')

# Parse bytes

data = File.read('document.pdf', mode: 'rb')

content = ParseKit.parse_bytes(data.bytes)

# Check supported formats

formats = ParseKit.supported_formats

# => ["txt", "json", "xml", "html", "docx", "xlsx", "xls", "csv", "pdf", "png", "jpg", "jpeg", "tiff", "bmp"]

# Check if a file is supported

ParseKit.supports_file?('document.pdf')  # => true

```

### Configuration Options

```ruby

# Create parser with options

parser = ParseKit::Parser.new(

  strict_mode: true,

  max_size: 50 * 1024 * 1024,  # 50MB limit

  encoding: 'UTF-8'

)

# Or use the strict convenience method

parser = ParseKit::Parser.strict

```

### Format-Specific Parsing

```ruby

parser = ParseKit::Parser.new

# Direct access to format-specific parsers

pdf_data = File.read('document.pdf', mode: 'rb').bytes

pdf_text = parser.parse_pdf(pdf_data)

image_data = File.read('image.png', mode: 'rb').bytes

ocr_text = parser.ocr_image(image_data)

excel_data = File.read('data.xlsx', mode: 'rb').bytes

excel_text = parser.parse_xlsx(excel_data)

```

## Supported Formats

| Format | Extensions | Method | Notes |

|--------|------------|--------|-------|

| PDF | .pdf | `parse_pdf` | Text extraction via MuPDF |

| Word | .docx | `parse_docx` | Office Open XML format |

| Excel | .xlsx, .xls | `parse_xlsx` | Both modern and legacy formats |

| PowerPoint | .pptx | `parse_pptx` | Text extraction from slides and notes |

| Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via bundled Tesseract |

| JSON | .json | `parse_json` | Pretty-printed output |

| XML/HTML | .xml, .html | `parse_xml` | Extracts text content |

| Text | .txt, .csv, .md | `parse_text` | With encoding detection |

## Performance

ParseKit is built with performance in mind:

- Native Rust implementation for speed

- Statically linked C libraries (MuPDF, Tesseract) compiled with optimizations

- Efficient memory usage with streaming where possible

- Configurable size limits to prevent memory issues

## Development

After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests.

To compile the Rust extension:

```bash

rake compile

```

To run tests with coverage:

```bash

rake dev:coverage

```

### OCR Mode Configuration

By default, ParseKit bundles Tesseract for zero-dependency OCR support. Advanced users who already have Tesseract installed system-wide and want faster gem installation can use system mode:

**Using system Tesseract during installation:**

```bash

gem install parsekit -- --no-default-features

```

**For development with system Tesseract:**

```bash

rake compile CARGO_FEATURES=""  # Disables bundled-tesseract feature

```

**System Tesseract requirements:**

- **macOS**: `brew install tesseract`

- **Ubuntu/Debian**: `sudo apt-get install libtesseract-dev`

- **Fedora/RHEL**: `sudo dnf install tesseract-devel`

The bundled mode adds ~1-3 minutes to initial gem installation but provides a completely self-contained experience with no external dependencies.

## Architecture

ParseKit uses a hybrid Ruby/Rust architecture:

- **Ruby Layer**: Provides convenient API and format detection

- **Rust Layer**: Implements high-performance parsing using:

  - MuPDF for PDF text extraction (statically linked)

  - tesseract-rs for OCR (with bundled Tesseract by default)

  - Pure Rust libraries for DOCX/XLSX parsing

  - Magnus for Ruby-Rust FFI bindings

## Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/parsekit.

## License

The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scientist-labs/parsekit

Awesome Lists containing this project

README