https://github.com/scientist-labs/parsekit
Ruby document parsing toolkit with zero runtime dependencies. Parse PDFs, DOCX, XLSX, and images (with OCR) using a single, lightweight gem. Statically links MuPDF and Tesseract at compile time for hassle-free installation - no system libraries or external tools required.
https://github.com/scientist-labs/parsekit
content extraction metadata ruby
Last synced: 9 months ago
JSON representation
Ruby document parsing toolkit with zero runtime dependencies. Parse PDFs, DOCX, XLSX, and images (with OCR) using a single, lightweight gem. Statically links MuPDF and Tesseract at compile time for hassle-free installation - no system libraries or external tools required.
- Host: GitHub
- URL: https://github.com/scientist-labs/parsekit
- Owner: scientist-labs
- License: mit
- Created: 2025-08-21T13:00:37.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-09-06T16:26:56.000Z (10 months ago)
- Last Synced: 2025-09-06T16:42:09.815Z (10 months ago)
- Topics: content, extraction, metadata, ruby
- Language: Ruby
- Homepage:
- Size: 1.93 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
Awesome Lists containing this project
README

[](https://badge.fury.io/rb/parsekit)
[](https://opensource.org/licenses/MIT)
Native Ruby bindings for the [parser-core](https://crates.io/crates/parser-core) Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX), images (with OCR), and more. Part of the ruby-nlp ecosystem.
## Features
- ๐ **Document Parsing**: Extract text from PDFs, Office documents (DOCX, XLSX)
- ๐ผ๏ธ **OCR Support**: Extract text from images using Tesseract OCR
- ๐ **High Performance**: Native Rust performance with Ruby convenience
- ๐ง **Unified API**: Single interface for multiple document formats
- ๐ฆ **Cross-Platform**: Works on Linux, macOS, and Windows
- ๐งช **Well Tested**: Comprehensive test suite with RSpec
## Installation
Add this line to your application's Gemfile:
```ruby
gem 'parsekit'
```
And then execute:
$ bundle install
Or install it yourself as:
```bash
gem install parsekit
```
### Requirements
- Ruby >= 3.0.0
- Rust toolchain (stable)
- C compiler (for linking)
That's it! ParseKit bundles all necessary libraries including Tesseract for OCR, so you don't need to install any system dependencies.
## Usage
### Basic Usage
```ruby
require 'parsekit'
# Parse a PDF file
text = ParseKit.parse_file("document.pdf")
puts text # Extracted text from the PDF
# Parse an Excel file
text = ParseKit.parse_file("spreadsheet.xlsx")
puts text # Extracted text from all sheets
# Parse binary data directly
file_data = File.binread("document.pdf")
text = ParseKit.parse_bytes(file_data)
puts text
# Parse with a Parser instance
parser = ParseKit::Parser.new
text = parser.parse_file("report.docx")
puts text
```
### Module-Level Convenience Methods
```ruby
# Parse files directly
content = ParseKit.parse_file('document.pdf')
# Parse bytes
data = File.read('document.pdf', mode: 'rb')
content = ParseKit.parse_bytes(data.bytes)
# Check supported formats
formats = ParseKit.supported_formats
# => ["txt", "json", "xml", "html", "docx", "xlsx", "xls", "csv", "pdf", "png", "jpg", "jpeg", "tiff", "bmp"]
# Check if a file is supported
ParseKit.supports_file?('document.pdf') # => true
```
### Configuration Options
```ruby
# Create parser with options
parser = ParseKit::Parser.new(
strict_mode: true,
max_size: 50 * 1024 * 1024, # 50MB limit
encoding: 'UTF-8'
)
# Or use the strict convenience method
parser = ParseKit::Parser.strict
```
### Format-Specific Parsing
```ruby
parser = ParseKit::Parser.new
# Direct access to format-specific parsers
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)
image_data = File.read('image.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)
excel_data = File.read('data.xlsx', mode: 'rb').bytes
excel_text = parser.parse_xlsx(excel_data)
```
## Supported Formats
| Format | Extensions | Method | Notes |
|--------|------------|--------|-------|
| PDF | .pdf | `parse_pdf` | Text extraction via MuPDF |
| Word | .docx | `parse_docx` | Office Open XML format |
| Excel | .xlsx, .xls | `parse_xlsx` | Both modern and legacy formats |
| PowerPoint | .pptx | `parse_pptx` | Text extraction from slides and notes |
| Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via bundled Tesseract |
| JSON | .json | `parse_json` | Pretty-printed output |
| XML/HTML | .xml, .html | `parse_xml` | Extracts text content |
| Text | .txt, .csv, .md | `parse_text` | With encoding detection |
## Performance
ParseKit is built with performance in mind:
- Native Rust implementation for speed
- Statically linked C libraries (MuPDF, Tesseract) compiled with optimizations
- Efficient memory usage with streaming where possible
- Configurable size limits to prevent memory issues
## Development
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests.
To compile the Rust extension:
```bash
rake compile
```
To run tests with coverage:
```bash
rake dev:coverage
```
### OCR Mode Configuration
By default, ParseKit bundles Tesseract for zero-dependency OCR support. Advanced users who already have Tesseract installed system-wide and want faster gem installation can use system mode:
**Using system Tesseract during installation:**
```bash
gem install parsekit -- --no-default-features
```
**For development with system Tesseract:**
```bash
rake compile CARGO_FEATURES="" # Disables bundled-tesseract feature
```
**System Tesseract requirements:**
- **macOS**: `brew install tesseract`
- **Ubuntu/Debian**: `sudo apt-get install libtesseract-dev`
- **Fedora/RHEL**: `sudo dnf install tesseract-devel`
The bundled mode adds ~1-3 minutes to initial gem installation but provides a completely self-contained experience with no external dependencies.
## Architecture
ParseKit uses a hybrid Ruby/Rust architecture:
- **Ruby Layer**: Provides convenient API and format detection
- **Rust Layer**: Implements high-performance parsing using:
- MuPDF for PDF text extraction (statically linked)
- tesseract-rs for OCR (with bundled Tesseract by default)
- Pure Rust libraries for DOCX/XLSX parsing
- Magnus for Ruby-Rust FFI bindings
## Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/parsekit.
## License
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.