https://github.com/oeo/processor-rs
High-performance document processing pipeline in Rust. Extracts text, performs OCR, and optimizes images from PDFs and other document formats with parallel processing and memory efficiency.
https://github.com/oeo/processor-rs
document-processing image-optimization parallel-processing rust tesseract-ocr text-extraction
Last synced: 7 months ago
JSON representation
High-performance document processing pipeline in Rust. Extracts text, performs OCR, and optimizes images from PDFs and other document formats with parallel processing and memory efficiency.
- Host: GitHub
- URL: https://github.com/oeo/processor-rs
- Owner: oeo
- Created: 2025-01-15T09:21:45.000Z (12 months ago)
- Default Branch: master
- Last Pushed: 2025-01-15T09:40:10.000Z (12 months ago)
- Last Synced: 2025-03-15T18:52:26.185Z (10 months ago)
- Topics: document-processing, image-optimization, parallel-processing, rust, tesseract-ocr, text-extraction
- Language: Rust
- Homepage:
- Size: 42 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# processor-rs
A high-performance document processing pipeline written in Rust that supports multiple file formats and provides text extraction, OCR capabilities, and image processing.
## Features
- Multi-format document processing:
- Text files (txt, csv)
- Office documents (docx, xlsx, pptx)
- PDF files with advanced rendering
- Images (jpg, png, gif, bmp, tiff, webp)
- Spreadsheets (csv, xls, xlsx)
- Advanced Processing Capabilities:
- Text extraction from various document formats
- OCR (Optical Character Recognition) for images and scanned documents
- Spreadsheet data parsing and formatting
- PDF processing with 1.5x render scale for optimal quality
- Intelligent text quality assessment
- Advanced OCR filtering and validation
- Performance Optimizations:
- Parallel image processing using rayon
- Pre-allocated buffers with exact capacity
- Single-pass pixel processing
- Efficient alpha blending with white background
- Fast image resizing using Triangle filter
- Memory-efficient buffer reuse
- Optimized PDF to image conversion
- Smart page selection for large documents
- Quality Control Features:
- Comprehensive OCR quality validation:
- Character validity ratio checks (80% minimum valid characters)
- Special character ratio limits (15% maximum)
- Word length analysis (1-20 characters per word)
- Word count validation (minimum 3 words)
- Average word length checks (2-15 characters)
- Single-character word ratio limits (30% maximum)
- Word-like token validation (40% minimum)
- Repeated character detection
- Text cleaning and normalization:
- Whitespace normalization
- Line break standardization
- Special character cleanup
- Artifact removal
- Architecture Features:
- Async processing pipeline
- Configurable memory limits
- Multi-threaded processing
- Temporary file management
- Progress tracking and metrics
- Memory-efficient image handling
- Optimized text processing
## Installation
### Prerequisites
- Rust toolchain (1.75.0 or later recommended)
- Tesseract OCR engine for image text extraction
- System dependencies:
```bash
# Ubuntu/Debian
sudo apt-get install leptonica-dev tesseract-ocr libtesseract-dev clang
# macOS
brew install tesseract leptonica
```
### Configuration
The processor supports various configuration options:
- OCR quality thresholds
- Maximum image size limits
- Memory usage constraints
- Temporary file handling
- Processing timeouts
- Thread count control
## Usage
Basic usage through command line:
```bash
processor-rs run [options]
Options:
--format Output format (json, html, protobuf)
--config Use custom config file
--temp-dir Specify temporary directory
--keep-temps Keep temporary files
--verbose Enable verbose logging
--max-memory Set maximum memory usage
--timeout Set processing timeout
```
## Output Formats
The processor supports multiple output formats:
### JSON
Structured output including:
- Extracted text with quality metrics
- OCR results with confidence scores
- Optimized image attachments
- Processing metadata and timing information
- Error and warning logs
### HTML
Clean, minimal visualization with:
- Extracted text content
- Image previews
- Processing metadata
### Protobuf
Binary format for efficient machine processing.
## Error Handling
The processor includes comprehensive error handling for:
- Invalid file formats
- OCR processing failures
- Memory constraints
- Timeout conditions
- File system errors
- Buffer size mismatches
- Image conversion issues
## API Usage
```rust
use processor_rs::{Config, Processor, Strategy};
use processor_rs::steps::{TextProcessor, PDFProcessor, ImageProcessor};
async fn process_document() {
// Initialize with custom config
let config = Config::default();
let mut processor = Processor::new(config);
// Add processing steps
processor.add_step(TextProcessor);
processor.add_step(PDFProcessor);
processor.add_step(ImageProcessor);
// Process document
let mut query = Query {
file_path: "document.pdf".to_string(),
file_type: "pdf".to_string(),
strategy: Strategy::PDF.to_string(),
prompt_parts: Vec::new(),
attachments: Vec::new(),
system: "You are a helpful assistant.".to_string(),
prompt: String::new(),
metadata: Some(QueryMetadata::default()),
};
let result = processor.process(&mut query).await.unwrap();
}
```
## Supported File Types
| Category | Extensions |
|----------|------------|
| Text | txt, csv |
| Office | docx, xlsx |
| Spreadsheets | xls, xlsx |
| Images | bmp, gif, jpg, jpeg, png, tiff, webp |
| PDF | pdf |