https://github.com/oeo/processor-rs

High-performance document processing pipeline in Rust. Extracts text, performs OCR, and optimizes images from PDFs and other document formats with parallel processing and memory efficiency.
https://github.com/oeo/processor-rs

document-processing image-optimization parallel-processing rust tesseract-ocr text-extraction

Last synced: 9 months ago
JSON representation

High-performance document processing pipeline in Rust. Extracts text, performs OCR, and optimizes images from PDFs and other document formats with parallel processing and memory efficiency.

Host: GitHub
URL: https://github.com/oeo/processor-rs
Owner: oeo
Created: 2025-01-15T09:21:45.000Z (about 1 year ago)
Default Branch: master
Last Pushed: 2025-01-15T09:40:10.000Z (about 1 year ago)
Last Synced: 2025-03-15T18:52:26.185Z (11 months ago)
Topics: document-processing, image-optimization, parallel-processing, rust, tesseract-ocr, text-extraction
Language: Rust
Homepage:
Size: 42 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # processor-rs

A high-performance document processing pipeline written in Rust that supports multiple file formats and provides text extraction, OCR capabilities, and image processing.

## Features

- Multi-format document processing:

  - Text files (txt, csv)

  - Office documents (docx, xlsx, pptx)

  - PDF files with advanced rendering

  - Images (jpg, png, gif, bmp, tiff, webp)

  - Spreadsheets (csv, xls, xlsx)

- Advanced Processing Capabilities:

  - Text extraction from various document formats

  - OCR (Optical Character Recognition) for images and scanned documents

  - Spreadsheet data parsing and formatting

  - PDF processing with 1.5x render scale for optimal quality

  - Intelligent text quality assessment

  - Advanced OCR filtering and validation

- Performance Optimizations:

  - Parallel image processing using rayon

  - Pre-allocated buffers with exact capacity

  - Single-pass pixel processing

  - Efficient alpha blending with white background

  - Fast image resizing using Triangle filter

  - Memory-efficient buffer reuse

  - Optimized PDF to image conversion

  - Smart page selection for large documents

- Quality Control Features:

  - Comprehensive OCR quality validation:

    - Character validity ratio checks (80% minimum valid characters)

    - Special character ratio limits (15% maximum)

    - Word length analysis (1-20 characters per word)

    - Word count validation (minimum 3 words)

    - Average word length checks (2-15 characters)

    - Single-character word ratio limits (30% maximum)

    - Word-like token validation (40% minimum)

    - Repeated character detection

  - Text cleaning and normalization:

    - Whitespace normalization

    - Line break standardization

    - Special character cleanup

    - Artifact removal

- Architecture Features:

  - Async processing pipeline

  - Configurable memory limits

  - Multi-threaded processing

  - Temporary file management

  - Progress tracking and metrics

  - Memory-efficient image handling

  - Optimized text processing

## Installation

### Prerequisites

- Rust toolchain (1.75.0 or later recommended)

- Tesseract OCR engine for image text extraction

- System dependencies:

  ```bash

  # Ubuntu/Debian

  sudo apt-get install leptonica-dev tesseract-ocr libtesseract-dev clang

  # macOS

  brew install tesseract leptonica

  ```

### Configuration

The processor supports various configuration options:

- OCR quality thresholds

- Maximum image size limits

- Memory usage constraints

- Temporary file handling

- Processing timeouts

- Thread count control

## Usage

Basic usage through command line:

```bash

processor-rs run  [options]

Options:

  --format      Output format (json, html, protobuf)

  --config        Use custom config file

  --temp-dir       Specify temporary directory

  --keep-temps          Keep temporary files

  --verbose            Enable verbose logging

  --max-memory     Set maximum memory usage

  --timeout   Set processing timeout

```

## Output Formats

The processor supports multiple output formats:

### JSON

Structured output including:

- Extracted text with quality metrics

- OCR results with confidence scores

- Optimized image attachments

- Processing metadata and timing information

- Error and warning logs

### HTML

Clean, minimal visualization with:

- Extracted text content

- Image previews

- Processing metadata

### Protobuf

Binary format for efficient machine processing.

## Error Handling

The processor includes comprehensive error handling for:

- Invalid file formats

- OCR processing failures

- Memory constraints

- Timeout conditions

- File system errors

- Buffer size mismatches

- Image conversion issues

## API Usage

```rust

use processor_rs::{Config, Processor, Strategy};

use processor_rs::steps::{TextProcessor, PDFProcessor, ImageProcessor};

async fn process_document() {

    // Initialize with custom config

    let config = Config::default();

    let mut processor = Processor::new(config);

    // Add processing steps

    processor.add_step(TextProcessor);

    processor.add_step(PDFProcessor);

    processor.add_step(ImageProcessor);

    // Process document

    let mut query = Query {

        file_path: "document.pdf".to_string(),

        file_type: "pdf".to_string(),

        strategy: Strategy::PDF.to_string(),

        prompt_parts: Vec::new(),

        attachments: Vec::new(),

        system: "You are a helpful assistant.".to_string(),

        prompt: String::new(),

        metadata: Some(QueryMetadata::default()),

    };

    let result = processor.process(&mut query).await.unwrap();

}

```

## Supported File Types

| Category | Extensions |

|----------|------------|

| Text | txt, csv |

| Office | docx, xlsx |

| Spreadsheets | xls, xlsx |

| Images | bmp, gif, jpg, jpeg, png, tiff, webp |

| PDF | pdf |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/oeo/processor-rs

Awesome Lists containing this project

README