{"id":24530883,"url":"https://github.com/oeo/processor-rs","last_synced_at":"2025-06-10T23:05:00.367Z","repository":{"id":272574316,"uuid":"917058672","full_name":"oeo/processor-rs","owner":"oeo","description":"High-performance document processing pipeline in Rust. Extracts text, performs OCR, and optimizes images from PDFs and other document formats with parallel processing and memory efficiency.","archived":false,"fork":false,"pushed_at":"2025-01-15T09:40:10.000Z","size":43,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-15T18:52:26.185Z","etag":null,"topics":["document-processing","image-optimization","parallel-processing","rust","tesseract-ocr","text-extraction"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oeo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-15T09:21:45.000Z","updated_at":"2025-01-15T09:40:11.000Z","dependencies_parsed_at":null,"dependency_job_id":"2aa903c4-7442-4ccc-947f-7a0df6c8c271","html_url":"https://github.com/oeo/processor-rs","commit_stats":null,"previous_names":["oeo/processor-rs"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oeo%2Fprocessor-rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oeo%2Fprocessor-rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oeo%2Fprocessor-rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oeo%2Fprocessor-rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oeo","download_url":"https://codeload.github.com/oeo/processor-rs/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oeo%2Fprocessor-rs/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259166894,"owners_count":22815586,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-processing","image-optimization","parallel-processing","rust","tesseract-ocr","text-extraction"],"created_at":"2025-01-22T08:18:05.811Z","updated_at":"2025-06-10T23:05:00.360Z","avatar_url":"https://github.com/oeo.png","language":"Rust","readme":"# processor-rs\n\nA high-performance document processing pipeline written in Rust that supports multiple file formats and provides text extraction, OCR capabilities, and image processing.\n\n## Features\n\n- Multi-format document processing:\n  - Text files (txt, csv)\n  - Office documents (docx, xlsx, pptx)\n  - PDF files with advanced rendering\n  - Images (jpg, png, gif, bmp, tiff, webp)\n  - Spreadsheets (csv, xls, xlsx)\n\n- Advanced Processing Capabilities:\n  - Text extraction from various document formats\n  - OCR (Optical Character Recognition) for images and scanned documents\n  - Spreadsheet data parsing and formatting\n  - PDF processing with 1.5x render scale for optimal quality\n  - Intelligent text quality assessment\n  - Advanced OCR filtering and validation\n\n- Performance Optimizations:\n  - Parallel image processing using rayon\n  - Pre-allocated buffers with exact capacity\n  - Single-pass pixel processing\n  - Efficient alpha blending with white background\n  - Fast image resizing using Triangle filter\n  - Memory-efficient buffer reuse\n  - Optimized PDF to image conversion\n  - Smart page selection for large documents\n\n- Quality Control Features:\n  - Comprehensive OCR quality validation:\n    - Character validity ratio checks (80% minimum valid characters)\n    - Special character ratio limits (15% maximum)\n    - Word length analysis (1-20 characters per word)\n    - Word count validation (minimum 3 words)\n    - Average word length checks (2-15 characters)\n    - Single-character word ratio limits (30% maximum)\n    - Word-like token validation (40% minimum)\n    - Repeated character detection\n  - Text cleaning and normalization:\n    - Whitespace normalization\n    - Line break standardization\n    - Special character cleanup\n    - Artifact removal\n\n- Architecture Features:\n  - Async processing pipeline\n  - Configurable memory limits\n  - Multi-threaded processing\n  - Temporary file management\n  - Progress tracking and metrics\n  - Memory-efficient image handling\n  - Optimized text processing\n\n## Installation\n\n### Prerequisites\n\n- Rust toolchain (1.75.0 or later recommended)\n- Tesseract OCR engine for image text extraction\n- System dependencies:\n  ```bash\n  # Ubuntu/Debian\n  sudo apt-get install leptonica-dev tesseract-ocr libtesseract-dev clang\n\n  # macOS\n  brew install tesseract leptonica\n  ```\n\n### Configuration\n\nThe processor supports various configuration options:\n- OCR quality thresholds\n- Maximum image size limits\n- Memory usage constraints\n- Temporary file handling\n- Processing timeouts\n- Thread count control\n\n## Usage\n\nBasic usage through command line:\n```bash\nprocessor-rs run \u003cinfile\u003e [options]\n\nOptions:\n  --format \u003cformat\u003e     Output format (json, html, protobuf)\n  --config \u003cfile\u003e       Use custom config file\n  --temp-dir \u003cdir\u003e      Specify temporary directory\n  --keep-temps          Keep temporary files\n  --verbose            Enable verbose logging\n  --max-memory \u003cmb\u003e    Set maximum memory usage\n  --timeout \u003cseconds\u003e  Set processing timeout\n```\n\n## Output Formats\n\nThe processor supports multiple output formats:\n\n### JSON\nStructured output including:\n- Extracted text with quality metrics\n- OCR results with confidence scores\n- Optimized image attachments\n- Processing metadata and timing information\n- Error and warning logs\n\n### HTML\nClean, minimal visualization with:\n- Extracted text content\n- Image previews\n- Processing metadata\n\n### Protobuf\nBinary format for efficient machine processing.\n\n## Error Handling\n\nThe processor includes comprehensive error handling for:\n- Invalid file formats\n- OCR processing failures\n- Memory constraints\n- Timeout conditions\n- File system errors\n- Buffer size mismatches\n- Image conversion issues\n\n## API Usage\n\n```rust\nuse processor_rs::{Config, Processor, Strategy};\nuse processor_rs::steps::{TextProcessor, PDFProcessor, ImageProcessor};\n\nasync fn process_document() {\n    // Initialize with custom config\n    let config = Config::default();\n    let mut processor = Processor::new(config);\n\n    // Add processing steps\n    processor.add_step(TextProcessor);\n    processor.add_step(PDFProcessor);\n    processor.add_step(ImageProcessor);\n\n    // Process document\n    let mut query = Query {\n        file_path: \"document.pdf\".to_string(),\n        file_type: \"pdf\".to_string(),\n        strategy: Strategy::PDF.to_string(),\n        prompt_parts: Vec::new(),\n        attachments: Vec::new(),\n        system: \"You are a helpful assistant.\".to_string(),\n        prompt: String::new(),\n        metadata: Some(QueryMetadata::default()),\n    };\n\n    let result = processor.process(\u0026mut query).await.unwrap();\n}\n```\n\n## Supported File Types\n\n| Category | Extensions |\n|----------|------------|\n| Text | txt, csv |\n| Office | docx, xlsx |\n| Spreadsheets | xls, xlsx |\n| Images | bmp, gif, jpg, jpeg, png, tiff, webp |\n| PDF | pdf |\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foeo%2Fprocessor-rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foeo%2Fprocessor-rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foeo%2Fprocessor-rs/lists"}