{"id":28719955,"url":"https://github.com/text2doc/redoc","last_synced_at":"2025-07-22T12:33:49.582Z","repository":{"id":297924360,"uuid":"998303147","full_name":"text2doc/redoc","owner":"text2doc","description":"image doc pdf ocr html json converter as DSL pipeline","archived":false,"fork":false,"pushed_at":"2025-06-08T21:08:22.000Z","size":734,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-15T15:57:41.023Z","etag":null,"topics":["converter","docs","dsl","html","json","llm","ml","ocr","ollama","pdf","pipeline","tensor","torch"],"latest_commit_sha":null,"homepage":"https://text2doc.github.io/redoc/","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/text2doc.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-08T10:18:38.000Z","updated_at":"2025-06-08T21:08:27.000Z","dependencies_parsed_at":"2025-06-08T11:41:24.162Z","dependency_job_id":null,"html_url":"https://github.com/text2doc/redoc","commit_stats":null,"previous_names":["text2doc/redoc"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/text2doc/redoc","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/text2doc%2Fredoc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/text2doc%2Fredoc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/text2doc%2Fredoc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/text2doc%2Fredoc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/text2doc","download_url":"https://codeload.github.com/text2doc/redoc/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/text2doc%2Fredoc/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265522320,"owners_count":23781652,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["converter","docs","dsl","html","json","llm","ml","ocr","ollama","pdf","pipeline","tensor","torch"],"created_at":"2025-06-15T06:06:14.936Z","updated_at":"2025-07-22T12:33:49.546Z","avatar_url":"https://github.com/text2doc.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# 📄 Redoc - Universal Document Converter\n\n[![PyPI Version](https://img.shields.io/pypi/v/redoc?color=blue\u0026logo=pypi\u0026logoColor=white)](https://pypi.org/project/redoc/)\n[![Python Version](https://img.shields.io/pypi/pyversions/redoc?logo=python\u0026logoColor=white)](https://www.python.org/)\n[![License](https://img.shields.io/pypi/l/redoc?color=blue)](https://opensource.org/licenses/Apache-2.0)\n[![Documentation Status](https://readthedocs.org/projects/redoc/badge/?version=latest)](https://redoc.readthedocs.io/)\n[![Build Status](https://github.com/text2doc/redoc/actions/workflows/tests.yml/badge.svg)](https://github.com/text2doc/redoc/actions)\n[![Test Coverage](https://codecov.io/gh/text2doc/redoc/branch/main/graph/badge.svg)](https://codecov.io/gh/text2doc/redoc)\n[![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Docker Pulls](https://img.shields.io/docker/pulls/text2doc/redoc?logo=docker)](https://hub.docker.com/r/text2doc/redoc)\n[![Downloads](https://static.pepy.tech/badge/redoc)](https://pepy.tech/project/redoc)\n[![CodeQL](https://github.com/text2doc/redoc/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/text2doc/redoc/actions/workflows/codeql-analysis.yml)\n[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit\u0026logoColor=white)](https://github.com/pre-commit/pre-commit)\n[![OpenSSF Scorecard](https://api.securityscorecards.dev/projects/github.com/text2doc/redoc/badge)](https://api.securityscorecards.dev/projects/github.com/text2doc/redoc)\n[![Discord](https://img.shields.io/discord/1234567890?logo=discord\u0026label=Discord\u0026color=7289DA)](https://discord.gg/softreck)\n[![Twitter Follow](https://img.shields.io/twitter/follow/text2doc?style=social)](https://twitter.com/softreck)\n\n\u003c/div\u003e\n\nRedoc is a powerful, modular document conversion framework that enables seamless transformation between various document formats including PDF, HTML, XML, JSON, DOCX, and EPUB. It features OCR capabilities, AI-powered content generation using Ollama Mistral:7b, and a bidirectional template system for document generation and data extraction.\n\n## 🌟 Features\n\n### Core Functionality\n- **Multi-format Support**: Bidirectional conversion between PDF, HTML, XML, JSON, DOCX, and EPUB\n- **Template System**: JSON+HTML templates for dynamic document generation with bidirectional support\n- **OCR Integration**: Extract text from scanned documents and images with Tesseract OCR\n- **AI-Powered**: Leverage Ollama Mistral:7b for intelligent content generation and processing\n- **Bidirectional Processing**: Convert documents to data and back with templates\n- **Batch Processing**: Process multiple documents efficiently with parallel execution\n\n### Advanced Capabilities\n- **Template Variables**: Support for dynamic content and conditional rendering\n- **Validation**: Built-in data validation with Pydantic models\n- **Extensible Architecture**: Plugin system for custom formats and processors\n- **Asynchronous Processing**: Non-blocking operations for high performance\n- **Web Interface**: Modern UI for document conversion and management\n\n### Developer Experience\n- **Comprehensive API**: Clean, well-documented Python API\n- **Command Line Interface**: Intuitive CLI for quick conversions\n- **Interactive Shell**: Built-in Python shell for exploration and debugging\n- **Logging \u0026 Debugging**: Configurable logging and error reporting\n- **Type Hints**: Full type annotations for better IDE support\n\n### Enterprise Ready\n- **Docker Support**: Containerized deployment with Docker and Docker Compose\n- **REST API**: Built with FastAPI for easy integration\n- **Asynchronous Processing**: Non-blocking operations for high performance\n- **Security**: Input validation, sanitization, and secure defaults\n- **Monitoring**: Built-in metrics and health checks\n\n## 🚀 Quick START\n\n### Installation\n\n#### Using pip (recommended)\n```bash\n# Install the latest stable version\npip install redoc\n\n# Install with all optional dependencies\npip install \"redoc[all]\"\n\n# Or install specific components\npip install \"redoc[cli]\"       # Command line interface\npip install \"redoc[server]\"     # Web server and API\npip install \"redoc[ai]\"         # AI features (requires Ollama)\npip install \"redoc[ocr]\"        # OCR capabilities (Tesseract)\npip install \"redoc[templates]\"  # Pre-built templates\n```\n\n#### Using Docker (recommended for production)\n```bash\n# Pull the latest image\ndocker pull text2doc/redoc:latest\n\n# Run a conversion\ndocker run -v $(pwd):/data text2doc/redoc convert input.pdf output.html\n\n# Start the web interface\ndocker run -p 8000:8000 -v $(pwd)/templates:/app/templates text2doc/redoc serve\n```\n\n#### Development Installation\n```bash\ngit clone https://github.com/text2doc/redoc.git\ncd redoc\npip install -e \".[dev]\"  # Install in development mode with all dependencies\npre-commit install  # Install git hooks\n```\n\n## 🛠 Basic Usage\n\n### Command Line Interface\n\n```bash\n# Convert a document\nredoc convert input.pdf output.html\n\n# Convert with a template\nredoc convert --template invoice.html data.json invoice.pdf\n\n# Start interactive shell\nredoc shell\n\n# Start web server\nredoc serve\n```\n\n### Python API\n\n```python\nfrom redoc import Redoc\n\n# Initialize with default settings\nconverter = Redoc()\n\n# Convert between formats\nconverter.convert('document.pdf', 'document.html')  # PDF to HTML\nconverter.convert('data.json', 'report.pdf')       # JSON to PDF with template\n\n# Process multiple files\nconverter.batch_convert(\n    input_glob='invoices/*.json',\n    output_dir='output/',\n    output_format='pdf',\n    template='invoice.html'\n)\n\n# Extract data from documents\ndata = converter.extract_data('document.pdf', 'invoice_schema.json')\n\n# Generate documents from templates\nconverter.generate_document(\n    template='invoice.html',\n    data='data.json',\n    output='invoice.pdf'\n)\n\n# Use the interactive shell\nconverter.shell()\n```\n\n#### Command Line Interface\n```bash\n# Show help\nredoc --help\n\n# Convert a document\nredoc convert input.pdf output.html\nredoc convert --template invoice.html data.json invoice.pdf\n\n# Start interactive shell\nredoc shell\n\n# Start web server\nredoc serve --host 0.0.0.0 --port 8000\n\n# Process multiple files\nredoc batch \"documents/*.pdf\" --format html --output-dir html_output\n```\n\n#### Using Templates\n```python\nfrom redoc import Redoc\n\nconverter = Redoc()\n\n# Simple template with variables\ntemplate = {\n    \"template\": \"invoice.html\",\n    \"data\": {\n        \"invoice\": {\n            \"number\": \"INV-2023-001\",\n            \"date\": \"2023-11-15\",\n            \"items\": [\n                {\"description\": \"Web Design\", \"quantity\": 10, \"price\": 100},\n                {\"description\": \"Hosting\", \"quantity\": 1, \"price\": 50}\n            ]\n        }\n    }\n}\n\n# Generate PDF from template\nconverter.convert(template, 'pdf', output_file='invoice.pdf')\n\n# Extract data from document\ndata = converter.extract_data('invoice.pdf', template='invoice_template.html')\n```\n\n## 📚 Supported Conversions\n\n| From \\ To | PDF | HTML | XML | JSON | DOCX | EPUB |\n|-----------|:---:|:----:|:---:|:----:|:----:|:----:|\n| **PDF**   | ❌  | ✅   | ✅  | ✅   | ✅   | ✅   |\n| **HTML**  | ✅  | ❌  | ✅  | ✅   | ✅   | ✅   |\n| **XML**   | ✅  | ✅   | ❌  | ✅   | ✅   | ✅   |\n| **JSON**  | ✅  | ✅   | ✅  | ❌   | ✅   | ✅   |\n| **DOCX**  | ✅  | ✅   | ✅  | ✅   | ❌   | ✅   |\n| **EPUB**  | ✅  | ✅   | ✅  | ✅   | ✅   | ❌   |\n\n### Conversion Features\n\n- **PDF Generation**: High-quality PDF output with support for headers, footers, and page numbers\n- **HTML Processing**: Clean HTML output with customizable CSS styling\n- **Data Extraction**: Extract structured data from documents using templates\n- **Template Variables**: Use Jinja2 syntax for dynamic content\n- **Batch Processing**: Process multiple files in parallel\n- **OCR Support**: Extract text from scanned documents and images\n- **AI-Powered**: Enhance documents with AI-generated content\n\n## 🏗️ Project Structure\n\n```\nredoc/\n├── src/\n│   └── redoc/\n│       ├── __init__.py          # Package initialization\n│       ├── core.py             # Core conversion logic\n│       ├── converters/         # Format-specific converters\n│       │   ├── base.py         # Base converter class\n│       │   ├── pdf_converter.py\n│       │   ├── html_converter.py\n│       │   ├── xml_converter.py\n│       │   ├── json_converter.py\n│       │   ├── docx_converter.py\n│       │   └── epub_converter.py\n│       ├── ocr/                # OCR functionality\n│       ├── templates/          # Default templates\n│       └── utils/              # Utility functions\n├── tests/                      # Test suite\n├── examples/                   # Usage examples\n├── docs/                       # Documentation\n├── pyproject.toml              # Project configuration\n└── README.md                   # This file\n```\n\n## 🔧 Advanced Usage\n\n### Using Templates\n\n```python\nfrom redoc import Redoc\n\nconverter = Redoc()\n\n# Convert JSON+HTML template to PDF\nconverter.convert(\n    {\n        \"template\": \"invoice.html\",\n        \"data\": {\n            \"invoice_number\": \"INV-2023-001\",\n            \"date\": \"2023-11-15\",\n            \"items\": [\n                {\"description\": \"Web Design\", \"quantity\": 1, \"price\": 1200}\n            ],\n            \"total\": 1200\n        }\n    },\n    'pdf',\n    output_file='invoice.pdf'\n)\n```\n\n### OCR Processing\n\n```python\nfrom redoc import Redoc\n\nconverter = Redoc()\n\n# Extract text from scanned PDF with OCR\nresult = converter.ocr('scanned_document.pdf')\nprint(result['text'])\n\n# Convert scanned document to searchable PDF\nconverter.ocr('scanned_document.pdf', output_file='searchable.pdf')\n```\n\n### AI-Powered Content Generation\n\n```python\nfrom redoc import Redoc\n\nconverter = Redoc()\n\n# Generate document using AI\nresult = converter.generate(\n    \"Create a professional invoice for web design services\",\n    format='pdf',\n    style='professional',\n    output_file='ai_invoice.pdf'\n)\n```\n\n## 🚧 Next Steps\n\nWe have an exciting roadmap ahead! Check out our [TODO list](TODO.txt) for upcoming features and improvements. Here are some highlights:\n\n### In Progress\n- Fixing pyproject.toml TOML syntax error\n- Resolving MkDocs build warnings\n- Enhancing documentation\n\n### Coming Soon\n- More template examples\n- Improved AI features\n- Performance optimizations\n- Additional document format support\n\n## 🤝 Contributing\n\nContributions are welcome! Please read our [Contributing Guidelines](CONTRIBUTING.md) for details on how to contribute to this project.\n\n## 📄 License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## 📧 Contact\n\nFor any questions or suggestions, please contact [info@softreck.dev](mailto:info@softreck.dev).\n\n---\n\n\u003cdiv align=\"center\"\u003e\n  Made with ❤️ by Text2Doc Team\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftext2doc%2Fredoc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftext2doc%2Fredoc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftext2doc%2Fredoc/lists"}