https://github.com/text2doc/redoc
image doc pdf ocr html json converter as DSL pipeline
https://github.com/text2doc/redoc
converter docs dsl html json llm ml ocr ollama pdf pipeline tensor torch
Last synced: 11 months ago
JSON representation
image doc pdf ocr html json converter as DSL pipeline
- Host: GitHub
- URL: https://github.com/text2doc/redoc
- Owner: text2doc
- License: apache-2.0
- Created: 2025-06-08T10:18:38.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-08T21:08:22.000Z (about 1 year ago)
- Last Synced: 2025-06-15T15:57:41.023Z (about 1 year ago)
- Topics: converter, docs, dsl, html, json, llm, ml, ocr, ollama, pdf, pipeline, tensor, torch
- Language: HTML
- Homepage: https://text2doc.github.io/redoc/
- Size: 717 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# 📄 Redoc - Universal Document Converter
[](https://pypi.org/project/redoc/)
[](https://www.python.org/)
[](https://opensource.org/licenses/Apache-2.0)
[](https://redoc.readthedocs.io/)
[](https://github.com/text2doc/redoc/actions)
[](https://codecov.io/gh/text2doc/redoc)
[](https://github.com/psf/black)
[](https://hub.docker.com/r/text2doc/redoc)
[](https://pepy.tech/project/redoc)
[](https://github.com/text2doc/redoc/actions/workflows/codeql-analysis.yml)
[](https://github.com/pre-commit/pre-commit)
[](https://api.securityscorecards.dev/projects/github.com/text2doc/redoc)
[](https://discord.gg/softreck)
[](https://twitter.com/softreck)
Redoc is a powerful, modular document conversion framework that enables seamless transformation between various document formats including PDF, HTML, XML, JSON, DOCX, and EPUB. It features OCR capabilities, AI-powered content generation using Ollama Mistral:7b, and a bidirectional template system for document generation and data extraction.
## 🌟 Features
### Core Functionality
- **Multi-format Support**: Bidirectional conversion between PDF, HTML, XML, JSON, DOCX, and EPUB
- **Template System**: JSON+HTML templates for dynamic document generation with bidirectional support
- **OCR Integration**: Extract text from scanned documents and images with Tesseract OCR
- **AI-Powered**: Leverage Ollama Mistral:7b for intelligent content generation and processing
- **Bidirectional Processing**: Convert documents to data and back with templates
- **Batch Processing**: Process multiple documents efficiently with parallel execution
### Advanced Capabilities
- **Template Variables**: Support for dynamic content and conditional rendering
- **Validation**: Built-in data validation with Pydantic models
- **Extensible Architecture**: Plugin system for custom formats and processors
- **Asynchronous Processing**: Non-blocking operations for high performance
- **Web Interface**: Modern UI for document conversion and management
### Developer Experience
- **Comprehensive API**: Clean, well-documented Python API
- **Command Line Interface**: Intuitive CLI for quick conversions
- **Interactive Shell**: Built-in Python shell for exploration and debugging
- **Logging & Debugging**: Configurable logging and error reporting
- **Type Hints**: Full type annotations for better IDE support
### Enterprise Ready
- **Docker Support**: Containerized deployment with Docker and Docker Compose
- **REST API**: Built with FastAPI for easy integration
- **Asynchronous Processing**: Non-blocking operations for high performance
- **Security**: Input validation, sanitization, and secure defaults
- **Monitoring**: Built-in metrics and health checks
## 🚀 Quick START
### Installation
#### Using pip (recommended)
```bash
# Install the latest stable version
pip install redoc
# Install with all optional dependencies
pip install "redoc[all]"
# Or install specific components
pip install "redoc[cli]" # Command line interface
pip install "redoc[server]" # Web server and API
pip install "redoc[ai]" # AI features (requires Ollama)
pip install "redoc[ocr]" # OCR capabilities (Tesseract)
pip install "redoc[templates]" # Pre-built templates
```
#### Using Docker (recommended for production)
```bash
# Pull the latest image
docker pull text2doc/redoc:latest
# Run a conversion
docker run -v $(pwd):/data text2doc/redoc convert input.pdf output.html
# Start the web interface
docker run -p 8000:8000 -v $(pwd)/templates:/app/templates text2doc/redoc serve
```
#### Development Installation
```bash
git clone https://github.com/text2doc/redoc.git
cd redoc
pip install -e ".[dev]" # Install in development mode with all dependencies
pre-commit install # Install git hooks
```
## 🛠 Basic Usage
### Command Line Interface
```bash
# Convert a document
redoc convert input.pdf output.html
# Convert with a template
redoc convert --template invoice.html data.json invoice.pdf
# Start interactive shell
redoc shell
# Start web server
redoc serve
```
### Python API
```python
from redoc import Redoc
# Initialize with default settings
converter = Redoc()
# Convert between formats
converter.convert('document.pdf', 'document.html') # PDF to HTML
converter.convert('data.json', 'report.pdf') # JSON to PDF with template
# Process multiple files
converter.batch_convert(
input_glob='invoices/*.json',
output_dir='output/',
output_format='pdf',
template='invoice.html'
)
# Extract data from documents
data = converter.extract_data('document.pdf', 'invoice_schema.json')
# Generate documents from templates
converter.generate_document(
template='invoice.html',
data='data.json',
output='invoice.pdf'
)
# Use the interactive shell
converter.shell()
```
#### Command Line Interface
```bash
# Show help
redoc --help
# Convert a document
redoc convert input.pdf output.html
redoc convert --template invoice.html data.json invoice.pdf
# Start interactive shell
redoc shell
# Start web server
redoc serve --host 0.0.0.0 --port 8000
# Process multiple files
redoc batch "documents/*.pdf" --format html --output-dir html_output
```
#### Using Templates
```python
from redoc import Redoc
converter = Redoc()
# Simple template with variables
template = {
"template": "invoice.html",
"data": {
"invoice": {
"number": "INV-2023-001",
"date": "2023-11-15",
"items": [
{"description": "Web Design", "quantity": 10, "price": 100},
{"description": "Hosting", "quantity": 1, "price": 50}
]
}
}
}
# Generate PDF from template
converter.convert(template, 'pdf', output_file='invoice.pdf')
# Extract data from document
data = converter.extract_data('invoice.pdf', template='invoice_template.html')
```
## 📚 Supported Conversions
| From \ To | PDF | HTML | XML | JSON | DOCX | EPUB |
|-----------|:---:|:----:|:---:|:----:|:----:|:----:|
| **PDF** | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
| **HTML** | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
| **XML** | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ |
| **JSON** | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
| **DOCX** | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| **EPUB** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
### Conversion Features
- **PDF Generation**: High-quality PDF output with support for headers, footers, and page numbers
- **HTML Processing**: Clean HTML output with customizable CSS styling
- **Data Extraction**: Extract structured data from documents using templates
- **Template Variables**: Use Jinja2 syntax for dynamic content
- **Batch Processing**: Process multiple files in parallel
- **OCR Support**: Extract text from scanned documents and images
- **AI-Powered**: Enhance documents with AI-generated content
## 🏗️ Project Structure
```
redoc/
├── src/
│ └── redoc/
│ ├── __init__.py # Package initialization
│ ├── core.py # Core conversion logic
│ ├── converters/ # Format-specific converters
│ │ ├── base.py # Base converter class
│ │ ├── pdf_converter.py
│ │ ├── html_converter.py
│ │ ├── xml_converter.py
│ │ ├── json_converter.py
│ │ ├── docx_converter.py
│ │ └── epub_converter.py
│ ├── ocr/ # OCR functionality
│ ├── templates/ # Default templates
│ └── utils/ # Utility functions
├── tests/ # Test suite
├── examples/ # Usage examples
├── docs/ # Documentation
├── pyproject.toml # Project configuration
└── README.md # This file
```
## 🔧 Advanced Usage
### Using Templates
```python
from redoc import Redoc
converter = Redoc()
# Convert JSON+HTML template to PDF
converter.convert(
{
"template": "invoice.html",
"data": {
"invoice_number": "INV-2023-001",
"date": "2023-11-15",
"items": [
{"description": "Web Design", "quantity": 1, "price": 1200}
],
"total": 1200
}
},
'pdf',
output_file='invoice.pdf'
)
```
### OCR Processing
```python
from redoc import Redoc
converter = Redoc()
# Extract text from scanned PDF with OCR
result = converter.ocr('scanned_document.pdf')
print(result['text'])
# Convert scanned document to searchable PDF
converter.ocr('scanned_document.pdf', output_file='searchable.pdf')
```
### AI-Powered Content Generation
```python
from redoc import Redoc
converter = Redoc()
# Generate document using AI
result = converter.generate(
"Create a professional invoice for web design services",
format='pdf',
style='professional',
output_file='ai_invoice.pdf'
)
```
## 🚧 Next Steps
We have an exciting roadmap ahead! Check out our [TODO list](TODO.txt) for upcoming features and improvements. Here are some highlights:
### In Progress
- Fixing pyproject.toml TOML syntax error
- Resolving MkDocs build warnings
- Enhancing documentation
### Coming Soon
- More template examples
- Improved AI features
- Performance optimizations
- Additional document format support
## 🤝 Contributing
Contributions are welcome! Please read our [Contributing Guidelines](CONTRIBUTING.md) for details on how to contribute to this project.
## 📄 License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
## 📧 Contact
For any questions or suggestions, please contact [info@softreck.dev](mailto:info@softreck.dev).
---
Made with ❤️ by Text2Doc Team