https://github.com/derekslinz/recursive-local-translator
A recursive document translator tool that leverages argostranslate/ctranslate2 and cuda/mps acceleration when possible.
https://github.com/derekslinz/recursive-local-translator
argostranslate ctranslate2 document-translation localization-tool osint-tool translation-tool
Last synced: 25 days ago
JSON representation
A recursive document translator tool that leverages argostranslate/ctranslate2 and cuda/mps acceleration when possible.
- Host: GitHub
- URL: https://github.com/derekslinz/recursive-local-translator
- Owner: derekslinz
- License: mit
- Created: 2026-04-28T01:32:43.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-28T06:17:47.000Z (about 2 months ago)
- Last Synced: 2026-04-28T08:05:01.067Z (about 2 months ago)
- Topics: argostranslate, ctranslate2, document-translation, localization-tool, osint-tool, translation-tool
- Language: Python
- Homepage: https://www.linzalytics.com
- Size: 73.2 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Recursive Local Translator
A high-performance, modular workspace translator designed to handle massive directories with mixed file formats. It leverages **CTranslate2** for ultra-fast inference and supports **CUDA** (NVIDIA) and **MPS** (Apple Silicon) hardware acceleration.
Unlike simple wrappers, this tool drives the underlying translation models directly, providing significant speed improvements and better control over memory and device utilization.
---
## Key Capabilities
### 1. Multi-Pass Workspace Processing
The tool operates in three distinct phases to ensure consistency and safety:
- **Pass 1: Renaming**: Recursively translates directory names and filenames.
- **Pass 2: Upgrading**: Converts legacy Office formats (e.g., `.doc`) to modern OpenXML (`.docx`) using LibreOffice.
- **Pass 3: Content**: Performs in-place translation of file contents or generates sidecar text extracts.
### 2. Comprehensive Format Support
| Category | Supported Extensions |
| :--- | :--- |
| **Documentation** | `.txt`, `.log`, `.nfo`, `.md`, `.mdx`, `.rmd`, `.rst`, `.adoc`, `.org`, `.wiki`, `.rtx`, `.tex` |
| **Config / Web** | `.cfg`, `.conf`, `.toml`, `.properties`, `.mak`, `.cmake`, `.yaml`, `.yml`, `.xml`, `.html`, `.htm`, `.xhtml`, `.shtml`, `.json`, `.json5`, `.jsonc`, `.jsonl`, `.svg`, `.resx`, `.xliff`, `.xlf`, `.tmx` |
| **Subtitles** | `.srt`, `.vtt`, `.ass`, `.ssa`, `.sub`, `.sbv`, `.po`, `.pot` |
| **Additional Text** | `.lrc`, `.info`, `.textile`, `.strings`, `.arb`, `.fb2`, `.ts` (Qt XML autodetected only) |
| **Office** | `.docx`, `.xlsx`, `.pptx` (formatting preserved) |
| **OpenDocument** | `.odt`, `.ods`, `.odp` (native in-place translation) |
| **Email / Ebook** | `.eml` (subject/body + Base64 text-like attachments), `.epub` |
| **Sidecars** | `.pdf`, `.vsd`, `.vsdx`, `.msg`, `.djvu`, `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp` (generates `.en.txt`) |
### 3. Intelligent Content Handling
- **Mixed Language Support**: Automatically detects language per paragraph/chunk. English text within a Russian document is preserved untouched.
- **Transliteration Mode**: High-speed mode to convert Cyrillic script to Latin script without full semantic translation.
- **Content-Based Renaming**: Can automatically rename generic filenames (like `scan_001.jpg` or `Untitled.txt`) based on the translated content found inside the file.
---
## Installation
### 1. Prerequisites
- **Python 3.8 - 3.12** (Python 3.13+ currently has limited ML library support).
- **LibreOffice**: Required for format upgrades (`.doc` → `.docx`).
- **Tesseract OCR**: Required for image and PDF OCR fallback.
### 2. Setup
It is highly recommended to use a virtual environment to manage dependencies:
```bash
# Clone the repository
git clone https://github.com/derekslinz/recursive-local-translator.git
cd recursive-local-translator
# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install the Russian-English translation model
argospm install translate-ru_en
```
> [!TIP]
> **Hardware Acceleration**: Manual device selection via `--device` is typically **not necessary**. The tool automatically detects and utilizes the best available accelerator (CUDA on NVIDIA GPUs, MPS on Apple Silicon) and falls back to CPU only if needed.
---
## Usage Guide
### Basic Command
Translates a specific directory from Russian to English (use `.` for the current directory):
```bash
python3 translate_all.py /path/to/workspace
```
### Advanced Flags
| Argument | Description |
| :--- | :--- |
| `root_path` | **Required**: The target directory to process recursively. |
| `--transliterate` | **Fast Mode**: Only converts Cyrillic to Latin (e.g., `Папка` → `Papka`). |
| `--auto-detect` | Detects language per file. Skips files that are already in the target language. |
| `--rename-only` | Only renames files and folders; skips content translation. |
| `--upgrade-only` | Only converts legacy formats; skips all translation. |
| `--sidecars` | Enables generating `.en.txt` extracts for binary formats (PDF, Images). |
| `--device [auto/cuda/mps/cpu]` | Forces a specific hardware accelerator (typically auto-detected). |
| `--workers [N]` | Sets concurrency level (default: 5). |
---
## Important Notes
- **Irreversible**: Content translation is performed **in-place**. ALWAYS work on a copy/mirror of your data.
- **Encoding**: The tool assumes UTF-8 encoding. Non-compliant files are processed with "surrogateescape" error handling to prevent crashes.
---
---
## Robustness & Reliability
### Cascading Fallbacks
To ensure maximum extraction success, the tool uses a cascading fallback strategy:
- **PDF Extraction**: `PyMuPDF` → `PyPDF2` → `pdfminer.six` → `pdftotext` (CLI).
- **OCR Engine**: `Tesseract` → `EasyOCR` → `pytesseract`.
### Email Attachment Handling (`.eml`)
- Base64-encoded attachments are decoded before processing.
- Text-like attachments (for example `.txt`, `.csv`, `.json`, `.xml`) are translated and re-encoded.
- Binary attachments are preserved untouched to avoid corruption.
### Filesystem Safety
- **Path Sanitization**: Automatically removes invalid characters and collapses excessive repetitions.
- **Length Enforcement**: Filenames are strictly limited to **255 bytes** using multi-byte safe truncation, preventing OS "path too long" errors.
- **Configuration Protection**: `.ini` and `.sys` files are automatically excluded from inline translation to prevent system corruption.
---