{"id":29691783,"url":"https://github.com/earth-app/doc2lora","last_synced_at":"2026-06-27T01:06:28.091Z","repository":{"id":304306698,"uuid":"1018372625","full_name":"earth-app/doc2lora","owner":"earth-app","description":"Generate LoRA Adapters from documents","archived":false,"fork":false,"pushed_at":"2025-07-20T01:43:25.000Z","size":101,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-07-20T03:50:40.456Z","etag":null,"topics":["ai","cloudflare-ai","cloudflare-workers","lora","numpy","py","python","torch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/earth-app.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"patreon":"gmitch215","liberapay":"gmitch215","buy_me_a_coffee":"gmitch215"}},"created_at":"2025-07-12T05:56:13.000Z","updated_at":"2025-07-20T01:54:00.000Z","dependencies_parsed_at":"2025-07-12T09:22:13.173Z","dependency_job_id":null,"html_url":"https://github.com/earth-app/doc2lora","commit_stats":null,"previous_names":["earth-app/doc2lora"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/earth-app/doc2lora","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/earth-app%2Fdoc2lora","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/earth-app%2Fdoc2lora/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/earth-app%2Fdoc2lora/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/earth-app%2Fdoc2lora/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/earth-app","download_url":"https://codeload.github.com/earth-app/doc2lora/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/earth-app%2Fdoc2lora/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266633531,"owners_count":23959576,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-23T02:00:09.312Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","cloudflare-ai","cloudflare-workers","lora","numpy","py","python","torch"],"created_at":"2025-07-23T07:06:43.917Z","updated_at":"2026-06-27T01:06:28.079Z","avatar_url":"https://github.com/earth-app.png","language":"Python","funding_links":["https://patreon.com/gmitch215","https://liberapay.com/gmitch215","https://buymeacoffee.com/gmitch215"],"categories":[],"sub_categories":[],"readme":"# doc2lora\n\nThis repository is a small library for fine-tuning LLMs using LoRA (Low-Rank Adaptation) by using a folder of documents as input. It is designed to be simple and easy to use, allowing users to quickly adapt large language models to specific tasks or domains.\n\nThe library allows you to pass a folder of documents (local or from R2 bucket) and turn them into a LoRA Adapter. It is particularly useful for fine-tuning models on domain-specific data, such as legal documents, medical texts, or any other specialized corpus. It is intended to be used with Cloudflare Workers AI or similar platforms that support LLM fine-tuning.\n\nIt supports the following formats:\n\n- **Markdown / reStructuredText**: `.md`, `.rst` files\n- **Text**: `.txt` files or blank text files\n- **PDF**: `.pdf` files\n- **HTML**: `.html` files\n- **Word Documents**: `.docx` files\n- **PowerPoint**: `.pptx` files (slide text + speaker notes)\n- **OpenDocument**: `.odt`, `.ods` files\n- **Rich Text**: `.rtf` files\n- **EPUB e-books**: `.epub` files\n- **Excel Spreadsheets**: `.xlsx` files\n- **CSV**: `.csv` files\n- **JSON**: `.json` files\n- **Jupyter notebooks**: `.ipynb` files (markdown + code cells)\n- **YAML**: `.yaml` / `.yml` files\n- **XML**: `.xml` files\n- **LaTeX**: `.tex` files\n- **Source code** (read as plaintext): `.py`, `.js`, `.ts`, `.java`, `.kt`, `.rs`, `.c`/`.cpp`, `.go`, `.rb`, `.php`, `.swift`, `.dart`, `.scala`, and more\n- **Audio** (speech-to-text via Whisper): `.wav`, `.mp3`, `.m4a`, `.flac`, `.aac`, `.ogg`, and more\n- **Images** (OCR text recognition): `.png`, `.jpg`, `.bmp`, `.gif`, `.tiff`, `.webp`, and more; `.svg` text is read from the markup\n- **Video** (audio transcript + on-screen-text OCR): `.mp4`, `.avi`, `.mov`, `.mkv`, `.webm`, and more\n- **Archive Formats**: `.zip`, `.tar.gz`, `.tar.xz`, `.7z`, single-file `.gz`/`.bz2`/`.xz`, etc with supported documents inside\n\nRun `doc2lora formats` to print the full list at any time.\n\n## Quick Start\n\n### Installation\n\n```bash\n# Core install (training only):\npip install doc2lora\n\n# Everything (all document formats, audio, R2, QLoRA):\npip install \"doc2lora[all]\"\n\n# Or pick what you need via extras:\npip install \"doc2lora[docs]\"    # pdf, docx, pptx, rtf, epub, xlsx, 7z\npip install \"doc2lora[image]\"   # image OCR (needs the system tesseract-ocr binary)\npip install \"doc2lora[audio]\"   # speech-to-text via Whisper (needs the ffmpeg binary)\npip install \"doc2lora[video]\"   # video: per-frame OCR + audio transcript\npip install \"doc2lora[r2]\"      # Cloudflare R2 ingestion\npip install \"doc2lora[quant]\"   # 4-bit QLoRA (CUDA only)\n\n# For local development (editable + dev tools):\npip install -e \".[all,dev]\"\n```\n\n**System dependencies** - the image/audio/video extras shell out to native binaries\nthat pip can't install:\n\n| Feature | Needs | macOS (Homebrew) | Debian/Ubuntu | Fedora |\n| ------- | ----- | ---------------- | ------------- | ------ |\n| Image / video OCR (`[image]`, `[video]`) | `tesseract-ocr` | `brew install tesseract` | `sudo apt-get install tesseract-ocr` | `sudo dnf install tesseract` |\n| Audio / video transcription (`[audio]`, `[video]`) | `ffmpeg` | `brew install ffmpeg` | `sudo apt-get install ffmpeg` | `sudo dnf install ffmpeg` |\n\n`opencv-python` bundles its own libraries in the wheel (no system package), and\nODT/ODS and SVG are parsed with the stdlib (no extra or system binary). Audio/video transcription defaults to\n`faster-whisper` (falls back to `openai-whisper`, then `SpeechRecognition`); choose a\nbackend with `--audio-backend` and a model size with `--whisper-model`. For more OCR\nlanguages, install the tesseract language pack (e.g. `brew install tesseract-lang`,\n`sudo apt-get install tesseract-ocr-fra`) and pass `--ocr-languages eng+fra`.\n\n### Basic Usage\n\n```bash\n# Test the example\ncd examples\npython basic_usage.py\n```\n\n## Library Usage\n\nTo use the library, you can import it into your project and call the `convert` function with the path to the folder containing your documents, or use `convert_from_r2` to process documents from an R2 bucket. The library will handle the parsing and conversion of the documents into a format suitable for LoRA fine-tuning.\n\nThe `convert` function now supports multiple input types:\n\n- **Folder path**: Pass a path to a folder containing documents\n- **Array of strings**: Pass document content directly as strings\n- **Array of bytes**: Pass document content as byte arrays\n- **Single string**: Pass individual document content\n- **Single bytes**: Pass individual document as bytes\n\n### Subdirectory-Based Labeling\n\n`doc2lora` now automatically uses subdirectory structure combined with filenames to create detailed labels, making it easy to organize training data by category.\n\nWhen processing a folder, each document is automatically labeled by combining its subdirectory and filename:\n\n```text\ntraining_data/\n├── legal/              # Documents labeled as \"legal_[filename]\"\n│   ├── contract1.pdf   # -\u003e \"legal_contract1\"\n│   └── agreement.docx  # -\u003e \"legal_agreement\"\n├── technical/          # Documents labeled as \"technical_[filename]\"\n│   ├── spec.md         # -\u003e \"technical_spec\"\n│   └── guide.txt       # -\u003e \"technical_guide\"\n├── marketing/          # Documents labeled as \"marketing_[filename]\"\n│   ├── campaign.html   # -\u003e \"marketing_campaign\"\n│   └── copy.txt        # -\u003e \"marketing_copy\"\n└── overview.txt        # Root-level files → \"root_overview\"\n```\n\n**Generated metadata includes:**\n\n```json\n{\n  \"content\": \"Document content...\",\n  \"filename\": \"contract1.pdf\",\n  \"label\": \"legal_contract1\",\n  \"category_path\": \"legal\",\n  \"extension\": \".pdf\",\n  \"size\": 1024\n}\n```\n\n**Use Cases:**\n\n- **Domain + Document type**: legal_contract, legal_agreement, technical_spec, technical_guide\n- **Difficulty + Topic**: beginner_python, intermediate_javascript, advanced_algorithms\n- **Type + Content**: manual_installation, faq_troubleshooting, tutorial_setup\n- **Language + Region**: en_privacy_policy, es_terms_service, fr_user_guide\n- **Time + Event**: 2023_quarterly_report, 2024_annual_summary, current_status\n\n```bash\n# See the labeling feature in action\ncd examples\npython subdirectory_labeling_demo.py\n```\n\n### Local Documents\n\n```py\nfrom doc2lora import convert\n\n# Method 1: Convert a folder of documents\nconvert(documents_path=\"path/to/documents\", output_path=\"path/to/output.json\")\n\n# Method 2: Convert array of strings directly\ndocuments = [\n    \"This is document 1 content...\",\n    \"This is document 2 content...\",\n    \"This is document 3 content...\"\n]\nconvert(input_data=documents, output_path=\"path/to/output.json\")\n\n# Method 3: Convert single string\ndocument_content = \"This is my document content...\"\nconvert(input_data=document_content, output_path=\"path/to/output.json\")\n\n# Method 4: Convert array of bytes\nwith open(\"doc1.txt\", \"rb\") as f1, open(\"doc2.txt\", \"rb\") as f2:\n    byte_documents = [f1.read(), f2.read()]\nconvert(input_data=byte_documents, output_path=\"path/to/output.json\")\n```\n\n### R2 Bucket Documents\n\n```py\nfrom doc2lora import convert_from_r2\n\n# Method 1: Direct credentials\nconvert_from_r2(\n    bucket_name=\"my-documents-bucket\",\n    folder_prefix=\"training-docs\",  # optional\n    output_path=\"path/to/output.json\",\n    aws_access_key_id=\"your-access-key\",\n    aws_secret_access_key=\"your-secret-key\",\n    endpoint_url=\"https://your-account.r2.cloudflarestorage.com\"\n)\n\n# Method 2: Using .env file (recommended)\nconvert_from_r2(\n    bucket_name=\"my-documents-bucket\",\n    folder_prefix=\"training-docs\",  # optional\n    output_path=\"path/to/output.json\",\n    env_file=\".env\"  # Load credentials from .env file\n)\n\n# The output will be a JSON file containing the LoRA adapter data\n# You can then use this output with your LLM fine-tuning framework\n# For example, with Cloudflare Workers AI:\nfrom cloudflare_workers_ai import LLM\nllm = LLM(model=\"your-model-name\")\nllm.load_lora_adapter(\"path/to/output.json\")\n```\n\n## CLI\n\nYou can also use the library from the command line. The CLI allows you to convert a folder of documents or R2 bucket contents into a LoRA adapter JSON file.\n\n### CLI for Local Documents\n\n```bash\ndoc2lora convert path/to/documents --output path/to/output.json\n\n# scan first to preview files + per-file and total training-time estimates\ndoc2lora scan path/to/documents --device cpu\n\n# low-memory machine: smaller batch + gradient accumulation (on by default:\n# gradient checkpointing). 4-bit QLoRA is available on CUDA via --load-in-4bit\ndoc2lora convert path/to/documents \\\n    --batch-size 1 --gradient-accumulation-steps 8 \\\n    --output adapter.json\n```\n\n### Deploy to Cloudflare Workers AI\n\nOnce you have an adapter, upload it as a Workers AI finetune with one command:\n\n```bash\n# uses the wrangler CLI under the hood (validates the adapter first)\ndoc2lora deploy adapter.json my-finetune-name \\\n    --cf-model \"@cf/mistralai/mistral-7b-instruct-v0.2-lora\"\n\n# or upload via the REST API (no wrangler needed)\ndoc2lora deploy adapter.json my-finetune-name --backend rest \\\n    --account-id \"$CLOUDFLARE_ACCOUNT_ID\" --api-token \"$CLOUDFLARE_API_TOKEN\"\n```\n\nThen reference it at inference time with the `lora` parameter\n(`env.AI.run(\"@cf/mistralai/mistral-7b-instruct-v0.2-lora\", { ..., lora: \"my-finetune-name\" })`).\n\n### CLI for R2 Bucket Documents\n\n```bash\n# Method 1: Set environment variables for credentials\nexport R2_ACCESS_KEY_ID=\"your-access-key\"\nexport R2_SECRET_ACCESS_KEY=\"your-secret-key\"\nexport R2_ENDPOINT_URL=\"https://your-account.r2.cloudflarestorage.com\"\n\n# Convert documents from R2 bucket\ndoc2lora convert-r2 my-documents-bucket --folder-prefix training-docs --output path/to/output.json\n\n# Method 2: Use .env file (recommended)\ndoc2lora convert-r2 my-documents-bucket \\\n    --env-file .env \\\n    --folder-prefix training-docs \\\n    --output path/to/output.json\n\n# Method 3: Pass credentials directly\ndoc2lora convert-r2 my-documents-bucket \\\n    --r2-access-key-id \"your-access-key\" \\\n    --r2-secret-access-key \"your-secret-key\" \\\n    --endpoint-url \"https://your-account.r2.cloudflarestorage.com\" \\\n    --output path/to/output.json\n```\n\n## Project Structure\n\n```text\ndoc2lora/\n├── doc2lora/             # Main package\n│   ├── __init__.py       # Package init + single-source __version__\n│   ├── core.py           # convert() / convert_from_r2() entry points\n│   ├── parsers.py        # Document / image / audio / video parsing\n│   ├── lora_trainer.py   # LoRA training, device/precision, speedups\n│   ├── deploy.py         # Upload adapters to Cloudflare Workers AI\n│   ├── cli.py            # Command-line interface\n│   └── utils.py          # R2 download + helpers\n├── examples/             # Example usage scripts\n│   ├── basic_usage.py\n│   ├── media_and_optimization.py   # images/audio/video + speed knobs\n│   ├── mistral_usage.py            # Mistral (needs HF_API_KEY)\n│   ├── gemma_usage.py              # Gemma\n│   ├── llama_usage.py              # Llama 2\n│   ├── qwq_usage.py                # QwQ-32B (4-bit QLoRA)\n│   ├── qlora_usage.py              # 4-bit QLoRA + deploy\n│   ├── r2_usage.py                 # R2 bucket integration\n│   ├── subdirectory_labeling_demo.py\n│   └── example_documents/          # Sample documents\n├── demo/                 # Complete Cloudflare Workers AI demo\n│   ├── data/             # Sample training corpus\n│   ├── scripts/          # train_lora.sh/.bat, deploy_to_r2.sh/.bat\n│   ├── worker.js         # Worker (loads adapter, /chat endpoints)\n│   ├── wrangler.toml     # Worker configuration\n│   ├── index.html        # Browser UI\n│   └── README.md         # Demo documentation\n├── tests/                # Test suite (pytest)\n├── pyproject.toml        # Packaging, dependencies/extras, tool config\n├── requirements.txt      # Full install (equivalent to the [all] extra)\n├── setup.py              # Thin shim (metadata lives in pyproject.toml)\n├── README.md             # This file\n├── USAGE.md              # Usage guide\n├── INSTALL_GUIDE.md      # Install + Mistral guide\n└── CLAUDE.md             # Repo guide for Claude Code\n```\n\n## Examples\n\nThe `examples/` directory contains usage examples for different models and scenarios:\n\n### Model-Specific Examples\n\n1. **`mistral_usage.py`** - Complete example for Mistral models with HuggingFace authentication\n\n   ```bash\n   cd examples\n   export HF_API_KEY=\"your_huggingface_token\"  # Required for Mistral models\n   python mistral_usage.py\n   ```\n\n2. **`gemma_usage.py`** - Google Gemma model fine-tuning for Cloudflare Workers AI\n\n   ```bash\n   cd examples\n   python gemma_usage.py\n   ```\n\n3. **`llama_usage.py`** - Meta Llama 2 model fine-tuning with optimized parameters\n\n   ```bash\n   cd examples\n   python llama_usage.py\n   ```\n\n4. **`r2_usage.py`** - R2 bucket integration with .env file support\n\n   ```bash\n   cd examples\n   python r2_usage.py\n   ```\n\n5. **`qlora_usage.py`** - Memory-efficient 4-bit QLoRA training (CUDA) + deploy\n\n   ```bash\n   cd examples\n   python qlora_usage.py\n   ```\n\n6. **`qwq_usage.py`** - Fine-tuning the QwQ-32B reasoning model\n   (`@cf/qwen/qwq-32b`) with 4-bit QLoRA; needs a 24 GB+ NVIDIA GPU\n\n   ```bash\n   cd examples\n   python qwq_usage.py\n   ```\n\n7. **`media_and_optimization.py`** - Ingest images / audio / video and tune the\n   training-speed knobs (the auto defaults plus the opt-in flags)\n\n   ```bash\n   cd examples\n   python media_and_optimization.py\n   ```\n\n### Demo Application\n\nThe `demo/` folder contains a complete working demonstration of a Cloudflare Worker using a custom LoRA adapter:\n\n```bash\n# 1. Train a LoRA adapter on software development data\ncd demo\n./scripts/train_lora.sh  # or train_lora.bat on Windows\n\n# 2. Deploy the adapter to R2 bucket\n./scripts/deploy_to_r2.sh  # or deploy_to_r2.bat on Windows\n\n# 3. Deploy the Cloudflare Worker\n./scripts/wrangler_deploy.sh  # or wrangler_deploy.bat on Windows\n```\n\nThe demo creates a **Software Developer Assistant** AI that provides guidance on:\n\n- Code development and architecture\n- Debugging and troubleshooting\n- Team collaboration and communication\n- Professional growth and career development\n- Technical decision-making\n\n**API Endpoints:**\n\n- `GET /health` - Health check\n- `POST /chat` - Send message and get response\n- `POST /chat/stream` - Streaming responses\n- `GET /docs` - API documentation\n\n## Configuration\n\n### GPU Support\n\n🚀 **Automatic GPU Detection**: doc2lora now automatically detects and uses the best available device for training:\n\n**Device Priority (Automatic):**\n\n1. 🚀 **NVIDIA GPU (CUDA)** - Fastest; bf16 on Ampere+ (else fp16), TF32 matmul, and fused AdamW\n2. 🍎 **Apple Silicon (MPS)** - Good performance on Mac M1/M2/M3 (bf16 on macOS 14+, else fp32; fp16 autocast on MPS is NaN-prone and is never auto-enabled)\n3. 💻 **CPU** - Reliable fallback, works everywhere (fp32)\n\n**Automatic Detection (Recommended):**\n\n```bash\n# Will automatically use GPU if available, fallback to CPU\ndoc2lora convert ./docs --output adapter.json\n```\n\n**Manual Device Selection:**\n\n```bash\n# Force GPU usage\ndoc2lora convert ./docs --output adapter.json --device cuda\n\n# Force CPU usage (useful for troubleshooting)\ndoc2lora convert ./docs --output adapter.json --device cpu\n\n# Use Apple Silicon GPU (Mac M1/M2/M3)\ndoc2lora convert ./docs --output adapter.json --device mps\n```\n\n**Python API:**\n\n```python\nfrom doc2lora import convert\n\n# Auto-detect device (recommended)\nconvert(documents_path=\"./docs\", output_path=\"adapter.json\")\n\n# Specify device manually\nconvert(documents_path=\"./docs\", output_path=\"adapter.json\", device=\"cuda\")\nconvert(documents_path=\"./docs\", output_path=\"adapter.json\", device=\"cpu\")\nconvert(documents_path=\"./docs\", output_path=\"adapter.json\", device=\"mps\")  # Apple Silicon\n```\n\n**GPU Requirements:**\n\n- **NVIDIA GPUs**: Requires CUDA-compatible PyTorch installation\n- **Apple Silicon**: Requires PyTorch with MPS support (automatically included on macOS)\n- **Memory**: 8GB+ GPU memory recommended for larger models\n\n### Training Parameters\n\nCommon configuration options:\n\n```bash\ndoc2lora convert ./docs \\\n    --model mistralai/Mistral-7B-Instruct-v0.2 \\\n    --batch-size 2 \\\n    --epochs 3 \\\n    --learning-rate 2e-4 \\\n    --lora-r 8 \\\n    --lora-alpha 16 \\\n    --gradient-accumulation-steps 4 \\\n    --device auto  # or cuda/mps/cpu\n```\n\n**LoRA rank:** the default is `8` (broadest compatibility). Cloudflare Workers AI\nnow accepts adapters up to **rank 32** (with a 300MB safetensors limit), so you can\nraise `--lora-r` up to 32 for more capacity; doc2lora only warns above 32.\n\n**Performance / low-resource options:**\n\n- ⚡ **Gradient checkpointing** (on by default): trades ~20% compute for a large\n  memory saving. Disable with `--no-gradient-checkpointing`.\n- 🧮 **Gradient accumulation**: `--gradient-accumulation-steps N` emulates a larger\n  effective batch (`batch_size * N`) without the memory cost - ideal on weak machines.\n- 🪶 **4-bit QLoRA**: `--load-in-4bit` (CUDA + `pip install \"doc2lora[quant]\"`) loads\n  the base model in 4-bit (nf4) so large models fit on small GPUs.\n- 🚀 **Precision**: bf16 on bf16-capable CUDA and Apple MPS (macOS 14+), fp16 on other\n  CUDA GPUs, fp32 on CPU and older MPS (fp16 autocast on MPS is NaN-prone).\n- 💻 **Out of Memory**: reduce `--batch-size`, raise `--gradient-accumulation-steps`,\n  or fall back with `--device cpu` (CUDA OOM also auto-falls back to CPU).\n\n### Training speed optimizations\n\ndoc2lora applies a set of **hardware-aware speedups automatically** - most are\nno-ops where they don't apply - and exposes a few **opt-in** ones for power users.\n\n**Applied automatically:**\n\n| Optimization | What it does | Where it helps |\n| ------------ | ------------ | -------------- |\n| **TF32 matmul** (`set_float32_matmul_precision(\"high\")`) | runs fp32 matmuls on Tensor Cores | NVIDIA **Ampere+** (no-op on older CUDA / MPS / CPU) |\n| **bf16 / fp16 precision** | bf16 on bf16-capable CUDA \u0026 MPS (macOS 14+), fp16 on other CUDA, fp32 elsewhere | CUDA, Apple Silicon |\n| **Fused AdamW** (`optim=\"adamw_torch_fused\"`) | single fused optimizer kernel | CUDA with PyTorch \u003e= 2.8 (else plain AdamW) |\n| **SDPA attention** | PyTorch scaled-dot-product attention auto-selects the fastest kernel | all (CUDA fused kernels; math fallback on CPU/MPS) |\n| **`pad_to_multiple_of=8`** | aligns padded batches to Tensor-Core tiles | CUDA (harmless elsewhere) |\n| **CUDA-gated pinned memory** | `dataloader_pin_memory` only on CUDA | avoids wasted host RAM on CPU/MPS |\n| **Gradient checkpointing** (`use_reentrant=False`) | recompute activations to save memory | low-RAM machines (on by default; ~20% slower - disable with `--no-gradient-checkpointing` if you have memory headroom) |\n| **`torch.compile`** | fuses the model graph (~20-50% faster steps) | **auto on CUDA when the corpus is \u003e= ~10 MB of text** (compile cost amortizes on long runs only; CUDA-only). Force with `--torch-compile` / `--no-torch-compile` |\n| **Length-grouped batching** | groups similar-length samples to cut padding | **auto when batch_size \u003e= 2** (hardware-agnostic; nothing to cut at batch 1). Force with `--group-by-length` / `--no-group-by-length` |\n| **Parallel parsing** | thread pool over the document folder (PDF / OCR / transcription) | auto: ~`min(8, CPU count)` threads; tune with `--max-workers N` |\n\n**Opt-in (you choose when it's worth it):**\n\n| Flag | What it does | When to use |\n| ---- | ------------ | ----------- |\n| `--no-torch-compile` | force `torch.compile` off (it auto-enables on CUDA for large corpora) | short runs, debugging, or if compile graph-breaks |\n| `--no-group-by-length` | force length-grouped batching off (it auto-enables at batch \u003e= 2) | if you need strict shuffle order or hit a convergence quirk |\n| `--attn-implementation flash_attention_2` | FlashAttention-2 kernel | Ampere+ CUDA with `flash-attn` + bf16/fp16 (falls back to eager; SDPA already uses flash kernels by default) |\n| `--optim adamw_bnb_8bit` | 8-bit Adam (~75% less optimizer memory) | full fine-tuning on CUDA; **little benefit for LoRA** (optimizer state is just the tiny adapter) - needs `[quant]`/bitsandbytes |\n| `--dataloader-num-workers N` | extra DataLoader worker processes | large corpora on Linux/CUDA only (default 0; keep 0 on macOS / in-memory data) |\n\n**Per-platform notes:**\n\n- **NVIDIA CUDA**: prefer **bf16 on Ampere+** (no loss scaling, no NaNs); TF32, the\n  fused optimizer, and (for corpora \u003e= ~10 MB of text) `torch.compile` are on\n  automatically. For very long sequences, `--attn-implementation flash_attention_2`\n  (with `pip install flash-attn`) is the biggest single win.\n- **Apple Silicon (MPS)**: bf16 is used on **macOS 14+** (same memory as fp16, but\n  stable), else fp32. `torch.compile`, FlashAttention, and 4-bit QLoRA do **not** help\n  on MPS; length-grouped batching (auto at batch_size \u003e= 2) is the main extra lever that\n  does. Keep the whole model in unified memory (no CPU offload exists on MPS).\n- **CPU**: training is a slow fallback. PyTorch already defaults to your physical-core\n  count; you can pin it with `OMP_NUM_THREADS`. Use a small base model.\n\n### How long will training take?\n\nAll numbers below are **order-of-magnitude estimates** and vary widely with\nsequence length, batch size, LoRA rank, and data shape. `doc2lora scan \u003cdir\u003e\n--device \u003cd\u003e` prints an estimate for your own corpus.\n\n#### Small base model (DialoGPT-small / GPT-2 class), 3 epochs\n\n| Corpus size | CPU       | Apple MPS | NVIDIA CUDA |\n| ----------- | --------- | --------- | ----------- |\n| ~1 MB       | minutes   | ~1 min    | seconds     |\n| ~10 MB      | ~1 hour   | ~10 min   | ~2 min      |\n| ~100 MB     | many hrs  | ~1-2 hrs  | ~20 min     |\n\n#### 7B-class model (Mistral / Gemma / Llama) vs hardware and VRAM\n\nTimes below are for **3 epochs** at ~512-token sequences. The \"approach\" column\nreflects what fits in memory:\n\n- **\u003e= 24 GB VRAM**: full fp16/bf16 LoRA fits comfortably.\n- **12 GB VRAM**: use 4-bit QLoRA (`--load-in-4bit`) to fit a 7B model.\n- **Apple Silicon**: 4-bit QLoRA is CUDA-only (bitsandbytes), so MPS runs **bf16\n  LoRA** (macOS 14+, else fp32) and needs ~18 GB+ unified memory for a 7B model;\n  8 GB Macs cannot train 7B (use a smaller base model). MPS is also much slower\n  than a discrete GPU.\n\n| Hardware | Memory            | 7B approach               | 1 MB    | 10 MB    | 10 MB +optimizations\u0026dagger; | 100 MB    | 100 MB +optimizations\u0026dagger; |\n| -------- | ----------------- | ------------------------- | ------- | -------- | ---------------------------- | --------- | ----------------------------- |\n| Apple M2 | 8-24 GB unified   | bf16 LoRA (16 GB+ for 7B) | ~1 hr   | ~11 hrs  | n/a (CUDA only)              | ~4-5 days | n/a (CUDA only)               |\n| Apple M3 | 8-128 GB unified  | bf16 LoRA                 | ~40 min | ~6 hrs   | n/a                          | ~2-3 days | n/a                           |\n| Apple M4 | 16-128 GB unified | bf16 LoRA                 | ~25 min | ~4 hrs   | n/a                          | ~1.5 days | n/a                           |\n| RTX 4070 | 12 GB             | QLoRA (4-bit) required    | ~10 min | ~1.5 hrs | ~1 hr                        | ~17 hrs   | ~12 hrs                       |\n| RTX 5070 | 12 GB             | QLoRA (4-bit) required    | ~7 min  | ~1.2 hrs | ~50 min                      | ~12 hrs   | ~8-9 hrs                      |\n| RTX 3090 | 24 GB             | full bf16 LoRA            | ~7 min  | ~1 hr    | ~40 min                      | ~11 hrs   | ~7-8 hrs                      |\n| RTX 4090 | 24 GB             | full bf16 LoRA            | ~4 min  | ~35 min  | ~25 min                      | ~6 hrs    | ~4 hrs                        |\n| RTX 5090 | 32 GB             | full bf16 LoRA            | ~2 min  | ~20 min  | ~15 min                      | ~3-4 hrs  | ~2-3 hrs                      |\n\n\u0026dagger; **+optimizations** = the dynamic speedups doc2lora turns on for you, on top of the\nalways-on defaults (TF32, fused AdamW, bf16, SDPA, `pad_to_multiple_of=8` - already in the\nbase columns). For these CUDA rows it is dominated by **`torch.compile`** (auto on CUDA once\nthe corpus is \u003e= ~10 MB of text; ~20-40% faster steps per HuggingFace, the column applies\n~30%) plus **length-grouped batching** (auto at batch_size \u003e= 2). The compile cost (seconds\nto minutes) only amortizes on long runs, so the 10 MB and 100 MB columns reflect it while the\n~1 MB / minutes-scale column does not. Override with `--torch-compile` / `--no-torch-compile`\n(and `--group-by-length`). It does **not** help on **Apple MPS** - compile and FlashAttention\nare CUDA-only, and 7B on Apple runs at batch_size 1 (no length-grouping win), so Apple's gains\nare the bf16/etc. already in the base columns. For long sequences on Ampere+ GPUs, stack\n`--attn-implementation flash_attention_2` (needs `pip install flash-attn`).\n\n\u003e **What counts as an \"example\"?** Each document becomes **one or more** training\n\u003e examples: a file \u003c= `--max-length` tokens (default 512, ~2 KB of text) is one example,\n\u003e and a longer file is **auto-chunked** into consecutive `--max-length`-token windows -\n\u003e one example each - so **all of its content is trained on** (use `--chunk-overlap N` to\n\u003e overlap windows, or `--no-chunk` to revert to truncating each file to its first window).\n\u003e So \"a few hundred to a few thousand examples\" is really *total tokens / max_length*.\n\u003e\n\u003e **How this impacts performance:** training time scales with **total tokens**, so\n\u003e chunking a few huge files can balloon the run - a dozen 1 MB files is ~6,000 chunks\n\u003e (~500x more steps than the old truncate-to-12-examples behavior). Bound it with\n\u003e `--max-steps`, fewer `--epochs`, a smaller/curated corpus, or `--no-chunk`. Curated\n\u003e quality still beats raw quantity. `doc2lora scan` estimates time from total bytes - a\n\u003e rough figure, but its all-content basis now matches what's actually trained (chunking\n\u003e on); under the old truncation default it over-counted for big files. The small-model\n\u003e table above is ~20-40x faster if you only need a lightweight adapter.\n\n#### 32B-class model (QwQ-32B) vs hardware and VRAM\n\nQwQ-32B (`@cf/qwen/qwq-32b`) also accepts BYO LoRA adapters. A 32B base is roughly\n4-5x slower than 7B and only fits with **4-bit QLoRA**, which needs ~20-24 GB of\nVRAM - so it is realistically a 24 GB+ NVIDIA job. Times are for **3 epochs** at\n~512-token sequences (see `examples/qwq_usage.py`).\n\n| Hardware        | Memory   | 32B approach             | 1 MB     | 10 MB    | 100 MB   |\n| --------------- | -------- | ------------------------ | -------- | -------- | -------- |\n| Apple M2/M3/M4  | unified  | not practical (no 4-bit) | -        | -        | -        |\n| RTX 4070 / 5070 | 12 GB    | too small for 32B        | -        | -        | -        |\n| RTX 3090        | 24 GB    | QLoRA (4-bit), tight     | ~30 min  | ~4.5 hrs | ~2 days  |\n| RTX 4090        | 24 GB    | QLoRA (4-bit)            | ~18 min  | ~2.5 hrs | ~1 day   |\n| RTX 5090        | 32 GB    | QLoRA (4-bit), roomy     | ~9 min   | ~1.5 hrs | ~15 hrs  |\n\n\u003e A rank-8..32 adapter on a 32B model is still well under Cloudflare's 300 MB\n\u003e safetensors limit. doc2lora tags Qwen/QwQ adapters with `model_type: qwen`\n\u003e automatically; deploy with `--cf-model \"@cf/qwen/qwq-32b\"`.\n\n## Features\n\n- ✅ **Document Parsing**: Recursively scan directories for supported document types\n- ✅ **Subdirectory Labeling**: Automatically label documents based on directory structure and filename\n- ✅ **Multiple Formats**: Support for 20+ document, image, audio, and video formats including archives\n- ✅ **Archive Support**: Extract and parse documents from ZIP and TAR archives\n- ✅ **R2 Bucket Support**: Direct integration with Cloudflare R2 storage buckets\n- ✅ **CLI Interface**: Easy-to-use command-line interface\n- ✅ **Image / Audio / Video**: OCR images (tesseract), transcribe audio \u0026 video with Whisper, and OCR on-screen text from frames\n- ✅ **Parallel Parsing**: multithreaded parsing of the document folder\n- ✅ **Hardware-aware Speedups**: TF32, bf16, fused AdamW, SDPA, and auto `torch.compile` / length-grouped batching selected by device + corpus size\n- ✅ **One-command Deploy**: `doc2lora deploy` uploads adapters to Cloudflare Workers AI (wrangler or REST)\n- ✅ **Flexible Configuration**: Customizable LoRA parameters\n- 🔄 **LoRA Training**: Fine-tune models using LoRA adaptation (requires ML dependencies)\n- 🔄 **Export Options**: JSON format compatible with various platforms\n\n## Status\n\n- **Document Parsing**: ✅ Fully working\n- **CLI Interface**: ✅ Basic functionality working\n- **LoRA Training**: 🔄 Requires ML dependencies (torch, transformers, peft, datasets)\n\nThe core document parsing functionality works out of the box. For full LoRA training capabilities, install the ML dependencies listed above.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fearth-app%2Fdoc2lora","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fearth-app%2Fdoc2lora","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fearth-app%2Fdoc2lora/lists"}