{"id":29691783,"url":"https://github.com/earth-app/doc2lora","last_synced_at":"2026-05-16T08:42:05.751Z","repository":{"id":304306698,"uuid":"1018372625","full_name":"earth-app/doc2lora","owner":"earth-app","description":"Generate LoRA Adapters from documents","archived":false,"fork":false,"pushed_at":"2025-07-20T01:43:25.000Z","size":101,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-07-20T03:50:40.456Z","etag":null,"topics":["ai","cloudflare-ai","cloudflare-workers","lora","numpy","py","python","torch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/earth-app.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"patreon":"gmitch215","liberapay":"gmitch215","buy_me_a_coffee":"gmitch215"}},"created_at":"2025-07-12T05:56:13.000Z","updated_at":"2025-07-20T01:54:00.000Z","dependencies_parsed_at":"2025-07-12T09:22:13.173Z","dependency_job_id":null,"html_url":"https://github.com/earth-app/doc2lora","commit_stats":null,"previous_names":["earth-app/doc2lora"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/earth-app/doc2lora","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/earth-app%2Fdoc2lora","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/earth-app%2Fdoc2lora/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/earth-app%2Fdoc2lora/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/earth-app%2Fdoc2lora/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/earth-app","download_url":"https://codeload.github.com/earth-app/doc2lora/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/earth-app%2Fdoc2lora/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266633531,"owners_count":23959576,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-23T02:00:09.312Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","cloudflare-ai","cloudflare-workers","lora","numpy","py","python","torch"],"created_at":"2025-07-23T07:06:43.917Z","updated_at":"2026-05-16T08:42:05.738Z","avatar_url":"https://github.com/earth-app.png","language":"Python","funding_links":["https://patreon.com/gmitch215","https://liberapay.com/gmitch215","https://buymeacoffee.com/gmitch215"],"categories":[],"sub_categories":[],"readme":"# doc2lora\n\nThis repository is a small library for fine-tuning LLMs using LoRA (Low-Rank Adaptation) by using a folder of documents as input. It is designed to be simple and easy to use, allowing users to quickly adapt large language models to specific tasks or domains.\n\nThe library allows you to pass a folder of documents (local or from R2 bucket) and turn them into a LoRA Adapter. It is particularly useful for fine-tuning models on domain-specific data, such as legal documents, medical texts, or any other specialized corpus. It is intended to be used with Cloudflare Workers AI or similar platforms that support LLM fine-tuning.\n\nIt supports the following formats:\n\n- **Markdown**: `.md` files\n- **Text**: `.txt` files or blank text files\n- **PDF**: `.pdf` files\n- **HTML**: `.html` files\n- **Word Documents**: `.docx` files\n- **Excel Spreadsheets**: `.xlsx` files\n- **CSV**: `.csv` files\n- **JSON**: `.json` files\n- **YAML**: `.yaml` / `.yml` files\n- **XML**: `.xml` files\n- **LaTeX**: `.tex` files\n- **Archive Formats**: `.zip`, `.tar.gz`, `tar.xz`, etc with supported documents inside\n\n## Quick Start\n\n### Installation\n\n```bash\n# Install the package\npip install -e .\n\n# For full functionality with ML training, install additional dependencies:\npip install torch transformers peft datasets\n\n# For additional document format support:\npip install PyPDF2 python-docx beautifulsoup4 PyYAML openpyxl\n\n# For R2 bucket support:\npip install boto3\n```\n\n### Basic Usage\n\n```bash\n# Test the example\ncd examples\npython basic_usage.py\n```\n\n## Library Usage\n\nTo use the library, you can import it into your project and call the `convert` function with the path to the folder containing your documents, or use `convert_from_r2` to process documents from an R2 bucket. The library will handle the parsing and conversion of the documents into a format suitable for LoRA fine-tuning.\n\nThe `convert` function now supports multiple input types:\n\n- **Folder path**: Pass a path to a folder containing documents\n- **Array of strings**: Pass document content directly as strings\n- **Array of bytes**: Pass document content as byte arrays\n- **Single string**: Pass individual document content\n- **Single bytes**: Pass individual document as bytes\n\n### Subdirectory-Based Labeling\n\n`doc2lora` now automatically uses subdirectory structure combined with filenames to create detailed labels, making it easy to organize training data by category.\n\nWhen processing a folder, each document is automatically labeled by combining its subdirectory and filename:\n\n```text\ntraining_data/\n├── legal/              # Documents labeled as \"legal_[filename]\"\n│   ├── contract1.pdf   # -\u003e \"legal_contract1\"\n│   └── agreement.docx  # -\u003e \"legal_agreement\"\n├── technical/          # Documents labeled as \"technical_[filename]\"\n│   ├── spec.md         # -\u003e \"technical_spec\"\n│   └── guide.txt       # -\u003e \"technical_guide\"\n├── marketing/          # Documents labeled as \"marketing_[filename]\"\n│   ├── campaign.html   # -\u003e \"marketing_campaign\"\n│   └── copy.txt        # -\u003e \"marketing_copy\"\n└── overview.txt        # Root-level files → \"root_overview\"\n```\n\n**Generated metadata includes:**\n\n```json\n{\n  \"content\": \"Document content...\",\n  \"filename\": \"contract1.pdf\",\n  \"label\": \"legal_contract1\",\n  \"category_path\": \"legal\",\n  \"extension\": \".pdf\",\n  \"size\": 1024\n}\n```\n\n**Use Cases:**\n\n- **Domain + Document type**: legal_contract, legal_agreement, technical_spec, technical_guide\n- **Difficulty + Topic**: beginner_python, intermediate_javascript, advanced_algorithms\n- **Type + Content**: manual_installation, faq_troubleshooting, tutorial_setup\n- **Language + Region**: en_privacy_policy, es_terms_service, fr_user_guide\n- **Time + Event**: 2023_quarterly_report, 2024_annual_summary, current_status\n\n```bash\n# See the labeling feature in action\ncd examples\npython subdirectory_labeling_demo.py\n```\n\n### Local Documents\n\n```py\nfrom doc2lora import convert\n\n# Method 1: Convert a folder of documents\nconvert(documents_path=\"path/to/documents\", output_path=\"path/to/output.json\")\n\n# Method 2: Convert array of strings directly\ndocuments = [\n    \"This is document 1 content...\",\n    \"This is document 2 content...\",\n    \"This is document 3 content...\"\n]\nconvert(input_data=documents, output_path=\"path/to/output.json\")\n\n# Method 3: Convert single string\ndocument_content = \"This is my document content...\"\nconvert(input_data=document_content, output_path=\"path/to/output.json\")\n\n# Method 4: Convert array of bytes\nwith open(\"doc1.txt\", \"rb\") as f1, open(\"doc2.txt\", \"rb\") as f2:\n    byte_documents = [f1.read(), f2.read()]\nconvert(input_data=byte_documents, output_path=\"path/to/output.json\")\n```\n\n### R2 Bucket Documents\n\n```py\nfrom doc2lora import convert_from_r2\n\n# Method 1: Direct credentials\nconvert_from_r2(\n    bucket_name=\"my-documents-bucket\",\n    folder_prefix=\"training-docs\",  # optional\n    output_path=\"path/to/output.json\",\n    aws_access_key_id=\"your-access-key\",\n    aws_secret_access_key=\"your-secret-key\",\n    endpoint_url=\"https://your-account.r2.cloudflarestorage.com\"\n)\n\n# Method 2: Using .env file (recommended)\nconvert_from_r2(\n    bucket_name=\"my-documents-bucket\",\n    folder_prefix=\"training-docs\",  # optional\n    output_path=\"path/to/output.json\",\n    env_file=\".env\"  # Load credentials from .env file\n)\n\n# The output will be a JSON file containing the LoRA adapter data\n# You can then use this output with your LLM fine-tuning framework\n# For example, with Cloudflare Workers AI:\nfrom cloudflare_workers_ai import LLM\nllm = LLM(model=\"your-model-name\")\nllm.load_lora_adapter(\"path/to/output.json\")\n```\n\n## CLI\n\nYou can also use the library from the command line. The CLI allows you to convert a folder of documents or R2 bucket contents into a LoRA adapter JSON file.\n\n### CLI for Local Documents\n\n```bash\ndoc2lora convert path/to/documents --output path/to/output.json\n```\n\n### CLI for R2 Bucket Documents\n\n```bash\n# Method 1: Set environment variables for credentials\nexport R2_ACCESS_KEY_ID=\"your-access-key\"\nexport R2_SECRET_ACCESS_KEY=\"your-secret-key\"\nexport R2_ENDPOINT_URL=\"https://your-account.r2.cloudflarestorage.com\"\n\n# Convert documents from R2 bucket\ndoc2lora convert-r2 my-documents-bucket --folder-prefix training-docs --output path/to/output.json\n\n# Method 2: Use .env file (recommended)\ndoc2lora convert-r2 my-documents-bucket \\\n    --env-file .env \\\n    --folder-prefix training-docs \\\n    --output path/to/output.json\n\n# Method 3: Pass credentials directly\ndoc2lora convert-r2 my-documents-bucket \\\n    --r2-access-key-id \"your-access-key\" \\\n    --r2-secret-access-key \"your-secret-key\" \\\n    --endpoint-url \"https://your-account.r2.cloudflarestorage.com\" \\\n    --output path/to/output.json\n```\n\n## Project Structure\n\n```text\ndoc2lora/\n├── doc2lora/           # Main package\n│   ├── __init__.py     # Package initialization\n│   ├── core.py         # Main convert function\n│   ├── parsers.py      # Document parsing logic\n│   ├── lora_trainer.py # LoRA training implementation\n│   ├── cli.py          # Command-line interface\n│   └── utils.py        # Utility functions\n├── examples/           # Example usage\n│   ├── basic_usage.py  # Working example script\n│   ├── subdirectory_labeling_demo.py # Subdirectory labeling demonstration\n│   ├── mistral_usage.py # Mistral model example with HF API key\n│   ├── gemma_usage.py  # Gemma model example for Cloudflare AI\n│   ├── llama_usage.py  # Llama model example for Cloudflare AI\n│   ├── r2_usage.py     # R2 bucket integration example\n│   └── example_documents/  # Sample documents\n│       ├── sample.md\n│       ├── sample.txt\n│       ├── sample.json\n│       └── sample.csv\n├── demo/              # Complete working demonstration\n│   ├── data/          # Sample training documents about software development\n│   ├── scripts/       # Automation scripts (train_lora.sh/.bat, deploy_to_r2.sh/.bat)\n│   ├── worker.js      # Cloudflare Worker implementation\n│   ├── wrangler.toml  # Cloudflare Worker configuration\n│   ├── index.html     # Web interface for testing\n│   └── README.md      # Demo documentation\n├── tests/             # Test suite\n├── requirements.txt   # Dependencies\n├── setup.py          # Package setup\n└── README.md         # This file\n```\n\n## Examples\n\nThe `examples/` directory contains usage examples for different models and scenarios:\n\n### Model-Specific Examples\n\n1. **`mistral_usage.py`** - Complete example for Mistral models with HuggingFace authentication\n\n   ```bash\n   cd examples\n   export HF_API_KEY=\"your_huggingface_token\"  # Required for Mistral models\n   python mistral_usage.py\n   ```\n\n2. **`gemma_usage.py`** - Google Gemma model fine-tuning for Cloudflare Workers AI\n\n   ```bash\n   cd examples\n   python gemma_usage.py\n   ```\n\n3. **`llama_usage.py`** - Meta Llama 2 model fine-tuning with optimized parameters\n\n   ```bash\n   cd examples\n   python llama_usage.py\n   ```\n\n4. **`r2_usage.py`** - R2 bucket integration with .env file support\n\n   ```bash\n   cd examples\n   python r2_usage.py\n   ```\n\n### Demo Application\n\nThe `demo/` folder contains a complete working demonstration of a Cloudflare Worker using a custom LoRA adapter:\n\n```bash\n# 1. Train a LoRA adapter on software development data\ncd demo\n./scripts/train_lora.sh  # or train_lora.bat on Windows\n\n# 2. Deploy the adapter to R2 bucket\n./scripts/deploy_to_r2.sh  # or deploy_to_r2.bat on Windows\n\n# 3. Deploy the Cloudflare Worker\n./scripts/wrangler_deploy.sh  # or wrangler_deploy.bat on Windows\n```\n\nThe demo creates a **Software Developer Assistant** AI that provides guidance on:\n\n- Code development and architecture\n- Debugging and troubleshooting\n- Team collaboration and communication\n- Professional growth and career development\n- Technical decision-making\n\n**API Endpoints:**\n\n- `GET /health` - Health check\n- `POST /chat` - Send message and get response\n- `POST /chat/stream` - Streaming responses\n- `GET /docs` - API documentation\n\n## Configuration\n\n### GPU Support\n\n🚀 **Automatic GPU Detection**: doc2lora now automatically detects and uses the best available device for training:\n\n**Device Priority (Automatic):**\n\n1. 🚀 **NVIDIA GPU (CUDA)** - Fastest training with fp16 precision and optimal memory usage\n2. 🍎 **Apple Silicon (MPS)** - Good performance on Mac M1/M2/M3\n3. 💻 **CPU** - Reliable fallback, works everywhere\n\n**Automatic Detection (Recommended):**\n\n```bash\n# Will automatically use GPU if available, fallback to CPU\ndoc2lora convert ./docs --output adapter.json\n```\n\n**Manual Device Selection:**\n\n```bash\n# Force GPU usage\ndoc2lora convert ./docs --output adapter.json --device cuda\n\n# Force CPU usage (useful for troubleshooting)\ndoc2lora convert ./docs --output adapter.json --device cpu\n\n# Use Apple Silicon GPU (Mac M1/M2/M3)\ndoc2lora convert ./docs --output adapter.json --device mps\n```\n\n**Python API:**\n\n```python\nfrom doc2lora import convert\n\n# Auto-detect device (recommended)\nconvert(documents_path=\"./docs\", output_path=\"adapter.json\")\n\n# Specify device manually\nconvert(documents_path=\"./docs\", output_path=\"adapter.json\", device=\"cuda\")\nconvert(documents_path=\"./docs\", output_path=\"adapter.json\", device=\"cpu\")\nconvert(documents_path=\"./docs\", output_path=\"adapter.json\", device=\"mps\")  # Apple Silicon\n```\n\n**GPU Requirements:**\n\n- **NVIDIA GPUs**: Requires CUDA-compatible PyTorch installation\n- **Apple Silicon**: Requires PyTorch with MPS support (automatically included on macOS)\n- **Memory**: 8GB+ GPU memory recommended for larger models\n\n### Training Parameters\n\nCommon configuration options:\n\n```bash\ndoc2lora convert ./docs \\\n    --model mistralai/Mistral-7B-Instruct-v0.2 \\\n    --batch-size 2 \\\n    --epochs 3 \\\n    --learning-rate 2e-4 \\\n    --lora-r 8 \\\n    --lora-alpha 16 \\\n    --device auto  # or cuda/mps/cpu\n```\n\n**Memory Management:**\n\n- 🚀 **GPU Training**: Automatically uses fp16 precision on CUDA GPUs to save memory\n- 🔧 **Out of Memory**: Reduce `--batch-size` if you encounter GPU memory errors\n- 💻 **CPU Fallback**: Use `--device cpu` if GPU memory is insufficient\n- ⚡ **Automatic Optimization**: The system automatically chooses optimal settings per device\n\n## Features\n\n- ✅ **Document Parsing**: Recursively scan directories for supported document types\n- ✅ **Subdirectory Labeling**: Automatically label documents based on directory structure and filename\n- ✅ **Multiple Formats**: Support for 16+ document formats including archives\n- ✅ **Archive Support**: Extract and parse documents from ZIP and TAR archives\n- ✅ **R2 Bucket Support**: Direct integration with Cloudflare R2 storage buckets\n- ✅ **CLI Interface**: Easy-to-use command-line interface\n- ✅ **Flexible Configuration**: Customizable LoRA parameters\n- 🔄 **LoRA Training**: Fine-tune models using LoRA adaptation (requires ML dependencies)\n- 🔄 **Export Options**: JSON format compatible with various platforms\n\n## Status\n\n- **Document Parsing**: ✅ Fully working\n- **CLI Interface**: ✅ Basic functionality working\n- **LoRA Training**: 🔄 Requires ML dependencies (torch, transformers, peft, datasets)\n\nThe core document parsing functionality works out of the box. For full LoRA training capabilities, install the ML dependencies listed above.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fearth-app%2Fdoc2lora","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fearth-app%2Fdoc2lora","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fearth-app%2Fdoc2lora/lists"}