{"id":23463520,"url":"https://github.com/lx-0/web2llm","last_synced_at":"2025-04-12T11:58:13.245Z","repository":{"id":269351618,"uuid":"907057484","full_name":"lx-0/web2llm","owner":"lx-0","description":"🌐 Expand LLM knowledge beyond training cutoffs through transforming modern websites into AI-digestible PDFs via HTTrack-powered scraping","archived":false,"fork":false,"pushed_at":"2024-12-22T23:14:50.000Z","size":32,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-16T01:43:21.044Z","etag":null,"topics":["ai-tools","documentation-tools","knowledge-base","llm-tools","pdf-converter"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lx-0.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-22T17:32:46.000Z","updated_at":"2025-01-22T13:14:53.000Z","dependencies_parsed_at":"2024-12-23T00:24:16.789Z","dependency_job_id":"a4ceca8b-64f2-4173-ba8d-790f4ac4a258","html_url":"https://github.com/lx-0/web2llm","commit_stats":null,"previous_names":["lx-0/web2llm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lx-0%2Fweb2llm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lx-0%2Fweb2llm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lx-0%2Fweb2llm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lx-0%2Fweb2llm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lx-0","download_url":"https://codeload.github.com/lx-0/web2llm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248565079,"owners_count":21125415,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-tools","documentation-tools","knowledge-base","llm-tools","pdf-converter"],"created_at":"2024-12-24T09:12:33.449Z","updated_at":"2025-04-12T11:58:13.222Z","avatar_url":"https://github.com/lx-0.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🌐 web2llm - Website scraper for LLM consumption\n\n\u003e A Python tool that prepares online documentation for LLM consumption by downloading websites and converting them into standardized formats. Ideal for making post-cutoff documentation accessible to language models.\n\u003e\n\u003e For example, it can transform the latest Pydantic AI documentation into clean, structured PDFs, allowing LLMs to understand features released after their training cutoff.\n\n\u003cdiv align=\"center\"\u003e\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code Style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n\u003c/div\u003e\n\n## 🎯 Purpose\n\nPrepare online documentation for LLM consumption through:\n\n- 📥 Downloading complete websites with full JavaScript support\n- 🔄 Converting content into standardized formats\n- 📚 Generating LLM-friendly PDFs with proper structure\n- 🤖 Making post-cutoff knowledge accessible\n\n## ✨ Features\n\n- 🌐 Website Processing\n  - Full JavaScript support\n  - Proper handling of relative paths\n  - Automatic resource collection\n- 📑 Document Generation\n  - Clean PDF output with proper formatting\n  - Automatic page breaks\n  - Table of contents generation\n  - Custom CSS and print styles support\n- 🛠️ Configuration Options\n  - Configurable margins and layout\n  - Environment variable support\n  - Progress tracking with color output\n  - Debug and quiet modes\n\n## 🔧 Prerequisites\n\n- Python 3.8 or higher\n- Conda (Miniconda or Anaconda)\n- HTTrack\n  - 🍎 MacOS: `brew install httrack`\n  - 🐧 Linux: `apt-get install httrack`\n  - 🪟 Windows: Download from HTTrack website\n- wkhtmltopdf (installed automatically via conda)\n\n## 🚀 Quick Start\n\n### 📦 Installation\n\n1. Install HTTrack for your OS (see Prerequisites)\n2. Clone and setup the environment:\n\n```bash\n# Create and activate conda environment\nconda env create -f environment.yml\nconda activate web2llm\n\n# Install package\npip install -e .\n```\n\n### 📖 Usage\n\nBasic conversion:\n\n```bash\npython -m web2llm https://example.com --output docs.pdf\n```\n\n### 🎮 Command Options\n\n- `url`: Target website URL (required)\n- `--output`, `-o`: Output PDF path (required)\n- `--debug`: Keep temporary files and debug info\n- `--quiet`, `-q`: Suppress progress output\n- `--skip-download`: Use existing files\n- `--download-only`: Skip conversion\n\n### 📝 Examples\n\n1. Convert Pydantic AI docs:\n\n```bash\npython -m web2llm https://ai.pydantic.dev/ --output pydantic_ai.pdf\n```\n\n2. Debug mode:\n\n```bash\npython -m web2llm https://example.com --output docs.pdf --debug\n```\n\n3. Quiet mode for scripts:\n\n```bash\npython -m web2llm https://example.com --output docs.pdf --quiet\n```\n\n## ⚙️ Configuration\n\nSet via environment variables or `.env`:\n\n- `DOWNLOAD_DIR`: Temporary files location\n- `OUTPUT_DIR`: PDF output location\n\n## 🤝 Testing\n\nRun tests with pytest:\n\n```bash\n# Run all tests (via python)\npython -m pytest tests/ -v\n\n# Run all tests (directly)\npytest tests/ -v\n\n# Run tests with coverage report\npytest tests/ --cov=web2llm --cov-report=term-missing\n\n# Run specific test file\npytest tests/test_preprocessor.py -v\n\n# Run specific test function\npytest tests/test_preprocessor.py::test_normalize_url -v\n```\n\nTests cover:\n\n- CLI functionality\n- Website downloading\n- HTML preprocessing\n- SVG handling\n- Tabbed content processing\n- Document merging\n- PDF conversion\n\n## 🤝 Contributing\n\nContributions welcome! Please feel free to submit a Pull Request.\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flx-0%2Fweb2llm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flx-0%2Fweb2llm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flx-0%2Fweb2llm/lists"}