{"id":26295571,"url":"https://github.com/ns81000/docdownloader","last_synced_at":"2025-03-15T04:14:26.838Z","repository":{"id":282409401,"uuid":"948498235","full_name":"Ns81000/DocDownloader","owner":"Ns81000","description":"DocDownloader 📚 Python tool that downloads web documentation to clean Markdown. Features multiple crawling methods (sitemap/recursive), maintains document hierarchy, and respects robots.txt. Perfect for offline reading with user-friendly CLI and command-line options for automation.","archived":false,"fork":false,"pushed_at":"2025-03-14T12:53:11.000Z","size":0,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-14T13:39:15.458Z","etag":null,"topics":["beautifulsoup","documentation","offline","python","sitemap"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ns81000.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-14T12:51:14.000Z","updated_at":"2025-03-14T12:54:12.000Z","dependencies_parsed_at":"2025-03-14T13:39:17.660Z","dependency_job_id":"fc369569-ee09-4d85-9efb-f821f61de8c3","html_url":"https://github.com/Ns81000/DocDownloader","commit_stats":null,"previous_names":["ns81000/docdownloader"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ns81000%2FDocDownloader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ns81000%2FDocDownloader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ns81000%2FDocDownloader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ns81000%2FDocDownloader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ns81000","download_url":"https://codeload.github.com/Ns81000/DocDownloader/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243681077,"owners_count":20330155,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","documentation","offline","python","sitemap"],"created_at":"2025-03-15T04:14:25.979Z","updated_at":"2025-03-15T04:14:26.827Z","avatar_url":"https://github.com/Ns81000.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Documentation Downloader 📚\n\nA powerful and user-friendly Python tool that downloads web documentation and converts it to clean Markdown format! Perfect for offline reading, documentation migration, or content analysis.\n\n## ✨ Features\n\n- 🔄 Multiple crawling methods:\n  - Sitemap-based crawling (auto-detects common sitemap locations)\n  - Recursive link-following for sites without sitemaps\n  - Custom sitemap URL support\n  - Support for sitemap indexes and nested sitemaps\n- 📝 Converts HTML to clean Markdown format\n- 🌳 Maintains documentation structure with proper directory hierarchy\n- 🚀 Shows real-time progress with nice progress bars\n- 🕊 Respects rate limiting and robots.txt rules\n- 🎯 Smart error handling and detailed logging\n- 💾 Organized output with clean filenames\n- 🎨 User-friendly command-line interface with clear prompts\n- 📊 Command-line arguments support for automation/scripting\n- 💡 Set maximum pages to download and custom delay between requests\n\n## 🛠 Installation\n\n1. **Clone or download this repository**\n\n2. **Create a virtual environment:**\n   ```bash\n   # On Windows\n   python -m venv venv\n   .\\venv\\Scripts\\activate\n\n   # On macOS/Linux\n   python -m venv venv\n   source venv/bin/activate\n   ```\n\n3. **Install dependencies:**\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n## 🔧 Configuration Parameters\n\n### Request Delay\n- **Purpose**: Controls the time interval between consecutive requests to avoid overloading the target server.\n- **Default**: 1.0 second\n- **Usage**: \n  - Command line: `--delay 2.5` (in seconds)\n  - Interactive mode: Enter value when prompted\n- **Recommendation**: Use higher values (2-3 seconds) for smaller servers, lower values (0.5-1 second) for robust sites.\n\n### Page Limit\n- **Purpose**: Sets the maximum number of pages to download, preventing unintended large-scale crawling.\n- **Usage**:\n  - Command line: `--max-pages 100`\n  - Interactive mode: Enter value when prompted or leave empty for no limit\n- **Note**: Setting this appropriately helps control execution time and output size.\n\n### Robots.txt Compliance\n- **Purpose**: Determines whether the crawler should respect robots.txt restrictions.\n- **Default**: Enabled (respects robots.txt)\n- **Usage**:\n  - Command line: Use `--no-robots` to disable\n  - Interactive mode: Answer 'n' when prompted\n\n### Crawling Method\n- **Purpose**: Determines how the tool discovers pages to download.\n- **Options**:\n  1. **Auto-detect sitemap** (`--method auto`): Fastest when sitemaps are available\n  2. **Recursive crawling** (`--method recursive`): Most thorough but slower\n  3. **Custom sitemap URL** (`--method sitemap --sitemap URL`): Best for known sitemap locations\n- **Note**: Choose based on the structure of the documentation site and your specific needs.\n\n### Output Directory\n- **Purpose**: Specifies where the converted Markdown files will be saved.\n- **Default**: 'markdown_docs'\n- **Usage**:\n  - Command line: `--output custom_folder_name`\n  - Interactive mode: Enter value when prompted\n\n## 🚀 Usage\n\n### Interactive Mode\n\n1. **Activate the virtual environment** (if not already activated):\n   ```bash\n   # On Windows\n   .\\venv\\Scripts\\activate\n\n   # On macOS/Linux\n   source venv/bin/activate\n   ```\n\n2. **Run the script:**\n   ```bash\n   python main.py\n   ```\n\n3. **Follow the interactive prompts:**\n   - Enter the documentation base URL\n   - Choose your preferred crawling method:\n     1. Auto-detect sitemap.xml (tries common locations)\n     2. Recursive crawling (follows links within the domain)\n     3. Enter custom sitemap URL\n   - Choose an output directory for the Markdown files\n   - Set optional parameters like delay between requests\n\n### Command-Line Arguments Mode (for automation)\n\nYou can also run the script with command-line arguments for automation:\n\n```bash\npython main.py --url https://docs.example.com --output docs_output --method recursive --delay 1.5 --max-pages 100 --no-robots\n```\n\nAvailable arguments:\n- `--url`: Base URL of the documentation\n- `--output`: Output directory name (default: markdown_docs)\n- `--method`: Crawling method (auto/recursive/sitemap)\n- `--sitemap`: Custom sitemap URL (required if method=sitemap)\n- `--delay`: Delay between requests in seconds (default: 1.0)\n- `--max-pages`: Maximum number of pages to download\n- `--no-robots`: Ignore robots.txt restrictions\n\n## 📝 Example\n\n```bash\n$ python main.py\n\n╔═══════════════════════════════════════════╗\n║     Documentation Downloader v1.0         ║\n║         Convert Docs to Markdown          ║\n╚═══════════════════════════════════════════╝\n\nWelcome to Documentation Downloader!\nThis tool will help you convert web documentation to Markdown format.\n\nEnter the base documentation URL: https://docs.example.com\n\nChoose crawling method:\n1. Auto-detect sitemap.xml\n2. Recursive crawling (follows links)\n3. Enter custom sitemap URL\n\nEnter choice (1/2/3): 2\n\nEnter output directory name [markdown_docs]: my_docs\n\nEnter delay between requests in seconds [1.0]: 2\n\nMaximum number of pages to download (leave empty for no limit): 50\n\nRespect robots.txt restrictions? (y/n) [y]: y\n\nStarting documentation download...\nDownloading documentation: 100%|██████████| 42/42 [01:24\u003c00:00]\nPages: 42, Pending: 13\n\nSuccess! Documentation has been downloaded and converted.\nYou can find the Markdown files in the 'my_docs' directory.\n```\n\n## 📁 Output Structure\n\nThe downloaded documentation maintains its original structure:\n```\nmy_docs/\n├── index.md\n├── getting-started/\n│   ├── installation.md\n│   └── configuration.md\n├── guides/\n│   ├── basic-usage.md\n│   └── advanced-features.md\n└── api/\n    └── reference.md\n```\n\nEach Markdown file includes:\n- Clean, readable content\n- Original formatting preserved\n- YAML frontmatter with:\n  - Original title\n  - Source URL\n  - Download timestamp\n\nExample Markdown file:\n```markdown\n---\ntitle: Getting Started Guide\nsource_url: https://docs.example.com/getting-started\ndate_downloaded: 2024-03-14 11:20:15\n---\n\n# Getting Started\n\nRest of the converted content...\n```\n\n## 🔍 Logging\n\nThe script creates a `crawler.log` file with detailed information about the download process, helpful for debugging any issues.\n\n## 🛠️ Advanced Features\n\n### Robots.txt Support\n\nThe tool respects robots.txt rules by default, but you can disable this with the `--no-robots` flag or by answering \"n\" to the robots.txt prompt.\n\n### Sitemap Parsing\n\nThe tool can handle both standard sitemaps and sitemap indexes (which contain links to multiple sitemaps).\n\n### Error Handling\n\nThe tool provides detailed error handling and logging, with graceful fallbacks when issues occur.\n\n## ⚠️ Important Notes\n\n1. Choose the appropriate crawling method:\n   - **Sitemap-based**: Faster and more efficient if available\n   - **Recursive**: More thorough but slower, great for sites without sitemaps\n2. Respect website terms of service and robots.txt\n3. Use reasonable delays between requests (default: 1 second)\n4. Some websites may block automated downloads\n5. Large documentation sites may take significant time to download\n\n## 🤝 Contributing\n\nContributions are welcome! Feel free to:\n- Report issues\n- Suggest improvements\n- Submit pull requests\n\n## 📜 License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## 🙏 Acknowledgments\n\n- Built with Python and lots of ❤️\n- Uses excellent libraries:\n  - beautifulsoup4 for HTML parsing\n  - html2text for conversion\n  - tqdm for progress bars\n  - requests for HTTP requests\n  - validators for URL validation\n  - python-robots for robots.txt parsing\n- Inspired by the need for offline documentation access\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fns81000%2Fdocdownloader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fns81000%2Fdocdownloader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fns81000%2Fdocdownloader/lists"}