https://github.com/ns81000/docdownloader
DocDownloader 📚 Python tool that downloads web documentation to clean Markdown. Features multiple crawling methods (sitemap/recursive), maintains document hierarchy, and respects robots.txt. Perfect for offline reading with user-friendly CLI and command-line options for automation.
https://github.com/ns81000/docdownloader
beautifulsoup documentation offline python sitemap
Last synced: over 1 year ago
JSON representation
DocDownloader 📚 Python tool that downloads web documentation to clean Markdown. Features multiple crawling methods (sitemap/recursive), maintains document hierarchy, and respects robots.txt. Perfect for offline reading with user-friendly CLI and command-line options for automation.
- Host: GitHub
- URL: https://github.com/ns81000/docdownloader
- Owner: Ns81000
- License: mit
- Created: 2025-03-14T12:51:14.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-14T12:53:11.000Z (over 1 year ago)
- Last Synced: 2025-03-14T13:39:15.458Z (over 1 year ago)
- Topics: beautifulsoup, documentation, offline, python, sitemap
- Language: Python
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Documentation Downloader 📚
A powerful and user-friendly Python tool that downloads web documentation and converts it to clean Markdown format! Perfect for offline reading, documentation migration, or content analysis.
## ✨ Features
- 🔄 Multiple crawling methods:
- Sitemap-based crawling (auto-detects common sitemap locations)
- Recursive link-following for sites without sitemaps
- Custom sitemap URL support
- Support for sitemap indexes and nested sitemaps
- 📝 Converts HTML to clean Markdown format
- 🌳 Maintains documentation structure with proper directory hierarchy
- 🚀 Shows real-time progress with nice progress bars
- 🕊 Respects rate limiting and robots.txt rules
- 🎯 Smart error handling and detailed logging
- 💾 Organized output with clean filenames
- 🎨 User-friendly command-line interface with clear prompts
- 📊 Command-line arguments support for automation/scripting
- 💡 Set maximum pages to download and custom delay between requests
## 🛠 Installation
1. **Clone or download this repository**
2. **Create a virtual environment:**
```bash
# On Windows
python -m venv venv
.\venv\Scripts\activate
# On macOS/Linux
python -m venv venv
source venv/bin/activate
```
3. **Install dependencies:**
```bash
pip install -r requirements.txt
```
## 🔧 Configuration Parameters
### Request Delay
- **Purpose**: Controls the time interval between consecutive requests to avoid overloading the target server.
- **Default**: 1.0 second
- **Usage**:
- Command line: `--delay 2.5` (in seconds)
- Interactive mode: Enter value when prompted
- **Recommendation**: Use higher values (2-3 seconds) for smaller servers, lower values (0.5-1 second) for robust sites.
### Page Limit
- **Purpose**: Sets the maximum number of pages to download, preventing unintended large-scale crawling.
- **Usage**:
- Command line: `--max-pages 100`
- Interactive mode: Enter value when prompted or leave empty for no limit
- **Note**: Setting this appropriately helps control execution time and output size.
### Robots.txt Compliance
- **Purpose**: Determines whether the crawler should respect robots.txt restrictions.
- **Default**: Enabled (respects robots.txt)
- **Usage**:
- Command line: Use `--no-robots` to disable
- Interactive mode: Answer 'n' when prompted
### Crawling Method
- **Purpose**: Determines how the tool discovers pages to download.
- **Options**:
1. **Auto-detect sitemap** (`--method auto`): Fastest when sitemaps are available
2. **Recursive crawling** (`--method recursive`): Most thorough but slower
3. **Custom sitemap URL** (`--method sitemap --sitemap URL`): Best for known sitemap locations
- **Note**: Choose based on the structure of the documentation site and your specific needs.
### Output Directory
- **Purpose**: Specifies where the converted Markdown files will be saved.
- **Default**: 'markdown_docs'
- **Usage**:
- Command line: `--output custom_folder_name`
- Interactive mode: Enter value when prompted
## 🚀 Usage
### Interactive Mode
1. **Activate the virtual environment** (if not already activated):
```bash
# On Windows
.\venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate
```
2. **Run the script:**
```bash
python main.py
```
3. **Follow the interactive prompts:**
- Enter the documentation base URL
- Choose your preferred crawling method:
1. Auto-detect sitemap.xml (tries common locations)
2. Recursive crawling (follows links within the domain)
3. Enter custom sitemap URL
- Choose an output directory for the Markdown files
- Set optional parameters like delay between requests
### Command-Line Arguments Mode (for automation)
You can also run the script with command-line arguments for automation:
```bash
python main.py --url https://docs.example.com --output docs_output --method recursive --delay 1.5 --max-pages 100 --no-robots
```
Available arguments:
- `--url`: Base URL of the documentation
- `--output`: Output directory name (default: markdown_docs)
- `--method`: Crawling method (auto/recursive/sitemap)
- `--sitemap`: Custom sitemap URL (required if method=sitemap)
- `--delay`: Delay between requests in seconds (default: 1.0)
- `--max-pages`: Maximum number of pages to download
- `--no-robots`: Ignore robots.txt restrictions
## 📝 Example
```bash
$ python main.py
╔═══════════════════════════════════════════╗
║ Documentation Downloader v1.0 ║
║ Convert Docs to Markdown ║
╚═══════════════════════════════════════════╝
Welcome to Documentation Downloader!
This tool will help you convert web documentation to Markdown format.
Enter the base documentation URL: https://docs.example.com
Choose crawling method:
1. Auto-detect sitemap.xml
2. Recursive crawling (follows links)
3. Enter custom sitemap URL
Enter choice (1/2/3): 2
Enter output directory name [markdown_docs]: my_docs
Enter delay between requests in seconds [1.0]: 2
Maximum number of pages to download (leave empty for no limit): 50
Respect robots.txt restrictions? (y/n) [y]: y
Starting documentation download...
Downloading documentation: 100%|██████████| 42/42 [01:24<00:00]
Pages: 42, Pending: 13
Success! Documentation has been downloaded and converted.
You can find the Markdown files in the 'my_docs' directory.
```
## 📁 Output Structure
The downloaded documentation maintains its original structure:
```
my_docs/
├── index.md
├── getting-started/
│ ├── installation.md
│ └── configuration.md
├── guides/
│ ├── basic-usage.md
│ └── advanced-features.md
└── api/
└── reference.md
```
Each Markdown file includes:
- Clean, readable content
- Original formatting preserved
- YAML frontmatter with:
- Original title
- Source URL
- Download timestamp
Example Markdown file:
```markdown
---
title: Getting Started Guide
source_url: https://docs.example.com/getting-started
date_downloaded: 2024-03-14 11:20:15
---
# Getting Started
Rest of the converted content...
```
## 🔍 Logging
The script creates a `crawler.log` file with detailed information about the download process, helpful for debugging any issues.
## 🛠️ Advanced Features
### Robots.txt Support
The tool respects robots.txt rules by default, but you can disable this with the `--no-robots` flag or by answering "n" to the robots.txt prompt.
### Sitemap Parsing
The tool can handle both standard sitemaps and sitemap indexes (which contain links to multiple sitemaps).
### Error Handling
The tool provides detailed error handling and logging, with graceful fallbacks when issues occur.
## ⚠️ Important Notes
1. Choose the appropriate crawling method:
- **Sitemap-based**: Faster and more efficient if available
- **Recursive**: More thorough but slower, great for sites without sitemaps
2. Respect website terms of service and robots.txt
3. Use reasonable delays between requests (default: 1 second)
4. Some websites may block automated downloads
5. Large documentation sites may take significant time to download
## 🤝 Contributing
Contributions are welcome! Feel free to:
- Report issues
- Suggest improvements
- Submit pull requests
## 📜 License
This project is licensed under the MIT License - see the LICENSE file for details.
## 🙏 Acknowledgments
- Built with Python and lots of ❤️
- Uses excellent libraries:
- beautifulsoup4 for HTML parsing
- html2text for conversion
- tqdm for progress bars
- requests for HTTP requests
- validators for URL validation
- python-robots for robots.txt parsing
- Inspired by the need for offline documentation access