https://github.com/pkharsimran/website-downloader

Website-downloader is a powerful and versatile Python script designed to download entire websites along with all their assets. This tool allows you to create a local copy of a website, including HTML pages, images, CSS, JavaScript files, and other resources. It is ideal for web archiving, offline browsing, and web development.
https://github.com/pkharsimran/website-downloader

automation beautifulsoup data-mining html internet-tools offline-browsing open-source python python-scripts requests web-archiving web-scraping website-cloner website-downloader wget

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/pkharsimran/website-downloader
Owner: PKHarsimran
License: mit
Created: 2024-07-03T20:59:45.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2026-03-01T00:19:55.000Z (5 months ago)
Last Synced: 2026-03-01T03:43:22.677Z (5 months ago)
Topics: automation, beautifulsoup, data-mining, html, internet-tools, offline-browsing, open-source, python, python-scripts, requests, web-archiving, web-scraping, website-cloner, website-downloader, wget
Language: Python
Homepage: https://harsim.ca/
Size: 92.8 KB
Stars: 107
Watchers: 4
Forks: 23
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # 🌐 Website Downloader CLI  

[![CI – Website Downloader](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml)

[![Lint & Style](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

[![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/)

[![Code style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

Website Downloader CLI is a lightweight, pure-Python site mirroring tool that creates a fully browsable offline copy of any publicly accessible website.

* Recursively crawls every same-origin link (including “pretty” `/about/` URLs)

* Downloads **all** assets (images, CSS, JS, …)

* Rewrites internal links so pages open flawlessly from your local disk

* Streams files concurrently with automatic retry / back-off

* Generates a clean, flat directory tree (`example_com/index.html`, `example_com/about/index.html`, …)

* Handles extremely long filenames safely via hashing and graceful fallbacks

> Perfect for web archiving, pentesting labs, long flights, or just poking around a site without an internet connection.

## ❤️ Support This Project

If you find this tool useful, consider supporting the project:

[Donate via

PayPal](https://www.paypal.com/donate/?business=MVEWG3QAX6UBC&no_recurring=1&item_name=Github+Project+-+Website+downloader&currency_code=CAD)

---

## 🚀 Quick Start

```bash

# 1. Grab the code

git clone https://github.com/PKHarsimran/website-downloader.git

cd website-downloader

# 2. Install dependencies (only two runtime libs!)

pip install -r requirements.txt

# 3. Mirror a site – no prompts needed

python website-downloader.py \

    --url https://harsim.ca \

    --destination harsim_ca_backup \

    --max-pages 100 \

    --threads 8

```

---

## 🛠️ Libraries Used

| Library | Purpose |

|----------|----------|

| **requests** + **urllib3.Retry** | HTTP client with automatic retry, backoff, and persistent session handling |

| **BeautifulSoup (bs4)** | Parses HTML and extracts ``, ``, ``, and `<link>` elements |

| **argparse** | Provides structured CLI argument parsing and validation |

| **logging** | Dual console + file logging with crawl progress and summary metrics |

| **threading** & **queue** | Concurrent asset downloading via lightweight worker pool |

| **pathlib** & **os** | Cross-platform filesystem management and safe directory creation |

| **urllib.parse** | URL parsing, normalization, and safe internal link rewriting |

| **hashlib (sha256)** | Generates stable hashes for long filenames and query-string collisions |

| **posixpath** | Normalizes URL paths while preventing traversal |

| **time** | Measures crawl duration and per-page performance |

| **sys** | Handles CLI exit codes and stream output management |

| **re** | Normalizes path segments and collapses malformed multi-dot filenames |


## 🗂️ Project Structure

| Path | What it is | Key features |

|------|------------|--------------|

| `website_downloader.py` | **Single-entry CLI** that performs the entire crawl *and* link-rewriting pipeline. | • Persistent `requests.Session` with automatic retries<br>• Breadth-first crawl capped by `--max-pages` (default = 50)<br>• Thread-pool (configurable via `--threads`, default = 6) to fetch images/CSS/JS in parallel<br>• Robust link rewriting so every internal URL works offline (pretty-URL folders ➜ `index.html`, plain paths ➜ `.html`)<br>• Smart output folder naming (`example.com` → `example_com`)<br>• Colourised console + file logging with per-page latency and crawl summary |

| `requirements.txt` | Minimal dependency pin-list. Only **`requests`** and **`beautifulsoup4`** are third-party; everything else is Python ≥ 3.10 std-lib. |

| `web_scraper.log` | Auto-generated run log (rotates/overwrites on each invocation). Useful for troubleshooting or audit trails. |

| `README.md` | The document you’re reading – quick-start, flags, and architecture notes. |

| *(output folder)* | Created at runtime (`example_com/ …`) – mirrors the remote directory tree with `index.html` stubs and all static assets. |

> **Removed:** The old `check_download.py` verifier is no longer required because the new downloader performs integrity checks (missing files, broken internal links) during the crawl and reports any issues directly in the log summary.

## ✨ Recent Improvements

✅ Type Conversion Fix

Fixed a TypeError caused by int(..., 10) when non-string arguments were passed.

✅ Safer Path Handling

Added intelligent path shortening and hashing for long filenames to prevent

OSError: [Errno 36] File name too long errors.

✅ Improved CLI Experience

Rebuilt argument parsing with argparse for cleaner syntax and validation.

✅ Code Quality & Linting

Applied Black + Flake8 formatting; the project now passes all CI lint checks.

✅ Logging & Stability

Improved error handling, logging, and fallback mechanisms for failed writes.

✅ Skip Non-Fetchable Schemes  

The crawler now safely skips `mailto:`, `tel:`, `javascript:`, and `data:` links instead of trying to download them.  

This prevents `requests.exceptions.InvalidSchema: No connection adapters were found` errors and keeps those links intact in saved HTML.

✅ Improved Path Normalization

-   Decodes URL-encoded segments (`%20` → space)

-   Trims unnecessary whitespace

-   Collapses accidental multi-dot filenames (`file....jpg` →

    `file.jpg`)

-   Preserves traversal protection and hashing safeguards

------------------------------------------------------------------------

## 🤝 Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

## 📜 License

This project is licensed under the MIT License.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pkharsimran/website-downloader

Awesome Lists containing this project

README