https://github.com/pkharsimran/website-downloader
Website-downloader is a powerful and versatile Python script designed to download entire websites along with all their assets. This tool allows you to create a local copy of a website, including HTML pages, images, CSS, JavaScript files, and other resources. It is ideal for web archiving, offline browsing, and web development.
https://github.com/pkharsimran/website-downloader
automation beautifulsoup data-mining html internet-tools offline-browsing open-source python python-scripts requests web-archiving web-scraping website-cloner website-downloader wget
Last synced: 3 months ago
JSON representation
Website-downloader is a powerful and versatile Python script designed to download entire websites along with all their assets. This tool allows you to create a local copy of a website, including HTML pages, images, CSS, JavaScript files, and other resources. It is ideal for web archiving, offline browsing, and web development.
- Host: GitHub
- URL: https://github.com/pkharsimran/website-downloader
- Owner: PKHarsimran
- License: mit
- Created: 2024-07-03T20:59:45.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2026-03-01T00:19:55.000Z (3 months ago)
- Last Synced: 2026-03-01T03:43:22.677Z (3 months ago)
- Topics: automation, beautifulsoup, data-mining, html, internet-tools, offline-browsing, open-source, python, python-scripts, requests, web-archiving, web-scraping, website-cloner, website-downloader, wget
- Language: Python
- Homepage: https://harsim.ca/
- Size: 92.8 KB
- Stars: 107
- Watchers: 4
- Forks: 23
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# π Website Downloader CLI
[](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml)
[](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml)
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/)
[](https://github.com/psf/black)
Website Downloader CLI is a lightweight, pure-Python site mirroring tool that creates a fully browsable offline copy of any publicly accessible website.
* Recursively crawls every same-origin link (including βprettyβ `/about/` URLs)
* Downloads **all** assets (images, CSS, JS, β¦)
* Rewrites internal links so pages open flawlessly from your local disk
* Streams files concurrently with automatic retry / back-off
* Generates a clean, flat directory tree (`example_com/index.html`, `example_com/about/index.html`, β¦)
* Handles extremely long filenames safely via hashing and graceful fallbacks
> Perfect for web archiving, pentesting labs, long flights, or just poking around a site without an internet connection.
## β€οΈ Support This Project
If you find this tool useful, consider supporting the project:
[Donate via
PayPal](https://www.paypal.com/donate/?business=MVEWG3QAX6UBC&no_recurring=1&item_name=Github+Project+-+Website+downloader¤cy_code=CAD)
---
## π Quick Start
```bash
# 1. Grab the code
git clone https://github.com/PKHarsimran/website-downloader.git
cd website-downloader
# 2. Install dependencies (only two runtime libs!)
pip install -r requirements.txt
# 3. Mirror a site β no prompts needed
python website-downloader.py \
--url https://harsim.ca \
--destination harsim_ca_backup \
--max-pages 100 \
--threads 8
```
---
## π οΈ Libraries Used
| Library | Purpose |
|----------|----------|
| **requests** + **urllib3.Retry** | HTTP client with automatic retry, backoff, and persistent session handling |
| **BeautifulSoup (bs4)** | Parses HTML and extracts ``, `
`, ``, and `<link>` elements |
| **argparse** | Provides structured CLI argument parsing and validation |
| **logging** | Dual console + file logging with crawl progress and summary metrics |
| **threading** & **queue** | Concurrent asset downloading via lightweight worker pool |
| **pathlib** & **os** | Cross-platform filesystem management and safe directory creation |
| **urllib.parse** | URL parsing, normalization, and safe internal link rewriting |
| **hashlib (sha256)** | Generates stable hashes for long filenames and query-string collisions |
| **posixpath** | Normalizes URL paths while preventing traversal |
| **time** | Measures crawl duration and per-page performance |
| **sys** | Handles CLI exit codes and stream output management |
| **re** | Normalizes path segments and collapses malformed multi-dot filenames |
## ποΈ Project Structure
| Path | What it is | Key features |
|------|------------|--------------|
| `website_downloader.py` | **Single-entry CLI** that performs the entire crawl *and* link-rewriting pipeline. | β’ Persistent `requests.Session` with automatic retries<br>β’ Breadth-first crawl capped by `--max-pages` (default = 50)<br>β’ Thread-pool (configurable via `--threads`, default = 6) to fetch images/CSS/JS in parallel<br>β’ Robust link rewriting so every internal URL works offline (pretty-URL folders β `index.html`, plain paths β `.html`)<br>β’ Smart output folder naming (`example.com` β `example_com`)<br>β’ Colourised console + file logging with per-page latency and crawl summary |
| `requirements.txt` | Minimal dependency pin-list. Only **`requests`** and **`beautifulsoup4`** are third-party; everything else is Python β₯ 3.10 std-lib. |
| `web_scraper.log` | Auto-generated run log (rotates/overwrites on each invocation). Useful for troubleshooting or audit trails. |
| `README.md` | The document youβre reading β quick-start, flags, and architecture notes. |
| *(output folder)* | Created at runtime (`example_com/ β¦`) β mirrors the remote directory tree with `index.html` stubs and all static assets. |
> **Removed:** The old `check_download.py` verifier is no longer required because the new downloader performs integrity checks (missing files, broken internal links) during the crawl and reports any issues directly in the log summary.
## β¨ Recent Improvements
β
Type Conversion Fix
Fixed a TypeError caused by int(..., 10) when non-string arguments were passed.
β
Safer Path Handling
Added intelligent path shortening and hashing for long filenames to prevent
OSError: [Errno 36] File name too long errors.
β
Improved CLI Experience
Rebuilt argument parsing with argparse for cleaner syntax and validation.
β
Code Quality & Linting
Applied Black + Flake8 formatting; the project now passes all CI lint checks.
β
Logging & Stability
Improved error handling, logging, and fallback mechanisms for failed writes.
β
Skip Non-Fetchable Schemes
The crawler now safely skips `mailto:`, `tel:`, `javascript:`, and `data:` links instead of trying to download them.
This prevents `requests.exceptions.InvalidSchema: No connection adapters were found` errors and keeps those links intact in saved HTML.
β
Improved Path Normalization
- Decodes URL-encoded segments (`%20` β space)
- Trims unnecessary whitespace
- Collapses accidental multi-dot filenames (`file....jpg` β
`file.jpg`)
- Preserves traversal protection and hashing safeguards
------------------------------------------------------------------------
## π€ Contributing
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
## π License
This project is licensed under the MIT License.