An open API service indexing awesome lists of open source software.

https://github.com/pkharsimran/website-downloader

Website-downloader is a powerful and versatile Python script designed to download entire websites along with all their assets. This tool allows you to create a local copy of a website, including HTML pages, images, CSS, JavaScript files, and other resources. It is ideal for web archiving, offline browsing, and web development.
https://github.com/pkharsimran/website-downloader

automation beautifulsoup data-mining html internet-tools offline-browsing open-source python python-scripts requests web-archiving web-scraping website-cloner website-downloader wget

Last synced: 3 months ago
JSON representation

Website-downloader is a powerful and versatile Python script designed to download entire websites along with all their assets. This tool allows you to create a local copy of a website, including HTML pages, images, CSS, JavaScript files, and other resources. It is ideal for web archiving, offline browsing, and web development.

Awesome Lists containing this project

README

          

# 🌐 Website Downloader CLI

[![CI – Website Downloader](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml)
[![Lint & Style](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/)
[![Code style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

Website Downloader CLI is a lightweight, pure-Python site mirroring tool that creates a fully browsable offline copy of any publicly accessible website.

* Recursively crawls every same-origin link (including β€œpretty” `/about/` URLs)
* Downloads **all** assets (images, CSS, JS, …)
* Rewrites internal links so pages open flawlessly from your local disk
* Streams files concurrently with automatic retry / back-off
* Generates a clean, flat directory tree (`example_com/index.html`, `example_com/about/index.html`, …)
* Handles extremely long filenames safely via hashing and graceful fallbacks

> Perfect for web archiving, pentesting labs, long flights, or just poking around a site without an internet connection.

## ❀️ Support This Project

If you find this tool useful, consider supporting the project:

[Donate via
PayPal](https://www.paypal.com/donate/?business=MVEWG3QAX6UBC&no_recurring=1&item_name=Github+Project+-+Website+downloader&currency_code=CAD)

---

## πŸš€ Quick Start

```bash
# 1. Grab the code
git clone https://github.com/PKHarsimran/website-downloader.git
cd website-downloader

# 2. Install dependencies (only two runtime libs!)
pip install -r requirements.txt

# 3. Mirror a site – no prompts needed
python website-downloader.py \
--url https://harsim.ca \
--destination harsim_ca_backup \
--max-pages 100 \
--threads 8
```

---

## πŸ› οΈ Libraries Used

| Library | Purpose |
|----------|----------|
| **requests** + **urllib3.Retry** | HTTP client with automatic retry, backoff, and persistent session handling |
| **BeautifulSoup (bs4)** | Parses HTML and extracts ``, ``, ``, and `<link>` elements |
| **argparse** | Provides structured CLI argument parsing and validation |
| **logging** | Dual console + file logging with crawl progress and summary metrics |
| **threading** & **queue** | Concurrent asset downloading via lightweight worker pool |
| **pathlib** & **os** | Cross-platform filesystem management and safe directory creation |
| **urllib.parse** | URL parsing, normalization, and safe internal link rewriting |
| **hashlib (sha256)** | Generates stable hashes for long filenames and query-string collisions |
| **posixpath** | Normalizes URL paths while preventing traversal |
| **time** | Measures crawl duration and per-page performance |
| **sys** | Handles CLI exit codes and stream output management |
| **re** | Normalizes path segments and collapses malformed multi-dot filenames |

## πŸ—‚οΈ Project Structure

| Path | What it is | Key features |
|------|------------|--------------|
| `website_downloader.py` | **Single-entry CLI** that performs the entire crawl *and* link-rewriting pipeline. | β€’ Persistent `requests.Session` with automatic retries<br>β€’ Breadth-first crawl capped by `--max-pages` (default = 50)<br>β€’ Thread-pool (configurable via `--threads`, default = 6) to fetch images/CSS/JS in parallel<br>β€’ Robust link rewriting so every internal URL works offline (pretty-URL folders ➜ `index.html`, plain paths ➜ `.html`)<br>β€’ Smart output folder naming (`example.com` β†’ `example_com`)<br>β€’ Colourised console + file logging with per-page latency and crawl summary |
| `requirements.txt` | Minimal dependency pin-list. Only **`requests`** and **`beautifulsoup4`** are third-party; everything else is Python β‰₯ 3.10 std-lib. |
| `web_scraper.log` | Auto-generated run log (rotates/overwrites on each invocation). Useful for troubleshooting or audit trails. |
| `README.md` | The document you’re reading – quick-start, flags, and architecture notes. |
| *(output folder)* | Created at runtime (`example_com/ …`) – mirrors the remote directory tree with `index.html` stubs and all static assets. |

> **Removed:** The old `check_download.py` verifier is no longer required because the new downloader performs integrity checks (missing files, broken internal links) during the crawl and reports any issues directly in the log summary.

## ✨ Recent Improvements

βœ… Type Conversion Fix
Fixed a TypeError caused by int(..., 10) when non-string arguments were passed.

βœ… Safer Path Handling
Added intelligent path shortening and hashing for long filenames to prevent
OSError: [Errno 36] File name too long errors.

βœ… Improved CLI Experience
Rebuilt argument parsing with argparse for cleaner syntax and validation.

βœ… Code Quality & Linting
Applied Black + Flake8 formatting; the project now passes all CI lint checks.

βœ… Logging & Stability
Improved error handling, logging, and fallback mechanisms for failed writes.

βœ… Skip Non-Fetchable Schemes
The crawler now safely skips `mailto:`, `tel:`, `javascript:`, and `data:` links instead of trying to download them.
This prevents `requests.exceptions.InvalidSchema: No connection adapters were found` errors and keeps those links intact in saved HTML.

βœ… Improved Path Normalization
- Decodes URL-encoded segments (`%20` β†’ space)
- Trims unnecessary whitespace
- Collapses accidental multi-dot filenames (`file....jpg` β†’
`file.jpg`)
- Preserves traversal protection and hashing safeguards

------------------------------------------------------------------------

## 🀝 Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

## πŸ“œ License

This project is licensed under the MIT License.