https://github.com/geiserx/wayback-archive
Download complete websites from the Wayback Machine with full asset preservation for offline viewing
https://github.com/geiserx/wayback-archive
archive cdn content-preservation css digital-preservation google-fonts html internet-archive minification offline-browsing python recursive-download static-site-generator url-rewriting wayback-machine web-archiving web-crawler web-scraping website-backup website-downloader
Last synced: 2 months ago
JSON representation
Download complete websites from the Wayback Machine with full asset preservation for offline viewing
- Host: GitHub
- URL: https://github.com/geiserx/wayback-archive
- Owner: GeiserX
- License: gpl-3.0
- Created: 2025-12-11T20:01:44.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2026-04-02T08:23:36.000Z (2 months ago)
- Last Synced: 2026-04-02T23:18:34.770Z (2 months ago)
- Topics: archive, cdn, content-preservation, css, digital-preservation, google-fonts, html, internet-archive, minification, offline-browsing, python, recursive-download, static-site-generator, url-rewriting, wayback-machine, web-archiving, web-crawler, web-scraping, website-backup, website-downloader
- Language: Python
- Size: 73.2 KB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
Download complete websites from the Wayback Machine for offline viewing.
---
Wayback-Archive is a Python tool that downloads archived websites from the [Wayback Machine](https://web.archive.org/) and reconstructs them for fully functional offline viewing. It preserves all assets -- HTML, CSS, JavaScript, images, and fonts -- rewrites URLs to relative paths, and cleans up Wayback Machine artifacts so the result looks like the original site.
## Quick Start
```bash
# Install
git clone https://github.com/GeiserX/Wayback-Archive.git
cd Wayback-Archive
pip install -r config/requirements.txt
# Run
export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
python3 -m wayback_archive.cli
# Preview
cd output && python3 -m http.server 8000
# Open http://localhost:8000
```
## Features
### Core
- **Full website download** -- HTML, CSS, JS, images, fonts, and all linked assets
- **Recursive link discovery** -- Automatically follows links in HTML, CSS, and JS files
- **Smart URL rewriting** -- Converts all links to relative paths for local serving
- **Timeframe fallback** -- Searches nearby Wayback Machine timestamps when a resource returns 404
- **Real-time progress logging** -- Displays download status and file processing as it happens
### Asset Handling
- **Google Fonts support** -- Downloads Google Fonts CSS and font files locally, fixing CORS issues
- **Font corruption detection** -- Identifies and removes corrupted font files (HTML error pages served as fonts)
- **CDN fallback** -- Automatic fallback to CDN for critical libraries (e.g., jQuery) when Wayback Machine fails
- **Data attribute processing** -- Processes `data-*` attributes containing URLs (videos, images, etc.)
### Preservation
- **Icon group preservation** -- Preserves all links in icon groups (social media, contact icons)
- **Button link preservation** -- Maintains styling and functionality of button links
- **Cookie consent preservation** -- Keeps cookie consent popups and functionality intact
### Optimization
- **HTML minification** -- Uses `minify-html` (Python 3.14+ compatible)
- **JS/CSS minification** -- Optional JavaScript and CSS minification via `rjsmin` and `cssmin`
- **Image compression** -- Optional image optimization with Pillow
- **Tracker/ad removal** -- Strips analytics, ads, and external iframes
- **Link cleanup** -- Configurable external link removal with anchor preservation options
- **www/non-www normalization** -- Normalize domain variations automatically
## Why Wayback-Archive?
| Capability | Wayback-Archive | wget | httrack |
|---|:---:|:---:|:---:|
| Wayback Machine URL rewriting | Yes | No | No |
| Wayback artifact cleanup | Yes | No | No |
| Timeframe fallback for 404s | Yes | No | No |
| Google Fonts localization | Yes | No | No |
| Font corruption detection | Yes | No | No |
| CDN fallback | Yes | No | No |
| HTML/CSS/JS minification | Yes | No | No |
| Tracker and ad removal | Yes | No | No |
| `data-*` attribute processing | Yes | No | No |
General-purpose tools like `wget --mirror` or `httrack` can download live websites, but they do not understand Wayback Machine URL structures, cannot clean up archive artifacts, and lack the specialized asset recovery that Wayback-Archive provides.
## Installation
### Prerequisites
- Python 3.8 or higher
- pip
### From Source
```bash
git clone https://github.com/GeiserX/Wayback-Archive.git
cd Wayback-Archive
# Optional: create a virtual environment
python3 -m venv venv
source venv/bin/activate # macOS/Linux
# venv\Scripts\activate # Windows
pip install -r config/requirements.txt
```
### As a Package
```bash
cd Wayback-Archive
pip install -e .
wayback-archive # Available as a CLI command after installation
```
## Configuration
All options are set via environment variables. You can also use a `.env` file.
### Required
| Variable | Description |
|---|---|
| `WAYBACK_URL` | The Wayback Machine URL to download |
### Output
| Variable | Default | Description |
|---|---|---|
| `OUTPUT_DIR` | `./output` | Output directory for downloaded files |
### Optimization
| Variable | Default | Description |
|---|---|---|
| `OPTIMIZE_HTML` | `true` | Minify HTML |
| `OPTIMIZE_IMAGES` | `false` | Compress images |
| `MINIFY_JS` | `false` | Minify JavaScript |
| `MINIFY_CSS` | `false` | Minify CSS |
### Content Removal
| Variable | Default | Description |
|---|---|---|
| `REMOVE_TRACKERS` | `true` | Remove analytics and trackers |
| `REMOVE_ADS` | `true` | Remove advertisements |
| `REMOVE_CLICKABLE_CONTACTS` | `true` | Remove `tel:` and `mailto:` links |
| `REMOVE_EXTERNAL_IFRAMES` | `false` | Remove external iframes |
### Link Handling
| Variable | Default | Description |
|---|---|---|
| `REMOVE_EXTERNAL_LINKS_KEEP_ANCHORS` | `true` | Remove external links, keep anchor text |
| `REMOVE_EXTERNAL_LINKS_REMOVE_ANCHORS` | `false` | Remove external links and anchor elements |
| `MAKE_INTERNAL_LINKS_RELATIVE` | `true` | Convert internal links to relative paths |
### Domain
| Variable | Default | Description |
|---|---|---|
| `MAKE_NON_WWW` | `true` | Convert www to non-www |
| `MAKE_WWW` | `false` | Convert non-www to www |
| `KEEP_REDIRECTIONS` | `false` | Keep redirect pages |
### Testing
| Variable | Default | Description |
|---|---|---|
| `MAX_FILES` | unlimited | Limit number of files to download |
## Usage
### macOS / Linux
```bash
export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
export OUTPUT_DIR="./my_website"
export REMOVE_CLICKABLE_CONTACTS="false" # Keep email/phone links
python3 -m wayback_archive.cli
```
### Windows (PowerShell)
```powershell
$env:WAYBACK_URL = "https://web.archive.org/web/20250417203037/http://example.com/"
$env:OUTPUT_DIR = ".\my_website"
$env:REMOVE_CLICKABLE_CONTACTS = "false"
python -m wayback_archive.cli
```
### Windows (CMD)
```cmd
set WAYBACK_URL=https://web.archive.org/web/20250417203037/http://example.com/
set OUTPUT_DIR=.\my_website
set REMOVE_CLICKABLE_CONTACTS=false
python -m wayback_archive.cli
```
### Quick Test
Download a limited number of files to verify everything works:
```bash
export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
export MAX_FILES=5
python3 -m wayback_archive.cli
```
## How It Works
1. **Initial download** -- Fetches the main page from the Wayback Machine
2. **Link extraction** -- Parses HTML to find all referenced assets (links, images, CSS, JS)
3. **CSS processing** -- Extracts font URLs, background images, and `@import` statements; downloads Google Fonts locally; detects corrupted font files
4. **JS processing** -- Extracts dynamically loaded resources from JavaScript
5. **Data attributes** -- Scans `data-*` attributes for additional asset URLs
6. **Iterative crawling** -- Continues discovering and downloading resources until the queue is empty
7. **Timeframe fallback** -- For 404 responses, searches nearby Wayback Machine timestamps
8. **URL rewriting** -- Converts all URLs to relative paths for offline serving
9. **Preservation** -- Maintains icon groups, button links, and cookie consent functionality
## Project Structure
```
Wayback-Archive/
wayback_archive/ # Main package
__init__.py
__main__.py
cli.py # CLI entry point
config.py # Environment variable configuration
downloader.py # Core download and processing engine
config/
requirements.txt # Runtime dependencies
requirements-dev.txt # Development dependencies
setup.py # Package setup
pytest.ini # Test configuration
tests/ # Test suite
docs/ # Documentation
LICENSE # GPL-3.0
README.md
```
## Testing
```bash
pip install -r config/requirements-dev.txt
# Run tests
pytest
# Run tests with coverage
pytest --cov=wayback_archive
```
## Troubleshooting
### Port Already in Use
```bash
python3 -m http.server 8080 # Use a different port
```
### Font Loading Issues
- **Google Fonts**: Downloaded automatically to avoid CORS issues
- **Corrupted fonts**: Detected and removed from CSS automatically
- **Missing fonts**: Some fonts may not exist in the Wayback Machine archive
See [Font Loading Research Notes](docs/FONT_LOADING.md) for details.
### Missing Links or Icons
- Icon groups (social media, contacts) are preserved automatically
- Button links with `sppb-btn` or `btn` classes are preserved
- Set `REMOVE_CLICKABLE_CONTACTS=false` to keep `tel:` and `mailto:` links
### jQuery or Libraries Not Loading
The tool includes automatic CDN fallback for critical libraries. If a file fails to download from the Wayback Machine, it will attempt to fetch it from a CDN.
## Dependencies
| Package | Purpose |
|---|---|
| [requests](https://pypi.org/project/requests/) | HTTP client |
| [beautifulsoup4](https://pypi.org/project/beautifulsoup4/) | HTML parsing |
| [lxml](https://pypi.org/project/lxml/) | Fast HTML/XML parser |
| [minify-html](https://pypi.org/project/minify-html/) | HTML minification |
| [cssmin](https://pypi.org/project/cssmin/) | CSS minification |
| [rjsmin](https://pypi.org/project/rjsmin/) | JS minification |
| [Pillow](https://pypi.org/project/Pillow/) | Image optimization |
| [python-dotenv](https://pypi.org/project/python-dotenv/) | `.env` file support |
## Contributing
Contributions are welcome. Please feel free to submit a Pull Request.
## Related Web Archiving Tools
- [Way-CMS](https://github.com/GeiserX/Way-CMS) — Simple web CMS for editing archived HTML/CSS files
- [Wayback-Diff](https://github.com/GeiserX/Wayback-Diff) — Web page comparison with Wayback Machine support
- [web-mirror](https://github.com/GeiserX/web-mirror) — Mirror any webpage for offline access
- [media-download](https://github.com/GeiserX/media-download) — Download all media files from any web page
## License
This project is licensed under the [GNU General Public License v3.0](LICENSE) (GPL-3.0).