{"id":44028427,"url":"https://github.com/geiserx/wayback-archive","last_synced_at":"2026-04-06T22:01:19.183Z","repository":{"id":328231189,"uuid":"1114732335","full_name":"GeiserX/Wayback-Archive","owner":"GeiserX","description":"Download complete websites from the Wayback Machine with full asset preservation for offline viewing","archived":false,"fork":false,"pushed_at":"2026-04-02T08:23:36.000Z","size":75,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-02T23:18:34.770Z","etag":null,"topics":["archive","cdn","content-preservation","css","digital-preservation","google-fonts","html","internet-archive","minification","offline-browsing","python","recursive-download","static-site-generator","url-rewriting","wayback-machine","web-archiving","web-crawler","web-scraping","website-backup","website-downloader"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GeiserX.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"geiserx","patreon":"geiser","buy_me_a_coffee":"geiser","thanks_dev":"u/gh/geiserx"}},"created_at":"2025-12-11T20:01:44.000Z","updated_at":"2026-04-02T08:23:44.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/GeiserX/Wayback-Archive","commit_stats":null,"previous_names":["geiserx/wayback-archive"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/GeiserX/Wayback-Archive","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GeiserX%2FWayback-Archive","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GeiserX%2FWayback-Archive/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GeiserX%2FWayback-Archive/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GeiserX%2FWayback-Archive/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GeiserX","download_url":"https://codeload.github.com/GeiserX/Wayback-Archive/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GeiserX%2FWayback-Archive/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31491097,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-06T17:22:55.647Z","status":"ssl_error","status_checked_at":"2026-04-06T17:22:54.741Z","response_time":112,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archive","cdn","content-preservation","css","digital-preservation","google-fonts","html","internet-archive","minification","offline-browsing","python","recursive-download","static-site-generator","url-rewriting","wayback-machine","web-archiving","web-crawler","web-scraping","website-backup","website-downloader"],"created_at":"2026-02-07T18:16:24.698Z","updated_at":"2026-04-06T22:01:19.176Z","avatar_url":"https://github.com/GeiserX.png","language":"Python","funding_links":["https://github.com/sponsors/geiserx","https://patreon.com/geiser","https://buymeacoffee.com/geiser","https://thanks.dev/u/gh/geiserx"],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/GeiserX/Wayback-Archive/main/docs/images/banner.svg\" alt=\"Wayback-Archive banner\" width=\"900\"/\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eDownload complete websites from the Wayback Machine for offline viewing.\u003c/strong\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/wayback-archive/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/wayback-archive?style=flat-square\" alt=\"PyPI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/GeiserX/Wayback-Archive/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://img.shields.io/github/actions/workflow/status/GeiserX/Wayback-Archive/ci.yml?style=flat-square\u0026logo=github\u0026label=build\" alt=\"Build\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/GeiserX/Wayback-Archive/releases\"\u003e\u003cimg src=\"https://img.shields.io/github/v/release/GeiserX/Wayback-Archive?style=flat-square\" alt=\"Release\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/GeiserX/Wayback-Archive/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/github/license/GeiserX/Wayback-Archive?style=flat-square\" alt=\"License\"\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/python-3.8%2B-blue?style=flat-square\u0026logo=python\u0026logoColor=white\" alt=\"Python 3.8+\"\u003e\n  \u003ca href=\"https://github.com/GeiserX/Wayback-Archive/stargazers\"\u003e\u003cimg src=\"https://img.shields.io/github/stars/GeiserX/Wayback-Archive?style=flat-square\" alt=\"GitHub Stars\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\nWayback-Archive is a Python tool that downloads archived websites from the [Wayback Machine](https://web.archive.org/) and reconstructs them for fully functional offline viewing. It preserves all assets -- HTML, CSS, JavaScript, images, and fonts -- rewrites URLs to relative paths, and cleans up Wayback Machine artifacts so the result looks like the original site.\n\n## Quick Start\n\n```bash\n# Install\ngit clone https://github.com/GeiserX/Wayback-Archive.git\ncd Wayback-Archive\npip install -r config/requirements.txt\n\n# Run\nexport WAYBACK_URL=\"https://web.archive.org/web/20250417203037/http://example.com/\"\npython3 -m wayback_archive.cli\n\n# Preview\ncd output \u0026\u0026 python3 -m http.server 8000\n# Open http://localhost:8000\n```\n\n## Features\n\n### Core\n\n- **Full website download** -- HTML, CSS, JS, images, fonts, and all linked assets\n- **Recursive link discovery** -- Automatically follows links in HTML, CSS, and JS files\n- **Smart URL rewriting** -- Converts all links to relative paths for local serving\n- **Timeframe fallback** -- Searches nearby Wayback Machine timestamps when a resource returns 404\n- **Real-time progress logging** -- Displays download status and file processing as it happens\n\n### Asset Handling\n\n- **Google Fonts support** -- Downloads Google Fonts CSS and font files locally, fixing CORS issues\n- **Font corruption detection** -- Identifies and removes corrupted font files (HTML error pages served as fonts)\n- **CDN fallback** -- Automatic fallback to CDN for critical libraries (e.g., jQuery) when Wayback Machine fails\n- **Data attribute processing** -- Processes `data-*` attributes containing URLs (videos, images, etc.)\n\n### Preservation\n\n- **Icon group preservation** -- Preserves all links in icon groups (social media, contact icons)\n- **Button link preservation** -- Maintains styling and functionality of button links\n- **Cookie consent preservation** -- Keeps cookie consent popups and functionality intact\n\n### Optimization\n\n- **HTML minification** -- Uses `minify-html` (Python 3.14+ compatible)\n- **JS/CSS minification** -- Optional JavaScript and CSS minification via `rjsmin` and `cssmin`\n- **Image compression** -- Optional image optimization with Pillow\n- **Tracker/ad removal** -- Strips analytics, ads, and external iframes\n- **Link cleanup** -- Configurable external link removal with anchor preservation options\n- **www/non-www normalization** -- Normalize domain variations automatically\n\n## Why Wayback-Archive?\n\n| Capability | Wayback-Archive | wget | httrack |\n|---|:---:|:---:|:---:|\n| Wayback Machine URL rewriting | Yes | No | No |\n| Wayback artifact cleanup | Yes | No | No |\n| Timeframe fallback for 404s | Yes | No | No |\n| Google Fonts localization | Yes | No | No |\n| Font corruption detection | Yes | No | No |\n| CDN fallback | Yes | No | No |\n| HTML/CSS/JS minification | Yes | No | No |\n| Tracker and ad removal | Yes | No | No |\n| `data-*` attribute processing | Yes | No | No |\n\nGeneral-purpose tools like `wget --mirror` or `httrack` can download live websites, but they do not understand Wayback Machine URL structures, cannot clean up archive artifacts, and lack the specialized asset recovery that Wayback-Archive provides.\n\n## Installation\n\n### Prerequisites\n\n- Python 3.8 or higher\n- pip\n\n### From Source\n\n```bash\ngit clone https://github.com/GeiserX/Wayback-Archive.git\ncd Wayback-Archive\n\n# Optional: create a virtual environment\npython3 -m venv venv\nsource venv/bin/activate  # macOS/Linux\n# venv\\Scripts\\activate   # Windows\n\npip install -r config/requirements.txt\n```\n\n### As a Package\n\n```bash\ncd Wayback-Archive\npip install -e .\nwayback-archive  # Available as a CLI command after installation\n```\n\n## Configuration\n\nAll options are set via environment variables. You can also use a `.env` file.\n\n### Required\n\n| Variable | Description |\n|---|---|\n| `WAYBACK_URL` | The Wayback Machine URL to download |\n\n### Output\n\n| Variable | Default | Description |\n|---|---|---|\n| `OUTPUT_DIR` | `./output` | Output directory for downloaded files |\n\n### Optimization\n\n| Variable | Default | Description |\n|---|---|---|\n| `OPTIMIZE_HTML` | `true` | Minify HTML |\n| `OPTIMIZE_IMAGES` | `false` | Compress images |\n| `MINIFY_JS` | `false` | Minify JavaScript |\n| `MINIFY_CSS` | `false` | Minify CSS |\n\n### Content Removal\n\n| Variable | Default | Description |\n|---|---|---|\n| `REMOVE_TRACKERS` | `true` | Remove analytics and trackers |\n| `REMOVE_ADS` | `true` | Remove advertisements |\n| `REMOVE_CLICKABLE_CONTACTS` | `true` | Remove `tel:` and `mailto:` links |\n| `REMOVE_EXTERNAL_IFRAMES` | `false` | Remove external iframes |\n\n### Link Handling\n\n| Variable | Default | Description |\n|---|---|---|\n| `REMOVE_EXTERNAL_LINKS_KEEP_ANCHORS` | `true` | Remove external links, keep anchor text |\n| `REMOVE_EXTERNAL_LINKS_REMOVE_ANCHORS` | `false` | Remove external links and anchor elements |\n| `MAKE_INTERNAL_LINKS_RELATIVE` | `true` | Convert internal links to relative paths |\n\n### Domain\n\n| Variable | Default | Description |\n|---|---|---|\n| `MAKE_NON_WWW` | `true` | Convert www to non-www |\n| `MAKE_WWW` | `false` | Convert non-www to www |\n| `KEEP_REDIRECTIONS` | `false` | Keep redirect pages |\n\n### Testing\n\n| Variable | Default | Description |\n|---|---|---|\n| `MAX_FILES` | unlimited | Limit number of files to download |\n\n## Usage\n\n### macOS / Linux\n\n```bash\nexport WAYBACK_URL=\"https://web.archive.org/web/20250417203037/http://example.com/\"\nexport OUTPUT_DIR=\"./my_website\"\nexport REMOVE_CLICKABLE_CONTACTS=\"false\"  # Keep email/phone links\n\npython3 -m wayback_archive.cli\n```\n\n### Windows (PowerShell)\n\n```powershell\n$env:WAYBACK_URL = \"https://web.archive.org/web/20250417203037/http://example.com/\"\n$env:OUTPUT_DIR = \".\\my_website\"\n$env:REMOVE_CLICKABLE_CONTACTS = \"false\"\n\npython -m wayback_archive.cli\n```\n\n### Windows (CMD)\n\n```cmd\nset WAYBACK_URL=https://web.archive.org/web/20250417203037/http://example.com/\nset OUTPUT_DIR=.\\my_website\nset REMOVE_CLICKABLE_CONTACTS=false\n\npython -m wayback_archive.cli\n```\n\n### Quick Test\n\nDownload a limited number of files to verify everything works:\n\n```bash\nexport WAYBACK_URL=\"https://web.archive.org/web/20250417203037/http://example.com/\"\nexport MAX_FILES=5\npython3 -m wayback_archive.cli\n```\n\n## How It Works\n\n1. **Initial download** -- Fetches the main page from the Wayback Machine\n2. **Link extraction** -- Parses HTML to find all referenced assets (links, images, CSS, JS)\n3. **CSS processing** -- Extracts font URLs, background images, and `@import` statements; downloads Google Fonts locally; detects corrupted font files\n4. **JS processing** -- Extracts dynamically loaded resources from JavaScript\n5. **Data attributes** -- Scans `data-*` attributes for additional asset URLs\n6. **Iterative crawling** -- Continues discovering and downloading resources until the queue is empty\n7. **Timeframe fallback** -- For 404 responses, searches nearby Wayback Machine timestamps\n8. **URL rewriting** -- Converts all URLs to relative paths for offline serving\n9. **Preservation** -- Maintains icon groups, button links, and cookie consent functionality\n\n## Project Structure\n\n```\nWayback-Archive/\n  wayback_archive/          # Main package\n    __init__.py\n    __main__.py\n    cli.py                  # CLI entry point\n    config.py               # Environment variable configuration\n    downloader.py           # Core download and processing engine\n  config/\n    requirements.txt        # Runtime dependencies\n    requirements-dev.txt    # Development dependencies\n    setup.py                # Package setup\n    pytest.ini              # Test configuration\n  tests/                    # Test suite\n  docs/                     # Documentation\n  LICENSE                   # GPL-3.0\n  README.md\n```\n\n## Testing\n\n```bash\npip install -r config/requirements-dev.txt\n\n# Run tests\npytest\n\n# Run tests with coverage\npytest --cov=wayback_archive\n```\n\n## Troubleshooting\n\n### Port Already in Use\n\n```bash\npython3 -m http.server 8080  # Use a different port\n```\n\n### Font Loading Issues\n\n- **Google Fonts**: Downloaded automatically to avoid CORS issues\n- **Corrupted fonts**: Detected and removed from CSS automatically\n- **Missing fonts**: Some fonts may not exist in the Wayback Machine archive\n\nSee [Font Loading Research Notes](docs/FONT_LOADING.md) for details.\n\n### Missing Links or Icons\n\n- Icon groups (social media, contacts) are preserved automatically\n- Button links with `sppb-btn` or `btn` classes are preserved\n- Set `REMOVE_CLICKABLE_CONTACTS=false` to keep `tel:` and `mailto:` links\n\n### jQuery or Libraries Not Loading\n\nThe tool includes automatic CDN fallback for critical libraries. If a file fails to download from the Wayback Machine, it will attempt to fetch it from a CDN.\n\n## Dependencies\n\n| Package | Purpose |\n|---|---|\n| [requests](https://pypi.org/project/requests/) | HTTP client |\n| [beautifulsoup4](https://pypi.org/project/beautifulsoup4/) | HTML parsing |\n| [lxml](https://pypi.org/project/lxml/) | Fast HTML/XML parser |\n| [minify-html](https://pypi.org/project/minify-html/) | HTML minification |\n| [cssmin](https://pypi.org/project/cssmin/) | CSS minification |\n| [rjsmin](https://pypi.org/project/rjsmin/) | JS minification |\n| [Pillow](https://pypi.org/project/Pillow/) | Image optimization |\n| [python-dotenv](https://pypi.org/project/python-dotenv/) | `.env` file support |\n\n## Contributing\n\nContributions are welcome. Please feel free to submit a Pull Request.\n\n## Related Web Archiving Tools\n\n- [Way-CMS](https://github.com/GeiserX/Way-CMS) — Simple web CMS for editing archived HTML/CSS files\n- [Wayback-Diff](https://github.com/GeiserX/Wayback-Diff) — Web page comparison with Wayback Machine support\n- [web-mirror](https://github.com/GeiserX/web-mirror) — Mirror any webpage for offline access\n- [media-download](https://github.com/GeiserX/media-download) — Download all media files from any web page\n\n## License\n\nThis project is licensed under the [GNU General Public License v3.0](LICENSE) (GPL-3.0).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeiserx%2Fwayback-archive","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgeiserx%2Fwayback-archive","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeiserx%2Fwayback-archive/lists"}