{"id":49807754,"url":"https://github.com/stardothosting/shift8-waybackpress","last_synced_at":"2026-05-12T23:32:16.372Z","repository":{"id":323960110,"uuid":"1095362912","full_name":"stardothosting/shift8-waybackpress","owner":"stardothosting","description":"Wayback Press : Recover WordPress page and post data from Wayback Machine archives","archived":false,"fork":false,"pushed_at":"2025-12-21T19:16:56.000Z","size":135,"stargazers_count":22,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-12-22T08:53:39.962Z","etag":null,"topics":["backup-recovery","data-recovery","data-recovery-tool","wayback-machine","wayback-press","waybackmachine","waybackpress","wordpress","wordpress-development"],"latest_commit_sha":null,"homepage":"https://shift8web.ca","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stardothosting.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-13T00:26:45.000Z","updated_at":"2025-12-21T19:16:59.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/stardothosting/shift8-waybackpress","commit_stats":null,"previous_names":["stardothosting/shift8-waybackpress"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/stardothosting/shift8-waybackpress","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stardothosting%2Fshift8-waybackpress","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stardothosting%2Fshift8-waybackpress/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stardothosting%2Fshift8-waybackpress/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stardothosting%2Fshift8-waybackpress/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stardothosting","download_url":"https://codeload.github.com/stardothosting/shift8-waybackpress/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stardothosting%2Fshift8-waybackpress/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32961757,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-12T23:30:32.555Z","status":"ssl_error","status_checked_at":"2026-05-12T23:30:18.191Z","response_time":102,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["backup-recovery","data-recovery","data-recovery-tool","wayback-machine","wayback-press","waybackmachine","waybackpress","wordpress","wordpress-development"],"created_at":"2026-05-12T23:32:15.731Z","updated_at":"2026-05-12T23:32:16.365Z","avatar_url":"https://github.com/stardothosting.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# WaybackPress\n\nRecover WordPress sites from the Internet Archive's Wayback Machine. This tool discovers, validates, and exports WordPress content from archived snapshots into a standard WordPress WXR import file.\n\n## Features\n\n- Automated URL discovery from Wayback Machine CDX API\n- Intelligent post validation with content heuristics\n- Multi-pass media fetching with automatic retries\n- Clean WXR 1.2 export compatible with WordPress Importer\n- Resumable operations with progress tracking\n- Configurable request throttling to respect archive.org\n- Detailed logging and reporting\n\n## Legal and Ethical Use\n\n**This tool is for personal archival and legitimate content recovery only.**\n\nYou are responsible for:\n- Only recovering content you have legal rights to\n- Complying with Internet Archive's Terms of Service\n- Respecting copyright and intellectual property laws\n- Using conservative rate limiting (default: 5s delay, 2 concurrency)\n- Not using this for commercial scraping or bulk downloads\n\nThe tool has built-in safeguards (rate limiting, user-agent identification) but ultimately you are responsible for how you use it.\n\n## Installation\n\n### From Source (Recommended)\n\n**Python 3.12+ requires a virtual environment** due to [PEP 668](https://peps.python.org/pep-0668/). This is the recommended approach for all Python versions:\n\n```bash\ngit clone https://github.com/stardothosting/shift8-waybackpress.git\ncd shift8-waybackpress\n\n# Create virtual environment\npython3 -m venv venv\n\n# Activate virtual environment\nsource venv/bin/activate  # On Linux/macOS\n# OR\nvenv\\Scripts\\activate     # On Windows\n\n# Install package\npip install -e .\n\n# Verify installation\nwaybackpress --version\n```\n\n**When you're done using the tool:**\n\n```bash\ndeactivate\n```\n\n**For future use, always activate the virtual environment first:**\n\n```bash\ncd shift8-waybackpress\nsource venv/bin/activate\nwaybackpress run example.com\n```\n\n### Alternative: System-Wide Installation (Python 3.11 and older)\n\n```bash\npip install -r requirements.txt\npip install -e .\n```\n\n**Note:** This method will fail on Python 3.12+ with an \"externally-managed-environment\" error.\n\n### Requirements\n\n- Python 3.8 or higher\n- Dependencies: beautifulsoup4, lxml, aiohttp, python-dateutil, trafilatura\n\n## Quick Start\n\nThe simplest way to recover a site:\n\n```bash\nwaybackpress run example.com\n```\n\nTo limit recovery to a specific date range (e.g., October 2018 to October 2025):\n\n```bash\nwaybackpress run example.com --from 20181001 --to 20251031\n```\n\nThis will run the complete pipeline: discover URLs, validate posts, fetch media, and generate a WordPress import file.\n\n## Usage\n\nWaybackPress works in stages, allowing you to control each step of the recovery process.\n\n### Stage 1: Discover URLs\n\nQuery the Wayback Machine to find all archived URLs for your domain:\n\n```bash\nwaybackpress discover example.com\n```\n\n**Single URL Extraction:** Extract just one specific post instead of the entire site:\n\n```bash\nwaybackpress discover example.com --url https://example.com/2020/01/post-title/\n```\n\n**Date Range Filtering:** Limit discovery to specific date range:\n\n```bash\nwaybackpress discover example.com --from 20181001 --to 20251031\n```\n\nThis queries only snapshots between October 1, 2018 and October 31, 2025. Useful for:\n- Recovering content from specific time periods\n- Avoiding very old or very recent snapshots\n- Reducing processing time for large sites\n\nOptions:\n- `--url URL`: Extract a single specific URL instead of entire site\n- `--from DATE`: Start date (YYYYMMDD or YYYYMMDDHHMMSS format)\n- `--to DATE`: End date (YYYYMMDD or YYYYMMDDHHMMSS format)\n- `--output DIR`: Specify output directory (default: wayback-data/example.com)\n- `--delay SECONDS`: Delay between requests (default: 5)\n- `--concurrency N`: Concurrent requests (default: 2)\n\n### Stage 2: Validate Posts\n\nDownload and validate discovered URLs to identify actual blog posts:\n\n```bash\nwaybackpress validate --output wayback-data/example.com\n```\n\nThis stage:\n- Downloads HTML for each URL\n- Extracts metadata (title, date, author, categories, tags)\n- Identifies valid posts using content heuristics\n- Filters out archives, category pages, and duplicates\n- Generates a detailed validation report\n\n### Stage 3: Fetch Media\n\nDownload images, CSS, and JavaScript referenced in posts:\n\n```bash\nwaybackpress fetch-media --output wayback-data/example.com\n```\n\nOptions:\n- `--pass N`: Pass number for multi-pass fetching (default: 1)\n\nThe media fetcher:\n- Parses HTML to extract all media URLs\n- Queries CDX API for available snapshots\n- Attempts multiple snapshots if initial fetch fails\n- Tracks successes and failures for additional passes\n- Saves progress incrementally\n\n#### Multi-Pass Media Fetching\n\nIf the first pass has a low success rate, run additional passes:\n\n```bash\nwaybackpress fetch-media --output wayback-data/example.com --pass 2\n```\n\nEach pass attempts different snapshots, increasing the likelihood of recovery.\n\n### Stage 4: Export to WordPress\n\nGenerate a WordPress WXR import file:\n\n```bash\nwaybackpress export --output wayback-data/example.com\n```\n\nOptions:\n- `--title TEXT`: Site title for export (default: domain name)\n- `--url URL`: Site URL for export (default: http://domain)\n- `--author-name NAME`: Post author name (default: admin)\n- `--author-email EMAIL`: Post author email (default: admin@example.com)\n\n### Complete Pipeline\n\nRun all stages at once:\n\n```bash\nwaybackpress run example.com\n```\n\nWith date range:\n\n```bash\nwaybackpress run example.com --from 20181001 --to 20251031\n```\n\nOptions:\n- `--skip-media`: Skip media fetching\n- `--output DIR`: Output directory\n- `--delay SECONDS`: Request delay\n- `--concurrency N`: Concurrent requests\n- `--from DATE`: Start date (YYYYMMDD or YYYYMMDDHHMMSS)\n- `--to DATE`: End date (YYYYMMDD or YYYYMMDDHHMMSS)\n- All export options (--title, --url, --author-name, --author-email)\n\n## Output Structure\n\nWaybackPress creates the following directory structure:\n\n```\nwayback-data/\n└── example.com/\n    ├── config.json              # Project configuration\n    ├── waybackpress.log         # Detailed logs\n    ├── discovered_urls.tsv      # All discovered URLs\n    ├── valid_posts.tsv          # Validated post URLs\n    ├── validation_report.csv    # Detailed validation results\n    ├── media_report.csv         # Media fetch results\n    ├── wordpress-export.xml     # Final WXR import file\n    ├── html/                    # Downloaded HTML files\n    │   └── post-slug.html\n    └── media/                   # Downloaded media assets\n        └── example.com/\n            └── wp-content/\n                └── uploads/\n```\n\n## Configuration\n\nEach project maintains a `config.json` file with settings and state:\n\n```json\n{\n  \"domain\": \"example.com\",\n  \"output_dir\": \"wayback-data/example.com\",\n  \"delay\": 5.0,\n  \"concurrency\": 2,\n  \"skip_media\": false,\n  \"discovered\": true,\n  \"validated\": true,\n  \"media_fetched\": true,\n  \"exported\": true\n}\n```\n\n## Best Practices\n\n### Respecting Archive.org\n\nThe Wayback Machine is a free public resource. Be respectful:\n\n- Use the default 5-second delay between requests\n- Keep concurrency at 2 or lower\n- Run during off-peak hours for large sites\n- Consider multiple sessions for sites with thousands of posts\n\n### Media Recovery\n\nMedia fetching has inherent limitations:\n\n- Not all media is archived\n- Some snapshots may be corrupted\n- Success rates typically range from 30-50%\n\nStrategies to improve recovery:\n- Run multiple passes (2-3 recommended)\n- Increase delay and decrease concurrency for better reliability\n- Review `media_report.csv` to identify patterns in failures\n- Consider manual recovery for high-value assets\n\n### Validation Heuristics\n\nThe validator applies several filters:\n\n- Minimum content length (200 characters)\n- Duplicate detection (content hash)\n- URL pattern matching (excludes /category/, /tag/, /feed/)\n- Date validation\n\nReview `validation_report.csv` to verify results and adjust if needed.\n\n## Importing into WordPress\n\nAfter generating the WXR file:\n\n1. Log into your WordPress admin panel\n2. Go to Tools → Import → WordPress\n3. Install the WordPress Importer if prompted\n4. Upload `wordpress-export.xml`\n5. Assign post authors and choose import options\n6. Click \"Run Importer\"\n\n### Media Files\n\nMedia files must be uploaded separately:\n\n1. Connect to your server via SFTP/SSH\n2. Navigate to `wp-content/uploads/`\n3. Upload the contents of the `media/` directory\n4. Preserve the directory structure (domain/wp-content/uploads/)\n\nAlternatively, use WP-CLI:\n\n```bash\nwp media regenerate --yes\n```\n\n## Troubleshooting\n\n### Installation Issues\n\n**Error: \"externally-managed-environment\"**\n\nYou're using Python 3.12+ which requires virtual environments. Follow the recommended installation steps above using `python3 -m venv venv`.\n\n**Error: \"Cannot update time stamp of directory 'waybackpress.egg-info'\"**\n\nThe egg-info directory is owned by root. Remove it and reinstall:\n\n```bash\nsudo rm -rf waybackpress.egg-info\npython3 -m venv venv\nsource venv/bin/activate\npip install -e .\n```\n\n**ModuleNotFoundError: No module named 'trafilatura'**\n\nThe `setup.py` is missing the `trafilatura` dependency. This is fixed in the latest version. If you're using an older version:\n\n```bash\npip install trafilatura\u003e=2.0.0\n```\n\n### No Posts Found\n\n- Verify the domain is archived: https://web.archive.org/\n- Check if posts use non-standard URL patterns\n- Review `discovered_urls.tsv` to see what was found\n- Adjust URL filtering logic in `utils.py` if needed\n\n### Low Media Success Rate\n\n- Run additional passes with `--pass 2`, `--pass 3`\n- Reduce concurrency: `--concurrency 1`\n- Increase delay: `--delay 10`\n- Check `media_report.csv` for failure patterns\n\n### Import Errors\n\n- Validate XML: `xmllint --noout wordpress-export.xml`\n- Check WordPress error logs\n- Ensure server has adequate memory (php.ini: memory_limit)\n- Split large imports into smaller batches\n\n## Development\n\nRun tests:\n\n```bash\npython -m pytest tests/\n```\n\nFormat code:\n\n```bash\nblack waybackpress/\n```\n\nType checking:\n\n```bash\nmypy waybackpress/\n```\n\n## Project Structure\n\n```\nwaybackpress/\n├── __init__.py       # Package metadata\n├── __main__.py       # Entry point for python -m\n├── cli.py            # Command-line interface\n├── config.py         # Configuration management\n├── utils.py          # Shared utilities\n├── discover.py       # URL discovery\n├── validate.py       # Post validation\n├── fetch.py          # Media fetching\n└── export.py         # WXR generation\n```\n\n## Known Limitations\n\n- Only works with WordPress sites (other CMSs not supported)\n- Requires posts to be archived in Wayback Machine\n- Media recovery depends on archive availability\n- Some dynamic content (comments, widgets) may not preserve perfectly\n- Wayback snapshots may have inconsistent timestamps\n\n## Contributing\n\nContributions are welcome. Please:\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes with tests\n4. Submit a pull request\n\n## License\n\nMIT License. See LICENSE file for details.\n\n## Credits\n\nDeveloped by [Shift8 Web](https://shift8web.ca) for the WordPress community.\n\nBuilt using:\n- BeautifulSoup4 for HTML parsing\n- aiohttp for async HTTP requests\n- python-dateutil for flexible date parsing\n- lxml for XML processing\n\n## Changelog\n\n### 0.1.0 (Initial Release)\n\n- URL discovery from Wayback CDX API\n- Post validation with content heuristics\n- Multi-pass media fetching\n- WXR 1.2 export generation\n- Resumable operations\n- Progress tracking and reporting\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstardothosting%2Fshift8-waybackpress","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstardothosting%2Fshift8-waybackpress","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstardothosting%2Fshift8-waybackpress/lists"}