{"id":13585590,"url":"https://github.com/sangaline/wayback-machine-scraper","last_synced_at":"2025-04-13T02:25:21.265Z","repository":{"id":37768315,"uuid":"87244601","full_name":"sangaline/wayback-machine-scraper","owner":"sangaline","description":"A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.","archived":false,"fork":false,"pushed_at":"2024-02-23T22:42:02.000Z","size":84,"stargazers_count":439,"open_issues_count":10,"forks_count":80,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-04-04T04:41:15.157Z","etag":null,"topics":["archive-dot-org","command-line-tool","python","wayback-archiver","wayback-machine","web-scraping"],"latest_commit_sha":null,"homepage":"http://sangaline.com/post/wayback-machine-scraper/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"isc","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sangaline.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-04-04T23:27:58.000Z","updated_at":"2025-03-27T08:08:42.000Z","dependencies_parsed_at":"2024-05-05T03:42:11.322Z","dependency_job_id":null,"html_url":"https://github.com/sangaline/wayback-machine-scraper","commit_stats":{"total_commits":42,"total_committers":1,"mean_commits":42.0,"dds":0.0,"last_synced_commit":"32ba9503fa8438ee75d16909911821d6ca336e8f"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sangaline%2Fwayback-machine-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sangaline%2Fwayback-machine-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sangaline%2Fwayback-machine-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sangaline%2Fwayback-machine-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sangaline","download_url":"https://codeload.github.com/sangaline/wayback-machine-scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248655038,"owners_count":21140419,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archive-dot-org","command-line-tool","python","wayback-archiver","wayback-machine","web-scraping"],"created_at":"2024-08-01T15:05:01.918Z","updated_at":"2025-04-13T02:25:21.231Z","avatar_url":"https://github.com/sangaline.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"![The Wayback Machine Scraper Logo](img/logo.png)\n\n# The Wayback Machine Scraper\n\nThe repository consists of a command-line utility `wayback-machine-scraper` that can be used to scrape or download website data as it appears in [archive.org](http://archive.org)'s [Wayback Machine](https://archive.org/web/).\nIt crawls through historical snapshots of a website and saves the snapshots to disk.\nThis can be useful when you're trying to scrape a site that has scraping measures that make direct scraping impossible or prohibitively slow.\nIt's also useful if you want to scrape a website as it appeared at some point in the past or to scrape information that changes over time.\n\nThe command-line utility is highly configurable in terms of what it scrapes but it only saves the unparsed content of the pages on the site.\nIf you're interested in parsing data from the pages that are crawled then you might want to check out [scrapy-wayback-machine](https://github.com/sangaline/scrapy-wayback-machine) instead.\nIt's a downloader middleware that handles all of the tricky parts and passes normal `response` objects to your [Scrapy](https://scrapy.org) spiders with archive timestamp information attached.\nThe middleware is very unobtrusive and should work seamlessly with existing [Scrapy](https://scrapy.org) middlewares, extensions, and spiders.\nIt's what `wayback-machine-scraper` uses behind the scenes and it offers more flexibility for advanced use cases.\n\n## Installation\n\nThe package can be installed using `pip`.\n\n```bash\npip install wayback-machine-scraper\n```\n\n## Command-Line Interface\n\nWriting a custom [Scrapy](https://scrapy.org) spider and using the `WaybackMachine` middleware is the preferred way to use this project, but a command line interface for basic mirroring is also included.\nThe usage information can be printed by running `wayback-machine-scraper -h`.\n\n```\nusage: wayback-machine-scraper [-h] [-o DIRECTORY] [-f TIMESTAMP]\n                               [-t TIMESTAMP] [-a REGEX] [-d REGEX]\n                               [-c CONCURRENCY] [-u] [-v]\n                               DOMAIN [DOMAIN ...]\n\nMirror all Wayback Machine snapshots of one or more domains within a specified\ntime range.\n\npositional arguments:\n  DOMAIN                Specify the domain(s) to scrape. Can also be a full\n                        URL to specify starting points for the crawler.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -o DIRECTORY, --output DIRECTORY\n                        Specify the domain(s) to scrape. Can also be a full\n                        URL to specify starting points for the crawler.\n                        (default: website)\n  -f TIMESTAMP, --from TIMESTAMP\n                        The timestamp for the beginning of the range to\n                        scrape. Can either be YYYYmmdd, YYYYmmddHHMMSS, or a\n                        Unix timestamp. (default: 10000101)\n  -t TIMESTAMP, --to TIMESTAMP\n                        The timestamp for the end of the range to scrape. Use\n                        the same timestamp as `--from` to specify a single\n                        point in time. (default: 30000101)\n  -a REGEX, --allow REGEX\n                        A regular expression that all scraped URLs must match.\n                        (default: ())\n  -d REGEX, --deny REGEX\n                        A regular expression to exclude matched URLs.\n                        (default: ())\n  -c CONCURRENCY, --concurrency CONCURRENCY\n                        Target concurrency for crawl requests.The crawl rate\n                        will be automatically adjusted to match this\n                        target.Use values less than 1 to be polite and higher\n                        values to scrape more quickly. (default: 10.0)\n  -u, --unix            Save snapshots as `UNIX_TIMESTAMP.snapshot` instead of\n                        the default `YYYYmmddHHMMSS.snapshot`. (default:\n                        False)\n  -v, --verbose         Turn on debug logging. (default: False)\n```\n\n## Examples\n\nThe usage can be perhaps be made more clear with a couple of concrete examples.\n\n### A Single Page Over Time\n\nOne of the key advantages of `wayback-machine-scraper` over other projects, such as [wayback-machine-downloader](https://github.com/hartator/wayback-machine-downloader), is that it offers the capability to download all available [archive.org](https://archive.org) snapshots.\nThis can be extremely useful if you're interested in analyzing how pages change over time.\n\nFor example, say that you would like to analyze many snapshots of the [Hacker News](news.ycombinator.com) front page as I did writing [Reverse Engineering the Hacker News Algorithm](http://sangaline.com/post/reverse-engineering-the-hacker-news-ranking-algorithm/).\nThis can be done by running\n\n```bash\nwayback-machine-scraper -a 'news.ycombinator.com$' news.ycombinator.com\n```\n\nwhere the `--allow` regular expression `news.ycombinator.com$` limits the crawl to the front page.\nThis produces a file structure of\n\n```\nwebsite/\n└── news.ycombinator.com\n    ├── 20070221033032.snapshot\n    ├── 20070226001637.snapshot\n    ├── 20070405032412.snapshot\n    ├── 20070405175109.snapshot\n    ├── 20070406195336.snapshot\n    ├── 20070601184317.snapshot\n    ├── 20070629033202.snapshot\n    ├── 20070630222527.snapshot\n    ├── 20070630222818.snapshot\n    └── etc.\n```\n\nwith each snapshot file containing the full HTML body of the front page.\n\nA series of snapshots for any page can be obtained in this way as long as suitable regular expressions and start URLs are constructed.\nIf we are interested in a page other than the homepage then we should use it as the start URL instead.\nTo get all of the snapshots for a specific story we could run\n\n```bash\nwayback-machine-scraper -a 'id=13857086$' 'news.ycombinator.com/item?id=13857086'\n```\n\nwhich produces\n\n```\nwebsite/\n└── news.ycombinator.com\n    └── item?id=13857086\n        ├── 20170313225853.snapshot\n        ├── 20170313231755.snapshot\n        ├── 20170314043150.snapshot\n        ├── 20170314165633.snapshot\n        └── 20170320205604.snapshot\n```\n\n### A Full Site Crawl at One Point In Time\n\nIf the goal is to take a snapshot of an entire site at once then this can also be easily achieved.\nSpecifying both the `--from` and `--to` options as the same point in time will assure that only one snapshot is saved for each URL.\nRunning\n\n```\nwayback-machine-scraper -f 20080623 -t 20080623 news.ycombinator.com\n```\n\nproduces a file structure of\n\n```\nwebsite\n└── news.ycombinator.com\n    ├── 20080621143814.snapshot\n    ├── item?id=221868\n    │   └── 20080622151531.snapshot\n    ├── item?id=222157\n    │   └── 20080622151822.snapshot\n    ├── item?id=222341\n    │   └── 20080620221102.snapshot\n    └── etc.\n```\n\nwith a single snapshot for each page in the crawl as it appeared on June 23, 2008.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsangaline%2Fwayback-machine-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsangaline%2Fwayback-machine-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsangaline%2Fwayback-machine-scraper/lists"}