{"id":13577939,"url":"https://github.com/erlange/wbm-dl","last_synced_at":"2025-04-05T15:31:55.389Z","repository":{"id":119844710,"uuid":"208297067","full_name":"erlange/wbm-dl","owner":"erlange","description":"Wayback Machine Downloader. 🔥 Download your entire archived websites from the Internet Archive Wayback Machine. ","archived":false,"fork":false,"pushed_at":"2022-08-05T21:19:16.000Z","size":302,"stargazers_count":88,"open_issues_count":2,"forks_count":16,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-11-05T15:48:23.484Z","etag":null,"topics":["command-line-app","command-line-parser","command-line-tool","console","console-app","console-application","csharp","internet","internet-archive","internet-wayback-machine","wayback-machine","wayback-machine-downloader","website-scraper"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/erlange.png","metadata":{"files":{"readme":"README.flat.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-09-13T15:49:39.000Z","updated_at":"2024-10-17T15:23:06.000Z","dependencies_parsed_at":"2023-06-03T08:45:29.759Z","dependency_job_id":null,"html_url":"https://github.com/erlange/wbm-dl","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erlange%2Fwbm-dl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erlange%2Fwbm-dl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erlange%2Fwbm-dl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erlange%2Fwbm-dl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/erlange","download_url":"https://codeload.github.com/erlange/wbm-dl/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247358939,"owners_count":20926326,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["command-line-app","command-line-parser","command-line-tool","console","console-app","console-application","csharp","internet","internet-archive","internet-wayback-machine","wayback-machine","wayback-machine-downloader","website-scraper"],"created_at":"2024-08-01T15:01:25.553Z","updated_at":"2025-04-05T15:31:50.798Z","avatar_url":"https://github.com/erlange.png","language":"C#","funding_links":[],"categories":["C# #"],"sub_categories":[],"readme":"![wbm-dl logo](wbm-dl.png \"wbm-dl logo\") \n# Wayback Machine Downloader\nA C# implementation of wayback machine downloader.  Download an entire archived website from the [Internet Archive Wayback Machine](http://web.archive.org/).  The files downloaded are the original ones not the Wayback Archive rewritten version.\n\nFor complete documentation you may want to consult the [Wiki page here.](https://github.com/erlange/wbm-dl/wiki)\n\n\n\n## Table of Contents\n\n* [**Requirements**](#requirements)\n* [**Installation**](#installation)\n  * [Stand Alone Exexutable](#Stand-Alone-Exexutable)\n  * [Source Code](#Source-Code)\n* [**Basic Usage**](#basic-usage)\n  * [Specifying the URL to Download](#specifying-the-url-to-download)\n  * [Output Directory](#output-directory)\n* [**Advanced Usage**](#advanced-usage)\n  * [Case Sensitive Parameter Names](#case-sensitive-parameter-names)\n  * [Downloading Snapshots for All Timestamps](#downloading-snapshots-for-all-timestamps)\n  * [From Timestamp](#from-timestamp)\n  * [To Timestamp](#to-timestamp)\n  * [Limiting Between Two Timestamps](#limiting-between-two-timestamps)\n  * [Limiting The Number of Files to Download](#limiting-the-number-of-files-to-Download)\n  * [Exact URL](#exact-url)\n  * [Download Only Specific Files](#download-only-specific-Files)\n  * [Excluding Specific Files](#excluding-specific-files)\n  * [Download All HTTP Status Codes](#download-all-http-status-codes)\n  * [Download Multiple Files at a Time](#download-multiple-files-at-a-Time)\n  * [Displaying the File List Without Downloading](#displaying-the-file-list-Without-downloading)\n* [**Log Files**](#log-files)\n  * [Log File Metadata](#log-file-metadata)\n* [**Considerations**](#considerations)\n  * [Avoid Mass-Scraping](#avoid-mass-scraping)\n  * [Windows Long Filename Limitation](#windows-long-filename-limitation)\n* [**Contributing**](#contributing)\n\n## Requirements\n1. .NET Framework 4.0 or newer.\n2. For development use Visual Studio 2010 or newer. You can [download the latest version of Visual Studio here.](https://visualstudio.microsoft.com/downloads/) The Visual Studio Community Edition is free.\n3. This tool uses [Command Line Parser 2.6.0](http://github.com/commandlineparser/commandline) library.\n\n## Installation\n### Stand Alone Exexutable\n* Download the latest executable [here](https://github.com/erlange/wbm-dl/releases/download/v0.6/wbm-dl.1.0.6.zip) or choose from the available versions [here](https://github.com/erlange/wbm-dl/releases) \n\n### Source Code\n* Download the [ZIP file](https://github.com/erlange/wbm-dl/archive/master.zip) or clone this repository:\n    ```\n    mkdir [your-directory]\n    cd [your-directory]\n    git clone https://github.com/erlange/wbm-dl.git\n    cd wbm-dl\n    dir\n    ```\n    Then you can open the `.sln` and build the solution file with Visual Studio.\n* From Visual Studio, run this command from the Package Manager Console window:\n    ```\n    PM\u003e Install-Package CommandLineParser -Version 2.6.0\n    ```\n\n\n## Basic Usage\nAt the very basic, you should run `wbm-dl` followed by the website name, for example `http://yoursite.com` :\n```\nwbm-dl http://yoursite.com\n```\nor just\n```\nwbm-dl yoursite.com\n```\n\nIssuing the above command will download the website to the `./websites/yoursite.com` directory.\n\n## Specifying the URL to Download\nYou must supply a valid URL address to download.\n### Examples\nSome valid URL examples are shown below:\n```\nwbm-dl yoursite.com \n```\n```\nwbm-dl http://yoursite.com \n```\n```\nwbm-dl https://yoursite.com \n```\n\n\n## Advanced Usage\nThe additional parameter list will display when run without any parameters:\n```\nwbm-dl (Wayback Machine Downloader)\nhttp://erlange.github.com \n\n  -o, --out      Output/destination directory\n\n  -f, --from     From timestamp. Limits the archived result SINCE this timestamp.\n                 Use 1 to 14 digit with the format: yyyyMMddhhmmss\n                 If omitted, retrieves results since the earliest timestamp available.\n\n  -t, --to       To timestamp. Limits the archived result  UNTIL this timestamps.\n                 Use 1 to 14 digit with the format: yyyyMMddhhmmss\n                 If omitted, retrieves results until the latest timestamp available.\n\n  -l, --limit    Limits the first N or the last N results. Negative number limits the last N results.\n\n  -a             All timestamps. Retrieves snapshots for all timestamps.\n\n  -c, --count    (Default: 1) Number of concurrent processes.\n                 Can speed up the process but requires more memory.\n\n  -A, --All      Retrieves snapshots for all HTTP status codes.\n                 If omitted only retrieves the status code of 200\n\n  -e, --exact    Downloads only the url provided and not the full site.\n\n  -O, --Only     Restrict downloading to urls that match this filter.\n\n  -X, --eXclude  Skip downloading of urls that match this filter.\n\n  -L, --list     Displays only the list in a JSON format with the archived timestamps, does not download anything\n\n  --help         Display this help screen.\n\n  --version      Display version information.\n```\n\n#### Case Sensitive Parameter Names\nThe Wayback Machine Downloader uses case sensitive parameter names, such as `-a` is different from `-A`. Careful consideration should be taken when typing such parameter names.\n \n\n## Output Directory\n```\n-o, --out      Output/destination directory\n```\nOptional.  The `-o` or `--out` option specifies the directory in which you want the websites to be saved.   A sub-directory called `/websites` will be created under the specified directory.\n\n### Examples\n```\nwbm-dl yoursite.com -o c:/download\n```\nWill download to `c:/download/websites` directory.\n\n```\nwbm-dl yoursite.com -o ./myFolder/web\n```\nWill download to `[Current Directory]/myFolder/web/websites` directory.\n\n\n## Downloading Snapshots for All Timestamps\nBy default, your files are archived in different snapshots for each timestamp.  You can specify the `-a` parameter to download all snapshot versions for each file.\n\nThe `-a` parameter is not to be confused with `-A` parameter, although they both can also be used in conjunction.\n```\n-a             All timestamps. Retrieves snapshots for all timestamps.\n```\nOptional.  The  `-a` parameter will download the file versions all timestamps. The timestamp of each snapshot will be used as a directory.\n```\nwbm-dl yoursite.com -o c:/download  -a\n```\nWill download to the directory structure below:\n```\nc:/download/websites/yoursite.com/20180820202452/index.html\nc:/download/websites/yoursite.com/20181019232937/index.html\nc:/download/websites/yoursite.com/20190305194903/assets/logo.png\n```\n\nIf this parameter is omitted the Wayback Machine Downloader will only download the latest snapshot version of each unique item.\n\n## From Timestamp\n```\n-f, --from     From timestamp. \n```\nOptional. You can limit the result by specifying **the earliest** timestamp in the *yyyyMMddhhmmss* format. This parameter is inclusive, in which the value is included to the result. The Wayback Machine Downloader will only fetch the snapshots since the timestamp specified.\n\n### Examples\n```\nwbm-dl yoursite.com -o c:/download -f 20171101210000\n```\nWill download only the snapshots since *November 01, 2017* at *21:00:00*\n\n```\nwbm-dl yoursite.com -o c:/download -f 2017\n```\nWill download only the snapshots since the year of *2017*\n\n```\nwbm-dl yoursite.com -o c:/download -f 201707\n```\nWill download only the snapshots since *July 2017*\n\n## To Timestamp\n```\n-t, --to     To timestamp. \n```\nOptional. You can limit the result by specifying **the latest** timestamp in the *yyyyMMddhhmmss* format. This parameter is inclusive, in which the value is included to the result. The Wayback Machine Downloader will only fetch the snapshots until the timestamp specified.\n\n### Examples\n```\nwbm-dl yoursite.com -o c:/download -t 20180915220000\n```\nWill download only the snapshots until *September 15, 2018* at *22:00:00*\n\n```\nwbm-dl yoursite.com -o c:/download -t 2018\n```\nWill download only the snapshots until the year of *2018*\n\n```\nwbm-dl yoursite.com -o c:/download -t 201804\n```\nWill download only the snapshots until *April 2018*.\n\n## Limiting Between Two Timestamps\nYou can combine both `-f` and `-t` parameters to limit the result between two timestamps.  Since both parameters are inclusive, the from and to parameter values are included to the result.\n\n### Examples\n```\nwbm-dl yoursite.com -o c:/download -f 20171101210000 -t 20180915220000\n```\nWill download only the snapshots since *November 01, 2017 21:00:00* until *September 15, 2018 22:00:00*.\n\n\n```\nwbm-dl yoursite.com -o c:/download  -f 2017 -t 201804\n```\nWill download only the snapshots since *2017* until *April 2018*.\n\n\n```\nwbm-dl yoursite.com -o c:/download  -f 2017 -t 2017\n```\nWill download only the snapshots during *2017*.\n\n## Limiting The Number of Files to Download\n```\n-l, --limit    Limits the first N or the last N results. Negative number limits the last N results.\n```\nOptional. You can limit the number of files to download by specifying the `-l` parameter followed by a positive or negative integer value.\n\n\nThis `-l` parameter is not to be confused with the `-L` parameter. They both can be used in conjunction though.\n\n\n### Examples\n```\nwbm-dl yoursite.com -o c:/download -l 50\n```\nWill download only 50 files **since** the earliest timestamp. The earliest timestamp is included to the result.\n\n\n```\nwbm-dl yoursite.com -o c:/download -l -25\n```\nWill download only 25 files until the latest timestamp. The latest timestamp is included to the result.\n\n\n## Exact URL\n```\n-e, --exact    Downloads only the url provided and not the full site.\n```\nOptional. Instead of downloading the entire websites you can use this `-e` flag to download only the file you specify as the URL.\n### Examples\n```\nwbm-dl yoursite.com -o c:/download -e\n```\nWill download only the homepage html file of yoursite.com\n\n## Download Only Specific Files\n```\n-O, --Only\n```\nOptional. You can filter the download to only a specific condition, for example you only want to download files of certain types (e.g., .jpg, .pdf, .doc, etc). This parameter needs a string or a regex.\n\n### Examples\n```\nwbm-dl yoursite.com -o c:/download -O \"^.*\\.(jpg|gif|png|)$\"\n```\nThis will download only image files of .jpg, .gif and .png types.\n\n\n```\nwbm-dl yoursite.com -o c:/download -O \"^.*\\b(themes|green).*\\b$\"\n```\nThis will download files containing the word `themes` or `green` in the path.\n\n## Excluding Specific Files\n```\n-X, --eXclude\n```\nOptional. In contrast with the `-O` parameter, you can exclude specific files using `-X` parameter. This parameter needs a string or a regex.\n\n\n### Examples\n```\nwbm-dl yoursite.com -o c:/download -X \"^.*\\.(jpg|gif|png|)$\"\n```\nThis will not download image files of .jpg, .gif and .png types.\n\n```\nwbm-dl yoursite.com -o c:/download -X \"^.*\\b(themes|green).*\\b$\"\n```\nThis will exclude the files containing the word `themes` or `green` in the path.\n\n## Download All HTTP Status Codes\n```\n-A, --All      Retrieves snapshots for all HTTP status codes.\n               If omitted only retrieves the status code of 200\n```\nOptional. By default, the Wayback Machine Downloader will download the files responding only to the HTTP status code of 200 (HTTP status code for OK).  This `-A` flag will download responses with all HTTP status codes, such as 30x, 40x and 50x.\n### Examples\n```\nwbm-dl yoursite.com -o c:/download -A\n```\n\n\n\n## Download Multiple Files at a Time\n```\n-c, --count    (Default: 1) Number of concurrent processes.\n               Can speed up the process but requires more memory.\n```\nOptional. You can speed up the download process  significantly by specifying an (integer) number of concurrency with the `-c` parameter.\n\n\n### Examples\n```\nwbm-dl yoursite.com -o c:/download -c 50\n```\nWill download maximum 50 files at a time.\n\n\n## Displaying the File List Without Downloading\n```\n-L, --list     Displays only the list in a JSON format with the archived timestamps, does not download anything\n```\nOptional.  This option will only display the file list in JSON format and save it to the `/logs` directory.  It won't download anything else.\n\n\n### Examples\n```\nwbm-dl yoursite.com -o c:/download -L\n```\nThis will only display the file list on screen and save the list in the `c:/download/logs` directory.\n\n## Log Files\nUpon completion, a `/logs` directory containing a log file will be created under the `/websites` directory.\nThe JSON-formatted log file contains completion status of each downloaded item.  If errors occured the log files can further be examined to accommodate manual download with the source URL for each item.\n\nThis log file will not be generated when using the `-L` or `--list` flag.\n\nThe generated log filename will be `yoursite.com.log.json`\n\n### Log File Metadata\nThe JSON-formatted log file contains metadata as follows:\n* `ErrorMsg`    \n    Contains the error message if error occured.\n* `Num`    \n    Line number.\n* `Original`    \n    Contains the original location of the item.\n* `Source`    \n    Contains the archived location of the item in the Internet Wayback Archive Machine.  You can use the value for manually downloading.\n* `Status`    \n    Contains the HTTP status code.  If flag `-A` is omitted and no error occured the value will be `200 (OK)`.  If this value is empty an error might have occured.  You can then consult the `ErrorMsg` to examine the error and use the `Source` to manually download the individual file.\n* `Target`    \n    Contains full path in the output directory where the file is saved.  If this value is empty an error might have occured.  You can then consult the `ErrorMsg` to examine the error and use the `Source` to manually download the individual file.\n* `Time`    \n    The time the `Source` responds to the request. The time is in `yyyyMMdd hh:mm:ss` .NET format and might not conform to the standard JSON datetime format.\n\n## Considerations \n### Avoid Mass-Scraping\nYour archived website gets none but bigger over time. It can get so big with millions of files.\nCertain aspects must therefore come into considerations.\n\n\nIt is always advisable to limit the downloads each session with filtering options, including, but not limited to:\n- Filtering by certain timestamps with `-f` or `-t` options\n- Filtering by certain files with `-O` option\n- Do not download what you don't need with `-X` option\n- Minimize the number of simultaneous download by using small number to the `-c` option\n\n\nIt is a good ettiquete to crawl politely.  \nAvoid mass-scraping by overloading them with too many requests for too many big files as this will surely hurt the server.\nIf this occurs too often, they might take measures to block downloader tools such as this one, and in the long run, might lead to anti-scraping legal actions.\n\nThat said. So download wisely.\n\n### Windows Long Filename Limitation\nWindows has maximum of 248 characters on a directory path while a URL doesn't.\nThis can lead to error due to this limitation and your files are not downloaded.\nIn this case you can examine the log file and download manually from the source URL provided.\n\n## Contributing\nContributions are welcome.  Just pull an issue or pull request from GitHub.\n\n\n## \nCopyright \u0026copy; 2018 - eri.airlangga@gmail.com","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferlange%2Fwbm-dl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ferlange%2Fwbm-dl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferlange%2Fwbm-dl/lists"}