{"id":13586465,"url":"https://github.com/ArchiveTeam/grab-site","last_synced_at":"2025-04-07T15:31:59.018Z","repository":{"id":26878043,"uuid":"30338639","full_name":"ArchiveTeam/grab-site","owner":"ArchiveTeam","description":"The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns","archived":false,"fork":false,"pushed_at":"2024-07-07T01:13:29.000Z","size":1268,"stargazers_count":1469,"open_issues_count":95,"forks_count":145,"subscribers_count":41,"default_branch":"master","last_synced_at":"2025-04-02T22:28:49.074Z","etag":null,"topics":["archiving","crawl","crawler","spider","warc"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ArchiveTeam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-02-05T05:01:19.000Z","updated_at":"2025-03-30T17:45:22.000Z","dependencies_parsed_at":"2022-07-27T08:52:36.671Z","dependency_job_id":"d78fe50d-0080-4567-8946-6899f9c40a1a","html_url":"https://github.com/ArchiveTeam/grab-site","commit_stats":{"total_commits":1162,"total_committers":16,"mean_commits":72.625,"dds":0.07745266781411364,"last_synced_commit":"f4027f57fcb6c7f19421e348fa8c8d7acf725267"},"previous_names":["ludios/grab-site"],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2Fgrab-site","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2Fgrab-site/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2Fgrab-site/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2Fgrab-site/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ArchiveTeam","download_url":"https://codeload.github.com/ArchiveTeam/grab-site/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247610069,"owners_count":20966307,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archiving","crawl","crawler","spider","warc"],"created_at":"2024-08-01T15:05:35.312Z","updated_at":"2025-04-07T15:31:58.975Z","avatar_url":"https://github.com/ArchiveTeam.png","language":"Python","readme":"grab-site\n=========\n\n[![Build status][travis-image]][travis-url]\n\ngrab-site is an easy preconfigured web crawler designed for backing up websites.\nGive grab-site a URL and it will recursively crawl the site and write\n[WARC files](https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem).\nInternally, grab-site uses [a fork](https://github.com/ArchiveTeam/ludios_wpull) of\n[wpull](https://github.com/chfoo/wpull) for crawling.\n\ngrab-site gives you\n\n*\ta dashboard with all of your crawls, showing which URLs are being\n\tgrabbed, how many URLs are left in the queue, and more.\n\n*\tthe ability to add ignore patterns when the crawl is already running.\n\tThis allows you to skip the crawling of junk URLs that would\n\totherwise prevent your crawl from ever finishing.  See below.\n\n*\tan extensively tested default ignore set ([global](https://github.com/ArchiveTeam/grab-site/blob/master/libgrabsite/ignore_sets/global))\n\tas well as additional (optional) ignore sets for forums, reddit, etc.\n\n*\tduplicate page detection: links are not followed on pages whose\n\tcontent duplicates an already-seen page.\n\nThe URL queue is kept on disk instead of in memory.  If you're really lucky,\ngrab-site will manage to crawl a site with ~10M pages.\n\n![dashboard screenshot](https://raw.githubusercontent.com/ArchiveTeam/grab-site/master/images/dashboard.png)\n\nNote: if you have any problems whatsoever installing or getting grab-site to run,\nplease [file an issue](https://github.com/ArchiveTeam/grab-site/issues) - thank you!\n\nThe installation methods below are the only ones supported in our GitHub issues.\nPlease do not modify the installation steps unless you really know what you're\ndoing, with both Python packaging and your operating system. grab-site runs\non a specific version of Python (3.7 or 3.8) and with specific dependency versions.\n\n**Contents**\n\n- [Install on Ubuntu 18.04, 20.04, 22.04, Debian 10 (buster), Debian 11 (bullseye)](#install-on-ubuntu-1804-2004-2204-debian-10-buster-debian-11-bullseye)\n- [Install on NixOS](#install-on-nixos)\n- [Install on another distribution lacking Python 3.7.x or 3.8.x](#install-on-another-distribution-lacking-python-37x-or-38x)\n- [Install on macOS](#install-on-macos)\n- [Install on Windows 10 (experimental)](#install-on-windows-10-experimental)\n- [Upgrade an existing install](#upgrade-an-existing-install)\n- [Usage](#usage)\n  - [`grab-site` options, ordered by importance](#grab-site-options-ordered-by-importance)\n  - [Warnings](#warnings)\n  - [Tips for specific websites](#tips-for-specific-websites)\n- [Changing ignores during the crawl](#changing-ignores-during-the-crawl)\n- [Inspecting the URL queue](#inspecting-the-url-queue)\n- [Preventing a crawl from queuing any more URLs](#preventing-a-crawl-from-queuing-any-more-urls)\n- [Stopping a crawl](#stopping-a-crawl)\n- [Advanced `gs-server` options](#advanced-gs-server-options)\n- [Viewing the content in your WARC archives](#viewing-the-content-in-your-warc-archives)\n- [Inspecting WARC files in the terminal](#inspecting-warc-files-in-the-terminal)\n- [Automatically pausing grab-site processes when free disk is low](#automatically-pausing-grab-site-processes-when-free-disk-is-low)\n- [Thanks](#thanks)\n- [Help](#help)\n\n\n\nInstall on Ubuntu 18.04, 20.04, 22.04, Debian 10 (buster), Debian 11 (bullseye)\n---\n\n1.\tOn Debian, use `su` to become root if `sudo` is not configured to give you access.\n\n\t```\n\tsudo apt-get update\n\tsudo apt-get install --no-install-recommends \\\n\t    wget ca-certificates git build-essential libssl-dev zlib1g-dev \\\n\t    libbz2-dev libreadline-dev libsqlite3-dev libffi-dev libxml2-dev \\\n\t    libxslt1-dev libre2-dev pkg-config\n\t```\n\n\tIf you see `Unable to locate package`, run the two commands again.\n\n2.\tAs a **non-root** user:\n\n\t```\n\twget https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer\n\tchmod +x pyenv-installer\n\t./pyenv-installer\n\t~/.pyenv/bin/pyenv install 3.8.15\n\t~/.pyenv/versions/3.8.15/bin/python -m venv ~/gs-venv\n\t~/gs-venv/bin/pip install --no-binary lxml --upgrade git+https://github.com/ArchiveTeam/grab-site\n\t```\n\n\t`--no-binary lxml` is necessary for the html5-parser build.\n\n3.\tAdd this to your `~/.bashrc` or `~/.zshrc`:\n\n\t```\n\tPATH=\"$PATH:$HOME/gs-venv/bin\"\n\t```\n\n\tand then restart your shell (e.g. by opening a new terminal tab/window).\n\n\nInstall on NixOS\n---\n\ngrab-site was removed from nixpkgs master; 23.05 is the last release to contain grab-site.\n\n```\nnix-env -f https://github.com/NixOS/nixpkgs/archive/release-23.05.tar.gz -iA grab-site\n```\n\nor, if you are using profiles (ie when you have flakes enabled):\n\n```\nnix profile install nixpkgs/release-22.11#grab-site\n```\n\n\nInstall on another distribution lacking Python 3.7.x or 3.8.x\n---\n\ngrab-site and its dependencies are available in [nixpkgs](https://github.com/NixOS/nixpkgs), which can be used on any Linux distribution.\n\n1.\tAs root:\n\n\tWhere `USER` is your non-root username:\n\n\t```\n\tmkdir /nix\n\tchown USER:USER /nix\n\t```\n\n2.\tAs the **non-root** user, install Nix: https://nixos.org/nix/download.html\n\n3.\tAs the **non-root** user:\n\n\t```\n\tnix-env -f https://github.com/NixOS/nixpkgs/archive/release-23.05.tar.gz -iA grab-site\n\t```\n\n\tand then restart your shell (e.g. by opening a new terminal tab/window).\n\n\n\nInstall on macOS\n---\n\nOn OS X 10.10 - macOS 11:\n\n1.\tRun `locale` in your terminal.  If the output includes \"UTF-8\", you\n\tare all set.  If it does not, your terminal is misconfigured and grab-site\n\twill fail to start.  This can be corrected with:\n\n\t-\tTerminal.app: Preferences... -\u003e Profiles -\u003e Advanced -\u003e **check** Set locale environment variables on startup\n\n\t-\tiTerm2: Preferences... -\u003e Profiles -\u003e Terminal -\u003e Environment -\u003e **check** Set locale variables automatically\n\n### Using Homebrew (**Intel Mac**)\n\nFor M1 Macs, use the next section instead of this one.\n\n2.\tInstall Homebrew using the install step on https://brew.sh/\n\n3.\tRun:\n\n\t```\n\tbrew update\n\tbrew install python@3.8 libxslt re2 pkg-config\n\t/usr/local/opt/python@3.8/bin/python3 -m venv ~/gs-venv\n\tPKG_CONFIG_PATH=\"/usr/local/opt/libxml2/lib/pkgconfig\" ~/gs-venv/bin/pip install --no-binary lxml --upgrade git+https://github.com/ArchiveTeam/grab-site\n\t```\n\n4.\tTo put the `grab-site` binaries in your PATH, add this to your `~/.zshrc` (macOS 10.15, 11+) or `~/.bash_profile` (earlier):\n\n\t```\n\tPATH=\"$PATH:$HOME/gs-venv/bin\"\n\t```\n\n\tand then restart your shell (e.g. by opening a new terminal tab/window).\n\n### Using Homebrew (**M1 Mac**)\n\n2.\tInstall Homebrew using the install step on https://brew.sh/\n\n\tIf you already have a Homebrew install at `/usr/local`, you may need to first remove that old Intel-based Homebrew install.\n\n3.\tRun:\n\n\t```\n\tbrew update\n\tbrew install python@3.8 libxslt re2 pkg-config\n\t/opt/homebrew/opt/python@3.8/bin/python3 -m venv ~/gs-venv\n\tPKG_CONFIG_PATH=\"/opt/homebrew/opt/libxml2/lib/pkgconfig\" ~/gs-venv/bin/pip install --no-binary lxml --upgrade git+https://github.com/ArchiveTeam/grab-site\n\t```\n\n4.\tTo put the `grab-site` binaries in your PATH, add this to your `~/.zshrc` (macOS 10.15, 11+) or `~/.bash_profile` (earlier):\n\n\t```\n\tPATH=\"$PATH:$HOME/gs-venv/bin\"\n\t```\n\n\tand then restart your shell (e.g. by opening a new terminal tab/window).\n\n\n\nInstall on Windows 10 (experimental)\n---\n\nOn Windows 10 Fall Creators Update (1703) or newer:\n\n1. Start menu -\u003e search \"feature\" -\u003e Turn Windows features on or off\n\n2. Scroll down, check \"Windows Subsystem for Linux\" and click OK.\n\n3. Wait for install and click \"Restart now\"\n\n4. Start menu -\u003e Store\n\n5. Search for \"Ubuntu\" in the store and install Ubuntu (publisher: Canonical Group Limited).\n\n6. Start menu -\u003e Ubuntu\n\n7. Wait for install and create a user when prompted.\n\n8. Follow the [Ubuntu 18.04, 20.04, 22.04, Debian 10 (buster), Debian 11 (bullseye)](#install-on-ubuntu-1804-2004-2204-debian-10-buster-debian-11-bullseye) steps.\n\n\n\nUpgrade an existing install\n---\n\nTo update grab-site, simply run the `~/gs-venv/bin/pip install ...` or\n`nix-env ...` command used to install it originally (see above).\n\nAfter upgrading, stop `gs-server` with `kill` or ctrl-c, then start it again.\nExisting `grab-site` crawls will automatically reconnect to the new server.\n\n\n\nUsage\n---\n\nFirst, start the dashboard with:\n\n```\ngs-server\n```\n\nand point your browser to http://127.0.0.1:29000/\n\nNote: gs-server listens on all interfaces by default, so you can reach the\ndashboard by a non-localhost IP as well, e.g. a LAN or WAN IP.  (Sub-note:\nno code execution capabilities are exposed on any interface.)\n\nThen, start as many crawls as you want with:\n\n```\ngrab-site 'URL'\n```\n\nDo this inside tmux unless they're very short crawls.\n\ngrab-site outputs WARCs, logs, and control files to a new subdirectory in the\ndirectory from which you launched `grab-site`, referred to here as \"DIR\".\n(Use `ls -lrt` to find it.)\n\nYou can pass multiple `URL` arguments to include them in the same crawl,\nwhether they are on the same domain or different domains entirely.\n\nwarcprox users: [warcprox](https://github.com/internetarchive/warcprox) breaks the\ndashboard's WebSocket; please make your browser skip the proxy for whichever\nhost/IP you're using to reach the dashboard.\n\n### `grab-site` options, ordered by importance\n\nOptions can come before or after the URL.\n\n*\t`--1`: grab just `URL` and its page requisites, without recursing.\n\n*\t`--igsets=IGSET1,IGSET2`: use ignore sets `IGSET1` and `IGSET2`.\n\n\tIgnore sets are used to avoid requesting junk URLs using a pre-made set of\n\tregular expressions.  See [the full list of available ignore sets](https://github.com/ArchiveTeam/grab-site/tree/master/libgrabsite/ignore_sets).\n\n\tThe [global](https://github.com/ArchiveTeam/grab-site/blob/master/libgrabsite/ignore_sets/global)\n\tignore set is implied and enabled unless `--no-global-igset` is used.\n\n\tThe ignore sets can be changed during the crawl by editing the `DIR/igsets` file.\n\n*\t`--no-global-igset`: don't add the [global](https://github.com/ArchiveTeam/grab-site/blob/master/libgrabsite/ignore_sets/global) ignore set.\n\n*\t`--no-offsite-links`: avoid following links to a depth of 1 on other domains.\n\n\tgrab-site always grabs page requisites (e.g. inline images and stylesheets), even if\n\tthey are on other domains.  By default, grab-site also grabs linked pages to a depth\n\tof 1 on other domains.  To turn off this behavior, use `--no-offsite-links`.\n\n\tUsing `--no-offsite-links` may prevent all kinds of useful images, video, audio, downloads,\n\tetc from being grabbed, because these are often hosted on a CDN or subdomain, and\n\tthus would otherwise not be included in the recursive crawl.\n\n*\t`-i` / `--input-file`: Load list of URLs-to-grab from a local file or from a\n\tURL; like `wget -i`.  File must be a newline-delimited list of URLs.\n\tCombine with `--1` to avoid a recursive crawl on each URL.\n\n*\t`--igon`: Print all URLs being ignored to the terminal and dashboard.  Can be\n\tchanged during the crawl by `touch`ing or `rm`ing the `DIR/igoff` file.\n\tThis is slower because it needs to find the specific regexp to blame.\n\n*\t`--no-video`: Skip the download of videos by both mime type and file extension.\n\tSkipped videos are logged to `DIR/skipped_videos`.  Can be\n\tchanged during the crawl by `touch`ing or `rm`ing the `DIR/video` file.\n\n*\t`--no-sitemaps`: don't queue URLs from `sitemap.xml` at the root of the site.\n\n*\t`--max-content-length=N`: Skip the download of any response that claims a\n\tContent-Length larger than `N`.  (default: -1, don't skip anything).\n\tSkipped URLs are logged to `DIR/skipped_max_content_length`.  Can be changed\n\tduring the crawl by editing the `DIR/max_content_length` file.\n\n*\t`--no-dupespotter`: Disable dupespotter, a plugin that skips the extraction\n\tof links from pages that look like duplicates of earlier pages.  Disable this\n\tfor sites that are directory listings, because they frequently trigger false\n\tpositives.\n\n*\t`--concurrency=N`: Use `N` connections to fetch in parallel (default: 2).\n\tCan be changed during the crawl by editing the `DIR/concurrency` file.\n\n*\t`--delay=N`: Wait `N` milliseconds (default: 0) between requests on each concurrent fetcher.\n\tCan be a range like X-Y to use a random delay between X and Y.  Can be changed during\n\tthe crawl by editing the `DIR/delay` file.\n\n*\t`--import-ignores`: Copy this file to to `DIR/ignores` before the crawl begins.\n\n*\t`--warc-max-size=BYTES`: Try to limit each WARC file to around `BYTES` bytes\n\tbefore rolling over to a new WARC file (default: 5368709120, which is 5GiB).\n\tNote that the resulting WARC files may be drastically larger if there are very\n\tlarge responses.\n\n*\t`--level=N`: recurse `N` levels instead of `inf` levels.\n\n*\t`--page-requisites-level=N`: recurse page requisites `N` levels instead of `5` levels.\n\n*\t`--ua=STRING`: Send User-Agent: `STRING` instead of pretending to be Firefox on Windows.\n\n*\t`--id=ID`: Use id `ID` for the crawl instead of a random 128-bit id. This must be unique for every crawl.\n\n*\t`--dir=DIR`: Put control files, temporary files, and unfinished WARCs in `DIR`\n\t(default: a directory name based on the URL, date, and first 8 characters of the id).\n\n*\t`--finished-warc-dir=FINISHED_WARC_DIR`: absolute path to a directory into\n\twhich finished `.warc.gz` and `.cdx` files will be moved.\n\n*\t`--permanent-error-status-codes=STATUS_CODES`: A comma-separated list of\n\tHTTP status codes to treat as a permanent error and therefore **not** retry\n\t(default: `401,403,404,405,410`).  Other error responses tried another 2\n\ttimes for a total of 3 tries (customizable with `--wpull-args=--tries=N`).\n\tNote that, unlike wget, wpull puts retries at the end of the queue.\n\n*\t`--wpull-args=ARGS`: String containing additional arguments to pass to wpull;\n\tsee `wpull --help`.  `ARGS` is split with `shlex.split` and individual\n\targuments can contain spaces if quoted, e.g.\n\t`--wpull-args=\"--youtube-dl \\\"--youtube-dl-exe=/My Documents/youtube-dl\\\"\"`\n\n\tExamples:\n\n\t*\t`--wpull-args=--no-skip-getaddrinfo` to respect `/etc/hosts` entries.\n\t*\t`--wpull-args=--no-warc-compression` to write uncompressed WARC files.\n\n*\t`--which-wpull-args-partial`: Print a partial list of wpull arguments that\n\twould be used and exit.  Excludes grab-site-specific features, and removes\n\t`DIR/` from paths.  Useful for reporting bugs on wpull without grab-site involvement.\n\n*\t`--which-wpull-command`: Populate `DIR/` but don't start wpull; instead print\n\tthe command that would have been used to start wpull with all of the\n\tgrab-site functionality.\n\n*\t`--debug`: print a lot of debug information.\n\n*\t`--help`: print help text.\n\n### Warnings\n\nIf you pay no attention to your crawls, a crawl may head down some infinite bot\ntrap and stay there forever.  The site owner may eventually notice high CPU use\nor log activity, then IP-ban you.\n\ngrab-site does not respect `robots.txt` files, because they frequently\n[whitelist only approved robots](https://github.com/robots.txt),\n[hide pages embarrassing to the site owner](https://web.archive.org/web/20140401024610/http://www.thecrimson.com/robots.txt),\nor block image or stylesheet resources needed for proper archival.\n[See also](https://www.archiveteam.org/index.php?title=Robots.txt).\nBecause of this, very rarely you might run into a robot honeypot and receive\nan abuse@ complaint.  Your host may require a prompt response to such a complaint\nfor your server to stay online.  Therefore, we recommend against crawling the\nweb from a server that hosts your critical infrastructure.\n\nDon't run grab-site on GCE (Google Compute Engine); as happened to me, your\nentire API project may get nuked after a few days of crawling the web, with\nno recourse.  Good alternatives include OVH ([OVH](https://www.ovh.com/us/dedicated-servers/),\n[So You Start](https://www.soyoustart.com/us/essential-servers/),\n[Kimsufi](https://www.kimsufi.com/us/en/index.xml)), and online.net's\n[dedicated](https://www.online.net/en/dedicated-server) and\n[Scaleway](https://www.scaleway.com/) offerings.\n\n### Tips for specific websites\n\n#### Website requiring login / cookies\n\nLog in to the website in Chrome or Firefox.  Use the cookies.txt extension\n[for Chrome](https://github.com/daftano/cookies.txt) or\n[for Firefox](https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/)\nextension to copy Netscape-format cookies.  Paste the cookies data into a new\nfile.  Start grab-site with `--wpull-args=--load-cookies=ABSOLUTE_PATH_TO_COOKIES_FILE`.\n\n#### Static websites; WordPress blogs; Discourse forums\n\nThe defaults usually work fine.\n\n#### Blogger / blogspot.com blogs\n\nThe defaults work fine except for blogs with a JavaScript-only Dynamic Views theme.\n\nSome blogspot.com blogs use \"[Dynamic Views](https://support.google.com/blogger/answer/1229061?hl=en)\"\nthemes that require JavaScript and serve absolutely no HTML content.  In rare\ncases, you can get JavaScript-free pages by appending `?m=1`\n([example](https://happinessbeyondthought.blogspot.com/?m=1)).  Otherwise, you\ncan archive parts of these blogs through Google Cache instead\n([example](https://webcache.googleusercontent.com/search?q=cache:http://blog.datomic.com/))\nor by using https://archive.is/ instead of grab-site.\n\n#### Tumblr blogs\n\nEither don't crawl from Europe (because tumblr redirects to a GDPR `/privacy/consent` page), or add `Googlebot` to the user agent:\n\n```\n--ua \"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/70.0 but not really nor Googlebot/2.1\"\n```\n\nUse [`--igsets=singletumblr`](https://github.com/ArchiveTeam/grab-site/blob/master/libgrabsite/ignore_sets/singletumblr)\nto avoid crawling the homepages of other tumblr blogs.\n\nIf you don't care about who liked or reblogged a post, add `\\?from_c=` to the\ncrawl's `ignores`.\n\nSome tumblr blogs appear to require JavaScript, but they are actually just\nhiding the page content with CSS.  You are still likely to get a complete crawl.\n(See the links in the page source for https://X.tumblr.com/archive).\n\n#### Subreddits\n\nUse [`--igsets=reddit`](https://github.com/ArchiveTeam/grab-site/blob/master/libgrabsite/ignore_sets/reddit)\nand add a `/` at the end of the URL to avoid crawling all subreddits.\n\nWhen crawling a subreddit, you **must** get the casing of the subreddit right\nfor the recursive crawl to work.  For example,\n\n```\ngrab-site https://www.reddit.com/r/Oculus/ --igsets=reddit\n```\n\nwill crawl only a few pages instead of the entire subreddit.  The correct casing is:\n\n```\ngrab-site https://www.reddit.com/r/oculus/ --igsets=reddit\n```\n\nYou can hover over the \"Hot\"/\"New\"/... links at the top of the page to see the correct casing.\n\n#### Directory listings (\"Index of ...\")\n\nUse `--no-dupespotter` to avoid triggering false positives on the duplicate\npage detector.  Without it, the crawl may miss large parts of the directory tree.\n\n#### Very large websites\n\nUse `--no-offsite-links` to stay on the main website and avoid crawling linked pages on other domains.\n\n#### Websites that are likely to ban you for crawling fast\n\nUse `--concurrency=1 --delay=500-1500`.\n\n#### MediaWiki sites with English language\n\nUse [`--igsets=mediawiki`](https://github.com/ArchiveTeam/grab-site/blob/master/libgrabsite/ignore_sets/mediawiki).\nNote that this ignore set ignores old page revisions.\n\n#### MediaWiki sites with non-English language\n\nYou will probably have to add ignores with translated `Special:*` URLs based on\n[ignore_sets/mediawiki](https://github.com/ArchiveTeam/grab-site/blob/master/libgrabsite/ignore_sets/mediawiki).\n\n#### Forums that aren't Discourse\n\nForums require more manual intervention with ignore patterns.\n[`--igsets=forums`](https://github.com/ArchiveTeam/grab-site/blob/master/libgrabsite/ignore_sets/forums)\nis often useful for most forums, but you will have to add other ignore\npatterns, including one to ignore individual-forum-post pages if there are\ntoo many posts to crawl.  (Generally, crawling the thread pages is enough.)\n\n#### GitHub issues / pull requests\n\nFind the highest issue number from an issues page ([example](https://github.com/rust-lang/rust/issues)) and use:\n\n```\ngrab-site --1 https://github.com/rust-lang/rust/issues/{1..30000}\n```\n\nThis relies on your shell to expand the argument to thousands of arguments.\nIf there are too many arguments, you may have to write the URLs to a file\nand use `grab-site -i` instead:\n\n```\nfor i in {1..30000}; do echo https://github.com/rust-lang/rust/issues/$i \u003e\u003e .urls; done\ngrab-site --1 -i .urls\n```\n\n#### Websites whose domains have just expired but are still up at the webhost\n\nUse a [DNS history](https://www.google.com/search?q=historical+OR+history+dns)\nservice to find the old IP address (the DNS \"A\" record) for the domain.  Add a\nline to your `/etc/hosts` to point the domain to the old IP.  Start a crawl\nwith `--wpull-args=--no-skip-getaddrinfo` to make wpull use `/etc/hosts`.\n\n#### twitter.com/user\n\nUse [snscrape](https://github.com/JustAnotherArchivist/snscrape) to get a list\nof tweets for a user.  Redirect `snscrape`'s output to a list of URLs with\n`\u003e urls` and pass this file to `grab-site --1 -i urls`.\n\nAlternatively, use [webrecorder.io](https://webrecorder.io/) instead of\ngrab-site.  It has an autoscroll feature and you can download the WARCs.\n\nKeep in mind that scrolling `twitter.com/user` returns a maximum of 3200 tweets,\nwhile a [from:user](https://twitter.com/search?q=from%3Ainternetarchive\u0026src=typd\u0026f=realtime\u0026qf=off\u0026lang=en)\nquery can return more.\n\n\n\nChanging ignores during the crawl\n---\nWhile the crawl is running, you can edit `DIR/ignores` and `DIR/igsets`; the\nchanges will be applied within a few seconds.\n\n`DIR/igsets` is a comma-separated list of ignore sets to use.\n\n`DIR/ignores` is a newline-separated list of [Python 3 regular expressions](https://pythex.org/)\nto use in addition to the ignore sets.\n\nYou can `rm DIR/igoff` to display all URLs that are being filtered out\nby the ignores, and `touch DIR/igoff` to turn it back off.\n\nNote that ignores will not apply to any of the crawl's start URLs.\n\n\n\nInspecting the URL queue\n---\nInspecting the URL queue is usually not necessary, but may be helpful\nfor adding ignores before grab-site crawls a large number of junk URLs.\n\nTo dump the queue, run:\n\n```\ngs-dump-urls DIR/wpull.db todo\n```\n\nFour other statuses can be used besides `todo`:\n`done`, `error`, `in_progress`, and `skipped`.\n\nYou may want to pipe the output to `sort` and `less`:\n\n```\ngs-dump-urls DIR/wpull.db todo | sort | less -S\n```\n\n\n\nPreventing a crawl from queuing any more URLs\n---\n`rm DIR/scrape`.  Responses will no longer be scraped for URLs.  Scraping cannot\nbe re-enabled for a crawl.\n\n\n\nStopping a crawl\n---\nYou can `touch DIR/stop` or press ctrl-c, which will do the same.  You will\nhave to wait for the current downloads to finish.\n\n\n\nAdvanced `gs-server` options\n---\nThese environmental variables control what `gs-server` listens on:\n\n*\t`GRAB_SITE_INTERFACE` (default `0.0.0.0`)\n*\t`GRAB_SITE_PORT` (default `29000`)\n\nThese environmental variables control which server each `grab-site` process connects to:\n\n*\t`GRAB_SITE_HOST` (default `127.0.0.1`)\n*\t`GRAB_SITE_PORT` (default `29000`)\n\n\n\nViewing the content in your WARC archives\n---\n\nTry [ReplayWeb.page](https://replayweb.page/) or [webrecorder-player](https://github.com/webrecorder/webrecorder-player).\n\n\n\nInspecting WARC files in the terminal\n---\n`zless` is a wrapper over `less` that can be used to view raw WARC content:\n\n```\nzless DIR/FILE.warc.gz\n```\n\n`zless -S` will turn off line wrapping.\n\nNote that grab-site requests uncompressed HTTP responses to avoid\ndouble-compression in .warc.gz files and to make zless output more useful.\nHowever, some servers will send compressed responses anyway.\n\n\n\nAutomatically pausing grab-site processes when free disk is low\n---\n\nIf you automatically upload and remove finished .warc.gz files, you can still\nrun into a situation where grab-site processes fill up your disk faster than\nyour uploader process can handle.  To prevent this situation, you can customize\nand run [this script](https://github.com/ArchiveTeam/grab-site/blob/master/extra_docs/pause_resume_grab_sites.sh),\nwhich will pause and resume grab-site processes as your free disk space\ncrosses a threshold value.\n\n\n\nThanks\n---\n\ngrab-site is made possible only because of [wpull](https://github.com/chfoo/wpull),\nwritten by [Christopher Foo](https://github.com/chfoo) who spent a year\nmaking something much better than wget.  ArchiveTeam's most pressing\nissue with wget at the time was that it kept the entire URL queue in memory\ninstead of on disk.  wpull has many other advantages over wget, including\nbetter link extraction and Python hooks.\n\nThanks to [David Yip](https://github.com/yipdw), who created\n[ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot).  The wpull\nhooks in ArchiveBot served as the basis for grab-site.  The original ArchiveBot\ndashboard inspired the newer dashboard now used in both projects.\n\nThanks to [Falcon Darkstar Momot](https://github.com/falconkirtaran) for\nthe many wpull 2.x fixes that were rolled into\n[ArchiveTeam/wpull](https://github.com/ArchiveTeam/wpull).\n\nThanks to [JustAnotherArchivist](https://github.com/JustAnotherArchivist)\nfor investigating my wpull issues.\n\nThanks to [BrowserStack](https://www.browserstack.com/) for providing free\nbrowser testing for grab-site, which we use to make sure the dashboard works\nin various browsers.\n\n[\u003cimg src=\"https://user-images.githubusercontent.com/211271/29110431-887941d2-7cde-11e7-8c2f-199d85c5a3b5.png\" height=\"30\" alt=\"BrowserStack Logo\"\u003e](https://www.browserstack.com/)\n\n\n\nHelp\n---\ngrab-site bugs and questions are welcome in\n[grab-site/issues](https://github.com/ArchiveTeam/grab-site/issues).\n\nTerminal output in your bug report should be surrounded by triple backquotes, like this:\n\n\u003cpre\u003e\n```\nvery\nlong\noutput\n```\n\u003c/pre\u003e\n\nPlease report security bugs as regular bugs.\n\n\n[travis-image]: https://img.shields.io/travis/ArchiveTeam/grab-site.svg\n[travis-url]: https://travis-ci.org/ArchiveTeam/grab-site\n","funding_links":[],"categories":["Tools \u0026 Software","Python","[](#table-of-contents) Table of contents","Web Archiving"],"sub_categories":["Acquisition","[](#warc)Tools for working with WARC (WebARChive) files","Crawlers \u0026 Capture"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FArchiveTeam%2Fgrab-site","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FArchiveTeam%2Fgrab-site","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FArchiveTeam%2Fgrab-site/lists"}