{"id":21035816,"url":"https://github.com/archiveteam/ia.bak","last_synced_at":"2026-03-11T14:07:57.708Z","repository":{"id":29684022,"uuid":"33226509","full_name":"ArchiveTeam/IA.BAK","owner":"ArchiveTeam","description":"We back up a lot of stuff from around the web; now it's time to back up the Internet Archive, just in case.","archived":false,"fork":false,"pushed_at":"2020-07-13T10:49:34.000Z","size":491,"stargazers_count":89,"open_issues_count":18,"forks_count":22,"subscribers_count":25,"default_branch":"master","last_synced_at":"2025-05-07T23:27:26.595Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ArchiveTeam.png","metadata":{"files":{"readme":"README.md","changelog":"change-email","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-04-01T04:26:08.000Z","updated_at":"2025-03-19T20:16:34.000Z","dependencies_parsed_at":"2022-09-05T15:41:53.127Z","dependency_job_id":null,"html_url":"https://github.com/ArchiveTeam/IA.BAK","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2FIA.BAK","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2FIA.BAK/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2FIA.BAK/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2FIA.BAK/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ArchiveTeam","download_url":"https://codeload.github.com/ArchiveTeam/IA.BAK/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254358717,"owners_count":22057960,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T13:16:24.732Z","updated_at":"2026-03-11T14:07:57.676Z","avatar_url":"https://github.com/ArchiveTeam.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"Client scripts for\n\u003chttp://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation\u003e\n\nYou can use git-annex commands by hand, if you prefer, but the `iabak`\nscript automates several things for you.\n\nClone this git repository to somewhere that has a lot of disk space,\nand run the `iabak` script to get started.\n\nYou can stop running it once it's downloaded enough. Just hit ctrl-C at any\ntime.\n\nThis script has been tested on:\n\n* Linux (any not too minimal distribution)\n* OSX\n\n## System Requirements\n\n* RAM: 2GB (git-annex will need some ram while verifying the files you've downloaded.)\n* CPU: yes (the faster it can hash a file the better)\n\n## care and feeding of your backup\n\nTo be sure that your backup still exists and is still in good shape,\nyou shoud periodically run either `iabak` or `iabak-cronjob` or\nboth. Either of these will check back in and verify that your repo\nexists. The difference is that `iabak-cronjob` avoids downloading any\nmore data from the IA, avoids verifying the checksums of the files you\nare storing, and logs to `iabak-cronjob.log`.\n\nWe recommend setting up a cron job that runs one of these at least once per\nweek, so we can notice when repositories go missing or develop problems.\n\nFor example, to run it at 10:30am on Mondays, put this in crontab:\n\n\t30 10 * * 1 /path/to/IA.BAK/iabak-cronjob\n\nThe `install-fsck-service` installs a systemd timer or cron job that will run\n`iabak-cronjob` once a day. This is now set up automatically the first time\n`iabak` is run.\n\n## checking out additional shards\n\nRunning `iabak` will check out one shard of the IA at a time. Once it\nfinishes the current shard, if you have more disk space available, it will\nfind and check out another shard.\n\nTo manually check out a particular shard, you can run the\n`checkoutshard` script, passing it the name of a shard, such as \"shard3\".\nSee the `repolist` file for a list of shards and their status.\n\nOnce you have multiple shards checked out, the next time you run `iabak`,\nit will process all them.\n\n## flag files\n\nYou can touch these files in the IA.BAK directory to control `iabak`.\n\n* `NOSHUF`\n\tPrevents shuffling files before downloading.\n* `NOMORE`\n\tPrevents `iabak` from checking out additional shards as existing\n\tshards complete.\n\nAlso, these files in the IA.BAK directory can have values written\nto them to tune its behavior.\n\n* `ANNEXGETOPTS`\n\tOptions passed to `git annex get`.\n\tThis is useful to enable concurrent downloads of multiple files.\n\tFor example \"-J10\" for concurrent downloads.\n* `FSCKTIMELIMIT`\n\tLimits how much time is spent verifying checksums of\n\tfiles in your backup. The default is \"5h\", which means\n\tit will spend up to 5 hours per shard per run of `iabak`.\n\tFeel free to set this to a smaller time limit like \"1h\" or \"30m\".\n\t(Note that `iabak-cronjob` does not perform these expensive fscks.)\n\n\tThe goal is to verify the checksum of each file\n\tin your backup once per month. If it's interrupted by this time\n\tlimit, or just by your ctrl-c, it will pick up next time where it\n\tleft off. Once it's verified all files, it will avoid doing\n\tany more checksumming until the next month.\n\n## tuning resource usage\n\nSo you want to back up part of the IA, but don't want this to take over\nyour whole disk or internet pipe? Here's some tuning options you can use..\nRun these commands in git repos like IA.BAK/shard1, IA.BAK/shard2, etc.\n\n* `git config annex.diskreserve 200GB`\n\tThis will prevent git-annex from using up the last 200gb of your disk.\n\tAdjust to suit. This is prompted for the first time you run `iabak`, and it\n\tis automatically propagated to each new shard.\n\n* `git config annex.web-options --limit-rate=200k`\n\tThis will limit wget/curl to downloading at 200 kb/s. Adjust to suit.\n\n\tNote that if concurrent downloads are enabled, each download thread will\n\tuse up to this rate limit.\n\n## instructions for earlier users\n\nIf you cloned shard1 by hand before, here's how to convert to managing it\nwith iabak.\n\n1. Clone this repo to the same drive you cloned shard1 to before.\n2. Stop any running git-annex process.\n3. Move the shard1 repo to IA.BAK/shard1\n4. Go to IA.BAK, and run `./iabak`\n\n## FAQ\n\n* `Can I run this on BSD?`\n\tNot without some serious work. You'll need /bin/bash, GNU awk, and possibly other things I can't think of off the top of my head. Join the IRC channel and chat with other BSD users; they may have more up-to-date information.\n\n* `Can I store the backups on an NFS or SMB filesystem?`\n\tKinda. If you're using SMB then you're on your own (but do send us a pull request). If you're using NFS then you'll have to install git-annex manually (as the default install tarball uses symlinks), and you'll have to add \"-c annex.sshcaching=false\" to the ANNEXGETOPTS file so that git-annex doesn't try to create unix sockets on your NFS filesystem.\n\n* `I keep seeing this error message from git-annex: \"Unable to access these remotes: web\"; what do I do?`\n\tThis indicates that a file used to exist on archive.org but has since been hidden for one reason or another. The message will also list which git remotes are believed to contain the file; if remotes other than the web remote are listed then you could contact that user and arrange for access to the file. The best way to do this is to set up mutual SSH access.\n\n* `What do I do when git-annex tells me \"verification of content failed\"?`\n\tThis means that git-annex tried to verify the content of a file it has downloaded, but failed to do so. Most likely the file has changed since we first added it to the shard. This is most common with torrent files and the *_meta.xml files.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchiveteam%2Fia.bak","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farchiveteam%2Fia.bak","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchiveteam%2Fia.bak/lists"}