{"id":15655235,"url":"https://github.com/edsu/bagweb","last_synced_at":"2025-05-02T15:36:44.339Z","repository":{"id":66762460,"uuid":"50777034","full_name":"edsu/bagweb","owner":"edsu","description":"mirror a website, put it in a bag","archived":false,"fork":false,"pushed_at":"2022-12-18T15:46:40.000Z","size":10,"stargazers_count":25,"open_issues_count":4,"forks_count":8,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-05-01T14:54:33.502Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/edsu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-01-31T13:58:18.000Z","updated_at":"2023-09-05T16:18:56.000Z","dependencies_parsed_at":"2023-07-23T21:45:46.944Z","dependency_job_id":null,"html_url":"https://github.com/edsu/bagweb","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fbagweb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fbagweb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fbagweb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fbagweb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/edsu","download_url":"https://codeload.github.com/edsu/bagweb/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251895424,"owners_count":21661342,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-03T12:57:15.078Z","updated_at":"2025-05-01T14:54:39.640Z","avatar_url":"https://github.com/edsu.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# bagweb ~ 👜🕸\n\nActive projects need active websites. An active website is one that can be\nupdated quickly and easily, which often means installing a CMS of some kind\n(WordPress, MediaWiki, Drupal, Rails, etc). But projects end, and so does the\nwrite activity on a website. People may still want to look at the website as a\nrecord of what happened, but they are less interested in contributing new\ncontent, since the project is, well, [finished].\n\nTo keep the record of a project you are stuck keeping the CMS software up to\ndate so it doesn't get hacked, and making sure the database is upgraded and\nbacked up. As the Web gets older, this problem gets [worse].  Wouldn't it be\nnice to replace the once dynamic site with a static version that wouldn't\nrequire software updates of any kind?\n\nbagweb is a utility script for mirroring a website, creating a WARC file, and\nwriting the data into a [BagIt] package for your data archive. The heavy lifting\nis done by [wget] and [bagit.py] so you'll need to have those installed.  bagweb\nreally isn't anything special, it's just a way of remembering a sequence of\ncommand-line interactions, and perhaps a modest preservation pattern for the \nWeb.\n\n    % bagweb http://mith.umd.edu/api-workshop/ api-workshop\n\n    Crawling http://mith.umd.edu/apiworkshop/\n\n    Finished, see /Users/ed/Projects/bagweb/apiworkshop.log for details.\n\n    You may want to record additional provenance in\n    /Users/ed/Projects/bagweb/apiworkshop/bag-info.txt\n\n    % tree apiworkshop\n    apiworkshop\n    ├── bag-info.txt\n    ├── bagit.txt\n    ├── data\n    │   ├── apiworkshop.warc.gz\n    │   └── mith.umd.edu.tar.gz\n    ├── manifest-md5.txt\n    └── tagmanifest-md5.txt\n\nYou can take this bag and put it somewhere where you like to keep track of data.\nThen you can scp the snapshot file, in this case\n`apiworkshop/data/mith.umd.edu.tar.gz` and unpack it in place of the previously\ninstalled CMS.\n\nFollow these steps to test your website tarball:\n\n1. unzip mith.umd.edu.tar.gz \n2. docker run -v `pwd`:/usr/local/apache2/htdocs -p 8080:80 httpd\n3. disconnect from the Internet (disable wifi, remove ethernet cable, etc)\n4. open http://localhost:8080/\n\nCAVEAT: If the website being archived has a lot of dynamic AJAX stuff going on,\nthe mirror copy may not be perfect, because wget doesn't execute JavaScript. But\nit may work good enough for you, considering the alternatives.\n\n[BagIt]: https://en.wikipedia/wiki/BagIt\n[bagit.py]: https://github.com/libraryofcongress/bagit-python\n[wget]: https://www.gnu.org/software/wget/\n[worse]: http://www.newyorker.com/magazine/2015/01/26/cobweb\n[finished]: https://www.youtube.com/watch?v=4vuW6tQ0218\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedsu%2Fbagweb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fedsu%2Fbagweb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedsu%2Fbagweb/lists"}