{"id":13571582,"url":"https://github.com/buren/wayback_archiver","last_synced_at":"2025-04-04T14:07:48.943Z","repository":{"id":18737471,"uuid":"21949146","full_name":"buren/wayback_archiver","owner":"buren","description":"Ruby gem to send URLs to Wayback Machine","archived":false,"fork":false,"pushed_at":"2024-12-11T12:46:18.000Z","size":164,"stargazers_count":59,"open_issues_count":10,"forks_count":7,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-28T13:08:59.749Z","etag":null,"topics":["internet-archive","ruby","rubygem","wayback-archiver","wayback-machine"],"latest_commit_sha":null,"homepage":"https://rubygems.org/gems/wayback_archiver","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/buren.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-07-17T16:23:51.000Z","updated_at":"2025-03-04T22:22:53.000Z","dependencies_parsed_at":"2025-01-18T17:11:35.828Z","dependency_job_id":"a0d74651-7194-4e1c-939f-b4cdfcea26a9","html_url":"https://github.com/buren/wayback_archiver","commit_stats":{"total_commits":92,"total_committers":5,"mean_commits":18.4,"dds":0.08695652173913049,"last_synced_commit":"6b3447d932cc30f3d621421b993d191dca9d821e"},"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buren%2Fwayback_archiver","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buren%2Fwayback_archiver/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buren%2Fwayback_archiver/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buren%2Fwayback_archiver/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/buren","download_url":"https://codeload.github.com/buren/wayback_archiver/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247190250,"owners_count":20898702,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["internet-archive","ruby","rubygem","wayback-archiver","wayback-machine"],"created_at":"2024-08-01T14:01:03.495Z","updated_at":"2025-04-04T14:07:48.923Z","avatar_url":"https://github.com/buren.png","language":"Ruby","funding_links":[],"categories":["Ruby"],"sub_categories":[],"readme":"# WaybackArchiver\n\nPost URLs to [Wayback Machine](https://archive.org/web/) (Internet Archive), using a crawler, from [Sitemap(s)](http://www.sitemaps.org), or a list of URLs.\n\n\u003e The Wayback Machine is a digital archive of the World Wide Web [...]\n\u003e The service enables users to see archived versions of web pages across time ...  \n\u003e \\- [Wikipedia](https://en.wikipedia.org/wiki/Wayback_Machine)\n\n[![Build Status](https://travis-ci.org/buren/wayback_archiver.svg?branch=master)](https://travis-ci.org/buren/wayback_archiver) [![Code Climate](https://codeclimate.com/github/buren/wayback_archiver.png)](https://codeclimate.com/github/buren/wayback_archiver) [![Docs badge](https://inch-ci.org/github/buren/wayback_archiver.svg?branch=master)](http://www.rubydoc.info/github/buren/wayback_archiver/master) [![Gem Version](https://badge.fury.io/rb/wayback_archiver.svg)](http://badge.fury.io/rb/wayback_archiver)\n\n__Index__\n\n* [Installation](#installation)\n* [Usage](#usage)\n  - [Ruby](#ruby)\n  - [CLI](#cli)\n* [Configuration](#configuration)\n* [RubyDoc](#docs)\n* [Contributing](#contributing)\n* [MIT License](#license)\n* [References](#references)\n\n## Installation\n\nInstall the gem:\n```\n$ gem install wayback_archiver\n```\n\nOr add this line to your application's Gemfile:\n\n```ruby\ngem 'wayback_archiver'\n```\n\nAnd then execute:\n\n```\n$ bundle\n```\n\n## Usage\n\n* [Ruby](#ruby)\n* [CLI](#cli)\n\n__Strategies__:\n\n* `auto` (the default) - Will try to\n    1. Find Sitemap(s) defined in `/robots.txt`\n    2. Then in common sitemap locations `/sitemap-index.xml`, `/sitemap.xml` etc.\n    3. Fallback to crawling (using the excellent [spidr](https://github.com/postmodern/spidr/) gem)\n* `sitemap` - Parse Sitemap(s), supports [index files](https://www.sitemaps.org/protocol.html#index) (and gzip)\n* `urls` - Post URL(s)\n\n## Ruby\n\nFirst require the gem\n\n```ruby\nrequire 'wayback_archiver'\n```\n\n_Examples_:\n\nAuto\n\n```ruby\n# auto is the default\nWaybackArchiver.archive('example.com')\n\n# or explicitly\nWaybackArchiver.archive('example.com', strategy: :auto)\n```\n\nCrawl\n\n```ruby\nWaybackArchiver.archive('example.com',  strategy: :crawl)\n```\n\nOnly send one single URL\n\n```ruby\nWaybackArchiver.archive('example.com', strategy: :url)\n```\n\nSend multiple URLs\n\n```ruby\nWaybackArchiver.archive(%w[example.com www.example.com], strategy: :urls)\n```\n\nSend all URL(s) found in Sitemap\n\n```ruby\nWaybackArchiver.archive('example.com/sitemap.xml', strategy: :sitemap)\n\n# works with Sitemap index files too\nWaybackArchiver.archive('example.com/sitemap-index.xml.gz', strategy: :sitemap)\n```\n\nSpecify concurrency\n\n```ruby\nWaybackArchiver.archive('example.com', strategy: :auto, concurrency: 10)\n```\n\nSpecify max number of URLs to be archived\n\n```ruby\nWaybackArchiver.archive('example.com', strategy: :auto, limit: 10)\n```\n\nEach archive strategy can receive a block that will be called for each URL\n\n```ruby\nWaybackArchiver.archive('example.com', strategy: :auto) do |result|\n  if result.success?\n    puts \"Successfully archived: #{result.archived_url}\"\n  else\n    puts \"Error (HTTP #{result.code}) when archiving: #{result.archived_url}\"\n  end\nend\n```\n\nUse your own adapter for posting found URLs\n\n```ruby\nWaybackArchiver.adapter = -\u003e(url) { puts url } # whatever that responds to #call\n```\n\n## CLI\n\n__Usage__:\n\n```\nwayback_archiver [\u003curl\u003e] [options]\n```\n\nPrint full usage instructions\n\n```\nwayback_archiver --help\n```\n\n_Examples_:\n\nAuto\n\n```\n# auto is the default\nwayback_archiver example.com\n\n# or explicitly\nwayback_archiver example.com --auto\n```\n\nCrawl\n\n```bash\nwayback_archiver example.com --crawl\n```\n\nOnly send one single URL\n\n```bash\nwayback_archiver example.com --url\n```\n\nSend multiple URLs\n\n```bash\nwayback_archiver example.com www.example.com --urls\n```\n\nCrawl multiple URLs\n\n```bash\nwayback_archiver example.com www.example.com --crawl\n```\n\nSend all URL(s) found in Sitemap\n\n```bash\nwayback_archiver example.com/sitemap.xml\n\n# works with Sitemap index files too\nwayback_archiver example.com/sitemap-index.xml.gz\n```\n\nMost options\n\n```bash\nwayback_archiver example.com www.example.com --auto --concurrency=10 --limit=100 --log=output.log --verbose\n```\n\nView archive: [https://web.archive.org/web/*/http://example.com](https://web.archive.org/web/*/http://example.com) (replace `http://example.com` with to your desired domain).\n\n## Configuration\n\n:information_source: By default `wayback_archiver` doesn't respect robots.txt files, see [this Internet Archive blog post](https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/) for more information.\n\nConfiguration (the below values are the defaults)\n\n```ruby\nWaybackArchiver.concurrency = 1\nWaybackArchiver.user_agent = WaybackArchiver::USER_AGENT\nWaybackArchiver.respect_robots_txt = WaybackArchiver::DEFAULT_RESPECT_ROBOTS_TXT\nWaybackArchiver.logger = Logger.new(STDOUT)\nWaybackArchiver.max_limit = -1 # unlimited\nWaybackArchiver.adapter = WaybackArchiver::WaybackMachine # must implement #call(url)\n```\n\nFor a more verbose log you can configure `WaybackArchiver` as such:\n\n```ruby\nWaybackArchiver.logger = Logger.new(STDOUT).tap do |logger|\n  logger.progname = 'WaybackArchiver'\n  logger.level = Logger::DEBUG\nend\n```\n\n_Pro tip_: If you're using the gem in a Rails app you can set `WaybackArchiver.logger = Rails.logger`.\n\n## Docs\n\nYou can find the docs online on [RubyDoc](http://www.rubydoc.info/github/buren/wayback_archiver/master).\n\nThis gem is documented using `yard` (run from the root of this repository).\n\n```bash\nyard # Generates documentation to doc/\n```\n\n## Contributing\n\nContributions, feedback and suggestions are very welcome.\n\n1. Fork it\n2. Create your feature branch (`git checkout -b my-new-feature`)\n3. Commit your changes (`git commit -am 'Add some feature'`)\n4. Push to the branch (`git push origin my-new-feature`)\n5. Create new Pull Request\n\n## License\n\n[MIT License](LICENSE)\n\n## References\n\n* Don't know what the Wayback Machine (Internet Archive) is? [Wayback Machine](https://archive.org/web/)\n* Don't know what a Sitemap is? [sitemaps.org](http://www.sitemaps.org)\n* Don't know what robot.txt is? [www.robotstxt.org](http://www.robotstxt.org/robotstxt.html)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fburen%2Fwayback_archiver","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fburen%2Fwayback_archiver","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fburen%2Fwayback_archiver/lists"}