Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/buren/wayback_archiver
Ruby gem to send URLs to Wayback Machine
https://github.com/buren/wayback_archiver
internet-archive ruby rubygem wayback-archiver wayback-machine
Last synced: 5 days ago
JSON representation
Ruby gem to send URLs to Wayback Machine
- Host: GitHub
- URL: https://github.com/buren/wayback_archiver
- Owner: buren
- License: mit
- Created: 2014-07-17T16:23:51.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2024-12-11T12:46:18.000Z (22 days ago)
- Last Synced: 2024-12-21T13:05:47.518Z (12 days ago)
- Topics: internet-archive, ruby, rubygem, wayback-archiver, wayback-machine
- Language: Ruby
- Homepage: https://rubygems.org/gems/wayback_archiver
- Size: 160 KB
- Stars: 59
- Watchers: 8
- Forks: 7
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# WaybackArchiver
Post URLs to [Wayback Machine](https://archive.org/web/) (Internet Archive), using a crawler, from [Sitemap(s)](http://www.sitemaps.org), or a list of URLs.
> The Wayback Machine is a digital archive of the World Wide Web [...]
> The service enables users to see archived versions of web pages across time ...
> \- [Wikipedia](https://en.wikipedia.org/wiki/Wayback_Machine)[![Build Status](https://travis-ci.org/buren/wayback_archiver.svg?branch=master)](https://travis-ci.org/buren/wayback_archiver) [![Code Climate](https://codeclimate.com/github/buren/wayback_archiver.png)](https://codeclimate.com/github/buren/wayback_archiver) [![Docs badge](https://inch-ci.org/github/buren/wayback_archiver.svg?branch=master)](http://www.rubydoc.info/github/buren/wayback_archiver/master) [![Gem Version](https://badge.fury.io/rb/wayback_archiver.svg)](http://badge.fury.io/rb/wayback_archiver)
__Index__
* [Installation](#installation)
* [Usage](#usage)
- [Ruby](#ruby)
- [CLI](#cli)
* [Configuration](#configuration)
* [RubyDoc](#docs)
* [Contributing](#contributing)
* [MIT License](#license)
* [References](#references)## Installation
Install the gem:
```
$ gem install wayback_archiver
```Or add this line to your application's Gemfile:
```ruby
gem 'wayback_archiver'
```And then execute:
```
$ bundle
```## Usage
* [Ruby](#ruby)
* [CLI](#cli)__Strategies__:
* `auto` (the default) - Will try to
1. Find Sitemap(s) defined in `/robots.txt`
2. Then in common sitemap locations `/sitemap-index.xml`, `/sitemap.xml` etc.
3. Fallback to crawling (using the excellent [spidr](https://github.com/postmodern/spidr/) gem)
* `sitemap` - Parse Sitemap(s), supports [index files](https://www.sitemaps.org/protocol.html#index) (and gzip)
* `urls` - Post URL(s)## Ruby
First require the gem
```ruby
require 'wayback_archiver'
```_Examples_:
Auto
```ruby
# auto is the default
WaybackArchiver.archive('example.com')# or explicitly
WaybackArchiver.archive('example.com', strategy: :auto)
```Crawl
```ruby
WaybackArchiver.archive('example.com', strategy: :crawl)
```Only send one single URL
```ruby
WaybackArchiver.archive('example.com', strategy: :url)
```Send multiple URLs
```ruby
WaybackArchiver.archive(%w[example.com www.example.com], strategy: :urls)
```Send all URL(s) found in Sitemap
```ruby
WaybackArchiver.archive('example.com/sitemap.xml', strategy: :sitemap)# works with Sitemap index files too
WaybackArchiver.archive('example.com/sitemap-index.xml.gz', strategy: :sitemap)
```Specify concurrency
```ruby
WaybackArchiver.archive('example.com', strategy: :auto, concurrency: 10)
```Specify max number of URLs to be archived
```ruby
WaybackArchiver.archive('example.com', strategy: :auto, limit: 10)
```Each archive strategy can receive a block that will be called for each URL
```ruby
WaybackArchiver.archive('example.com', strategy: :auto) do |result|
if result.success?
puts "Successfully archived: #{result.archived_url}"
else
puts "Error (HTTP #{result.code}) when archiving: #{result.archived_url}"
end
end
```Use your own adapter for posting found URLs
```ruby
WaybackArchiver.adapter = ->(url) { puts url } # whatever that responds to #call
```## CLI
__Usage__:
```
wayback_archiver [] [options]
```Print full usage instructions
```
wayback_archiver --help
```_Examples_:
Auto
```
# auto is the default
wayback_archiver example.com# or explicitly
wayback_archiver example.com --auto
```Crawl
```bash
wayback_archiver example.com --crawl
```Only send one single URL
```bash
wayback_archiver example.com --url
```Send multiple URLs
```bash
wayback_archiver example.com www.example.com --urls
```Crawl multiple URLs
```bash
wayback_archiver example.com www.example.com --crawl
```Send all URL(s) found in Sitemap
```bash
wayback_archiver example.com/sitemap.xml# works with Sitemap index files too
wayback_archiver example.com/sitemap-index.xml.gz
```Most options
```bash
wayback_archiver example.com www.example.com --auto --concurrency=10 --limit=100 --log=output.log --verbose
```View archive: [https://web.archive.org/web/*/http://example.com](https://web.archive.org/web/*/http://example.com) (replace `http://example.com` with to your desired domain).
## Configuration
:information_source: By default `wayback_archiver` doesn't respect robots.txt files, see [this Internet Archive blog post](https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/) for more information.
Configuration (the below values are the defaults)
```ruby
WaybackArchiver.concurrency = 1
WaybackArchiver.user_agent = WaybackArchiver::USER_AGENT
WaybackArchiver.respect_robots_txt = WaybackArchiver::DEFAULT_RESPECT_ROBOTS_TXT
WaybackArchiver.logger = Logger.new(STDOUT)
WaybackArchiver.max_limit = -1 # unlimited
WaybackArchiver.adapter = WaybackArchiver::WaybackMachine # must implement #call(url)
```For a more verbose log you can configure `WaybackArchiver` as such:
```ruby
WaybackArchiver.logger = Logger.new(STDOUT).tap do |logger|
logger.progname = 'WaybackArchiver'
logger.level = Logger::DEBUG
end
```_Pro tip_: If you're using the gem in a Rails app you can set `WaybackArchiver.logger = Rails.logger`.
## Docs
You can find the docs online on [RubyDoc](http://www.rubydoc.info/github/buren/wayback_archiver/master).
This gem is documented using `yard` (run from the root of this repository).
```bash
yard # Generates documentation to doc/
```## Contributing
Contributions, feedback and suggestions are very welcome.
1. Fork it
2. Create your feature branch (`git checkout -b my-new-feature`)
3. Commit your changes (`git commit -am 'Add some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
5. Create new Pull Request## License
[MIT License](LICENSE)
## References
* Don't know what the Wayback Machine (Internet Archive) is? [Wayback Machine](https://archive.org/web/)
* Don't know what a Sitemap is? [sitemaps.org](http://www.sitemaps.org)
* Don't know what robot.txt is? [www.robotstxt.org](http://www.robotstxt.org/robotstxt.html)