https://github.com/ArchiveTeam/ArchiveBot

ArchiveBot, an IRC bot for archiving websites
https://github.com/ArchiveTeam/ArchiveBot

archiving haxe irc javascript python ruby

Last synced: 8 months ago
JSON representation

ArchiveBot, an IRC bot for archiving websites

Host: GitHub
URL: https://github.com/ArchiveTeam/ArchiveBot
Owner: ArchiveTeam
License: mit
Created: 2013-09-06T05:13:57.000Z (over 12 years ago)
Default Branch: master
Last Pushed: 2024-09-23T04:18:10.000Z (over 1 year ago)
Last Synced: 2024-10-30T00:55:21.716Z (over 1 year ago)
Topics: archiving, haxe, irc, javascript, python, ruby
Language: Python
Homepage: http://www.archiveteam.org/index.php?title=ArchiveBot
Size: 2.73 MB
Stars: 357
Watchers: 27
Forks: 71
Open Issues: 172
Metadata Files:
- Readme: README
- License: LICENSE

Awesome Lists containing this project

awesome-github-repos - ArchiveTeam/ArchiveBot - ArchiveBot, an IRC bot for archiving websites (Python)

README

1. ArchiveBot

Coders, I have a question.
Or, a request, etc.
I spent some time with xmc discussing something we could
do to make things easier around here.
What we came up with is a trigger for a bot, which can
be triggered by people with ops.
You tell it a website. It crawls it. WARC. Uploads it to
archive.org. Boom.
I can supply machine as needed.
Obviously there's some sanitation issues, and it is root
all the way down or nothing.
I think that would help a lot for smaller sites
Sites where it's 100 pages or 1000 pages even, pretty
simple.
And just being able to go "bot, get a sanity dump"

2. More info

ArchiveBot has two major backend components: the control node, which
runs the IRC interface and bookkeeping programs, and the crawlers, which
do all the Web crawling. ArchiveBot users communicate with ArchiveBot
by issuing commands in an IRC channel.

User's guide: http://archivebot.readthedocs.org/en/latest/
Control node installation guide: INSTALL.backend
Crawler installation guide: INSTALL.pipeline

3. Local use

ArchiveBot was originally written as a set of separate programs for
deployment on a server. This means it has a poor distribution story.
However, Ivan Kozik (@ivan) has taken the ArchiveBot pipeline,
dashboard, ignores, and control system and created a package intended for
personal use. You can find it at https://github.com/ArchiveTeam/grab-site.

4. License

5. Acknowledgments

Thanks to Alard (@alard), who added WARC generation and Lua scripting to
GNU Wget. Wget+lua was the first web crawler used by ArchiveBot.

Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot's current web
crawler.

Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns and
tracking down performance problems at scale.

Other thanks go to the following projects:

* Celluloid
* Cinch
* CouchDB
* Ember.js
* Redis
* Seesaw

6. Special thanks

Dragonette, Barnaby Bright, Vienna Teng, NONONO.

The memory hole of the Web has gone too far.
Don't look down, never look away; ArchiveBot's like the wind.

vim:ts=2:sw=2:tw=72:et

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ArchiveTeam/ArchiveBot

Awesome Lists containing this project

README