Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/chriskite/anemone
Anemone web-spider framework
https://github.com/chriskite/anemone
Last synced: about 1 month ago
JSON representation
Anemone web-spider framework
- Host: GitHub
- URL: https://github.com/chriskite/anemone
- Owner: chriskite
- License: mit
- Created: 2009-04-14T18:31:48.000Z (over 15 years ago)
- Default Branch: next
- Last Pushed: 2020-03-20T11:27:38.000Z (over 4 years ago)
- Last Synced: 2024-04-04T09:02:45.166Z (7 months ago)
- Language: Ruby
- Homepage: http://anemone.rubyforge.org
- Size: 577 KB
- Stars: 1,614
- Watchers: 63
- Forks: 329
- Open Issues: 53
-
Metadata Files:
- Readme: README.rdoc
- Changelog: CHANGELOG.rdoc
- License: LICENSE.txt
Awesome Lists containing this project
README
= Anemone
Anemone is a web spider framework that can spider a domain and collect useful
information about the pages it visits. It is versatile, allowing you to
write your own specialized spider tasks quickly and easily.See http://anemone.rubyforge.org for more information.
== Features
* Multi-threaded design for high performance
* Tracks 301 HTTP redirects
* Built-in BFS algorithm for determining page depth
* Allows exclusion of URLs based on regular expressions
* Choose the links to follow on each page with focus_crawl()
* HTTPS support
* Records response time for each page
* CLI program can list all pages in a domain, calculate page depths, and more
* Obey robots.txt
* In-memory or persistent storage of pages during crawl, using TokyoCabinet, SQLite3, MongoDB, or Redis== Examples
See the scripts under the lib/anemone/cli directory for examples of several useful Anemone tasks.== Requirements
* nokogiri
* robots== Development
To test and develop this gem, additional requirements are:
* rspec
* fakeweb
* tokyocabinet
* kyotocabinet-ruby
* mongo
* redis
* sqlite3You will need to have KyotoCabinet, {Tokyo Cabinet}[http://fallabs.com/tokyocabinet/], {MongoDB}[http://www.mongodb.org/], and {Redis}[http://code.google.com/p/redis/] installed on your system and running.