Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kanety/kudzu
https://github.com/kanety/kudzu
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/kanety/kudzu
- Owner: kanety
- License: mit
- Created: 2017-11-18T13:19:47.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2024-10-17T05:51:44.000Z (3 months ago)
- Last Synced: 2024-10-19T07:04:45.009Z (3 months ago)
- Language: Ruby
- Size: 162 KB
- Stars: 1
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# Kudzu
A simple web crawler for ruby.
## Features
* Run single-thread or multi-thread.
* Pool HTTP connection.
* Restrict links by url-based patterns.
* Respect robots.txt.
* Store page contents via adapter.## Dependencies
* ruby 2.5+
* libicu## Installation
Add to your application's Gemfile:
```ruby
gem 'kudzu'
```Then run:
$ bundle install
## Usage
Crawl html files in `example.com`:
```ruby
crawler = Kudzu::Crawler.new do
user_agent 'YOUR_AWESOME_APP'
add_filter do
focus_host true
allow_mime_type %w(text/html)
end
end
crawler.run('http://example.com/') do
on_success do |page, link|
puts page.url
end
end
```## Adapters
This gem supports only in-memory crawling by default. Use following adapter to save page contents persistently:
* [kudzu-adapter-active_record](https://github.com/kanety/kudzu-adapter-active_record)
## Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/kanety/kudzu. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
## License
The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).