Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cibernox/crawlette
Very simple web crawler.
https://github.com/cibernox/crawlette
Last synced: about 1 month ago
JSON representation
Very simple web crawler.
- Host: GitHub
- URL: https://github.com/cibernox/crawlette
- Owner: cibernox
- License: mit
- Created: 2014-09-14T19:47:13.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2014-09-14T19:49:35.000Z (over 10 years ago)
- Last Synced: 2024-10-29T20:12:41.729Z (3 months ago)
- Language: Ruby
- Size: 129 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Crawlette
Very simple command line utility to crawl an URL and show the links and assets of each page.
## Installation
$ gem install crawlette
## Usage
$ crawlette http://miguelcamba.com
## Improvements
This approach discovers new pages and fetches them in batches of up to 8 pages using threads.
Probably a solution using EventMachine instead of threads would be more performant.Since I haven't used any mutex or thread-safe data structures, there is a small chance that
any page is crawled twice unecessarily. Since the chance is small and that won't alter the result,
I just don't care.This crawler has no limits, neither in number of links or its depth. Don't use it them to crawl
pages too big. Crawling twitter by example can be too much.