Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cibernox/crawlette

Very simple web crawler.
https://github.com/cibernox/crawlette

Last synced: about 1 month ago
JSON representation

Very simple web crawler.

Host: GitHub
URL: https://github.com/cibernox/crawlette
Owner: cibernox
License: mit
Created: 2014-09-14T19:47:13.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2014-09-14T19:49:35.000Z (over 10 years ago)
Last Synced: 2024-10-29T20:12:41.729Z (3 months ago)
Language: Ruby
Size: 129 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

# Crawlette

Very simple command line utility to crawl an URL and show the links and assets of each page.

## Installation

$ gem install crawlette

## Usage

$ crawlette http://miguelcamba.com

## Improvements

This approach discovers new pages and fetches them in batches of up to 8 pages using threads.
Probably a solution using EventMachine instead of threads would be more performant.

Since I haven't used any mutex or thread-safe data structures, there is a small chance that
any page is crawled twice unecessarily. Since the chance is small and that won't alter the result,
I just don't care.

This crawler has no limits, neither in number of links or its depth. Don't use it them to crawl
pages too big. Crawling twitter by example can be too much.