Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mike442144/crawl-e

Automatically exported from code.google.com/p/crawl-e
https://github.com/mike442144/crawl-e

Last synced: 24 days ago
JSON representation

Automatically exported from code.google.com/p/crawl-e

Host: GitHub
URL: https://github.com/mike442144/crawl-e
Owner: mike442144
License: bsd-3-clause
Created: 2015-03-13T07:58:16.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2015-03-13T22:23:40.000Z (over 9 years ago)
Last Synced: 2023-03-16T21:40:29.808Z (over 1 year ago)
Language: Python
Size: 402 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README
- License: LICENSE

Awesome Lists containing this project

README

        Authors

---------------------

Bryce Boe (bboe _at_ cs.ucsb.edu)

Christo Wilson (bowlin _at_ cs.ucsb.edu)

About

---------------------

CRAWL-E was designed to crawl the web fast fast as possible with as little

development time as possible. It is only a framework, and requires the

development of a Handler module in order to function properly.

The CRAWL-E developers are very familiar with how TCP and HTTP works and using

that knowledge have written a web crawler intended to maximize TCP throughput.

This benefit is realized when crawling web servers that utilize persistent HTTP

connections as numerous requests will be made over a single TCP connection thus

increasing the throughput.

Other features of CRAWL-E are multiple HTTP request method support, the most

basic being GET, POST, PUT, DELETE, HEAD.

Installation

---------------------

Requirements: python >= python2.5

Run: python setup.py install

Note: You will probably need to be a root user to install the package.

Running

---------------------

Check out the page downloader in the examples folder. The script run_it.sh

should be sufficient to get you started.

The heart of CRAWL-E relies on extending the Crawle.Handler, in which one

must implement a process function. This function is where all the magic can

happen. We say can happen because it's entirely up to you what to do in this

function. Some possibilities are parsing links to add to the queue, simply

saving the contents of the page, or both!