https://github.com/dnephin/threaded-crawler

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/dnephin/threaded-crawler
Owner: dnephin
Created: 2011-03-01T00:31:31.000Z (about 14 years ago)
Default Branch: master
Last Pushed: 2011-03-01T02:55:11.000Z (about 14 years ago)
Last Synced: 2025-01-26T00:12:13.818Z (3 months ago)
Language: Python
Homepage:
Size: 219 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README

Awesome Lists containing this project

README

Threaded Crawler

This web crawler is designed to be a generic and highly configurable crawler, that
can quickly traverse sites, and pull content based on regex and other selection criteria.

__Requirements__

Uses BeatifulSoup to parse html pages (http://www.crummy.com/software/BeautifulSoup/)
Uses epydoc for documentation
Uses JobSite common package

python-psycopg2 2.0.8

__Development__

The 'cmd' script can be used to clean and build docs.
Documentation is in doc/API.

__INSTALL__

python setup.py install

__Running__

$COMMON environment variable should be set to the path for common/patterns.py
lib, or the lib should be installed on the default python path.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dnephin/threaded-crawler

Awesome Lists containing this project

README