https://github.com/dnephin/threaded-crawler
https://github.com/dnephin/threaded-crawler
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/dnephin/threaded-crawler
- Owner: dnephin
- Created: 2011-03-01T00:31:31.000Z (about 14 years ago)
- Default Branch: master
- Last Pushed: 2011-03-01T02:55:11.000Z (about 14 years ago)
- Last Synced: 2025-01-26T00:12:13.818Z (3 months ago)
- Language: Python
- Homepage:
- Size: 219 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README
Awesome Lists containing this project
README
Threaded Crawler
This web crawler is designed to be a generic and highly configurable crawler, that
can quickly traverse sites, and pull content based on regex and other selection criteria.__Requirements__
Uses BeatifulSoup to parse html pages (http://www.crummy.com/software/BeautifulSoup/)
Uses epydoc for documentation
Uses JobSite common packagepython-psycopg2 2.0.8
__Development__
The 'cmd' script can be used to clean and build docs.
Documentation is in doc/API.__INSTALL__
python setup.py install
__Running__
$COMMON environment variable should be set to the path for common/patterns.py
lib, or the lib should be installed on the default python path.