Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bububa/lego
A python module which is based on bububa.SuperMario project provide a YAML style templates crawler, also provide some collective intellectual functions.
https://github.com/bububa/lego
Last synced: about 5 hours ago
JSON representation
A python module which is based on bububa.SuperMario project provide a YAML style templates crawler, also provide some collective intellectual functions.
- Host: GitHub
- URL: https://github.com/bububa/lego
- Owner: bububa
- Created: 2009-12-02T05:34:26.000Z (almost 15 years ago)
- Default Branch: master
- Last Pushed: 2010-03-03T12:04:19.000Z (over 14 years ago)
- Last Synced: 2023-04-18T23:29:47.245Z (over 1 year ago)
- Language: Python
- Homepage: http://syd.todayclose.com/
- Size: 147 KB
- Stars: 3
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.txt
Awesome Lists containing this project
README
= About Lego =
Lego is an advance web crawler library written in python. It
provides a number of methods to mine data from kinds of sites. You
could use YAML to create crawler templates don't need to know write
python code for a web crawl job.
This project is based on SuperMario (http://github.com/bububa/SuperMario/).== License ==
BSD License
See 'LICENSE' for details.== Requirements ==
Platform: *nix like system (Unix, Linux, Mac OS X, etc.)
Python: 2.5+
Storage: mongodb
Some other python models:
- bububa.SuperMario== Features ==
+ YAML style templates;
+ keywords IDF calculation
+ keywords COEF calculation
+ Sitemanager
+ Smart Crawling
+== DEMO ==
Sample YAML file to crawl snipplr.com#snipplr.yaml
run: !SiteCrawler
check: !Crawlable
yaml_file: sites.yaml
label: snipplr
logger:
filename: snipplr.log
crawler: !YAMLStorage
yaml_file: sites.yaml
label: snipplr
method: update_config
data: !DetailCrawler
sleep: 3
proxies: !File
method: readlines
filename: /proxies
pages: !PaginateCrawler
proxies: !File
method: readlines
filename: /proxies
url_pattern: http://snipplr.com/all/page/{NUMBER}
start_no: 0
end_no: 0
multithread: True
wrapper: '([^^].*?)
'
logger:
filename: snipplr.log
url_pattern: '/view/\d+/[^^]*?/$'
wrapper:
title: '([^^]*?)
'
language: 'Published in: ([^^]*?)'
author: 'Posted By
\s+([^^]*?)'
code: ''
comment: '([^^]*?)'
tag: 'Tagged
\s+([^^]*?)
'
essential_fields:
- title
- language
- code
multithread: True
remove_external_duplicate: True
logger:
filename: snipplr.log
page_callback: !Document
label: snipplr
method: write
page: None
logger:
filename: snipplr.log
furthure:
tag:
parser: !Steps
inputs: None
steps:
- !Dict
dictionary: None
method: member
args:
- tag
- !Regx
string: None
pattern: ([^^]*?)
multiple: True
code:
parser: !Steps
inputs: None
steps:
- !Dict
dictionary: None
method: member
args:
- code
- !String
args: None
base_str: 'http://snipplr.com/view.php?codeview&id=%s'
- !Init
inputs: None
obj: !URLCrawler
urls: None
params:
save_output: True
wrapper: '([^^]*?)'
- !Array
arr: None
method: member
args:
- 0
- !Dict
dictionary: None
method: member
args:
- wrapper-------------------------------------------------------------------
#sites.yaml
snipplr:
duration: 1800
end_no: 1
last_updated_at: 1263883143.0
start_no: 1
step: 1--------------------------------------------------------------------
Run the crawler in shell
python lego.py -c snipplr.yaml
The results are stored in mongodb. You could use inserter Module to insert the data into mysql database.