Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bububa/lego

A python module which is based on bububa.SuperMario project provide a YAML style templates crawler, also provide some collective intellectual functions.
https://github.com/bububa/lego

Last synced: about 5 hours ago
JSON representation

A python module which is based on bububa.SuperMario project provide a YAML style templates crawler, also provide some collective intellectual functions.

Awesome Lists containing this project

README

        

= About Lego =
Lego is an advance web crawler library written in python. It
provides a number of methods to mine data from kinds of sites. You
could use YAML to create crawler templates don't need to know write
python code for a web crawl job.
This project is based on SuperMario (http://github.com/bububa/SuperMario/).

== License ==
BSD License
See 'LICENSE' for details.

== Requirements ==
Platform: *nix like system (Unix, Linux, Mac OS X, etc.)
Python: 2.5+
Storage: mongodb
Some other python models:
- bububa.SuperMario

== Features ==
+ YAML style templates;
+ keywords IDF calculation
+ keywords COEF calculation
+ Sitemanager
+ Smart Crawling
+

== DEMO ==
Sample YAML file to crawl snipplr.com

#snipplr.yaml
run: !SiteCrawler
check: !Crawlable
yaml_file: sites.yaml
label: snipplr
logger:
filename: snipplr.log
crawler: !YAMLStorage
yaml_file: sites.yaml
label: snipplr
method: update_config
data: !DetailCrawler
sleep: 3
proxies: !File
method: readlines
filename: /proxies
pages: !PaginateCrawler
proxies: !File
method: readlines
filename: /proxies
url_pattern: http://snipplr.com/all/page/{NUMBER}
start_no: 0
end_no: 0
multithread: True
wrapper: '

    ([^^].*?)
'
logger:
filename: snipplr.log
url_pattern: '/view/\d+/[^^]*?/$'
wrapper:
title: '

([^^]*?)

'
language: '

Published in: ([^^]*?)'
author: '

Posted By

\s+

([^^]*?)'
code: ''
comment: '

'
tag: '

Tagged

\s+

([^^]*?)

'
essential_fields:
- title
- language
- code
multithread: True
remove_external_duplicate: True
logger:
filename: snipplr.log
page_callback: !Document
label: snipplr
method: write
page: None
logger:
filename: snipplr.log
furthure:
tag:
parser: !Steps
inputs: None
steps:
- !Dict
dictionary: None
method: member
args:
- tag
- !Regx
string: None
pattern:
([^^]*?)
multiple: True
code:
parser: !Steps
inputs: None
steps:
- !Dict
dictionary: None
method: member
args:
- code
- !String
args: None
base_str: 'http://snipplr.com/view.php?codeview&id=%s'
- !Init
inputs: None
obj: !URLCrawler
urls: None
params:
save_output: True
wrapper: '([^^]*?)'
- !Array
arr: None
method: member
args:
- 0
- !Dict
dictionary: None
method: member
args:
- wrapper

-------------------------------------------------------------------
#sites.yaml
snipplr:
duration: 1800
end_no: 1
last_updated_at: 1263883143.0
start_no: 1
step: 1

--------------------------------------------------------------------

Run the crawler in shell

python lego.py -c snipplr.yaml

The results are stored in mongodb. You could use inserter Module to insert the data into mysql database.