Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kish1/krawll
Web crawling made easy for Pythonistas. Just supply extractor and terminator functions to this higher-order function and let krawll do the rest.
https://github.com/kish1/krawll
Last synced: 2 months ago
JSON representation
Web crawling made easy for Pythonistas. Just supply extractor and terminator functions to this higher-order function and let krawll do the rest.
- Host: GitHub
- URL: https://github.com/kish1/krawll
- Owner: kish1
- License: mit
- Created: 2015-05-15T23:33:29.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2017-03-25T04:39:00.000Z (almost 8 years ago)
- Last Synced: 2024-08-08T00:44:17.439Z (6 months ago)
- Language: Python
- Homepage:
- Size: 5.86 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- project-awesome - kish1/krawll - Web crawling made easy for Pythonistas. Just supply extractor and terminator functions to this higher-order function and let krawll do the rest. (Python)
README
# krawll
Web crawling made easy for Pythonistas. Just supply extractor and terminator functions to this higher-order function and let krawll do the rest.# How to use krawll
Supply the following to the krawll function:
1. *Extractor* - A function that parses and extracts the required data from the HTML and returns the data as a dictionary.
2. *Terminator* - A function that returns True when the crawler should terminate, False otherwise.
_krawll(cookies, homepage, hostname, extractor, terminator)_
# Example
```python
# heading_extractor: String -> Dict
# GIVEN: a html source tree
# RETURNS: a data structure containing newly extracted data
# an empty list otherwise.
def heading_extractor(html):
import BeautifulSoup
soup = BeautifulSoup(html)
h1s = soup.find_all('h1')
hits = []
for tag in h1s:
str_tag = str(tag)
if str_tag.startswith('Breaking News: '):
hits.append(str(tag.string).split('FLAG: ')[1])
news = {}
for hit in hits:
if not news.has_key(hit):
news[hit] = 1
return new# may_exit: Dict -> Boolean
# GIVEN: a data structure containing the extracted data
# RETURNS: True if the required number of hits have been found, False otherwise.
# This implies that the crawler will terminate when this function returns True.
def may_exit(news):
return len(news) == 20
# never_exit: Dict -> Boolean
# GIVEN: a data structure containing the extracted data
# RETURNS: False. This implies that the crawler will crawl the entire domain.
def never_exit(news):
return Falseimport krawll
cookies = {}
home_page = 'http://www.xyznews.com'
host = 'www.xyznews.com'
news_dict = krawll.krawll(cookies, home_page, host, heading_extractor, may_exit)
```