Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/7ws/webkitcrawler
QtWebKit-based web crawler
https://github.com/7ws/webkitcrawler
Last synced: about 2 months ago
JSON representation
QtWebKit-based web crawler
- Host: GitHub
- URL: https://github.com/7ws/webkitcrawler
- Owner: 7ws
- Created: 2010-08-01T11:50:39.000Z (about 14 years ago)
- Default Branch: master
- Last Pushed: 2011-04-16T04:35:33.000Z (over 13 years ago)
- Last Synced: 2024-07-05T09:33:37.218Z (3 months ago)
- Language: Python
- Homepage:
- Size: 270 KB
- Stars: 71
- Watchers: 9
- Forks: 27
- Open Issues: 0
-
Metadata Files:
- Readme: README.markdown
Awesome Lists containing this project
README
# WebKit Crawler
A Python Qt based tool for extracting content from complex websites.
from webkit_browser import Browser
from lxml import htmlb = Browser()
b.open('http://google.com/search?q=python')
content = b.main_frame['content'].read()
dom = html.fromstring(content)
results = dom.xpath('//*[@id="ires"]/ol/li')for result in results:
print result.find('h3').text_content()## Dependencies
Assuming that you're on a Linux box, you need ``python-qt`` installed
in your system to make it work. If you plan to run it on a server, you
may want to use ``xvfb``, since Qt needs a display backend.If you get it working on another environment, please contribute to this
README. :)## Target
Use this software if you are dealing with a web page that completely
depends on JavaScript and you already digged on its code but still can't
extract the info you want with simple HTTP requests.Note that this software **is not** intended to replace tools like
[Mechanize][1] nor others simple tools for doing web scraping. I would
not use it if the page I want would be downloadable with a simple
``curl`` call.## Usage
There's a ``Browser`` class that works similarly to the Mechanize's
``Browser``, but without all that extra functionality. You can follow
the above example and interact with the ``main_frame`` dict.When the page is loaded, the code looks for all the page frames,
recursively, and puts them up in a dictionary. Each "frame" has three
keys, ``'title'`` (unicode) ``'content'`` (a file-like object) and
``'children'`` (a list containing child frames, if they exist). As you
can see in the example, ``main_frame`` serves as the root frame.Additionally, if you're running the code in a graphical environment,
a mini-browser window will open, showing what's happen under the hoods.If you need to handle authentication and/or your page goes through many
redirects until you finally get what you want, consider providing a
``validate`` function for it. Example:def proceed_after_redirect(qwebview):
"""
After all the redirects, a page with title 'Home' will be
displayed. Note that you'll be handing the ``QWebView`` instance
in this function, not a ``Browser`` object.
"""if 'Home' in qwebview.page().mainFrame().title:
return TrueOr even
def proceed_after_login(qwebview):
"""
Fill then submit the authentication form
"""
main_frame = qwebview.page().mainFrame()
if 'Login' in main_frame.title:
main_frame.evaluateJavaScript('''
var form = document.querySelector('form#login');
form['username'].value = '{0}';
form['password'].value = '{1}';
form.submit();
'''.format(
username, password,
))
else:
return TrueYou can always mix these to get what fits your problem. ;)
## License
This project is licensed under the DWTFYW (Do What The F*ck You Want)
license. No, I'm kidding. It's *MIT* licensed; anyway you are free to do
anything you want with it.If this program saved your day, please consider sending me some soda or
donate to my PayPal account ([email protected]) so I can buy it
here. :)**Note**: this code needs packaging. Why don't you fork it and make it
a Python package?[1]: http://wwwsearch.sourceforge.net/mechanize/