Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/7ws/webkitcrawler

QtWebKit-based web crawler
https://github.com/7ws/webkitcrawler

Last synced: about 2 months ago
JSON representation

QtWebKit-based web crawler

Host: GitHub
URL: https://github.com/7ws/webkitcrawler
Owner: 7ws
Created: 2010-08-01T11:50:39.000Z (about 14 years ago)
Default Branch: master
Last Pushed: 2011-04-16T04:35:33.000Z (over 13 years ago)
Last Synced: 2024-07-05T09:33:37.218Z (3 months ago)
Language: Python
Homepage:
Size: 270 KB
Stars: 71
Watchers: 9
Forks: 27
Open Issues: 0
Metadata Files:
- Readme: README.markdown

Awesome Lists containing this project

README

        # WebKit Crawler

A Python Qt based tool for extracting content from complex websites.

    from webkit_browser import Browser

    from lxml import html

    b = Browser()

    b.open('http://google.com/search?q=python')

    content = b.main_frame['content'].read()

    dom = html.fromstring(content)

    results = dom.xpath('//*[@id="ires"]/ol/li')

    for result in results:

        print result.find('h3').text_content()

## Dependencies

Assuming that you're on a Linux box, you need ``python-qt`` installed

in your system to make it work. If you plan to run it on a server, you

may want to use ``xvfb``, since Qt needs a display backend.

If you get it working on another environment, please contribute to this

README. :)

## Target

Use this software if you are dealing with a web page that completely

depends on JavaScript and you already digged on its code but still can't

extract the info you want with simple HTTP requests.

Note that this software **is not** intended to replace tools like

[Mechanize][1] nor others simple tools for doing web scraping. I would

not use it if the page I want would be downloadable with a simple

``curl`` call.

## Usage

There's a ``Browser`` class that works similarly to the Mechanize's

``Browser``, but without all that extra functionality. You can follow

the above example and interact with the ``main_frame`` dict.

When the page is loaded, the code looks for all the page frames,

recursively, and puts them up in a dictionary. Each "frame" has three

keys, ``'title'`` (unicode) ``'content'`` (a file-like object) and

``'children'`` (a list containing child frames, if they exist). As you

can see in the example, ``main_frame`` serves as the root frame.

Additionally, if you're running the code in a graphical environment,

a mini-browser window will open, showing what's happen under the hoods.

If you need to handle authentication and/or your page goes through many

redirects until you finally get what you want, consider providing a

``validate`` function for it. Example:

    def proceed_after_redirect(qwebview):

        """

        After all the redirects, a page with title 'Home' will be

        displayed. Note that you'll be handing the ``QWebView`` instance

        in this function, not a ``Browser`` object.

        """

        if 'Home' in qwebview.page().mainFrame().title:

            return True

Or even

    def proceed_after_login(qwebview):

        """

        Fill then submit the authentication form

        """

        main_frame = qwebview.page().mainFrame()

        if 'Login' in main_frame.title:

            main_frame.evaluateJavaScript('''

                var form = document.querySelector('form#login');

                form['username'].value = '{0}';

                form['password'].value = '{1}';

                form.submit();

            '''.format(

                username, password,

            ))

        else:

            return True

You can always mix these to get what fits your problem. ;)

## License

This project is licensed under the DWTFYW (Do What The F*ck You Want)

license. No, I'm kidding. It's *MIT* licensed; anyway you are free to do

anything you want with it.

If this program saved your day, please consider sending me some soda or

donate to my PayPal account ([email protected]) so I can buy it

here. :)

**Note**: this code needs packaging. Why don't you fork it and make it

a Python package?

[1]: http://wwwsearch.sourceforge.net/mechanize/