https://github.com/kwlzn/webscraping

github convenience fork of http://code.google.com/p/webscraping/
https://github.com/kwlzn/webscraping

Last synced: 12 days ago
JSON representation

github convenience fork of http://code.google.com/p/webscraping/

Host: GitHub
URL: https://github.com/kwlzn/webscraping
Owner: kwlzn
Created: 2012-09-04T08:16:29.000Z (over 12 years ago)
Default Branch: master
Last Pushed: 2012-09-04T09:13:28.000Z (over 12 years ago)
Last Synced: 2025-03-26T16:39:07.407Z (30 days ago)
Language: Python
Size: 146 KB
Stars: 8
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        webscraping

===========

github convenience fork of http://code.google.com/p/webscraping/

Overview

========

The webscraping library aims to make web scraping easier.

All code is pure Python and has been run across multiple Linux servers, Windows machines, as well as Google App Engine.

Examples

========

common

------


>>> from webscraping import common

>>> common.remove_tags('hello <b>world</b>!')

'hello world!'

>>> common.extract_domain('http://www.google.com.au/tos.html')

'google.com.au'

>>> common.unescape('&lt;hello&nbsp;&amp;&nbsp;world&gt;')

'<hello & world>'

>>> common.extract_emails('hello richard AT sitescraper DOT net world')

['[email protected]']

>>> cj = common.firefox_cookie()

>>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

>>> html = opener.open(url).read() # use current firefox cookies to access url



download

--------


>>> from webscraping import download

>>> D = download.Download()

>>> # crawl given domain

>>> domain = ...

>>> for url in D.crawl(domain):

>>>    html = D.cache[url]



pdict

-----


>>> from webscraping import pdict 

>>> cache = pdict.PersistentDict(CACHE_FILE)

>>> cache['a'] = range(5) # pickle stored in sqlite database

>>> 'a' in cache

True

>>> cache['a']

[0, 1, 2, 3, 4]

(see a further example here)



xpath

-----


>>> from webscraping import xpath

>>> html = urllib2.urlopen(url).read()

>>> xpath.parse(html, '/html/body/ul[2]/li[@class="info"]/div[1]')

['div content']

>>> xpath.parse(html, '/html/body/ul[2]/li[@class="info"]/a/@href')

['url1', 'url2', 'url3']



Install

=======

Some options to install the webscraping package.

Clone the repository: hg clone https://code.google.com/p/webscraping/

Install with pip: sudo pip install -e hg+https://code.google.com/p/webscraping/#egg=webscraping

Download zip: http://webscraping.googlecode.com/files/webscraping.zip

License

=======

This code is licensed under the LGPL license.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kwlzn/webscraping

Awesome Lists containing this project

README