Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/santinic/htmlmatch

Python tool for automatic data scraping from Html templates
https://github.com/santinic/htmlmatch

Last synced: about 1 month ago
JSON representation

Python tool for automatic data scraping from Html templates

Awesome Lists containing this project

README

        

## htmlmatch: Automatic data scraping

Suppose you have a page with a list of videos (videos.html), and you want to get all the videos:

```html

Example




...

```

You can easily extract the data from this web page, creating an extraction template like this (template.html):

```html


```

Just put `$variable$` where you want. Now if you run the script against videos.html and template.html, you get the raw data:

```
claudio@laptop:~$ ./htmlmatch.py videos.html pattern.html
code: 0001
title: The first video
preview: preview1.jpg

code: 0002
title: The second video
preview: preview2.jpg

code: 0003
title: The third video
preview: preview3.jpg
```

You can easily access all these filed using the library as a function in your python code and iterating the list (of dictionaries) it gives you back. For example:

```python
videos_page = urllib2.urlopen("http://www.videos-website.com/")
pattern = open("pattern.html", "r")
matches = htmlmatch(videos_page, pattern)
for map in matches:
for k, v in map.iteritems():
print k, v
print
```