Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/santinic/htmlmatch
Python tool for automatic data scraping from Html templates
https://github.com/santinic/htmlmatch
Last synced: about 1 month ago
JSON representation
Python tool for automatic data scraping from Html templates
- Host: GitHub
- URL: https://github.com/santinic/htmlmatch
- Owner: santinic
- Created: 2013-03-07T16:17:23.000Z (almost 12 years ago)
- Default Branch: master
- Last Pushed: 2016-05-02T18:54:58.000Z (over 8 years ago)
- Last Synced: 2024-05-02T23:39:16.062Z (8 months ago)
- Language: Python
- Size: 3.91 KB
- Stars: 19
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## htmlmatch: Automatic data scraping
Suppose you have a page with a list of videos (videos.html), and you want to get all the videos:
```html
Example
...```
You can easily extract the data from this web page, creating an extraction template like this (template.html):
```html
```Just put `$variable$` where you want. Now if you run the script against videos.html and template.html, you get the raw data:
```
claudio@laptop:~$ ./htmlmatch.py videos.html pattern.html
code: 0001
title: The first video
preview: preview1.jpgcode: 0002
title: The second video
preview: preview2.jpgcode: 0003
title: The third video
preview: preview3.jpg
```You can easily access all these filed using the library as a function in your python code and iterating the list (of dictionaries) it gives you back. For example:
```python
videos_page = urllib2.urlopen("http://www.videos-website.com/")
pattern = open("pattern.html", "r")
matches = htmlmatch(videos_page, pattern)
for map in matches:
for k, v in map.iteritems():
print k, v
```