https://github.com/santinic/htmlmatch

Python tool for automatic data scraping from Html templates
https://github.com/santinic/htmlmatch

Last synced: 6 months ago
JSON representation

Python tool for automatic data scraping from Html templates

Host: GitHub
URL: https://github.com/santinic/htmlmatch
Owner: santinic
Created: 2013-03-07T16:17:23.000Z (over 12 years ago)
Default Branch: master
Last Pushed: 2016-05-02T18:54:58.000Z (about 9 years ago)
Last Synced: 2025-01-14T17:29:20.802Z (6 months ago)
Language: Python
Size: 3.91 KB
Stars: 19
Watchers: 3
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## htmlmatch: Automatic data scraping

Suppose you have a page with a list of videos (videos.html), and you want to get all the videos:

```html

Example

Title first video

Title second video

Title third video

...

```

You can easily extract the data from this web page, creating an extraction template like this (template.html):

```html

$title$

```

Just put `$variable$` where you want. Now if you run the script against videos.html and template.html, you get the raw data:

```
claudio@laptop:~$ ./htmlmatch.py videos.html pattern.html
code: 0001
title: The first video
preview: preview1.jpg

code: 0002
title: The second video
preview: preview2.jpg

code: 0003
title: The third video
preview: preview3.jpg
```

You can easily access all these filed using the library as a function in your python code and iterating the list (of dictionaries) it gives you back. For example:

```python
videos_page = urllib2.urlopen("http://www.videos-website.com/")
pattern = open("pattern.html", "r")
matches = htmlmatch(videos_page, pattern)
for map in matches:
for k, v in map.iteritems():
print k, v
print
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/santinic/htmlmatch

Awesome Lists containing this project

README