https://github.com/jhyao/tp_html
https://github.com/jhyao/tp_html
html parser python template
Last synced: 5 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/jhyao/tp_html
- Owner: jhyao
- License: mit
- Created: 2018-06-13T05:42:13.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2018-12-30T09:03:29.000Z (over 7 years ago)
- Last Synced: 2025-10-28T17:48:10.929Z (8 months ago)
- Topics: html, parser, python, template
- Language: Python
- Size: 35.2 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Template Parser HTML
This tool can help get useful data from html web page. It parses html page with the template file which marks data that you need with special attributes. The template file positions html blocks that contain data, and describes types, names and structures of data. You can modify an example page file to get it, or write a basic html structure that can position your data. I suggest you to use the first method, the tool can delete irrelevant parts and organize html tree automatically.
## install
```
pip install tp_html
```
## How to use
```python
from tp_html import Template, ThtmlParser
# get template
template = Template(template_file='samples/basic_template.html')
# save template
template.save('samples/basic_template.min.html')
# get parser
parser = ThtmlParser(template_file='samples/basic_template.html')
parser = ThtmlParser(template_text='...')
parser = ThtmlParser(template=template)
# parse data
data = parser.parse(page_file='samples/basic_sample.html', encoding='urf-8')
data = parser.parse(page_url='http://.....')
data = parser.parse(page_text='.....')
```
## Template file
### string
To get data from content or attributes of element.
```html
link
```
To get content. This will get data {'name': 'link'}
```html
```
To get href. This will get data {'name': '...'}
```html
```
### list
For HTML
```html
```
template:
```html
```
data:
```json
{
"images": [
"/image/1",
"/image/2",
"/image/3",
"/image/4"
]
}
```
In list template, in the element which is marked with p-type=list, require one child p-value node and just one that is for selecting item data. If list item is dict or list, structure in item is also allowed.
### dict
For HTML
```html
```
template:
```html
```
data:
```json
{
"user_link": {
"name": "xxx",
"link": "/user/13456",
"title": "user xxx",
"age": "20",
"fans_num": "10",
"follow_num": "20"
}
}
```
In dict template, p-name is required for key of dictionary. Multiple p-item is allowed, split with space, and "string" means content of element, others items are attributies name.
## complex nesting
html
```html
Title
```
template:
```html
```
data:
```json
{
"user_list": [
{
"name": "xxx",
"link": "/user/1",
"title": "user xxx",
"age": "20",
"fans_num": "10",
"follow_num": "20"
},
{
"name": "yyy",
"link": "/user/2",
"title": "user yyy",
"age": "10",
"fans_num": "10",
"follow_num": "20"
}
],
"user_all": {
"name": "xxx",
"link": "/user/1",
"title": "user xxx",
"age": "20",
"fans_num": "10",
"follow_num": "20",
"images": [
"/image/1",
"/image/2",
"/image/3",
"/image/4"
]
},
"user_info": {
"profile": {
"name": "xxx",
"link": "/user/1",
"title": "user xxx",
"age": "20"
},
"counts": {
"fans_num": "10",
"follow_num": "20"
}
},
"image_tags": [
[
"tag1-1",
"tag1-2",
"tag1-3"
],
[
"tag2-1",
"tag2-2",
"tag2-3"
],
[
"tag3-1",
"tag3-2",
"tag3-3"
]
]
}
```
# p-data tag in template
p-data is a empty html tag using in template to organize data structure. In the example below, four elements are in the same level. But with p-data tag, they can have more structure.
HTML:
```html
```
template:
```html
```
data:
```json
{
"user_info": {
"profile": {
"name": "xxx",
"link": "/user/1",
"title": "user xxx",
"age": "20"
},
"counts": {
"fans_num": "10",
"follow_num": "20"
}
}
}
```
# Save template
The tool also provide a method to save minimal template into a file. It will have a faster template building speed with the minimal template file.
```python
template = TemplateParser(template_file='samples/pixiv_user_template.html')
template.save('samples/pixiv_user_template.min.html')
parser = WebPageParser(template_file='samples/pixiv_user_template.min.html')
data = parser.parse(page_file='samples/pixiv_user.html')
```
minimal template:
```html
```
# Test
Time test for templates in samples folder. Test tool is [timefunc](https://github.com/jhyao/functime)
test code:
```python
parser = ThtmlParser(template_file='samples/basic_template.html')
functime.func_time(ThtmlParser, template_file='samples/basic_template.html')
functime.func_time(parser.parse, page_file='samples/basic_sample.html')
```
result:
```
basic_template.html
TemplateParser AVG(1000): 1.251ms
parse AVG(100): 3.2599ms
complex_template.html
TemplateParser AVG(1000): 1.6005ms
parse AVG(100): 6.265ms
pixiv_user_template.html
TemplateParser AVG(10): 24.7ms
parse AVG(10): 34.4997ms
pixiv_user_template.min.html
TemplateParser AVG(1000): 760.001183us
parse AVG(10): 31.2003ms
```