Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sihaelov/harser
Easy way for HTML parsing and building XPath
https://github.com/sihaelov/harser
html html-parser parser python xpath
Last synced: about 2 months ago
JSON representation
Easy way for HTML parsing and building XPath
- Host: GitHub
- URL: https://github.com/sihaelov/harser
- Owner: sihaelov
- License: mit
- Created: 2016-11-30T18:07:04.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2022-07-06T19:20:57.000Z (over 2 years ago)
- Last Synced: 2024-09-26T20:05:13.700Z (4 months ago)
- Topics: html, html-parser, parser, python, xpath
- Language: Python
- Size: 5.86 KB
- Stars: 138
- Watchers: 5
- Forks: 3
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- starred-awesome - harser - Easy way for HTML parsing and building XPath (Python)
README
# Harser
[![Build Status](https://travis-ci.org/sihaelov/harser.svg?branch=master)](https://travis-ci.org/sihaelov/harser) [![Coverage Status](https://img.shields.io/codecov/c/github/sihaelov/harser.svg)](https://codecov.io/gh/sihaelov/harser) [![Wheel Status](https://img.shields.io/badge/wheel-yes-brightgreen.svg)](https://pypi.python.org/pypi/harser) ![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg) [![PyPI Version](https://img.shields.io/pypi/v/harser.svg)](https://pypi.python.org/pypi/harser)
Harser is a library for easy extracting data from HTML and building XPath.
## Installation
```python
pip install harser
```
## Examples```python
>>> from harser import Harser>>> HTML = '''
First item
Second item
Third item
First layer
Lorem Ipsum
Dolor sit amet
Second layer
Third layer
first block
second block
third block
fourth layer
'''>>> harser = Harser(HTML)
>>> harser.find('div', class_='header').children(class_='nav-item').find('text').extract()
# Or just
# harser.find(class_='nav-item').find('text').extract()
['First item', 'Second item', 'Third item']>>> harser.find(class_='nav-item').get_attr('href').extract()
['/nav1', '/nav2', '/nav3']# It is equally
>>> harser.find('div', class_='header', id='id-header')
>>> harser.find('div', attrs={'class': 'header', 'id': 'id-header'})>>> harser.find(id__contains='bar').get_attr('class').extract()
['footer']>>> harser.find(href__not_contains='2').find('text').extract()
['First item', 'Third item']>>> harser.find(attrs={'data-nav__contains': 'second'}).next_siblings().find('text').extract()
['Third item']>>> harser.find('li').parent().next_siblings(filters={'text__contains': 'Second'}).clean_extract()
['Second layer']>>> harser.find('h3', filters={'span.@id__starts_with': 'foo'}).get_attr('some-attr').extract()
['hey']>>> harser.find('div').children('h3').xpath
'//descendant::div/h3'```
## Support the project
Please contact [Michael Sinov](mailto:[email protected]?subject=Harser) if you want to support the Harser project.