Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sihaelov/harser

Easy way for HTML parsing and building XPath
https://github.com/sihaelov/harser

html html-parser parser python xpath

Last synced: 3 months ago
JSON representation

Easy way for HTML parsing and building XPath

Host: GitHub
URL: https://github.com/sihaelov/harser
Owner: sihaelov
License: mit
Created: 2016-11-30T18:07:04.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2022-07-06T19:20:57.000Z (over 2 years ago)
Last Synced: 2024-09-26T20:05:13.700Z (5 months ago)
Topics: html, html-parser, parser, python, xpath
Language: Python
Size: 5.86 KB
Stars: 138
Watchers: 5
Forks: 3
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

starred-awesome - harser - Easy way for HTML parsing and building XPath (Python)

README

        
# Harser

[![Build Status](https://travis-ci.org/sihaelov/harser.svg?branch=master)](https://travis-ci.org/sihaelov/harser) [![Coverage Status](https://img.shields.io/codecov/c/github/sihaelov/harser.svg)](https://codecov.io/gh/sihaelov/harser) [![Wheel Status](https://img.shields.io/badge/wheel-yes-brightgreen.svg)](https://pypi.python.org/pypi/harser) ![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg) [![PyPI Version](https://img.shields.io/pypi/v/harser.svg)](https://pypi.python.org/pypi/harser)

Harser is a library for easy extracting data from HTML and building XPath.

## Installation

```python

pip install harser

```

## Examples

```python

>>> from harser import Harser

>>> HTML = '''

    

    


        First item

        Second item

        Third item

    

    First layer

        Lorem Ipsum

        Dolor sit amet

    

    Second layer

    Third layer

        first block

        second block

        third block

    

    fourth layer

    

    

        

            foo ter

        

    

    

'''

>>> harser = Harser(HTML)

>>> harser.find('div', class_='header').children(class_='nav-item').find('text').extract()

# Or just

# harser.find(class_='nav-item').find('text').extract()

['First item', 'Second item', 'Third item']

>>> harser.find(class_='nav-item').get_attr('href').extract()

['/nav1', '/nav2', '/nav3']

# It is equally

>>> harser.find('div', class_='header', id='id-header')

>>> harser.find('div', attrs={'class': 'header', 'id': 'id-header'})

>>> harser.find(id__contains='bar').get_attr('class').extract()

['footer']

>>> harser.find(href__not_contains='2').find('text').extract()

['First item', 'Third item']

>>> harser.find(attrs={'data-nav__contains': 'second'}).next_siblings().find('text').extract()

['Third item']

>>> harser.find('li').parent().next_siblings(filters={'text__contains': 'Second'}).clean_extract()

['
Second layer
']

>>> harser.find('h3', filters={'span.@id__starts_with': 'foo'}).get_attr('some-attr').extract()

['hey']

>>> harser.find('div').children('h3').xpath

'//descendant::div/h3'

```

## Support the project

Please contact [Michael Sinov](mailto:[email protected]?subject=Harser) if you want to support the Harser project.