An open API service indexing awesome lists of open source software.

https://github.com/vkolev/parsy

HTML parsing library with YAML definitions
https://github.com/vkolev/parsy

html python yaml

Last synced: about 1 month ago
JSON representation

HTML parsing library with YAML definitions

Awesome Lists containing this project

README

          

![Logo](https://raw.githubusercontent.com/vkolev/parsy/master/images/parsy-logo.png)

![CI](https://github.com/vkolev/parsy/actions/workflows/main.yml/badge.svg?branch=master) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pyparsy) ![PyPI](https://img.shields.io/pypi/v/pyparsy)

# PyParsy

PyParsy is an HTML parsing library using YAML definition files. The idea is to use the YAML file as
sort of intent - what you want to have as a result and let Parsy do the heavy lifting for you. The
differences to other similar libraries (e.g. [selectorlib](https://selectorlib.com/)) is that it
supports multiple version of selectors for a single field. This way you will not need to create a new
yaml definition file for every change on a website.

The YAML files contain:
- The desired structure of the output
- XPath/CSS/Regex selectors for the element extraction
- Return type definition
- Optional children of the field

## Features

- [x] YAML File definitions
- [x] YAML File validation
- [x] Intent instead of coding
- [x] support for XPath, CSS and Regex selectors
- [ ] Different output formats e.g. JSON, YAML, XML
- [x] Somewhat opinionated
- [x] 99% coverage

## Installation

Using pip:
```shell
pip install pyparsy
```

## Running Tests

To run tests, run the following command

```bash
poetry run pytest
```

## YAML Structure

- `:` Field name is the top level of the yaml
- `selector:` `` - The Selector expression
- `selector_type:` `` - The type of the selector expression only in of `XPATH, CSS, REGEX`
- `multiple:` `` *[Optional]* true - get all matching results as list, false - get first matching result
- `return_type:` `