https://github.com/vkolev/parsy
HTML parsing library with YAML definitions
https://github.com/vkolev/parsy
html python yaml
Last synced: about 1 month ago
JSON representation
HTML parsing library with YAML definitions
- Host: GitHub
- URL: https://github.com/vkolev/parsy
- Owner: vkolev
- License: mit
- Created: 2022-11-28T07:29:09.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2022-12-16T07:16:10.000Z (over 3 years ago)
- Last Synced: 2025-09-18T08:29:58.336Z (9 months ago)
- Topics: html, python, yaml
- Language: HTML
- Homepage:
- Size: 950 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

  
# PyParsy
PyParsy is an HTML parsing library using YAML definition files. The idea is to use the YAML file as
sort of intent - what you want to have as a result and let Parsy do the heavy lifting for you. The
differences to other similar libraries (e.g. [selectorlib](https://selectorlib.com/)) is that it
supports multiple version of selectors for a single field. This way you will not need to create a new
yaml definition file for every change on a website.
The YAML files contain:
- The desired structure of the output
- XPath/CSS/Regex selectors for the element extraction
- Return type definition
- Optional children of the field
## Features
- [x] YAML File definitions
- [x] YAML File validation
- [x] Intent instead of coding
- [x] support for XPath, CSS and Regex selectors
- [ ] Different output formats e.g. JSON, YAML, XML
- [x] Somewhat opinionated
- [x] 99% coverage
## Installation
Using pip:
```shell
pip install pyparsy
```
## Running Tests
To run tests, run the following command
```bash
poetry run pytest
```
## YAML Structure
- `:` Field name is the top level of the yaml
- `selector:` `` - The Selector expression
- `selector_type:` `` - The type of the selector expression only in of `XPATH, CSS, REGEX`
- `multiple:` `` *[Optional]* true - get all matching results as list, false - get first matching result
- `return_type:` `