Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/orf/html-query
jq, but for HTML
https://github.com/orf/html-query
html json parser rust
Last synced: 13 days ago
JSON representation
jq, but for HTML
- Host: GitHub
- URL: https://github.com/orf/html-query
- Owner: orf
- License: mit
- Created: 2022-11-30T22:12:30.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-06-26T00:47:35.000Z (5 months ago)
- Last Synced: 2024-10-12T14:38:52.743Z (29 days ago)
- Topics: html, json, parser, rust
- Language: HTML
- Homepage: https://orf.github.io/html-query/
- Size: 1.23 MB
- Stars: 630
- Watchers: 7
- Forks: 7
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# hq
[![Crates.io](https://img.shields.io/crates/v/html-query.svg)](https://crates.io/crates/html-query)
jq, but for HTML. [Try it in your browser here](https://orf.github.io/html-query/)
![](./images/readme-example.gif)
`hq` reads HTML and converts it into a JSON object based on a series of CSS selectors. The selectors are expressed
in a similar way to JSON, but where the values are CSS selectors. For example:```
{posts: .athing | [ {title: .titleline > a, url: .titleline > a | @(href)} ] }
```This will select all `.athing` elements, and it will create an array (`| [{...}]`) of objects for each element selected.
Then for each element it will select the text of the `titleline > a` element, and the `href` attribute (`| @(href)`).The end result is the following structure:
```json
{
"posts": [
{
"title": "...",
"url": "..."
}
]
}
```## Install
`brew install hq`, or `cargo install html-query`
## Special query syntax
### Text
`.foo | @text`
This will select the text content from the first element matching `.foo`.
### Selecting attributes
`.foo | @(href)`
This will select the `href` attribute from the first element matching `.foo`.
### Parents
`.foo | @parent`
This will return the parent element from the first element matching `.foo`.
### Siblings
`.foo | @sibling(1)`
This will return the sibling element from the first element matching `.foo`.
## Examples
### Full hacker news story extraction
```
{posts: .athing | [{href: .titleline > a | @(href), title: .titleline > a, meta: @sibling(1) | {user: .hnuser, posted: .age | @(title) }}]}
```This selects each `.athing` element, extracts the URL from the `href` attribute as well as the title. It then selects
the _sibling_ `.athing` element, and extracts the user and post time from that:```json
{
"posts": [
{
"title": "...",
"url": "...",
"meta": {
"posted": "...",
"user": "..."
}
}
]
}
```