Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ludovicianul/hq
lightweight command line HTML processor using CSS and XPath selectors
https://github.com/ludovicianul/hq
cli command-line css-selectors graal-native hacktoberfest html java jq xpath
Last synced: 2 months ago
JSON representation
lightweight command line HTML processor using CSS and XPath selectors
- Host: GitHub
- URL: https://github.com/ludovicianul/hq
- Owner: ludovicianul
- License: mit
- Created: 2021-09-07T14:25:08.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-09-28T07:06:05.000Z (over 1 year ago)
- Last Synced: 2024-08-02T06:17:10.829Z (6 months ago)
- Topics: cli, command-line, css-selectors, graal-native, hacktoberfest, html, java, jq, xpath
- Language: Shell
- Homepage:
- Size: 103 KB
- Stars: 64
- Watchers: 4
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-tools - ludovicianul/hq - lightweight command line HTML processor using CSS and XPath selectors (Command Line / Like jq)
README
# hq
Small utility to parse and grep HTML files. It
uses [CSS selectors](https://www.w3schools.com/cssref/css_selectors.asp) or [XPath Selectors](https://www.w3schools.com/xml/xpath_intro.asp) to extract HTML elements.# Usage
```bash
hq - command line HTML elements finder; version 1.0.0Usage: hq [-hptV] [-a=] [-f=] [-o=] [-s=] [-x=]
[COMMAND]
The CSS selector
-a, --attribute=
Return only this attribute from the selected HTML elements
-f, --file= The HTML input file. If not supplied it will default to stdin
-h, --help Show this help message and exit.
-o, --output= The output file. If not supplied it will default to stdout
-p, --pretty Force pretty printing the output
-r, --remove= Remove nodes matching given selector
-s, --sanitize= Sanitizes the html input according to the given policy
-t, --text Display only the inner text of the selected HTML top element
-V, --version Print version information and exit.
-x, --xpath= Supply an XPath selector instead of CSS
Commands:
generate-completion Generate bash/zsh completion script for hq.```
# Installation
## Homebrew
```
> brew tap ludovicianul/tap
> brew install ludovicianul/tap/hq
```## Manual
`hq` is compiled to native code using GraalVM. Check
the [release page](https://github.com/ludovicianul/hq/releases/) for binaries (Linux,
MacOS, uberjar).After download, you can make `hq` globally available:
```bash
sudo cp hq-macos /usr/local/bin/hq
```The uberjar can be run using `java -jar hq`. Requires Java 11+.
# Autocomplete
Run the following commands to get autocomplete:```bash
hq generate-completion >> hq_autocompletesource hq_autocomplete
```# HTML Sanitizing
`hq` can sanitize html output. Supported modes are: `NONE, BASIC, SIMPLE_TEXT, BASIC_WITH_IMAGES, RELAXED`.This is how sanitization works:
| Policy | Details |
| ------- | ------- |
| `NONE` | Allows only text nodes: all HTML will be stripped. |
| `BASIC` | Allows a fuller range of text nodes: `a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul`, and appropriate attributes. Does not allow images.|
| `SIMPLE_TEXT` | Allows only simple text formatting: `b, em, i, strong, u`. All other HTML (tags and attributes) will be removed.|
| `BASIC_WITH_IMAGES` | Allows the same text tags as `BASIC`, and also allows `img` tags, with appropriate attributes, with `src` pointing to `http` or `https`.
| `RELAXES` | Allows a full range of text and structural body HTML: `a, b, blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul`.|# Examples
Get the `div` with id `mainLeaderboard`:
```
➜ curl -s https://www.w3schools.com/cssref/css_selectors.php | hq "#main > p:nth-child(6)" -tIn CSS, selectors are patterns used to select the element(s) you want to style.
```
Get the text inside an article:
```
➜ curl -s https://ludovicianul.github.io/2021/07/16/unicode_language_version/ | hq '.post' -tMake sure you know which Unicode version is supported by your programming language version 16 Jul 2021 While enhancing CATS I recently added a feature to send requests that include
single and multi code point emojis. This is a single code point emoji: 🥶, which can be represented in Java as the \uD83E\uDD76 string. The test case is simple: inject emojis within
strings and expect that the REST endpoint will sanitize the input and remove them entirely (I appreciate this might not be a valid case for all APIs, this is why the behaviour is
configurable in CATS, but not the focus of this article). I usually recommend that any REST endpoint should sanitize input before validating it and remove special characters.
A typical regex for this would be [\p{C}\p{Z}\p{So}]+ (although you should enhance it to allow spaces between words), which means: p{C} - match Unicode invisible Control
Chars (\u000D - carriage return for example) ...
...
```Sanitize the html according to the [specified policy](#html-sanitizing):
```
➜ curl -s https://ludovicianul.github.io/2021/07/16/unicode_language_version/ | hq html -s=BASIC -ppractical thoughts about software engineering
Home
About
GitHub
© 2021. All rights reserved.
Make sure you know which Unicode version is supported by your programming language version
16 Jul 2021
...
```
Get all `href` attributes from a given page:
```shell
➜ curl -s https://ludovicianul.github.io | hq "*" -a "href"
http://gmpg.org/xfn/11
https://ludovicianul.github.io/public/css/poole.css
https://ludovicianul.github.io/public/css/syntax.css
https://ludovicianul.github.io/public/css/hyde.css
https://fonts.googleapis.com/css?family=PT+Sans:400,400italic,700|Abril+Fatface
https://ludovicianul.github.io/public/apple-touch-icon-144-precomposed.png
https://ludovicianul.github.io/public/favicon.ico
/atom.xml
https://ludovicianul.github.io/
https://ludovicianul.github.io/
/about/
...
```# Resources
- [Universal selector in CSS](https://www.scaler.com/topics/universal-selector-in-css/)
- [HTML elements](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)