Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ludovicianul/hq

lightweight command line HTML processor using CSS and XPath selectors
https://github.com/ludovicianul/hq

cli command-line css-selectors graal-native hacktoberfest html java jq xpath

Last synced: about 1 month ago
JSON representation

lightweight command line HTML processor using CSS and XPath selectors

Awesome Lists containing this project

README

        

# hq

Small utility to parse and grep HTML files. It
uses [CSS selectors](https://www.w3schools.com/cssref/css_selectors.asp) or [XPath Selectors](https://www.w3schools.com/xml/xpath_intro.asp) to extract HTML elements.

# Usage

```bash
hq - command line HTML elements finder; version 1.0.0

Usage: hq [-hptV] [-a=] [-f=] [-o=] [-s=] [-x=]
[COMMAND]
The CSS selector
-a, --attribute=
Return only this attribute from the selected HTML elements
-f, --file= The HTML input file. If not supplied it will default to stdin
-h, --help Show this help message and exit.
-o, --output= The output file. If not supplied it will default to stdout
-p, --pretty Force pretty printing the output
-r, --remove= Remove nodes matching given selector
-s, --sanitize= Sanitizes the html input according to the given policy
-t, --text Display only the inner text of the selected HTML top element
-V, --version Print version information and exit.
-x, --xpath= Supply an XPath selector instead of CSS
Commands:
generate-completion Generate bash/zsh completion script for hq.

```

# Installation

## Homebrew

```
> brew tap ludovicianul/tap
> brew install ludovicianul/tap/hq
```

## Manual

`hq` is compiled to native code using GraalVM. Check
the [release page](https://github.com/ludovicianul/hq/releases/) for binaries (Linux,
MacOS, uberjar).

After download, you can make `hq` globally available:

```bash
sudo cp hq-macos /usr/local/bin/hq
```

The uberjar can be run using `java -jar hq`. Requires Java 11+.

# Autocomplete
Run the following commands to get autocomplete:

```bash
hq generate-completion >> hq_autocomplete

source hq_autocomplete
```

# HTML Sanitizing
`hq` can sanitize html output. Supported modes are: `NONE, BASIC, SIMPLE_TEXT, BASIC_WITH_IMAGES, RELAXED`.

This is how sanitization works:

| Policy | Details |
| ------- | ------- |
| `NONE` | Allows only text nodes: all HTML will be stripped. |
| `BASIC` | Allows a fuller range of text nodes: `a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul`, and appropriate attributes. Does not allow images.|
| `SIMPLE_TEXT` | Allows only simple text formatting: `b, em, i, strong, u`. All other HTML (tags and attributes) will be removed.|
| `BASIC_WITH_IMAGES` | Allows the same text tags as `BASIC`, and also allows `img` tags, with appropriate attributes, with `src` pointing to `http` or `https`.
| `RELAXES` | Allows a full range of text and structural body HTML: `a, b, blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul`.|

# Examples

Get the `div` with id `mainLeaderboard`:

```
➜ curl -s https://www.w3schools.com/cssref/css_selectors.php | hq "#main > p:nth-child(6)" -t

In CSS, selectors are patterns used to select the element(s) you want to style.

```

Get the text inside an article:

```
➜ curl -s https://ludovicianul.github.io/2021/07/16/unicode_language_version/ | hq '.post' -t

Make sure you know which Unicode version is supported by your programming language version 16 Jul 2021 While enhancing CATS I recently added a feature to send requests that include
single and multi code point emojis. This is a single code point emoji: 🥶, which can be represented in Java as the \uD83E\uDD76 string. The test case is simple: inject emojis within
strings and expect that the REST endpoint will sanitize the input and remove them entirely (I appreciate this might not be a valid case for all APIs, this is why the behaviour is
configurable in CATS, but not the focus of this article). I usually recommend that any REST endpoint should sanitize input before validating it and remove special characters.
A typical regex for this would be [\p{C}\p{Z}\p{So}]+ (although you should enhance it to allow spaces between words), which means: p{C} - match Unicode invisible Control
Chars (\u000D - carriage return for example) ...
...
```

Sanitize the html according to the [specified policy](#html-sanitizing):
```
➜ curl -s https://ludovicianul.github.io/2021/07/16/unicode_language_version/ | hq html -s=BASIC -p



m's blog

practical thoughts about software engineering


Home
About
GitHub

© 2021. All rights reserved.


Make sure you know which Unicode version is supported by your programming language version
16 Jul 2021


...

```

Get all `href` attributes from a given page:

```shell
➜ curl -s https://ludovicianul.github.io | hq "*" -a "href"
http://gmpg.org/xfn/11
https://ludovicianul.github.io/public/css/poole.css
https://ludovicianul.github.io/public/css/syntax.css
https://ludovicianul.github.io/public/css/hyde.css
https://fonts.googleapis.com/css?family=PT+Sans:400,400italic,700|Abril+Fatface
https://ludovicianul.github.io/public/apple-touch-icon-144-precomposed.png
https://ludovicianul.github.io/public/favicon.ico
/atom.xml
https://ludovicianul.github.io/
https://ludovicianul.github.io/
/about/
...
```

# Resources

- [Universal selector in CSS](https://www.scaler.com/topics/universal-selector-in-css/)
- [HTML elements](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)