Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/TUVIMEN/reliq

HTML parsing and searching tool
https://github.com/TUVIMEN/reliq

c html parsing searching

Last synced: about 2 months ago
JSON representation

HTML parsing and searching tool

Awesome Lists containing this project

README

        

# reliq

reliq is a html parsing and searching tool.

## Building

### Build

make install

### Build library

make lib-install

### Build linked

make linked

## Usage

Get some help

man reliq

Get `div` tags with class `tile`.

reliq 'div class="tile"'

Get `div` tags with class `tile` as a word and id `current` as a word.

reliq 'div .tile #current' index.html

Get tags which does not have any tags inside them from file `index.html`.

reliq '* c@[0]' index.html

Get empty tags from file 'index.html'.

reliq '* m@>[0]' index.html

Get hyperlinks from level greater or equal to 6 from file `index.html`.

reliq 'a href @l[6:] | "%(href)v\n"' index.html

Get any tag with class `cont` and without id starting with `img-`.

reliq '* .cont -#b>img-' index.html

Get hyperlinks ending with `/[0-9]+.html`

reliq 'a href=Ee>/[0-9]+\.html | "%(href)a\n"' index.html

Get `ul` tags and html inside `i` tags that are inside `p` tags.

reliq 'ul, p; i | "%i\n"' index.html

Get the last images in every `li` with `id` matching extended regex `img-[0-9]+`

reliq 'li #E>img-[0-9]+; img src [-1] | "%(src)v\n"' index.html

Get `tr` and `td` inside `table` tag.

reliq 'table; { tr, td }' index.html

Process output using `cut` for each tag, and with `sed` and `tr` for the whole output

reliq 'div #B>msg_[0-9-]* | "%(id)v" cut [2] "-" / sed "s/^msg_//" tr "\n" "\t"' index.html

Process output in a block (note that things in blocks have to be separated by `,` as just newline is not enough)

reliq '
{
div class=B>"memberProfileBanner memberTooltip-header.*" style=a>"url(" | "%(style)v" / sed "s#.*url(##;s#^//#https://#;s/?.*//;p;q" "n",
img src | "%(src)v" / sed "s/?.*//; q"
} / sed "s#^//#https://#"
' index.html

Get `tr` in `table` with level relative to `table` equal `1`, and process it individually for every tag in block, and at the end of each block delete every `\n` character and append `\n` at the end

reliq '
table border=1; tr l@[1]; {
td; * c@[0] | "%i\t"
} | tr "\n" echo "" "\n"
' index.html

Same but process all tags at once in block, then process final output of the block deleting all `\n` and appending `\n` at the end. The above creates tsv where each `tr` has its own line, but this example craetes only one line

reliq '
table border=1; tr; {
td; * c@[0] | "%i\t"
} / tr "\n" echo "" "\n"
' index.html

### JSON like output

Output of expression can be assigned to a field. Fields combined in json like structure will be written at the end of output, if any expression is not assigned to a field it's output will be written before the structure.

reliq '.links.a a href | "%(href)v\n", img src | "%(src)v\n"'

will return

```
/static/images/icons/wikipedia.png
/static/images/mobile/copyright/wikipedia-wordmark-en.svg
/static/images/mobile/copyright/wikipedia-tagline-en.svg
//upload.wikimedia.org/wikipedia/commons/thumb/0/08/Sodium_nitrite.svg/220px-Sodium_nitrite.svg.png
https//upload.wikimedia.org/wikipedia/commons/thumb/a/a6/Nitrite-3D-vdW.png/110px-Nitrite-3D-vdW.png
https//upload.wikimedia.org/wikipedia/commons/thumb/0/05/Sodium-3D.png/80px-Sodium-3D.png
{"links":["/wiki/Main_Page","/wiki/Wikipedia:Contents","/wiki/Portal:Current_events","/wiki/Special:Random","/wiki/Wikipedia:About"]}
```

Note that from now on any json structure in examples will be prettified for ease of reading, in reality reliq returns compressed json.

It is a json like structure because reliq does enforce json output as in above example, if output would be directly connected to json parser an error might accur when incorrect changes are made to reliq script.

It also does not check for repeating field names e.g.

reliq '.a span #views | "%i", .a time datetime | "%(datetime)v\a"'

will return incorrect json

```
{
"a": "29423",
"a": "2024-07-19 07:12:05"
}
```

So reliq returning json depends on user!

Field is specified when there's `.` followed by string matching `[A-Za-z0-9_-]+` before expression e.g.

reliq '.some_fiELD-1 p | "%i"'

Fields can only be specified before expression, and will be ignored inside it making incorrect patterns e.g.

reliq 'ul; .a li' index.html

This will not return json like structure, or anything since `.a` was specified as plain name of tag which cannot exist in html.

Fields take precedence before everything

reliq '
.field dd; {
time datetime | "%(datetime)v\a",
a m@v>"<" | "%i\a",
* l@[0] | "%i"
} / tr '\n' sed "s/^\a*//;s/\a.*//"
' index.html

Fields can be nested in another fields

reliq '
.user {
h4 class=b>message-name,
* .MessageCard__user-info__name
}; {
.name [0] * c@[0] | "%i",
.link [0] a class href | "%(href)v",
.id.u [0] * data-user-id | "%(data-user-id)v"
},
' index.html

will return

```
{
"user": {
"name": "TUVIMEN",
"link": "https://examplesite.com/u/TUVIMEN/",
"id": 1245
}
}
```

If field that nests other fields has expression without field, that expression will not be assigned to nesting field, and will output normally before json structure, breaking compability with json e.g.

reliq '
.user div #user; {
.name span .user-name | "%i",
.link [0] a href | "%(href)v",
div .user-avatar; img src | "%(src)v\n"
}
' index.html

will return

```
https://exampleother.com/a/8122.jpg
{
"user": {
"name": "hexderm",
"link": "https://exampleother.com/users/8122"
}
}
```

This further underlines the importance of writing correct script.

Unless fields are nested they will be on the same level no matter where they are in script

reliq '
.signature div .signature #B>sig[0-9]* | "%i",
dl .postprofile #B>profile[0-9]*; {
dt l@[1]; {
.avatar img src | "%(src)v",
.user a c@[0] | "%i",
.userid.u a href c@[0] | "%(href)v" / sed "s/.*[&;]u=([0-9]+).*/\1/" "E",
},
.userinfo.a("\a") dd l@[1] m@vf>" " | "%i\a" / tr '\n\t' sed "
s/([^<]*)<\/strong>/\1/g
s/ +:/:/
/