{"id":22330436,"url":"https://github.com/tuvimen/reliq","last_synced_at":"2025-07-29T19:32:38.090Z","repository":{"id":74846554,"uuid":"325020145","full_name":"TUVIMEN/reliq","owner":"TUVIMEN","description":"HTML parsing and searching tool","archived":false,"fork":false,"pushed_at":"2024-11-28T14:12:20.000Z","size":2787,"stargazers_count":12,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2024-11-28T15:24:08.654Z","etag":null,"topics":["c","html","parsing","searching"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TUVIMEN.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-28T13:28:24.000Z","updated_at":"2024-11-28T14:12:25.000Z","dependencies_parsed_at":"2023-12-10T16:22:41.167Z","dependency_job_id":"8b22cebb-9e2d-4b9f-a4eb-373d279840cf","html_url":"https://github.com/TUVIMEN/reliq","commit_stats":null,"previous_names":["tuvimen/reliq","tuvimen/hgrep"],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TUVIMEN%2Freliq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TUVIMEN%2Freliq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TUVIMEN%2Freliq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TUVIMEN%2Freliq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TUVIMEN","download_url":"https://codeload.github.com/TUVIMEN/reliq/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228041296,"owners_count":17860221,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c","html","parsing","searching"],"created_at":"2024-12-04T04:06:53.819Z","updated_at":"2025-07-29T19:32:38.067Z","avatar_url":"https://github.com/TUVIMEN.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# reliq\n\nHTML parser with it's own matching language. Benchmark [here](https://github.com/TUVIMEN/reliq-python#benchmark), examples of syntax with highlighting [here](https://github.com/TUVIMEN/reliq-vim).\n\n## Installation\n\n### AUR\n\n```shell\nyay -S reliq\n```\n\n### CLI tool\n\n```shell\nmake install\n```\n\n### Library only\n\n```shell\nmake lib-install\n```\n\n### Linked CLI tool and library\n\n```shell\nmake linked\n```\n\n## Build options\n\n```makefile\nD := 0 # debug build\nS := 0 # sanitizer build\n\nO_SMALL_STACK := 0 # change internal limits for a small stack\nO_PHPTAGS := 1 # support for \u003c?php ?\u003e\nO_AUTOCLOSING := 1 # html quirks, can be disabled if you close your tag correctly\n\n# only one should be set\nO_HTML_FULL := 0 # full html sizes, you probably don't know what you're doing\nO_HTML_SMALL := 1 # reasonable html sizes\nO_HTML_VERY_SMALL := 0 # this can be set if you don't put thousands of spaces between attributes\n```\n\n## Usage\n\nManual provides coarse, but complete documentation about the tool and expression language.\n\n```shell\nman reliq\n```\n\nI recommend reading it with colors like\n\n```bash\nman() {\n    LESS_TERMCAP_md=$'\\e[01;31m' \\\n    LESS_TERMCAP_me=$'\\e[0m' \\\n    LESS_TERMCAP_se=$'\\e[0m' \\\n    LESS_TERMCAP_so=$'\\e[01;32;31m' \\\n    LESS_TERMCAP_ue=$'\\e[0m' \\\n    LESS_TERMCAP_us=$'\\e[01;34m' \\\n    command man \"$@\"\n}\n```\n\nGet `div` tags with class `tile`.\n\n```shell\nreliq 'div class=\"tile\"'\n```\n\nGet `div` tags with class `tile` as a word and id `current` as a word.\n\n```shell\nreliq 'div .tile #current' index.html\n```\n\nGet tags which does not have any tags inside them from file `index.html`.\n\n```shell\nreliq '* c@[0]' index.html\n```\n\nGet tags that have length of their insides equal to 0 from file 'index.html'.\n\n```shell\nreliq '* i@\u003e[0]' index.html\n```\n\nGet hyperlinks from level greater or equal to 6 from file `index.html`.\n\n```shell\nreliq 'a href @l[6:] | \"%(href)v\\n\"' index.html\n```\n\nGet any tag with class `cont` and without id starting with `img-`.\n\n```shell\nreliq '* .cont -#b\u003eimg-' index.html\n```\n\nGet hyperlinks ending with `/[0-9]+\\.html`.\n\n```shell\nreliq 'a href=Ee\u003e/[0-9]+\\.html | \"%(href)a\\n\"' index.html\n```\n\nGet links that are not at 1 level, both are equivalent.\n\n```shell\nreliq 'a href l@[!0] | \"(href)v\\n\"' index.html\nreliq 'a href -l@[0] | \"(href)v\\n\"' index.html\n```\n\nGet `li` tags of which insides don't start with `Views:`, both are equivalent.\n\n```shell\nreliq 'li i@bv\u003e\"Views:\"' index.html\nreliq 'li -i@b\u003e\"Views:\"' index.html\n```\n\nGet `ul` tags and html inside `i` tags that are inside `p` tags.\n\n```shell\nreliq 'ul, p; i | \"%i\\n\"' index.html\n```\n\nGet the second to last image in every `li` with `id` matching extended regex `img-[0-9]+`.\n\n```shell\nreliq 'li #E\u003eimg-[0-9]+; img src [-1] | \"%(src)v\\n\"' index.html\n```\n\nGet `tr` and `td` inside `table` tag.\n\n```shell\nreliq 'table; { tr, td }' index.html\n```\n\nGet `tr` with either `title` attribute with class `x1` or `x2`, or if it has one child. Note that `)` has to be preceded with space otherwise it will be a part of previous  argument and that brackets have to touch for it to be or.\n\n```shell\nreliq 'tr ( title ( .x1 )( .x2 ) )( c@[1] )'\n```\n\nProcess output using `cut` for each tag, and with `sed` and `tr` for the whole output.\n\n```shell\nreliq 'div #B\u003emsg_[0-9-]* | \"%(id)v\" cut [1] \"-\" / sed \"s/^msg_//\" tr \"\\n\" \"\\t\"' index.html\n```\n\nProcess output in a block (note that things in blocks have to be separated by `,` as just newline is not enough).\n\n```shell\nreliq '\n    {\n        div class=B\u003e\"memberProfileBanner memberTooltip-header.*\" style=a\u003e\"url(\" | \"%(style)v\" / sed \"s#.*url(##;s#^//#https://#;s/?.*//;p;q\" \"n\",\n        img src | \"%(src)v\" / sed \"s/?.*//; q\"\n    } / sed \"s#^//#https://#\"\n' index.html\n```\n\nGet `tr` in `table` with level relative to `table` equal `1`, and process it individually for every tag in block, and at the end of each block delete every `\\n` character and append `\\n` at the end.\n\n```shell\nreliq '\n    table border=1; tr l@[1]; {\n        td; * c@[0] | \"%i\\t\"\n    } | tr \"\\n\" echo \"\" \"\\n\"\n' index.html\n```\n\nSame but process all tags at once in block, then process final output of the block deleting all `\\n` and appending `\\n` at the end. The above creates tsv where each `tr` has its own line, but this example creates only one line.\n\n```shell\nreliq '\n    table border=1; tr; {\n        td; * c@[0] | \"%i\\t\"\n    } / tr \"\\n\" echo \"\" \"\\n\"\n' index.html\n```\n\n### JSON like output\n\nOutput of expression can be assigned to a field. Fields combined in json like structure will be written at the end of output, if any expression is not assigned to a field it's output will be written before the structure.\n\n```shell\nreliq '.links.a a href | \"%(href)v\\n\", img src | \"%(src)v\\n\"'\n```\n\nwill return\n\n```json\n/static/images/icons/wikipedia.png\n/static/images/mobile/copyright/wikipedia-wordmark-en.svg\n/static/images/mobile/copyright/wikipedia-tagline-en.svg\n//upload.wikimedia.org/wikipedia/commons/thumb/0/08/Sodium_nitrite.svg/220px-Sodium_nitrite.svg.png\nhttps//upload.wikimedia.org/wikipedia/commons/thumb/a/a6/Nitrite-3D-vdW.png/110px-Nitrite-3D-vdW.png\nhttps//upload.wikimedia.org/wikipedia/commons/thumb/0/05/Sodium-3D.png/80px-Sodium-3D.png\n{\"links\":[\"/wiki/Main_Page\",\"/wiki/Wikipedia:Contents\",\"/wiki/Portal:Current_events\",\"/wiki/Special:Random\",\"/wiki/Wikipedia:About\"]}\n```\n\nNote that from now on any json structure in examples will be prettified for ease of reading, in reality reliq returns compressed json.\n\nIt is a json like structure because reliq doesn't enforce json output as in above example, if output would be directly connected to json parser an error might occur when incorrect changes are made to reliq script.\n\nIt also does not check for repeating field names e.g.\n\n```shell\nreliq '\n    .a span #views | \"%i\",\n    .a time datetime | \"%(datetime)v\\a\"\n'\n```\n\nwill return incorrect json\n\n```json\n{\n  \"a\": \"29423\",\n  \"a\": \"2024-07-19 07:12:05\"\n}\n```\n\nSo reliq returning json depends on user!\n\nField is specified when there's `.` followed by string matching `[A-Za-z0-9_-]+` before expression e.g.\n\n```shell\nreliq '.some_fiELD-1 p | \"%i\"'\n```\n\nFields can only be specified before expression, and will be ignored inside it making incorrect patterns e.g.\n\n```shell\nreliq 'ul; .a li' index.html\n```\n\nThis will not return json like structure, or anything since `.a` was specified as plain name of tag which cannot exist in html.\n\nFields take precedence before everything\n\n```shell\nreliq '\n    .field dd; {\n        time datetime | \"%(datetime)v\\a\",\n        a i@v\u003e\"\u003c\" | \"%i\\a\",\n        * l@[0] | \"%i\"\n    } / tr '\\n' sed \"s/^\\a*//;s/\\a.*//\"\n' index.html\n```\n\nFields can be nested in another fields\n\n```shell\nreliq '\n    .user {\n        h4 class=b\u003emessage-name,\n        * .MessageCard__user-info__name\n    }; {\n        .name [0] * c@[0] | \"%i\",\n        .link [0] a class href | \"%(href)v\",\n        .id.u [0] * data-user-id | \"%(data-user-id)v\"\n    },\n' index.html\n```\n\nwill return\n\n```json\n{\n  \"user\": {\n    \"name\": \"TUVIMEN\",\n    \"link\": \"https://examplesite.com/u/TUVIMEN/\",\n    \"id\": 1245\n  }\n}\n```\n\nIf field that nests other fields has expression without field, that expression will not be assigned to nesting field, and will output normally before json structure, breaking compability with json e.g.\n\n```shell\nreliq '\n    .user div #user; {\n        .name span .user-name | \"%i\",\n        .link [0] a href | \"%(href)v\",\n        div .user-avatar; img src | \"%(src)v\\n\"\n    }\n' index.html\n```\n\nwill return\n\n```json\nhttps://exampleother.com/a/8122.jpg\n{\n    \"user\": {\n        \"name\": \"hexderm\",\n        \"link\": \"https://exampleother.com/users/8122\"\n    }\n}\n```\n\nThis further underlines the importance of writing correct script.\n\nUnless fields are nested they will be on the same level no matter where they are in script\n\n```shell\nreliq '\n    .signature div .signature #B\u003esig[0-9]* | \"%i\",\n    dl .postprofile #B\u003eprofile[0-9]*; {\n        dt l@[1]; {\n            .avatar img src | \"%(src)v\",\n            .user a c@[0] | \"%i\",\n            .userid.u a href c@[0] | \"%(href)v\" / sed \"s/.*[\u0026;]u=([0-9]+).*/\\1/\" \"E\",\n        },\n        .userinfo.a(\"\\a\") dd l@[1] i@vf\u003e\"\u0026nbsp;\" | \"%i\\a\" / tr '\\n\\t' sed \"\n            s/\u003cstrong\u003e([^\u003c]*)\u003c\\/strong\u003e/\\1/g\n            s/ +:/:/\n            /\u003cul [^\u003e]*class=\\\"profile-icons\\\"\u003e/{\n                s/.*\u003ca href=\\\"([^\\\"]*)\\\" title=\\\"Site [^\\\"]*\\\".*/Site\\t\\1/\n                t;d\n            }\n            /^[^\u003c\u003e]+:/!{s/^/Rank:/}\n            s/: */\\t/\n        \" \"E\" \"\\a\"\n    }\n' index.html\n```\n\nwill return\n\n```json\n{\n  \"signature\": \"Helicopter: \\\"helico\\\" -\u003e spiral, \\\"pter\\\" -\u003e with wings\",\n  \"avatar\": \"https://other.org/as/d.png\",\n  \"user\": \"Twospoons\",\n  \"userid\": 5218,\n  \"userinfo\": [\n    \"Posts\\t1306\",\n    \"Mood\\tA trace of hope...\"\n  ]\n}\n```\n\nIf field has block with fields that has `|` while format is empty, field type will change to array and block will be executed for each tag\n\n```shell\nreliq '\n    .posts form #quickModForm; table l@[1]; tr l@[1] i@B\u003e\"id=\\\"subject_[0-9]*\\\"\" i@v\u003e\".googlesyndication.com/\"; {\n        .postid.u div #B\u003esubject_[0-9]* | \"%(id)v\" / sed \"s/.*_//\",\n        .date td valign=middle; div .smalltext | \"%i\" / sed \"s/.* ://;s/^\u003c\\/b\u003e //;s/ \u0026#187;//g;s/\u003cbr *\\/\u003e.*//;s/\u003c[^\u003e]*\u003e//g;s/ *$//\",\n        .avatar td valign=top rowspan=2; img .avatar src | \"%(src)v\",\n    } |,\n    .title td #top_subject | \"%i\" / sed \"s/[^:]*: //;s/([^(]*)//;s/\u0026nbsp;//g;s/ *$//\"\n'\n```\n\nwill return\n\n```json\n{\n    \"posts\": [\n        {\n          \"postid\": 62225223,\n          \"date\": \"May 10, 2023, 08:46:51 PM\",\n          \"avatar\": \"/useravatars/avatar_2654005.png\"\n        },\n        {\n          \"postid\": 62225251,\n          \"date\": \"May 10, 2023, 08:57:28 PM\",\n          \"avatar\": \"/useravatars/avatar_300014.png\"\n        }\n    ],\n    \"title\": \"Bitcoins held by the US government\"\n}\n```\n\nIf field name is followed by another `.` with alphanumeric string it will have different type. Types are described in the manual. The most used ones are `.u` and `.a`.\n\n`.u` finds first unsigned integer e.g. on `is -2849.4 kg` will return `2849`.\n\n`.a` splits string by delimiter creating array. By default delimiter is set to `\\n` and type to `.s`. So `.a` `.a(\"\\n\")` `.a.s` `.a(\"\\n\").s` are equivalent. `.a(\"\\t\").u` will have `\\t` delimiter and will hold element consisting of unsigned integers.\n\n```shell\nreliq '.years.a(\"\\t\").u table .int-values; [0] tr | \"%i\" / tr \"\\n\" \"\\t\"'\n```\n\nwill return\n\n```json\n{\n    \"years\": [\n        1930,\n        1942,\n        1948,\n        1965,\n        1989,\n        1994,\n        2010\n    ]\n}\n```\n\n## Python interface\n\n[reliq-python](https://github.com/TUVIMEN/reliq-python)\n\n## Projects using reliq\n\n- [forumscraper](https://github.com/TUVIMEN/forumscraper)\n- [torge](https://github.com/TUVIMEN/torge)\n- [wordpress-madara-scraper](https://github.com/TUVIMEN/wordpress-madara-scraper)\n- [mangabuddy-scraper](https://github.com/TUVIMEN/mangabuddy-scraper)\n- [elektroda-scraper](https://github.com/TUVIMEN/elektroda-scraper)\n- [lightnovelworld](https://github.com/TUVIMEN/lightnovelworld)\n- and many others of my repositories\n\n## Syntax highlighting\n\n- [vim](https://github.com/TUVIMEN/reliq-vim)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftuvimen%2Freliq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftuvimen%2Freliq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftuvimen%2Freliq/lists"}