Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dealfonso/htmlq

command line utility for HTML querying using css selectors
https://github.com/dealfonso/htmlq

command-line command-line-tool html html-select monitoring-tool

Last synced: about 7 hours ago
JSON representation

command line utility for HTML querying using css selectors

Awesome Lists containing this project

README

        

# htmlq - command line HTML query
This is a simple command line utility to query HTML content as if you were using jQuery selectors. The idea is to be able to use commands like the next one:

```console
$ wget -q -O- www.google.com | htmlq title
```

and get an output like this one:

```console
Google
```

The command is mainly useful for scripting. For example, it is possible to try to get the version of a wordpress installation, by checking the `meta name="generator"` tag:

```console
$ htmlq -u www.wordpress.com 'meta[name="generator"]'

```

Or the title of a web page:

```console
$ htmlq -u www.google.com title
Google
```

It is even possible to get the value of a particular attribute in a tag

```console
$ htmlq -u www.wordpress.com 'meta[name="generator"]' -a name
generator
```

Or even do more sophisticated queries than remove internal elements, empty values, etc. As an example, the next query gets the title string of the different items in a search of items in ebay:

```
$ htmlq -u "https://www.ebay.com/sch/i.html?_nkw=laptop" "li.s-item h3.s-item__title" -s "\n" --rm span -a . -n
```

## Installing

Install using `python-pip`:

```
$ pip install htmlq
```

or building from source:

```
$ pip install bs4 html5lib urllib3 requests pathlib
...
$ git clone https://github.com/dealfonso/htmlq.git
$ cd htmlq
$ python3 setup.py install
```

## Use Cases

The most simple way to use `htmlq` is to get a tag from a web page:

```console
$ htmlq -u www.github.com title
GitHub: Where the world builds software · GitHub
```

_(*) this example gets the title of a web page_

---

But we may want to get other tag...

```console
$ htmlq -u www.github.com a
Skip to content
Learn more about the browsers we support.



Sign up
...
```

_(*) this example gets the "a" tags of a web page_

---

We obtained a lot of information, and that is why we wanted to narrow the query to remove those that we do not need

```console
$ htmlq -u www.github.com "a[aria-label]"



GitHub

```

_(*) this example gets the link tags of a web page, but those that do have the attribute aria-label set_

---

But we only want to select the _a_ tag, without some of the inner tags

```console
$ htmlq -u www.github.com "a[aria-label]" --rm svg --rm img


```

_(*) the --rm parameter enables to remove inner queries on each entry_

---

But we only need the value of the label attribute, one on each line:

```console
$ htmlq -u www.github.com "a[aria-label]" --rm svg --rm img -a aria-label -s "\n"
Homepage
Go to GitHub homepage
```

_(*) the -a parameter enables to obtain the values of the attributes for each resulting entry, and -s sets the separator between results_

---

Finally we want to get also the destination URL, with a pretty arrow:

```console
htmlq -u www.github.com "a[aria-label]" --rm svg --rm img -a aria-label -a href -s "\n" -S " -> "
Homepage -> https://github.com/
Go to GitHub homepage -> /
```

_(*) we may include multiple attributes (using multiple -a entries) and join them with specific connectors using -S_

## Detailed options

The usage syntax for `htmlq` is the next:

```console
htmlq [-h] [-f FILENAME] [-u URL] [-a ATTRIBUTE] [-r RMQUERY] [-s SEPARATOR] [-S FIELDSEPARATOR] [-n] [-N] [-1] [-U USER_AGENT] query
```

There are multiple options and flags for `htmlq` and here we try to explain each of them.

- __-f | --filename \__ reads content of the file `filename`. It is possible to use the whole path to the file (e.g. `/path/to/my/file`) or use special paths (e.g. `~/myfile`). If no filename nor url is included, `htmlq` will read from the standard input.

- __-u | --url \__ retrieves the content to be parsed from the url. It is advisable to include the whole schema in the url (e.g. `https://my.url`). If not included, the `https` schema will be tried in first place, and if it fails, `http` will be tried. If a file is included in the commandline, this parameter will be ignored. If no filename nor url is included, `htmlq` will read from the standard input.

- __-a | --attr \__ if a query obtains a set of tags as a result, the default behavior (if this parameter is not set) is to output the result of the whole obtained html fragments. Instead, if an attribute is queried (using _-a_) the output will be the values of each of these attributes for the entry. In case that an attribute is not in the html node, its output will be _blank_. It is possible to query multiple attributes by including multiple _-a_ entries (e.g. `-a href -a aria-label`). There is an special attribute (.) which refers to the text representation of the entry (i.e. `-a .`).

- __-r | --rm \__ the query to the html document may contain child nodes (e.g. \

    \
  • \

  • \

). When querying for `
    `, the result will be the `
      ` node along with its `
    • ` child nodes. Using `-r` it is possible to delete the `

    • ` nodes. It is possible to remove multiple child trees by including multiple `--rm` queries.

      - __-s | --separator \__ this is the string used to join the output of the result of the different entries. It is possible to include escaped strings (e.g. `\n`) or whole arbitraty strings (e.g. `\n ->`). The default value is `\0`.

      - __-S | --field-separator \__ this is the string used to join the values of the different attributes obtained from an entry, using `-a` parameter. It is possible to include escaped strings (e.g. `\n`) or whole arbitraty strings (e.g. `->`). The default value is `,`.

      - __-n | --no-empty-lines__ using this flag, `htmlq` will not include empty lines (i.e. lines whose value is _blank_ as a result of the combination of attributes).

      - __-N | --no-empty-attr__ using this flag, `htmlq` will not include the value of attributes that are empty (i.e. lines whose value is _blank_ as a result of the combination of attributes). Using this option, the number of resulting attributes may differ from the number of requested attributes (e.g. `-a href -a class -a id` may be converted to `/,mylink` if _class_ is not set for an entry).

      - __-1 | --only-first__ if a query obtains multiple results, using this flag, `htmlq` will deal only with the first one (thus ignoring the rest).

      - __-U | --user-agent \__. Using this parameter, it is possible to set an arbitraty user agent string to retrieve the web page. You can check your user agent string in this web: https://www.whatsmyua.info

      - __query__. This is the query string that wants to be retrieved from the html web page. It is possible to use queries that retrieve multiple trees. In this case, `htmlq` will consider them as individual entries and will deal with all of them (or only the first if using flag `-1`).

      # urlf - format url

      This command is an add-on to _htmlq_, as a command line application to deal with URLs and extracting information about them.

      The original purpose was to extract the values of variables in URLs, so that their values can be used in scripts. An example:

      ```console
      $ urlf -v oq "https://www.google.com/search?q=github&oq=github&sourceid=chrome&ie=UTF-8"
      github
      ```

      _(*) This example gets the value of var **oq**._

      Then the application has evolved to enable rewritting URLs, using the commandline as in the next example:

      ```console
      $ urlf "https://www.google.com/search\?q=github&oq=github&sourceid=chrome&ie=UTF-8" -F "%s://%H?oq=%#oq#"
      https://www.google.com?oq=github
      ```

      _(*) This example rewrites the URL to build a new one that removes the path and just includes the value of var **oq**._

      ## Detailed options

      There are multiple options and flags for `urlf` and here we try to explain each of them.

      ```console
      usage: urlf [-h] [-U] [-s] [-u] [-w] [-H] [-p] [-P] [-q] [-m] [-f] [-v var name] [-j SEPARATOR] [-F format string] [-V] urls [urls ...]
      ```

      - __-h, --help__: shows the help
      - __-V, --version__: show program's version number and exit
      - __-U, --url__: displays the URL as provided in the input.
      - __-s, --scheme__: shows the scheme provided in the url (e.g. https)
      - __-u, --username__: shows the username to accede to the url (i.e. user in https://user@pass:myserver.com)
      - __-w, --password__: shows the password to accede to the url (i.e. pass in https://user@pass:myserver.com)
      - __-H, --hostname__: shows the hostname in the url (i.e. myserver.com in https://myserver.com/my/path)
      - __-p, --port__: shows the port in the url (i.e. 443 in https://myserver.com:443/my/path)
      - __-P, --path__: shows the path in the url (i.e. my/path in https://myserver.com/my/path)
      - __-q, --query__: shows the query in the url (i.e. q=1&r=2 in https://myserver.com/my/path?q=1&r=2)
      - __-m, --parameters__: shows the parameters to accede to the url (i.e. param in https://myserver.com/my/path;param?q=1&r=2)
      - __-f, --fragment__: shows the fragment in the url (i.e. sec1 in https://myserver.com/my/path#sec1)
      - __-v var name, --var var name
      show the value of a var in the query string (this parameter may appear multiple times, to get the values of multiple variables; they will appear in the same order than appeared in the commandline)
      - __-j | --join-string \__:
      character (or string) used to separate the different fields (default: )
      - __-F | --format-string \__:
      user defined format string to get a custom output of the URL parts. Any arbitrary field or character may appear in this string, and the fields are substituted using the letter in the shorthand flag of each parameter, preceded by symbol %. E.g. `urlf -H` is the same than `urlf -F "%H"`; e.g. `urlf -s -H` is the same than `urlf -F "%s%H"`, but you can use `urlf -F "%s://%H"` to obtain a better output. In the case of variables, the value is obtained by surrounding the name of the var by symbol # and prepending symbol %; e.g. `urlf -v q` is the same than `urlf -F "%#q#"`.

      # A combined example (guessing wordpress version)
      In case that we wanted to get the version of a wordpress installation, we could check meta tag and get the content:

      ```
      $ htmlq -u www.grycap.upv.es 'meta[name="generator"]'

      $ htmlq -u www.grycap.upv.es 'meta[name="generator"]' -a content
      WordPress 5.8.1
      ```

      But many plugins hide the version in the tag, so we can try to guess the version from the links:

      ```
      $ htmlq -u www.grycap.upv.es 'link[href*="?ver="]' -s '\n'

      ```

      From the links, we see that wordpress includes the version of wordpress in the "ver" variable for any link; so we may get the value of such variable using `urlf`:

      ```
      $ htmlq -u www.grycap.upv.es 'link[href*="?ver="]' -s '\n' -a href | ./urlf.py -v ver -
      5.8.1
      5.8.1
      5.8.1
      5.8.1
      5.8.1
      5.8.1
      5.8.1
      5.8.1
      5.8.1
      5.8.1
      5.8.1
      5.8.1
      5.8.1
      5.8.1
      2.1.2
      5.8.1
      ```

      And now, if we get the most used value, it will probably be the one that refers to the wordpress version (because other plugins may also use that variable for its purposes):

      ```
      $ htmlq -u www.grycap.upv.es 'link[href*="?ver="]' -s '\n' -a href | ./urlf.py -v ver - | sort | uniq -c | sort -k1 -n -r
      15 5.8.1
      1 2.1.2
      ```

      And the first one will be the most voted version.

      Now we can compare to the currently available wordpress version:

      ```
      $ curl -s https://api.wordpress.org/core/version-check/1.7/ | jq ".offers[].version" | tr -d '"' | sort -V | tail -n 1
      5.8.1
      ```

      Our final script would be something like the next one:

      ```bash
      #!/bin/bash
      MYVERSION="$(htmlq -u www.grycap.upv.es 'link[href*="?ver="]' -s '\n' -a href | ./urlf.py -v ver - | sort | uniq -c | sort -k1 -n -r | head -n 1 | awk '{print $2}')"
      CURRENTVERSION="$(curl -s https://api.wordpress.org/core/version-check/1.7/ | jq ".offers[].version" | tr -d '"' | sort -V | tail -n 1)"

      LATESTVERSION="$(echo "$MYVERSION
      $CURRENTVERSION" | sort -V -r | head -n 1)"
      if [ "$LATESTVERSION" = "$MYVERSION" ]; then
      echo "you have the latest version of wordpress ($LATESTVERSION)"
      else
      echo "you should update your wordpress version"
      exit 1
      fi
      exit 0
      ```