https://github.com/ericchiang/pup

Parsing HTML at the command line
https://github.com/ericchiang/pup

Last synced: 9 months ago
JSON representation

Parsing HTML at the command line

Host: GitHub
URL: https://github.com/ericchiang/pup
Owner: ericchiang
License: mit
Created: 2014-09-01T01:31:29.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2024-05-02T13:43:38.000Z (over 1 year ago)
Last Synced: 2025-02-24T01:04:49.633Z (10 months ago)
Language: HTML
Homepage:
Size: 3.69 MB
Stars: 8,222
Watchers: 91
Forks: 263
Open Issues: 107
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-cli - pup - Parsing HTML at the command line. (Lovely Commands)
awesome-starred - ericchiang/pup - Parsing HTML at the command line (others)
my-awesome-github-stars - ericchiang/pup - Parsing HTML at the command line (HTML)
awesome-terminals - pup - Parsing HTML at the command line. (Tools / Go)
my-awesome-list - pup - Parsing HTML at the command line (Programming Languages / Go)
awesome-cli-tui-software - ericchiang/pup - Parsing HTML at the command line (<a name="data"></a>data)
awesome-hacking-lists - ericchiang/pup - Parsing HTML at the command line (HTML)
awesome-cli-apps-in-a-csv - pup - Parsing HTML at the command line. (<a name="text-processing"></a>Text processing)
awesome-cli-apps - pup - Parsing HTML at the command line. (<a name="text-processing"></a>Text processing)

README

# pup

pup is a command line tool for processing HTML. It reads from stdin,
prints to stdout, and allows the user to filter parts of the page using
[CSS selectors](https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_started/Selectors).

Inspired by [jq](http://stedolan.github.io/jq/), pup aims to be a
fast and flexible way of exploring HTML from the terminal.

## Install

Direct downloads are available through the [releases page](https://github.com/EricChiang/pup/releases/latest).

If you have Go installed on your computer just run `go get`.

go get github.com/ericchiang/pup

If you're on OS X, use [Homebrew](http://brew.sh/) to install (no Go required).

brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb

## Quick start

```bash
$ curl -s https://news.ycombinator.com/
```

Ew, HTML. Let's run that through some pup selectors:

```bash
$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a'
```

Okay, how about only the links?

```bash
$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a attr{href}'
```

Even better, let's grab the titles too:

```bash
$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a json{}'
```

## Basic Usage

```bash
$ cat index.html | pup [flags] '[selectors] [display function]'
```

## Examples

Download a webpage with wget.

```bash
$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html
```

#### Clean and indent

By default pup will fill in missing tags and properly indent the page.

```bash
$ cat robots.html
# nasty looking HTML
$ cat robots.html | pup --color
# cleaned, indented, and colorful HTML
```

#### Filter by tag

```bash
$ cat robots.html | pup 'title'

Robots exclusion standard - Wikipedia, the free encyclopedia

```

#### Filter by id

```bash
$ cat robots.html | pup 'span#See_also'

See also

```

#### Filter by attribute

```bash
$ cat robots.html | pup 'th[scope="row"]'

Exclusion standards

Robots exclusion standard

```

```bash
$ cat robots.html | pup 'h1#firstHeading span'

Robots exclusion standard

```

## Implemented Selectors

For further examples of these selectors head over to [MDN](
https://developer.mozilla.org/en-US/docs/Web/CSS/Reference).

```bash
pup '.class'
pup '#id'
pup 'element'
pup 'selector + selector'
pup 'selector > selector'
pup '[attribute]'
pup '[attribute="value"]'
pup '[attribute*="value"]'
pup '[attribute~="value"]'
pup '[attribute^="value"]'
pup '[attribute$="value"]'
pup ':empty'
pup ':first-child'
pup ':first-of-type'
pup ':last-child'
pup ':last-of-type'
pup ':only-child'
pup ':only-of-type'
pup ':contains("text")'
pup ':nth-child(n)'
pup ':nth-of-type(n)'
pup ':nth-last-child(n)'
pup ':nth-last-of-type(n)'
pup ':not(selector)'
pup ':parent-of(selector)'
```

You can mix and match selectors as you wish.

```bash
cat index.html | pup 'element#id[attribute="value"]:first-of-type'
```

## Display Functions

Non-HTML selectors which effect the output type are implemented as functions
which can be provided as a final argument.

#### `text{}`

Print all text from selected nodes and children in depth first order.

```bash
$ cat robots.html | pup '.mw-headline text{}'
History
About the standard
Disadvantages
Alternatives
Examples
Nonstandard extensions
Crawl-delay directive
Allow directive
Sitemap
Host
Universal "*" match
Meta tags and headers
See also
References
External links
```

#### `attr{attrkey}`

Print the values of all attributes with a given key from all selected nodes.

```bash
$ cat robots.html | pup '.catlinks div attr{id}'
mw-normal-catlinks
mw-hidden-catlinks
```

#### `json{}`

Print HTML as JSON.

```bash
$ cat robots.html | pup 'div#p-namespaces a'

Article

Talk

```

```bash
$ cat robots.html | pup 'div#p-namespaces a json{}'
[
{
"accesskey": "c",
"href": "/wiki/Robots_exclusion_standard",
"tag": "a",
"text": "Article",
"title": "View the content page [c]"
},
{
"accesskey": "t",
"href": "/wiki/Talk:Robots_exclusion_standard",
"tag": "a",
"text": "Talk",
"title": "Discussion about the content page [t]"
}
]
```

Use the `-i` / `--indent` flag to control the intent level.

```bash
$ cat robots.html | pup -i 4 'div#p-namespaces a json{}'
[
{
"accesskey": "c",
"href": "/wiki/Robots_exclusion_standard",
"tag": "a",
"text": "Article",
"title": "View the content page [c]"
},
{
"accesskey": "t",
"href": "/wiki/Talk:Robots_exclusion_standard",
"tag": "a",
"text": "Talk",
"title": "Discussion about the content page [t]"
}
]
```

If the selectors only return one element the results will be printed as a JSON
object, not a list.

```bash
$ cat robots.html | pup --indent 4 'title json{}'
{
"tag": "title",
"text": "Robots exclusion standard - Wikipedia, the free encyclopedia"
}
```

Because there is no universal standard for converting HTML/XML to JSON, a
method has been chosen which hopefully fits. The goal is simply to get the
output of pup into a more consumable format.

## Flags

Run `pup --help` for a list of further options

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ericchiang/pup

Awesome Lists containing this project

README

Robots exclusion standard