Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gajus/surgeon
Declarative DOM extraction expression evaluator. 👨⚕️
https://github.com/gajus/surgeon
css-selector parser scraper subroutines
Last synced: 5 days ago
JSON representation
Declarative DOM extraction expression evaluator. 👨⚕️
- Host: GitHub
- URL: https://github.com/gajus/surgeon
- Owner: gajus
- License: other
- Created: 2017-01-16T14:14:34.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2020-06-05T18:11:27.000Z (over 4 years ago)
- Last Synced: 2025-01-08T01:15:27.928Z (13 days ago)
- Topics: css-selector, parser, scraper, subroutines
- Language: JavaScript
- Homepage:
- Size: 712 KB
- Stars: 695
- Watchers: 16
- Forks: 30
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# Surgeon
[![GitSpo Mentions](https://gitspo.com/badges/mentions/gajus/surgeon?style=flat-square)](https://gitspo.com/mentions/gajus/surgeon)
[![Travis build status](http://img.shields.io/travis/gajus/surgeon/master.svg?style=flat-square)](https://travis-ci.org/gajus/surgeon)
[![Coveralls](https://img.shields.io/coveralls/gajus/surgeon.svg?style=flat-square)](https://coveralls.io/github/gajus/surgeon)
[![NPM version](http://img.shields.io/npm/v/surgeon.svg?style=flat-square)](https://www.npmjs.org/package/surgeon)
[![Canonical Code Style](https://img.shields.io/badge/code%20style-canonical-blue.svg?style=flat-square)](https://github.com/gajus/canonical)
[![Twitter Follow](https://img.shields.io/twitter/follow/kuizinas.svg?style=social&label=Follow)](https://twitter.com/kuizinas)Declarative DOM extraction expression evaluator.
Powerful, succinct, composable, extendable, declarative API.
```yaml
articles:
- select article {0,}
- body:
- select .body
- read property innerHTML
imageUrl:
- select img
- read attribute src
summary:
- select ".body p:first-child"
- read property innerHTML
- format text
title:
- select .title
- read property textContent
pageName:
- select .body
- read property innerHTML```
> Not succinct enough for you? Use [aliases](#declare-subroutine-aliases) and the [pipe operator (`|`)](#the-pipe-operator-) to shorten and concatenate the commands:
>
> ```
> articles:
> - sm article
> - body: s .body | rp innerHTML
> imageUrl: s img | ra src
> summary: s .body p:first-child | rp innerHTML | f text
> title: s .title | rp textContent
> pageName: s .body | rp innerHTML
>
> ```Have you got suggestions for improvement? [I am all ears](https://github.com/gajus/surgeon/issues).
---
* [Configuration](#configuration)
* [Evaluators](#evaluators)
* [`browser` evaluator](#browser-evaluator)
* [`cheerio` evaluator](#cheerio-evaluator)
* [Subroutines](#subroutines)
* [Built-in subroutines](#built-in-subroutines)
* [`append` subroutine](#append-subroutine)
* [`closest` subroutine](#closest-subroutine)
* [`constant` subroutine](#constant-subroutine)
* [`format` subroutine](#format-subroutine)
* [`match` subroutine](#match-subroutine)
* [`nextUntil` subroutine](#nextuntil-subroutine)
* [`prepend` subroutine](#prepend-subroutine)
* [`previous` subroutine](#previous-subroutine)
* [`read` subroutine](#read-subroutine)
* [`remove` subroutine](#remove-subroutine)
* [`select` subroutine](#select-subroutine)
* [Quantifier expression](#quantifier-expression)
* [`test` subroutine](#test-subroutine)
* [User-defined subroutines](#user-defined-subroutines)
* [Inline subroutines](#inline-subroutines)
* [Built-in subroutine aliases](#built-in-subroutine-aliases)
* [Expression reference](#expression-reference)
* [The pipe operator (`|`)](#the-pipe-operator-)
* [Cookbook](#cookbook)
* [Extract a single node](#extract-a-single-node)
* [Extract multiple nodes](#extract-multiple-nodes)
* [Name results](#name-results)
* [Validate the results using RegExp](#validate-the-results-using-regexp)
* [Validate the results using a user-defined test function](#validate-the-results-using-a-user-defined-test-function)
* [Declare subroutine aliases](#declare-subroutine-aliases)
* [Error handling](#error-handling)
* [Debugging](#debugging)## Configuration
|Name|Type|Description|Default value|
|---|---|---|---|
|`evaluator`|[`EvaluatorType`](./src/types.js)|HTML parser and selector engine. See [evaluators](#evaluators).|[`browser` evaluator](#browser-evaluator) if `window` and `document` variables are present, [`cheerio`](#cheerio-evaluator) otherwise.|
|`subroutines`|[`$PropertyType`](./src/types.js)|User defined subroutines. See [subroutines](#subroutines).|N/A|## Evaluators
[Subroutines](#subroutines) use an evaluator to parse input (i.e. convert a string into an object) and to select nodes in the resulting document.
The default evaluator is configured based on the user environment:
* [`browser` evaluator](#browser-evaluator) is used if `window` and `document` variables are defined; otherwise
* [`cheerio`](#cheerio-evaluator)> Have a use case for another evaluator? [Raise an issue](https://github.com/gajus/surgeon/issues).
>
> For an example implementation of an evaluator, refer to:
>
> * [`./src/evaluators/browserEvaluator.js`](./src/evaluators/browserEvaluator.js)
> * [`./src/evaluators/cheerioEvaluator.js`](./src/evaluators/cheerioEvaluator.js)### `browser` evaluator
Uses native browser methods to parse the document and to evaluate CSS selector queries.
Use `browser` evaluator if you are running Surgeon in a browser or a headless browser (e.g. PhantomJS).
```js
import {
browserEvaluator
} from './evaluators';surgeon({
evaluator: browserEvaluator()
});```
### `cheerio` evaluator
Uses [cheerio](https://github.com/cheeriojs/cheerio) to parse the document and to evaluate CSS selector queries.
Use `cheerio` evaluator if you are running Surgeon in Node.js.
```js
import {
cheerioEvaluator
} from './evaluators';surgeon({
evaluator: cheerioEvaluator()
});```
## Subroutines
A subroutine is a function used to advance the DOM extraction expression evaluator, e.g.
```js
x('foo | bar baz', 'qux');```
In the above example, Surgeon expression uses two subroutines: `foo` and `bar`.
`foo` subroutine is invoked without additional values. `bar` subroutine is executed with 1 value ("baz").
Subroutines are executed in the order in which they are defined – the result of the last subroutine is passed on to the next one. The first subroutine receives the document input (in this case: "qux" string).
Multiple subroutines can be written as an array. The following example is equivalent to the earlier example.
```js
x([
'foo',
'bar baz'
], 'qux');```
There are two types of subroutines:
* [Built-in subroutines](#built-in-subroutines)
* [User-defined subroutines](#user-defined-subroutines)> Note:
>
> These functions are called subroutines to emphasise the cross-platform nature of the declarative API.### Built-in subroutines
The following subroutines are available out of the box.
#### `append` subroutine
`append` appends a string to the input string.
|Parameter name|Description|Default|
|---|---|---|
|tail|Appends a string to the end of the input string.|N/A|Examples:
```js
// Assuming an element ,
// then the result is 'http://foo/bar'.
x(`select a | read attribute href | append '/bar'`);```
#### `closest` subroutine
`closest` subroutine iterates through all the preceding nodes (including parent nodes) searching for either a preceding node matching the selector expression or a descendant of the preceding node matching the selector.
Note: This is different from the jQuery [`.closest()`](https://api.jquery.com/closest/) in that the latter method does not search for parent descendants matching the selector.
|Parameter name|Description|Default|
|---|---|---|
|CSS selector|CSS selector used to select an element.|N/A|#### `constant` subroutine
`constant` returns the parameter value regardless of the input.
|Parameter name|Description|Default|
|---|---|---|
|`constant`|Constant value that will be returned as the result.|N/A|#### `format` subroutine
`format` is used to format input using [printf](https://en.wikipedia.org/wiki/Printf_format_string).
|Parameter name|Description|Default|
|---|---|---|
|format|[sprintf format](https://www.npmjs.com/package/sprintf-js) used to format the input string. The subroutine input is the first argument, i.e. `%1$s`.|`%1$s`|Examples:
```js
// Extracts 1 matching capturing group from the input string.
// Prefixes the match with 'http://foo.com'.
x(`select a | read attribute href | format 'http://foo.com%1$s'`);```
#### `match` subroutine
`match` is used to extract matching [capturing groups](https://www.regular-expressions.info/refcapture.html) from the subject input.
|Parameter name|Description|Default|
|---|---|---|
|Regular expression|Regular expression used to match capturing groups in the string.|N/A|
|Sprintf format|[sprintf format](https://www.npmjs.com/package/sprintf-js) used to construct a string using the matching capturing groups.|`%s`|Examples:
```js
// Extracts 1 matching capturing group from the input string.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | match "/input: (\d+)/"');// Extracts 2 matching capturing groups from the input string and formats the output using sprintf.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | match "/input: (\d+)-(\d+)/" %2$s-%1$s');```
#### `nextUntil` subroutine
`nextUntil` subroutine is used to select all following siblings of each element up to but not including the element matched by the selector.
|Parameter name|Description|Default|
|---|---|---|
|selector expression|A string containing a selector expression to indicate where to stop matching following sibling elements.|N/A|
|filter expression|A string containing a selector expression to match elements against.|#### `prepend` subroutine
`prepend` prepends a string to the input string.
|Parameter name|Description|Default|
|---|---|---|
|head|Prepends a string to the start of the input string.|N/A|Examples:
```js
// Assuming an element ,
// then the result is 'http://foo/bar'.
x(`select a | read attribute href | prepend 'http:'`);```
#### `previous` subroutine
`previous` subroutine selects the preceding sibling.
|Parameter name|Description|Default|
|---|---|---|
|CSS selector|CSS selector used to select an element.|N/A|Example:
```html
- foo