Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jpillora/scraper
A dual interface Go module for building simple web scrapers
https://github.com/jpillora/scraper
Last synced: 2 months ago
JSON representation
A dual interface Go module for building simple web scrapers
- Host: GitHub
- URL: https://github.com/jpillora/scraper
- Owner: jpillora
- License: mit
- Created: 2015-04-16T15:27:35.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2023-12-21T09:45:30.000Z (about 1 year ago)
- Last Synced: 2024-10-14T11:41:24.896Z (3 months ago)
- Language: Go
- Homepage:
- Size: 45.9 KB
- Stars: 50
- Watchers: 9
- Forks: 14
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# scraper
[![GoDoc](https://godoc.org/github.com/jpillora/scraper?status.svg)](https://godoc.org/github.com/jpillora/scraper) [![CI](https://github.com/jpillora/scraper/workflows/CI/badge.svg)](https://github.com/jpillora/scraper/actions?workflow=CI)
A dual interface Go module for building simple web scrapers
### Features* Go struct-tag interface
* Command-line interface
* HTML⇒JSON API server
* Single binary
* Simple configuration
* Zero-downtime config reload with `kill -s SIGHUP `### Install
**Binaries**
See [the latest release](https://github.com/jpillora/scraper/releases/latest) or download it with this one-liner: `curl https://i.jpillora.com/scraper | bash`
**Source**
``` sh
$ go get -v github.com/jpillora/scraper
```### Go Example
```go
package mainimport (
"log""github.com/jpillora/scraper/scraper"
)func main() {
type result struct {
Title string `scraper:"h3 span"`
URL string `scraper:"a[href] | @href"`
}type google struct {
URL string `scraper:"https://www.google.com/search?q={{query}}"`
Result []result `scraper:"#rso div[class=g]"`
Query string `scraper:"query"`
}g := google{Query: "hello world"}
if err := scraper.Execute(&g); err != nil {
log.Fatal(err)
}for i, r := range g.Result {
fmt.Printf("#%d: '%s' => %s\n", i+1, r.Title, r.URL)
}
}
``````
#1: 'Helloworld Travel – Deals on Accommodation, Flights ...' => https://www.helloworld.com.au/
#2: '"Hello, World!" program - Wikipedia' => https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
#3: 'Helloworld Travel - Wikipedia' => https://en.wikipedia.org/wiki/Helloworld_Travel
#4: 'Helloworld Travel Limited' => https://www.helloworldlimited.com.au/
#5: 'Total immersion, Serious fun! with Hello-World!' => https://www.hello-world.com/
#6: 'Helloworld Travel - Home | Facebook' => https://www.facebook.com/helloworldau/
```
### CLI ExampleGiven `google.json`
``` json
{
"/search": {
"url": "https://www.google.com/search?q={{query}}",
"list": "#rso div[class=g]",
"result": {
"title": "h3 span",
"url": ["a[href]", "@href"]
}
}
}
`````` sh
$ scraper google.json
2015/05/16 20:10:46 listening on 3000...
`````` sh
$ curl "localhost:3000/search?query=hellokitty"
[
{
"title": "Official Home of Hello Kitty \u0026 Friends | Hello Kitty Shop",
"url": "http://www.sanrio.com/"
},
{
"title": "Hello Kitty - Wikipedia, the free encyclopedia",
"url": "http://en.wikipedia.org/wiki/Hello_Kitty"
},
...
```### JSON API
``` plain
{
: {
"method":
"url":
"list": ,
"result": {
: ,
: [, , ...],
...
}
}
}
```* `` - **Required** The path of the scraper
* Accessible at `http://:port/`
* You may define path variables like: `my/path/:var` when set to `/my/path/foo` then `:var = "foo"`
* `` - **Required** The URL of the remote server to scrape
* It may contain template variables in the form `{{ var }}`, scraper will look for a `var` path variable, if not found, it will then look for a query parameter `var`
* `result` - **Required** represents the resulting JSON object, after executing the `` on the current DOM context. A field may use sequence of ``s to perform more complex queries.
* `` - The HTTP request method (defaults to `GET`)
* `` - A string in which must be one of:
* a regex in form `/abc/` - searches the text of the current DOM context (extracts the first group when provided).
* a regex in form `s/abc/xyz/` - searches the text of the current DOM context and replaces with the provided text (sed-like syntax).
* an attribute in the form `@abc` - gets the attribute `abc` from the DOM context.
* a function in the form `html()` - gets the DOM context as string
* a function in the form `trim()` - trims space from the beginning and the end of the string
* a query param in the form `query-param(abc)` - parses the current context as a URL and extracts the provided param
* a css selector `abc` (if not in the forms above) alters the DOM context.
* `list` - **Optional** A css selector used to split the root DOM context into a set of DOM contexts. Useful for capturing search results.### Go API
Replace `` with your configuration, documented above.
1. Define your endpoint struct:
```go
type endpoint struct {
Method string `scraper:""`
URL string `scraper:""`
Result []result `scraper:"`
string `scraper:""`
}
````Method`, `URL`, `Result` and `Debug` are special fields, the remaining **string** fields are treated as input parameters. Input parameters use the field name with first character lowercased by default.
2. Define your result struct:
```go
type result struct {
string `scraper:""`
string `scraper:" | "`
}
```The result struct is used to define field to extractor mappings. All fields must be `string`s. Struct tags cannot contain arrays so instead we join multiple `extractor`s with ` | `.
3. Execute it:
```go
e := endpoint{MyParam: "hello world"}
if err := scraper.Execute(&e); err != nil {
...
}
// e.Result is now set
```#### Similar projects
* https://github.com/ernesto-jimenez/scraperboardR THE USE OR OTHER DEALINGS IN THE SOFTWARE.