Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/wspl/creeper
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)
https://github.com/wspl/creeper
crawler cross-platform framework golang language script spider
Last synced: 2 months ago
JSON representation
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)
- Host: GitHub
- URL: https://github.com/wspl/creeper
- Owner: wspl
- License: apache-2.0
- Created: 2017-02-17T03:01:50.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2017-05-16T12:14:14.000Z (over 7 years ago)
- Last Synced: 2024-07-31T14:09:59.176Z (5 months ago)
- Topics: crawler, cross-platform, framework, golang, language, script, spider
- Language: Go
- Homepage:
- Size: 393 KB
- Stars: 778
- Watchers: 47
- Forks: 59
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- starred-awesome - creeper - :paw_prints: Creeper - The Next Generation Crawler Framework (Go) (Go)
README
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg?style=flat)](https://opensource.org/licenses/Apache-2.0)
[![Go Report Card](https://goreportcard.com/badge/github.com/wspl/creeper)](https://goreportcard.com/report/github.com/wspl/creeper)
[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/creeper-project/Lobby?utm_source=share-link&utm_medium=link&utm_campaign=share-link)
![Creeper](https://raw.githubusercontent.com/wspl/creeper/master/art/Creeper.png)
## AboutCreeper is a *next-generation* crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.
**Warning:** At present this project is still under early stage development, please do not use in the production environment.
## Get Started
#### Installation
```
$ go get github.com/wspl/creeper
```#### Hello World!
Create `hacker_news.crs`
```
page(@page=1) = "https://news.ycombinator.com/news?p={@page}"news[]: page -> $("tr.athing")
title: $(".title a.storylink").text
site: $(".title span.sitestr").text
link: $(".title a.storylink").href
```Then, create `main.go`
```go
package mainimport "github.com/wspl/creeper"
func main() {
c := creeper.Open("./hacker_news.crs")
c.Array("news").Each(func(c *creeper.Creeper) {
println("title: ", c.String("title"))
println("site: ", c.String("site"))
println("link: ", c.String("link"))
println("===")
})
}
```Build and run. Console will print something like:
```
title: Samsung chief Lee arrested as S.Korean corruption probe deepens
site: reuters.com
link: http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD
===
title: ReactOS 0.4.4 Released
site: reactos.org
link: https://reactos.org/project-news/reactos-044-released
===
title: FeFETs: How this new memory stacks up against existing non-volatile memory
site: semiengineering.com
link: http://semiengineering.com/what-are-fefets/
```## Script Spec
### Town
Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.
```
page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"
```When you need town, use it as if you were calling a function:
```
news[]: page(ext="Hello World!") -> $("tr.athing")
```You might have noticed that the `@page` parameter is not used. Yeah, it is a special parameter.
Expression in town definition line like `name="something"`, represents parameter `name` has a default value `"something"`.
Incidentally, `@page` is a parameter that will automatically increasing when current page has no more content.
### Node
Nodes are tree structure that represent the data structure you are going to crawl.
```
news[]: page -> $("tr.athing")
title: $(".title a.storylink").text
site: $(".title span.sitestr").text
link: $(".title a.storylink").href
```Like `yaml`, nodes distinguishes the hierarchy by indentation.
#### Node Name
Node has name. `title` is a field name, represents a general string data. `news[]` is a array name, represents a parent structure with multiple sub-data.
#### Page
Page indicates where to fetching the field data. It can be a town expression or field reference.
Field reference is a advanced usage of Node, you can found the details in [./eh.crs](./eh.crs).
If a node owned page and fun at the same time, page should on the left of `->`, fun should on the right of `->`. Which is `page -> fun`
#### Fun
Fun represents the data processing process.
There are all supported funs:
| Name | Parameters | Description |
| --------- | -------------------------------- | ---------------------------------------- |
| $ | (selector: string) | Relative CSS selector (select from parent node)|
| $root | (selector: string) | Absolute CSS selector (select from body)|
| html | | inner HTML |
| text | | inner text |
| outerHTML | | outer HTML |
| attr | (attr: string) | attribute value |
| style | | style attribute value |
| href | | href attribute value |
| src | | src attribute value |
| class | | class attribute value |
| id | | id attribute value |
| calc | (prec: int) | calculate arithmetic expression |
| match | (regexp: string) | match first sub-string via regular expression |
| expand | (regexp: string, target: string) | expand matched strings to target string |## Author
Plutonist
> [impl.moe](https://impl.moe) · Github [@wspl](https://github.com/wspl)