Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/foolin/scrago

An simpe, fast, extensible crawl page framework for golang
https://github.com/foolin/scrago

crawler go scrago scrapy

Last synced: 26 days ago
JSON representation

An simpe, fast, extensible crawl page framework for golang

Awesome Lists containing this project

README

        

# scrago

Scrago is an simpe, fast, extensible crawl page framework for golang.

# Install

```
go get github.com/foolin/scrago
```

# Document

[Godoc](https://godoc.org/github.com/foolin/scrago "go document")

# Exmaple

### Step 1:
```go

type ExampModel struct {
Title string `scrago:"title"`
Name string `scrago:"#main>.intro>h2::text()"`
Description string `scrago:"#main>.intro>p::html()"`
Intro string `scrago:"#main>.intro::outerHtml()"`
Keywords []string `scrago:"#main .keywords::GetMyKeywords()"`
}

func (e *ExampModel) GetMyKeywords(s *goquery.Selection) ([]string, error) {
v := s.Text()
if v == ""{
return nil, fmt.Errorf("not found keywords!")
}
arr := strings.Split(v, ",")
for i := 0; i < len(arr); i++{
arr[i] = strings.TrimSpace(arr[i])
}
return arr, nil
}

```

### Step 2:
```go

func main() {
examp := ExampModel{}
s := scrago.New()
err := s.HttpGetParser("https://raw.githubusercontent.com/foolin/scrago/master/example/data/example.html", &examp)
if err != nil {
log.Fatal(err)
}else{
printjson(examp)
}
}

func printjson(v interface{}) {
enc := json.NewEncoder(os.Stdout)
enc.SetEscapeHTML(false)
enc.SetIndent("", " ")
enc.Encode(v)
}

```

### Step 3:
Execute result:

```json

{
"Title": "Scrago exmaples",
"Name": "Scrago framework",
"Description": "An open source and collaborative framework for extracting the data you need from websites.\n In a fast, simple, yet extensible way.",
"Intro": "

\n

Scrago framework

\n

An open source and collaborative framework for extracting the data you need from websites.\n In a fast, simple, yet extensible way.

\n
Scrago, Scrap, Spider, Crawl, GoLang, Simple, Easy
\n
",
"Keywords": [
"Scrago",
"Scrap",
"Spider",
"Crawl",
"GoLang",
"Simple",
"Easy"
]
}

```

Origin page:
```html


Scrago exmaples





Scrago framework


An open source and collaborative framework for extracting the data you need from websites.
In a fast, simple, yet extensible way.


Scrago, Scrap, Spider, Crawl, GoLang, Simple, Easy




  • true

  • 123

  • 45.6

  • hello



    1. Aa

    2. Bb

    3. Cc




```

# Struct tag
Between selector and function use "::" symbol segmentation
```go
`scrago:"selector::function"`

```
* selector:
Css selector, sea more:github.com/PuerkitoBio/goquery

* function:
Get data function,default is text()。

1.Inner function:
- text() get text value.
- html() get html vlaue.
- outerHtml() get outer html value.
- attr(xxx) get attribute value, eg:attr(href)。

2.Write custom function:
```go

func (e *ExampModel) MyFunc(s *goquery.Selection) (MyReturnType, error) {
//todo
return ReturnValue, nil
}

```

eg:
```go

type ExampModel struct {
TextField string `scrago:"#xxx"`
TextField2 string `scrago:".xxx::text()"`
Link string `scrago:"a::attr(href)"`
MyField string `scrago:"#xxx::MyFunc()"`
}

func (e *ExampModel) MyFunc(s *goquery.Selection) (String, error) {
//todo
return s.Text(), nil
}

```

# Exmaples
* [Simple](https://github.com/foolin/scrago/tree/master/example/simple "Simple Example")
* [Parser](https://github.com/foolin/scrago/tree/master/example/parser "Parser Example")
* [Quotesbot](https://github.com/foolin/scrago/tree/master/example/quotesbot "Quotesbot Example")

# Relative
* github.com/PuerkitoBio/goquery