Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hengfeiyang/spider
spider is some components for crawl webpage, it is useful.
https://github.com/hengfeiyang/spider
Last synced: about 1 month ago
JSON representation
spider is some components for crawl webpage, it is useful.
- Host: GitHub
- URL: https://github.com/hengfeiyang/spider
- Owner: hengfeiyang
- License: mit
- Created: 2019-05-08T09:10:04.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-06-13T03:53:18.000Z (over 4 years ago)
- Last Synced: 2023-07-13T11:42:34.862Z (over 1 year ago)
- Language: Go
- Size: 40.4 MB
- Stars: 7
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# spider
spider is some components for crawl webpage, it is useful.## common
util functions for spider
## config
runtime config file
## components
### fetcher
* gokit: use go http fetch data
* webkit: use webkit(phantomjs) fetch data, this can parse javascript in webpage### parser
* htmldom: use goquery parse html document, support dom xpath
* jsonpath: use json parse json data
* regexp: use regexp parse data
* substring: use split and substr to parse data### proxy
* kxdaili: provider proxy service for http request use kx100.com
* be more### task
provider a full task crawl project, start with url and end with data
### url
provider url filter, fixpath, fetch content, parse content and more functions
### useragent
* Common // 普通,通用
* PC // 电脑
* Mobile // 手机
* IOS // iOS
* IPhone // iPhone
* IPad // iPad
* MacOS // macOS
* Android // Android
* Wechat // Wechat
* QQ // QQ
* Baidu // spider, Baidu
* Google // spider, Google
* Bing // spider, Bing
* Sogou // spider, Sogou
* Qihu // spider, Qihu
* Yahoo // spider, Yahoo## task flow
* task: init->PrepareFunc->URLinitFunc->{url}->BeforeQuitFunc
* url: rule->beforeRuleFunc->BeforeFetchFunc->CheckRepeatFunc->fetch->AntiSpiderFunc->AfterFetchFunc->{rule}->afterRuleFunc
* rule:
* fetchURL
* fetchField->[field.remote]->field.fieldFilterFuncs->->rule.fieldFilterFuncs
* save->beforeSaveFunc->saveFunc->afterSaveFunc### callback
* task level:
1. task.PrepareFunc once pass Task, can do something initialize, eg: setCookie
2. task.URLinitFunc once pass None, receive initialize urls, it should multiple
3. task.BeforeFetchFunc per uri pass Uri, you can do something before fetch url content
4. task.CheckRepeatFunc per uri pass Uri, you can check is repeated, if return true will skip the url
5. task.AntiSpiderFunc per uri pass Uri, check crawl behaviour is trigger the anti spider rule
6. task.AfterFetchFunc per uri pass Uri, you can do something after fetch url content
7. task.BeforeQuitFunc once pass taskid and url queue, you can storage saved queue before quit, it can be used for next init urls
* rule level:
1. rule.beforeRuleFunc per uri pass Rule, URI, after fetch and before excute rule functions, you can change the page content or something else
2. rule.afterRuleRunc per uri pass Rule, URI, after execute rule parse functions
3. rule.beforeSaveFunc per uri pass Rule, URI, dataMap, before save data, you have a change to filter the data
4. rule.saveFunc per uri pass Rule.ID, pk, dataMap, returns the data to you, you should save it
5. rule.afterSaveFunc per uri pass Rule, URI, after save, you can log or dispath the success message
6. rule.fieldFilterFuncs per field pass Field, a rule level global field filter, execute on each field
* filed level:
1. field.fieldFilterFuncs per filed pass Field, filter the filed value, this execute before the rule level filter## quick demo
crawl the golang.org blog articles.
```
package mainimport (
"fmt"
"sync/atomic""github.com/safeie/spider/component/task"
"github.com/safeie/spider/component/url"
)func main() {
// create a new task
t := task.New("1", "golang blog", "https://blog.golang.org", "")
// set init urls
t.SetURLinitFunc(func() []string {
return []string{"https://blog.golang.org/index"}
})
// set continue
t.SetErrorContinue(true)// prepare page fields
// use html document selector query
title := t.NewField("title", "title").SetMatchRule(url.MatchTypeSelector, "#content > div > h3 > a")
content := t.NewField("content", "content").SetMatchRule(url.MatchTypeSelector, "#content > div").
SetFilterFunc(func(f *url.Field) {
f.Remove(`(*)
`)
f.Remove(`(*)
`)
}).
SetFixURL(true)var num uint32
// use regexp match url
t.Rule("https://blog.golang.org/*").
SetName("blog paper"). // set rule name
URLs(). // collect matched url
Row(title, content). // parse page fields
SetSaveFunc(func(taskID, pk string, val map[string]interface{}) error {
atomic.AddUint32(&num, 1)
fmt.Printf("%5d taskID: %s, \n-> pk: %s\n-> title: %s\n", num, taskID, pk, val["title"])
// only print one content
if num == 2 {
fmt.Println(val["content"])
}
return nil
}, nil, nil).
Save() // save rule// run
fmt.Printf("done with: %v\n", t.Run())
}```