Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/twiny/wbot
A simple & efficient web crawler.
https://github.com/twiny/wbot
big-data crawler golang scraper seo spider
Last synced: 23 days ago
JSON representation
A simple & efficient web crawler.
- Host: GitHub
- URL: https://github.com/twiny/wbot
- Owner: twiny
- License: mit
- Created: 2022-05-10T11:15:36.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-18T15:01:29.000Z (11 months ago)
- Last Synced: 2024-06-19T04:10:52.912Z (7 months ago)
- Topics: big-data, crawler, golang, scraper, seo, spider
- Language: Go
- Homepage: https://github.com/twiny/wbot/wiki
- Size: 58.6 KB
- Stars: 17
- Watchers: 2
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# WBot
A configurable, thread-safe web crawler, provides a minimal interface for crawling and downloading web pages.
## Features
- Clean minimal API.
- Configurable: MaxDepth, MaxBodySize, Rate Limit, Parrallelism, User Agent & Proxy rotation.
- Memory-efficient, thread-safe.
- Provides built-in interface: Fetcher, Store, Queue & a Logger.## API
WBot provides a minimal API for crawling web pages.
```go
Run(links ...string) error
OnReponse(fn func(*wbot.Response))
Metrics() map[string]int64
Shutdown()
```## Usage
```go
package mainimport (
"fmt"
"log""github.com/rs/zerolog"
"github.com/twiny/wbot"
"github.com/twiny/wbot/pkg/api"
)func main() {
bot := wbot.New(
wbot.WithParallel(50),
wbot.WithMaxDepth(5),
wbot.WithRateLimit(&api.RateLimit{
Hostname: "*",
Rate: "10/1s",
}),
wbot.WithLogLevel(zerolog.DebugLevel),
)
defer bot.Shutdown()// read responses
bot.OnReponse(func(resp *api.Response) {
fmt.Printf("crawled: %s\n", resp.URL.String())
})if err := bot.Run(
"https://go.dev/",
); err != nil {
log.Fatal(err)
}log.Printf("finished crawling\n")
}
```### Wiki
More documentation can be found in the [wiki](https://github.com/twiny/wbot/wiki).
### Bugs
Bugs or suggestions? Please visit the [issue tracker](https://github.com/twiny/wbot/issues).