https://github.com/dineshsprabu/concurrent-web-crawler
Flexible and concurrent web crawler implemented in 'go'
https://github.com/dineshsprabu/concurrent-web-crawler
concurrent-web-crawler crawler go-crawler spider web-crawler
Last synced: 5 months ago
JSON representation
Flexible and concurrent web crawler implemented in 'go'
- Host: GitHub
- URL: https://github.com/dineshsprabu/concurrent-web-crawler
- Owner: dineshsprabu
- License: mit
- Created: 2017-05-04T13:44:29.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2017-05-05T06:00:06.000Z (about 9 years ago)
- Last Synced: 2024-06-20T03:34:48.633Z (almost 2 years ago)
- Topics: concurrent-web-crawler, crawler, go-crawler, spider, web-crawler
- Language: Go
- Size: 13.7 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Concurrent Web Crawler
Highly configurable crawler with powerful concurrency and better status logging.
[](https://godoc.org/github.com/dineshsprabu/concurrent-web-crawler). [](https://travis-ci.org/dineshsprabu/concurrent-web-crawler)
## Installation
```
go get github.com/dineshsprabu/concurrent-web-crawler
```
## Usage
```go
package main
import(
"github.com/dineshsprabu/concurrent-web-crawler"
)
func main(){
// Creating a web crawler object with configurations.
myCrawler := web.Crawler{
MaxConcurrencyLimit: 2,
StoragePath: "crawler/storage",
CrawlDelay: 10,
}
// List of URLS to be crawled as a string array.
urls := []string{
"https://httpbin.org/ip",
"http://example.com",
"https://archive.org/details/opensource_movies",
}
// Starting the crawler by passing the list of URLs.
myCrawler.Start(urls)
}
```
## Log
```
> go run crawler_sample.go
2017/05/04 20:29:59 || [Processing] Spawning subroutines : 2
2017/05/04 20:29:59 || [Processing] Fetching page content : https://archive.org/details/opensource_movies
2017/05/04 20:29:59 || [Processing] Fetching page content : https://httpbin.org/ip
2017/05/04 20:30:01 || [Processing] Writing to the file : crawler/ip.html
2017/05/04 20:30:01 || [Success] Crawled page : https://httpbin.org/ip
2017/05/04 20:30:03 || [Processing] Writing to the file : crawler/details/opensource_movies.html
2017/05/04 20:30:03 || [Success] Crawled page : https://archive.org/details/opensource_movies
2017/05/04 20:30:11 || [Processing] Fetching page content : http://example.com
2017/05/04 20:30:12 || [Processing] Writing to the file : crawler/example.com/index.html
2017/05/04 20:30:12 || [Success] Crawled page : http://example.com
2017/05/04 20:30:22 || [Status] Failed urls : []
```