Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tossmilestone/crawlgo
A crawler written in golang
https://github.com/tossmilestone/crawlgo
cralwer golang
Last synced: 6 days ago
JSON representation
A crawler written in golang
- Host: GitHub
- URL: https://github.com/tossmilestone/crawlgo
- Owner: tossmilestone
- License: apache-2.0
- Created: 2018-02-13T08:14:50.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2020-04-19T14:11:26.000Z (over 4 years ago)
- Last Synced: 2024-06-20T09:18:13.013Z (5 months ago)
- Topics: cralwer, golang
- Language: Go
- Size: 175 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Crawlgo ![Go](https://github.com/tossmilestone/crawlgo/workflows/Go/badge.svg) [![CircleCI](https://circleci.com/gh/tossmilestone/crawlgo.svg?style=shield)](https://circleci.com/gh/tossmilestone/crawlgo) [![Coverage Status](https://coveralls.io/repos/github/tossmilestone/crawlgo/badge.svg?branch=master)](https://coveralls.io/github/tossmilestone/crawlgo?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/tossmilestone/crawlgo)](https://goreportcard.com/report/github.com/tossmilestone/crawlgo)
Crawlgo is a crawler written in golang, it aims to be an extensible, scalable and high-performance distributed crawler system.
Using `phantomjs`, crawlgo can crawl web pages rendered with javascript.
## Prerequisite
* Linux OS
* phantomjs: `phantomjs` should be able to run through the env `PATH`. It can be downloaded [here](http://phantomjs.org/download.html).## Install
```
go get github.com/tossmilestone/crawlgo
cd ${GOPATH}/src/github.com/tossmilestone/crawlgo
sudo make install
```The above commands will install `crawlgo` in `${GOPATH}/go/bin`.
## Usage
```
crawlgo [flags]Flags:
--download-selector string The DOM selector to query the links that will be downloaded from the site
--enable-profile enable profiling the program to start a pprof HTTP server on localhost:6360
-h, --help help for crawlgo
--save-dir string The directory to save downloaded files. (default "./crawlgo")
--site string The site to crawl
--version version for crawlgo
--workers int The number of workers to run the crawl tasks. If no set, will be 'runtime.NumCPU()'
```Crawlgo uses file name to identify the downloaded links. If the file of a link is existed in the save directory, the link will be assumed downloaded already.