https://github.com/tossmilestone/crawlgo

A crawler written in golang
https://github.com/tossmilestone/crawlgo

cralwer golang

Last synced: 9 months ago
JSON representation

A crawler written in golang

Host: GitHub
URL: https://github.com/tossmilestone/crawlgo
Owner: tossmilestone
License: apache-2.0
Created: 2018-02-13T08:14:50.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2020-04-19T14:11:26.000Z (over 5 years ago)
Last Synced: 2025-01-07T14:46:29.506Z (11 months ago)
Topics: cralwer, golang
Language: Go
Size: 175 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Crawlgo ![Go](https://github.com/tossmilestone/crawlgo/workflows/Go/badge.svg) [![CircleCI](https://circleci.com/gh/tossmilestone/crawlgo.svg?style=shield)](https://circleci.com/gh/tossmilestone/crawlgo) [![Coverage Status](https://coveralls.io/repos/github/tossmilestone/crawlgo/badge.svg?branch=master)](https://coveralls.io/github/tossmilestone/crawlgo?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/tossmilestone/crawlgo)](https://goreportcard.com/report/github.com/tossmilestone/crawlgo)

Crawlgo is a crawler written in golang, it aims to be an extensible, scalable and high-performance distributed crawler system.

Using `phantomjs`, crawlgo can crawl web pages rendered with javascript.

## Prerequisite

* Linux OS
* phantomjs: `phantomjs` should be able to run through the env `PATH`. It can be downloaded [here](http://phantomjs.org/download.html).

## Install

```
go get github.com/tossmilestone/crawlgo
cd ${GOPATH}/src/github.com/tossmilestone/crawlgo
sudo make install
```

The above commands will install `crawlgo` in `${GOPATH}/go/bin`.

## Usage

```
crawlgo [flags]

Flags:
--download-selector string The DOM selector to query the links that will be downloaded from the site
--enable-profile enable profiling the program to start a pprof HTTP server on localhost:6360
-h, --help help for crawlgo
--save-dir string The directory to save downloaded files. (default "./crawlgo")
--site string The site to crawl
--version version for crawlgo
--workers int The number of workers to run the crawl tasks. If no set, will be 'runtime.NumCPU()'
```

Crawlgo uses file name to identify the downloaded links. If the file of a link is existed in the save directory, the link will be assumed downloaded already.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tossmilestone/crawlgo

Awesome Lists containing this project

README