https://github.com/bestnathan/crazycrawler
crawl everything by this
https://github.com/bestnathan/crazycrawler
nodejs
Last synced: 4 months ago
JSON representation
crawl everything by this
- Host: GitHub
- URL: https://github.com/bestnathan/crazycrawler
- Owner: BestNathan
- License: mit
- Created: 2018-01-25T04:40:17.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2018-06-11T04:44:25.000Z (over 7 years ago)
- Last Synced: 2025-09-05T00:46:48.814Z (4 months ago)
- Topics: nodejs
- Language: JavaScript
- Size: 70.3 KB
- Stars: 2
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# crazyCrawler 3.0
crawl everything by this
# Usage(使用方法)
## install(安装)
```bash
npm install crazy-crawler
```
## require(引用)
```js
const { CrazyCrawler, Task, TaskChain } = require('crazy-crawler')
const crawler = new CrazyCrawler({ maxTask: 5, sleep: 100 })
crawler.on('done', () => {
// crawler finish working
})
```
## create crawler task (创建爬虫任务)
```js
const task = new Task({
name: 'example',
url: 'http://www.baidu.com',
handler: function(response) {
// response is axios response
console.log(response.data) //data of axios response
console.log(response.task.name) // example
},
errorHandler: function(err) {
// when error occurs in HTTP request this handler will be invoked
}
})
```
## queue task to crawler and run (将任务加入到爬虫队列并运行)
```js
crawler.queueTask(task).run() // run crawler
```
## create crawler taskChain (创建爬虫任务链)
```js
const taskChain = new TaskChain()
const firstTask = new Task({
name: 'first',
url: 'http://www.baidu.com',
beforeTask: function({ lastTask, task, state }) {
// if this task is the first task of the task chain
// lastTask will be undefined
// task is the task will be executed
// state is the property of taskChain
// and used by every task
console.log(task.name) // first
state.firstStatus = 'before'
},
afterTask: function({ task, state, response }) {
// response is axios response and the same as response in handler
console.log(state.firstStatus) // before
state.firstStatus = 'finish'
}
})
const secondTask = new Task({
name: 'second',
url: 'http://www.baidu.com',
beforeTask: function({ lastTask, task, state }) {
console.log(lastTask.name) // first
console.log(state.firstStatus) // finish
},
afterTask: function({ task, state, response }) {
// response is axios response and the same as response in handler
console.log(response.task.name) // second
}
})
taskChain.queue([firstTask, secondTask])
```
## queue taskChain to crawler and run (将任务链加入到爬虫队列并运行)
```js
crawler.queueTask(taskChain).run() // run crawler
```
# examples
## repeat task
* example 1
```js
const crawler = new CrazyCrawler({ maxTask: 5, sleep: 100 })
let counter = 0
crawler.on('done', () => {
console.log(counter) // 3
})
const repaetTask = new Task({
name: 'repeat',
url: 'http://example.com',
handler: function(response) {
counter++
},
repeat: true,
limit: 3
})
crawler.queueTask(repaetTask).run()
```
* example 2
```js
const crawler = new CrazyCrawler({ maxTask: 5, sleep: 100 })
let counter = 0
crawler.on('done', () => {
console.log(counter) // 4
})
const repaetTask = new Task({
name: 'repeat',
url: 'http://example.com',
handler: function(response) {
counter++
},
repeat: true,
limit: 2
})
const repaetTask1 = new Task({
name: 'repeat',
url: 'http://example.com',
handler: function(response) {
counter++
},
repeat: true,
limit: 2
})
crawler
.queueTask(repaetTask)
.queueTask(repaetTask1)
.run()
```
## functional task
* example 3
```js
const crawler = new CrazyCrawler({ maxTask: 5, sleep: 100 })
let counter = 0
crawler.on('done', () => {
console.log(counter) // 2
})
const functionalTask = new Task({
name: 'functional',
baseUrl: 'http://example.com/:id',
paramSetters: {
id: function(counter) {
return counter + 123
// url will be http://example.com/123 http://example.com/124 ...
}
},
handler: function(response) {
counter++
},
functional: true,
limit: 2
})
crawler.queueTask(functionalTask).run()
```
## functional and repeat task
* example 4
```js
const crawler = new CrazyCrawler({ maxTask: 5, sleep: 100 })
let counter = 0
crawler.on('done', () => {
console.log(counter) // 4
})
const functionalTask = new Task({
name: 'functional',
baseUrl: 'http://example.com/:id',
paramSetters: {
id: function(counter) {
return counter + 123
// url will be http://example.com/123 http://example.com/124 ...
}
},
handler: function(response) {
counter++
},
functional: true,
limit: 2
})
const repaetTask = new Task({
name: 'repeat',
url: 'http://example.com',
handler: function(response) {
counter++
},
repeat: true,
limit: 2
})
crawler
.queueTask(functionalTask)
.queueTask(reapeatTask)
.run()
```
# API
## CrazyCrawler
### CrazyCrawler.constructor({ maxTask, sleep })
* maxkTask: max tasks downloader execs at the same time
* sleep: sleep between every task
### CrazyCrawler.queueTask(task: Task | TaskChain)
* add `task` or `taskChain` to crawler
### CrazyCrawler.run()
* run crawler
### events
#### done
* when crawler finish working 'done' event will be emitted
## Task
### Task.constructor({...options})
#### basic options(基础选项)
* name: the name of task
* url: target url
* method: default to 'get'
* data: only work with `method` is post, can be plain object or string
* headers: can be plain object or string
* cookies: cookie object, if `headers` not exist 'Cookie' property, then use `cookies` options
* axiosOptions: any axios supported options, include `url`,`method`, `data`, `headers`
* handler: to handle `response` if success, parameter is axios response
* errorHandler: to handle error if any `Error` occurs in axios progress
* fakeIP: by add 'X-Forword-For' and 'CLIENT_IP' with random IP to `headers`
* repeat: specific task is repeat
* limit: work with task is `repeat` or `functional`, number or function
#### functional task options(函数式任务选项)
* functional: sepecific task is functional
* baseUrl: generate `url` from baseUrl
* baseData: generate `data` from baseData
* paramSetters: sepecific properties to be generated to `url` and `data`
* baseUrlPattern: how to find where to be replaced with generated param
#### task in chain options(任务链有效的选项)
* inChain: specific task is working in chain
* beforeTask: invoke before axios progress and you can modify the task
* afterTask: invoke after axios progress and you can store some useful data to use in chain
### Task.exec()
run task
### Task.CheckLimit()
check if task is over `limit`
### Task.copy()
return a task with `coptFrom` property of this task
### Task.repeatTask()
return a task like this task
### Task.generateTask()
if task is `functional` this will return a generated task with functional options,
otherwise return `this.copy()` with this task
## TaskChain
### TaskChain.constructor({ repeat, functional, limit })
* repeat: sepecific this task chain is repeat chain
* limit: times to repeat, not work with functional
* functional: sepecific this task chain is functional
### TaskChain.queue(task)
queue tasks to `exec` in chain, order is the order with queue
### TaskChain.toTask()
to `Task`
### TaskChain.checkLimit()
if `reapet` this will check if over `limit`, if functional this will invoke `checkLimit` of every task in chain to check
### TaskChain.generateTaskChain()
if `functional`, this will invoke `generateTask` of every task in chain and push them to a new `TaskChain`, then return this new chain
### TaskChain.repeatTaskChain()
if `repeat`, this will return a new `TaskChain` based on this `taskChain`
# welcome pull request
# Lisence
MIT