Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lulurun/kick-off-crawling
make web scraping easy
https://github.com/lulurun/kick-off-crawling
crawler nodejs scraper
Last synced: 16 days ago
JSON representation
make web scraping easy
- Host: GitHub
- URL: https://github.com/lulurun/kick-off-crawling
- Owner: lulurun
- License: mit
- Created: 2016-02-02T07:10:14.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2019-04-03T02:51:27.000Z (almost 6 years ago)
- Last Synced: 2024-12-14T13:11:39.398Z (27 days ago)
- Topics: crawler, nodejs, scraper
- Language: JavaScript
- Homepage:
- Size: 97.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Kick-off-crawling
provides simple API/structure for developers to create scrapers and connect those scrapers together to achieve complicated data mining jobs.Kick-off-crawling is made possible by below powerful libraries
* [`request`](https://github.com/request/request), [`request-promise`](https://github.com/request/request-promise) for http request
* [`puppeteer`](https://github.com/GoogleChrome/puppeteer) for making http request getting js generated html page
* [`cheerio`](https://github.com/cheeriojs/cheerio) for translating html text to dom objects
* [`async`](https://github.com/caolan/async) for scheduling scraping tasks
* [`html-minifier`](https://github.com/kangax/html-minifier) for minifying html in order to get stable scraping result# Overview
Kick-off-crawling exposes a `Scraper` class and a `kickoff` function.
`Scraper`s are self-managed:
* scrape data and urls from a DOM object parsed by cheerio.
* post the data to developer defined function
* pass the url to next scraper`kickoff` takes a url and a scraper, kicks off the crawling process.
During the crawling process, new scrapers (scraping job) will be generated and scheduled by `kickoff` function. The crawling process stops when there is no more scraping job.# Qucik start & Running the Examples
```
$ cd ${REPO_PATH}
$ npm install
$ cd examples
$ node google.js # getting top 30 search results
$ node amazon.js # getting top 5 apps from Amazon Appstore
```# Usage
_Working example source code: [examples/amazon.js](https://github.com/lulurun/kick-off-crawling/blob/master/examples/amazon.js)_
Here is an example for getting app info from Amazon Appstore.
### 1. Define your scrapers
We need to define 2 Scrapers for completing the task
1. `BrowseNodeScraper` for getting list of app detail pages.
2. `DetailPageScraper` for getting app into (title, stars, version) from each detail page.##### Scraping app list
```
class BrowseNodeScraper extends Scraper {
scrape($, emitter) {
$('#mainResults .s-item-container').slice(0, 5).each((i, x) => {
const detailPageUrl = $(x).find('.s-access-detail-page').attr('href');
emitter.emitJob(detailPageUrl, new DetailPageScraper()); // <-- New scraping job
});
}
}
```##### Scraping the detail page
```
class DetailPageScraper extends Scraper {
getVersion($) {
let version = '';
$('#mas-technical-details .masrw-content-row .a-section').each((ii, xx) => {
const text = $(xx).text();
const re = /バージョン:\s+(.+)/;
const matched = re.exec(text);
if (matched) {
[, version] = matched;
}
});
return version;
}scrape($, emitter) {
const item = {
title: $('#btAsinTitle').text(),
star: $('.a-icon-star .a-icon-alt').eq(0).text().replace('5つ星のうち ', ''),
version: this.getVersion($),
};
emitter.emitItem(item); // <-- the emitted item is recieved by *onItem* callback set with the `kickoff` function
}
}
```### 2. Kick off
```
kickoff(
'https://www.amazon.co.jp/b/?node=2386870051',
new BrowseNodeScraper(),
{
concurrency: 2, // <-- max 2 requests at a time, default 1
minify: true, // <-- minify html, default true
headless: false, // <-- set true when scraping js generated page, default false
onItem: (item) => { // <-- the item is emitted from scraper
console.log(item);
},
onDone: () => { // <-- this is called when there is no more scraping task, optional
console.log('done');
},
},
);
```