Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/schne324/moocher
Web content scraper
https://github.com/schne324/moocher
scraper scraping scraping-websites web-scraper web-scraping
Last synced: about 2 months ago
JSON representation
Web content scraper
- Host: GitHub
- URL: https://github.com/schne324/moocher
- Owner: schne324
- License: mit
- Created: 2017-02-07T03:57:59.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2018-08-20T00:03:40.000Z (over 6 years ago)
- Last Synced: 2024-10-04T00:25:03.376Z (3 months ago)
- Topics: scraper, scraping, scraping-websites, web-scraper, web-scraping
- Language: JavaScript
- Homepage:
- Size: 46.9 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Moocher
> Web content scraper
[![CircleCI](https://circleci.com/gh/schne324/moocher.svg?style=svg)](https://circleci.com/gh/schne324/moocher)
## Installation
```bash
$ npm install --save moocher # or yarn add moocher
```## Usage
```js
new Moocher(urls, options);
```- `urls` {String|Array} a single string url or an array of urls to scrape content from.
- `options` {Object} (optional) the configuration object.
- `limit` {Number} (optional) the number of concurrent requests to make while scraping. Defaults to `undefined` which does not enforce a concurrency limit (all requests will be run in parallel).## API
Moocher emits the following events:- `"mooch"`: Emits for each response. The callback receives the following arguments:
- `$`: The cheerio-loaded document. This means you can just use jQuery methods on the response document.
- `url`: The original url passed to Moocher.
- `response`: The full response object
- `"error"`: Emits when a single request fails
- `"complete"`: Emits when the moocher is done mooching.## Example
```js
const mooch = new Moocher([
'https://url-1.com',
'http://url-2.com',
'http://url-3.com',
'https://url-4.com',
'http://url-5.com'
], {
limit: 2 // allow only 2 concurrent requests
});mooch
// emitted for each web page mooched
.on('mooch', ($, url) => {
const $h1 = $('h1');
titles.push($h1.text());
})
// emitted if any request fails
.on('error', (err) => console.error(err))
// emitted when all urls have been mooched
.on('complete', () => {
console.log(`All titles have been mooched: ${titles.join(', ')}`);
})
// start mooching!
.start();
```