Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/schne324/moocher

Web content scraper
https://github.com/schne324/moocher

scraper scraping scraping-websites web-scraper web-scraping

Last synced: about 2 months ago
JSON representation

Web content scraper

Host: GitHub
URL: https://github.com/schne324/moocher
Owner: schne324
License: mit
Created: 2017-02-07T03:57:59.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2018-08-20T00:03:40.000Z (over 6 years ago)
Last Synced: 2024-10-04T00:25:03.376Z (3 months ago)
Topics: scraper, scraping, scraping-websites, web-scraper, web-scraping
Language: JavaScript
Homepage:
Size: 46.9 KB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Moocher

> Web content scraper

[![CircleCI](https://circleci.com/gh/schne324/moocher.svg?style=svg)](https://circleci.com/gh/schne324/moocher)

## Installation

```bash

$ npm install --save moocher # or yarn add moocher

```

## Usage

```js

new Moocher(urls, options);

```

- `urls` {String|Array} a single string url or an array of urls to scrape content from.

- `options` {Object} (optional) the configuration object.

  - `limit` {Number} (optional) the number of concurrent requests to make while scraping. Defaults to `undefined` which does not enforce a concurrency limit (all requests will be run in parallel).

## API

Moocher emits the following events:

- `"mooch"`: Emits for each response. The callback receives the following arguments:

  - `$`: The cheerio-loaded document. This means you can just use jQuery methods on the response document.

  - `url`: The original url passed to Moocher.

  - `response`: The full response object

- `"error"`: Emits when a single request fails

- `"complete"`: Emits when the moocher is done mooching.

## Example

```js

const mooch = new Moocher([

  'https://url-1.com',

  'http://url-2.com',

  'http://url-3.com',

  'https://url-4.com',

  'http://url-5.com'

], {

  limit: 2 // allow only 2 concurrent requests

});

mooch

  // emitted for each web page mooched

  .on('mooch', ($, url) => {

    const $h1 = $('h1');

    titles.push($h1.text());

  })

  // emitted if any request fails

  .on('error', (err) => console.error(err))

  // emitted when all urls have been mooched

  .on('complete', () => {

    console.log(`All titles have been mooched: ${titles.join(', ')}`);

  })

  // start mooching!

  .start();

```