Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/patrickschababerle/schabbi-webscraper

Small and easy to use NodeJS webcrawler project. Returns basic information about the crawled sites.
https://github.com/patrickschababerle/schabbi-webscraper

crawler puppeteer scraper scraping web-crawler

Last synced: about 23 hours ago
JSON representation

Small and easy to use NodeJS webcrawler project. Returns basic information about the crawled sites.

Host: GitHub
URL: https://github.com/patrickschababerle/schabbi-webscraper
Owner: PatrickSchababerle
License: apache-2.0
Created: 2021-03-23T16:09:15.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2024-08-07T13:23:34.000Z (6 months ago)
Last Synced: 2025-02-08T15:38:31.282Z (2 days ago)
Topics: crawler, puppeteer, scraper, scraping, web-crawler
Language: JavaScript
Homepage:
Size: 1.05 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

        




[![Build Status](https://travis-ci.com/PatrickSchababerle/schabbi-webscraper.svg?token=x3Xxx6fmnZtByDoY9d4v&branch=master)](https://travis-ci.com/PatrickSchababerle/schabbi-webscraper)

[![npm puppeteer package](https://img.shields.io/npm/v/schabbi-webscraper)](https://npmjs.org/package/schabbi-webscraper)

[![Package Quality](https://packagequality.com/shield/schabbi-webscraper.svg)](https://packagequality.com/#?package=schabbi-webscraper)

![Downloads](https://img.shields.io/npm/dw/schabbi-webscraper)

Lightweight and easy to use webcrawler.

## Features

- Fast and reliable

- Supports custom page handling

- Result contains also all cookies

- Accepts all puppeteer parameters

## Requirements

 - NodeJS v15.*

## Installation

### via NPM

```bash

$ npm i schabbi-webscraper

```

### via Github

```bash

$ git clone https://github.com/PatrickSchababerle/schabbi-webscraper

$ npm install

```

## Usage

#### Standard use case

```js

const  Schabbi = require('schabbi-webscraper');

const  Crawler = new Schabbi();      

Crawler.setUrl('https://www.example.com').crawl();

```

#### With custom option parameters

```js

const  Schabbi = require('schabbi-webscraper');

const  Crawler = new Schabbi();

Crawler.setUrl('https://www.example.com').withOptions({

    includeExternalLinks :  true,

    userAgent :  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",

    authentication : {

        username : 'Testuser',

        password : 'Test'

    }

}).crawl();

```

You can decide which crawled links are added to the queue by using the queue option. F.e. to crawl only pages with a specific attribute, class or target:

```js

const  Schabbi = require('schabbi-webscraper');

const  Crawler = new Schabbi();

Crawler.setUrl('https://www.digitalsterne.de').withOptions({

    queue : {

        pattern : 'a[href*="/2021/05/06"]'

    }

}).crawl();

```

You can also decide if parameters are ignored when adding urls to the queue:

```js

const  Schabbi = require('schabbi-webscraper');

const  Crawler = new Schabbi();

Crawler.setUrl('https://www.digitalsterne.de').withOptions({

    ignoreUrlParameter : true

}).crawl();

```

#### Work with the crawled pages while the're beeing processed

With custom functions you can perform actions on each crawled page. The results will be pushed into the final results.

```js

const  Schabbi = require('schabbi-webscraper');

const  Crawler = new Schabbi();

Crawler.setUrl('https://digitalsterne.de').eachPage(async (page) => {

    const links = await page.$$eval('a', as => as.map(a => a.href));

    return links;

}).crawl().then((result) => {

    console.log(result);

});

```

#### Further work with result

Schabbi is returning a promise which will be resolved as soon as the crawl has finished:

```js

const  Schabbi = require('schabbi-webscraper');

const  Crawler = new Schabbi();

Crawler.setUrl('https://www.example.com').crawl().then((result) => {

    console.log(result);

});

```

## Methods

| Method | Description |

|--|--|

| withOptions( Object ) | Set custom options for the crawler |

| setUrl( String ) | Set initial url |

| crawl() | Start the crawling |

## Configuration

| Option | Description | Type |

|--|--|--|

| includeExternalLinks | Determine if Schabbi should output external links in the results | BOOLEAN |

| userAgent | Use a custom User Agent for crawling | STRING |

| browser | Settings for Puppeteer. All Puppeteer browser launch arguments are accepted | OBJECT |

| queue | Set custom pattern for evaluation of links inside crawled pages. | OBJECT |

Visit the examples for detailed information on how to use options properly.

## About this project

This is one of my first projects on github to be available for you all out there. Please feel free to provide feedback!