Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/patrickschababerle/schabbi-webscraper
Small and easy to use NodeJS webcrawler project. Returns basic information about the crawled sites.
https://github.com/patrickschababerle/schabbi-webscraper
crawler puppeteer scraper scraping web-crawler
Last synced: about 23 hours ago
JSON representation
Small and easy to use NodeJS webcrawler project. Returns basic information about the crawled sites.
- Host: GitHub
- URL: https://github.com/patrickschababerle/schabbi-webscraper
- Owner: PatrickSchababerle
- License: apache-2.0
- Created: 2021-03-23T16:09:15.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2024-08-07T13:23:34.000Z (6 months ago)
- Last Synced: 2025-02-08T15:38:31.282Z (2 days ago)
- Topics: crawler, puppeteer, scraper, scraping, web-crawler
- Language: JavaScript
- Homepage:
- Size: 1.05 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
[![Build Status](https://travis-ci.com/PatrickSchababerle/schabbi-webscraper.svg?token=x3Xxx6fmnZtByDoY9d4v&branch=master)](https://travis-ci.com/PatrickSchababerle/schabbi-webscraper)
[![npm puppeteer package](https://img.shields.io/npm/v/schabbi-webscraper)](https://npmjs.org/package/schabbi-webscraper)
[![Package Quality](https://packagequality.com/shield/schabbi-webscraper.svg)](https://packagequality.com/#?package=schabbi-webscraper)
![Downloads](https://img.shields.io/npm/dw/schabbi-webscraper)Lightweight and easy to use webcrawler.
## Features
- Fast and reliable
- Supports custom page handling
- Result contains also all cookies
- Accepts all puppeteer parameters## Requirements
- NodeJS v15.*
## Installation
### via NPM
```bash
$ npm i schabbi-webscraper
```
### via Github
```bash
$ git clone https://github.com/PatrickSchababerle/schabbi-webscraper
$ npm install
```
## Usage#### Standard use case
```js
const Schabbi = require('schabbi-webscraper');
const Crawler = new Schabbi();Crawler.setUrl('https://www.example.com').crawl();
```
#### With custom option parameters
```js
const Schabbi = require('schabbi-webscraper');
const Crawler = new Schabbi();Crawler.setUrl('https://www.example.com').withOptions({
includeExternalLinks : true,
userAgent : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
authentication : {
username : 'Testuser',
password : 'Test'
}
}).crawl();
```
You can decide which crawled links are added to the queue by using the queue option. F.e. to crawl only pages with a specific attribute, class or target:```js
const Schabbi = require('schabbi-webscraper');
const Crawler = new Schabbi();Crawler.setUrl('https://www.digitalsterne.de').withOptions({
queue : {
pattern : 'a[href*="/2021/05/06"]'
}
}).crawl();
```
You can also decide if parameters are ignored when adding urls to the queue:```js
const Schabbi = require('schabbi-webscraper');
const Crawler = new Schabbi();Crawler.setUrl('https://www.digitalsterne.de').withOptions({
ignoreUrlParameter : true
}).crawl();
```#### Work with the crawled pages while the're beeing processed
With custom functions you can perform actions on each crawled page. The results will be pushed into the final results.
```js
const Schabbi = require('schabbi-webscraper');
const Crawler = new Schabbi();Crawler.setUrl('https://digitalsterne.de').eachPage(async (page) => {
const links = await page.$$eval('a', as => as.map(a => a.href));
return links;
}).crawl().then((result) => {
console.log(result);
});
```
#### Further work with resultSchabbi is returning a promise which will be resolved as soon as the crawl has finished:
```js
const Schabbi = require('schabbi-webscraper');
const Crawler = new Schabbi();Crawler.setUrl('https://www.example.com').crawl().then((result) => {
console.log(result);
});
```
## Methods| Method | Description |
|--|--|
| withOptions( Object ) | Set custom options for the crawler |
| setUrl( String ) | Set initial url |
| crawl() | Start the crawling |## Configuration
| Option | Description | Type |
|--|--|--|
| includeExternalLinks | Determine if Schabbi should output external links in the results | BOOLEAN |
| userAgent | Use a custom User Agent for crawling | STRING |
| browser | Settings for Puppeteer. All Puppeteer browser launch arguments are accepted | OBJECT |
| queue | Set custom pattern for evaluation of links inside crawled pages. | OBJECT |Visit the examples for detailed information on how to use options properly.
## About this project
This is one of my first projects on github to be available for you all out there. Please feel free to provide feedback!