Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/siristechnology/news-crawler

Config based news crawler using Google Puppeteer
https://github.com/siristechnology/news-crawler

chromium javascript news-crawler puppeteer

Last synced: 8 days ago
JSON representation

Config based news crawler using Google Puppeteer

Awesome Lists containing this project

README

        

# news-crawler

Config based news crawler using Google Puppeteer

- Uses `puppeteer-extra-plugin-adblocker` to block ads
- Uses `puppeteer-extra-plugin-stealth` to prevent detection
- Uses `html-to-text` to convert html to text

## Install

yarn add news-crawler

## Sample code

const articles = await NewsCrawler(sourceConfig, { maxArticlesPerPage : 1, headless: false })

## Sample News Source Config

```
[
{
"name": "ekantipur",
"pages": [
{
"url": "https://ekantipur.com",
"category": "headlines",
"linkSelector": "article.normal > h1 > a"
}
],
"article-detail-selectors": {
"title": "main > article > header > h1",
"excerpt": "article .text-wrap > h2",
"leadImage": "#wrapper main article header figure img",
"content": [
"main article div.text-wrap p.description"
],
"tags": "",
"likes-count": "main > article > header div.total.shareTotal"
}
}
]
```

## Sample News Output Json

```
[
{
source: 'ekantipur',
category: 'sports',
url: 'https://ekantipur.com/sports/2020/06/11/159183662731487753.html',
title: 'बायर्न जर्मनकप फाइनलमा',
leadImage: 'https://assets-cdn-usae.kantipurdaily.com/uploads/source/news/kantipur/2020/third-party/bayern-1162020024916-1000x0.jpg',
content: 'म्युनिख — बायर्न म्युनिखले कप डबलको उपलब्धि जीवन्त राख्न बुधबार राति आइनट्राख्ट फ्रान्कफर्टलाई २–१ ले हरायो र जर्मनकपको फाइनल'
}
]
```