https://github.com/xop/news-scraper-core

NewScraper Core
https://github.com/xop/news-scraper-core

Last synced: 2 months ago
JSON representation

NewScraper Core

Host: GitHub
URL: https://github.com/xop/news-scraper-core
Owner: XOP
License: mit
Created: 2016-10-01T15:56:48.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2016-12-29T13:52:44.000Z (over 8 years ago)
Last Synced: 2025-02-07T07:48:54.413Z (3 months ago)
Language: JavaScript
Size: 31.3 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # NewScraper Core Module

[![npm version](https://badge.fury.io/js/news-scraper-core.svg)](https://badge.fury.io/js/news-scraper-core) [![npm dependencies](https://david-dm.org/stewiekillsloiss/news-scraper-core.svg)](https://david-dm.org/stewiekillsloiss/news-scraper-core)

> The core module for the NewScraper

> [https://github.com/XOP/news-scraper](https://github.com/XOP/news-scraper)

## Goal

**NewScraper Core Module** (NewScraper) is a NodeJS module, that receives specific directives as props and returns scraped pages data.

Both directives' and output' format is `JSON`.

NewScraper is designed to be used as a **middleware** for a server / hybrid / CLI application.

## API

### Config

`limit`      

Number, default: `undefined` (bypass)  

Defines the default common limit; will overwrite directive's [Input -> limit](#input)

`output`  

Object:

```

{ 

    path,

    current

}

```

`output.path`  

String, default: "./"  

Path to the scraped data directory

`output.current`

String, default: "data.json"  

Path to the current data json file (used to filter previously shown news)

`updateStrategy`  

String, default: ""  

Defines logic of the post-processing the scraped data:  

`"scratch"` - ignores previous runs, creates new json file every new scraping round  

`"compare"` - compares scraping results to the previous result, stores in `output.current` file (data.json by default)  

`""` - bypass, no scraping results saved

`scraperOptions`  

Object, default: {}  

Parameters to pass to the currently used scraper.  

Version 1.x - [Nightmare](http://www.nightmarejs.org/), find all options [here](https://github.com/segmentio/nightmare#api).

### Input

Input is the collection of directives in a `JSON` format.

> It is recommended for the application to store directives in a most readable format (e.g. `YAML`) and convert it on the fly to the `JSON`.

Example:

```

[

    {

        "title": "Smashing magazine",

        "url": "http://www.smashingmagazine.com/",

        "elem": "article.post",

        "link": "h2 > a",

        "author": "h2 + ul li.a a",

        "time": "h2 + ul li.rd",

        "image": "figure > a > img",

        "limit": 6

    },

    {...},

    {...}

]

```

`title`  

String  

Name of the resource, **required**

`url`  

String  

Url of the resource, **required**

`elem`  

String  

CSS selector of the news item container element, **required**

`link`  

String  

CSS selector of the link (...) _inside_ of the `elem`

If the `elem` itself _is_ a link, this is not required

`author`  

String  

CSS selector of the author element _inside_ of the `elem`

`time`  

String  

CSS selector of the time element _inside_ of the `elem`

`image`  

String  

CSS selector of the image element _inside_ of the `elem`

This one can be `img` tag or any other - NewScraper will search for `data-src` and `background-image` CSS properties to find proper image data

`limit`  

Number  

How many `elem`-s from the `url` will be scraped, maximum  

See also: [Config -> limit](#config)

### Output

Output includes all Input data  

`pages -> [] -> {...}`

**Plus** the parsed scraping result, ready for the favourite templating engine  

`pages -> [] -> {data -> [] -> {...}}`

**Plus** the unmodified markup from the specified pages  

`pages -> [] -> {data -> [] -> {raw}}`

It also contains some **meta-data**, such as path to the current data file and the exact moment of the scraping start.

Example:

```

{

    "meta": {

        "file": "/Users/[...]/data/1474811135645.json",

        "date": 1474811135645

    },

    "pages": [

        {

            "url": "https://www.smashingmagazine.com",

            "elem": "article.post",

            "link": "h2 > a",

            "author": "h2 + ul li.a a",

            "time": "h2 + ul li.rd",

            "image": "figure > a > img",

            "limit": 6,

            "data": [

                {

                "href": "https://www.smashingmagazine.com/2016/09/interview-with-matan-stauber/",

                "text": "\n\t\t\tAn Interview With Matan Stauber\n\t\t\tStretching The Limits Of What’s Possible\n\t\t",

                "title": "Read 'Stretching The Limits Of What’s Possible'",

                "raw": " [ ... a lot of markup ... ] ",

                "author": "Cosima Mielke",

                "time": "September 23rd, 2016",

                "imageSrc": "https://www.smashingmagazine.com/wp-content/uploads/2016/09/histography-website-small-opt.png"

                },

                {... x5}

            ]

        },

        {...},

        {...}

    ]

```

## Events

:construction: coming up!

## [MIT License](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xop/news-scraper-core

Awesome Lists containing this project

README