https://github.com/unistudents/saffron

A fairly intuitive & powerful framework that enables you to collect & save articles and news from all over the web.
https://github.com/unistudents/saffron

aggregator announcements api-scraper articles crawler crawler-framework dynamic-scraping html-scraping javascript news parser rss rss-aggregator rss-feed rss-parser saffron scraping typescript wordpress-api

Last synced: 8 days ago
JSON representation

A fairly intuitive & powerful framework that enables you to collect & save articles and news from all over the web.

Host: GitHub
URL: https://github.com/unistudents/saffron
Owner: UniStudents
License: mit
Created: 2021-03-28T09:43:30.000Z (almost 5 years ago)
Default Branch: main
Last Pushed: 2024-09-14T09:08:17.000Z (over 1 year ago)
Last Synced: 2025-10-03T01:27:20.201Z (5 months ago)
Topics: aggregator, announcements, api-scraper, articles, crawler, crawler-framework, dynamic-scraping, html-scraping, javascript, news, parser, rss, rss-aggregator, rss-feed, rss-parser, saffron, scraping, typescript, wordpress-api
Language: HTML
Homepage:
Size: 902 KB
Stars: 11
Watchers: 3
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Saffron | News & announcements aggregation framework.

## Table of Contents

- [What is Saffron?](#what-is-saffron)

- [Architecture](#architecture)

- [Installation](#installation)

- [Initialization](#initialization)

- [Configuration](#configuration)

- [Parsers](#parsers)

    - [WordPress V2](#wordpress-v2)

    - [RSS](#rss)

    - [HTML](#html)

    - [JSON / XML](#json--xml)

    - [Dynamic](#dynamic)

    - [Which to choose](#which-to-choose)

- [Article](#article)

- [Source files](#source-files)

    - [What is a source file?](#what-is-a-source-file)

    - [Creating a source file](#creating-a-source-file)

- [Middleware](#middleware)

    - [Register a middleware](#register-a-middleware)

    - [Format article](#format-article)

    - [Articles](#articles)

- [Listeners](#listeners)

- [Standalone](#standalone)

## What is Saffron?

Saffron stands for **S**imple **A**bstract **F**ramework **F**or the **R**etrieval **O**f **N**ews

As said saffron is a framework. It is an abstraction engine that helps you collect news and

announcements from websites in a uniform way.

It supports different ways of data collection, such as API endpoints and web-scraping.

It tries to ease the process of integrating all data sources, by abstracting data collection into a few simple

and powerful functions.

## Architecture

Saffron's architecture is based on a `main` node that issues scraping instructions and several `worker` nodes

that do the scraping & upload the data to the database.

The communication between the nodes is happening through the `Grid`. The grid will generate events to communicate

with other classes. Saffron supports remote nodes by using [`socket.io`](https://socket.io) server and clients

as a middleware to connect to the `main` node.

## Installation

To install the latest release:

```shell

npm install @unistudents/saffron

```

To install a specific version:

```shell

npm install @unistudents/saffron@version

```

## Initialization

Once you have installed the library and created your [configuration](./docs/configuration.md):

```ts

import Saffron from "@unistudents/saffron";

const saffron = new Saffron();

// Initialize saffron

saffron.initialize({/* configuration */});

// Start sheduler and workers.

saffron.start();

```

## Configuration

Read the [configuration](./docs/configuration.md) file for more information.

## Parsers

To retrieve the desired information from the websites we use parsers.

There are four available parser types: `wordpress`, `rss`, `html`, `api` and `dynamic`.

### WordPress V2

Parser type: `wordpress-v2`

By default, [`WordPress`](https://wordpress.com/) based websites has an open API for news retrieval.

We make use of that to get access on the articles and categories of the website.

To quickly check if a website supports the WordPress API simply open your browser and

type `/wp-json/wp/v2/posts/`.

If a valid JSON file is displayed on the browser (or downloaded on your computer) which contains the website's articles,

then you can safely use the `wordpress` parser.

### RSS

Parser type: `rss`

Many websites support [`RSS`](https://en.wikipedia.org/wiki/RSS) feed. RSS allows users and applications to access updates

to websites in a standardized, computer-readable format. You can check if a website supports RSS if you can see this

icon .

### JSON / XML

Parser type: `json` (or `xml`)

This parser is best to be used when it comes to pages that are loading data using API requests (e.g. lazy loading).

The only prerequisite for this parser is that the response of the API requests is in a structured JSON or XML format.

### HTML

Parser type: `html`

This parser uses scrapping tools like [CheerioJS](https://cheerio.js.org/) to scrape the website content and receive

the displayed news. This parser is best to be used when the HTML in the website is structured. Websites where the HTML

and CSS are not structured will be very difficult to scrape.

### Dynamic

Parser type: `dynamic`

Unlike the other parsers, this parser uses javascript/typescript code to parse a website. All the logic for the scraping is

decided by the user by extending the class `DynamicSourceFile`.

### Which to choose

We recommend a specific order for using the available parsers.

* If the desired website is based an [`WordPress`](https://wordpress.com/) and the WordPress articles API is enabled, then choose the `wordpress-v2` parser.

* If the desired website supports [`RSS`](https://en.wikipedia.org/wiki/RSS) feed. then choose the `rss` parser.

* If the desired website is loading data using API requests with structured responses (e.g. lazy loading), then choose the `json` or `xml` parser.

* If the desired website has a structured form, the use the `html` parser.

* If none of the above is possible (bad html or custom API) then the `dynamic` parser is our last choice.

## Article

We have created a universal format for the parsed news, and we named it `Article`.

Read the [article](./docs/article.md) file for more information.

## Source files

### What is a source file?

A source file is a `json` or `javascript` file that represents a website.

These files are generated from the user and guide Saffron on how to parse a website.

### Creating a source file

Read the [source](./docs/source_files/source_file.md) file for the common options or the parsers files

[WordPress V2](./docs/source_files/wordpress_v2.md), [RSS](./docs/source_files/rss.md), [API](./docs/source_files/json.md), [HTML](./docs/source_files/html.md) or [Dynamic](./docs/source_files/dynamic.md) for the scrape options.

## Middleware

A middleware is a function that gets executed before the articles are passed to `newArticles` function.

Middleware functions can be useful for logging, article formatting or sorting.

The order where the middleware are executed is the order where they were reistered.

Each middleware function can be called more than once.

### Register a middleware

```typescript

saffron.use("name", (...args: any) => {

    //...

});

```

### Format article

For changing the contents of the articles.

It gets as parameter every article that was found from the parsers and must return the same object when it changed.

```javascript

saffron.use("article.format", (article: Article) => {

    // If possible set pubDate with milliseconds.

    let ms = new Date(article.pubDate).getTime();

    if (!isNaN(ms)) article.pubDate = ms;

    // Append source name before title for every article

    article.title = `[${article.getSource(saffron).name}] ${article.title}`;

    // Return the changed article.

    return article;

});

```

You can also access the source class of the article by calling `article.getSource()`.

Note that any changes made on the source class will also affect the saved source.

### Articles

This middleware can be used to edit the articles in bulk. You can sort or filter them as you want.

The only requirement is to return an array (empty or not) of articles.

```js

saffron.use("articles", (articles: Article[]) => {

    sort(articles);

    return articles.filter(

        (article) => article.title != null && article.title !== ""

    );

});

```

## Listeners

Saffron supports listeners for various event. Listeners can be used for logging or creating analytics.

Read the [listeners](./docs/listeners.md) file for more information.

## Standalone

Saffron supports immediate parsing using the static function `parse`.

```ts

import {Saffron} from "@unistudents/saffron";

try {

    const result = Saffron.parse({

        name: "source-name",

        url: ["Category 1", "https://example.com"],

        type: "html",

        // ...

        scrape: {

            // ...

        },

    }, null); // or pass a config

    console.log("Result:", result);

} catch (e) {

    console.log("Encountered an error during parsing:", e);

}

```

The result of the `parse` function is an array of objects for each url passed in the source file:

```ts

[

    {

        url: "https://example.com",

        aliases: ["Category 1"],

        articles: [/*Article*/, /*Article*/, /*Article*/, /*...*/]

    },

];

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/unistudents/saffron

Awesome Lists containing this project

README