Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ghostdogpr/readability4s

Scala library to extract relevant content from an article HTML
https://github.com/ghostdogpr/readability4s

article-extracting readability scala

Last synced: 3 months ago
JSON representation

Scala library to extract relevant content from an article HTML

Host: GitHub
URL: https://github.com/ghostdogpr/readability4s
Owner: ghostdogpr
License: apache-2.0
Created: 2017-10-06T06:47:47.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2018-06-20T07:57:50.000Z (over 6 years ago)
Last Synced: 2024-10-04T19:07:06.330Z (3 months ago)
Topics: article-extracting, readability, scala
Language: Scala
Size: 31.3 KB
Stars: 7
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# readability4s [![Build Status](https://travis-ci.org/ghostdogpr/readability4s.svg?branch=master)](https://travis-ci.org/ghostdogpr/readability4s) [![License](http://img.shields.io/:license-Apache%202-red.svg)](http://www.apache.org/licenses/LICENSE-2.0.txt)
A Scala library to extract content from an article HTML: title, full text, favicon, image, etc.

This project is a scala port of Mozilla's [Readability.js](https://github.com/mozilla/readability) with a few tweaks and improvements.
Scala version is 2.12.

## Usage

Import the project with Maven as follows:

```xml

com.github.ghostdogpr
readability4s
1.0.9

```

To parse a document, you must create a new `Readability` object from a URI string and an HTML string, and then call `parse()`. Here's an example:

```scala
val article = Readability(url, htmlString).parse()
```

It returns an `Option[Article]`.
It is either `None` when the article could not be processed, or an `Article` with the following properties:

* `uri`: original URI string that was passed to constructor
* `title`: article title
* `byline`: author metadata
* `content`: HTML string of processed article content
* `textContent`: text of processed article content
* `length`: length of article, in characters
* `excerpt`: article description, or short excerpt from content
* `faviconUrl`: URL of the favicon image
* `imageUrl`: URL of an image representing the article