Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ghostdogpr/readability4s
Scala library to extract relevant content from an article HTML
https://github.com/ghostdogpr/readability4s
article-extracting readability scala
Last synced: 3 months ago
JSON representation
Scala library to extract relevant content from an article HTML
- Host: GitHub
- URL: https://github.com/ghostdogpr/readability4s
- Owner: ghostdogpr
- License: apache-2.0
- Created: 2017-10-06T06:47:47.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2018-06-20T07:57:50.000Z (over 6 years ago)
- Last Synced: 2024-10-04T19:07:06.330Z (3 months ago)
- Topics: article-extracting, readability, scala
- Language: Scala
- Size: 31.3 KB
- Stars: 7
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# readability4s [![Build Status](https://travis-ci.org/ghostdogpr/readability4s.svg?branch=master)](https://travis-ci.org/ghostdogpr/readability4s) [![License](http://img.shields.io/:license-Apache%202-red.svg)](http://www.apache.org/licenses/LICENSE-2.0.txt)
A Scala library to extract content from an article HTML: title, full text, favicon, image, etc.This project is a scala port of Mozilla's [Readability.js](https://github.com/mozilla/readability) with a few tweaks and improvements.
Scala version is 2.12.## Usage
Import the project with Maven as follows:
```xml
com.github.ghostdogpr
readability4s
1.0.9```
To parse a document, you must create a new `Readability` object from a URI string and an HTML string, and then call `parse()`. Here's an example:
```scala
val article = Readability(url, htmlString).parse()
```It returns an `Option[Article]`.
It is either `None` when the article could not be processed, or an `Article` with the following properties:* `uri`: original URI string that was passed to constructor
* `title`: article title
* `byline`: author metadata
* `content`: HTML string of processed article content
* `textContent`: text of processed article content
* `length`: length of article, in characters
* `excerpt`: article description, or short excerpt from content
* `faviconUrl`: URL of the favicon image
* `imageUrl`: URL of an image representing the article