https://github.com/ghostdogpr/readability4s
Scala library to extract relevant content from an article HTML
https://github.com/ghostdogpr/readability4s
article-extracting readability scala
Last synced: about 1 year ago
JSON representation
Scala library to extract relevant content from an article HTML
- Host: GitHub
- URL: https://github.com/ghostdogpr/readability4s
- Owner: ghostdogpr
- License: apache-2.0
- Created: 2017-10-06T06:47:47.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2018-06-20T07:57:50.000Z (almost 8 years ago)
- Last Synced: 2025-03-31T18:51:43.653Z (about 1 year ago)
- Topics: article-extracting, readability, scala
- Language: Scala
- Size: 31.3 KB
- Stars: 7
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# readability4s [](https://travis-ci.org/ghostdogpr/readability4s) [](http://www.apache.org/licenses/LICENSE-2.0.txt)
A Scala library to extract content from an article HTML: title, full text, favicon, image, etc.
This project is a scala port of Mozilla's [Readability.js](https://github.com/mozilla/readability) with a few tweaks and improvements.
Scala version is 2.12.
## Usage
Import the project with Maven as follows:
```xml
com.github.ghostdogpr
readability4s
1.0.9
```
To parse a document, you must create a new `Readability` object from a URI string and an HTML string, and then call `parse()`. Here's an example:
```scala
val article = Readability(url, htmlString).parse()
```
It returns an `Option[Article]`.
It is either `None` when the article could not be processed, or an `Article` with the following properties:
* `uri`: original URI string that was passed to constructor
* `title`: article title
* `byline`: author metadata
* `content`: HTML string of processed article content
* `textContent`: text of processed article content
* `length`: length of article, in characters
* `excerpt`: article description, or short excerpt from content
* `faviconUrl`: URL of the favicon image
* `imageUrl`: URL of an image representing the article