https://github.com/cdimascio/essence

Automatically extract the main text content (and more) from an HTML document
https://github.com/cdimascio/essence

extractor hacktoberfest html-extractor scraper web-content-extractor webpage-extractor website-extractor

Last synced: about 2 months ago
JSON representation

Automatically extract the main text content (and more) from an HTML document

Host: GitHub
URL: https://github.com/cdimascio/essence
Owner: cdimascio
License: apache-2.0
Created: 2018-11-22T17:16:46.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-09-01T22:59:29.000Z (almost 3 years ago)
Last Synced: 2025-03-28T08:11:13.334Z (2 months ago)
Topics: extractor, hacktoberfest, html-extractor, scraper, web-content-extractor, webpage-extractor, website-extractor
Language: Kotlin
Homepage:
Size: 1.93 MB
Stars: 117
Watchers: 6
Forks: 16
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # essence

![](https://travis-ci.org/cdimascio/essence.svg?branch=master) [![Maven Central](https://img.shields.io/maven-central/v/io.github.cdimascio/essence.svg?label=Maven%20Central)](https://search.maven.org/search?q=g:%22io.github.cdimascio%22%20AND%20a:%22essence%22) ![](https://camo.githubusercontent.com/208c24da54eea1ae12f8abed5dcc6b84b6ce8440/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d417061636865253230322e302d626c75652e737667) 

[![All Contributors](https://img.shields.io/badge/all_contributors-2-orange.svg?style=flat-square)](#contributors-)

An automatic web page content extractor for _Kotlin_ and _Java_.

Given an HTML document, **essence** automatically extracts the main text content (and much more).

[Try out the demo](https://essence.mybluemix.net) - _a simple webapp to demonstrate essence_.



  



_This library is inspired by [node-unfluff](https://github.com/ageitgey/node-unfluff) and its [lineage](#credits)_

## Usage

**Java**

```Java

import io.github.cdimascio.essence.Essence;

EssenceResult data = Essence.extract(html);

System.out.println(data.getText());

```

**Kotlin**

```Kotlin

val data = Essence.extract(html)

println(data.text)

```

See [Extracted data elements](#extracted-data-elements) for additional extracted metadata.

## Install

**Maven**

```xml

  io.github.cdimascio

  essence

  0.13.0

  pom

```

**Gradle**

```groovy

compile 'io.github.cdimascio:essence:0.13.0'

```

## Try the Essence web demo

[Essence web](https://essence.mybluemix.net) is a simple web page that fetches content at a given url and passes the HTML to this essence library.

![](https://raw.githubusercontent.com/cdimascio/essence/master/assets/example.png)

The essence web project lives [here](https://github.com/cdimascio/essence-web)

## Extracted data elements

**essence** attempts to extract the following content:

- `title` - The document's title

- `softTitle` - A version of `title` with less truncation

- `date` - The document's publication date

- `copyright` - The document's copyright line, if present

- `author` - The document's author

- `publisher` - The document's publisher (website name)

- `text` - The main text of the document with all the junk thrown away

- `image` - The main image for the document (what's used by facebook, etc.)

- *(coming soon...)*`videos` - An array of videos that were embedded in the article. Each video has src, width and height.

- `tags`- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.

- `canonicalLink` - The [canonical url](https://support.google.com/webmasters/answer/139066?hl=en) of the document, if given.

- `lang` - The language of the document, either detected or supplied by you.

- `description` - The description of the document, from <meta> tags

- `favicon` - The url of the document's [favicon](http://en.wikipedia.org/wiki/Favicon).

- `links` - An array of links embedded within the article text. (text and href for each)

## Credits

- node-unfluff by [https://github.com/ageitgey](ageitgey)

- python-goose by [Xavier Grangier](https://github.com/grangier)

- goose by [Gravity Labs](https://github.com/GravityLabs)

## License

[Apache 2.0](LICENSE)



## Contributors ✨

Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):

  

    
_{Clément P.}
💻

  

This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cdimascio/essence

Awesome Lists containing this project

README