{"id":15590521,"url":"https://github.com/cdimascio/essence","last_synced_at":"2025-04-14T19:44:52.287Z","repository":{"id":37263367,"uuid":"158731101","full_name":"cdimascio/essence","owner":"cdimascio","description":"Automatically extract the main text content (and more) from an HTML document","archived":false,"fork":false,"pushed_at":"2022-09-01T22:59:29.000Z","size":2023,"stargazers_count":117,"open_issues_count":8,"forks_count":16,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-28T08:11:13.334Z","etag":null,"topics":["extractor","hacktoberfest","html-extractor","scraper","web-content-extractor","webpage-extractor","website-extractor"],"latest_commit_sha":null,"homepage":"","language":"Kotlin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cdimascio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-11-22T17:16:46.000Z","updated_at":"2025-02-19T15:37:25.000Z","dependencies_parsed_at":"2022-08-02T23:46:17.308Z","dependency_job_id":null,"html_url":"https://github.com/cdimascio/essence","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cdimascio%2Fessence","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cdimascio%2Fessence/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cdimascio%2Fessence/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cdimascio%2Fessence/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cdimascio","download_url":"https://codeload.github.com/cdimascio/essence/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248950133,"owners_count":21188211,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["extractor","hacktoberfest","html-extractor","scraper","web-content-extractor","webpage-extractor","website-extractor"],"created_at":"2024-10-02T23:22:28.566Z","updated_at":"2025-04-14T19:44:52.245Z","avatar_url":"https://github.com/cdimascio.png","language":"Kotlin","funding_links":["https://www.buymeacoffee.com/m97tA5c"],"categories":[],"sub_categories":[],"readme":"# essence\n\n![](https://travis-ci.org/cdimascio/essence.svg?branch=master) [![Maven Central](https://img.shields.io/maven-central/v/io.github.cdimascio/essence.svg?label=Maven%20Central)](https://search.maven.org/search?q=g:%22io.github.cdimascio%22%20AND%20a:%22essence%22) ![](https://camo.githubusercontent.com/208c24da54eea1ae12f8abed5dcc6b84b6ce8440/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d417061636865253230322e302d626c75652e737667) \u003c!-- ALL-CONTRIBUTORS-BADGE:START - Do not remove or modify this section --\u003e\n[![All Contributors](https://img.shields.io/badge/all_contributors-2-orange.svg?style=flat-square)](#contributors-)\n\u003c!-- ALL-CONTRIBUTORS-BADGE:END --\u003e\n\nAn automatic web page content extractor for _Kotlin_ and _Java_.\n\nGiven an HTML document, **essence** automatically extracts the main text content (and much more).\n\n[Try out the demo](https://essence.mybluemix.net) - _a simple webapp to demonstrate essence_.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/cdimascio/essence/master/assets/essence.png\" width=\"400px\"/\u003e\n\u003c/p\u003e\n\n_This library is inspired by [node-unfluff](https://github.com/ageitgey/node-unfluff) and its [lineage](#credits)_\n\n## Usage\n\n**Java**\n\n```Java\nimport io.github.cdimascio.essence.Essence;\n\nEssenceResult data = Essence.extract(html);\nSystem.out.println(data.getText());\n```\n\n**Kotlin**\n\n```Kotlin\nval data = Essence.extract(html)\nprintln(data.text)\n```\n\nSee [Extracted data elements](#extracted-data-elements) for additional extracted metadata.\n\n## Install\n\n**Maven**\n\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003eio.github.cdimascio\u003c/groupId\u003e\n  \u003cartifactId\u003eessence\u003c/artifactId\u003e\n  \u003cversion\u003e0.13.0\u003c/version\u003e\n  \u003ctype\u003epom\u003c/type\u003e\n\u003c/dependency\u003e\n```\n\n**Gradle**\n\n```groovy\ncompile 'io.github.cdimascio:essence:0.13.0'\n```\n\n## Try the Essence web demo\n\n[Essence web](https://essence.mybluemix.net) is a simple web page that fetches content at a given url and passes the HTML to this essence library.\n\n![](https://raw.githubusercontent.com/cdimascio/essence/master/assets/example.png)\n\nThe essence web project lives [here](https://github.com/cdimascio/essence-web)\n\n## Extracted data elements\n\n**essence** attempts to extract the following content:\n\n- `title` - The document's title\n- `softTitle` - A version of `title` with less truncation\n- `date` - The document's publication date\n- `copyright` - The document's copyright line, if present\n- `author` - The document's author\n- `publisher` - The document's publisher (website name)\n- `text` - The main text of the document with all the junk thrown away\n- `image` - The main image for the document (what's used by facebook, etc.)\n- *(coming soon...)*`videos` - An array of videos that were embedded in the article. Each video has src, width and height.\n- `tags`- Any tags or keywords that could be found by checking \u0026lt;rel\u0026gt; tags or by looking at href urls.\n- `canonicalLink` - The [canonical url](https://support.google.com/webmasters/answer/139066?hl=en) of the document, if given.\n- `lang` - The language of the document, either detected or supplied by you.\n- `description` - The description of the document, from \u0026lt;meta\u0026gt; tags\n- `favicon` - The url of the document's [favicon](http://en.wikipedia.org/wiki/Favicon).\n- `links` - An array of links embedded within the article text. (text and href for each)\n\n\n## Credits\n- node-unfluff by [https://github.com/ageitgey](ageitgey)\n- python-goose by [Xavier Grangier](https://github.com/grangier)\n- goose by [Gravity Labs](https://github.com/GravityLabs)\n\n## License\n\n[Apache 2.0](LICENSE)\n\n\n\u003ca href=\"https://www.buymeacoffee.com/m97tA5c\" target=\"_blank\"\u003e\u003cimg src=\"https://bmc-cdn.nyc3.digitaloceanspaces.com/BMC-button-images/custom_images/orange_img.png\" alt=\"Buy Me A Coffee\" style=\"height: auto !important;width: auto !important;\" \u003e\u003c/a\u003e\n\n## Contributors ✨\n\nThanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):\n\n\u003c!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section --\u003e\n\u003c!-- prettier-ignore-start --\u003e\n\u003c!-- markdownlint-disable --\u003e\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\u003ca href=\"https://cleymax.fr/\"\u003e\u003cimg src=\"https://avatars3.githubusercontent.com/u/24879740?v=4\" width=\"100px;\" alt=\"\"/\u003e\u003cbr /\u003e\u003csub\u003e\u003cb\u003eClément P.\u003c/b\u003e\u003c/sub\u003e\u003c/a\u003e\u003cbr /\u003e\u003ca href=\"https://github.com/cdimascio/essence/commits?author=Cleymax\" title=\"Code\"\u003e💻\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003c!-- markdownlint-enable --\u003e\n\u003c!-- prettier-ignore-end --\u003e\n\u003c!-- ALL-CONTRIBUTORS-LIST:END --\u003e\n\nThis project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcdimascio%2Fessence","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcdimascio%2Fessence","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcdimascio%2Fessence/lists"}