Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/droidsonroids/jspoon
Annotation based HTML to Java parser + Retrofit converter
https://github.com/droidsonroids/jspoon
html java parser
Last synced: about 5 hours ago
JSON representation
Annotation based HTML to Java parser + Retrofit converter
- Host: GitHub
- URL: https://github.com/droidsonroids/jspoon
- Owner: DroidsOnRoids
- License: mit
- Created: 2017-07-12T13:27:58.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-04-19T06:49:08.000Z (8 months ago)
- Last Synced: 2024-12-21T23:09:36.929Z (7 days ago)
- Topics: html, java, parser
- Language: Java
- Homepage: https://www.thedroidsonroids.com/blog/scraping-web-pages-with-retrofit-jspoon-library
- Size: 345 KB
- Stars: 323
- Watchers: 16
- Forks: 23
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/pl.droidsonroids/jspoon/badge.svg?style=flat)](https://maven-badges.herokuapp.com/maven-central/pl.droidsonroids/jspoon)
[![Javadocs](https://javadoc.io/badge/pl.droidsonroids/jspoon.svg?color=blue)](https://javadoc.io/doc/pl.droidsonroids/jspoon)# jspoon
jspoon is a Java library that provides parsing HTML into Java objects basing on CSS selectors. It uses [jsoup][jsoup] underneath as a HTML parser.## Installation
Insert the following dependency into your project's `build.gradle` file:
```gradle
dependencies {
implementation 'pl.droidsonroids:jspoon:1.3.2'
}
```
## Usage
jspoon works on any class with a default constructor. To make it work you need to annotate fields with `@Selector` annotation and set a CSS selector as the annotation's value:
```java
class Page {
@Selector("#title") String title;
@Selector("li.a") List intList;
@Selector(value = "#image1", attr = "src") String imageSource;
}
```
Then you can create a `HtmlAdapter` and use it to build objects:
```java
String htmlContent = """;
+ "Title
"
+ ""
"
+ "- 1
"
+ "- 2
"
+ "- 3
"
+ "
+ ""
+ "Jspoon jspoon = Jspoon.create();
HtmlAdapter htmlAdapter = jspoon.adapter(Page.class);Page page = htmlAdapter.fromHtml(htmlContent);
//title = "Title"; intList = [1, 3]; imageSource = "image.bmp"
```
It looks for the first occurrence in HTML and sets its value to a field.### Supported types
`@Selector` can be applied to any field of the following types (or their primitive equivalents):
* `String`
* `Boolean`
* `Integer`
* `Long`
* `Float`
* `Double`
* `Date`
* `BigDecimal`
* Jsoup's `Element`
* Any class with default constructor
* `List` (or its superclass/superinterface) of supported typeIt can also be used with a class, then you don't need to annotate every field inside it.
### Attributes
By default, the HTML's `textContent` value is used on Strings, Dates and numbers. It is possible to use an attribute by setting an `attr` parameter in the `@Selector` annotation. You can also use `"html"` (or `"innerHtml"`) and `"outerHtml"` as `attr`'s value.### Formatting and regex
Regex can be set up by passing `regex` parameter to `@Selector` annotation. Example:
```java
class Page {
@Selector(value = "#numbers", regex = "([a-z]+),") String matchedNumber;
}
```
Date format can be set up by passing `value` parameter to `@Format` annotation. Example:
```java
class Page {
@Format(value = "HH:mm:ss dd.MM.yyyy")
@Selector(value = "#date") Date date;
}
```
```java
String htmlContent = "13:30:12 14.07.2017"
+ "ONE, TwO, three,";
Jspoon jspoon = Jspoon.create();
HtmlAdapter htmlAdapter = jspoon.adapter(Page.class);
Page page = htmlAdapter.fromHtml(htmlContent);//date = Jul 14, 2017 13:30:12; matchedNumber = "three";
```Java's `Locale` is used for parsing Floats, Doubles and Dates. You can override it by setting `languageTag` @Format parameter:
```java
@Format(languageTag = "pl")
@Selector(value = "div > p > span") Double pi; //3,14 will be parsed
```
If jspoon doesn't find a HTML element it wont't set field's value unless you set the `defValue` parameter:
```java
@Selector(value = "div > p > span", defValue = "NO_TEXT") String text;
```### Custom converterts
When format or regex is not enough, custom converter can be used to implement parsing from jsoup's `Element`. This can be done by extending `ElementConverter` class:
```java
public class JoinChildrenClassConverter implements ElementConverter {
@Override
public String convert(Element node, Selector selector) {
return node.children().stream().map(Element::text).collect(Collectors.joining(", "));
}
}
```
And it can be used the following way:
```java
public class Model {
@Selector(value = "#id", converter = JoinChildrenClassConverter::class)
String childrenText;
}
```### Retrofit
Retrofit converter is available [here][retrofit-converter].### Changelog
See [GitHub releases][changelog]### Other libraries/inspirations
* [jsoup][jsoup] - all HTML parsing in jspoon is made by this library
* [webGrude][webGrude] - when I had an idea I found this library. It was the biggest inspiration and I used some ideas from it
* [Moshi][Moshi] - I wanted to make jspoon work with HTML the same way as Moshi works with JSON. I adapted caching mechanism (fields and adapters) from it.
* [jsoup-annotations][jsoup-annotations] - similar to jspoon[//]: #
[jsoup]:
[webGrude]:
[Moshi]:
[jsoup-annotations]:
[retrofit-converter]:
[changelog]: