An open API service indexing awesome lists of open source software.

https://github.com/code4craft/xsoup

When jsoup meets XPath.
https://github.com/code4craft/xsoup

Last synced: 9 months ago
JSON representation

When jsoup meets XPath.

Awesome Lists containing this project

README

          

Xsoup
----
[![Build Status](https://api.travis-ci.org/code4craft/xsoup.png?branch=master)](https://travis-ci.org/code4craft/xsoup)

XPath selector based on Jsoup.

## Get started:

```java
@Test
public void testSelect() {

String html = "

github.com
" +
"ab";

Document document = Jsoup.parse(html);

String result = Xsoup.compile("//a/@href").evaluate(document).get();
Assert.assertEquals("https://github.com", result);

List list = Xsoup.compile("//tr/td/text()").evaluate(document).list();
Assert.assertEquals("a", list.get(0));
Assert.assertEquals("b", list.get(1));
}
```

## Performance:

Xsoup use Jsoup as HTML parser.

Compare with another most used XPath selector for HTML - [**`HtmlCleaner`**](http://htmlcleaner.sourceforge.net/), Xsoup is much faster:

Normal HTML, size 44KB
XPath: "//a"
Run for 2000 times

Environment:Mac Air MD231CH/A
CPU: 1.8Ghz Intel Core i5


Operation
Xsoup
HtmlCleaner


parse
3,207(ms)
7,999(ms)


select
95(ms)
380(ms)

## Syntax supported:

### XPath1.0:


Name
Expression
Support


nodename
nodename
yes


immediate parent
/
yes


parent
//
yes


attribute
[@key=value]
yes


nth child
tag[n]
yes


attribute
/@key
yes


wildcard in tagname
/*
yes


wildcard in attribute
/[@*]
yes


function
function()
part


or
a | b
yes since 0.2.0


parent in path
. or ..
no


predicates
price>35
no


predicates logic
@class=a or @class=b
yes since 0.2.0

### Function supported:

In Xsoup, we use some function (maybe not in Standard XPath 1.0):


Expression
Description
Standard XPath


text(n)
nth text content of element(0 for all)
text() only


allText()
text including children
not support



tidyText()
text including children, well formatted
not support


html()
innerhtml of element
not support


outerHtml()
outerHtml of element
not support


regex(@attr,expr,group)
use regex to extract content
not support

### Extended syntax supported:

These XPath syntax are extended only in Xsoup (for convenience in extracting HTML, refer to Jsoup CSS Selector):


Name
Expression
Support


attribute value not equals
[@key!=value]
yes


attribute value start with
[@key~=value]
yes


attribute value end with
[@key$=value]
yes


attribute value contains
[@key*=value]
yes


attribute value match regex
[@key~=value]
yes

## License

MIT License, see file `LICENSE`

[![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/code4craft/xsoup/trend.png)](https://bitdeli.com/free "Bitdeli Badge")