{"id":17081006,"url":"https://github.com/zhegexiaohuozi/jsoupxpath","last_synced_at":"2025-05-15T01:04:30.582Z","repository":{"id":15209097,"uuid":"17937492","full_name":"zhegexiaohuozi/JsoupXpath","owner":"zhegexiaohuozi","description":"纯Java实现的支持W3C Xpath 1.0标准语法的HTML解析器。A html parser with xpath base on Jsoup and Antlr4. Maybe it is the best in java.Just try it. ","archived":false,"fork":false,"pushed_at":"2024-11-28T10:05:00.000Z","size":1926,"stargazers_count":454,"open_issues_count":12,"forks_count":154,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-04-06T14:09:27.320Z","etag":null,"topics":["antlr4","html-parser","jsoupxpath","xpath"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zhegexiaohuozi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-03-20T09:46:17.000Z","updated_at":"2025-04-04T15:50:37.000Z","dependencies_parsed_at":"2023-12-14T12:32:04.263Z","dependency_job_id":"46938487-eb5a-4edd-a948-8265104df654","html_url":"https://github.com/zhegexiaohuozi/JsoupXpath","commit_stats":null,"previous_names":[],"tags_count":22,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhegexiaohuozi%2FJsoupXpath","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhegexiaohuozi%2FJsoupXpath/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhegexiaohuozi%2FJsoupXpath/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhegexiaohuozi%2FJsoupXpath/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zhegexiaohuozi","download_url":"https://codeload.github.com/zhegexiaohuozi/JsoupXpath/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248784911,"owners_count":21161195,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["antlr4","html-parser","jsoupxpath","xpath"],"created_at":"2024-10-14T12:49:09.022Z","updated_at":"2025-04-13T21:28:55.584Z","avatar_url":"https://github.com/zhegexiaohuozi.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"JsoupXpath\n==========\n\n[![GitHub release](https://img.shields.io/github/release/zhegexiaohuozi/JsoupXpath.svg)](https://github.com/zhegexiaohuozi/JsoupXpath/releases)\n[![Maven](https://maven-badges.herokuapp.com/maven-central/cn.wanghaomiao/JsoupXpath/badge.svg)](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22cn.wanghaomiao%22%20AND%20a%3A%22JsoupXpath%22)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n\n纯Java实现的支持W3C Xpath 1.0标准语法的HTML解析器。A html parser with xpath base on Jsoup and Antlr4.Maybe it is the best in java,Just try it.\n\nIf you like this project, please give it a Star.\nRead detail document in [English](docs/English.md) | [日本語](docs/Japanese.md) | [한국어](docs/Korean.md) | [Русский](docs/Russian.md) | [Français](docs/French.md).\n\n## 简介 ##\n\n**JsoupXpath** 是一款纯Java开发的使用xpath解析提取html数据的解析器，针对html解析完全重新实现了W3C XPATH 1.0标准语法，xpath的Lexer和Parser基于Antlr4构建，html的DOM树生成采用Jsoup，故命名为JsoupXpath.\n为了在java里也享受xpath的强大与方便但又苦于找不到一款足够好用的xpath解析器，故开发了JsoupXpath。JsoupXpath的实现逻辑清晰，扩展方便，\n支持完备的W3C XPATH 1.0标准语法，W3C规范：http://www.w3.org/TR/1999/REC-xpath-19991116 ，JsoupXpath语法描述文件[Xpath.g4](https://github.com/zhegexiaohuozi/JsoupXpath/blob/master/src/main/resources/Xpath.g4)\n\n# Change Log #\n\nhttps://github.com/zhegexiaohuozi/JsoupXpath/releases\n\n# 社区讨论 #\n\n- Issue\n\nhttps://github.com/zhegexiaohuozi/JsoupXpath/issues\n\n- 微信订阅号\n\n![weixin](https://imgs.wanghaomiao.cn/seimiweixin_v2.jpeg)\n\n里面会发布一些使用案例等文章，以及seimi体系相关项目的最新更新动态等。也会有作者关于互联网后端技术一些文章和感悟。\n\n\n## 快速开始 ##\n\nmaven依赖,全版本请参见[release信息](https://github.com/zhegexiaohuozi/JsoupXpath/releases)或[中央maven库](http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22cn.wanghaomiao%22%20AND%20a%3A%22JsoupXpath%22)：\n```\n\u003cdependency\u003e\n   \u003cgroupId\u003ecn.wanghaomiao\u003c/groupId\u003e\n   \u003cartifactId\u003eJsoupXpath\u003c/artifactId\u003e\n   \u003cversion\u003e2.5.3\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n示例：\n\n```\nString html = \"\u003chtml\u003e\u003cbody\u003e\u003cscript\u003econsole.log('aaaaa')\u003c/script\u003e\u003cdiv class='test'\u003esome body\u003c/div\u003e\u003cdiv class='xiao'\u003eTwo\u003c/div\u003e\u003c/body\u003e\u003c/html\u003e\";\nJXDocument underTest = JXDocument.create(html);\nString xpath = \"//div[contains(@class,'xiao')]/text()\";\nJXNode node = underTest.selNOne(xpath);\nAssert.assertEquals(\"Two\",node.asString());\n```\n其他可以参考 [`org.seimicrawler.xpath.JXDocumentTest`](https://github.com/zhegexiaohuozi/JsoupXpath/blob/master/src/test/java/org/seimicrawler/xpath/JXDocumentTest.java)，这里有大量的测试用例\n\n或者Issue中比较[典型的例子](https://github.com/zhegexiaohuozi/JsoupXpath/issues?q=is%3Aissue+is%3Aclosed+label%3A%E6%96%B0%E6%89%8B%E5%8F%82%E8%80%83)\n\n## 语法 ##\n\n支持完备的W3C XPATH 1.0标准语法，W3C规范：http://www.w3.org/TR/1999/REC-xpath-19991116\n\n这里是JsoupXpath的基于Antlr4的语法解析树示例，方便大家更快速的一览JsoupXpath的语法处理能力与语法解析执行过程\n- `//ul[@class='subject-list']/li[./div/div/span[@class='pl']/num()\u003e(1000+90*(2*50))][last()][1]/div/h2/allText()`\n这个主要是一些表达式嵌套的解析示例，点击图片可以查看大图\n[![muti_expr](https://imgs.wanghaomiao.cn/jsoupxpath/antlr4_parse_tree_muti_expr.png)](https://imgs.wanghaomiao.cn/jsoupxpath/antlr4_parse_tree_muti_expr.png)\n\n- `//ul[@class='subject-list']/li[not(contains(self::li/div/div/span[@class='pl']//text(),'14582'))]/div/h2//text()`\n这个是对内置函数支持的一个解析示例，点击图片可以查看大图\n[![functions](https://imgs.wanghaomiao.cn/jsoupxpath/antlr4_parse_tree_functions_v2.png)](https://imgs.wanghaomiao.cn/jsoupxpath/antlr4_parse_tree_functions_v2.png)\n\n### 关于使用Xpath的一些注意事项 ####\n\n多数情况下是不建议直接粘贴Firefox或chrome里生成的Xpath，这些浏览器在渲染页面会根据标准自动补全一些标签，如table标签会自动加上tbody标签，这样生成的Xpath路径显然不是最通用的，所以很可能就取不到值。所以，要使用Xpath并感受Xpath的强大以及他所带来便捷与优雅最好就是学习下Xpath的标准语法，这样应对各种问题才能游刃有余，享受Xpath的真正威力！\n\n## 函数 ##\n\n- `int position()` 返回当前节点在其所在上下文中的位置\n- `int last()` 返回所在上下文的最后那个节点位置\n- `int first()` 返回所在上下文的的第一个节点位置\n- `string concat(string, string, string*)` 连接若干字符串\n- `boolean contains(string, string)` 判断第一个字符串是否包含第二个\n- `int count(node-set)` 计算给定的节点集合中节点个数\n- `double/long sum(node-set)` 计算给定的节点集合中数字节点值的和，计算参数范围内包含非数字内容则计算无效。\n- `boolean starts-with(string, string)` 判断第一个字符串是否以第二个开头\n- `int string-length(string?)` 如果给定了字符串则返回字符串长度，如果没有，那么则将当前节点转为字符串并返回长度\n- `string substring(string, number, number?)` 第一个参数指定字符串，第二个指定起始位置（xpath索引都是从1开始），第三指定要截取的长度，这里要注意在xpath的语法里这，不是结束的位置。\n\n  substring(\"12345\", 1.5, 2.6) returns \"234\"\n\n  substring(\"12345\", 2, 3) returns \"234\"\n\n- `string substring-ex(string, number, number)` 第一个参数指定字符串，第二个指定起始位置(java里的习惯从0开始)，第三个结束的位置（支持负数），这个是JsoupXpath扩展的函数，方便java习惯的开发者使用。\n- `string substring-after(string, string)` 在第一个字符串中截取第二个字符串之后的部分\n- `string substring-after-last(string, string)` 在第一个字符串中截取第二个字符串最后出现位置之后的部分\n- `string substring-before(string, string)` 在第一个字符串中截取第二个字符串之前的部分\n- `string substring-before-last(string, string)` 在第一个字符串中截取第二个字符串最后出现位置之前的部分\n- `date format-date(string, string ,string)` 第一个参数是表达式，第二个参数是表达式值的时间格式，第三个参数是时区locale，非必填\n\n### 开发者添加函数 ###\n以上只是Xpath1.0标准中的函数，开发亦可以方便快捷的添加自定义函数，只需实现 `org.seimicrawler.xpath.core.Function.java`接口，并在你的系统初始化的时候调用`Scanner.registerFunction(Class\u003c? extends Function\u003e func)`，不需要修改语法范式，JsoupXpath运行时即可识别并加载。对于标准语法中目前JsoupXpath还未实现的函数，欢迎大家向主仓库提交Pull request，一起添砖添瓦。\n\n### NodeTest ###\n- `allText()`提取节点下全部文本，取代类似 `//div/h3//text()`这种递归取文本用法\n- `html()`获取全部节点的内部的html\n- `outerHtml()`获取全部节点的 包含节点本身在内的全部html\n- `num()`抽取节点自有文本中全部数字，如果知道节点的自有文本(即非子代节点所包含的文本)中只存在一个数字，如阅读数，评论数，价格等那么直接可以直接提取此数字出来。如果有多个数字将提取第一个匹配的连续数字。\n- `text()` 提取节点的自有文本。更多介绍可参见 https://github.com/zhegexiaohuozi/JsoupXpath/releases/tag/v2.4.1\n- `node()` 提取所有节点\n\n## 轴 ##\n```\nAxisName:  'ancestor'         //在当前上下文中节点的祖先中选择\n  |  'ancestor-or-self'       //在当前上下文中节点的祖先及包括自身中选择\n  |  'attribute'              //标记做提取节点属性运算\n  |  'child'                  //在当前上下文中节点的子节点中选择 这是xpath默认的轴，如 /div/li 就是 /div/child::li 的简写\n  |  'descendant'             //在当前上下文中节点的后代中选择\n  |  'descendant-or-self'     //在当前上下文中节点的后代包括自身中选择\n  |  'following'              //在当前上下文中节点后面的全部节点中选择\n  |  'following-sibling'      //在当前上下文中节点后面的全部同胞节点中选择\n  |  'parent'                 //在当前上下文中节点的父亲节点中选择\n  |  'preceding'              //在当前上下文中节点前面的全部节点中选择\n  |  'preceding-sibling'      //在当前上下文中节点前面的全部同胞节点中选择\n  |  'self'                   //当前上下文中选择\n  |  'following-sibling-one'  //在上下文中节点的下一个同胞节点中选择（JsoupXpath扩展）\n  |  'preceding-sibling-one'  //在上下文中节点的前一个同胞节点选择(JsoupXpath扩展)\n  |  'sibling'                //全部同胞(JsoupXpath扩展)(开发中。。。)\n  ;\n```\n\n## 操作符 ##\n\n```\nMINUS\n       :  '-';\n  PLUS\n       :  '+';\n  DOT\n       :  '.';\n  MUL\n       : '*';\n  DIVISION\n       : '`div`';\n  MODULO\n       : '`mod`';\n  DOTDOT\n       :  '..';\n  AT\n       : '@';\n  COMMA\n       : ',';\n  PIPE\n       :  '|';\n  LESS\n       :  '\u003c';\n  MORE_\n       :  '\u003e';\n  LE\n       :  '\u003c=';\n  GE\n       :  '\u003e=';\n  START_WITH\n       :  '^=';  // `a^=b` 字符串a以字符串b开头 a startwith b  （JsoupXpath扩展）\n  END_WITH\n       :  '$=';  // `a*=b` a包含b, a contains b   （JsoupXpath扩展）\n  CONTAIN_WITH\n       :  '*=';  // a包含b, a contains b  （JsoupXpath扩展）\n  REGEXP_WITH\n       :  '~=';  // a的内容符合 正则表达式b   （JsoupXpath扩展）\n  REGEXP_NOT_WITH\n       :  '!~';  //a的内容不符合 正则表达式b   （JsoupXpath扩展）\n```\n\n\n## 应用的项目 ##\n目前JsoupXpath被大量使用的项目或是组织有：[SeimiCrawler](https://github.com/zhegexiaohuozi/SeimiCrawler)。\n如果您也有项目在使用JsoupXpath，并希望出现在这个列表中，可以通过下面的联系方式联系我，邮件格式可以为：\n```\n项目或组织名称：XX\n项目或组织URL：http://xxx.xxx.cc\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhegexiaohuozi%2Fjsoupxpath","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzhegexiaohuozi%2Fjsoupxpath","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhegexiaohuozi%2Fjsoupxpath/lists"}