{"id":19084342,"url":"https://github.com/coghost/xparse","last_synced_at":"2026-02-14T21:32:01.908Z","repository":{"id":61625623,"uuid":"529082449","full_name":"coghost/xparse","owner":"coghost","description":"parse yaml syntax config file to map html as structured data ","archived":false,"fork":false,"pushed_at":"2025-03-10T02:47:41.000Z","size":447,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-07T21:41:16.290Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/coghost.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-26T02:07:51.000Z","updated_at":"2025-03-10T02:47:45.000Z","dependencies_parsed_at":"2023-02-17T19:30:32.764Z","dependency_job_id":"ba114ad8-b7a9-4031-9fcb-e68cfdef11a3","html_url":"https://github.com/coghost/xparse","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/coghost/xparse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coghost%2Fxparse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coghost%2Fxparse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coghost%2Fxparse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coghost%2Fxparse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/coghost","download_url":"https://codeload.github.com/coghost/xparse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coghost%2Fxparse/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29456239,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-14T21:29:27.764Z","status":"ssl_error","status_checked_at":"2026-02-14T21:28:11.111Z","response_time":53,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T02:50:59.097Z","updated_at":"2026-02-14T21:32:01.893Z","avatar_url":"https://github.com/coghost.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# xparse\n\nparse raw html or raw json file to structured data with yaml config file\n\n## demo\n\n```yaml\n__raw:\n  site_url: https://xkcd.com/\n  test_keys:\n    - bottom.comic_links.*\n    - middle.ctitle\n    - middle.transcript\n\nbottom:\n  _locator: div#bottom\n  comic_links:\n    _locator: div#comicLinks\u003ea\n    _index: ~\n    text:\n    href:\n      _attr: href\n      _attr_refine: enrich_url\n\nmiddle:\n  _locator: div#middleContainer\n  ctitle: div#ctitle\n  transcript: div#transcript\n```\n\n## constants\n\nall reserved keys when we used to write yaml config file to map the HTML/JSON\n\n```go\npackage xparse\n\n// Core extraction configuration keys\nconst (\n\t// Index specifies which elements to extract from results\n\t// Formats: \"_index\" or \"_i\"\n\t// Values:\n\t//   - nil/not existed: get all elements\n\t//   - array: [0,1] gets elements[0] and elements[1]\n\t//   - single: 0 gets elements[0]\n\t// Index types:\n\t//  1. without index\n\t//  2. index: ~ (index is null)\n\t//  3. index: 0\n\t//  4. index: [0, 1, ...]\n\t//  5. index: 0,4 =\u003e 0,1,2,3\n\tIndex = \"_index\"\n\n\t// Locator specifies the path/selector to find desired elements\n\t// Formats: \"_locator\" or \"_l\"\n\t// Supported types:\n\t//  \u003e string:\n\t//   _locator: string\n\t//\n\t//  \u003e list:\n\t//   _locator:\n\t//     - div.001\n\t//     - div.002\n\t//     - div.003\n\t//\n\t//  \u003e map:\n\t//   _locator:\n\t//     key1: div.001\n\t//     key2: div.002\n\t//     key3: div.003\n\tLocator = \"_locator\"\n\n\t// Element navigation keys\n\t// ExtractPrevElem is used when no proper locator exists\n\t// in most cases, we can use locator to get the elem we want,\n\t// but in some rare cases, there is no proper locator to use, so we have to use this to get prev elem\n\tExtractPrevElem = \"_extract_prev\"\n\tExtractParent   = \"_extract_parent\"\n)\n\n// Attribute related configuration keys\nconst (\n\t// Attr specifies which attribute to extract\n\t// Default is element text\n\t// Special value \"__html\" returns raw HTML\n\tAttr = \"_attr\"\n\n\t// AttrRefine specifies how to refine the extracted attribute\n\t// Formats: \"_attr_refine\" or \"_ar\"\n\t// Values:\n\t//   - bool(true): auto-generate method name\n\t//   - string(_name): adds prefix \"refine\" so \"_xxx\" becomes \"_refine_name\"\n\t//   - string(refine_xxx/_refine_xxx): used as-is\n\t//   - string(not started with _): used as-is\n\tAttrRefine = \"_attr_refine\"\n\n\t// AttrJoiner specifies the joiner for attributes\n\tAttrJoiner = \"_joiner\"\n\n\t// AttrIndex configuration:\n\t//   - _joiner: \",\"\n\t//   - _attr_refine: _attr_by_index\n\t//   - _attr_index: 0\n\tAttrIndex = \"_attr_index\"\n\n\tAttrRegex = \"_attr_regex\"\n\n\t// AttrPython runs Python script directly (requires Python environment)\n\t// Example:\n\t//   import sys\n\t//   raw = sys.argv[1] # raw is globally registered\n\t//   arr = raw.split(\"_\")\n\t//   print(arr[1]) # required: output value as refined attr value\n\tAttrPython = \"_attr_python\"\n\n\t// AttrJS runs JavaScript code\n\t// Example:\n\t//   arr = raw.split(\"_\") // raw is registered by default\n\t//   refined = arr[1] // refined is required value\n\t// Note: Underscore.js (https://underscorejs.org/) is supported by default\n\tAttrJS = \"_attr_js\"\n)\n\n// Post-processing configuration keys\nconst (\n\t// PostJoin joins parsed attributes array into string using joiner\n\tPostJoin = \"_post_join\"\n\n\t// Strip controls string trimming\n\t// Values:\n\t//   - if `_strip: true` or not existed: does strings.TrimSpace\n\t//   - if `_strip: str`: does strings.ReplaceAll(raw, str, \"\")\n\t//   - if `_strip: [\"(\", \")\"]`: replaces one by one\n\t// Note: Called by default, use `_strip: false` to disable\n\tStrip = \"_strip\"\n\n\t// Type converts output to specified type\n\t// Without `_type: b/i/f`, returns as string\n\t// Values:\n\t//   - b: bool\n\t//   - i: int\n\t//   - f: float\n\tType = \"_type\"\n)\n\n// Abbreviated keys\nconst (\n\tLocatorAbbr    = \"_l\"\n\tIndexAbbr      = \"_i\"\n\tAttrRefineAbbr = \"_ar\"\n\tTypeAbbr       = \"_t\"\n)\n\n// Special locators and internal constants\nconst (\n\t// JSONArrayRootLocator is used for JSON arrays without root object\n\t// Used when JSON file has ordered list of values like: `[{...}, {...}]`\n\tJSONArrayRootLocator = \"*/*\"\n\n\t// PrefixLocatorStub for multiple locators not in same stub\n\t// Recalculates from base locator (map root)\n\t// Example:\n\t//   jobs:\n\t//     _locator: jobs\n\t//     _index:\n\t//     taxo:\n\t//       _locator: taxonomyAttributes\n\t//       _index: 0\n\t//       attr:\n\t//       _locator:\n\t//         - attributes\n\t//         - ___.salarySnippet\n\tPrefixLocatorStub = \"___\"\n\n\t// _prefixRefine defines the word we use as the prefix of method of attr refiner\n\t_prefixRefine = \"_refine\"\n\t// AttrJoinerSep is a separator used to join an array to string\n\tAttrJoinerSep = \"|||\"\n)\n\n// Special attribute values\nconst (\n\t// AttrJoinElemsText joins all elements inner text to string\n\t// Used only when parsing HTML\n\t// Warning: Rarely used, consider alternatives\n\tAttrJoinElemsText = \"__join_text\"\n\n\t// AttrRawHTML returns the raw html of locator\n\tAttrRawHTML = \"__html\"\n\n\t// RefineWithKeyName uses key name as refiner method\n\t// Example:\n\t//   root:\n\t//     a_changeable_name:\n\t//       _locator: div.xxx\n\t//       _attr: title\n\t//       _attr_refine: __key\n\tRefineWithKeyName = \"__key\"\n)\n\n// Type constants\nconst (\n\tAttrTypeB = \"b\" // Boolean\n\tAttrTypeF = \"f\" // Float\n\tAttrTypeI = \"i\" // Integer\n\n\t// Time types\n\tAttrTypeT  = \"t\"  // Quick mode\n\tAttrTypeT1 = \"t1\" // Search mode\n)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoghost%2Fxparse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcoghost%2Fxparse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoghost%2Fxparse/lists"}