{"id":50182336,"url":"https://github.com/vkolev/parsy","last_synced_at":"2026-05-25T07:41:41.927Z","repository":{"id":63895470,"uuid":"571463202","full_name":"vkolev/parsy","owner":"vkolev","description":"HTML parsing library with YAML definitions","archived":false,"fork":false,"pushed_at":"2022-12-16T07:16:10.000Z","size":973,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-09-18T08:29:58.336Z","etag":null,"topics":["html","python","yaml"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vkolev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-11-28T07:29:09.000Z","updated_at":"2024-01-12T18:49:39.000Z","dependencies_parsed_at":"2023-01-14T12:30:35.101Z","dependency_job_id":null,"html_url":"https://github.com/vkolev/parsy","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/vkolev/parsy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vkolev%2Fparsy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vkolev%2Fparsy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vkolev%2Fparsy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vkolev%2Fparsy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vkolev","download_url":"https://codeload.github.com/vkolev/parsy/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vkolev%2Fparsy/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33465616,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-25T06:32:55.349Z","status":"ssl_error","status_checked_at":"2026-05-25T06:32:35.322Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["html","python","yaml"],"created_at":"2026-05-25T07:41:41.104Z","updated_at":"2026-05-25T07:41:41.917Z","avatar_url":"https://github.com/vkolev.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Logo](https://raw.githubusercontent.com/vkolev/parsy/master/images/parsy-logo.png)\n\n![CI](https://github.com/vkolev/parsy/actions/workflows/main.yml/badge.svg?branch=master) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pyparsy) ![PyPI](https://img.shields.io/pypi/v/pyparsy)\n\n# PyParsy\n\nPyParsy is an HTML parsing library using YAML definition files. The idea is to use the YAML file as\nsort of intent - what you want to have as a result and let Parsy do the heavy lifting for you. The\ndifferences to other similar libraries (e.g. [selectorlib](https://selectorlib.com/)) is that it \nsupports multiple version of selectors for a single field. This way you will not need to create a new \nyaml definition file for every change on a website.\n\n\nThe YAML files contain:\n- The desired structure of the output\n- XPath/CSS/Regex selectors for the element extraction\n- Return type definition\n- Optional children of the field\n\n## Features\n\n- [x] YAML File definitions\n- [x] YAML File validation\n- [x] Intent instead of coding\n- [x] support for XPath, CSS and Regex selectors\n- [ ] Different output formats e.g. JSON, YAML, XML\n- [x] Somewhat opinionated\n- [x] 99% coverage\n\n## Installation\n\nUsing pip:\n```shell\npip install pyparsy\n```\n\n## Running Tests\n\nTo run tests, run the following command\n\n```bash\n  poetry run pytest\n```\n\n## YAML Structure\n\n- `\u003cfield_name\u003e:` Field name is the top level of the yaml\n  - `selector:` `\u003cselector_definition\u003e` - The Selector expression\n  - `selector_type:` `\u003cselector_type[XPATH, CSS, REGEX]\u003e` - The type of the selector expression only in of `XPATH, CSS, REGEX`\n  - `multiple:` `\u003ctrue/flase\u003e` *[Optional]* true - get all matching results as list, false - get first matching result\n  - `return_type:` `\u003creturn_type[STRING, INTEGER, FLOAT, MAP]` - Desired return type on of `STRING, INTEGER, FLOAT or MAP`\n  - `children:` `\u003clist of definitions` *[Optional]* - used for `return_type: MAP`\n\n## Examples\n\nWe can consider as an example the amazon bestseller page. First we define the .yaml definition file:\n\n```yaml\ntitle:\n  selector: //div[contains(@class, \"_card-title_\")]/h1/text()\n  selector_type: XPATH\n  return_type: STRING\npage:\n  selector: //ul[contains(@class, \"a-pagination\")]/li[@class=\"a-selected\"]/a/text()\n  selector_type: XPATH\n  return_type: INTEGER\nproducts:\n  selector: //div[@id=\"gridItemRoot\"]\n  selector_type: XPATH\n  multiple: true\n  return_type: MAP\n  children:\n    image:\n      selector: //img[contains(@class, \"a-dynamic-image\")]/@src\n      selector_type: XPATH\n      return_type: STRING\n    title:\n      selector: //a[@class=\"a-link-normal\"]/span/div/text()\n      selector_type: XPATH\n      return_type: STRING\n    price:\n      selector: //span[contains(@class, \"a-color-price\")]/span/text()\n      selector_type: XPATH\n      return_type: FLOAT\n    asin:\n      selector: //div[contains(@class, \"sc-uncoverable-faceout\")]/@id\n      selector_type: XPATH\n      return_type: STRING\n    reviews_count:\n      selector: //div[contains(@class, \"sc-uncoverable-faceout\")]/div/div/a/span/text()\n      selector_type: XPATH\n      return_type: INTEGER\n```\nThen we can use this definition in code:\n\n```python\nfrom pathlib import Path\nfrom pyparsy import Parsy\nimport httpx\nimport json\n\n\ndef main():\n    parser = Parsy.from_file(Path('tests/assets/amazon_bestseller_de.yaml'))\n    response = httpx.get(\"https://www.amazon.de/-/en/gp/bestsellers/ce-de/ref=zg_bs_nav_0\")\n    result = parser.parse(response.text)\n    print(json.dumps(dict(result), indent=4))\n\n\nif __name__ == \"__main__\":\n    main()\n```\n\nWill result in:\n```json\n{\n    \"title\": \"Best Sellers in Electronics \u0026 Photo\",\n    \"page\": 1,\n    \"products\": [\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/81ZnAYiX5sL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Amazon Basics High Power 1.5V AA Alkaline Batteries, Pack of 48 (Appearance May Vary)\",\n            \"price\": 19.12,\n            \"asin\": \"B00MNV8E0C\",\n            \"reviews_count\": 526202\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/71C3lbbeLsL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"All-new Echo Dot (5th generation, 2022 release) smart speaker with Alexa | Charcoal\",\n            \"price\": 59.99,\n            \"asin\": \"B09B8X9RGM\",\n            \"reviews_count\": 760\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/811OG1FsNFL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Fire TV Stick with Alexa Voice Remote (includes TV controls) | HD streaming device\",\n            \"price\": 39.99,\n            \"asin\": \"B08C1KN5J2\",\n            \"reviews_count\": 92504\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/81FGpGF5kaL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Amazon Basics AA Industrial Alkaline Batteries, Pack of 40\",\n            \"price\": 11.78,\n            \"asin\": \"B07MLFBJG3\",\n            \"reviews_count\": 72375\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/61ymYQD3gaL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Fire TV Stick 4K with Alexa Voice Remote (includes TV controls)\",\n            \"price\": 59.99,\n            \"asin\": \"B08XW4FDJV\",\n            \"reviews_count\": 46503\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/61UV1sshWKL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Varta Lithium Button Cell Battery\",\n            \"price\": 3.29,\n            \"asin\": \"B00TYEL11K\",\n            \"reviews_count\": 62993\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/61bLsZejhPL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Instax Fujifilm Mini Instant Film, White, 2 x 10 Sheets (20 Sheets)\",\n            \"price\": 15.95,\n            \"asin\": \"B0000C73CQ\",\n            \"reviews_count\": 197326\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/71PTROtCLRL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"2032 20 40 Cell Battery Silver\",\n            \"price\": 8.99,\n            \"asin\": \"B07CSZ575S\",\n            \"reviews_count\": 16096\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/51CDcTTd3-S._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Apple AirTag, pack of 4\",\n            \"price\": 119.0,\n            \"asin\": \"B0935JRJ59\",\n            \"reviews_count\": 47525\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/71yf6yTNWSL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"All-new Echo Dot (5th generation, 2022 release) smart speaker with clock and Alexa | Cloud Blue\",\n            \"price\": 69.99,\n            \"asin\": \"B09B8RVKGW\",\n            \"reviews_count\": 665\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/71EFiZtPjML._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Duracell Plus C Baby Alkaline Batteries 1.5 V LR14 MN1400 Pack of 4\",\n            \"price\": 7.13,\n            \"asin\": \"B093C9FN7W\",\n            \"reviews_count\": 42466\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/51Z0FcUPmgL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"ooono traffic alarm: Warns about speed cameras and hazards in road traffic in real time, automatically active after connection to smartphone via Bluetooth, data from Blitzer.de\",\n            \"price\": 49.95,\n            \"asin\": \"B07Q619ZKS\",\n            \"reviews_count\": 26587\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/71g8a2BcgRL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Fire TV Stick 4K Max streaming device, Wi-Fi 6, Alexa Voice Remote (includes TV controls)\",\n            \"price\": 64.99,\n            \"asin\": \"B08MT4MY9J\",\n            \"reviews_count\": 30523\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/81Tt3+NBcSL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"KabelDirekt - 2m - 4K HDMI Cable (4K @120Hz \u0026 4K @60Hz - Spectacular Ultra HD Experience - High Speed with Ethernet - HDMI 2.0/1.4, Blu-ray/PS4/PS5/Xbox Series X/Switch - Black\",\n            \"price\": 7.99,\n            \"asin\": \"B004BEMD5Q\",\n            \"reviews_count\": 125724\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/61a3VAbtpQL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Soundcore Life P2 Bluetooth Headphones, Wireless Earbuds with CVC 8.0 Noise Isolation for a Crystal Clear Sound Profile, 40-hour Battery Life, IPX7 Water Protection Class, for Work and Travel\",\n            \"price\": 23.99,\n            \"asin\": \"B07SJR6HL3\",\n            \"reviews_count\": 115557\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/41qfJN7dLhL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Fire TV Stick Lite mit Alexa-Sprachfernbedienung Lite (ohne TV-Steuerungstasten) | HD-Streamingger\\u00e4t\",\n            \"price\": 29.99,\n            \"asin\": \"B091G3WT74\",\n            \"reviews_count\": 5601\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/41hX+2Es+vL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Echo Dot (3rd Gen) - Smart speaker with Alexa - Charcoal Fabric\",\n            \"price\": 49.99,\n            \"asin\": \"B07PHPXHQS\",\n            \"reviews_count\": 312374\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/71nnAxdMtkL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Pack of 40 AG13 LR44 1.5 V Alkaline Button Cell Batteries, Mercury-Free (357 / 357A / L1154 / A76 / GPA76)\",\n            \"price\": 6.99,\n            \"asin\": \"B079HZ6RQR\",\n            \"reviews_count\": 10692\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/418AP8pw3KL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"EarPods with Lightning Connector\",\n            \"price\": 16.9,\n            \"asin\": \"B01M1EEPOB\",\n            \"reviews_count\": 22486\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/715i0StnSlS._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Amazon Basics High Capacity AA Rechargeable 2400mAh Batteries Pre-Charged Pack of 12\",\n            \"price\": 21.94,\n            \"asin\": \"B07NWT6YLD\",\n            \"reviews_count\": 146981\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/61iYFNhtwHL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"NEW'C tempered glass foil, protective foil for iPhone 11, iPhone XR, 2pcs., free from scratches, fingerprints and oil, 9H hardness, 0.33 mm ultra clear, screen protective foil for iPhone 11, iPhone XR\",\n            \"price\": 5.99,\n            \"asin\": \"B07NC8PWDM\",\n            \"reviews_count\": 76098\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/81Pd4ogDITL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"LiCB CR2032 3V Lithium Button Cell Batteries CR 2032 Pack of 10\",\n            \"price\": 6.99,\n            \"asin\": \"B07P7V9SP7\",\n            \"reviews_count\": 10781\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/61MRw0Bun4L._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Varta Ready2Use Rechargeable Battery, Pre-Charged AAA Micro Ni-Mh Battery, Pack of 4, 1000 mAh, Rechargeable without Memory Effect, Ready to Use\",\n            \"price\": 10.22,\n            \"asin\": \"B000IGW3JC\",\n            \"reviews_count\": 47147\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/61pBvlYVPxL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Amazon Basics - high-speed cable, Ultra HD HDMI 2.0, supports 3D formats, with audio return channel, 1.8 m\",\n            \"price\": 6.99,\n            \"asin\": \"B014I8SSD0\",\n            \"reviews_count\": 469788\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/71AwNMpA29L._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Instax Mini 11 Camera\",\n            \"price\": 79.0,\n            \"asin\": \"B084S3Y6L1\",\n            \"reviews_count\": 14441\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/51hsq3bombL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Soundcore by Anker Life P2 Mini Bluetooth Headphones, In-Ear Headphones with 10 mm Audio Driver, Intense Bass, EQ, Bluetooth 5.2, 32 Hours Battery, Charging with USB-C, Minimalist Design (Night Black)\",\n            \"price\": 39.99,\n            \"asin\": \"B099DP3617\",\n            \"reviews_count\": 14245\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/61pUgAx+pPL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"NEW'C Pack of 3 Tempered Protective Glass for iPhone 14, 13, 13 Pro (6.1 inches), Free from Scratches, 9H Hardness, HD Screen Protector, 0.33 mm Ultra Clear, Ultra Resistant\",\n            \"price\": 5.99,\n            \"asin\": \"B09F3P3DQD\",\n            \"reviews_count\": 10136\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/71fzcZQlbqS._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Echo Show 5 | 2nd generation (2021 release), smart display with Alexa and 2 MP camera | Charcoal\",\n            \"price\": 84.99,\n            \"asin\": \"B08KH2MTSS\",\n            \"reviews_count\": 22944\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/81Jz6OogtbL._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Misxi hard case with glass screen protector compatible with Apple Watch Series 6 / SE / Series 5 / Series 4 44 mm, pack of 2\",\n            \"price\": 10.99,\n            \"asin\": \"B07ZRMCRG7\",\n            \"reviews_count\": 28111\n        },\n        {\n            \"image\": \"https://images-eu.ssl-images-amazon.com/images/I/71gG2vN8FFS._AC_UL300_SR300,200_.jpg\",\n            \"title\": \"Duracell Plus AAA Micro Alkaline Batteries 1.5 V LR03 MN2400 Pack of 12\",\n            \"price\": 6.99,\n            \"asin\": \"B093LT2N4Q\",\n            \"reviews_count\": 20307\n        }\n    ]\n}\n\n```\n\nFor the example sake let's store the file as `amazon_bestseller.yaml`.\n\nThen we can use the PyParsy library in out code:\n\n```python\nimport httpx\nfrom pathlib import Path\nfrom pyparsy import Parsy\n\n\ndef main():\n  html = httpx.get(\"https://www.amazon.com/gp/bestsellers/hi/?ie=UTF8\u0026ref_=sv_hg_1\")\n  parser = Parsy.from_file(Path(\"amazon_bestseller.yaml\"))\n  result = parser.parse(html.text)\n  print(result)\n\n\nif __name__ == \"__main__\":\n  main()\n```\n\nFor more examples please see the tests for the library.\n\n## Documentation\n\n[Documentation](https://pyparsy.readthedocs.com) (hopefully some day)\n\n## Acknowledgements\n\n - [selectorlib](https://selectorlib.com/) - It is the main inspiration for this project\n - [Scrapy](https://scrapy.org/) - One of the best crawling libraries for Python\n - [parsel](https://github.com/scrapy/parsel) - Scrapy parsing library is heavily used in this project and can be considered main dependency.\n - [schema](https://github.com/keleshev/schema) - Used for validating the YAML file schema\n\n## Contributing\n\nContributions are very much welcome. Just create your Pull request with enough tests.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvkolev%2Fparsy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvkolev%2Fparsy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvkolev%2Fparsy/lists"}