{"id":22797330,"url":"https://github.com/lexndru/hap","last_synced_at":"2025-03-30T18:29:11.018Z","repository":{"id":43367528,"uuid":"140211940","full_name":"lexndru/hap","owner":"lexndru","description":"Hap! is an HTML parser and scraping tool.","archived":false,"fork":false,"pushed_at":"2022-07-06T19:54:43.000Z","size":125,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-05T22:09:49.462Z","etag":null,"topics":["cli","hap","html-parser","javascript","json","nodejs","protocol","python","scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lexndru.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-07-09T00:14:37.000Z","updated_at":"2022-01-22T10:55:34.000Z","dependencies_parsed_at":"2022-07-08T20:30:49.702Z","dependency_job_id":null,"html_url":"https://github.com/lexndru/hap","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lexndru%2Fhap","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lexndru%2Fhap/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lexndru%2Fhap/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lexndru%2Fhap/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lexndru","download_url":"https://codeload.github.com/lexndru/hap/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246362095,"owners_count":20765035,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","hap","html-parser","javascript","json","nodejs","protocol","python","scraper"],"created_at":"2024-12-12T06:05:44.750Z","updated_at":"2025-03-30T18:29:10.994Z","avatar_url":"https://github.com/lexndru.png","language":"Python","readme":"# Hap!\n[![Build Status](https://travis-ci.org/lexndru/hap.svg?branch=master)](https://travis-ci.org/lexndru/hap)\n\nHap! is an HTML parser and scraping tool build upon its own concept to transform markup-language documents into valuable data. It implements the dataplan application specification for markup-language data-oriented documents. This repository contains both the [dataplan specification](DATAPLAN.md) and the reference implementation of the HTML parser originally written in Python2/3.\n\n\n## Install\nHap! has been implemented in two languages: Python and JavaScript. Both implementations respect the dataplan specification and support all features described in this document. The most recent release is `v1.3.0`\n\n#### Get Hap! for Python (2/3)\n```\n$ pip install hap\n$ hap -h\nusage: hap [-h] [--sample] [--link LINK] [--save] [--verbose] [--no-cache]\n           [--refresh] [--silent] [--version]\n           [input]\n\nHap! Simple HTML scraping tool\n\npositional arguments:\n  input        your JSON formated dataplan input\n\noptional arguments:\n  -h, --help   show this help message and exit\n  --sample     generate a sample dataplan\n  --link LINK  overwrite link in dataplan\n  --save       save collected data to dataplan\n  --verbose    enable verbose mode\n  --no-cache   disable cache link\n  --refresh    reset stored records before save\n  --silent     suppress any output\n  --version    print version number\n```\n\n#### Get Hap! for Node.js\n```\n$ npm install hap-js\n$ hap -h\nusage: hap [-h] [-v] [--sample] [--link LINK] [--save] [--verbose]\n           [--no-cache] [--refresh] [--silent]\n           [input]\n\nHap! Simple HTML scraping tool\n\nPositional arguments:\n  input          Your JSON formated dataplan input.\n\nOptional arguments:\n  -h, --help     Show this help message and exit.\n  -v, --version  Show program's version number and exit.\n  --sample       Generate a sample dataplan.\n  --link LINK    Overwrite link in dataplan.\n  --save         Save collected data to dataplan.\n  --verbose      Enable verbose mode.\n  --no-cache     Disable cache link.\n  --refresh      Reset stored records before save.\n  --silent       Suppress any output.\n```\n**Note:** Hap! has been ported to Node.js \u003e= 6.x, older versions might not work.\n\n\n## Features\n- [x] Scrape HTML pages from the web or locally\n- [x] Quick parse and extract data from HTML documents\n- [x] Convert desired output data (\"declared data\") to `string`, `integer`, `floats`, `decimals` or `boolean` values\n- [x] Run step-by-step instructions (\"defined steps\") with performers to obtain the desired output\n- [x] Create variables and directly or indirectly assign string values\n- [x] Query CSS selectors and XPath expressions\n- [x] Evaluate patterns with regular expressions\n- [x] Capture and create variables from named groups (Python2/3 only)\n- [x] Remove unwanted characters or patterns from an intermediate result\n- [x] Replace characters or patterns with static content or dynamic content with variables\n- [x] Concatenate list of strings with support for variable interpolation\n- [x] Self-stored history records with `datetime` support\n- [x] Configurable outgoing requests\n- [x] Supports metafields\n\n\n## Getting started with Hap!\n*Notice: The following guide is using the Python implementation of Hap!*\n\nThe purpose of Hap! is to have a simple and fast way to retrieve certain data from the internet. It uses JSON formatted data as input and output. Input can be either from a local file or from stdin from another process. Output is either printed to stdout or saved to file. If input is provided by file, Hap! names it dataplan (\"data planning\") and the same file is used when the output is saved.\n\n## Usage\n```\nusage: hap [-h] [--sample] [--link LINK] [--save] [--verbose] [--no-cache]\n           [--refresh] [--silent] [--version]\n           [input]\n\nHap! Simple HTML scraping tool\n\npositional arguments:\n  input        your JSON formated dataplan input\n\noptional arguments:\n  -h, --help   show this help message and exit\n  --sample     generate a sample dataplan\n  --link LINK  overwrite link in dataplan\n  --save       save collected data to dataplan\n  --verbose    enable verbose mode\n  --no-cache   disable cache link\n  --refresh    reset stored records\n  --silent     suppress any output\n  --version    print version number\n```\n\n\n## An educational example\nIf we were to have an online store with a list of products, we could create a dataplan that describes the process of extracting some important aspects of a product such as product name, product price or product currency.\n\n#### The HTML\nImagine the HTML of a product would look like this:\n```\n\u003chtml\u003e\n    ...\n    \u003cdiv class=\"products\"\u003e\n        ...\n        \u003cdiv id=\"product9\" data-product-id=\"900\"\u003e\n            \u003ch1\u003eCool product\u003c/h1\u003e\n            \u003cimg src=\"cool-product-image.svg\"\u003e\n            \u003cspan class=\"price\"\u003e10.99 \u003cem\u003eEur\u003c/em\u003e\u003c/span\u003e\n        \u003c/div\u003e\n        ...\n    \u003c/div\u003e\n    ...\n\u003c/html\u003e\n```\n\n#### The dataplan\nA dataplan has at least two properties: a declaration and a definition. The first one, the declaration is a JSON key-value object with keys being the desired (targeted) attributes of a dataplan (e.g. product name, product price). The second one, the definition is a JSON list of objects with every object describing the process of obtaining something. That something can be the targeted data or an unprocessed raw or auxiliary part of the targeted data.\n\nTaken the example above, the declaration would look like this:\n```\n    ...\n    \"declare\": {\n        \"product_name\": \"string\",\n        \"product_price\": \"decimal\",\n        \"product_currency\": \"string\"\n    }\n    ...\n```\n\nDefined entries must be found in the declaration object above otherwise it will not be exported. Following the example above, the definition could look like this:\n```\n    ...\n    \"define\": [\n        {\n            \"product_name\": {\n                \"query\": \".products \u003e [data-product-id] \u003e h1\"\n            }\n        },\n        {\n            \"product_price\": {\n                \"query_xpath\": \"//*/div[@class='products']/div/span[@class='price']/text()\"\n            }\n        },\n        {\n            \"product_currency\": {\n                \"query\": \".products \u003e [data-product-id] \u003e span \u003e em\"\n            }\n        }\n    ]\n    ...\n```\n\nThe final touch is to add a link of the internet page of the product. Let's pretend the link is http://localhost/cool-store/product/900. The updated dataplan with both definitions and declarations would look like this:\n```\n{\n    \"link\": \"http://localhost/cool-store/product/900\",\n    \"declare\": {\n        \"product_name\": \"string\",\n        \"product_price\": \"decimal\",\n        \"product_currency\": \"string\"\n    },\n    \"define\": [\n        {\n            \"product_name\": {\n                \"query\": \".products \u003e [data-product-id] \u003e h1\"\n            }\n        },\n        {\n            \"product_price\": {\n                \"query_xpath\": \"//*/div[@class='products']/div/span[@class='price']/text()\"\n            }\n        },\n        {\n            \"product_currency\": {\n                \"query\": \".products \u003e [data-product-id] \u003e span \u003e em\"\n            }\n        }\n    ]\n}\n\n```\n\n#### Running a dataplan\nHap! can work with JSON formatted input either pipe'd to stdin (*Python only*) from another process or by providing first command line argument to a file. Let's pretend we have a fancy process that outputs the dataplan above. This could look like this:\n```\n$ fancy_process | hap --verbose\n2018-07-09 01:21:28,436 Filepath: None\n2018-07-09 01:21:28,436 Save to file? False\n2018-07-09 01:21:28,436 HTML Parser initialized\n2018-07-09 01:21:28,447 Getting content from cache: http://localhost/cool-store/product/900\n2018-07-09 01:21:28,473 Parsing definition for 'product_name'\n2018-07-09 01:21:28,486 Performing product_name:query '.products \u003e [data-product-id] \u003e h1' =\u003e Cool product\n2018-07-09 01:21:28,486 Parsing definition for 'product_price'\n2018-07-09 01:21:28,489 Performing product_price:query '//*/div[@class='products']/div/span[@class='price']/text()' =\u003e 10.99\n2018-07-09 01:21:28,522 Parsing definition for 'product_currency'\n2018-07-09 01:21:28,523 Performing product_currency:query '.products \u003e [data-product-id] \u003e span \u003e em' =\u003e Eur\n2018-07-09 01:21:28,523 Updating records with 'product_name' as 'Cool product' (string)\n2018-07-09 01:21:28,523 Updating records with 'product_price' as '10.99' (decimal)\n2018-07-09 01:21:28,523 Updating records with 'product_currency' as 'Eur' (string)\n2018-07-09 01:21:28,523 Logging records datetime...\n2018-07-09 01:21:28,523 Done...\n{\n    \"_datetime\": \"2018-07-08 22:21:28.523606\",\n    \"product_currency\": \"Eur\",\n    \"product_name\": \"Cool product\",\n    \"product_price\": 10.99\n}\n```\nThe `--verbose` flag is the reason the detailed output above exists. If the flag would have been omitted, only the JSON at the bottom would have been printed. **Note:** scripts can capture verbose output and errors from stderr; and output results from stdout or by saving to file.\n\n#### A test sample\nAdd the following content to a file named `test.json` and then `hap test.json --verbose` to run it:\n```\n{\n    \"link\": \"http://skyle.codeissues.net/\",\n    \"declare\": {\n        \"github\": \"string\"\n    },\n    \"define\": [\n        {\n            \"github\": {\n                \"query_xpath\": \"//*/a[@title='Skyle GitHub']/@href\"\n            }\n        }\n    ]\n}\n```\n\n\n## License\nCopyright 2018 Alexandru Catrina\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in\nall copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\nTHE SOFTWARE.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flexndru%2Fhap","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flexndru%2Fhap","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flexndru%2Fhap/lists"}