{"id":23315381,"url":"https://github.com/lexndru/hap-utils","last_synced_at":"2025-04-07T03:27:11.329Z","repository":{"id":144407599,"uuid":"150893063","full_name":"lexndru/hap-utils","owner":"lexndru","description":"A set of utilities for Hap! to automate scraping processes","archived":false,"fork":false,"pushed_at":"2018-10-24T13:07:56.000Z","size":141,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-13T08:17:40.900Z","etag":null,"topics":["automated","bash","hap","python","scraping"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lexndru.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-29T18:41:36.000Z","updated_at":"2018-10-24T13:07:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"f2f21993-8908-4ef2-8e5c-3da706c41935","html_url":"https://github.com/lexndru/hap-utils","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lexndru%2Fhap-utils","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lexndru%2Fhap-utils/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lexndru%2Fhap-utils/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lexndru%2Fhap-utils/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lexndru","download_url":"https://codeload.github.com/lexndru/hap-utils/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247585914,"owners_count":20962424,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automated","bash","hap","python","scraping"],"created_at":"2024-12-20T15:34:30.300Z","updated_at":"2025-04-07T03:27:11.289Z","avatar_url":"https://github.com/lexndru.png","language":"Shell","readme":"# Hap! utils\n[![Build Status](https://travis-ci.org/lexndru/hap-utils.svg?branch=master)](https://travis-ci.org/lexndru/hap-utils)\n\nHap! utils brings a set of utilities to generate and validate dataplans, automate tasks as background jobs, collect harvested records and extend with custom user functionality. Get Hap! from https://github.com/lexndru/hap or PyPI.\n\nNotice: installing utils replaces hap CLI with a new and improved one; it does NOT delete hap from your system.\n\n## Install from sources\n```\n$ make install\n$ hap\n_                   _       _   _ _      \n| |__   __ _ _ __   / \\_   _| |_(_) |___  \n| '_ \\ / _' | '_ \\ /  / | | | __| | / __|\n| | | | (_| | |_) /\\_/| |_| | |_| | \\__ \\\n|_| |_|\\__,_| .__/\\/   \\__,_|\\__|_|_|___/\n           |_|                           \n\nHap! utils v0.2.1 [installed hap v1.2.3 x86_64 GNU/Linux]\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n\n Please report bugs at http://github.com/lexndru/hap-utils\n\nUsage:\n hap [input | option]        - Launch an utility or invoke Hap! directly\n\nOptions:\n input [flags]               - File with JSON formated dataplan\n dataplans                   - List all master dataplans available\n register [DATAPLAN | name]  - Register new dataplan or create it\n unregister DATAPLAN         - Unregister existing dataplan\n check DATAPLAN LINK         - Run once a dataplan with a link and test its output\n jobs                        - List all background jobs\n join DATAPLAN LINK          - Add background job with a dataplan and a link\n purge LINK                  - Permanently remove a background job\n pause LINK                  - Temporary pause a background job\n resume LINK                 - Resume a paused a background job\n dump LINK                   - Export job's stored records as tsv\n logs                        - View recent log activity\n upgrade                     - Upgrade Hap! to the latest version\n\nInput flags:\n --link LINK                 - Overwrite link in dataplan\n --save                      - Save collected data to dataplan\n --verbose                   - Enable verbose mode\n --no-cache                  - Disable cache link\n --refresh                   - Reset stored records before save\n --silent                    - Suppress any output\n```\n\n## Compatibility\nThe new CLI is backwards compatible with JSON input files as dataplans. E.g. launching a dataplan from `/tmp/dataplan.json` with the new CLI `hap /tmp/dataplan.json --verbose`.\n\n## Usage and options\nThe purpose of these utilities is to help generate dataplans and automate harvesting processes in a cronjob-like way. In fact, the background jobs system relies on Linux's `crontab` utility. Handling dataplans is done by a set of tools to `register` and `join` such files.\n\n#### Register master dataplan\nA proper dataplan is required in order to harvest something from an HTML document. A dataplan is like a blueprint for an HTML document and any other URL matching the pattern of a document can use the same dataplan, as long as the user seeks to harvest the exact same fields. The purpose of registering a dataplan is to keep it in a \"safe place\" for later use with any URL that fits the requirements. This is called registering a master dataplan. E.g. for a dataplan kept at `/tmp/another_dataplan.json` an user would do:\n\n```\n$ hap register /tmp/another_dataplan.json\nHap! dataplan validator ready to parse /tmp/another_dataplan.json\n------------------------------------------------------------\n  ✔️  Found a meta name (\"another_dataplan\")\n  ✔️  Using custom configuration\n  ✔️  Link found (\"http://localhost:8080/something\")\n  ✔️  Found a valid declared field \"full_name\" as \"string\"\n  ✔️  Dataplan has all declared fields covered\n  ✔️  Found 1 record(s) saved to dataplan\n  ✔️  Validated records\n------------------------------------------------------------\nValid dataplan\nNew master dataplan has been registered!\nYou can use \"another_dataplan.json\" to add jobs or tasks\n```\n\n#### View known master dataplans\nAt any point in time the user can see a complete list of the master dataplans registered with the system. Each entry shows the name of the dataplan (that can be used to add jobs) and a table-like display of the declared fields, their datatype and a valid sample to match the fields.\n\n```\n$ hap dataplans\nFound 1 master dataplan(s):\n# another_dataplan.json\n  Field     | Type       | Sample\n  ==========|============|============================================================\n  fist_name | string     | Alexandru Catrina\n```\n\n#### Unregister a master dataplan\nRemoving or unregistering a master dataplan does not affect added background jobs, but it will no longer be able to add jobs with the removed dataplan. It is possible to register it again.\n\n```\n$ hap unregister another_dataplan.json\nWarning: unregistering a master dataplan means you will no longer be able\nWarning: to add jobs or tasks with it. Current running tasks or jobs will\nWarning: not be affected.\nPermanently unregister another_dataplan.json? [yn]\n...\n```\n\n#### Validating an URL with a master dataplan\nChecking the compatibility of an URL is a good practice before adding a background job to run indefinitely. The procedure will run a master dataplan with a given URL as a parameter and return to stdout the results. Fields with non-null values are considered to be valid.\n\n```\n$ hap check another_dataplan.json http://localhost/path/to/something\n{\n    \"_datetime\": \"2018-10-14 23:44:34.195463\",\n    \"first_name\": \"some value here... or null if incompatible\"\n}\n```\n\n#### Background jobs\nUtils extend Hap! by automating it. Adding a background job is similar to a cronjob, but with dataplans. A background job requires a master dataplan and a valid URL compatible with the dataplan. The job will run indefinitely and daily update the newly created dataplan as a result of the join between the master dataplan and the URL provided. Jobs cannot be directly created without master dataplans.\n\n```\n$ hap join another_dataplan.json http://localhost/path/to/something\n...\n```\n\nBackground jobs can be listed with `jobs`, temporary paused with `pause` or permanently removed with `purge`. A paused job is ignored on the daily update and will not receive any new records. A paused job can be resumed with `resume`, but resuming a job does not mean it recovers the missing records while it was paused.\n\n#### Jobs records\nThe collected results can be exported to a `*.tsv` file on local disk. The records can be imported into a database or viewed with any program capable of handing csv-like files (e.g. LibreOffice).\n\n```\n$ hap dump http://localhost/path/to/something/saved/as/job\n...\n```\n\n\n## License\nCopyright 2018 Alexandru Catrina\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in\nall copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\nTHE SOFTWARE.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flexndru%2Fhap-utils","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flexndru%2Fhap-utils","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flexndru%2Fhap-utils/lists"}