{"id":21629584,"url":"https://github.com/abrie/custom-web-scraper","last_synced_at":"2026-04-12T20:40:58.348Z","repository":{"id":148468403,"uuid":"236325415","full_name":"abrie/custom-web-scraper","owner":"abrie","description":"A One-off web scraper.","archived":false,"fork":false,"pushed_at":"2020-01-26T17:12:49.000Z","size":6,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-18T21:23:41.079Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abrie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-26T14:43:20.000Z","updated_at":"2020-01-26T17:12:50.000Z","dependencies_parsed_at":"2023-05-20T06:45:20.388Z","dependency_job_id":null,"html_url":"https://github.com/abrie/custom-web-scraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/abrie/custom-web-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abrie%2Fcustom-web-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abrie%2Fcustom-web-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abrie%2Fcustom-web-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abrie%2Fcustom-web-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abrie","download_url":"https://codeload.github.com/abrie/custom-web-scraper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abrie%2Fcustom-web-scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31729856,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-12T13:21:33.774Z","status":"ssl_error","status_checked_at":"2026-04-12T13:21:29.265Z","response_time":58,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-25T02:08:06.808Z","updated_at":"2026-04-12T20:40:58.280Z","avatar_url":"https://github.com/abrie.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Custom Web Scraper\n\nThis is a one-off project to scrape data from the web. Built for a hospitality and entertainment product. Documented here for posterity and discussion. Specific urls and proprietary details are excluded from this repository.\n\n## Technologies\n\nIt uses a mix of technologies, selected for expedience and utility:\nMake, Bash, [cURL](https://curl.haxx.se/), [awk](https://www.gnu.org/software/gawk/manual/gawk.html), Python3, [jq](https://stedolan.github.io/jq/).\n\n## Overview\n\nThe scraper runs in a series of stages. Each stage takes an input generates an output. Outputs are cached on the filesystem. The stages invoked through a `Makefile`\n\n| Stage | Input      | Action                                                                                        | Ouput                                            |\n| ----- | ---------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------ |\n| 1     | **secret** | [Bash scripts cURL to query a list of urls](step1.sh)                                         | A list indexed 'location' headers                |\n| 2     | stage 1    | [Awk extracts url from location header](Makefile#L14)                                         | A list of indexed Urls                           |\n| 3     | stage 2    | [Python iterates through list and caches url content](step3.py)                               | Directory of .gz files named by index value      |\n| 4     | stage 3    | [Python iterates through cached .gz files and applies regex for fields of interest](step4.py) | Directory of JSON files named by index           |\n| 5     | stage 4    | Bash and jq filter json files according to tuned selection criteria                           | A file with a list of indexes relevant to search |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabrie%2Fcustom-web-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabrie%2Fcustom-web-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabrie%2Fcustom-web-scraper/lists"}